Apply the most compatible ML algorithm and build a text classification model to predict "Article Types"

Let's divide the classification problem into the following steps:

1. Preprocessing and cleaning

2. For vectorization, use SentenceBERT or other BERT tokenizer.

3. Select and train an ML classifier model

4. Hyperparameter tuning

5. Validate and evaluate the Accuracy (Use Cross-validation methods)

6. Present the performance metrics

7. Save and Reload the model from the disk

8. Extract the heading and Full article from the article_url unknown articles.csv

9. Predict the "Article_Types" for the extracted data

# Data Cleaning and Pre-processing

In [1]:
import pandas as pd
import numpy as np

In [2]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [3]:
# Load the dataset
df = pd.read_csv('articles.csv', encoding='ISO-8859-1')

In [4]:
df.shape

(4305, 8)

In [5]:
df.head()

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality
0,d6995462-5e87-453b-b64d-e9f1df6e94d2,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",,Essex Caller,<p>The helicopter that crashed in Southeast Al...,<p>The helicopter that crashed in Southeast Al...,Commercial,Negative
1,8b05e939-a89e-4548-b92b-013822e8ee7d,Bells Nexus Air Taxi Concept Rings Changes Fo...,,Aviation Week Network,<p>A year after teasing the fledgling electric...,<p>A year after teasing the fledgling electric...,Commercial,Positive
2,69fcd400-bceb-4255-8277-619f2d68ac0b,Bell Helicopter Show Air Taxi Nexus,http://images.tmtpost.com/uploads/images/2019/...,TMTPost,<p>Bell released the full-size design of the v...,<p>Bell released the full-size design of the v...,Commercial,Positive
3,17943578-c11b-414b-b3f5-063d3a93157b,BELL DÉVOILE LA CONCEPTION INTÉGRALE DE SON TA...,http://www.fredzone.org/wp-content/uploads/201...,Fredzone,<p>Bell est une soci&eacute;t&eacute; am&eacut...,<p>Bell est une soci&eacute;t&eacute; am&eacut...,Commercial,Positive
4,f33c7b11-5f77-4a98-bb2e-d36689042aea,Les premiers retours dOlivier Ezratty,,FrenchWeb,<p>It was still anecdotal to observe the explo...,<p>It was still anecdotal to observe the explo...,Commercial,Positive


In [6]:
# Check for null values and data type of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4305 entries, 0 to 4304
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Id                    4305 non-null   object
 1   Heading               4305 non-null   object
 2   Article.Banner.Image  1753 non-null   object
 3   Outlets               4305 non-null   object
 4   Article.Description   4305 non-null   object
 5   Full_Article          4305 non-null   object
 6   Article_Type          4305 non-null   object
 7   Tonality              3873 non-null   object
dtypes: object(8)
memory usage: 269.2+ KB


In [7]:
len(df['Outlets'].unique())

1762

In [8]:
# Drop irrelevant columns
df1 = df.drop(['Id', 'Article.Banner.Image', 'Outlets'], axis=1)
df1.shape

(4305, 5)

In [9]:
# Check for null values
df1.isnull().sum()

Heading                  0
Article.Description      0
Full_Article             0
Article_Type             0
Tonality               432
dtype: int64

In [10]:
# Fill null values
df1['Tonality'] = df1['Tonality'].fillna(df1['Tonality'].mode()[0])

In [11]:
df1.isnull().sum()

Heading                0
Article.Description    0
Full_Article           0
Article_Type           0
Tonality               0
dtype: int64

In [12]:
df1['Article_Type'].unique()

array(['Commercial', 'Military', 'Training', 'Executives', 'Others',
       'Financing', 'Support & Services'], dtype=object)

In [13]:
df1['Tonality'].unique()

array(['Negative', 'Positive', 'Neutral'], dtype=object)

In [14]:
df2 = df1.copy()
df2.head()

Unnamed: 0,Heading,Article.Description,Full_Article,Article_Type,Tonality
0,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",<p>The helicopter that crashed in Southeast Al...,<p>The helicopter that crashed in Southeast Al...,Commercial,Negative
1,Bells Nexus Air Taxi Concept Rings Changes Fo...,<p>A year after teasing the fledgling electric...,<p>A year after teasing the fledgling electric...,Commercial,Positive
2,Bell Helicopter Show Air Taxi Nexus,<p>Bell released the full-size design of the v...,<p>Bell released the full-size design of the v...,Commercial,Positive
3,BELL DÉVOILE LA CONCEPTION INTÉGRALE DE SON TA...,<p>Bell est une soci&eacute;t&eacute; am&eacut...,<p>Bell est une soci&eacute;t&eacute; am&eacut...,Commercial,Positive
4,Les premiers retours dOlivier Ezratty,<p>It was still anecdotal to observe the explo...,<p>It was still anecdotal to observe the explo...,Commercial,Positive


In [15]:
# Encode categorical columns using Label Encoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df2['Tonality'] = le.fit_transform(df2['Tonality'])
df2['Article_Type'] = le.fit_transform(df2['Article_Type'])

In [16]:
df2['Article_Type'].unique()

array([0, 3, 6, 1, 4, 2, 5])

In [17]:
df2['Tonality'].unique()

array([0, 2, 1])

# Vectorization

In [18]:
# Install the Sentence Transformers library, which is a Python module for SentenceBERT developed by the UKPLab
!pip3 install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_trans

In [19]:
# Load a pre-trained SentenceBERT model. The 'bert-base-nli-mean-tokens' model is commonly used

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')

for column in ['Heading', 'Article.Description', 'Full_Article']:
    df2[column + '_embedding'] = df2[column].apply(lambda x: model.encode(x))

.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [20]:
df3 = df2.drop(['Heading', 'Article.Description', 'Full_Article'], axis=1)
df3.head()

Unnamed: 0,Article_Type,Tonality,Heading_embedding,Article.Description_embedding,Full_Article_embedding
0,0,0,"[0.006817788, 0.38549262, 0.111032486, 0.25931...","[-0.3087916, 0.47476256, 0.17605346, 0.2191023...","[-0.2975318, 0.5137314, 0.20075305, 0.22101636..."
1,0,2,"[0.13352107, -0.045962982, 2.3014443, 0.146599...","[-0.2224193, 0.13635927, 1.0301534, 0.08868462...","[-0.2224193, 0.13635927, 1.0301534, 0.08868462..."
2,0,2,"[0.1315209, -0.071940266, 1.0037092, 0.1943929...","[-0.6536702, -0.106780246, 1.311686, 0.1152132...","[-0.6536702, -0.106780246, 1.311686, 0.1152132..."
3,0,2,"[-0.31541964, 0.7637206, 1.5530187, 0.58139217...","[-0.3561128, 0.09793185, 1.634191, 0.17869553,...","[-0.3561128, 0.09793185, 1.634191, 0.17869553,..."
4,0,2,"[-0.27859825, 0.16457734, 0.99810475, 0.414360...","[-0.0018755815, 0.6525793, -0.44657302, 0.0285...","[-0.0018755815, 0.6525793, -0.44657302, 0.0285..."


In [21]:
df4 = df3[['Heading_embedding', 'Full_Article_embedding', 'Article_Type']]
df4.head()

Unnamed: 0,Heading_embedding,Full_Article_embedding,Article_Type
0,"[0.006817788, 0.38549262, 0.111032486, 0.25931...","[-0.2975318, 0.5137314, 0.20075305, 0.22101636...",0
1,"[0.13352107, -0.045962982, 2.3014443, 0.146599...","[-0.2224193, 0.13635927, 1.0301534, 0.08868462...",0
2,"[0.1315209, -0.071940266, 1.0037092, 0.1943929...","[-0.6536702, -0.106780246, 1.311686, 0.1152132...",0
3,"[-0.31541964, 0.7637206, 1.5530187, 0.58139217...","[-0.3561128, 0.09793185, 1.634191, 0.17869553,...",0
4,"[-0.27859825, 0.16457734, 0.99810475, 0.414360...","[-0.0018755815, 0.6525793, -0.44657302, 0.0285...",0


In [22]:
heading_features = pd.DataFrame(df4['Heading_embedding'].to_list(), columns=[f'heading_{i}' for i in range(len(df4['Heading_embedding'].iloc[0]))])
article_features = pd.DataFrame(df4['Full_Article_embedding'].to_list(), columns=[f'article_{i}' for i in range(len(df4['Full_Article_embedding'].iloc[0]))])

df5 = pd.concat([df4, heading_features, article_features], axis=1)

df5 = df5.drop(['Heading_embedding', 'Full_Article_embedding'], axis=1)

In [23]:
df5.head()

Unnamed: 0,Article_Type,heading_0,heading_1,heading_2,heading_3,heading_4,heading_5,heading_6,heading_7,heading_8,...,article_758,article_759,article_760,article_761,article_762,article_763,article_764,article_765,article_766,article_767
0,0,0.006818,0.385493,0.111032,0.259312,0.044005,-0.569938,0.613437,-0.476182,0.505468,...,-0.442507,0.213695,-1.048564,-0.371615,0.236706,-0.698086,-0.351208,-0.291638,0.164043,0.16612
1,0,0.133521,-0.045963,2.301444,0.1466,0.312994,-0.354765,-0.20495,-0.314788,0.620836,...,-0.498318,-1.014818,-1.64383,-0.4907,-0.124704,-0.598111,-0.384294,-0.424274,0.30989,-0.254794
2,0,0.131521,-0.07194,1.003709,0.194393,0.171645,0.06097,-0.475287,-0.187274,0.128736,...,-0.221055,-1.092432,-0.661864,-0.15822,-0.150076,-0.647646,-0.896943,-0.415517,-0.015466,-0.050569
3,0,-0.31542,0.763721,1.553019,0.581392,0.836126,-0.659893,-0.19133,0.306782,0.704866,...,0.129382,-0.636668,-0.731162,0.076022,0.134098,-1.099116,-0.230633,-0.127485,-0.038905,0.348545
4,0,-0.278598,0.164577,0.998105,0.414361,0.795069,-0.513556,-0.00794,0.652897,-0.087312,...,-0.360877,-0.186242,-1.036903,-0.809415,-0.027214,-1.038251,0.262545,0.132081,0.640066,-0.072544


In [24]:
x = df5.drop('Article_Type', axis=1)
y = df5['Article_Type']

# Training ML Classifier model

In [25]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)

In [26]:
# Scale the input features using MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression()
logistic_model.fit(x_train_scaled, y_train)

In [28]:
logistic_model.score(x_test_scaled, y_test)

0.9036004645760743

### SVM

In [29]:
from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(x_train_scaled, y_train)

In [30]:
svm_model.score(x_test_scaled, y_test)

0.9175377468060395

### Random Forest

In [31]:
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier()
forest_model.fit(x_train_scaled, y_train)

In [32]:
forest_model.score(x_test_scaled, y_test)

0.8908246225319396

### Multinomial Naive Bayes

In [33]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes_model = MultinomialNB()
naive_bayes_model.fit(x_train_scaled, y_train)

In [34]:
naive_bayes_model.score(x_test_scaled, y_test)

0.818815331010453

# Hyperparameter Tuning and Cross-Validation

In [35]:
from sklearn.model_selection import GridSearchCV, cross_val_score

models = {
    'LogisticRegression': {
        'model': LogisticRegression(C=1),
        'params': {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
    },
    'SVM': {
        'model': SVC(C=1, gamma=1, kernel='rbf'),
        'params': {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']}
    },
    'MultinomialNB': {
    'model': MultinomialNB(),
    'params': {'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
}
}

# Assuming you have your data in X (features) and y (labels)

for model_name, model_params in models.items():
    model = model_params['model']
    param_grid = model_params['params']

    grid_search = GridSearchCV(model, param_grid, cv=5)
    grid_search.fit(x_train_scaled, y_train)

    best_model = grid_search.best_estimator_

    # Cross-validation
    cross_val_scores = cross_val_score(best_model, x_train_scaled, y_train, cv=5)
    mean_cross_val_score = np.mean(cross_val_scores)

    # Evaluate the best model on the test set
    test_accuracy = best_model.score(x_test_scaled, y_test)

    print(f'{model_name} - Best Parameters: {grid_search.best_params_}')
    print(f'Mean Cross-Validation Score: {mean_cross_val_score}')
    print(f'Test Set Accuracy: {test_accuracy}\n')

LogisticRegression - Best Parameters: {'C': 1}
Mean Cross-Validation Score: 0.8887927397306511
Test Set Accuracy: 0.9036004645760743

SVM - Best Parameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
Mean Cross-Validation Score: 0.9085340230195429
Test Set Accuracy: 0.9279907084785134

MultinomialNB - Best Parameters: {'alpha': 0.001}
Mean Cross-Validation Score: 0.8138838728187128
Test Set Accuracy: 0.8199767711962834



In [36]:
# Choosing the best model and best parameters

best_model = SVC(C=100, gamma=0.01, kernel='rbf')
best_model.fit(x_train_scaled, y_train)

In [37]:
y_pred = best_model.predict(x_test_scaled)

# Performance Metrics

### Accuracy

*   The ratio of correctly predicted instances to the total instances.
*   Suitable for balanced datasets.



In [38]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9279907084785134

### Precision

*   The ratio of correctly predicted positive observations to the total predicted positives.
*   Focuses on minimizing false positives.



In [39]:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred, average='weighted')
precision

0.9293046576847748

### Recall

*   The ratio of correctly predicted positive observations to the actual positives.
*   Focuses on minimizing false negatives.



In [40]:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred, average='weighted')
recall

0.9279907084785134

### F1Score

*   The harmonic mean of precision and recall, providing a balance between the two metrics.



In [41]:
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, average='weighted')
f1

0.9257487703287801

### Confusion Matrix

A table showing the true positive, true negative, false positive, and false negative values.

In [42]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[486,   0,   0,  29,   0,   0],
       [  2,   7,   0,   1,   0,   0],
       [  0,   0,   3,   0,   0,   0],
       [ 21,   0,   0, 296,   0,   0],
       [  5,   0,   0,   1,   2,   0],
       [  2,   0,   0,   1,   0,   5]])

### Classification Report

A summary of precision, recall, F1-score, and support for each class.

In [43]:
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print("Classification Report: \n", report)

Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.94      0.94       515
           1       1.00      0.70      0.82        10
           2       1.00      1.00      1.00         3
           3       0.90      0.93      0.92       317
           4       1.00      0.25      0.40         8
           5       1.00      0.62      0.77         8

    accuracy                           0.93       861
   macro avg       0.97      0.74      0.81       861
weighted avg       0.93      0.93      0.93       861



# Save and Reload the model

In [44]:
import joblib

# Save the model to disk
joblib.dump(best_model, 'model.joblib')

['model.joblib']

In [45]:
# Reload the model from disk
loaded_model = joblib.load('model.joblib')

# Extracting the heading and full article from unknown_articles.csv

In [46]:
import requests
from bs4 import BeautifulSoup

# Function to extract heading and full article content from a URL
def extract_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract heading
    heading = soup.find('h1').text.strip() if soup.find('h1') else None

    # Extract full article content
    article_content = '\n'.join([p.text.strip() for p in soup.find_all('p')])

    return heading, article_content

In [47]:
# Read URLs from the CSV file
df = pd.read_csv('unknown_articles.csv')

In [None]:
# Create new columns for heading and full article
df['Heading'], df['Full_Article'] = zip(*df['Article.URL'].apply(extract_content))

In [None]:
# Save the updated DataFrame to a new CSV file
df.to_csv('unknown_articles_extraction.csv', index=False)

# Predict the "Article_Types" for the extracted data

In [None]:
def predict_article_types(df):

    # Vectorize the text columns using SentenceBERT
    for column in ['Heading', 'Full_Article']:
        df[column + '_embedding'] = df[column].apply(lambda x: model.encode(x))

    heading_features = pd.DataFrame(df['Heading_embedding'].to_list(), columns=[f'heading_{i}' for i in range(len(df['Heading_embedding'].iloc[0]))])
    article_features = pd.DataFrame(df['Full_Article_embedding'].to_list(), columns=[f'article_{i}' for i in range(len(df['Full_Article_embedding'].iloc[0]))])

    x = pd.concat([df, heading_features, article_features], axis=1)
    x = df1.drop(['Heading_embedding', 'Full_Article_embedding'], axis=1)

    # Scale the features
    x_scaled = scaler.transform(x)

    y_pred = loaded_model.predict(x_scaled)
    return y_pred


In [None]:
predict_article_types(df)