# Comments classification

<b>Aim</b>

Train a model to classify comments as positive or negative

<br><b>Background</b>

The online store "WikiShop" is launching a new service. Now, users can edit and supplement product descriptions, much like in wiki communities. This means that customers can suggest their edits and comment on the changes of others. The store needs a tool that will identify toxic comments and forward them for moderation.

Client's priorities:

- The F1 quality metric should be no less than 0.75.

<br><b>Data Description</b>

Features:

- *text* — comment text

Target Feature:

- *toxic* — 0: non-toxic comment; 1: toxic comment

## Preparation

In [1]:
#!pip install catboost
#!pip install lightgbm
#import sys
#!{sys.executable} -m pip install spacy
#!{sys.executable} -m spacy download en_core_web_sm

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import spacy
import nltk
import re

from sklearn.utils import shuffle
from scipy.sparse import vstack
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm
from nltk.corpus import stopwords as nltk_stopwords

import warnings
nltk.download('stopwords')
warnings.simplefilter("ignore")
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Игорь\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
try:
    df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
except:
    df = pd.read_csv('toxic_comments.csv')

In [4]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
df.duplicated().sum()

0

In [7]:
df.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [8]:
# verify that only two classes exist. This will also highlight any class imbalance
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [9]:
# check for class imbalance

print('Proportion of class 0 objects:', len(df.loc[df['toxic'] == 0])/len(df.loc[df['toxic']]))
print('Proportion of class 1 objects:', len(df.loc[df['toxic'] == 1])/len(df.loc[df['toxic']]))

Proportion of class 0 objects: 0.8983878663084147
Proportion of class 1 objects: 0.10161213369158527


There's a pronounced class imbalance detected. It will be necessary to address this imbalance.

In [10]:
# using the spacy model

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [11]:
# split the data into features and the target feature

features = df['text']
target = df['toxic']

In [12]:
# create a function for lemmatization and cleaning of comments

def space_lemmatize_clear(text):
    text = re.sub(r'[^a-zA-Z ]',' ', text)    
    text = text.split()
    doc = nlp(" ".join(text))
    return " ".join([token.lemma_ for token in doc])

In [13]:
# apply the function to the features

features_lemma = features.apply(space_lemmatize_clear)

In [14]:
# create a new dataframe with lemmatized features

df_lemma=pd.concat([features_lemma,target],axis=1)

In [15]:
# split the data into three sets: training, validation, and testing

df_train, df_valid_and_test = train_test_split(df_lemma, test_size=0.4, 
                                               random_state=12345, stratify=target)

df_valid_and_test_features = df_valid_and_test.drop(['toxic'], axis=1)
df_valid_and_test_target = df_valid_and_test['toxic']
df_valid, df_test = train_test_split(df_valid_and_test, test_size=0.5, 
                                     random_state=12345, stratify=df_valid_and_test_target)

In [16]:
# define the set of features and the target feature for each set:

# training
features_train = df_train['text']
target_train = df_train['toxic']

# validation
features_valid = df_valid['text']
target_valid = df_valid['toxic']

# testing
features_test = df_test['text']
target_test = df_test['toxic']

In [17]:
# identify stop words and create a counter

stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

In [18]:
# vectorize the features

train_tf_idf = count_tf_idf.fit_transform(features_train)
valid_tf_idf = count_tf_idf.transform(features_valid)
test_tf_idf = count_tf_idf.transform(features_test)

In [19]:
# Upsampling Function

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_upsampled = vstack([features_zeros] + [features_ones] * repeat)
    target_upsampled = np.concatenate([target_zeros] + [target_ones] * repeat)
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(train_tf_idf, target_train, 9)

In [20]:
# training and prediction of the LR model with upsampling

model_lr = LogisticRegression(random_state=12345, solver='liblinear', max_iter=1000, C=5)
model_lr.fit(features_upsampled, target_upsampled)
predicted_valid = model_lr.predict(valid_tf_idf)
lr_f1_up = f1_score(target_valid, predicted_valid)
print("F1 score of the LR model with upsampling:", lr_f1_up)

F1 score of the LR model with upsampling: 0.7722943722943723


In [21]:
# downsampling Function

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    random_indices = np.random.choice(features_zeros.shape[0], size=int(features_zeros.shape[0] * fraction), replace=False)
    features_zeros_downsampled = features_zeros[random_indices, :]
    target_zeros_downsampled = target_zeros.iloc[random_indices]

    features_downsampled = vstack([features_zeros_downsampled, features_ones])
    target_downsampled = np.concatenate([target_zeros_downsampled, target_ones])

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(train_tf_idf, target_train, 0.11)


In [22]:
# training and prediction of the LR model with downsampling

model = LogisticRegression(random_state=12345, solver='liblinear', max_iter=1000, C=5)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(valid_tf_idf)
lr_f1_down = f1_score(target_valid, predicted_valid)
print("F1 score of the LR model with downsampling:", lr_f1_down)

F1 score of the LR model with downsampling: 0.7043360766490604


<b>Conclusion</b>

Key takeaways from this stage: class imbalance was identified, the best "balancing" model was determined to be upsampling. The original sample was divided into training, validation, and testing sets for subsequent training and model testing. Text lemmatization was carried out, and the samples were converted to vector form.

## Training

In [23]:
# training a Decision Tree

best_model_dt = None
best_result_dt_f1 = 0
best_depth_dt = 0
for depth in range(1, 20):
    model_dt = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model_dt.fit(features_upsampled, target_upsampled)
    predictions_dt_valid = model_dt.predict(valid_tf_idf)
    f1_dt = f1_score(target_valid, predictions_dt_valid)
    if f1_dt > best_result_dt_f1:
        best_model_dt = model_dt
        best_result_dt_f1 = f1_dt
        best_depth_dt = depth

print('F1 score of the best Decision Tree model on the validation set:', 
      best_result_dt_f1, 'in max_depth:', best_depth_dt)

F1 score of the best Decision Tree model on the validation set: 0.6347107438016529 in max_depth: 18


In [24]:
# training a Random Forest

best_model_rf = None
best_result_rf_f1 = 0
best_depth_rf = 0
best_est_rf = 0
for est in tqdm(range(1,20)):
    for depth in range(1, 20):
        model_rf = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        model_rf.fit(features_upsampled, target_upsampled)
        prediction_rf_valid = model_rf.predict(valid_tf_idf)
        f1_rf = f1_score(target_valid, prediction_rf_valid)
        if f1_rf > best_result_rf_f1:
            best_model_rf = model_rf
            best_result_rf_f1 = f1_rf
            best_depth_rf = depth
            best_est_rf = est

print('F1 score of the best Random Forest model on the validation set:', best_result_rf_f1, 
      'in max_depth:', best_depth_rf, 'and n_estimators:', best_est_rf)

100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [07:37<00:00, 24.05s/it]

F1 score of the best Random Forest model on the validation set: 0.3455190882481161 in max_depth: 19 and n_estimators: 18





In [25]:
# creating a CatBoostRegressor model
catboost_model = CatBoostClassifier(random_seed=12345, loss_function='Logloss', verbose=False)

# hyperparameter range for the CatBoostRegressor model
param_grid = {
    'learning_rate': [0.03, 0.1],
    'depth': [6, 8],
    'l2_leaf_reg': [1, 3],
    'iterations': [500, 1000],
}

# Creating a GridSearchCV object
f1 = make_scorer(f1_score)
grid_search_cb = GridSearchCV(catboost_model, param_grid, cv=3, scoring=f1, n_jobs=-1, verbose=1)

# Training the model with the best hyperparameter optimization
grid_search_cb.fit(features_upsampled, target_upsampled)
best_catboost_model = grid_search_cb.best_estimator_
best_catboost_model.fit(features_upsampled, target_upsampled)
predictions_catboost_valid = best_catboost_model.predict(valid_tf_idf)
f1_catboost_valid = f1_score(target_valid, predictions_catboost_valid)
print("F1 score of the CatBoost model on the validation set:", f1_catboost_valid)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
F1 score of the CatBoost model on the validation set: 0.7352006930407162


<b>Summary</b>

In total, we trained four models: Logistic Regression, Decision Tree, Random Forest, and Cat Boost.

## Conclusions

In [26]:
# constructing a summary table

index = ['LogisticRegression',
         'DecisionTree',
         'RandomForest',
         'CatBoost']
final_data = {'F1':[round(lr_f1_up, 3),
                      best_result_dt_f1,
                      best_result_rf_f1,
                      f1_catboost_valid]}
final_data_table = pd.DataFrame(data=final_data, index=index)
final_data_table['Success'] = final_data_table['F1'] > 0.75

print('')
print('Summary table:')
print('')
final_data_table


Summary table:



Unnamed: 0,F1,Success
LogisticRegression,0.772,True
DecisionTree,0.634711,False
RandomForest,0.345519,False
CatBoost,0.735201,False


In [27]:
%%time

# the best result was demonstrated by the Logistic Regression model.
# we will test it on the test dataset to ensure it functions properly.
predictions_lr_test = model_lr.predict(test_tf_idf)
lr_f1_test = f1_score(target_test, predictions_lr_test)
print("F1 score of the Logistic Regression model on the test dataset:", lr_f1_test)

F1 score of the Logistic Regression model on the test dataset: 0.7603593161402491
CPU times: total: 0 ns
Wall time: 14 ms


In [28]:
# as a bonus, let's check 3 tweets to see how the model interprets them

tweet = ["It’s exciting to see more & more public figures engaging in active dialogue on this platform!", 
         "The 15 rounds of voting: the US ruling parties weren't electing the House Speaker, but pulling a fat pig.",
         "Conversations about building a brighter future are essential to driving progress."]
tweet = count_tf_idf.transform(tweet)
model_lr.predict(tweet)

array([0, 1, 0], dtype=int64)

<b>Conclusion</b>

Our goal was to train a model to predict the toxicity of a comment with an F1 quality metric of at least 0.75.

In the initial training dataset, a pronounced class imbalance was identified, after which it was determined that the best solution to this problem was upsampling. Text lemmatization was conducted, and the samples were converted to a vector form. Four models were trained - Logistic Regression, Decision Tree, Random Forest, and Cat Boost. We identified a model that meets the client's requirements - with an F1 metric of 0.772 on the validation set and 0.760 on the test set.

Thus, the task set before us has been accomplished, and we can recommend the Logistic Regression model to the client.