# Toxic comments detection model


The online store 'Vikishop' is launching a new service. Now users can edit and enhance product descriptions, similar to wiki communities. In other words, customers can suggest edits and comment on other people's changes. The store needs a tool that can detect toxic comments and send them for moderation.

Train a model to classify comments as positive or negative. You have a dataset with annotations indicating the toxicity of edits.

Build a model with an F1 quality metric of at least 0.75



## Preprocessing

In [2]:
import torch
import transformers
import pandas as pd
import os
import numpy as np
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem.snowball import EnglishStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

In [3]:
pth1 = '/datasets/toxic_comments.csv'
pth2 = 'C:/Users/n.kirpichnikov/Desktop/Оля Учеба/Проекты/comments review/toxic_comments.csv'

if os.path.exists(pth1):
    data = pd.read_csv(pth1)
elif os.path.exists(pth2):
    data = pd.read_csv(pth2)
else:
    print('Something is wrong')

In [4]:
display(data)
display(data.info())

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


None

We have 159291 comments

In [5]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

The number of toxic comments is 16,186, and the number of regular comments is 143,106.

Let's write a function called "clear_text(text)" that will keep only Latin characters and spaces in the text. It takes the text as input and returns the cleaned text.

In [6]:
def clear_text(text):
    clean_text = re.sub(r"[^a-zA-Z']", ' ', text)
    return clean_text

In [7]:
data['text'] = data['text'].apply(clear_text)

In [8]:
display (data)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation Why the edits made under my userna...,0
1,1,D'aww He matches this background colour I'm s...,0
2,2,Hey man I'm really not trying to edit war It...,0
3,3,More I can't make any real suggestions on im...,0
4,4,You sir are my hero Any chance you remember...,0
...,...,...,...
159287,159446,And for the second time of asking when ...,0
159288,159447,You should be ashamed of yourself That is a ...,0
159289,159448,Spitzer Umm theres no actual article for pr...,0
159290,159449,And it looks like it was actually you who put ...,0


 Lemmatize the text using nltk library

In [24]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
wnl = WordNetLemmatizer()
tknzr = TweetTokenizer()

def lemmatize(text, m=wnl):
    word_list = tknzr.tokenize(text)
    tagged_words = pos_tag(word_list)
    lemmatized_words = []
    for word, tag in tagged_words:
        if tag.startswith('N'):  # Nouns
            pos = 'n'
        elif tag.startswith('V'):  # Verbs
            pos = 'v'
        elif tag.startswith('R'):  # Adverbs
            pos = 'r'
        else:  # Adjectives and others
            pos = 'a'
        lemmatized_words.append(m.lemmatize(word, pos=pos))
    return ' '.join(lemmatized_words)
    

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\n.kirpichnikov\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [25]:
data['text'] = data['text'].apply(lemmatize)

In [13]:
display (data)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation Why the edits make under my userna...,0
1,1,D'aww He match this background colour I'm seem...,0
2,2,Hey man I'm really not try to edit war It's ju...,0
3,3,More I can't make any real suggestion on impro...,0
4,4,You sir be my hero Any chance you remember wha...,0
...,...,...,...
159287,159446,And for the second time of ask when your view ...,0
159288,159447,You should be ashamed of yourself That be a ho...,0
159289,159448,Spitzer Umm theres no actual article for prost...,0
159290,159449,And it look like it be actually you who put on...,0


## Model Training

In [11]:
target = data['toxic']
features = data['text']

In [12]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.1, shuffle=False)

In [13]:
model_data = pd.DataFrame(columns=('model','fit_time','score_time','f1-score'))
model_data.head()

Unnamed: 0,model,fit_time,score_time,f1-score


### Logistic Regression

Download stopwords base

In [14]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\n.kirpichnikov\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


To count the TF-IDF for a text corpus, we can use the "fit" function for the training set and the "transform" function for both the test and training sets.

The counter will identify the unique words in the corpus and count their occurrences in each text. The method will return a matrix where each row represents a text and each column represents a unique word from the entire corpus. The number at their intersection indicates how many times the specific word appeared in the text.

In [15]:
pipeline_lr = Pipeline(
    [
        ("tf_idf", TfidfVectorizer(stop_words=stopwords)),
        ("model_lr", LogisticRegression(random_state=12345, solver = 'liblinear', class_weight='balanced')),
    ]
)

 Initiate the GridSearch.

In [16]:
params_lr = {
          'model_lr__class_weight':['balanced'],
          'model_lr__C':[8,10,12],
          'model_lr__max_iter':[500]}  

In [17]:
grid_lr=GridSearchCV(estimator=pipeline_lr, cv=5, param_grid = params_lr, scoring='f1', n_jobs=-1, verbose=3)

In [18]:
grid_lr.fit(features_train, target_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf_idf',
                                        TfidfVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
        

In [19]:
print(grid_lr.best_params_, grid_lr.best_score_)

{'model_lr__C': 10, 'model_lr__class_weight': 'balanced', 'model_lr__max_iter': 500} 0.7631816162431637


In [20]:
results = grid_lr.cv_results_

best_index = grid_lr.best_index_


fit_time = results['mean_fit_time'][best_index]
score_time = results['mean_score_time'][best_index]
F1 = (grid_lr.best_score_)

In [21]:
lr_data = ['logistic_regression', fit_time, score_time, F1]

model_data.loc[len(model_data)] = lr_data 
model_data.head()

Unnamed: 0,model,fit_time,score_time,f1-score
0,logistic_regression,16.809028,2.35814,0.763182


### Random Forest Classifyer

In [22]:
pipeline_rf = Pipeline(
    [
        ("tf_idf", TfidfVectorizer(stop_words=stopwords)),
        ("model_rf", RandomForestClassifier(random_state=12345)),
    ]
)

In [23]:
params_rf = {
          'model_rf__class_weight':['balanced'],
           'model_rf__max_depth':[10,20,30],
        'model_rf__n_estimators':[100,200,300]} 

In [24]:
grid_rf=GridSearchCV(estimator=pipeline_rf, cv=5, param_grid = params_rf, scoring='f1', n_jobs=-1, verbose=3)

In [25]:
grid_rf.fit(features_train, target_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf_idf',
                                        TfidfVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
        

In [26]:
print(grid_rf.best_params_, grid_rf.best_score_)

{'model_rf__class_weight': 'balanced', 'model_rf__max_depth': 30, 'model_rf__n_estimators': 300} 0.43154895295118034


In [27]:
results = grid_rf.cv_results_

best_index = grid_rf.best_index_


fit_time = results['mean_fit_time'][best_index]
score_time = results['mean_score_time'][best_index]
F1 = (grid_rf.best_score_)

In [28]:
rf_data = ['rf_classifyer', fit_time, score_time, F1]

model_data.loc[len(model_data)] = rf_data 
model_data.head()

Unnamed: 0,model,fit_time,score_time,f1-score
0,logistic_regression,16.809028,2.35814,0.763182
1,rf_classifyer,112.200676,3.672404,0.431549


## Testing the best model

We can observe that logistic_regression shows the best performance. Let's proceed with testing it.

In [29]:
preds = grid_lr.predict(features_test)

In [30]:
f1_score(target_test, preds)

0.7718197375926982

### Conclusion

We were selecting the optimal model for the online store "WikiShop" that would detect toxic comments and send them for moderation.

To achieve this, we loaded a database of comments labeled with their toxicity status. The database contained over 159,000 comments, with more than 16,000 being toxic and over 143,000 being normal comments. The target variable was imbalanced, which was taken into account during model training.

Next, we performed tokenization (splitting the text into a list of words) and lemmatization (converting words to their base form) while excluding unnecessary symbols, retaining only Latin alphabet letters, apostrophes, and spaces.

We then loaded a stop words database and transformed the text strings into vectors using TfidfVectorizer(stop_words=stopwords), incorporating the stop words. This resulted in a TF-IDF matrix, which we used as features.

In other words, we tackled a classification problem where the target variable was "1" for positive text and "0" for negative text. The features consisted of words and their respective TF-IDF values for each text.

After text preprocessing, we trained two classifiers and evaluated their performance using cross-validation. To avoid data leakage during cross-validation, we placed the model and vectorizer in a pipeline.

As a result, the logistic regression model outperformed the random forest model. Therefore, we conducted the testing phase on the logistic regression model. The testing demonstrated an F1 score higher than the required threshold of 0.75 according to the task conditions. Thus, this model is suitable for our purposes.

