# Toxic comments classification

>For pre-moderation of comments there is a need in automated text classification. The aim is to detect toxic comments and queueing them for moderation. <br>
Using available text data one should train model to distinguish toxic and non-toxic comments. 
Labeled data are following: <i>text</i> contains text comment, and toxic is a target. <br> Desirable F1 score is at least 0.75. </br>

## Libraries import

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import random

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 

import re
from tqdm import notebook

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, SCORERS
from sklearn.naive_bayes import MultinomialNB,ComplementNB
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt

from lightgbm import LGBMClassifier

In [2]:
import torch

In [52]:
import spacy

## Data preprocessing

In [4]:
#path to folder with stopwords and data
path = 'D:/Bot/DataScience/Jupyter projects/3 module/project 4'

In [5]:
#loading data
try:
    data = pd.read_csv('/datasets/toxic_comments.csv') #переводим даты в даты и делаем их индексами
    print('Using remote files')
    print()
except:
    data = pd.read_csv(path + '/toxic_comments.csv')
    print('Using local files')
    print()

Using local files



In [6]:
#data overview
display(data.head())
print(data.info())
#checking target for class imbalance
print('Доля классов в общей выборке:',data['toxic'].value_counts(normalize=True))

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB
None
Доля классов в общей выборке: 0    0.898321
1    0.101679
Name: toxic, dtype: float64


### Additional functions

#### Removing punctuation and other symbols

In [7]:
#removeing everything but letters and spaces
#and converting them to lowercase
def purge(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]',' ',text)
    
    text.split()
    " ".join(text)
    return text

#### Lemmatization

In [8]:
%%time
#using spacy for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#function for sentence lemmatization
def lemm(text):
    value = nlp(text)
    lemmatized = " ".join([token.lemma_ for token in value])
    return lemmatized

Wall time: 659 ms


### Cleaning the text

In [9]:
%%time
#removing symbols and converting to lowercase using function
data['purged'] = data['text'].apply(purge)

Wall time: 6.73 s


In [10]:
%%time
#lemmatizing
data['purged'] = data['purged'].apply(lemm)

Wall time: 16min 26s


In [11]:
#checking out
display(data.head())

Unnamed: 0,text,toxic,purged
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man I m really not try to edit war it ...
3,"""\nMore\nI can't make any real suggestions on ...",0,more I can t make any real suggestion on im...
4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you rememb...


### Stop-words

Stopwords need to be removed, we make the list for them. One is from nltk library, and browse for additional in the web.

In [12]:
#https://github.com/saurabbhsp/stopwords source for stopwords

additonal_words = pd.read_csv(path + '/additional_words.csv')
display(additonal_words)

Unnamed: 0,also read
0,read also
1,read more
2,source
3,&
4,'ll
...,...
725,youre
726,yours
727,yourself
728,yourselves


In [13]:
#getting stopwords from nltk
nltk.download('stopwords')
#merging lists
stopwords.words('english').append(additonal_words.values)
stop_words = set(stopwords.words('english')) 
stop_words

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Anastasia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [14]:
#function for stopwords removing 
def del_stop_words(text):
    text = text.split()
    new_text = []
    for element in text:
        if element not in stop_words:
            new_text.append(element)
    new_text = " ".join(new_text)
    return new_text

In [15]:
%%time
data['purged'] = data['purged'].apply(del_stop_words)

Wall time: 2.7 s


In [16]:
display(data.head())

Unnamed: 0,text,toxic,purged
0,Explanation\nWhy the edits made under my usern...,0,explanation edit make username hardcore metall...
1,D'aww! He matches this background colour I'm s...,0,aww match background colour I seemingly stick ...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man I really try edit war guy constantly r...
3,"""\nMore\nI can't make any real suggestions on ...",0,I make real suggestion improvement I wonder se...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page


### Splitting data

In [17]:
%%time

features = data['purged']
target = data['toxic']

features_train, features_test, target_train,  target_test = train_test_split(features,target, test_size = 0.2,random_state = 12345)
#checking subsets size
print(features_train.shape, target_train.shape)
print(features_test.shape,target_test.shape)

(127656,) (127656,)
(31915,) (31915,)
Wall time: 32.9 ms


### TF-IDF

In this project it was decided to use TfidfVectorizer for feature extraction. 

In [18]:
%%time
#using Tfidf for feature engeneering
train_corpus = list(features_train)
#making sets of unigramms
count_tf_idf = TfidfVectorizer(stop_words=stop_words,ngram_range=(1, 1))
count_tf_idf.fit(train_corpus)

x_train = count_tf_idf.transform(train_corpus)
train_dic = count_tf_idf.vocabulary_

Wall time: 10.3 s


In [19]:
test_corpus = list(features_test)
x_test = count_tf_idf.transform(test_corpus)


Making sets with uni- and bigramms for comparison. 

In [20]:
%%time
#and additional set with bigramms
counter = TfidfVectorizer(stop_words=stop_words,ngram_range=(1, 2))
counter.fit(train_corpus)
x2_train = counter.transform(train_corpus)

x2_test = counter.transform(test_corpus)

Wall time: 36.9 s


## Models training

#### Accessory functions


In [21]:
#dictionary for models and counter
models_dict = {}

predictions_train_dict = {}
predictions_test_dict = {}

f1_train_dict = {}
f1_test_dict = {}

Counter = 0

In [22]:
#features lists
features_list = [x_train,target_train,x_test,target_test]
print(x_train.shape,target_train.shape,x_test.shape,target_test.shape)

features_list2 = [x2_train,target_train,x2_test,target_test]
print(x2_train.shape,target_train.shape,x2_test.shape,target_test.shape)

(127656, 132665) (127656,) (31915, 132665) (31915,)
(127656, 2065784) (127656,) (31915, 2065784) (31915,)


In [23]:
#function for model training and F1 printing

def train_model(model,features_list):
    
    global Counter
    Counter += 1 #counter counts every function call
    
    name = 'model'+str(Counter) 
    
    predictions_train = 'predictions_train' + '_' + name
    predictions_test = 'predictions_test' + '_' + name
            
    f1_train = 'f1_train' + '_' + name
    f1_test = 'f1_test' + '_' + name
    
    models_dict[name] = model
    models_dict[name].fit(features_list[0],features_list[1])
    
    #features_list = [features_train,target_train,features_test,target_test] just to remember how it is looks like
    
    predictions_train_dict[predictions_train] = models_dict[name].predict(features_list[0])
    predictions_test_dict[predictions_test] = models_dict[name].predict(features_list[2])
    
    f1_train_dict[f1_train] = f1_score(features_list[1],predictions_train_dict[predictions_train])
    f1_test_dict[f1_test] = f1_score(features_list[3],predictions_test_dict[predictions_test])
    
    print('train F1:', f1_train_dict[f1_train])
    print('test F1:',f1_test_dict[f1_test])
    return models_dict[name]

#### Pipeline for TF-IDF

Adding pipeline for cross-validation (count_tf_idf - unigramms, counter - uni- and bigramms).

In [24]:
#new features lists for pipelines
features_list_ = [features_train,target_train,features_test,target_test]

In [25]:
def pipe(model,tf_vectorizer):
    pipe = Pipeline([('TF-IDF',tf_vectorizer),('model',model)])
    return pipe

In [26]:
#the function, which is
#taking models name and dictionary with parameters for GridSearch
#making pipelines for model and uni- and bigramms
#all paramers X for model should be like model__X in the dictionary
def lasy(model,parameters):
    #pipelines
    model_pipe_uni = pipe(model,count_tf_idf)
    model_pipe_bi = pipe(model,counter)
    
    gsearch_model_uni = GridSearchCV(model_pipe_uni,parameters,scoring='f1',n_jobs=-1,cv=5)
    gsearch_model_uni.fit(features_train,target_train)
    print('parameter for model with unigramms:',gsearch_model_uni.best_params_)
    
    gsearch_model_bi = GridSearchCV(model_pipe_bi,parameters,scoring='f1',n_jobs=-1,cv=5)
    gsearch_model_bi.fit(features_train,target_train)
    print('parameter for model with bigramms:',gsearch_model_bi.best_params_)
    
    #проверяем на трейне и тесте
    print('Model trained on unigramms:')
    model_uni = train_model(gsearch_model_uni.best_estimator_,features_list_)
    
    print('Model trained on uni- and bigramms:')
    model_bi = train_model(gsearch_model_bi.best_estimator_,features_list_)

#### Linear model

In [33]:
%%time 
linear_model = LogisticRegression(class_weight='balanced',solver='lbfgs',max_iter=400)
#using pipeline
linear_pipe_uni = pipe(linear_model,count_tf_idf)
linear_pipe_bi = pipe(linear_model,counter)
#varying C
parameters_linear = {
    'model__C':[0.1,10.]
    
}
gsearch_linear_uni = GridSearchCV(linear_pipe_uni,parameters_linear,scoring='f1',n_jobs=-1,cv=5)
gsearch_linear_uni.fit(features_train,target_train)
print('parameter for model with unigramms:',gsearch_linear_uni.best_params_)

gsearch_linear_bi = GridSearchCV(linear_pipe_bi,parameters_linear,scoring='f1',n_jobs=-1,cv=5)
gsearch_linear_bi.fit(features_train,target_train)
print('parameter for model with bigramms:',gsearch_linear_bi.best_params_)

parameter for model with unigramms: {'model__C': 10.0}
parameter for model with bigramms: {'model__C': 10.0}
Wall time: 6min 34s


In [34]:
%%time
print('Model trained on unigramms:')
linear_model3 = train_model(gsearch_linear_uni.best_estimator_,features_list_)

Model trained on unigramms:
train F1: 0.9070380922232775
test F1: 0.7610795454545454
Wall time: 26.3 s


In [35]:
%%time
print('Model trained on bigramms:')
linear_model4 = train_model(gsearch_linear_bi.best_estimator_,features_list_)

Model trained on bigramms:
train F1: 0.9831869130566495
test F1: 0.7900452488687781
Wall time: 2min 14s


#### Decision tree

In [36]:
parameters_tree = {
    'model__criterion':('gini', 'entropy'),
    'model__splitter':('best','random'),
    'model__max_depth':[2,20],
    'model__max_features':('auto','sqrt','log2'),   
}

In [37]:
%%time
tree_model = DecisionTreeClassifier(random_state=12345,class_weight='balanced')
lasy(tree_model,parameters_tree)

parameter for model with unigramms: {'model__criterion': 'entropy', 'model__max_depth': 20, 'model__max_features': 'auto', 'model__splitter': 'best'}
parameter for model with bigramms: {'model__criterion': 'gini', 'model__max_depth': 20, 'model__max_features': 'auto', 'model__splitter': 'best'}
Model trained on unigramms:
train F1: 0.21569454036648697
test F1: 0.2133170257922947
Model trained on uni- and bigramms:
train F1: 0.20031522675202704
test F1: 0.19801466294686304
Wall time: 13min 7s


#### Random forest

In [38]:
parameters_forest = {
    'model__n_estimators':[10,100],
    'model__max_depth':[2,15],
}
forest_model = RandomForestClassifier(random_state=12345,class_weight='balanced')

In [39]:
%%time
lasy(forest_model,parameters_forest)

parameter for model with unigramms: {'model__max_depth': 15, 'model__n_estimators': 100}
parameter for model with bigramms: {'model__max_depth': 15, 'model__n_estimators': 100}
Model trained on unigramms:
train F1: 0.37193970621179634
test F1: 0.35501216233516836
Model trained on uni- and bigramms:
train F1: 0.3616288151426515
test F1: 0.34536334719843403
Wall time: 5min 27s


#### LightGBM

In [40]:
LGBM = LGBMClassifier(random_state = 12345,class_weight='balanced',n_estimators=100,n_jobs=-1)
parameters_LGBMC = {
    'model__boosting_type':('gbdt','dart','goss'),
    'model__max_depth':[10,40],
    'model__num_leaves':[10,30]
    }

In [41]:
%%time
#here one can try to reduce the number of cross-val sets and parameters
LGBM_pipe_uni = pipe(LGBM,count_tf_idf)
LGBM_pipe_bi = pipe(LGBM,counter)
    
gsearch_LGBM_uni = GridSearchCV(LGBM_pipe_uni,parameters_LGBMC,scoring='f1',n_jobs=-1,cv=3)
gsearch_LGBM_uni.fit(features_train,target_train)
print('параметр для модели на униграммах:',gsearch_LGBM_uni.best_params_)
    
gsearch_LGBM_bi = GridSearchCV(LGBM_pipe_bi,parameters_LGBMC,scoring='f1',n_jobs=-1,cv=3)
gsearch_LGBM_bi.fit(features_train,target_train)
print('параметр для модели на биграммах:',gsearch_LGBM_bi.best_params_)


параметр для модели на униграммах: {'model__boosting_type': 'gbdt', 'model__max_depth': 40, 'model__num_leaves': 30}
параметр для модели на биграммах: {'model__boosting_type': 'gbdt', 'model__max_depth': 40, 'model__num_leaves': 30}
Wall time: 25min 55s


In [42]:
%%time
print('Модель, обученная на униграммах:')
LGBM_uni = train_model(gsearch_LGBM_uni.best_estimator_,features_list_)

Модель, обученная на униграммах:
train F1: 0.772264851914864
test F1: 0.7373439910251017
Wall time: 37.5 s


In [43]:
%%time
print('Модель, обученная на униграммах и биграммах:')
LGBM_bi = train_model(gsearch_LGBM_bi.best_estimator_,features_list_)

Модель, обученная на униграммах и биграммах:
train F1: 0.7741755948239877
test F1: 0.7297520661157024
Wall time: 1min 30s


#### Naive Bayes

In [48]:
NB_model = MultinomialNB()
parameters_NB = {
    'model__alpha':[1.0e-10,10],
    'model__fit_prior':(True,False)
    
}

In [49]:
%%time
lasy(NB_model,parameters_NB)

parameter for model with unigramms: {'model__alpha': 1e-10, 'model__fit_prior': True}
parameter for model with bigramms: {'model__alpha': 1e-10, 'model__fit_prior': True}
Model trained on unigramms:
train F1: 0.8684188428919305
test F1: 0.6186473060756592
Model trained on uni- and bigramms:
train F1: 0.9900649598777226
test F1: 0.5570376269161855
Wall time: 3min 40s


#### Complement Naive Bayes

In [50]:
CNB_model = ComplementNB()
parameters_CNB = {
    'model__alpha':[1.0e-10,10],
    'model__fit_prior':(True,False),
    'model__norm':(True,False)
    
}

In [51]:
%%time
lasy(CNB_model,parameters_CNB)

parameter for model with unigramms: {'model__alpha': 1e-10, 'model__fit_prior': True, 'model__norm': False}
parameter for model with bigramms: {'model__alpha': 1e-10, 'model__fit_prior': True, 'model__norm': False}
Model trained on unigramms:
train F1: 0.7699015213878472
test F1: 0.5506663314055821
Model trained on uni- and bigramms:
train F1: 0.9629299788489368
test F1: 0.5316191799861014
Wall time: 5min 15s


## Results

- Data were uploaded and inspected
 - data have notable imbalance of classes
- Text comments were lowercased, purified and lemmatized
- Stopwords were removed
- Data were split into train and test sets in a 1 to 5 ration
- TF-IDF values of words were used as features
- Several models were trained and tested
 - For some of them parameters were chosen by the means of GridSearchCV 
 - The best F1 score on test (0.79) was achieved for linear regression model trained on uni- and bigramms