# Project for Wikishop

Wikishop online store launches a new service. Now users can edit and supplement product descriptions, as in wiki communities. That is, clients offer their edits and comment on the changes of others. The store needs a tool that will search for toxic comments and send them for moderation.

In this project, we will train a machine learning model to classify comments into positive and negative. Our goal will be to reach *F1* at least 0.75.


**Data description**

The *text* column contains the comment text, and *toxic* is the target. 

In [2]:
import pandas as pd
from pymystem3 import Mystem
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
import warnings
from tqdm import tqdm
import time
import optuna
import pipeline
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline


#algorithms which we are going try
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from sklearn.svm import LinearSVC

## Data Preparation

In [3]:
data_raw = pd.read_csv('/Users/Shepunova/Desktop/Data_Science/Projects/Data/toxic_comments.csv')
data_raw.head(), data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


(                                                text  toxic
 0  Explanation\nWhy the edits made under my usern...      0
 1  D'aww! He matches this background colour I'm s...      0
 2  Hey man, I'm really not trying to edit war. It...      0
 3  "\nMore\nI can't make any real suggestions on ...      0
 4  You, sir, are my hero. Any chance you remember...      0,
 None)

In [152]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shepunova/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/shepunova/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/shepunova/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Let's create a function which will remove unnecessary words from the text

In [153]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

data_raw['text'] = data_raw['text'].apply(lambda x: clean_text(x))

Let's prepare a function for lemmatization

In [154]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    text = re.sub(r"[^a-zA-Z']", ' ', text) #leave only the letters a-z
    text = text.split() #split the text into separate words
    text = " ".join(text) #combine the words into a string separated by a space
    lemm_list = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)]
    lemm_text = " ".join(lemm_list)
    return lemm_text


There are 159 thousand rows in the data. To select the best model, we will choose random 50 thousand texts.

In [84]:
#choose random 50 thousand texts
data_model_selection = data_raw.sample(50000).reset_index(drop=True)
data_model_selection.shape

In [155]:
%%time
#lemmatization
data_model_selection['lemmas'] = data_model_selection['text'].apply(lambda x: lemmatize_text(x))

CPU times: user 6min 37s, sys: 1min 15s, total: 7min 52s
Wall time: 8min 4s


In [156]:
#take a look at the lemmatization results
data_model_selection.head()

Unnamed: 0,text,toxic,lemmas
0,yep hopefully i am going to take a look at tha...,0,yep hopefully i be go to take a look at that a...
1,best regards wolfowitz,0,best regard wolfowitz
2,i am new at this and may make a few mistakes b...,0,i be new at this and may make a few mistake bu...
3,i did not know you are a lord i am sorry sir c...,0,i do not know you be a lord i be sorry sir con...
4,woah as someone who would been the victim of h...,0,woah a someone who would be the victim of his ...


As we are dealing with the calssification problem, let's check the class balance first.

In [5]:
print(data_raw['toxic'].value_counts())
class_ratio = data_raw['toxic'].value_counts()[0]/data_raw['toxic'].value_counts()[1]
class_ratio

0    143346
1     16225
Name: toxic, dtype: int64


8.834884437596301

Classes are not balanced. The ratio is 1:8.83. We will try to change the weights in the training model.
Let's prepare features and target before the test training.

In [158]:
#Divide into features and target
X = data_model_selection['lemmas']
y = data_model_selection['toxic']

In [159]:
#divide features and target into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((37500,), (37500,), (12500,), (12500,))

In [None]:
#Вычислим tf-idf для текстов. Исключив стоп слова
#stopwords = set(nltk_stopwords.words('english'))
#создаем счетчик
#count = TfidfVectorizer(stop_words = stopwords, min_df = 10)
#count.fit_transform(X_train)
#tf_idf_train = count.transform(X_train)
#tf_idf_test = count.transform(X_test)
#Посмотрим размер получившейся матрицы
#tf_idf_train.shape, tf_idf_test.shape

### Summary

We have selected 50 thousand random twits, lemmatized them and splitted into test and train sets.
We now have 37500 observations in the train set and 12500 observations in the test set. Each row has 7393 features.
Let's start testing different models.

## Model Training

### Default Models
We will first test several models with default hyperparameters and choose the best time and f1 score

In [160]:
#parameters for cross validation
cv_outer=StratifiedKFold(n_splits=5, random_state=12345, shuffle=True)

In [161]:
#This function will test several algorythms at once
#As an argument, function takes a DataFrame with ML models 
#It returns DataFrame with f1 score and calculated time for each model 

def test(models): 
    results =  {} 
    for i in models:
        start_time = time.time()
        pipe = Pipeline([
        ('prep', TfidfVectorizer(stop_words=stopwords, min_df = 10)),
        ('est', models[i])]) #Prepare data with word vectotrizer using pipeline
        
        #calculate f1 score for each model
        f1 = cross_val_score(pipe,  X_train, y_train, scoring = 'f1', cv = cv_outer).mean() 
        elapsed_time = time.time() - start_time #calculate time for training model
        results[i] = [f1, elapsed_time]
    return pd.DataFrame(results, index = ['F1', 'Time']) 

In [162]:
models = {'Logistic Regression': LogisticRegression(random_state=12345),
    'Decision Tree': DecisionTreeClassifier(random_state = 12345),
         'Random Forest': RandomForestClassifier(random_state = 12345),
          'Catboost': CatBoostClassifier(verbose=50),
         'K-neighbours': KNeighborsClassifier(n_neighbors = 1),
         'Linear SVC': LinearSVC(random_state = 12345)}

In [163]:
%%time
test(models)

Learning rate set to 0.044021
0:	learn: 0.6492746	total: 177ms	remaining: 2m 56s
50:	learn: 0.2229225	total: 6.91s	remaining: 2m 8s
100:	learn: 0.1955475	total: 13.6s	remaining: 2m 1s
150:	learn: 0.1809679	total: 20.4s	remaining: 1m 54s
200:	learn: 0.1704951	total: 27.4s	remaining: 1m 48s
250:	learn: 0.1625447	total: 34.3s	remaining: 1m 42s
300:	learn: 0.1555973	total: 41.3s	remaining: 1m 36s
350:	learn: 0.1492229	total: 48.1s	remaining: 1m 28s
400:	learn: 0.1441965	total: 55.5s	remaining: 1m 22s
450:	learn: 0.1397705	total: 1m 3s	remaining: 1m 17s
500:	learn: 0.1353440	total: 1m 10s	remaining: 1m 10s
550:	learn: 0.1312206	total: 1m 17s	remaining: 1m 3s
600:	learn: 0.1276012	total: 1m 24s	remaining: 56.2s
650:	learn: 0.1242495	total: 1m 32s	remaining: 49.3s
700:	learn: 0.1210165	total: 1m 40s	remaining: 42.8s
750:	learn: 0.1184079	total: 1m 48s	remaining: 35.9s
800:	learn: 0.1165515	total: 1m 56s	remaining: 28.9s
850:	learn: 0.1147529	total: 2m 3s	remaining: 21.7s
900:	learn: 0.1130273

Unnamed: 0,Logistic Regression,Decision Tree,Random Forest,Catboost,K-neighbours,Linear SVC
F1,0.655776,0.686154,0.733199,0.714337,0.374152,0.741894
Time,10.699842,190.42893,160.378992,708.013116,40.339485,9.076745


Лучший результат по метрике даёт линейный SVC - 0.7418 
<br>При этом он выполняет код быстрее всех остальных моделей. 
<br>На втором месте Random Forest с результатом 0.7331. Однако вреня выполнения в 17 раз выше, чем у Linear SVC
<br>Попробуем оптимизировать Linear SVC настройкой гиперпарамертров

**Linear SVC** preforms the best f1 score  - l 0.7418.
<br>At the same time, it runs the code faster than all other models.
<br>On the second place Random Forest with f1 score = 0.7331. However, it takes 17 times longer than Linear SVC. 
<<br>Let's try to optimize Linear SVC by setting hyperparameters

### Hyperparameters for Linear SVC

Class_weight 

In [6]:
dict_classes={0:1, 1:class_ratio}

In [167]:
%%time
pipe = Pipeline([
        ('prep', TfidfVectorizer(stop_words=stopwords)),
        ('est', LinearSVC(class_weight = dict_classes, random_state=12345))])       
    
f1 = cross_val_score(pipe, X_train, y_train, cv=cv_outer, scoring='f1').mean()
f1

CPU times: user 11.7 s, sys: 244 ms, total: 11.9 s
Wall time: 12.1 s


0.7415313212207305

There is not much difference. Let's try to set the parameter "class_weight" to balanced. 

In [168]:
%%time
pipe = Pipeline([
        ('prep', TfidfVectorizer(stop_words=stopwords)),
        ('est', LinearSVC(class_weight = 'balanced', random_state=12345))])       
    
f1 = cross_val_score(pipe, X_train, y_train, cv=cv_outer, scoring='f1').mean()
f1

CPU times: user 10.9 s, sys: 259 ms, total: 11.2 s
Wall time: 11.3 s


0.744089285080044

The metric has improved by 0.0022. Let's try to choose the parameters using the Optuna

In [172]:
%%time

def objective(trial):
    
    param = {
        'max_iter': trial.suggest_int('max_iter', 1000, 3000, 100),
        'dual': trial.suggest_categorical('dual', [True, False]),
    }

    if param["dual"] == True:
        param["penalty"] = 'l2'
        param["loss"] ='hinge'
    elif param["dual"] == False:
        param["penalty"] = 'l1'
        param["loss"] = 'squared_hinge'

    cv_outer=StratifiedKFold(n_splits=5, random_state=12345, shuffle=True)
    
    pipe = Pipeline([
        ('prep', TfidfVectorizer(stop_words=stopwords)),
        ('est', LinearSVC(**param, random_state=12345))])       
    
    return cross_val_score(pipe, X_train, y_train, cv=cv_outer, scoring='f1').mean()


study = optuna.create_study(direction='maximize', 
                            sampler = optuna.samplers.TPESampler(seed=0), 
                            study_name='Linear Support Vector Classification Optuna')
study.optimize(objective, n_trials=20)

#best parameters
print(study.best_params)
#best f1 score
print(study.best_value)

[32m[I 2022-05-29 13:48:58,150][0m A new study created in memory with name: Linear Support Vector Classification Optuna[0m
[32m[I 2022-05-29 13:49:11,137][0m Trial 0 finished with value: 0.7295617913765453 and parameters: {'max_iter': 2100, 'dual': True}. Best is trial 0 with value: 0.7295617913765453.[0m
[32m[I 2022-05-29 13:49:23,556][0m Trial 1 finished with value: 0.758451449241466 and parameters: {'max_iter': 2100, 'dual': False}. Best is trial 1 with value: 0.758451449241466.[0m
[32m[I 2022-05-29 13:49:40,444][0m Trial 2 finished with value: 0.758451449241466 and parameters: {'max_iter': 1900, 'dual': False}. Best is trial 1 with value: 0.758451449241466.[0m
[32m[I 2022-05-29 13:49:52,560][0m Trial 3 finished with value: 0.7295617913765453 and parameters: {'max_iter': 1800, 'dual': True}. Best is trial 1 with value: 0.758451449241466.[0m
[32m[I 2022-05-29 13:50:04,571][0m Trial 4 finished with value: 0.7295617913765453 and parameters: {'max_iter': 2100, 'dual': T

{'max_iter': 2100, 'dual': False}
0.758451449241466
CPU times: user 4min 6s, sys: 7.42 s, total: 4min 13s
Wall time: 4min 27s


<br>The metric has improved! New best result is 0.7584. 
Now let's try the model on test data set.

### Testing the model

In [175]:
#Calculate tf-idf for train and test separately.
#creating a counter

count = TfidfVectorizer(stop_words = stopwords)
count.fit_transform(X_train)
tf_idf_train = count.transform(X_train)
tf_idf_test = count.transform(X_test)

In [176]:
model = LinearSVC(penalty = 'l1', loss = 'squared_hinge', dual = False, random_state = 12345, max_iter = 2100)
model.fit(tf_idf_train, y_train)
predictions = predictions = model.predict(tf_idf_test)
f1 = f1_score(y_test, predictions)
f1

0.7502295684113865

## Summary
<b>Data preparation</b>
<<br>At the stage of data preparation, we cleared the texts of unnecessary words, lemmatized them and carried out vectorization using tf-idf. As data was big and heavy, we worked on a random sample of 50,000 texts. 

<b>Model training</b>
<br> LinearSVC model showed the best results in f1 score and code execution time. The result of F1 is 0.74. RandomForest takes second place with f1 score equal to 0.73.
<br>Using the selection of hyperparameters for LinearSVC, it was possible to achieve only a slight increase in f1 score = 0.7584. 
<br>The best model is <b>LinearSVC(penalty = 'l1', loss = 'squared_hinge', dual = False)</b> 

<br>We reached f1 equal to 0.7502 on the test data which meets the requirements. 