# Classification of texts for an online store

**Problem situation:** Wikishop is launching a new service. Now users can edit and supplement product descriptions, just like in wiki communities. I.e. clients propose their edits and comment on the changes of others.

**Goal:** creating a tool that will look for toxic comments and submit them for moderation.

**Task:** building a model for classifying comments into positive and negative.

**Data:** Tweets with edit toxicity markup. ```Tag:``` comment text. ```Target trait:``` binary toxicity classification.

**Requirements:**

* The F1 metric must be at least 0.75.

In this work, we will pay a lot of attention to high-quality data preprocessing: their cleaning from extra characters and auxiliary words, lemmatization taking into account parts of speech, and vectorization through TF-IDF counting.

Then, using cross-validation and randomized hyperparameter enumeration, we will test 3 models:

* Logistic regression;
* Light GBM;
* Catboost.

Let's start by importing the libraries:



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import string
import nltk

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet as wn
from tqdm import notebook
from sklearn.model_selection import (train_test_split,
                                     cross_val_score,
                                     RandomizedSearchCV)
from sklearn.pipeline import Pipeline
from sklearn.utils import shuffle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier

!pip install catboost
from catboost import CatBoostClassifier

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## 1. Preparing the data

В этом разделе мы загрузим, изучим и преобработаем наши данные.

### 1.1. Primary analysis

Let's write a function to quickly get information about the dataset.

In [None]:
def df_read(file_path):
    df = pd.read_csv(file_path, index_col=[0])
    print(df.info(), '\n')
    print('Количество дубликатов:', '\n', sum(df.duplicated()), '\n')
    print('Количество пропусков:', '\n', df.isna().sum(), '\n')
    return df

In [None]:
df = df_read('https://code.s3.yandex.net/datasets/toxic_comments.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB
None 

Количество дубликатов: 
 0 

Количество пропусков: 
 text     0
toxic    0
dtype: int64 



We see that there are no omissions and duplicates. Let's see what the data looks like.

In [None]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


The very first 5 lines demonstrate there's quite a lot of garbage in the text. But there is another possible problem - an imbalance of classes in the target trait. Let's look at the quantity of values in the target variable.

In [None]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Indeed, there is a big imbalance. Let's eliminate it at the model training stage by setting the `class_weight` attribute.



### 1.2. Preprocessing

Let's extract the target feature:

In [None]:
target = df['toxic']
features = df['text']

Now let's start cleaning learning features from stop words and extra characters, as well as lemmatization, considering differents parts of speech.

In [None]:
stoplist = set(stopwords.words('english'))

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
def text_preprocess(row):
  row = row.lower()
  row = re.sub('[^a-zA-Z0-9]', ' ', row)
  row = re.sub('[\s+]', ' ', row)
  row = ' '.join([i for i in row.split() if i not in stoplist])

  wn_map = {'N' : wn.NOUN, 'V' : wn.VERB, 'J' : wn.ADJ, 'R' : wn.ADV}
  tagged = pos_tag(row.split())
  lemmatized = ' '.join([lemmatizer.lemmatize(word, pos=wn_map.get(pos[0], wn.NOUN)) for word, pos in tagged])
  return lemmatized

We've written the function, now we shall apply and check it.

In [None]:
features = features.str.lower()
features = features.apply(lambda x: text_preprocess(x))

In [None]:
print(features.shape, '\n')
features.head()

(159292,) 



0    explanation edits make username hardcore metal...
1    aww match background colour seemingly stuck th...
2    hey man really try edit war guy constantly rem...
3    make real suggestion improvement wonder sectio...
4                        sir hero chance remember page
Name: text, dtype: object

Great, it worked! Now I don’t have to pronounce words with the ```73913``` index!




Now let's divide the features into training and test samples in the proportion of 50/50.

In [None]:
RANDOM_STATE = 11235

features_train, features_test, target_train, target_test = train_test_split(features,
                                                                            target,
                                                                            test_size=0.5,
                                                                            random_state=RANDOM_STATE)

In [None]:
print(features_train.shape)
print(features_test.shape)
print(target_train.shape)
print(target_test.shape)

(79646,)
(79646,)
(79646,)
(79646,)


Let's look at the ratio of classes after dividing into samples.

In [None]:
target_train.value_counts()

0    71545
1     8101
Name: toxic, dtype: int64

We will try to eliminate this imbalance during the training using class weights.

We'll vectorize the text using TF-IDF in a following manner: at the cross-validation stage, we will use the vectorizer as part of the pipeline (which will include features without vectorization) so that there is no data leakage between the training and validation samples.

## 2. Training

Let's go in order: logistic regression, Light GBM and CatBoost. So let's start with...

### 2.1. Logistic regression

Yep. We will get the F1 metric during cross-validation on 5 blocks. First, let's create a Pipeline and get the baseline results for the model without class weighting. We'll also try to set the value of the strength of the regularization with enumeration, and at the same time set the number of iterations.

In [None]:
pipeline_lr = Pipeline([('tfidf', TfidfVectorizer()),
                            ('lr', LogisticRegression())])

param_grid = { 'lr__C' : np.arange(2, 20, 2),
              'lr__max_iter' : np.arange(100, 500, 100)}

In [None]:
%%time

lr_random = RandomizedSearchCV(pipeline_lr,
param_distributions=param_grid,
cv=5,
scoring='f1',
verbose=100,
random_state=RANDOM_STATE)

lr_random.fit(features_train, target_train)

In [None]:
print(f'Лучшая логистическая регрессия получается при следующих гиперпараметрах: \n \
{lr_random.best_params_}, \
её показатель F1: {lr_random.best_score_:.3f}')

Лучшая логистическая регрессия получается при следующих гиперпараметрах: 
 {'lr__max_iter': 100, 'lr__C': 16}, её показатель F1: 0.771


Without class balancing, the best measure F1 of the logistic regression was 0.771. Now let's try to balance the classes.

In [None]:
%%time

pipeline_lr_b = Pipeline([('tfidf', TfidfVectorizer()),
                            ('lr', LogisticRegression(class_weight='balanced'))])

lr_random_b = RandomizedSearchCV(pipeline_lr_b,
param_distributions=param_grid,
cv=5,
scoring='f1',
verbose=100,
random_state=RANDOM_STATE)

lr_random_b.fit(features_train, target_train)

In [None]:
print(f'Лучшая логистическая регрессия со сбалансированными классами \n\
получается при следующих гиперпараметрах: {lr_random_b.best_params_}, её показатель F1: \
{lr_random_b.best_score_:.3f}')

Лучшая логистическая регрессия со сбалансированными классами 
получается при следующих гиперпараметрах: {'lr__max_iter': 300, 'lr__C': 18}, её показатель F1: 0.765


The logistic regression with balanced classes has an F1 score of 0.765, i.e. even worse than without balancing. Let's try to compare models with gradient boosting.

### 2.2. LightGBM

Let's create a new pipeline, and at the same time a dictionary for iterating over hyperparameters. Since training a model with gradient boosting takes a lot of time, we set 5 iterations to enumerate the hyperparameters.

First, let's test the model without eliminating class imbalances.

In [None]:
pipeline_lgb = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                            ('clf', LGBMClassifier(random_state=RANDOM_STATE))])

In [None]:
param_grid = {'num_leaves': np.arange(2, 50, 2),
              'reg_lambda' : np.arange(1, 20, 2),
              'max_depth' : np.arange(2, 100, 2),
              'learning_rate' : np.arange(0.1, 1, 0.05)}

param_grid = {'clf__' + key: param_grid[key] for key in param_grid}

param_grid

{'clf__num_leaves': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
        36, 38, 40, 42, 44, 46, 48]),
 'clf__reg_lambda': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
 'clf__max_depth': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
        36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68,
        70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]),
 'clf__learning_rate': array([0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 ,
        0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95])}

In [None]:
%%time

lgb_random = RandomizedSearchCV(pipeline_lgb,
                                param_distributions=param_grid,
                                cv=5,
                                scoring='f1',
                                verbose=10,
                                n_iter=5,
                                random_state=RANDOM_STATE
                                )

lgb_random.fit(features_train, target_train)

In [None]:
print(f'Лучшая модель LightGBM получается при следующих гиперпараметрах: \n \
{lgb_random.best_params_}. \n \
Её метрика F1 составляет {lgb_random.best_score_:.3f}')

Лучшая модель LightGBM получается при следующих гиперпараметрах: 
 {'clf__reg_lambda': 19, 'clf__num_leaves': 38, 'clf__max_depth': 96, 'clf__learning_rate': 0.7000000000000002}. 
 Её метрика F1 составляет 0.762


Now let's evaluate the model with balanced classes.

In [None]:
pipeline_lgb_b = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                            ('clf', LGBMClassifier(random_state=RANDOM_STATE, class_weight='balanced'))])

In [None]:
%%time

lgb_random_b = RandomizedSearchCV(pipeline_lgb_b,
                                param_distributions=param_grid,
                                cv=5,
                                scoring='f1',
                                verbose=10,
                                n_iter=5,
                                random_state=RANDOM_STATE
                                )

lgb_random_b.fit(features_train, target_train)

In [None]:
print(f'Лучшая модель LightGBM со сбалансированными классами получается при следующих гиперпараметрах: \n \
{lgb_random_b.best_params_}. \n \
Её метрика F1 составляет {lgb_random_b.best_score_:.3f}')

Лучшая модель LightGBM со сбалансированными классами получается при следующих гиперпараметрах: 
 {'clf__reg_lambda': 17, 'clf__num_leaves': 44, 'clf__max_depth': 70, 'clf__learning_rate': 0.7000000000000002}. 
 Её метрика F1 составляет 0.742


### 2.3. CatBoost

So far we've seen that the elimination of class imbalance only worsens the quality of the models. Since CatBoost oftentimes is trained longer than LightGBM, we'll try not to set its ```auto_class_weights``` parameter immediately during training. We will also limit the number of iterations to 5 and the number of blocks for cross-validation to 3.

We also pre-set the growth strategy "Lossguide", which allows you to build trees sequentially, "petal by petal", separating only the petals with the minimum values of the error function.

Other hyperparameters, such as the number of leaves, the depth of the tree, and the learning rate, will be iterated over in a random order.

In [None]:
pipeline_cb = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                            ('clf', CatBoostClassifier(iterations=30,
                                                       random_state=RANDOM_STATE,
                                                       grow_policy='Lossguide'))])

param_grid = {'num_leaves': np.arange(2, 50, 2),
              'max_depth' : np.arange(2, 16, 2),
              'learning_rate' : np.arange(0.1, 1, 0.05)}

param_grid = {'clf__' + key: param_grid[key] for key in param_grid}

In [None]:
%%time

cb_random = RandomizedSearchCV(pipeline_cb,
                                param_distributions=param_grid,
                                cv=3,
                                scoring='f1',
                                verbose=10,
                                random_state=RANDOM_STATE,
                               n_iter=5,
                               error_score='raise'
                                )

cb_random.fit(features_train, target_train)

In [None]:
print(f'Лучшая модель CatBoost получается при следующих гиперпараметрах: \n \
{cb_random.best_params_}. \n \
Её метрика F1 составляет {cb_random.best_score_:.3f}')

Лучшая модель CatBoost получается при следующих гиперпараметрах: 
 {'clf__num_leaves': 6, 'clf__max_depth': 4, 'clf__learning_rate': 0.7000000000000002}. 
 Её метрика F1 составляет 0.703


The F1 metric turned out to be lower than that of all other models. Welp.

But before culling the model, let's try one thing: you can set text features in CatBoost, and then we won't even need to use TFIDF.

Let's set the best hyperparameters that we got as a result of the selection, along with a set of texts.

In [None]:
features_catboost = pd.DataFrame(features_train, columns=['text'])

In [None]:
cb_model = CatBoostClassifier(grow_policy='Lossguide',
                              random_state=RANDOM_STATE,
                              learning_rate=0.7,
                              num_leaves=6,
                              max_depth=4,
                              text_features=['text'])

In [None]:
cb_f1 = cross_val_score(cb_model, features_catboost, target_train, cv=5, scoring='f1')
print(f'Значение метрики F1 для CatBoost, обученной на текстовых признаках, составляет {cb_f1.mean():.3f}')

The value of the F1 metric for CatBoost has become significantly better - 0.758. Just for an experiment let's try to balance the classes.

In [None]:
cb_model_b = CatBoostClassifier(grow_policy='Lossguide',
                              auto_class_weights='Balanced',
                              random_state=RANDOM_STATE,
                              learning_rate=0.7,
                              num_leaves=6,
                              max_depth=4,
                              text_features=['text'])

In [None]:
cb_f1_b = cross_val_score(cb_model_b, features_catboost, target_train, cv=5, scoring='f1')
print(f'Значение метрики F1 для CatBoost со сбалансированными классами, \n \
обученной на текстовых признаках, составляет {cb_f1_b.mean():.3f}')

And again, balancing the classes of the target feature worsened the F1-measure of the model (0.739). But on the other hand, we've found out that it might be better to leave texts unvectorized using Catboost, since text features can be specified in the hyperparameters of the model and that can result in higher metrics.

### 2.4. Comparing the models

Since models without class balancing performed better during each test, we will compare
 their best metrics in the table.



 Model | F1-score  | Duration of training, tuning and cross-validation
-----| -------|----------
Logistic regression| 0.771 | 10 min 32 sec
Light GBM | 0.762 | 15 min 55 sec
CatBoost | 0.758 | 25 min 32 sec

Our clear winner is logistic regression trained on 100 iterations with a regularization strength of 16.

## 3. Testing

The moment of truth.

### 3.1. Metrics on the test data

In [None]:
predictions = lr_random.predict(features_test)
print(f'Метрика F1 логистической регрессии на тестовой выборке \
составляет: {f1_score(target_test, predictions):.3f}')

Метрика F1 логистической регрессии на тестовой выборке составляет: 0.775


Our logistic regression has a final F1 score of 0.775, it has successfully passed the critical barrier of 0.75!

But for reliability, we check the model for its sanity.

### 3.2. Sanity check

For the sanity check we are to evaluate the quality of predictions of the constant model, which will consider all comments to be negative by default (1).

In [None]:
dummy_predictions = pd.Series(1, index=features_test.index)

print(f'Метрика F1 константной модели: {f1_score(target_test, dummy_predictions):.3f}')

Метрика F1 константной модели: 0.184


The F1 of the constant model is only 0.184! Thus, our logistic regression has successfully passed through the sanity check.

## 4. Выводы

## Results

So that's what we've done:

**1. Preprocessing.**

* Downloaded and studied the data, revealed an imbalance in the classes of the target trait;
* Prepared data for training: cleaned, lemmatized and divided into samples;
* Created a pipeline for data vectorization during model cross-validation in order to prevent data leakage.

**2. Training.**

* With the help of randomized selection, we selected the best hyperparameters for models without gradient boosting and with it: logistic regression, Light GBM and CatBoost;
* Discovered that **F1-measure of models is higher if the imbalance of classes is not eliminated**;
* Compared model metrics (training time and F1 value) taking into account training time, hyperparameter fitting and cross-validation. Based on the results of the comparison, we chose simple **logistic regression** with the highest F1(**0.771**);
* Hyperparameters of the selected model: {'max_iter': 100, 'C': 16}.

**3. Testing.**

* We evaluated the quality of predictions of the selected model on the test sample.
* Calculated a final F1 score of **0.775**, which is higher than the given threshold of 0.75.
* Checked the model for adequacy using a constant model whose F1 score was 0.184.
