<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-preparation" data-toc-modified-id="Data-preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data preparation</a></span><ul class="toc-item"><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>TF-IDF</a></span></li><li><span><a href="#BERT.-Embeddings-preparation" data-toc-modified-id="BERT.-Embeddings-preparation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>BERT. Embeddings preparation</a></span></li></ul></li><li><span><a href="#Models-training" data-toc-modified-id="Models-training-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Models training</a></span><ul class="toc-item"><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>TF-IDF</a></span></li><li><span><a href="#BERT" data-toc-modified-id="BERT-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>BERT</a></span></li></ul></li><li><span><a href="#Testing-model" data-toc-modified-id="Testing-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Testing model</a></span></li></ul></div>

# Project description: Search for toxic comments for the online shop Wikishop - with BERT

The online shop Wikshop is launching a new service. Users can now edit and add to product descriptions, just like in wiki communities. That is, customers suggest their edits and comment on others' changes. 

The shop needs a tool that will search for toxic comments and send them for moderation.

Let's train the model to classify comments into positive and negative. We have a data set with markup on the toxicity of edits.

Let's build a model with an F1 quality metric value of not less than 0.75.

**Data description**.

The data is in the file `/datasets/toxic_comments.csv`.

The `text` column in it contains the text of the comment and `toxic` contains the target feature.

In [1]:
# loading and importing required libraries

!pip install torch
!pip install transformers
!pip install pymystem3



In [2]:
import numpy as np
import pandas as pd
import torch
import transformers
import warnings
warnings.filterwarnings('ignore')


from tqdm import notebook
from tqdm.notebook import tqdm
tqdm.pandas()

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer

import lightgbm as lgb
from lightgbm import LGBMRegressor, LGBMClassifier

import re

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Serghei\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Serghei\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Serghei\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data preparation

In [3]:
# loading and reading data

try:
    df = pd.read_csv('toxic_comments.csv', index_col=0).reset_index(drop=True)
    
except FileNotFoundError:
    df = pd.read_csv('/datasets/toxic_comments.csv', index_col=0).reset_index(drop=True)

**Data overview**

In [4]:
display(df.info(), df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


None

Unnamed: 0,text,toxic
143523,"""\n\n September 2010 \n\n Please do not vandal...",0
5405,"""\n\nNo, Thanatos666 and Dr.K. are not """"pushi...",0
132796,On the contrary-it is the germans that have al...,0
139971,That'll do fine. Thanks.,0
39648,Oh spare me the drama. What are you trying to ...,0


**Checking the class balance**

In [5]:
df.toxic.value_counts(normalize=True)

0    0.898388
1    0.101612
Name: toxic, dtype: float64

**Conclusion:**.

1. There are 159292 rows in the dataset, no missing values.
2. There are noises in the text (hyphenation, apostrophes).
3. The target feature is imbalanced. The ratio of negative to positive class is 9:1.

The problem will be solved in two ways: 
1. by building a features matrix with calculation of TF-IDF values,
2. by forming embeddings of the pre-trained BERT model.

### TF-IDF

In [6]:
lemmatizer = WordNetLemmatizer()

Will write lemmatisation and text cleaning functions.

In [7]:
def lemmatize(text):
    lemm_list = nltk.word_tokenize(text)
    lemm_text = ' '.join([lemmatizer.lemmatize(w) for w in lemm_list])  
    return lemm_text

def clear_text(text):
    text = re.sub(r"[^a-zA-Z']", ' ', text)
    return ' '.join(text.split()) 

Will apply the defined functions to clean up our dataset corpus.

In [8]:
df['lemm_text'] = df['text'].progress_apply(lambda x: lemmatize(clear_text(x)))

  0%|          | 0/159292 [00:00<?, ?it/s]

In [9]:
df.sample(2)

Unnamed: 0,text,toxic,lemm_text
139698,you suck messing with honduras! \n\nFuck you\n...,1,you suck messing with honduras Fuck you anon
72946,"""\n\nExcept this article is extremely biased b...",0,Except this article is extremely biased becaus...


Will split the dataset into training and test samples at a ratio of 80:20, vectorise the texts, and define the features and the target.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(df['lemm_text'], df['toxic'], stratify=df.toxic, test_size=0.2, random_state=12345)

In [11]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

X_train = count_tf_idf.fit_transform(X_train)
X_test = count_tf_idf.transform(X_test)

print("Размер матрицы X_train:", X_train.shape)
print("Размер матрицы X_test:", X_test.shape)

Размер матрицы X_train: (127433, 144000)
Размер матрицы X_test: (31859, 144000)


The features and samples for model training by building TF-IDF vectors are ready. Let us proceed to prepare embeddings to solve the problem using the pre-trained BERT model.

### BERT. Embeddings preparation

Since the BERT model is quite resource-intensive and the embeddings take a long time to generate, we decided to limit ourselves to a sample of 500 random texts.

Will read our source file again and generate a new random dataset of 500 rows. 

In [12]:
try:
    df = pd.read_csv('toxic_comments.csv', index_col=0).reset_index(drop=True)
    
except FileNotFoundError:
    df = pd.read_csv('/datasets/toxic_comments.csv', index_col=0).reset_index(drop=True)

In [13]:
df_bert = df.sample(500, random_state=12345).reset_index(drop = True) 
df_bert.head()

Unnamed: 0,text,toxic
0,Expert Categorizers \n\nWhy is there no menti...,0
1,"""\n\n Noise \n\nfart* talk. """,1
2,"An indefinite block is appropriate, even for a...",0
3,I don't understand why we have a screenshot of...,0
4,"Hello! Some of the people, places or things yo...",0


Let's load the pre-trained model and initialize the tokenizer. Note that the BERT model does not require a lemmatization step, as it understands the word forms itself.

In [14]:
model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, transformers.BertTokenizer, 'bert-base-uncased')

# Loading the pre-trained model and initialising the tokeniser 

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BERT's tokenizer splits each sentence into tokens (words). It then adds the special tokens that are needed to classify the sentence (namely the `[CLS]` token at the first position and `[SEP]` at the end of the sentence).

The next step is to replace each token with its identifier from the embeddings table, which we get along with the pre-built model.

Tokenising all comments. Also note that the BERT model has a limit of 512 token sequence length, otherwise, indexing error will occur later. Therefore, with the parameter `max_length=512` we will limit the token sequence to 512.

In [15]:
tokenized = df_bert['text'].progress_apply(lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))

  0%|          | 0/500 [00:00<?, ?it/s]

In [16]:
display(max(len(i) for i in tokenized), tokenized.shape)

512

(500,)

We have converted each sentence in the dataset into a list of identifiers. The dataset is now a list of lists. Before the BERT model could process it on the input, we have to bring the lengths of all vectors to the same size, by adding zeros to the shorter vectors (padding).

Let's add zeros and create a mask of important tokens.

In [17]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [18]:
display(padded.shape, attention_mask.shape)

(500, 512)

(500, 512)

Thus, we have a matrix/tensor that can be passed to BERT.

Let's convert the data into tensor format - multidimensional vectors in the torch library. The embeddings will be created in batches of 50 rows each. To speed up the calculation, let's indicate in the torch library with function `no_grad()` that no gradient is needed - we won't teach BERT model.

Then, will extract the required elements from the obtained tensor and add them to the list of all embeddings.

In [19]:
batch_size = 50
embeddings = []

for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])

    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)

    embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/10 [00:00<?, ?it/s]

Compile all the embeddings into a features matrix. 

In [20]:
features_bert = np.concatenate(embeddings)

In [21]:
features_bert.shape

(500, 768)

The `features_bert` variable now contains an array that consists of the embeddings of all the sentences in our dataset.

**Conclusion of Stage 1:**

1. We have loaded the data representing comments with markup about toxicity.
2. Prepared both TF-IDF features and embeddings derived from the BERT model.

We can proceed to training the models.

## Models training

### TF-IDF

**Models initialization**

Will train different models with different hyperparameters. 

Will create a list to find the best hyperparameters for the models, and then initialise the different models.

In [22]:
# create a list to select the best hyperparameters for the models

models = []

Initializing the models `Logistic Regression` and `SGD Classifier (Stochastic Gradient Descent)`.

In [23]:
# Logistic Regression

lr = LogisticRegression(n_jobs=-1, random_state=12345)

param_grid={'penalty' : ['l1', 'l2'], 
            'fit_intercept': [True, False],
            'max_iter' : [1000, 5000],
            'C' : [0.001, 0.1, 0.8, 1],
            'class_weight' : [None, 'balanced', {0: 0.9, 1: 0.1}]}

models.append(('Logistic Regression', lr, param_grid))

In [24]:
# SGD Classifier (Stochastic Gradient Descent) 

sgd = SGDClassifier(n_jobs=-1, random_state=12345)

param_grid = {'loss' : ['log', 'modified_huber'],
              'penalty' : ['l1', 'l2'], 
              'fit_intercept' : [True, False],
              'max_iter' : [5, 1000, 5000],
              'shuffle' : [True, False],
              'learning_rate' : ['optimal'],
              'validation_fraction' : [0.1, 0.2, 0.3, 0.4]
             }

models.append(('SGDClassifier', sgd, param_grid))

We will define a function that will enumerate the hyperparameters of the models on the hyperparameter grid and return the best of them, and the best value of the metric.

In [25]:
# Function for enumerating over a grid and finding the best hyperparameters of models

def grid_search(model, param_grid, cv, x, y):
    
    ''' 
    The input function takes as its arguments the model and the hyperparameter grid from the list of models
    as well as the number of folds for cross validation and the corresponding samples with features and target.
    The function returns the best metric value and the best hyperparameters of the model
    '''
    
    grid_model = GridSearchCV(model, param_grid=param_grid, scoring='f1', cv=cv, verbose=1, n_jobs=-1)
    grid_model.fit(x, y)
    best_estimator = grid_model.best_estimator_
    best_score = grid_model.best_score_
    
    return best_score, best_estimator

Defining a function that will collect into a table the best **F1** metrics for each of the models on the training sample.

In [26]:
def table_builder(models):
    
    table = []
    
    for model in models:
        grid = grid_search(model[1], model[2], 3, X_train, y_train)
        table.append((model[0], grid[0], grid[1]))
        
        print(grid)
        
    final_table = pd.DataFrame(table, columns=['Model', 'F1_score_CV', 'Grid'])
        
    return final_table

In [27]:
final_table = table_builder(models)

Fitting 3 folds for each of 96 candidates, totalling 288 fits
(0.7439326734209278, LogisticRegression(C=1, class_weight='balanced', max_iter=1000, n_jobs=-1,
                   random_state=12345))
Fitting 3 folds for each of 192 candidates, totalling 576 fits
(0.7505653082415282, SGDClassifier(fit_intercept=False, loss='modified_huber', max_iter=5, n_jobs=-1,
              penalty='l1', random_state=12345))


In [28]:
final_table

Unnamed: 0,Model,F1_score_CV,Grid
0,Logistic Regression,0.743933,"LogisticRegression(C=1, class_weight='balanced..."
1,SGDClassifier,0.750565,"SGDClassifier(fit_intercept=False, loss='modif..."


### BERT

We will split our initial dataset into training and test samples at a ratio of 80:20. We have already prepared the features and they are in the `features_bert` matrix.

In [29]:
features_bert_train, features_bert_test, y_bert_train, y_bert_test = train_test_split(features_bert, df_bert.toxic, test_size=0.2, stratify=df_bert.toxic, random_state=12345)

In [30]:
print("Размер матрицы features_bert_train:", features_bert_train.shape)
print("Размер матрицы features_bert_test:", features_bert_test.shape)
print("Размер вектора y_bert_train:", y_bert_train.shape)
print("Размер вектора y_bert_test:", y_bert_test.shape)

Размер матрицы features_bert_train: (400, 768)
Размер матрицы features_bert_test: (100, 768)
Размер вектора y_bert_train: (400,)
Размер вектора y_bert_test: (100,)


**Logistic regression with BERT**

In [31]:
param_grid={'penalty' : ['l1', 'l2'],
            'fit_intercept': [True, False],
            'C' : [0.001, 0.1, 0.8, 1],
            'max_iter' : [1000, 2000],
            'class_weight' : [None, 'balanced', {0: 0.9, 1: 0.1}]
            }

In [32]:
print('\nLogistic Regression_BERT\n\nF1_score и гиперпараметры на кросс-валидации:\n', grid_search(lr, param_grid, 3, features_bert_train, y_bert_train))

Fitting 3 folds for each of 96 candidates, totalling 288 fits

Logistic Regression_BERT

F1_score и гиперпараметры на кросс-валидации:
 (0.5632996632996633, LogisticRegression(C=1, max_iter=1000, n_jobs=-1, random_state=12345))


**Conclusion of Stage 2:**

We trained models to predict comment toxicity based on two approaches.

1. For the TF-IDF approach, the SGDClassifier model, with an F1 metric value on cross validation of 0.751, proved to be the best.
2. For the Embedding approach (for 500 records), the Logistic Regression model had a cross-validation F1 metric value of 0.56.

Let us test the best model, SGDClassifier, on the test sample.

## Testing model

The SGDClassifier model with TF-IDF showed the best value of F1 metric on the training sample. Let us test this model on a test sample, with the best hyperparameters identified during model training.

In [33]:
sgd_test = SGDClassifier(fit_intercept=False, loss='modified_huber', max_iter=5, n_jobs=-1,
              penalty='l1', random_state=12345)

sgd_test.fit(X_train, y_train)
predictions_test = sgd_test.predict(X_test)

print('F1_score на тестовой выборке: SGD Classifier =', f1_score(y_test, predictions_test).round(2))

F1_score на тестовой выборке: SGD Classifier = 0.75


**Overall conclusion:**

A total of 159292 rows representing comments from the online shop Wikishop were downloaded. To solve the problem of classifying comments for toxicity, we applied two approaches:
1. by training the model on a feature matrix with TF-IDF values computed;
2. by generating a feature matrix as embeddings of a pre-trained BERT model. Given the resource-intensive nature of the embedding process, we only used a subsample of 500 records randomly selected from the original dataset for this approach.

For the TF-IDF approach, the `SGDClassifier' model proved to be the best, with a **F1** value on cross-validation equal to **0.751**.

For the embedding approach (for 500 records), the **F1** value of the `Logistic Regression' model on cross-validation was **0.56**.

The best model on training data, SGDClassifier TF-IDF, was tested on the test sample. The value of **F1_score** was **0.75**.
