# Comments classification for an online shop

The online shop's website enables users to leave comments on products as well as on other users' comments. To maintain a positive user experience, the shop needs to identify negative (toxic/insulting) comments and flag them for manual moderation.

**Our goal**: to detect negative (toxic/insulting) comments.

To achieve this goal, we will develop a machine learning model. The customer's requirement for the quality metric is an F1 score of at least 0.75.

**Data overview:** the dataset contains 159 292 entries, consisting of text data (`text`) and target label (`toxic`).

**Research plan**:
1. Data loading and preprocessing.
2. Model creation.
3. Conclusion.

## 1. Data loading and preprocessing

First of all we import all necessary libraries and dictionaries for natural language processing task.

In [1]:
import pandas as pd
import numpy as np
import time
import re
from IPython.display import display

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

import torch
import transformers
from transformers import BertTokenizer, BertModel
from tqdm import tqdm

from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

from sklearn.metrics import f1_score

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

STOP_WORDS = list(stopwords.words('english'))

RANDOM_STATE = 42

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\annad\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\annad\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\annad\AppData\Roaming\nltk_data...


Now we load the dataset and view the data.

In [2]:
data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv', index_col=[0])
display(data.head())
data.info()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


<class 'pandas.core.frame.DataFrame'>
Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Let's look at the possible values in the `toxic` column.

In [3]:
data['toxic'].value_counts()

toxic
0    143106
1     16186
Name: count, dtype: int64

The feature takes only two unique values, indicating that we are solving binary classification problem. Additionally, we observe class imbalance, with ten times fewer examples of toxic comments.

Since generating embeddings using BERT requires significant computational resources, we will not utilize the entire dataset. Instead, we will use a sample of 10000 messages, taking 5000 samples from each class.

In [4]:
# Create subsets of each class
class_0_subset = data[data['toxic'] == 0]
class_1_subset = data[data['toxic'] == 1]

# Take 5000 samples of each class
sampled_class_0 = class_0_subset.sample(n=5000, random_state=RANDOM_STATE)
sampled_class_1 = class_1_subset.sample(n=5000, random_state=RANDOM_STATE)

# Combine them to create a new dataset and shuffle
data_sample = pd.concat([sampled_class_0, sampled_class_1])
data_sample = data_sample.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)
data_sample.head()

Unnamed: 0,text,toxic
0,A block ohhhhhhhhhhhhhh noooooooooooo I'm sooo...,1
1,20% reminds me of the percentage of German you...,0
2,Relation to independence \n\nThis concept of a...,0
3,Milage\nIt is more than 30 Miles from Liverpoo...,0
4,Also don't seek revenge that user is an admin....,0


Let's preprocess the text as follows:

* Convert to lowercase,
* Remove punctuation and digits,
* Tokenize the text,
* Lemmatize each token,
* Remove stop words,
* Reassemble the processed tokens back into a string separated by spaces.

In [5]:
# Function for text preprocessing
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    tokens = word_tokenize(text)
    
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    tokens = [token for token in tokens if token not in STOP_WORDS]
    
    preprocessed_text = ' '.join(tokens)
    
    return preprocessed_text

In [6]:
# Apply text preprocessing funcion to 'text' column
data_sample['text'] = data_sample['text'].apply(lambda x: preprocess_text(x))
data_sample.head()

Unnamed: 0,text,toxic
0,block ohhhhhhhhhhhhhh noooooooooooo soooo like...,1
1,reminds percentage german youth fighting battl...,0
2,relation independence concept set generator se...,0
3,milage mile liverpool blackpool like mile also...,0
4,also seek revenge user admin talk,0


Using a pretrained BERT model, we will generate embeddings that will serve as input features for training our model.

In [7]:
# Device choice (GPU/CPU)
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

# Function for embedding creation
def get_embeddings(data, tokenizer, model, max_len=512, batch_size=1):
    tokenized = data['text'].apply(lambda x: tokenizer.encode(x, 
                                                              add_special_tokens=True, 
                                                              max_length=max_len,
                                                              truncation=True))
    
    padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])
    attention_mask = np.where(padded != 0, 1, 0)
    
    embeddings = []
    for i in tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())
        del batch
        del attention_mask_batch
        del batch_embeddings
        
    return embeddings

In [8]:
# Create embeddings
start = time.time()

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = model.to(device)

text_embeddings = get_embeddings(data_sample, tokenizer, model, max_len=512, batch_size=1)

# Create features from embeddings
features = np.concatenate(text_embeddings)

emb_time = round(time.time() - start, 2)
print('Embedding creation took:', emb_time, 'seconds.')

100%|██████████| 10000/10000 [02:33<00:00, 65.08it/s]

Embedding creation took: 159.25 seconds.





Data preprocessing is complete, now we can train our model.

## 2. Model creation

We will split the dataset into training and testing sets, ensuring that the target feature is proportionally represented in both sets.

In [9]:
# Train/test split
X = features
y = data_sample['toxic']
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=RANDOM_STATE,
    stratify=y
)

print('Trainin set size:', X_train.shape)
print('Testing set size:', X_test.shape)

Trainin set size: (7500, 768)
Testing set size: (2500, 768)


Now we will find the best hyperparameters for gradient boosting model (XGboost) using cross-validation.

In [10]:
# Parameters grid
param_grid = {
    'learning_rate': [0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
}

# Start cross-validation
start = time.time()

model_search = GridSearchCV(
    estimator=XGBClassifier(objective='binary:logistic',
                            random_state=RANDOM_STATE),
    param_grid=param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1
)

model_search.fit(X_train, y_train)

search_time = round(time.time() - start, 2)
print('The search took:', search_time, 'seconds.\n')
print('\033[1m' + 'Cross-validation results:' + '\033[0m')
print('Model hyperparameters:\n', model_search.best_estimator_,'\n')
print('F1-score:', (model_search.best_score_).round(3))

The search took: 138.75 seconds.

[1mCross-validation results:[0m
Model hyperparameters:
 XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...) 

F1-score: 0.842


F1-score during cross-validation meet the criteria. Let's test our model.

In [11]:
print('\033[1m' + 'Testing results:' + '\033[0m')
y_pred_test = model_search.predict(X_test)
f1_score_test = f1_score(y_test, y_pred_test).round(3)
print('F1-score:', f1_score_test)

[1mTesting results:[0m
F1-score: 0.858


The model meets curtomer's requirements.

## 3. Conclusion

The goal of this project was to classify comments as neutral or negative (toxic/insulting).

To achieve this goal, we built a machine learning model based on gradient boosting using the XGBoost library. The F1 score was used as the quality metric.

The dataset contained 159 292 entries. The data showed a significant imbalance in the target feature.

We preprocessed the text as follows:
* Convert to lowercase,
* Remove punctuation and digits,
* Tokenize the text,
* Lemmatize each token,
* Remove stop words,
* Reassemble the processed tokens back into a string separated by spaces.

Next, using the BERT model, we generated embeddings and converted the texts to vector form.

As a result of cross-validation, we determined that the best model has the following properties:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.35, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42) 

Performance Metrics:
* F1-score during cross-validation: 0.842
* F1-score on the test set: 0.858

The required quality level on the test set is an F1 score of at least 0.75. Therefore, the presented model meets the customer's requirements and can be used for comment classification on the website.