<a href="https://colab.research.google.com/github/Rakhayeva/Data-Science-Projects-in-English/blob/main/ML_for_texts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project for "Wikishop" with BERT

The "Wikishop" online store is launching a new service. Users can now edit and supplement product descriptions, similar to wiki communities. In other words, customers suggest edits and comment on the changes made by others. The store needs a tool that will detect toxic comments and send them for moderation.

**Objective**: Train a model to classify comments into positive and negative. You have a dataset at your disposal with labels indicating the toxicity of the edits.

**Requirement**: Build a model with an $F1$ quality metric of at least **0.75**.

**Project Instructions**
- [Downloading and prepare the data](#Downloading).
- [Training different models](#Training).
- [Drawing conclusions](#conclusions).


## Data Description

The data is stored in the `toxic_comments.csv` file.
- The text column contains the comment text.
- The toxic column is the target feature.

## <a name='Downloading'></a> Downloading and Preparing Data

### Modules and Libraries Import

In [None]:
!pip install catboost -q

In [2]:
import torch
import numpy as np
import pandas as pd
import transformers
from tqdm import notebook
from nltk.corpus import stopwords
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier
from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

from google.colab import drive
drive.mount('/content/drive')

from google.colab import output
# Preventing disconnect in Colab environment
output.eval_js('function ClickConnect(){console.log("Preventing disconnect");document.querySelector("colab-toolbar-button#connect").click()}setInterval(ClickConnect,60000)')

import warnings
warnings.filterwarnings('ignore')

Mounted at /content/drive


### <a name='Loading'></a> Data Loading

In [3]:
# Reading data
df = pd.read_csv('/content/drive/MyDrive/Yandex_Practicum/datasets for DS/toxic_comments.csv')

In [4]:
# Checking the class distribution
class_distribution = df['toxic'].value_counts(normalize=True)
class_distribution

Unnamed: 0_level_0,proportion
toxic,Unnamed: 1_level_1
0,0.898388
1,0.101612


There is a significant class imbalance. We should use balancing methods when training the LogisticRegression model.

### Feature Engineering with BERT

In [5]:
# Sample a portion of the data due to limited computational power for the full dataset
df_sample = df.sample(1000).reset_index(drop=True)

In [None]:
# Remove unnecessary characters (cleaning)
df_sample['text'] = df_sample['text'].replace(to_replace='[^\w\s]', value='', regex=True)

# Loading pretrained tokenizer and model
model_name = 'unitary/toxic-bert'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Prepare device for training (GPU support)
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# Tokenizing texts
tokenized = df_sample['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True, max_length=512))

# Determine the maximum length among all tokenized sequences
max_len = max(len(seq) for seq in tokenized)

# Pad all sequences to the same length (max_len) with zeros
padded = np.array([seq + [0] * (max_len - len(seq)) for seq in tokenized])

# Create attention masks: 1 for real tokens, 0 for padding tokens
attention_mask = np.where(padded != 0, 1, 0)

# Set batch size
batch_size = 100
embeddings = []

# Move model to GPU
model.to(device)

# Disable gradient calculation (not required for inference/embedding extraction)
with torch.no_grad():
    # Iterate over the data in batches
    for i in notebook.tqdm(range(0, len(padded), batch_size)):
        # Create data batch and attention mask batch
        batch = torch.LongTensor(padded[i:i + batch_size])
        attention_mask_batch = torch.LongTensor(attention_mask[i:i + batch_size])

        # Get model outputs
        outputs = model(batch.to(device), # Move data to GPU
                        attention_mask=attention_mask_batch.to(device))

        # Extract CLS token embeddings and move them back to CPU as numpy arrays
        embeddings.append(outputs.last_hidden_state[:, 0, :].cpu().numpy())

# Concatenate all embeddings into a single feature matrix
features = np.concatenate(embeddings)

In [11]:
# Display the shape of the resulting embeddings array
display(features.shape)

(1000, 768)

## <a name='Training'></a> Training Models

### LinearRegression

In [8]:
# Create an instance of LogisticRegression
# We use class_weight='balanced' to handle the identified class imbalance
lr = LogisticRegression(random_state=42, class_weight='balanced')

# Define the hyperparameter grid for search
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10],
    'max_iter': [500, 1000]
}

# Configure GridSearchCV
grid_lr = GridSearchCV(lr, param_grid_lr, scoring='f1', cv=5)

# Search for the best hyperparameters
grid_lr.fit(X_train_val, y_train_val)

# Extract the model with the best parameters
best_lr = grid_lr.best_estimator_

print(f'Best parameters for LogisticRegression: {grid_lr.best_params_}')
print(f'F1 via cross-validation: {grid_lr.best_score_}')

Best parameters for LogisticRegression: {'C': 10, 'max_iter': 500}
F1 via cross-validation: 0.9034159166750936


### CatBoost

In [9]:
# Create an instance of CatBoostClassifier
# Using GPU for faster training
cb = CatBoostClassifier(random_state=42, verbose=0, task_type='GPU', devices='0')

# Define the hyperparameter grid for search
param_grid_cb = {
    'iterations': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'depth': [4, 6, 8]
}

# Configure GridSearchCV
grid_cb = GridSearchCV(cb, param_grid_cb, scoring='f1', cv=5)

# Search for the best hyperparameters
grid_cb.fit(X_train_val, y_train_val)

# Extract the model with the best parameters
best_cb = grid_cb.best_estimator_

print(f'Best parameters for CatBoost: {grid_cb.best_params_}')
print(f'F1 via cross-validation: {grid_cb.best_score_}')

Best parameters for CatBoost: {'depth': 8, 'iterations': 150, 'learning_rate': 0.01}
F1 via cross-validation: 0.9005561735261403


### Best Model Selection and Testing

In [10]:
# Select the best model based on cross-validation results
if grid_lr.best_score_ > grid_cb.best_score_:
    final_model = best_lr
    model_name = "LogisticRegression"
else:
    final_model = best_cb
    model_name = "CatBoost"

# Evaluate the best model on the test set
y_pred_final = final_model.predict(X_test)
f1_final = f1_score(y_test, y_pred_final)

print(f'Best model: {model_name}')
print(f'Test F1 score: {f1_final}')

Best model: LogisticRegression
Test F1 score: 0.8936170212765957


## <a name='conclusions'></a> Conclusions

As part of the project for the "Wikishop" online store, we were tasked with developing a machine learning model to automatically detect toxic comments and flag them for moderation. The primary objective was to train a classifier to distinguish between positive and negative comments, achieving an $F1$ score of at least **0.75**.

**Project Workflow:**
- **Data Preprocessing:**
   - An analysis of the class distribution revealed a significant imbalance: toxic comments accounted for only about 10% of the dataset.
   - To optimize computational efficiency, a random sub-sample of 800 records was selected for the initial experiment.
   - Text data was cleaned of unnecessary characters and noise using regular expressions.
- **Modeling and Feature Engineering:**
- We explored two classification algorithms: `Logistic Regression` and `CatBoost`.
- Both models utilized high-dimensional text embeddings extracted via the pre-trained `unitary/toxic-bert` model.
- To mitigate the class imbalance, we applied the `class_weight='balanced'` parameter for Logistic Regression and utilized CatBoostâ€™s internal imbalance-handling mechanisms.
- Hyperparameter tuning and performance estimation were conducted using **GridSearchCV** with 5-fold cross-validation.

**Performance Evaluation:**
- **Cross-Validation Results:**
   - Logistic Regression Parameters: `{'C': 10, 'max_iter': 500}`. Achieved an $F1$ score of 0.9.
   - CatBoost Parameters: `{'depth': 8, 'iterations': 150, 'learning_rate': 0.01}`. Also demonstrated a high $F1$ score of 0.9.
- **Final Test Results:**
The **Logistic Regression** model achieved an $F1$ score of **0.89** on the test set. This result is consistent with the validation phase, indicating that the model generalizes well to unseen data without significant overfitting.