# Description of the project

Online store "Wikishop" launches a new service.
<br>Now users can edit and complete product descriptions, just like in wiki communities.
<br>That is, clients propose their edits and comment on the changes of others.
<br>The store needs a tool that will look for toxic comments and send them for moderation.
It is necessary to train the model to classify comments into positive and negative.
<br>There is a data set with markup on the toxicity of edits.
<br>The model must have a quality metric F1 of at least 0.75.

# Description of data

The data is in the `/datasets/toxic_comments.csv` file.
<br>The `text` column contains the text of the comment, and `toxic` is the target feature.

# Action plan

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

# Loading data

In [None]:
import pandas as pd
import numpy as np
import copy

import re
import nltk
from nltk.stem import SnowballStemmer 
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import notebook

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb

from sklearn.metrics import f1_score
from sklearn.dummy import DummyClassifier

import torch
import transformers
from transformers import BertConfig, TFBertForSequenceClassification

pd.set_option('display.max_row', 100)
pd.set_option('display.max_columns',100)

In [None]:
df = pd.read_csv('/datasets/toxic_comments.csv')
print(df.info())
df.head()

# Data preparation and model training without using BERT

## Data preparation

In [None]:
eng_stemmer = SnowballStemmer('english')

def stem_clear(text):
    stem_text = eng_stemmer.stem(text)   
    stem_text = re.sub(r'[^a-zA-Z]', ' ', stem_text)   
    return " ".join(stem_text.split())

In [None]:
tqdm_notebook.pandas()

In [None]:
df['stem_text'] = df['text'].progress_apply(stem_clear)
df.head()

In [None]:
features = df['stem_text'].values.astype('U')
target = df['toxic']

In [None]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=12345)

print(features.shape)
print(target.shape)
print()

print(features_train.shape)
print(target_train.shape)
print()

print(features_test.shape)
print(target_test.shape)

In [None]:
%%time
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf_train = count_tf_idf.fit_transform(features_train)
tf_idf_test = count_tf_idf.transform(features_test)
count_tf_idf

## Model training

### Logistic regression

In [None]:
%%time
model = LogisticRegression(class_weight='balanced')
model.fit(tf_idf_train, target_train)

In [None]:
predictions = model.predict(tf_idf_train)
print('F1 Logistic Regression on train set:', f1_score(target_train, predictions))

predictions = model.predict(tf_idf_test)
print('F1 Logistic Regression on test set:', f1_score(target_test, predictions))

In [None]:
dummy_clf = DummyClassifier(strategy='stratified', random_state=12345)
dummy_clf.fit(tf_idf_train, target_train)
predictions = dummy_clf.predict(tf_idf_test)
print('F1 dummy on test set:', f1_score(target_test, predictions))

### LightGBM

In [None]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'metric': 'f1',
    'learning_rate': 0.005,
    'verbose': 0,
    "max_depth": 20,
    "num_iterations": 5000,
    "n_estimators": 1000
}

In [None]:
%%time
gbm = lgb.LGBMClassifier(**hyper_params)
gbm.fit(tf_idf_train, target_train, verbose=0)
gbm.best_score_

In [None]:
predictions = gbm.predict(tf_idf_train)
print('F1 LightGBM on the train set:', f1_score(target_train, predictions))

predictions = gbm.predict(tf_idf_test)
print('F1 LightGBM on the test set:', f1_score(target_test, predictions))

I did not use a decision tree and a random forest, since they require vectors of the same length, which in this case is not very good. I tried to train the model, but even the decision tree takes a very long time to train.

## Conclusion

As can be seen from the results on the test dataset, LightGBM takes first place, and logistic regression is in second place.
<br>But logistic regression wins in execution time.
<br>If the class balancing parameter is not used in the logistic regression, then F1 is less than 0.75.
<br>I think LightGBM can show the result better, but then you need to change the hyperparameters, which will lead to an increase in execution time.

# Data preparation and model training using BERT

## Data preparation

In [None]:
tqdm_notebook.pandas()

In [None]:
%%time
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')

tokenized = df['text'].progress_apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, max_length = 512, truncation = True))

max_len = 0
for i in notebook.tqdm(tokenized.values):
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [None]:
config = BertConfig.from_pretrained('bert-base-cased') 
model = transformers.BertModel(config=config)

In [None]:
batch_size = 100
embeddings = []

for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
    attention_mask_batch = torch.LongTensor(
    attention_mask[batch_size*i:batch_size*(i+1)])
    
    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
    embeddings.append(batch_embeddings[0][:,0,:].numpy())

In [None]:
features = np.concatenate(embeddings)
target = df['target']

In [None]:
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.25)

## Neural network training

In [None]:
%%time
model.fit(train_features, train_target)

In [None]:
predictions = model.predict(tf_idf_train)
print('F1 BERT on the train set:', f1_score(target_train, predictions))

predictions = model.predict(tf_idf_test)
print('F1 BERT on the test set:', f1_score(target_test, predictions))