# Описание проекта

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers
from transformers import BertModel, BertConfig
from transformers import BertTokenizer, BertForMaskedLM
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score




## 1. Подготовка

Изучим имеющиеся данные:

In [2]:
try:
    df = pd.read_csv("C:/Users/dimil/OneDrive/Desktop/toxic_comments.csv")
except:
    df = pd.read_csv('/datasets/toxic_comments.csv')

df.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
toxic,159571.0,0.101679,0.302226,0.0,0.0,0.0,0.0,1.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Для ускорения заменим тип целевого признака:

In [5]:
df['toxic'] = df['toxic'].astype('int32')

Возьмем выборку из 50000 объектов:

In [6]:
df = df.sample(50000).reset_index(drop=True) 

Преобразуем данные и зададим количество батчей для модели BERT:

In [7]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case=True
)

tokenized = df['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=100, truncation=True))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [8]:
config = BertConfig.from_pretrained("bert-base-uncased")
model = transformers.BertModel.from_pretrained("bert-base-uncased", config=config)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
batch_size =100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=500.0), HTML(value='')))




## 2. Обучение

In [10]:
features = np.concatenate(embeddings)

X = features
y = df['toxic']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



Обучим и сравним результаты 3 моделей: логистической регрессии, случайного леса и CatBoost.

In [11]:
lr_clf = LogisticRegression(max_iter=500)
lr_clf.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(max_iter=500)

In [12]:
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 100, num = 8)]
max_depth = [int(x) for x in np.linspace(10, 60, num = 8)]
max_depth.append(None)

random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth}
print(random_grid)

{'n_estimators': [10, 22, 35, 48, 61, 74, 87, 100], 'max_depth': [10, 17, 24, 31, 38, 45, 52, 60, None]}


In [13]:
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=12345, n_jobs = -1)
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed: 15.6min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed: 30.8min finished


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=50,
                   n_jobs=-1,
                   param_distributions={'max_depth': [10, 17, 24, 31, 38, 45,
                                                      52, 60, None],
                                        'n_estimators': [10, 22, 35, 48, 61, 74,
                                                         87, 100]},
                   random_state=12345, verbose=2)

In [14]:
rf_random.best_params_

{'n_estimators': 100, 'max_depth': 24}

In [15]:
best_rf = RandomForestClassifier(**rf_random.best_params_)
best_rf.fit(X_train, y_train)


RandomForestClassifier(max_depth=24)

In [16]:
cb_model = CatBoostClassifier()
grid = {'max_depth': [1,3,5],'learning_rate':[0.01, 0.03], 'iterations':[10, 50, 100]}
gscb = GridSearchCV(estimator = cb_model, param_grid = grid, cv = 5)
gscb.fit(X_train, y_train)

print(gscb.best_estimator_)

print(gscb.best_score_)

print(gscb.best_params_)

0:	learn: 0.6830781	total: 223ms	remaining: 2.01s
1:	learn: 0.6738778	total: 236ms	remaining: 945ms
2:	learn: 0.6643813	total: 249ms	remaining: 581ms
3:	learn: 0.6551074	total: 261ms	remaining: 392ms
4:	learn: 0.6470026	total: 274ms	remaining: 274ms
5:	learn: 0.6389150	total: 287ms	remaining: 191ms
6:	learn: 0.6310612	total: 299ms	remaining: 128ms
7:	learn: 0.6228229	total: 311ms	remaining: 77.8ms
8:	learn: 0.6153883	total: 324ms	remaining: 36ms
9:	learn: 0.6080823	total: 337ms	remaining: 0us
0:	learn: 0.6829990	total: 22ms	remaining: 198ms
1:	learn: 0.6740711	total: 34.7ms	remaining: 139ms
2:	learn: 0.6647740	total: 46.8ms	remaining: 109ms
3:	learn: 0.6560718	total: 59.2ms	remaining: 88.8ms
4:	learn: 0.6473074	total: 71.7ms	remaining: 71.7ms
5:	learn: 0.6385248	total: 83.8ms	remaining: 55.9ms
6:	learn: 0.6305333	total: 96.1ms	remaining: 41.2ms
7:	learn: 0.6221594	total: 108ms	remaining: 27.1ms
8:	learn: 0.6150656	total: 121ms	remaining: 13.4ms
9:	learn: 0.6078035	total: 133ms	remainin

Рассчитаем метрику F1 для каждой модели:

In [17]:
models = [lr_clf , best_rf, gscb]
for model in models:
    y_pred = model.predict(X_test)
    print("Модель", model)
    print("Метрика F1 равна:", f1_score(y_test, y_pred))

Модель LogisticRegression(max_iter=500)
Метрика F1 равна: 0.7131782945736435
Модель RandomForestClassifier(max_depth=24)
Метрика F1 равна: 0.46072507552870096
Модель GridSearchCV(cv=5,
             estimator=<catboost.core.CatBoostClassifier object at 0x00000221159CAC10>,
             param_grid={'iterations': [10, 50, 100],
                         'learning_rate': [0.01, 0.03],
                         'max_depth': [1, 3, 5]})
Метрика F1 равна: 0.5230987917555082


## 3. Вывод

Наилучшее значение метрики F1 показала логистическая регрессия - 0.75. Метрики у модели случайного леса и СatBoost равны нулю. 