<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Проект-для-&quot;Викишоп&quot;" data-toc-modified-id="Проект-для-&quot;Викишоп&quot;-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Проект для "Викишоп"</a></span></li><li><span><a href="#Загрузка-и-подготовка-данных" data-toc-modified-id="Загрузка-и-подготовка-данных-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Загрузка и подготовка данных</a></span></li><li><span><a href="#Обучение-моделей" data-toc-modified-id="Обучение-моделей-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Обучение моделей</a></span><ul class="toc-item"><li><span><a href="#CatBoost" data-toc-modified-id="CatBoost-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>CatBoost</a></span><ul class="toc-item"><li><span><a href="#Вывод" data-toc-modified-id="Вывод-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Вывод</a></span></li></ul></li><li><span><a href="#BERT" data-toc-modified-id="BERT-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>BERT</a></span></li><li><span><a href="#Вывод" data-toc-modified-id="Вывод-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Вывод</a></span></li></ul></li></ul></div>

# Проект для "Викишоп"
Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучу модель классифицировать комментарии на позитивные и негативные. В моём распоряжении набор данных с разметкой о токсичности правок.

Построю модель со значением метрики качества *F1* не меньше 0.75.

# Загрузка и подготовка данных

In [108]:
%pip install catboost



In [109]:
%pip install transformers



In [110]:
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
import re
import string
from sklearn.metrics import roc_auc_score
from tqdm.notebook import tqdm
import torch
from transformers import DistilBertModel, DistilBertTokenizer
import numpy as np
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

###### Создадим функцию очистки текста

In [111]:
def text_cleaning(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) # remove special chars
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    
    return text

In [112]:

try:
    df = pd.read_csv('https://toxic_comments.csv')
except FileNotFoundError as e:
    print(e)
    df = pd.read_csv('/datasets/toxic_comments.csv')

In [113]:
df

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


**Визуализируем тексты после очистки**

In [114]:
df['text']=df['text'].apply(text_cleaning)

In [115]:
df

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,d aww he matches this background colour i m s...,0
2,hey man i m really not trying to edit war it...,0
3,more i can t make any real suggestions on im...,0
4,you sir are my hero any chance you remember...,0
...,...,...
159566,and for the second time of asking when ...,0
159567,you should be ashamed of yourself that is a ...,0
159568,spitzer umm theres no actual article for pr...,0
159569,and it looks like it was actually you who put ...,0


Добавил для сравнения

# Обучение моделей

## CatBoost

###### Разобьем выборку на тренировочную, тестовую и валидационную.

In [116]:
train, test = train_test_split(df,test_size=0.4,random_state=77)

In [117]:
val, test = train_test_split(test,test_size=0.5,random_state=77)

###### Проверим коректность разбиения

In [118]:
len(train) / len(df)

0.5999962399182809

In [119]:
len(val) / len(df)

0.19999874663942696

In [120]:
len(test) / len(df)

0.20000501344229216

##### Выделим признаки и таргет.

In [121]:
X = ['text']
y = ['toxic']
text_features = ['text']

##### Используем в параметрах AUC т.к. F1 зависит от порога, затем подберём пороги для F1.

In [122]:
params = {'eval_metric':'AUC',
          'text_features':text_features,
          'task_type':'GPU',
          'learning_rate':0.1,
          'verbose':100}

In [123]:
cbc = CatBoostClassifier(**params)

In [124]:
#!g1.1
cbc.fit(train[X], train[y], eval_set=(val[X], val[y]))

0:	test: 0.8959635	best: 0.8959635 (0)	total: 24.4ms	remaining: 24.4s
100:	test: 0.9640577	best: 0.9640577 (100)	total: 1.55s	remaining: 13.8s
200:	test: 0.9671695	best: 0.9671773 (199)	total: 2.93s	remaining: 11.6s
300:	test: 0.9687546	best: 0.9687546 (300)	total: 4.33s	remaining: 10.1s
400:	test: 0.9696219	best: 0.9696219 (400)	total: 5.75s	remaining: 8.59s
500:	test: 0.9700253	best: 0.9700253 (500)	total: 7.13s	remaining: 7.1s
600:	test: 0.9703659	best: 0.9703758 (598)	total: 8.5s	remaining: 5.64s
700:	test: 0.9706537	best: 0.9706537 (700)	total: 9.89s	remaining: 4.22s
800:	test: 0.9709156	best: 0.9709415 (792)	total: 11.3s	remaining: 2.79s
900:	test: 0.9709576	best: 0.9709814 (871)	total: 12.6s	remaining: 1.39s
999:	test: 0.9710866	best: 0.9710928 (998)	total: 14s	remaining: 0us
bestTest = 0.9710927904
bestIteration = 998
Shrink model to first 999 iterations.


<catboost.core.CatBoostClassifier at 0x7f54d0f13b90>

In [125]:
train['y_score'] = cbc.predict_proba(train[X])[:, 1]


In [126]:
test['y_score'] = cbc.predict_proba(test[X])[:, 1]

In [127]:
roc_auc_score(train['toxic'], train['y_score'])

0.9882700290057004

In [128]:
roc_auc_score(test['toxic'], test['y_score'])

0.9699123283708867

##### Найдем порог для F1

In [129]:
val['y_score'] = cbc.predict_proba(val[X])[:, 1]

In [130]:
thrs = [0] + sorted(list(val['y_score'].unique()))


In [131]:
#!g1.1
res = []
for thr in tqdm(thrs):
  val['y_pred'] = (val['y_score'] > thr) * 1
  res.append((thr,
              f1_score(val['toxic'],val['y_pred'])))

  0%|          | 0/31326 [00:00<?, ?it/s]

In [132]:
f1s = pd.DataFrame(res,columns=['thr','f1'])


In [133]:
f1s[f1s['f1'] == f1s['f1'].max()]


Unnamed: 0,thr,f1
28310,0.360313,0.786964


##### Расчитаем F1 На тесте

In [134]:
test['y_pred'] = (test['y_score'] > f1s[f1s['f1'] == f1s['f1'].max()]['thr']) * 1

In [135]:
f1_score(test['toxic'],test['y_pred'])

0.7846966543941091

##### F1 без порога

In [136]:
f1_score(test['toxic'], cbc.predict(test[X]))

0.7761502671032224

### Вывод
**F1 с использованием модели CatBoost 0,79**

## BERT

In [137]:
model_class, tokenizer_class, pretrained_weights = (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**В целях экономии нервов сил и ресурсов уменьшим выборку, наша цель опробировать работоспособность модели т.к. приемлемый результат уже достигнут**

In [138]:
df_sampled = df.sample(4810)

**Проведем токенезацию, паддинг, создадим маску**

In [139]:
#!g1.1
tokenized = df_sampled['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True, max_length=512))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

**Векторизуем данные**

In [140]:
#!g1.1
batch_size = 10
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/481 [00:00<?, ?it/s]

In [141]:
features = np.concatenate(embeddings)
target = df_sampled['toxic']
features_train, features_test, target_train, target_test = train_test_split(
features, target, test_size=0.5)


**Используем полученые вектора для обучения классификатора**

In [142]:
len(features_train) / len(df_sampled)


0.5

In [143]:
len(target_train) / len(df_sampled)

0.5

In [144]:
len(features_test) / len(df_sampled)

0.5

In [145]:
len(target_test) / len(df_sampled)

0.5

In [146]:
modelr = LogisticRegression(random_state=77, solver='liblinear')

In [147]:
modelr.fit(features_train, target_train)
predicted = modelr.predict(features_test)
f1_score(target_test, predicted)

0.6585956416464891

##### Подберем коэффицент обратной регуляризации

In [148]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=77)

c_values = np.logspace(-2, 3, 500)

logit_searcher = LogisticRegressionCV(Cs=c_values, cv=skf, verbose=100, n_jobs=-1, scoring='f1', solver='liblinear')
logit_searcher.fit(features_train, target_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed: 11.3min remaining:  7.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 14.6min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 14.6min finished
[LibLinear]

LogisticRegressionCV(Cs=array([1.00000000e-02, 1.02334021e-02, 1.04722519e-02, 1.07166765e-02,
       1.09668060e-02, 1.12227736e-02, 1.14847155e-02, 1.17527712e-02,
       1.20270833e-02, 1.23077980e-02, 1.25950646e-02, 1.28890361e-02,
       1.31898690e-02, 1.34977233e-02, 1.38127630e-02, 1.41351558e-02,
       1.44650734e-02, 1.48026913e-02, 1.51481892e-02, 1.55017512e-02,
       1.58635653e-02, 1.62...
       7.07455942e+02, 7.23968114e+02, 7.40865683e+02, 7.58157646e+02,
       7.75853206e+02, 7.93961785e+02, 8.12493021e+02, 8.31456781e+02,
       8.50863158e+02, 8.70722485e+02, 8.91045332e+02, 9.11842520e+02,
       9.33125118e+02, 9.54904456e+02, 9.77192128e+02, 1.00000000e+03]),
                     cv=StratifiedKFold(n_splits=5, random_state=77, shuffle=True),
                     n_jobs=-1, scoring='f1', solver='liblinear', verbose=100)

In [149]:
logit_searcher.get_params

<bound method BaseEstimator.get_params of LogisticRegressionCV(Cs=array([1.00000000e-02, 1.02334021e-02, 1.04722519e-02, 1.07166765e-02,
       1.09668060e-02, 1.12227736e-02, 1.14847155e-02, 1.17527712e-02,
       1.20270833e-02, 1.23077980e-02, 1.25950646e-02, 1.28890361e-02,
       1.31898690e-02, 1.34977233e-02, 1.38127630e-02, 1.41351558e-02,
       1.44650734e-02, 1.48026913e-02, 1.51481892e-02, 1.55017512e-02,
       1.58635653e-02, 1.62...
       7.07455942e+02, 7.23968114e+02, 7.40865683e+02, 7.58157646e+02,
       7.75853206e+02, 7.93961785e+02, 8.12493021e+02, 8.31456781e+02,
       8.50863158e+02, 8.70722485e+02, 8.91045332e+02, 9.11842520e+02,
       9.33125118e+02, 9.54904456e+02, 9.77192128e+02, 1.00000000e+03]),
                     cv=StratifiedKFold(n_splits=5, random_state=77, shuffle=True),
                     n_jobs=-1, scoring='f1', solver='liblinear', verbose=100)>

In [150]:
predictedcv = logit_searcher.predict(features_test)

In [151]:
f1_score(target_test, predictedcv)

0.6584766584766585

In [152]:
model = LogisticRegression(C=10, random_state=77, solver='liblinear')
model

LogisticRegression(C=10, random_state=77, solver='liblinear')

In [153]:
model.fit(features_train, target_train)
predicted = model.predict(features_test)
f1_score(target_test, predicted)

0.6547085201793722

##### Подбор коэффициента обратной регуляризации не дал результата.

##### Подберем порог

In [155]:
probabilities_train = model.predict_proba(features_train)
probabilities_one_train = probabilities_train[:, 1]
res = []
for threshold in np.arange(0, 0.7, 0.02):
    predicted_train = probabilities_one_train > threshold 
    precision = precision_score(target_train, predicted_train) 
    recall = recall_score(target_train, predicted_train) 
    res.append((threshold, precision, recall))
    print("Порог = {:.2f} | Точность = {:.3f}, Полнота = {:.3f}".format(
        threshold, precision, recall))

Порог = 0.00 | Точность = 0.095, Полнота = 1.000
Порог = 0.02 | Точность = 0.404, Полнота = 1.000
Порог = 0.04 | Точность = 0.508, Полнота = 1.000
Порог = 0.06 | Точность = 0.580, Полнота = 1.000
Порог = 0.08 | Точность = 0.639, Полнота = 1.000
Порог = 0.10 | Точность = 0.687, Полнота = 1.000
Порог = 0.12 | Точность = 0.755, Полнота = 1.000
Порог = 0.14 | Точность = 0.791, Полнота = 0.996
Порог = 0.16 | Точность = 0.828, Полнота = 0.996
Порог = 0.18 | Точность = 0.857, Полнота = 0.996
Порог = 0.20 | Точность = 0.866, Полнота = 0.996
Порог = 0.22 | Точность = 0.882, Полнота = 0.987
Порог = 0.24 | Точность = 0.896, Полнота = 0.987
Порог = 0.26 | Точность = 0.904, Полнота = 0.987
Порог = 0.28 | Точность = 0.911, Полнота = 0.987
Порог = 0.30 | Точность = 0.915, Полнота = 0.987
Порог = 0.32 | Точность = 0.941, Полнота = 0.974
Порог = 0.34 | Точность = 0.945, Полнота = 0.974
Порог = 0.36 | Точность = 0.957, Полнота = 0.974
Порог = 0.38 | Точность = 0.969, Полнота = 0.965
Порог = 0.40 | Точно

In [156]:
table = pd.DataFrame(res, columns=['thr','precision','recall'])
table['f1'] =(2 * table['precision'] * table['recall']) / (table['precision'] + table['recall'])
table


Unnamed: 0,thr,precision,recall,f1
0,0.0,0.094802,1.0,0.173186
1,0.02,0.404255,1.0,0.575758
2,0.04,0.507795,1.0,0.67356
3,0.06,0.580153,1.0,0.7343
4,0.08,0.638655,1.0,0.779487
5,0.1,0.686747,1.0,0.814286
6,0.12,0.754967,1.0,0.860377
7,0.14,0.790941,0.995614,0.881553
8,0.16,0.828467,0.995614,0.904382
9,0.18,0.856604,0.995614,0.920892


In [157]:
table['f1'].max()

0.9688888888888889

In [158]:
table[table['f1'] == table['f1'].max()]['thr']

20    0.4
Name: thr, dtype: float64

In [159]:
modelr = LogisticRegression(random_state=77, solver='liblinear')

In [160]:
modelr.fit(features_train, target_train)
predicted = model.predict_proba(features_test)[:, 1]

In [162]:
predicted_test = (predicted > table[table['f1'] == table['f1'].max()]['thr']) * 1

In [163]:
f1_score(target_test, predicted_test)

0.6288032454361054

Подбор порога не дал результатов.

## Вывод
**Даже на выборке в 30 раз меньше BERT показал F1 = 0.71, имея хорошие вычислительные мощности можно работать с BERT, для результата выше среднего подойдет CatBoost**