# Практика предобученной нейросети BERT

Требуется обучить модель находить токсичные твиты.
##План работы:
* Провести подготовку данных для создания признаков с помощью BERT
* Обучить модели классификации
* Выбрать модель с лучшим F1 score


##Ход работы:

### Подготовка данных

In [None]:
!pip install transformers -q

In [None]:
pip install catboost

Note: you may need to restart the kernel to use updated packages.


In [None]:
import numpy as np
import pandas as pd
import torch
from transformers import BertModel, BertTokenizer, BertConfig
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score as F1
from sklearn.model_selection import GridSearchCV
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [None]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
device

In [None]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [None]:
model = BertModel.from_pretrained('unitary/toxic-bert')
tokenizer = BertTokenizer.from_pretrained('unitary/toxic-bert')

In [None]:
df.info

<bound method DataFrame.info of         Unnamed: 0                                               text  toxic
0                0  Explanation\nWhy the edits made under my usern...      0
1                1  D'aww! He matches this background colour I'm s...      0
2                2  Hey man, I'm really not trying to edit war. It...      0
3                3  "\nMore\nI can't make any real suggestions on ...      0
4                4  You, sir, are my hero. Any chance you remember...      0
...            ...                                                ...    ...
159287      159446  ":::::And for the second time of asking, when ...      0
159288      159447  You should be ashamed of yourself \n\nThat is ...      0
159289      159448  Spitzer \n\nUmm, theres no actual article for ...      0
159290      159449  And it looks like it was actually you who put ...      0
159291      159450  "\nAnd ... I really don't think you understand...      0

[159292 rows x 3 columns]>

In [None]:
df = df.drop('Unnamed: 0', axis = 1);

Для сокращения времени создания эмбединга будем использовать выборку из 20000 объектов

In [None]:
df_sample = df.sample(20000)
corpus_sample = df_sample.reset_index(drop = True)


In [None]:
tokens = corpus_sample['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens = True, max_length= 512 , truncation=True))


In [None]:
max_len = 0
for i in tokens.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokens.values])


attention_mask = np.where(padded != 0, 1, 0)


In [None]:
torch.cuda.empty_cache()
batch_size = 200

In [None]:
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).cuda()
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).cuda()

        with torch.no_grad():
            model.cuda()
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)

        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())
        del batch
        del attention_mask_batch
        del batch_embeddings

### Обучение моделей

In [None]:
features = np.concatenate(embeddings)
df_new = pd.DataFrame(features)
df_new['toxic'] = df_sample['toxic']

train, test = train_test_split(df_new, test_size=0.1)

In [None]:
train_features = train.drop('toxic', axis = 1)
train_target = train['toxic']

test_features = test.drop('toxic', axis = 1)
test_target = test['toxic']

#### Логистическая регрессия

In [None]:
model = LogisticRegression()
grid_space = {'penalty':['l1', 'l2', 'elasticnet', None],
              'dual':[True, False],
              'solver':['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'] }
grid = GridSearchCV(model, param_grid = grid_space, cv=3, n_jobs = -1 )
grid.fit(train_features, train_target)
model = grid.best_estimator_


In [None]:
model = grid.best_estimator_
scores = cross_val_score(model, train_features, train_target, cv = 2, scoring = 'f1')
print('F1:', grid.best_score_)
print('Cross-Val F1:',sum(scores)/len(scores))

Метрика F1 на логистической регрессии при проверке на кросс валидации составила 0.94

#### CatboostClassifier

In [None]:
model_2 = CatBoostClassifier()
grid_space_2 = {'depth'         : [4,7, 10],
                  'iterations'    : [10, 20, 30]
                 }
grid_2 = GridSearchCV(model_2, param_grid = grid_space_2, scoring = 'f1', cv=3, n_jobs = -1 )
grid_2.fit(train_features, train_target)
model_2 = grid_2.best_estimator_


In [None]:
model_2 = grid_2.best_estimator_
scores_2 = cross_val_score(model_2, train_features, train_target, cv = 2, scoring = 'f1')
print('F1:', grid_2.best_score_)
print('Cross-Val F1:',sum(scores_2)/len(scores))

:Метрика F1 на классификаторе CatBoost после кросс-валидации составила 0.93

#### RandomForestClassifier

In [None]:
model_3 = RandomForestClassifier()
grid_space_3 = {'n_estimators'         : [100, 120, 150],
                  'criterion'  :  ['gini', 'entropy', 'log_loss'],
                'max_depth':[2, 5, 8]
                }
grid_3 = GridSearchCV(model_3, param_grid = grid_space_3, scoring = 'f1', cv=3, n_jobs = -1 )
grid_3.fit(train_features, train_target)



In [None]:
model_3 = grid_3.best_estimator_
scores_3 = cross_val_score(model_3, train_features, train_target, cv = 2, scoring = 'f1')
print('F1:', grid_3.best_score_)
print('Cross-Val F1:',sum(scores_3)/len(scores_3))

Метрика F1 без подбора параметров для случайного леса после кросс_валидации составила 0.94

Лучшая метрика оказалась у леса решений. Протестируем данную модель на тестовой выборке:

In [None]:
pred = model_3.predict(test_features)
print(F1(pred, test_target))

## Вывод:

Для обучения модели классификации твитов на токсичные и нетоксичные было проведено составление эмбедингов для набора твитов размером в 20000 текстов с помощью нейросети BERT. Далее были обучены три модели, из которых наилучший результат показал лес решений. Метрика F1-score при его тестировании состаила 0.936





