<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [15]:
import numpy as np
import pandas as pd
import math
import torch
import transformers as ppb # pytorch transformers
import catboost as cb
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from tqdm import notebook

In [16]:
local_data = 'datasets/toxic_comments.csv'
try:
    df = pd.read_csv(local_data)
except:
    df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
    df.to_csv(local_data, index=False)


In [17]:
print('Размер датафрейма: ', df.shape)
df = df.drop(df.columns[0], axis=1)

Размер датафрейма:  (159292, 3)


In [18]:
# Для ускорения вычислений будем использовать только первые 3000 комментариев
train, test = train_test_split(
    df, test_size=0.1,
    stratify=df['toxic'])

In [21]:
train

Unnamed: 0,text,toxic
121755,"First and foremost, you were not threatened. ...",0
61621,"""\nAlthough this is not a vote, I suggest the ...",0
93329,I'm not sure but I think I have read that Brit...,0
15611,"""\n\nI hope all is OK and hope to see you soon...",0
72602,"As far as I remember, I needed to provide a pr...",0
...,...,...
96273,"""\n\n Slight change to intro \n\ncould the bit...",0
43521,"""\n\n Spammy edits... \n\nI've been seeing a l...",0
12102,Animals section on Muhammad\nHello Truthspread...,0
91473,THAT'S RIGHT - LISTEN UP ASSHOLE !!!,1


In [22]:
x_train, x_test, y_train, y_test = train_test_split(
    train.drop('toxic', axis=1), train['toxic'], test_size=0.25,
    stratify=train['toxic'])

In [23]:
def fit_model(train_pool, test_pool, **kwargs):
    model = cb.CatBoostClassifier(
        # task_type='GPU',
        iterations=2000,
        eval_metric='F1',
        od_type='IncToDec',
        od_wait=500,
        **kwargs
    )
    return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=100,
        plot=True,
        use_best_model=True)

In [24]:
train_pool = cb.Pool(
    data=x_train,
    label=y_train,
    text_features=['text']
)

# valid_pool = Pool(
#     data=X_valid, 
#     label=y_valid,
#     text_features=['text']
# )

test_pool = cb.Pool(
    data=x_test, 
    label=y_test,
    text_features=['text']
)

In [25]:
model = fit_model(
    train_pool, test_pool,
    learning_rate=0.25,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'False',
            'token_types':['Word', 'SentenceBreak', 'Punctuation'],
            'sub_tokens_policy':'SeveralTokens'
        }      
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '150000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.4113210	test: 0.4046483	best: 0.4046483 (0)	total: 238ms	remaining: 7m 56s
100:	learn: 0.7198470	test: 0.6961832	best: 0.6961832 (100)	total: 22.4s	remaining: 7m
200:	learn: 0.7728843	test: 0.7272727	best: 0.7272727 (200)	total: 44s	remaining: 6m 34s
300:	learn: 0.8001905	test: 0.7381836	best: 0.7382182 (296)	total: 1m 5s	remaining: 6m 7s
400:	learn: 0.8181343	test: 0.7460444	best: 0.7460444 (400)	total: 1m 26s	remaining: 5m 43s
500:	learn: 0.8322537	test: 0.7480064	best: 0.7493631 (475)	total: 1m 47s	remaining: 5m 20s
600:	learn: 0.8447167	test: 0.7518248	best: 0.7518248 (599)	total: 2m 7s	remaining: 4m 57s
700:	learn: 0.8514496	test: 0.7533576	best: 0.7533576 (700)	total: 2m 28s	remaining: 4m 34s
800:	learn: 0.8581633	test: 0.7568590	best: 0.7572081 (793)	total: 2m 48s	remaining: 4m 12s
900:	learn: 0.8650761	test: 0.7584906	best: 0.7591952 (885)	total: 3m 8s	remaining: 3m 50s
1000:	learn: 0.8709889	test: 0.7607708	best: 0.7612478 (985)	total: 3m 29s	remaining: 3m 28s
1100

In [41]:
# Тест катбуста на тестовой выборке
cb_test_true = test['toxic']
cb_test_pred = model.predict(test.drop('toxic', axis=1))
f1_score(cb_test_true, cb_test_pred)

0.7688513037350246

In [26]:
#model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Хотите BERT вместо distilBERT? Раскомментируйте следующую строку:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Загрузка предобученной модели/токенизатора 
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [27]:
# Сделаем токенизацию
tokenized = df['text'].apply((lambda x: tokenizer.encode(x[:512], add_special_tokens=True)))

In [30]:
tokenized

0         [101, 7526, 2339, 1996, 10086, 2015, 2081, 210...
1         [101, 1040, 1005, 22091, 2860, 999, 2002, 3503...
2         [101, 4931, 2158, 1010, 1045, 1005, 1049, 2428...
3         [101, 1000, 2062, 1045, 2064, 1005, 1056, 2191...
4         [101, 2017, 1010, 2909, 1010, 2024, 2026, 5394...
                                ...                        
159287    [101, 1000, 1024, 1024, 1024, 1024, 1024, 1998...
159288    [101, 2017, 2323, 2022, 14984, 1997, 4426, 200...
159289    [101, 13183, 6290, 26114, 1010, 2045, 2015, 20...
159290    [101, 1998, 2009, 3504, 2066, 2009, 2001, 2941...
159291    [101, 1000, 1998, 1012, 1012, 1012, 1045, 2428...
Name: text, Length: 159292, dtype: object

In [31]:
# Найдём максимальную длину списка
max_len = len(max(tokenized, key=len))

# Приведем весь список к одинаковой длине
padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [32]:
input_ids = torch.tensor(np.array(padded))

# with torch.no_grad():
    # last_hidden_states = model(input_ids)
    
batch_size = 100
embeddings = []
for i in range(math.ceil(padded.shape[0] / batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())    

: 

: 

In [None]:
features = np.concatenate(embeddings)
x_train, x_test, y_train, y_test = train_test_split(
    features, df['toxic'], test_size=0.25)

# обучите и протестируйте модель
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(x_train, y_train)
pred = lr_model.predict(x_test)

In [None]:
f1_score(y_test, pred)

## Обучение

## Выводы

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны