# Проект для "Викишоп"

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.  

Задача - обучить модель классифицировать комментарии на позитивные и негативные. Для обучения предоставлен набор данных с разметкой о токсичности правок. Метрика качества модели F1 должна быть не меньше 0.75. 

### Содержание

* [1. Подготовка данных](#chapter1)
* [2. Обучение моделей](#chapter2)

## 1. Подготовка данных <a class="anchor" id="chapter1"></a>

In [1]:
pip install torch

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install transformers

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [26]:
import numpy as np
import pandas as pd
import torch
import transformers
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

In [4]:
data = pd.read_csv('toxic_comments.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


Столбец *Unnamed: 0* дублирует номера строк, удалим его.

In [7]:
data = data.drop(['Unnamed: 0'], axis=1)

Данные загружены.  
Подготовим их.

In [8]:
data = data.sample(1000).reset_index(drop=True)

In [9]:
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-multilingual-uncased")
tokenizer.model_max_length = 64
tokenized = data['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [10]:
attention_mask.shape

(1000, 64)

In [11]:
#config = transformers.BertConfig.from_json_file('bert_config.json')
model = transformers.BertModel.from_pretrained("bert-base-multilingual-uncased")

In [12]:
%%time
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/10 [00:00<?, ?it/s]

Wall time: 1min 6s


In [13]:
features = np.concatenate(embeddings)

In [14]:
features

array([[-0.09272899,  0.01173171,  0.01736867, ...,  0.00325953,
        -0.06922017,  0.01061466],
       [-0.09901971,  0.00113225,  0.05407063, ...,  0.05023095,
         0.02505776, -0.0203661 ],
       [-0.08754513, -0.08850002,  0.07842252, ..., -0.00720571,
         0.06203734, -0.10118032],
       ...,
       [-0.07662422, -0.05442185,  0.02123993, ..., -0.01313299,
         0.08040719, -0.04401834],
       [ 0.03850738, -0.05539434,  0.02868565, ..., -0.01562159,
        -0.12733944, -0.09147856],
       [-0.16951336, -0.03366038, -0.05229738, ...,  0.08415107,
        -0.03002633, -0.00089188]], dtype=float32)

## 2. Обучение моделей<a class="anchor" id="chapter2"></a>

In [39]:
features = pd.DataFrame(features)
features_train, features_test = train_test_split(features, test_size=0.2)
target_train, target_test = train_test_split(data['toxic'], test_size=0.2)

In [40]:
model = RandomForestClassifier(n_estimators = 40, max_depth=7)
model.fit(features_train, target_train)

In [41]:
predicted = model.predict(features_test)

In [42]:
predicted

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=int64)

In [43]:
target_test

298    0
260    0
408    0
692    0
651    0
      ..
129    0
490    0
834    0
689    1
726    0
Name: toxic, Length: 200, dtype: int64

In [44]:
accuracy_score(predicted, target_test)

0.895

In [38]:
f1_score(target_test, predicted)

0.0