# Проект для "Викишоп"

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.  

Задача - обучить модель классифицировать комментарии на позитивные и негативные. Для обучения предоставлен набор данных с разметкой о токсичности правок. Метрика качества модели F1 должна быть не меньше 0.75. 

### Содержание

* [1. Подготовка данных](#chapter1)
* [2. Обучение моделей](#chapter2)

## 1. Подготовка данных <a class="anchor" id="chapter1"></a>

In [1]:
pip install torch

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install transformers

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [37]:
import numpy as np
import pandas as pd
import torch
import transformers
import lightgbm as lgb
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

In [4]:
data = pd.read_csv('toxic_comments.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


Столбец *Unnamed: 0* дублирует номера строк, удалим его.

In [7]:
data = data.drop(['Unnamed: 0'], axis=1)

Проверим данные на наличие дубликатов:

In [8]:
data.duplicated().sum()

0

Дубликатов нет.

Данные загружены.  
Подготовим их.

In [9]:
data = data.sample(5000).reset_index(drop=True)

In [10]:
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-multilingual-uncased")
tokenizer.model_max_length = 64
tokenized = data['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [11]:
attention_mask.shape

(5000, 64)

In [12]:
#config = transformers.BertConfig.from_json_file('bert_config.json')
model = transformers.BertModel.from_pretrained("bert-base-multilingual-uncased")

In [13]:
%%time
batch_size = 500
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/10 [00:00<?, ?it/s]

Wall time: 5min 10s


In [14]:
features = np.concatenate(embeddings)

In [15]:
features

array([[-1.3215357e-01, -4.4131625e-02, -4.9574725e-02, ...,
         9.8975420e-02, -7.5013965e-02,  2.0131052e-02],
       [-1.4217380e-01, -5.6921348e-02, -7.9326332e-05, ...,
         3.2128420e-02,  3.6965564e-02, -3.4458011e-02],
       [-2.4589820e-02,  9.5604109e-03,  7.2195627e-02, ...,
         2.5419258e-03, -7.4695677e-02, -7.2037958e-02],
       ...,
       [-9.0382516e-02, -7.3958501e-02,  7.3107824e-02, ...,
         1.1583552e-02, -5.6097284e-02, -7.5329505e-02],
       [-7.4199885e-02, -2.8242251e-02,  2.1570886e-02, ...,
         1.7973460e-02, -3.3992678e-02, -1.0720686e-01],
       [-1.2054427e-01, -3.6383942e-02,  5.5085421e-03, ...,
         3.2420177e-02, -2.1999717e-02, -4.8121870e-02]], dtype=float32)

## 2. Обучение моделей<a class="anchor" id="chapter2"></a>

In [38]:
features = pd.DataFrame(features)
features_train, features_test, target_train, target_test = train_test_split(features, data['toxic'])

In [39]:
model = LogisticRegression(class_weight='balanced')
model.fit(features_train, target_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [40]:
predicted = model.predict(features_test)

In [41]:
predicted

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [42]:
target_test

2175    0
66      0
1125    0
489     0
1764    1
       ..
321     0
1373    0
1784    0
3503    0
54      0
Name: toxic, Length: 1250, dtype: int64

In [43]:
accuracy_score(predicted, target_test)

0.8856

In [44]:
f1_score(target_test, predicted)

0.5653495440729484

Проверим моедль градиентного бустинга

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, data['toxic'])

train_dataset = lgb.Dataset(features_train, target_train, feature_name=list(features_train))
valid_dataset = lgb.Dataset(features_valid, target_valid, feature_name=list(features_valid))

In [None]:
for lrate in [0.1, 0.5, 0.9]:
    for num_iters in [50, 100, 200]:
        time_start = time()
        
        params = {"objective": "regression", 
                  "metric":"rmse", 
                  "learning_rate":lrate, 
                  "verbose":-100}
        
        bst = lgb.train(params, train_set=train_dataset,  valid_sets=(valid_dataset,), 
                        num_boost_round=num_iters, callbacks=[lgb.early_stopping(stopping_rounds=5, verbose=False)])
        
        time_finish = time()
        time_final = time_finish - time_start
        print(f"\nLearning_rate {lrate}, Num_iterations {num_iters}, Время обучения {time_final:.2f} сек.")
        print('RMSE', bst.best_score["valid_0"]['rmse'])