<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Альтернативное-решиение-с-использованием-Bert" data-toc-modified-id="Альтернативное-решиение-с-использованием-Bert-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Альтернативное решиение с использованием Bert</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп» - BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import tqdm
import warnings
import re
import nltk
import torch
import sys
import spacy

from sklearn.utils import resample, shuffle
from sklearn.metrics import f1_score

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from nltk.corpus import stopwords as nltk_stopwords

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import transformers as ppb

from sklearn.pipeline import Pipeline

from tqdm.notebook import tqdm
from wordcloud import WordCloud


warnings.filterwarnings("ignore")

## Альтернативное решиение с использованием Bert

In [2]:
try:
    df= pd.read_csv('/datasets/toxic_comments.csv')
except:
    df=pd.read_csv('/Users/aleksandrivanov/Downloads/toxic_comments.csv')
pd.set_option('display.max_columns', None)

In [3]:
df = df.drop(['Unnamed: 0'], axis =1)

In [4]:
batch_1 = df[:1000]

In [5]:
#train, test = train_test_split(df[:1000], test_size=0.2, random_state=12345, stratify = df)

In [6]:
batch_1

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
995,""" Hi, Writingrights, Welcome to Wikipedia! \n...",0
996,It is common knowledge that Karaims (but not K...,0
997,", 12 April 2006 (UTC)\nThen rewrite and expand...",0
998,"""I was trying to inject some humour (as eviden...",0


In [7]:
tokenizer = ppb. BertTokenizer.from_pretrained('unitary/toxic-bert')
model = ppb. BertModel. from_pretrained('unitary/toxic-bert')

Some weights of the model checkpoint at unitary/toxic-bert were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
tokenized = batch_1['text'].apply((lambda x: tokenizer.encode(x[:512], add_special_tokens=True)))

In [9]:
tokenized

0      [101, 7526, 2339, 1996, 10086, 2015, 2081, 210...
1      [101, 1040, 1005, 22091, 2860, 999, 2002, 3503...
2      [101, 4931, 2158, 1010, 1045, 1005, 1049, 2428...
3      [101, 1000, 2062, 1045, 2064, 1005, 1056, 2191...
4      [101, 2017, 1010, 2909, 1010, 2024, 2026, 5394...
                             ...                        
995    [101, 1000, 7632, 1010, 3015, 15950, 2015, 101...
996    [101, 2009, 2003, 2691, 3716, 2008, 13173, 571...
997    [101, 1010, 2260, 2258, 2294, 1006, 11396, 100...
998    [101, 1000, 1045, 2001, 2667, 2000, 1999, 2061...
999    [101, 2023, 2516, 2323, 2417, 7442, 6593, 2000...
Name: text, Length: 1000, dtype: object

In [10]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [11]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(1000, 230)

In [12]:
%%time
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

CPU times: user 38min 1s, sys: 3min 11s, total: 41min 12s
Wall time: 6min 15s


In [13]:
features = last_hidden_states[0][:,0,:].numpy()

In [14]:
labels = batch_1['toxic']

In [15]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels,
                                                                           test_size=0.2, random_state=12345, stratify = labels)

In [16]:
lr_clf = LogisticRegression()
#lr_clf.fit(train_features, train_labels)

In [17]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(max_iter = 1000, solver='lbfgs'), parameters, scoring = 'f1', n_jobs = -1)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

best parameters:  {'C': 5.263252631578947}
best scrores:  0.9040896358543418


In [18]:
predictions = grid_search.predict(test_features)

In [19]:
print (f1_score(test_labels, predictions))

0.888888888888889


Использование предобученной модели:
tokenizer = ppb. BertTokenizer.from_pretrained('unitary/toxic-bert')
model = ppb. BertModel. from_pretrained('unitary/toxic-bert')

значение поднялось с 0.68 до 0.9 на трейне и 0.88 на тесте