Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [1]:
import pandas as pd
import numpy as np
import pymystem3 as Mystem
import xgboost as xgb
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import torch
import tensorflow as tf
import transformers
import os
import re
from tqdm import notebook
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ivan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Места расположения папок с моделью BERT и исходными данными
BERT_PATH =r'C:\Users\ivan\YandexDisk\DS\Project_11\multi_cased_L-12_H-768_A-12'
DATA_PATH = "C:\\Users\ivan\\YandexDisk\\DS\\Project_11"

In [3]:
df_text = pd.read_csv(os.path.join(DATA_PATH,'toxic_comments.csv'))

# Делим текст тестовую
#df_text = df_text[:300]
df_text[df_text['toxic']!=0]

Unnamed: 0,text,toxic
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
12,Hey... what is it..\n@ | talk .\nWhat is it......,1
16,"Bye! \n\nDon't look, come or think of comming ...",1
42,You are gay or antisemmitian? \n\nArchangel WH...,1
43,"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!",1
...,...,...
159494,"""\n\n our previous conversation \n\nyou fuckin...",1
159514,YOU ARE A MISCHIEVIOUS PUBIC HAIR,1
159541,Your absurd edits \n\nYour absurd edits on gre...,1
159546,"""\n\nHey listen don't you ever!!!! Delete my e...",1


In [4]:
df_text.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
print('Количество строк датасета', len(df_text))

Количество строк датасета 159571


In [6]:
# Проверим по некоторым сторокам какой там текст
print(df_text['text'][0],'\n')
print(df_text['text'][2],'\n')
print(df_text['text'][10],'\n')

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27 

Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info. 

"
Fair use rationale for Image:Wonju.jpg

Thanks for uploading Image:Wonju.jpg. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in Wikipedia articles constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.

Please go to the image de

In [7]:
# лемматизация текста
def lemma(row):
    m = Mystem()
    return m.lemmatize(row)

In [8]:

text = df_text['text'][1]

In [9]:
re.sub('[^a-zA-Z]' ,' ' , text)

'D aww  He matches this background colour I m seemingly stuck with  Thanks    talk         January           UTC '

In [4]:
# Загружаем токенайзер, указываем путь к словарю

tokenizer = transformers.BertTokenizer(
    vocab_file=r'C:\Users\ivan\YandexDisk\DS\Project_11\multi_cased_L-12_H-768_A-12\vocab.txt')
tokenized = df_text['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True))

# Приводим длину всех векторов к одному значению
max_len = 100
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [5]:
config = transformers.BertConfig.from_json_file(os.path.join(BERT_PATH,'bert_config.json'))
model = transformers.BertForPreTraining.from_pretrained(os.path.join(BERT_PATH,"bert_model.ckpt.index"), config=config , from_tf=True )

In [6]:
padded = padded[:,:30]
attention_mask = attention_mask[:,:30]
padded.shape

(159571, 30)

In [None]:
embeddings = []
batch_size=500
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])

    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)

    embeddings.append(batch_embeddings[0][:,0,:].numpy())

HBox(children=(IntProgress(value=0, max=319), HTML(value='')))

In [None]:
features = np.concatenate(embeddings)

In [None]:
features.shape

# 2. Обучение

In [None]:
X_train , X_test , y_train , y_test = train_test_split(features , df_text['toxic'] , test_size = 0.3, random_state = 12)
X_train.shape , X_test.shape , y_train.shape , y_test.shape

In [None]:
model = LogisticRegression(solver ='liblinear'  ,class_weight = 'balanced')

scores = cross_val_score(model , X_train , y_train , scoring ='f1' , cv = 10   )

In [None]:
scores

In [None]:
dtrain = xgb.Dmatrix()


In [None]:
param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}


# 3. Выводы

# Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны