<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Случайный-лес" data-toc-modified-id="Случайный-лес-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Случайный лес</a></span></li><li><span><a href="#CatBoost" data-toc-modified-id="CatBoost-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>CatBoost</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Задача построить модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

Импортируем сразу библиотеки, которые пригодятся нам в этом проекте:

In [1]:
import pandas as pd
import numpy as np
import torch
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, confusion_matrix

from catboost import CatBoostClassifier

import nltk
from nltk.corpus import stopwords as nltk_stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

pd.options.display.float_format = '{:,.2f}'.format
pd.options.mode.chained_assignment = None

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
import spacy

Импортируем датасет и посмотрим:

In [3]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [5]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Столбец 'toxic' состоит из двух значений, но мы видим, что 0 в разы больше чем 1 и нам нужно отбалансировать датасет, чтобы модель могла правильно воспринимать все ответы. На данном этапе отделим нашу обучающую и тестовую выборки и сбалансируем.

In [6]:
train, test_balanced = train_test_split(df, test_size = 0.1, random_state=55)

In [7]:
train['toxic'].value_counts()

0    128785
1     14577
Name: toxic, dtype: int64

In [8]:
toxic = train.loc[df['toxic'] == 1]
df_zero = train.loc[df['toxic'] == 0].iloc[:21865]
df_balanced = toxic.append(df_zero)
df_balanced.sample(frac=1).reset_index(drop=True)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,3370,Fraggle81 ==\nI can just make new accounts eve...,1
1,37745,Deletion review\nI have posted your comment he...,0
2,106029,"It has ceased to be active, but it is still a ...",0
3,41056,WOW! I cannot believe that you found a link ex...,0
4,28314,Thanks for fixing the hovercraft definition,0
...,...,...,...
36437,58313,"""Wikipedia's article on Sanchez is a sick joke...",1
36438,5739,"""\n\n The less offensive term of equal clarity...",0
36439,108542,Get out of America. We don't want another supr...,1
36440,46573,"""\nI'm sorry if my post read as an attack. I w...",0


In [9]:
df_balanced['toxic'].value_counts()

0    21865
1    14577
Name: toxic, dtype: int64

In [10]:
test_balanced['toxic'].value_counts()

0    14321
1     1609
Name: toxic, dtype: int64

В итоговый датасет на котором будет происходить обучение мы выделели 16225 строк с ответом 1 и столько же с ответом 0, чтобы у нас получилось ровно 1:1. Далее обработаем текст: токенизируем и лемматизируем.

In [11]:
# функция для очистки текста от знаков препинания
def clear_text(text):
    t = re.sub(r'[^a-zA-Z]', ' ', text)
    clear = ' '.join(t.split())
    return t

In [12]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [13]:
def lemmatizer_nlp(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

In [14]:
%%time
# применяем сначала очистку текста, затем лемматизацию
df_balanced['text'] = df_balanced['text'].str.lower()
df_balanced['lemma_text'] = df_balanced['text'].apply(clear_text)
df_balanced['lemma_text'] = df_balanced['lemma_text'].apply(lemmatizer_nlp)

CPU times: user 3min 25s, sys: 4.05 s, total: 3min 29s
Wall time: 3min 29s


In [15]:
%%time
test_balanced['text'] = test_balanced['text'].str.lower()
test_balanced['lemma_text'] = test_balanced['text'].apply(clear_text)
test_balanced['lemma_text'] = test_balanced['lemma_text'].apply(lemmatizer_nlp)

CPU times: user 1min 31s, sys: 1.4 s, total: 1min 32s
Wall time: 1min 32s


In [16]:
df_balanced.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic,lemma_text
116601,116700,"""\n\n the lgbt barnstar awarded because your ...",1,the lgbt barnstar award because your sir...
98912,99008,"c'mon bitch, speak up.",1,c mon bitch speak up
13986,14001,fuck wiki\n\nfuck this piece of shit called wi...,1,fuck wiki fuck this piece of shit call wikip...
20804,20824,dumbhead \n\nyou are a freaky dumbhead!!!!!!!!...,1,dumbhead you be a freaky dumbhead ...
54663,54724,"""\n\n where is your """"wiki-page""""? \n\nyo joe,...",1,where be your wiki page yo joe ...
140570,140722,"oh fuck you douchebag, it's just wikipedia, se...",1,oh fuck you douchebag it s just wikipedia ...
45786,45839,", who blantently privledge shit over quality",1,who blantently privledge shit over quality
136640,136778,"your all a bunch of idiots, rename the page ha...",1,you all a bunch of idiot rename the page hav...
51744,51801,don't you know it's bad to make pizza with hum...,1,don t you know it s bad to make pizza with hum...
141596,141749,what facts you fucking idiot. the facts from t...,1,what fact you fucking idiot the fact from tu...


In [17]:
test_balanced.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic,lemma_text
113548,113646,""" — preceding unsigned comment added by (talk...",0,precede unsigned comment add by talk ...
90630,90721,"1:35, 23 february 2009 (utc)",0,february utc
31333,31373,"""\n\n google hits must not be taken as a yards...",0,google hit must not be take as a yardstic...
52683,52740,if by original you mean chaucer's the nun's pr...,0,if by original you mean chaucer s the nun s pr...
82021,82097,"""\n\n request de-adminship of user: owenx \n\...",0,request de adminship of user owenx ...
17991,18008,"hey bro, i made this image for the movie -> ht...",0,hey bro I make this image for the movie ...
93030,93122,red dwarf christmas special \n\ni've announced...,0,red dwarf christmas special I ve announce t...
140840,140992,i got around to making the changes. they may b...,0,I get around to make the change they may be ...
108923,109020,thank you for your reply. i'm not sure what y...,0,thank you for your reply I m not sure what ...
1856,1856,my hog is size 10 soft 15 with a nice chubb chubb,0,my hog be size soft with a nice chubb ...


In [18]:
valid, test = train_test_split(test_balanced, test_size = 0.5, random_state=55)

In [19]:
valid_features = valid
test_features = test
valid_target = valid['toxic']
test_target = test['toxic']
print(valid_features.shape)
print(test_features.shape)

(7965, 4)
(7965, 4)


In [20]:
corpus_balanced = df_balanced['lemma_text'].values
corpus_valid = valid_features['lemma_text'].values
corpus_test = test_features['lemma_text'].values

После обработки получили корпус. Превратим текст в векторы и перейдем к обучению моделей.

In [21]:
# tf idf 
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
tf_idf_balanced = count_tf_idf.fit_transform(corpus_balanced)

In [23]:
valid_features = count_tf_idf.transform(corpus_valid)
test_features = count_tf_idf.transform(corpus_test)

In [24]:
tf_idf_balanced.shape

(36442, 58671)

In [25]:
valid_features.shape

(7965, 58671)

In [26]:
test_features.shape

(7965, 58671)

## Обучение

В данном проекте попробуем воспользоваться тремя алгоритмами классификации: логистическая регрессия, случайный лес и CatBoost. Сперва разделим модели на тренировочную и тестовую. 

In [27]:
train_features = tf_idf_balanced
train_target = df_balanced['toxic']

print(train_features.shape)
print(train_target.shape)

(36442, 58671)
(36442,)


Мы разделили данные, теперь приступим к обучению.

### Логистическая регрессия

In [None]:
model = LogisticRegression(penalty='l2', 
                           # solver='liblinear', 
                           C = 10, 
                           random_state=55, 
                           max_iter=1000, 
                           class_weight= {0 :5, 1: 1}, 
                           fit_intercept=True
                          )
logistic = model.fit(train_features, train_target)

In [29]:
logistic_predictions_balanced = logistic.predict(valid_features)

In [30]:
f1_score(valid_target, logistic_predictions_balanced)

0.7619647355163728

In [31]:
# тест потом удали
logistic_predictions_test = logistic.predict(test_features)
f1_score(test_target, logistic_predictions_test)

0.7645631067961165

In [32]:
confusion_matrix(valid_target, logistic_predictions_balanced)

array([[6761,  424],
       [ 122,  658]])

### Случайный лес

In [33]:
%%time
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 50)]
max_depth = [int(x) for x in np.linspace(1, 50)]
grid = {'n_estimators': n_estimators, 'max_depth': max_depth}
forest = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = grid, cv = 5, scoring='f1')
forest.fit(train_features, train_target)

CPU times: user 2min 18s, sys: 115 ms, total: 2min 18s
Wall time: 2min 18s


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(),
                   param_distributions={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29, 30, ...],
                                        'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8,
                                                         9, 10, 11, 12, 13, 14,
                                                         15, 16, 17, 18, 19, 20,
                                                         21, 22, 23, 24, 25, 26,
                                                         27, 28, 29, 30, ...]},
                   scoring='f1')

In [34]:
forest_predictions = forest.predict(valid_features)
f1_score(valid_target, forest_predictions)

0.6331045003813883

In [35]:
forest.best_score_

0.7024587237010106

In [36]:
confusion_matrix(valid_target, forest_predictions)

array([[7069,  116],
       [ 365,  415]])

### CatBoost

In [37]:
%%time
# CatBoost
catboost = CatBoostClassifier(iterations = 500, 
    loss_function='MultiClass', 
    bootstrap_type = "Bayesian",
    eval_metric = 'MultiClass',
    learning_rate=0.1,)
catboost.fit(train_features, train_target, verbose=50)

0:	learn: 0.6654454	total: 1.78s	remaining: 14m 47s
50:	learn: 0.4148341	total: 1m 4s	remaining: 9m 29s
100:	learn: 0.3637757	total: 2m 6s	remaining: 8m 19s
150:	learn: 0.3327742	total: 3m 8s	remaining: 7m 15s
200:	learn: 0.3136358	total: 4m 9s	remaining: 6m 11s
250:	learn: 0.3007490	total: 5m 10s	remaining: 5m 7s
300:	learn: 0.2898447	total: 6m 10s	remaining: 4m 5s
350:	learn: 0.2826190	total: 7m 11s	remaining: 3m 3s
400:	learn: 0.2768002	total: 8m 11s	remaining: 2m 1s
450:	learn: 0.2718538	total: 9m 12s	remaining: 1m
499:	learn: 0.2676704	total: 10m 11s	remaining: 0us
CPU times: user 10min 13s, sys: 3.44 s, total: 10min 16s
Wall time: 10min 17s


<catboost.core.CatBoostClassifier at 0x7feaf7b8d7c0>

In [38]:
catboost_predictions = catboost.predict(valid_features)
f1_score(valid_target, catboost_predictions)

0.7355320472930927

In [39]:
confusion_matrix(valid_target, catboost_predictions)

array([[6949,  236],
       [ 189,  591]])

## Выводы

In [32]:
logistic_predictions_test = logistic.predict(test_features)
f1_score(test_target, logistic_predictions_test)

0.7645631067961165

In [41]:
forest_predictions_test = forest.predict(test_features)
f1_score(test_target, forest_predictions_test)

0.6324786324786325

In [47]:
catboost_predictions_test = catboost.predict(test_features)
round(f1_score(test_target, catboost_predictions_test), 2)

0.75

После проверки тестовой выборки получились такие результаты: у логистической регрессии 0,76, случайный лес - 0,63, catboost - 0,75