<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
from pymystem3 import Mystem
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
import torch
import transformers
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from lightgbm import LGBMClassifier, LGBMRegressor
from sklearn.neighbors import KNeighborsClassifier
from nltk import WordNetLemmatizer, word_tokenize
from sklearn.model_selection import StratifiedKFold
from pymorphy2 import MorphAnalyzer

In [2]:
df = pd.read_csv("/datasets//toxic_comments.csv")
df.head(5)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [3]:
df.isna().sum()

text     0
toxic    0
dtype: int64

In [4]:
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
wnl = WordNetLemmatizer()

def lemmatize_text(text):
    cleared_text = re.sub(r'[^a-zA-Z]',' ',text)
    word_list = nltk.word_tokenize(cleared_text.lower())
    lemmatized_output = ' '.join([wnl.lemmatize(w) for w in word_list])
    return lemmatized_output

In [6]:
morph = MorphAnalyzer()
wnl = WordNetLemmatizer()
def lemmatize(text):
    cleared_text = re.sub(r'[^a-zA-Z]',' ',text)
    words = cleared_text.lower().split()     
    res = list()
    for word in words:
        p = morph.parse(word)[0]
        res.append(p.normal_form)
    return " ".join(res)

In [7]:
#f=df[:10000]
lemmatized_text = df['text'].apply(lambda x: lemmatize(x))

In [8]:
print(lemmatized_text)

0         explanation why the edits made under my userna...
1         d aww he matches this background colour i m se...
2         hey man i m really not trying to edit war it s...
3         more i can t make any real suggestions on impr...
4         you sir are my hero any chance you remember wh...
                                ...                        
159566    and for the second time of asking when your vi...
159567    you should be ashamed of yourself that is a ho...
159568    spitzer umm theres no actual article for prost...
159569    and it looks like it was actually you who put ...
159570    and i really don t think you understand i came...
Name: text, Length: 159571, dtype: object


In [9]:
X_train, X_test, y_train, y_test = train_test_split(pd.DataFrame(lemmatized_text), df['toxic'], test_size=0.2, random_state=42)

In [10]:
X_train.shape, X_test.shape

((127656, 1), (31915, 1))

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=35)

In [12]:
X_train.shape, X_val.shape

((95742, 1), (31914, 1))

In [13]:
print(X_train[:10])

                                                     text
108095  august utc let s wait until the abs release th...
47135   de neville hi ealdgyth sorry i didn t know you...
124803  the deletion of the term bulgarian in this art...
150942    you should read the external linking policy t c
155739  but as cessna mikoyan gurevich and piper to na...
155094  there is no policy basis for the block your id...
17459   btw i think i am free to see what what you do ...
140599  you seem to be one of the more well intentione...
128110  regarding edits made during november utc to sh...
93528   the reason that morocco is not a member of the...


In [14]:
count_tf_idf1 = TfidfVectorizer(stop_words=stopwords) 
train_tf_idf1 = count_tf_idf1.fit_transform(X_train['text'].values)
val_tf_idf1 = count_tf_idf1.transform(X_val['text'].values)
test_tf_idf1 = count_tf_idf1.transform(X_test['text'].values)

In [15]:
print(test_tf_idf1)

  (0, 122599)	0.07511400185807228
  (0, 119484)	0.1338926825446641
  (0, 117563)	0.24329989407646166
  (0, 116507)	0.09230487855318296
  (0, 108902)	0.2113729221421455
  (0, 104358)	0.2706825445008082
  (0, 104293)	0.12314212384350881
  (0, 101955)	0.19527788280633993
  (0, 96604)	0.11877926965493606
  (0, 88825)	0.17398693284943034
  (0, 78600)	0.17824452426583676
  (0, 77958)	0.12488951331058483
  (0, 71136)	0.2589678156464524
  (0, 68231)	0.19527788280633993
  (0, 67202)	0.22364778159232748
  (0, 66362)	0.12863370851908837
  (0, 60232)	0.08414483628745646
  (0, 46622)	0.1364524273630153
  (0, 43775)	0.1480168282163967
  (0, 43228)	0.2416912867620861
  (0, 38139)	0.16871519983019093
  (0, 35763)	0.22959011973986415
  (0, 33032)	0.1967738174472053
  (0, 30150)	0.15006036546985846
  (0, 16212)	0.12943096181466748
  :	:
  (31914, 41095)	0.11734710357808806
  (31914, 39475)	0.13236863736961532
  (31914, 38253)	0.10475477673414987
  (31914, 38242)	0.09293304930466588
  (31914, 38166)	0.17

## Обучение

In [16]:
df_TestModels = pd.DataFrame()

In [32]:
best_param_C = 0
best_max_iter_param = 0
best_class_weight_param = 0
best_f1 = 0

for param_C in (0.0001, 0.001, 0.01, 0.1, 1):
    for max_iter_param in range(10, 1000, 50):
        for class_weight_param in (None, 'balanced'):
            model = LogisticRegression(C=param_C, max_iter=max_iter_param, class_weight=class_weight_param, verbose=2)
            model.fit(train_tf_idf1, y_train)
            f1 = f1_score(y_val, model.predict(val_tf_idf1))
            if f1 > best_f1:
                best_param_C = param_C
                best_max_iter_param = max_iter_param
                best_class_weight_param = class_weight_param
                best_f1 = f1
print('best_param_C=',best_param_C,'; best_max_iter_param=',best_max_iter_param,'; best_class_weight_param=',
      best_class_weight_param,'; best_f1=',best_f1)            
tmp = pd.DataFrame([[model, best_f1]])
df_TestModels = df_TestModels.append(tmp)


[LibLinear]

  'precision', 'predicted', average, warn_for)


[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear



[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]best_param_C= 1 ; best_max_iter_param= 10 ; best_class_weight_param= balanced ; best_f1= 0.7351589781290205


<div class="alert alert-block alert-warning"> Добавил поиск гиперпараметров. Лучше результаты с параметрами: best_param_C= 1 ; best_max_iter_param= 10 ; best_class_weight_param= balanced ; best_f1= 0.0.7351589781290205
</div>

In [31]:
model = LogisticRegression(C=1, max_iter=10, class_weight='balanced', verbose=2)
model.fit(train_tf_idf1, y_train)
f1 = f1_score(y_test, model.predict(test_tf_idf1))
print('F1=',f1)



[LibLinear]F1= 0.7373612823674477


In [None]:
best_f1 = 0
best_n_neighbor=0
for n_neighbor in range(2,9,2):
    model = KNeighborsClassifier(n_neighbors=n_neighbor)
    model.fit(train_tf_idf1, y_train) 
    predict_test = model.predict(test_tf_idf1)
    f1 = f1_score(y_test, model.predict(test_tf_idf1))
    if f1 > best_f1:
        best_n_neighbor = n_neighbor
        best_f1 = f1
    print(n_neighbor," F1:", f1)

tmp = pd.DataFrame([[model, best_f1]])
df_TestModels = df_TestModels.append(tmp)

Слишком низкое качество прогноза.

In [17]:
train_tf_idf1 = train_tf_idf1[:3500]
y_train = y_train[:3500]
val_tf_idf1 = val_tf_idf1[:3500]
y_val = y_val[:3500]

In [None]:
best_f1 = 0
best_n_estimators = 0
best_m_depth = 0
n_estimators = 6
max_depth = 9
for maxdepth in range(14, 24, 2):
    for nestimators in range(90, 190, 10):
        model = LGBMClassifier(boosting_type='gbdt', #num_leaves=1200,
                                    learning_rate=0.17, n_estimators=nestimators, max_depth=maxdepth,metric='f1')
        model.fit(train_tf_idf1, y_train) 
        f1 = f1_score(y_val, model.predict(val_tf_idf1))
        if f1 > best_f1:
            best_f1 = f1
            best_n_estimators = n_estimators
            best_m_depth = max_depth
        print(model,': ',f1)
print('best_f1 =', f1,'best_n_estimators =', best_n_estimators,  'best_m_depth =', best_m_depth)
tmp = pd.DataFrame([[maxdepth, best_f1]])
df_TestModels = df_TestModels.append(tmp)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.17, max_depth=14,
               metric='f1', min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=90, n_jobs=-1, num_leaves=31,
               objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0) :  0.5436507936507936
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.17, max_depth=14,
               metric='f1', min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
               objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0) :  

Тоже низкое качество прогноза.

## Выводы

In [None]:
display(df_TestModels)

Лучшие результаты показала модель LogisticRegression f1= 0.7373612823674477 с параметрами: C= 1 ; max_iter = 10 ; class_weight = balanced. KNeighborsClassifier лучший результат F1: 0.2848114598693912.  LGBMClassifier F1= 0.6148771235362032.


## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [x]  Весь код выполняется без ошибок
- [x]  Ячейки с кодом расположены в порядке исполнения
- [x]  Данные загружены и подготовлены
- [x]  Модели обучены
- [x]  Значение метрики *F1* не меньше 0.75
- [x]  Выводы написаны