# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [4]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pl



from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
stopwords = set(nltk_stopwords.words('english'))

import re



from tqdm.notebook import tqdm
tqdm.pandas()

import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\afana\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\afana\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\afana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
df  = pd.read_csv("/datasets/toxic_comments.csv")

In [7]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [9]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [10]:
print("Процент негативных комментариев:", round(df['toxic'].value_counts()[1]/df['toxic'].value_counts().sum()*100, 2), '%')

Процент негативных комментариев: 10.16 %


Имеем датафрейм, в столбеце toxic 16186 комментарий отмечен как негативный - 10.16 %.

In [11]:
print("Количество дубликатов:", df.duplicated().sum())

Количество дубликатов: 0


Создадим новый столбец с исходным текстом и далее все операции лемматизации и очистки будем проводить с ним.

In [12]:
df['lemm_text'] = df['text'].str.lower()

In [13]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
 3   lemm_text   159292 non-null  object
dtypes: int64(2), object(2)
memory usage: 4.9+ MB


Напишем функцию которая лемматизирует текст. Далее объединим очищенный текст без лишних пробелов.

Функцией ниже очищу текст от знаков препинания, образовавшихся лишних пробелов, лематизируем текст используя spacy, объединим лематизированные токены.

In [14]:
def clean_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = " ".join(text.split())
    text = nlp(text)
    return " ".join([token.lemma_ for token in text])



Проверим как работает функция.

In [15]:
print("Исходный текст:", df['text'].loc[0].lower())
print('*'*50)
print("Лемматизированный:", clean_text(df['text'].loc[0].lower()))
print('*'*50)
print("Исходный текст:", df['text'].loc[1].lower())
print('*'*50)
print("Лемматизированный:", clean_text(df['text'].loc[1].lower()))

Исходный текст: explanation
why the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms, just closure on some gas after i voted at new york dolls fac. and please don't remove the template from the talk page since i'm retired now.89.205.38.27
**************************************************
Лемматизированный: explanation why the edit make under my username hardcore metallica fan be revert they weren t vandalism just closure on some gas after I vote at new york dolls fac and please don t remove the template from the talk page since I m retire now
**************************************************
Исходный текст: d'aww! he matches this background colour i'm seemingly stuck with. thanks.  (talk) 21:51, january 11, 2016 (utc)
**************************************************
Лемматизированный: d aww he match this background colour I m seemingly stuck with thank talk january utc


In [16]:
#отслеживание прогресса лематизации
df['lemm_text'] = df['lemm_text'].progress_apply(clean_text)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [17]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm_text
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour I m seem...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man I m really not try to edit war it s ju...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,more I can t make any real suggestion on impro...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...
...,...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0,and for the second time of ask when your view ...
159288,159447,You should be ashamed of yourself \n\nThat is ...,0,you should be ashamed of yourself that be a ho...
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm there s no actual article for pros...
159290,159449,And it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
 3   lemm_text   159292 non-null  object
dtypes: int64(2), object(2)
memory usage: 4.9+ MB


Определим признаки и таргет

In [19]:
features = df['lemm_text'].reset_index(drop=True)

target = df['toxic'].reset_index(drop=True)

Текст-признак и таргет подготовлены. Далее разобъем на обучающую и тестовую выборки.

In [20]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.1, random_state=12345)

In [21]:
print(features_train.shape)
print(target_train.shape)


print(features_test.shape)
print(target_test.shape)

(143362,)
(143362,)
(15930,)
(15930,)


Сделаем мешок слов как признаки для каждой выборки поотдельности чтобы не было подглядывания при обучении TfidfVectorizer

In [22]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

In [23]:
tf_idf_train = count_tf_idf.fit_transform(features_train)

print("Размер матрицы:", tf_idf_train.shape)

Размер матрицы: (143362, 142387)


In [24]:
tf_idf_test = count_tf_idf.transform(features_test)

print("Размер матрицы:", tf_idf_test.shape)

Размер матрицы: (15930, 142387)


Признаки подготовлены. Теперь приступим к обучению моделей и предсказаниям.

## Обучение

**Логистическая регрессия LogisticRegression**

In [26]:
%%time

lr_pipeline = Pipeline([
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("logistic", LogisticRegression(solver ='liblinear', random_state=12345))])

params = {'logistic__C': [0.1, 1, 10],
          'logistic__class_weight': ['balanced', None]}
    
                                   
lr_grid = GridSearchCV(estimator=lr_pipeline, param_grid=params, cv=5, scoring='f1', n_jobs=-1, refit=False, verbose=10)
lr_grid.fit(features_train, target_train)


lr_best_paramms = lr_grid.best_params_
lr_best_score = lr_grid.best_score_


print(lr_best_paramms)
print(lr_best_score)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
{'logistic__C': 10, 'logistic__class_weight': None}
0.7780645950279016
CPU times: total: 11.1 s
Wall time: 1min 17s


**DecisionTreeClassifier**

In [30]:
%%time

dtc_pipeline = Pipeline([
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("logistic", DecisionTreeClassifier(random_state=12345, max_depth=5, class_weight = 'balanced'))])

params = {'logistic__max_depth':range(1,6)}
    
                                   
dtc_grid = GridSearchCV(estimator=dtc_pipeline, param_grid=params, cv=5, scoring='f1', verbose=10)
dtc_grid.fit(features_train, target_train)


dtc_best_paramms = dtc_grid.best_params_
dtc_best_score = dtc_grid.best_score_


print(dtc_best_paramms)
print(dtc_best_score)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5; 1/5] START logistic__max_depth=1.......................................
[CV 1/5; 1/5] END .....................logistic__max_depth=1; total time=   7.0s
[CV 2/5; 1/5] START logistic__max_depth=1.......................................
[CV 2/5; 1/5] END .....................logistic__max_depth=1; total time=   6.8s
[CV 3/5; 1/5] START logistic__max_depth=1.......................................
[CV 3/5; 1/5] END .....................logistic__max_depth=1; total time=   6.7s
[CV 4/5; 1/5] START logistic__max_depth=1.......................................
[CV 4/5; 1/5] END .....................logistic__max_depth=1; total time=   6.7s
[CV 5/5; 1/5] START logistic__max_depth=1.......................................
[CV 5/5; 1/5] END .....................logistic__max_depth=1; total time=   6.9s
[CV 1/5; 2/5] START logistic__max_depth=2.......................................
[CV 1/5; 2/5] END .....................logistic__

**RandomForestClassifier**

In [32]:
%%time

rfc_pipeline = Pipeline([
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("classifer", RandomForestClassifier(random_state=12345, max_depth=6, class_weight = 'balanced', n_estimators=52))])

params = {'classifer__max_depth': range(2, 6),
        "classifer__n_estimators": range(1, 52, 10)}
    
                                   
rfc_grid = GridSearchCV(estimator=rfc_pipeline, param_grid=params, cv=5, scoring='f1', verbose=10)
rfc_grid.fit(features_train, target_train)


rfc_best_paramms = rfc_grid.best_params_
rfc_best_score = rfc_grid.best_score_


print(rfc_best_paramms)
print(rfc_best_score)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5; 1/24] START classifer__max_depth=2, classifer__n_estimators=1..........
[CV 1/5; 1/24] END classifer__max_depth=2, classifer__n_estimators=1; total time=   6.5s
[CV 2/5; 1/24] START classifer__max_depth=2, classifer__n_estimators=1..........
[CV 2/5; 1/24] END classifer__max_depth=2, classifer__n_estimators=1; total time=   6.3s
[CV 3/5; 1/24] START classifer__max_depth=2, classifer__n_estimators=1..........
[CV 3/5; 1/24] END classifer__max_depth=2, classifer__n_estimators=1; total time=   6.3s
[CV 4/5; 1/24] START classifer__max_depth=2, classifer__n_estimators=1..........
[CV 4/5; 1/24] END classifer__max_depth=2, classifer__n_estimators=1; total time=   6.4s
[CV 5/5; 1/24] START classifer__max_depth=2, classifer__n_estimators=1..........
[CV 5/5; 1/24] END classifer__max_depth=2, classifer__n_estimators=1; total time=   6.5s
[CV 1/5; 2/24] START classifer__max_depth=2, classifer__n_estimators=11.........
[CV 1/5

**CatBoostClassifier**

In [34]:
%%time

catboost_pipeline = Pipeline([
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("classifer", CatBoostClassifier(loss_function="Logloss", iterations=250))])

params = {'classifer__depth'         : [4, 6, 8],
                 'classifer__learning_rate' : [0.01, 0.04]
                 }
    
                                   
catboost_grid = GridSearchCV(estimator=catboost_pipeline, param_grid=params, cv=5, scoring='f1', verbose=50)
catboost_grid.fit(features_train, target_train)


catboost_best_paramms = catboost_grid.best_params_
catboost_best_score = catboost_grid.best_score_


print(catboost_best_paramms)
print(catboost_best_score)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5; 1/6] START classifer__depth=4, classifer__learning_rate=0.01...........
0:	learn: 0.6825210	total: 545ms	remaining: 2m 15s
1:	learn: 0.6720331	total: 965ms	remaining: 1m 59s
2:	learn: 0.6621268	total: 1.34s	remaining: 1m 50s
3:	learn: 0.6520962	total: 1.75s	remaining: 1m 47s
4:	learn: 0.6424390	total: 2.16s	remaining: 1m 45s
5:	learn: 0.6332185	total: 2.54s	remaining: 1m 43s
6:	learn: 0.6239023	total: 2.93s	remaining: 1m 41s
7:	learn: 0.6148027	total: 3.31s	remaining: 1m 40s
8:	learn: 0.6060465	total: 3.71s	remaining: 1m 39s
9:	learn: 0.5974516	total: 4.11s	remaining: 1m 38s
10:	learn: 0.5889526	total: 4.5s	remaining: 1m 37s
11:	learn: 0.5806809	total: 4.89s	remaining: 1m 36s
12:	learn: 0.5727045	total: 5.29s	remaining: 1m 36s
13:	learn: 0.5647464	total: 5.69s	remaining: 1m 35s
14:	learn: 0.5570762	total: 6.09s	remaining: 1m 35s
15:	learn: 0.5495687	total: 6.5s	remaining: 1m 35s
16:	learn: 0.5422130	total: 6.92s	remai

Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 2/5; 5/6] END classifer__depth=8, classifer__learning_rate=0.01; total time=  13.9s
[CV 3/5; 5/6] START classifer__depth=8, classifer__learning_rate=0.01...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 3/5; 5/6] END classifer__depth=8, classifer__learning_rate=0.01; total time=  13.6s
[CV 4/5; 5/6] START classifer__depth=8, classifer__learning_rate=0.01...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 4/5; 5/6] END classifer__depth=8, classifer__learning_rate=0.01; total time=  13.7s
[CV 5/5; 5/6] START classifer__depth=8, classifer__learning_rate=0.01...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 5/5; 5/6] END classifer__depth=8, classifer__learning_rate=0.01; total time=  13.8s
[CV 1/5; 6/6] START classifer__depth=8, classifer__learning_rate=0.04...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 1/5; 6/6] END classifer__depth=8, classifer__learning_rate=0.04; total time=  13.7s
[CV 2/5; 6/6] START classifer__depth=8, classifer__learning_rate=0.04...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 2/5; 6/6] END classifer__depth=8, classifer__learning_rate=0.04; total time=  13.7s
[CV 3/5; 6/6] START classifer__depth=8, classifer__learning_rate=0.04...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 3/5; 6/6] END classifer__depth=8, classifer__learning_rate=0.04; total time=  14.3s
[CV 4/5; 6/6] START classifer__depth=8, classifer__learning_rate=0.04...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 4/5; 6/6] END classifer__depth=8, classifer__learning_rate=0.04; total time=  13.8s
[CV 5/5; 6/6] START classifer__depth=8, classifer__learning_rate=0.04...........


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

[CV 5/5; 6/6] END classifer__depth=8, classifer__learning_rate=0.04; total time=  13.6s


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

0:	learn: 0.6502369	total: 1.17s	remaining: 4m 51s
1:	learn: 0.6108144	total: 2.31s	remaining: 4m 46s
2:	learn: 0.5780332	total: 3.37s	remaining: 4m 37s
3:	learn: 0.5461392	total: 4.47s	remaining: 4m 35s
4:	learn: 0.5160367	total: 5.57s	remaining: 4m 32s
5:	learn: 0.4897282	total: 6.67s	remaining: 4m 31s
6:	learn: 0.4652451	total: 7.77s	remaining: 4m 29s
7:	learn: 0.4441857	total: 8.86s	remaining: 4m 27s
8:	learn: 0.4244666	total: 9.95s	remaining: 4m 26s
9:	learn: 0.4068965	total: 11.1s	remaining: 4m 25s
10:	learn: 0.3911764	total: 12.1s	remaining: 4m 23s
11:	learn: 0.3765306	total: 13.2s	remaining: 4m 22s
12:	learn: 0.3633728	total: 14.4s	remaining: 4m 21s
13:	learn: 0.3515919	total: 15.4s	remaining: 4m 20s
14:	learn: 0.3412200	total: 16.5s	remaining: 4m 18s
15:	learn: 0.3315302	total: 17.6s	remaining: 4m 17s
16:	learn: 0.3226823	total: 18.6s	remaining: 4m 15s
17:	learn: 0.3145473	total: 19.7s	remaining: 4m 14s
18:	learn: 0.3073092	total: 20.8s	remaining: 4m 12s
19:	learn: 0.2996989	t

**LGBMClassifier**

In [43]:
%%time

lgbm_pipeline = Pipeline([
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("classifer", LGBMClassifier(max_depth=35, class_weight = 'balanced',
                    n_estimators=250))])

params = {'classifer__max_depth': [15, 25, 35],
              'classifer__learning_rate': [0.01, 0.1],
              'classifer__lambda_l1': [0, 0.6]}
                                   
lgbm_grid = GridSearchCV(estimator=lgbm_pipeline, param_grid=params, cv=5, scoring='f1', verbose=10)
lgbm_grid.fit(features_train, target_train)


lgbm_best_paramms = lgbm_grid.best_params_
lgbm_best_score = lgbm_grid.best_score_


print(lgbm_best_paramms)
print(lgbm_best_score)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV 1/5; 1/12] START classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15
[CV 1/5; 1/12] END classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15; total time= 1.3min
[CV 2/5; 1/12] START classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15
[CV 2/5; 1/12] END classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15; total time= 1.3min
[CV 3/5; 1/12] START classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15
[CV 3/5; 1/12] END classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15; total time= 1.5min
[CV 4/5; 1/12] START classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15
[CV 4/5; 1/12] END classifer__lambda_l1=0, classifer__learning_rate=0.01, classifer__max_depth=15; total time= 1.3min
[CV 5/5; 1/12] START classifer__lambda_l1=0, classifer__learning_ra

Итог метрики F1 кросс-валидации на обучающей выборке.

In [46]:
model= pd.Series(['LogisticRegression', 'RandomForestClassifier', 'DecisionTreeClassifier', 'CatBoostClassifier', 'LGBMClassifier'])
F1 = pd.Series([ round(lr_best_score, 2), round(rfc_best_score, 2), round(dtc_best_score, 2), round(catboost_best_score, 2), round(lgbm_best_score, 2)])

result = pd.DataFrame({'model':model, "F1":F1})
result

Unnamed: 0,model,F1
0,LogisticRegression,0.78
1,RandomForestClassifier,0.3
2,DecisionTreeClassifier,0.43
3,CatBoostClassifier,0.66
4,LGBMClassifier,0.76


**Вывод:** был проведен подбор лучших параметров для каждой модели. Лучший перзультат F1 показала модель LogisticRegression. Подбор параметров для бустинговых моделей шел очень долго(как и предполагалось). Для модели LGBMClassifier удалось в обучающей выборке кросс-валидацией удалось достич результата F1 0.76.

Проверим результат модели LogisticRegression на тестовой выборке.

In [45]:
%%time

lr = LogisticRegression(solver ='liblinear', random_state = 12345, C = 10, class_weight='balanced')
lr.fit(tf_idf_train, target_train)

pred_lr = lr.predict(tf_idf_test)
f1_score_lr = f1_score(target_test, pred_lr)

print('Оценка качества модели F1 на тесте:', round(f1_score_lr, 2))

Оценка качества модели F1 на тесте: 0.77
CPU times: total: 3.09 s
Wall time: 3.19 s


## Выводы

Получили датафрейм с заранее размеченными как негативные комментарии, для разработки модели предсказания негативных комментариев. Тексты в колонке комментарие были лемматизированы - разбиты на слова, очищены от мусора. Слова представили виде векторов - признаков. Были обучены модели:
* LogisticRegression
* RandomForestClassifier
* DecisionTreeClassifier
* CatBoostClassifier
* LGBMClassifier<br>

Для всех моделей были подобраны лучшие параметры с кросс-валидацией на обучающей выборке. <br>
**Лучшие результаты на кросс-валидации и тестовой выборке показала модель LogisticRegression, в тесте качество F1 0.77.**