# Проект для «Викишоп»

<a id=0></a>
[Содержание](#0)

[1. Описание проекта](#1)

[2. Анализ данных](#2)

[3. Подготовка данных](#3)

[4. Обучение моделей](#4)

[5. Тестирование модели](#5)

[6. Общий вывод](#6)

<a id=1></a>
## 1. Описание проекта

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

**Цель проекта:**

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Данные:**

Данные находятся в файле:

`/datasets/toxic_comments.csv`

Столбец `text` в нём содержит текст комментария, а `toxic` — целевой признак.

**План работы:**

1. Загрузить данные.

2. Проанализировать данные, заполнить пропущенные значения.

3. Подготовить выборки для обучения моделей.

4. Обучить разные модели с различными гиперпараметрами.

5. Опираясь на критерии заказчика, выбрать лучшую модель, проверить её качество на тестовой выборке.

6. Сделать вывод.

Для выполнения проекта можно использовать модель *BERT*.

<a id=2></a>
[Содержание](#0)
## 2. Анализ данных

In [1]:
import pandas as pd
import numpy as np
import datetime
import nltk
import re
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score
from tqdm import tqdm
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

tqdm.pandas()

RND_STATE = 20032023

Загрузим список стоп-слов:

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

Загрузим разметчик частей речи:

In [4]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

Загрузим данные из файла в датафрейм:

In [6]:
try:
    data = pd.read_csv('/datasets/toxic_comments.csv', index_col = [0])
except:
    data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv', index_col = [0])

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Пропусков данных нет.

Посмотрим соотношение 0 и 1 в целевом признаке `toxic`:

In [7]:
data.toxic.value_counts() / data.shape[0] * 100

toxic
0    89.838787
1    10.161213
Name: count, dtype: float64

Сохраним соотношение классов в переменную для дальнейшего использования, при обучении моделей:

In [8]:
class_ratio = data['toxic'].value_counts()[0] / data['toxic'].value_counts()[1]
class_ratio

8.841344371679229

Напишем функцию предобработки текста, которая выполняет следующие действия:
- Оставляет только латиницу.
- Переводит в нижний регистр.
- Удаляет слова из stop_words.

In [9]:
def clear_text(text):
    stop_words = set(nltk_stopwords.words('english'))
    text = text.lower()
    word_list = re.sub(r'[^a-z ]', ' ', text).split()
    word_not_stop_list = [w for w in word_list if not w in stop_words]
    return ' '.join(word_not_stop_list)

In [10]:
data['text_clear'] = data['text'].progress_apply(clear_text)
display(data)

100%|██████████| 159292/159292 [00:19<00:00, 8138.39it/s]


Unnamed: 0,text,toxic,text_clear
0,Explanation\nWhy the edits made under my usern...,0,explanation edits made username hardcore metal...
1,D'aww! He matches this background colour I'm s...,0,aww matches background colour seemingly stuck ...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man really trying edit war guy constantly ...
3,"""\nMore\nI can't make any real suggestions on ...",0,make real suggestions improvement wondered sec...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page
...,...,...,...
159446,""":::::And for the second time of asking, when ...",0,second time asking view completely contradicts...
159447,You should be ashamed of yourself \n\nThat is ...,0,ashamed horrible thing put talk page
159448,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm theres actual article prostitution...
159449,And it looks like it was actually you who put ...,0,looks like actually put speedy first version d...


Подготовим два набора данных:
- Без учёта части речи.
- С учётом части речи (POS-теги).

Для лематизации строк спользуем WordNet Lemmatizer:

In [11]:
WNL = WordNetLemmatizer()

Лемматизируем строку с WordNetLemmatizer и сохраним в столбец `text_wnl`:

In [12]:
def lemmatize_text(text):

    word_list = text.split()
    lemmatized_text = ' '.join([WNL.lemmatize(w) for w in word_list])

    return lemmatized_text

In [13]:
start_time = datetime.datetime.now()
data['text_wnl'] = data['text_clear'].progress_apply(lemmatize_text)
print('Время обработки:',(datetime.datetime.now() - start_time).seconds, 'с')

100%|██████████| 159292/159292 [00:16<00:00, 9676.99it/s] 

Время обработки: 16 с





Лемматизируем строку с WordNetLemmatizer с учетом nltk.pos_tag и сохраним в столбец `text_wnl_postag`:

In [14]:
def get_wordnet_pos(word):

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [15]:
def lemmatize_postag_text(text):

    word_list = text.split()
    lemmatized_text = ' '.join([WNL.lemmatize(w, get_wordnet_pos(w)) for w in word_list])
    
    return lemmatized_text

In [16]:
start_time = datetime.datetime.now()
data['text_wnl_postag'] = data['text_clear'].progress_apply(lemmatize_postag_text)
print('Время обработки:',(datetime.datetime.now() - start_time).seconds, 'с')

100%|██████████| 159292/159292 [03:53<00:00, 681.53it/s]

Время обработки: 233 с





In [17]:
display(data)

Unnamed: 0,text,toxic,text_clear,text_wnl,text_wnl_postag
0,Explanation\nWhy the edits made under my usern...,0,explanation edits made username hardcore metal...,explanation edits made username hardcore metal...,explanation edits make username hardcore metal...
1,D'aww! He matches this background colour I'm s...,0,aww matches background colour seemingly stuck ...,aww match background colour seemingly stuck th...,aww match background colour seemingly stuck th...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man really trying edit war guy constantly ...,hey man really trying edit war guy constantly ...,hey man really try edit war guy constantly rem...
3,"""\nMore\nI can't make any real suggestions on ...",0,make real suggestions improvement wondered sec...,make real suggestion improvement wondered sect...,make real suggestion improvement wonder sectio...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page,sir hero chance remember page,sir hero chance remember page
...,...,...,...,...,...
159446,""":::::And for the second time of asking, when ...",0,second time asking view completely contradicts...,second time asking view completely contradicts...,second time ask view completely contradicts co...
159447,You should be ashamed of yourself \n\nThat is ...,0,ashamed horrible thing put talk page,ashamed horrible thing put talk page,ashamed horrible thing put talk page
159448,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm theres actual article prostitution...,spitzer umm there actual article prostitution ...,spitzer umm there actual article prostitution ...
159449,And it looks like it was actually you who put ...,0,looks like actually put speedy first version d...,look like actually put speedy first version de...,look like actually put speedy first version de...


**Вывод:**

В этом разделе были выполнены следующие задачи:

- Комментарии очищены: буквы переведены в нижний регистр, оставлена только латиница, удалены стоп-слова.
- Комментарии лемматизированы без учёта части речи и с учётом части речи (POS-тегов).

<a id=3></a>
[Содержание](#0)
## 3. Подготовка данных

Подготовим наборы данных для обучения (без учёта части речи и с учётом части речи) и целевой признак:

In [18]:
corpus = data['text_wnl']
corpus_postag = data['text_wnl_postag']
y = data['toxic']

In [19]:
x_train, x_test, y_train, y_test = train_test_split(corpus, y,
     test_size = 0.2, stratify = y, shuffle = True, random_state = RND_STATE)

x_tag_train, x_tag_test, y_tag_train, y_tag_test = train_test_split(corpus_postag, y,
     test_size = 0.2, stratify = y, shuffle = True, random_state = RND_STATE)

print('Размер тренерующей выборки:', x_train.shape)
print('Размер тестовой выборки:', x_test.shape)
print('Размер тренерующей выборки postag:', x_tag_train.shape)
print('Размер тестовой выборки postag:', x_tag_test.shape)

Размер тренерующей выборки: (127433,)
Размер тестовой выборки: (31859,)
Размер тренерующей выборки postag: (127433,)
Размер тестовой выборки postag: (31859,)


In [20]:
count_tf_idf = TfidfVectorizer()
tf_idf_train = count_tf_idf.fit_transform(x_train)
tf_idf_test = count_tf_idf.transform(x_test)

count_tag_tf_idf = TfidfVectorizer()
tf_idf_tag_train = count_tag_tf_idf.fit_transform(x_tag_train)
tf_idf_tag_test = count_tag_tf_idf.transform(x_tag_test)

print('Размер тренерующей выборки tf_idf:', tf_idf_train.shape)
print('Размер тестовой выборки tf_idf:', tf_idf_test.shape)
print('Размер тренерующей выборки tf_idf postag:', tf_idf_tag_train.shape)
print('Размер тестовой выборки tf_idf postag:', tf_idf_tag_test.shape)

Размер тренерующей выборки tf_idf: (127433, 139051)
Размер тестовой выборки tf_idf: (31859, 139051)
Размер тренерующей выборки tf_idf postag: (127433, 132880)
Размер тестовой выборки tf_idf postag: (31859, 132880)


<a id=4></a>
[Содержание](#0)
## 4. Обучение моделей

Обучим следующие модели:

- LogisticRegression
- DecisionTreeClassifier
- LightGBM
- SGDClassifier
- CatBoostClassifier

Сохраним результаты в таблицу scores_data.

Зададим общие параметры для кроссвалидации.

Напишем функцию для подбора гиперпараметров для моделей с разными наборами данных (`text_wnl` или `text_wnl_postag`):

In [23]:
scores_data = []

In [24]:
kfold = KFold(n_splits = 8, random_state = RND_STATE, shuffle = True)

In [25]:
def fit_model(estimator, param_grid, param):
   
    model = GridSearchCV(
                         estimator = estimator,
                         param_grid = param_grid,
                         n_jobs = 12,
                         cv = kfold,
                         scoring = 'f1',
                        )

    if param == 'text_wnl':
        x = tf_idf_train
    else:
        x = tf_idf_tag_train
        
    model.fit(x, y_train)
    
    res = pd.DataFrame(model.cv_results_).iloc[model.best_index_]
    f1 = round(model.cv_results_['mean_test_score'][model.best_index_], 4)
    fit_time = round(res['mean_fit_time'], 3)
    score_time = round(res['mean_score_time'], 3)
    
#    print("Лучший набор параметров:", model.best_params_)
    print('Время обучения модели:', fit_time, 's')
    print('Время предсказания модели:', score_time, 's')
    print('Значение f1:', f1)

    scores_data.append([estimator, f1, fit_time, score_time, param, model.best_params_])
    
    return model.best_estimator_, model.best_score_

**4.1. LogisticRegression**

In [26]:
param_grid = {'C' : [0.01, 0.1, 1, 5, 10],
              'solver' : ['lbfgs', 'liblinear'],
              'max_iter' : [1000],
              'class_weight' : [{0 : 1, 1 : class_ratio}],
}

fit_model(LogisticRegression(), param_grid, 'text_wnl')

Время обучения модели: 12.581 s
Время предсказания модели: 0.021 s
Значение f1: 0.7684


(LogisticRegression(C=5, class_weight={0: 1, 1: 8.841344371679229},
                    max_iter=1000, solver='liblinear'),
 0.768374536539798)

In [27]:
param_grid = {'C' : [0.01, 0.1, 1, 5, 10],
              'solver' : ['lbfgs', 'liblinear'],
              'max_iter' : [1000],
              'class_weight' : [{0 : 1, 1 : class_ratio}],
}

fit_model(LogisticRegression(), param_grid, 'text_wnl_postag')

Время обучения модели: 11.721 s
Время предсказания модели: 0.021 s
Значение f1: 0.7654


(LogisticRegression(C=5, class_weight={0: 1, 1: 8.841344371679229},
                    max_iter=1000, solver='liblinear'),
 0.7653648623605365)

**4.2. DecisionTreeClassifier**

In [28]:
param_grid = {
    'max_depth': range(5, 33, 2),
    'min_samples_leaf': [ 1, 2, 3],
    'class_weight' : [{0 : 1, 1 : class_ratio}],
}

fit_model(DecisionTreeClassifier(), param_grid, 'text_wnl')

Время обучения модели: 73.451 s
Время предсказания модели: 0.041 s
Значение f1: 0.6288


(DecisionTreeClassifier(class_weight={0: 1, 1: 8.841344371679229}, max_depth=27,
                        min_samples_leaf=3),
 0.6288048282158489)

In [29]:
param_grid = {
            'max_depth': range(5, 33, 2),
            'min_samples_leaf': [ 1, 2, 3],
            'class_weight' : [{0 : 1, 1 : class_ratio}],
            }

fit_model(DecisionTreeClassifier(), param_grid, 'text_wnl_postag')

Время обучения модели: 69.26 s
Время предсказания модели: 0.027 s
Значение f1: 0.6497


(DecisionTreeClassifier(class_weight={0: 1, 1: 8.841344371679229}, max_depth=31,
                        min_samples_leaf=3),
 0.6497146172069246)

**4.3. LGBMClassifier**

In [30]:
param_grid = {
            'n_estimators': range(100, 351, 50),
            'learning_rate': [0.1, 0.3, 0.5],
            'max_depth': [5, 7, 9, 11, 13],
            'objective' : ['binary'],
            'class_weight' : [{0 : 1, 1 : class_ratio}],
            }

fit_model(LGBMClassifier(), param_grid, 'text_wnl')

[LightGBM] [Info] Number of positive: 12949, number of negative: 114484
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.517649 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 601426
[LightGBM] [Info] Number of data points in the train set: 127433, number of used features: 11090
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500006 -> initscore=0.000022
[LightGBM] [Info] Start training from score 0.000022
Время обучения модели: 292.666 s
Время предсказания модели: 0.259 s
Значение f1: 0.7743


(LGBMClassifier(class_weight={0: 1, 1: 8.841344371679229}, learning_rate=0.3,
                max_depth=11, n_estimators=300, objective='binary'),
 0.7742935988029956)

In [31]:
param_grid = {
            'n_estimators': range(100, 351, 50),
            'learning_rate': [0.1, 0.3, 0.5],
            'max_depth': [5, 7, 9, 11, 13],
            'objective' : ['binary'],
            'class_weight' : [{0 : 1, 1 : class_ratio}],
            }

fit_model(LGBMClassifier(), param_grid, 'text_wnl_postag')

[LightGBM] [Info] Number of positive: 12949, number of negative: 114484
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.084486 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 549956
[LightGBM] [Info] Number of data points in the train set: 127433, number of used features: 9933
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500006 -> initscore=0.000022
[LightGBM] [Info] Start training from score 0.000022
Время обучения модели: 313.697 s
Время предсказания модели: 0.308 s
Значение f1: 0.7725


(LGBMClassifier(class_weight={0: 1, 1: 8.841344371679229}, learning_rate=0.3,
                max_depth=13, n_estimators=350, objective='binary'),
 0.7725092465344539)

**4.4. SGDClassifier**

In [32]:
param_grid = {
            'class_weight' : [{0 : 1, 1 : class_ratio}],
            'eta0' : [0.0, 0.1, 0.2],
            'learning_rate' : ['adaptive'],
            'loss': ['hinge', 'log_loss', 'modified_huber'],
            }

fit_model(SGDClassifier(), param_grid, 'text_wnl')

Время обучения модели: 5.818 s
Время предсказания модели: 0.018 s
Значение f1: 0.7618


(SGDClassifier(class_weight={0: 1, 1: 8.841344371679229}, eta0=0.1,
               learning_rate='adaptive', loss='modified_huber'),
 0.7618110170134021)

In [33]:
param_grid = {
            'class_weight' : [{0 : 1, 1 : class_ratio}],
            'eta0' : [0.0, 0.1, 0.2],
            'learning_rate' : ['adaptive'],
            'loss': ['hinge', 'log_loss', 'modified_huber'],
            }

fit_model(SGDClassifier(), param_grid, 'text_wnl_postag')

Время обучения модели: 5.874 s
Время предсказания модели: 0.018 s
Значение f1: 0.7594


(SGDClassifier(class_weight={0: 1, 1: 8.841344371679229}, eta0=0.1,
               learning_rate='adaptive', loss='modified_huber'),
 0.7593993517098043)

**4.5. CatBoostClassifier**

In [34]:
param_grid = {
            'depth': [5, 7, 9],
            'iterations': [200],
            'learning_rate': [0.1],
            }

fit_model(CatBoostClassifier(), param_grid, 'text_wnl')

0:	learn: 0.5940045	total: 1.26s	remaining: 4m 11s
1:	learn: 0.5139714	total: 2.43s	remaining: 4m
2:	learn: 0.4509084	total: 3.63s	remaining: 3m 58s
3:	learn: 0.4047448	total: 4.82s	remaining: 3m 56s
4:	learn: 0.3672802	total: 6.02s	remaining: 3m 54s
5:	learn: 0.3378702	total: 7.19s	remaining: 3m 52s
6:	learn: 0.3156624	total: 8.32s	remaining: 3m 49s
7:	learn: 0.2972810	total: 9.49s	remaining: 3m 47s
8:	learn: 0.2830776	total: 10.7s	remaining: 3m 46s
9:	learn: 0.2702160	total: 11.8s	remaining: 3m 44s
10:	learn: 0.2607189	total: 13s	remaining: 3m 43s
11:	learn: 0.2528612	total: 14.1s	remaining: 3m 41s
12:	learn: 0.2462108	total: 15.3s	remaining: 3m 39s
13:	learn: 0.2407092	total: 16.4s	remaining: 3m 38s
14:	learn: 0.2359982	total: 17.6s	remaining: 3m 36s
15:	learn: 0.2322114	total: 18.7s	remaining: 3m 35s
16:	learn: 0.2287340	total: 19.9s	remaining: 3m 34s
17:	learn: 0.2250141	total: 21.1s	remaining: 3m 33s
18:	learn: 0.2217741	total: 22.3s	remaining: 3m 32s
19:	learn: 0.2193194	total: 

(<catboost.core.CatBoostClassifier at 0x1b62a28f3b0>, 0.7059882499163044)

0:	learn: 0.5969921	total: 814ms	remaining: 2m 41s
1:	learn: 0.5189616	total: 1.4s	remaining: 2m 18s
2:	learn: 0.4565339	total: 1.97s	remaining: 2m 9s
3:	learn: 0.4096035	total: 2.41s	remaining: 1m 58s
4:	learn: 0.3738266	total: 2.84s	remaining: 1m 50s
5:	learn: 0.3446820	total: 3.28s	remaining: 1m 46s
6:	learn: 0.3218712	total: 3.75s	remaining: 1m 43s
7:	learn: 0.3045587	total: 4.25s	remaining: 1m 42s
8:	learn: 0.2906054	total: 4.62s	remaining: 1m 38s
9:	learn: 0.2790970	total: 5.1s	remaining: 1m 36s
10:	learn: 0.2700328	total: 5.61s	remaining: 1m 36s
11:	learn: 0.2625549	total: 6.02s	remaining: 1m 34s
12:	learn: 0.2564450	total: 6.42s	remaining: 1m 32s
13:	learn: 0.2507205	total: 6.9s	remaining: 1m 31s
14:	learn: 0.2448651	total: 7.4s	remaining: 1m 31s
15:	learn: 0.2410782	total: 7.88s	remaining: 1m 30s
16:	learn: 0.2376517	total: 8.36s	remaining: 1m 29s
17:	learn: 0.2348602	total: 8.8s	remaining: 1m 28s
18:	learn: 0.2322802	total: 9.25s	remaining: 1m 28s
19:	learn: 0.2297162	total: 

0:	learn: 0.5933077	total: 794ms	remaining: 2m 38s
1:	learn: 0.5135068	total: 1.37s	remaining: 2m 15s
2:	learn: 0.4558747	total: 1.8s	remaining: 1m 57s
3:	learn: 0.4099792	total: 2.27s	remaining: 1m 51s
4:	learn: 0.3740930	total: 2.73s	remaining: 1m 46s
5:	learn: 0.3455301	total: 3.28s	remaining: 1m 46s
6:	learn: 0.3232595	total: 3.78s	remaining: 1m 44s
7:	learn: 0.3049918	total: 4.18s	remaining: 1m 40s
8:	learn: 0.2905931	total: 4.65s	remaining: 1m 38s
9:	learn: 0.2793813	total: 5.16s	remaining: 1m 38s
10:	learn: 0.2698954	total: 5.58s	remaining: 1m 35s
11:	learn: 0.2620519	total: 5.96s	remaining: 1m 33s
12:	learn: 0.2549548	total: 6.4s	remaining: 1m 32s
13:	learn: 0.2499130	total: 6.87s	remaining: 1m 31s
14:	learn: 0.2452538	total: 7.36s	remaining: 1m 30s
15:	learn: 0.2410538	total: 7.84s	remaining: 1m 30s
16:	learn: 0.2377293	total: 8.32s	remaining: 1m 29s
17:	learn: 0.2347987	total: 8.79s	remaining: 1m 28s
18:	learn: 0.2321786	total: 9.34s	remaining: 1m 29s
19:	learn: 0.2290170	tot

115:	learn: 0.1572939	total: 3m 52s	remaining: 2m 48s
116:	learn: 0.1570297	total: 3m 55s	remaining: 2m 46s
117:	learn: 0.1567958	total: 3m 57s	remaining: 2m 44s
118:	learn: 0.1565786	total: 3m 59s	remaining: 2m 42s
119:	learn: 0.1563315	total: 4m 1s	remaining: 2m 40s
120:	learn: 0.1560719	total: 4m 3s	remaining: 2m 39s
121:	learn: 0.1558502	total: 4m 5s	remaining: 2m 37s
122:	learn: 0.1554722	total: 4m 7s	remaining: 2m 35s
123:	learn: 0.1552757	total: 4m 9s	remaining: 2m 33s
124:	learn: 0.1550598	total: 4m 11s	remaining: 2m 31s
125:	learn: 0.1546387	total: 4m 14s	remaining: 2m 29s
126:	learn: 0.1543509	total: 4m 16s	remaining: 2m 27s
127:	learn: 0.1541353	total: 4m 18s	remaining: 2m 25s
128:	learn: 0.1539081	total: 4m 20s	remaining: 2m 23s
129:	learn: 0.1536908	total: 4m 22s	remaining: 2m 21s
130:	learn: 0.1534332	total: 4m 24s	remaining: 2m 19s
131:	learn: 0.1532220	total: 4m 26s	remaining: 2m 17s
132:	learn: 0.1529016	total: 4m 28s	remaining: 2m 15s
133:	learn: 0.1526011	total: 4m 3

113:	learn: 0.1584136	total: 8m 5s	remaining: 6m 5s
114:	learn: 0.1581852	total: 8m 8s	remaining: 6m 1s
115:	learn: 0.1579050	total: 8m 12s	remaining: 5m 56s
116:	learn: 0.1575929	total: 8m 16s	remaining: 5m 52s
117:	learn: 0.1573496	total: 8m 20s	remaining: 5m 47s
118:	learn: 0.1571273	total: 8m 24s	remaining: 5m 43s
119:	learn: 0.1568514	total: 8m 28s	remaining: 5m 39s
120:	learn: 0.1564364	total: 8m 32s	remaining: 5m 34s
121:	learn: 0.1561099	total: 8m 36s	remaining: 5m 30s
122:	learn: 0.1559198	total: 8m 40s	remaining: 5m 26s
123:	learn: 0.1556863	total: 8m 44s	remaining: 5m 21s
124:	learn: 0.1554548	total: 8m 48s	remaining: 5m 16s
125:	learn: 0.1552251	total: 8m 52s	remaining: 5m 12s
126:	learn: 0.1549353	total: 8m 56s	remaining: 5m 8s
127:	learn: 0.1547498	total: 8m 59s	remaining: 5m 3s
128:	learn: 0.1545480	total: 9m 3s	remaining: 4m 59s
129:	learn: 0.1542447	total: 9m 7s	remaining: 4m 55s
130:	learn: 0.1540372	total: 9m 11s	remaining: 4m 50s
131:	learn: 0.1538142	total: 9m 15s	

0:	learn: 0.5896362	total: 3.35s	remaining: 11m 7s
1:	learn: 0.5112601	total: 6.42s	remaining: 10m 36s
2:	learn: 0.4520807	total: 9.6s	remaining: 10m 30s
3:	learn: 0.4038460	total: 13.3s	remaining: 10m 49s
4:	learn: 0.3660872	total: 16.6s	remaining: 10m 49s
5:	learn: 0.3365962	total: 19.8s	remaining: 10m 41s
6:	learn: 0.3136409	total: 22.8s	remaining: 10m 27s
7:	learn: 0.2960923	total: 25.6s	remaining: 10m 14s
8:	learn: 0.2816202	total: 28.6s	remaining: 10m 7s
9:	learn: 0.2703314	total: 31.9s	remaining: 10m 6s
10:	learn: 0.2592620	total: 35.4s	remaining: 10m 7s
11:	learn: 0.2516897	total: 38.7s	remaining: 10m 6s
12:	learn: 0.2452937	total: 41.5s	remaining: 9m 56s
13:	learn: 0.2400975	total: 44.5s	remaining: 9m 51s
14:	learn: 0.2355699	total: 47.8s	remaining: 9m 49s
15:	learn: 0.2316842	total: 51s	remaining: 9m 46s
16:	learn: 0.2279276	total: 53.9s	remaining: 9m 39s
17:	learn: 0.2245064	total: 56.7s	remaining: 9m 32s
18:	learn: 0.2212331	total: 59.9s	remaining: 9m 30s
19:	learn: 0.21885

23:	learn: 0.2032917	total: 6m 43s	remaining: 49m 20s
24:	learn: 0.2017448	total: 6m 56s	remaining: 48m 33s
25:	learn: 0.2001045	total: 7m 7s	remaining: 47m 43s
26:	learn: 0.1982570	total: 7m 23s	remaining: 47m 21s
27:	learn: 0.1969197	total: 7m 37s	remaining: 46m 53s
28:	learn: 0.1957738	total: 7m 52s	remaining: 46m 23s
29:	learn: 0.1946446	total: 8m 6s	remaining: 45m 56s
30:	learn: 0.1935726	total: 8m 21s	remaining: 45m 31s
31:	learn: 0.1922930	total: 8m 35s	remaining: 45m 5s
32:	learn: 0.1910837	total: 8m 49s	remaining: 44m 41s
33:	learn: 0.1894155	total: 9m 4s	remaining: 44m 18s
34:	learn: 0.1883963	total: 9m 18s	remaining: 43m 54s
35:	learn: 0.1874903	total: 9m 33s	remaining: 43m 30s
36:	learn: 0.1866465	total: 9m 47s	remaining: 43m 7s
37:	learn: 0.1857739	total: 10m 1s	remaining: 42m 45s
38:	learn: 0.1849368	total: 10m 16s	remaining: 42m 23s
39:	learn: 0.1841591	total: 10m 30s	remaining: 42m 1s
40:	learn: 0.1834525	total: 10m 43s	remaining: 41m 37s
41:	learn: 0.1827376	total: 10m

0:	learn: 0.5926577	total: 2.38s	remaining: 7m 54s
1:	learn: 0.5088710	total: 5.35s	remaining: 8m 49s
2:	learn: 0.4461228	total: 8.17s	remaining: 8m 56s
3:	learn: 0.3973766	total: 11s	remaining: 8m 58s
4:	learn: 0.3595540	total: 13.8s	remaining: 8m 59s
5:	learn: 0.3291277	total: 16.8s	remaining: 9m 2s
6:	learn: 0.3063411	total: 19.6s	remaining: 9m
7:	learn: 0.2885652	total: 22.5s	remaining: 8m 59s
8:	learn: 0.2737305	total: 25.4s	remaining: 8m 58s
9:	learn: 0.2621487	total: 28.3s	remaining: 8m 57s
10:	learn: 0.2524583	total: 31.2s	remaining: 8m 56s
11:	learn: 0.2438351	total: 33.9s	remaining: 8m 51s
12:	learn: 0.2376138	total: 36.8s	remaining: 8m 48s
13:	learn: 0.2321901	total: 39.6s	remaining: 8m 45s
14:	learn: 0.2277277	total: 42.5s	remaining: 8m 43s
15:	learn: 0.2237050	total: 45.3s	remaining: 8m 40s
16:	learn: 0.2201950	total: 48.2s	remaining: 8m 38s
17:	learn: 0.2169290	total: 51s	remaining: 8m 35s
18:	learn: 0.2136799	total: 54s	remaining: 8m 34s
19:	learn: 0.2107681	total: 56.8s

57:	learn: 0.1717748	total: 2m 45s	remaining: 6m 44s
58:	learn: 0.1711194	total: 2m 47s	remaining: 6m 40s
59:	learn: 0.1706733	total: 2m 50s	remaining: 6m 37s
60:	learn: 0.1702232	total: 2m 53s	remaining: 6m 34s
61:	learn: 0.1693917	total: 2m 55s	remaining: 6m 31s
62:	learn: 0.1687347	total: 2m 58s	remaining: 6m 28s
63:	learn: 0.1682380	total: 3m 1s	remaining: 6m 25s
64:	learn: 0.1677974	total: 3m 4s	remaining: 6m 22s
65:	learn: 0.1671822	total: 3m 6s	remaining: 6m 19s
66:	learn: 0.1666568	total: 3m 9s	remaining: 6m 16s
67:	learn: 0.1662421	total: 3m 12s	remaining: 6m 13s
68:	learn: 0.1657782	total: 3m 15s	remaining: 6m 10s
69:	learn: 0.1654005	total: 3m 17s	remaining: 6m 7s
70:	learn: 0.1649952	total: 3m 20s	remaining: 6m 3s
71:	learn: 0.1646097	total: 3m 22s	remaining: 6m
72:	learn: 0.1641406	total: 3m 25s	remaining: 5m 57s
73:	learn: 0.1637414	total: 3m 28s	remaining: 5m 54s
74:	learn: 0.1633142	total: 3m 30s	remaining: 5m 51s
75:	learn: 0.1629544	total: 3m 33s	remaining: 5m 48s
76:

103:	learn: 0.1532438	total: 4m 49s	remaining: 4m 27s
104:	learn: 0.1529635	total: 4m 52s	remaining: 4m 24s
105:	learn: 0.1526909	total: 4m 54s	remaining: 4m 21s
106:	learn: 0.1523264	total: 4m 57s	remaining: 4m 18s
107:	learn: 0.1519967	total: 4m 59s	remaining: 4m 15s
108:	learn: 0.1517235	total: 5m 2s	remaining: 4m 12s
109:	learn: 0.1514480	total: 5m 4s	remaining: 4m 9s
110:	learn: 0.1510797	total: 5m 7s	remaining: 4m 6s
111:	learn: 0.1507492	total: 5m 10s	remaining: 4m 3s
112:	learn: 0.1504841	total: 5m 12s	remaining: 4m
113:	learn: 0.1502782	total: 5m 15s	remaining: 3m 57s
114:	learn: 0.1499415	total: 5m 17s	remaining: 3m 55s
115:	learn: 0.1496518	total: 5m 20s	remaining: 3m 52s
116:	learn: 0.1494057	total: 5m 23s	remaining: 3m 49s
117:	learn: 0.1491221	total: 5m 25s	remaining: 3m 46s
118:	learn: 0.1489189	total: 5m 28s	remaining: 3m 43s
119:	learn: 0.1485380	total: 5m 30s	remaining: 3m 40s
120:	learn: 0.1483168	total: 5m 33s	remaining: 3m 37s
121:	learn: 0.1480896	total: 5m 36s	re

(<catboost.core.CatBoostClassifier at 0x2a852c0d0>, 0.7190066441650901)

In [35]:
param_grid = {
            'depth': [5, 7, 9],
            'iterations': [200],
            'learning_rate': [0.1],
            }

fit_model(CatBoostClassifier(), param_grid, 'text_wnl_postag') 

0:	learn: 0.5895717	total: 455ms	remaining: 1m 30s
1:	learn: 0.5132318	total: 947ms	remaining: 1m 33s
2:	learn: 0.4518418	total: 1.4s	remaining: 1m 31s
3:	learn: 0.4042116	total: 1.87s	remaining: 1m 31s
4:	learn: 0.3663977	total: 2.32s	remaining: 1m 30s
5:	learn: 0.3379276	total: 2.79s	remaining: 1m 30s
6:	learn: 0.3155694	total: 3.25s	remaining: 1m 29s
7:	learn: 0.2977869	total: 3.7s	remaining: 1m 28s
8:	learn: 0.2838323	total: 4.16s	remaining: 1m 28s
9:	learn: 0.2727013	total: 4.63s	remaining: 1m 27s
10:	learn: 0.2636189	total: 5.07s	remaining: 1m 27s
11:	learn: 0.2554465	total: 5.55s	remaining: 1m 26s
12:	learn: 0.2494627	total: 6.03s	remaining: 1m 26s
13:	learn: 0.2444637	total: 6.48s	remaining: 1m 26s
14:	learn: 0.2402256	total: 6.95s	remaining: 1m 25s
15:	learn: 0.2364148	total: 7.41s	remaining: 1m 25s
16:	learn: 0.2333158	total: 7.89s	remaining: 1m 24s
17:	learn: 0.2297934	total: 8.36s	remaining: 1m 24s
18:	learn: 0.2273493	total: 8.82s	remaining: 1m 24s
19:	learn: 0.2251036	tot

(<catboost.core.CatBoostClassifier at 0x1b639a3cfe0>, 0.6885310421815378)

103:	learn: 0.1515722	total: 4m 8s	remaining: 3m 49s
104:	learn: 0.1513188	total: 4m 11s	remaining: 3m 47s
105:	learn: 0.1510503	total: 4m 13s	remaining: 3m 44s
106:	learn: 0.1506238	total: 4m 15s	remaining: 3m 42s
107:	learn: 0.1502528	total: 4m 18s	remaining: 3m 39s
108:	learn: 0.1499513	total: 4m 20s	remaining: 3m 37s
109:	learn: 0.1497104	total: 4m 22s	remaining: 3m 35s
110:	learn: 0.1493965	total: 4m 25s	remaining: 3m 32s
111:	learn: 0.1491369	total: 4m 27s	remaining: 3m 30s
112:	learn: 0.1488885	total: 4m 29s	remaining: 3m 27s
113:	learn: 0.1486012	total: 4m 32s	remaining: 3m 25s
114:	learn: 0.1482827	total: 4m 34s	remaining: 3m 22s
115:	learn: 0.1480225	total: 4m 37s	remaining: 3m 20s
116:	learn: 0.1477377	total: 4m 39s	remaining: 3m 18s
117:	learn: 0.1474594	total: 4m 41s	remaining: 3m 15s
118:	learn: 0.1471696	total: 4m 44s	remaining: 3m 13s
119:	learn: 0.1469312	total: 4m 46s	remaining: 3m 10s
120:	learn: 0.1467285	total: 4m 48s	remaining: 3m 8s
0:	learn: 0.5929375	total: 698

121:	learn: 0.1464901	total: 4m 50s	remaining: 3m 6s
122:	learn: 0.1462322	total: 4m 53s	remaining: 3m 3s
123:	learn: 0.1459857	total: 4m 55s	remaining: 3m 1s
124:	learn: 0.1457734	total: 4m 57s	remaining: 2m 58s
125:	learn: 0.1455801	total: 5m	remaining: 2m 56s
126:	learn: 0.1453518	total: 5m 2s	remaining: 2m 53s
127:	learn: 0.1450261	total: 5m 4s	remaining: 2m 51s
128:	learn: 0.1448172	total: 5m 6s	remaining: 2m 48s
129:	learn: 0.1445513	total: 5m 9s	remaining: 2m 46s
130:	learn: 0.1443230	total: 5m 11s	remaining: 2m 44s
131:	learn: 0.1441280	total: 5m 13s	remaining: 2m 41s
132:	learn: 0.1438790	total: 5m 15s	remaining: 2m 39s
133:	learn: 0.1437115	total: 5m 18s	remaining: 2m 36s
134:	learn: 0.1434774	total: 5m 20s	remaining: 2m 34s
135:	learn: 0.1431903	total: 5m 22s	remaining: 2m 31s
136:	learn: 0.1430276	total: 5m 25s	remaining: 2m 29s
137:	learn: 0.1428295	total: 5m 27s	remaining: 2m 27s
138:	learn: 0.1425239	total: 5m 29s	remaining: 2m 24s
139:	learn: 0.1423110	total: 5m 31s	rem

(<catboost.core.CatBoostClassifier at 0x16bec4400>, 0.7219725835179578)

In [36]:
result = pd.DataFrame(scores_data)
result.columns = ['Model', 'F1', 'Fit_time', 'Score_time', 'Data', 'Param']
result.set_index('F1', inplace = True)
display(result.sort_values('F1', ascending = False))

Unnamed: 0_level_0,Model,Fit_time,Score_time,Data,Param
F1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0.7743,LGBMClassifier(),292.666,0.259,text_wnl,"{'class_weight': {0: 1, 1: 8.841344371679229},..."
0.7725,LGBMClassifier(),313.697,0.308,text_wnl_postag,"{'class_weight': {0: 1, 1: 8.841344371679229},..."
0.7684,LogisticRegression(),12.581,0.021,text_wnl,"{'C': 5, 'class_weight': {0: 1, 1: 8.841344371..."
0.7654,LogisticRegression(),11.721,0.021,text_wnl_postag,"{'C': 5, 'class_weight': {0: 1, 1: 8.841344371..."
0.7618,SGDClassifier(),5.818,0.018,text_wnl,"{'class_weight': {0: 1, 1: 8.841344371679229},..."
0.7594,SGDClassifier(),5.874,0.018,text_wnl_postag,"{'class_weight': {0: 1, 1: 8.841344371679229},..."
0.706,<catboost.core.CatBoostClassifier object at 0x...,1192.983,1.822,text_wnl,"{'depth': 7, 'iterations': 200, 'learning_rate..."
0.6885,<catboost.core.CatBoostClassifier object at 0x...,813.71,2.707,text_wnl_postag,"{'depth': 5, 'iterations': 200, 'learning_rate..."
0.6497,DecisionTreeClassifier(),69.26,0.027,text_wnl_postag,"{'class_weight': {0: 1, 1: 8.841344371679229},..."
0.6288,DecisionTreeClassifier(),73.451,0.041,text_wnl,"{'class_weight': {0: 1, 1: 8.841344371679229},..."


**Вывод:**

В этом разделе выполнено:

Обучены пять моделей:
- LogisticRegression
- DecisionTreeClassifier
- LGBMClassifier
- SGDClassifier
- CatBoostClassifier

без учёта части речи и с учётом части речи (`text_wnl` или `text_wnl_postag`).

Модель LGBMClassifier показала лучший результат по метрике f1 = 0.7733 на данных,
лемматизированных с учетом части речи и с набором параметров:

- 'class_weight': {0: 1, 1: 8.841344371679229}
- 'learning_rate': 0.3
- 'max_depth': 11
- 'n_estimators': 350, 'objective': 'binary'}

Модели LogisticRegression (f1 = 0.7688) и SGDClassifier (f1 = 0.7612) показали хорошие результаты по метрике f1  на данных, лемматизированных без учета части речи, при этом время обучения и предсказания значительно меньше. Если выбирать по совокупности факторов (f1, fi_time, score_time) можно рассмотретть эти модели.

<a id=5></a>
[Содержание](#0)
## 5. Тестирование модели

Проверим качество LGBMClassifier на тестовой выборке.

In [37]:
model = LGBMClassifier(n_estimators = 350, class_weight = {0 : 1, 1 : class_ratio},
                     boosting_type = 'gbdt', learning_rate = 0.3, max_depth = 11, objective = 'binary')
model.fit(tf_idf_tag_train, y_train)
predicted = model.predict(tf_idf_tag_test)
f1 = f1_score(y_test, predicted)
print('Значение f1 на тестовых данных:', f1)

[LightGBM] [Info] Number of positive: 12949, number of negative: 114484
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.060950 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 549956
[LightGBM] [Info] Number of data points in the train set: 127433, number of used features: 9933
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500006 -> initscore=0.000022
[LightGBM] [Info] Start training from score 0.000022
Значение f1 на тестовых данных: 0.7816091954022989


**Вывод:**

В результате выявлено следующее:

- Наилучшая модель **LGBMClassifier** (на лемматизированных данных с учетом части речи и с параметрами {'class_weight': {0: 1, 1: 8.841344371679229}, 'learning_rate': 0.3, 'max_depth': 11, 'n_estimators': 350, 'objective': 'binary'} на тестовой выборке имеет метрику **f1 = 0.777**.
- Значение метрики **f1 на тестовой выборке не меньше 0.75**, что соответствует условиям задачи.

<a id=6></a>
[Содержание](#0)
## 6. Общий вывод

Проведено исследование с целью обучить модель классифицировать комментарии на позитивные и негативные.

Входные данные - набор данных с разметкой о токсичности правок.

Наилучшая модель LGBMClassifier (на лемматизированных данных с учетом части речи) на тестовой выборке имеет метрику f1 = 0.777.

Значение метрики f1 на тестовой выборке не меньше 0.75, что соответствует условиям задачи.

Магазину можно рекомендовать использовать полученную модель **LGBMClassifier** в качестве инструмента, который будет определять токсичные комментарии и отправлять их на модерацию.