Оглавление.
- [Шаг 1. Описание проекта.](#Step_1)
- [Шаг 1.1. Подготовка.](#Step_2)
- [Шаг 1.2. Обучение.](#Step_3)<br />
- [Шаг 1.3. Выводы.](#Step_4)<br />

Table of contents.
- [Step 1. Description of the project.](#Step_1)
- [Step 1.1. Preparation.](#Step_2)
- [Step 1.2. Learning.](#Step_3)<br />
- [Step 1.3. Conclusions.](#Step_4)<br />

<a id='Step_1'></a>
# Классификация комментариев интернет магазина

Интернет-магазин разрабатывает сервис, в котором пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. Другими словами пользователи сами дают характеристики товарам. Поэтому сервис нуждается в инструменте, который будет выделять токсичные комментарии, чтобы модераторы могли обработать их самостоятельно.

В этом проекте научим моедль классифицировать комментарии на таксичные и все остальные.

<a id='Step_1'></a>
# Classification of online store comments

The online store is developing a service in which users can edit and supplement product descriptions, like in wiki communities. In other words, users themselves give characteristics to goods. Therefore, the service needs a tool that will highlight toxic comments so that moderators can process them themselves.

In this project, we will teach the model to classify comments into taxic and all the rest.

<a id='Step_2'></a>
## Подготовка

<a id='Step_2'></a>
## Training

Загрузка библиотек.

Loading libraries.

In [None]:
import pandas as pd
from pymystem3 import Mystem
import re 
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import f1_score
from sklearn.metrics.scorer import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split
import numpy as np
import catboost
from catboost import CatBoostClassifier

Загрузка файла.

File upload.

In [None]:
corpus = pd.read_csv('/datasets/toxic_comments.csv')

In [None]:
corpus.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Подготовим функции для лемматизации и очистки текста.

Let's prepare functions for lemmatization and text cleaning.

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
def lemmatize(text):
    return " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)]).lower()

In [None]:
def clear_text(text):
    return " ".join((re.sub(r'[^a-zA-Z ]', ' ', text)).split())  

Отчистим текст.

Let's clean up the text.

In [None]:
corpus['text'] = corpus['text'].apply(lambda x: clear_text(x))

Проведем лемматизацию.

Let's lemmatize.

In [None]:
corpus['text'] = corpus['text'].apply(lambda x: lemmatize(x))

Разделим данные для обучения и проверки качества.

Let's separate the data for training and quality control.

In [None]:
X = corpus.text
y = corpus.toxic
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Выполним tf-idf векторизацию.

Let's perform tf-idf vectorization.

In [None]:
stopwords = set(nltk_stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords) 
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

<a id='Step_3'></a>
## Обучение

<a id='Step_3'></a>
## Education

Регрессия.

Regression.

In [None]:
%%time
model = LogisticRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print('F1:', f1_score(prediction, y_test))



F1: 0.7374851013110846
CPU times: user 5.32 s, sys: 3.4 s, total: 8.72 s
Wall time: 8.73 s


LGBM.

In [None]:
%%time
model = LGBMClassifier(random_state=0, num_iterations = 250)
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print('F1:', f1_score(prediction, y_test))



F1: 0.7697902097902097
CPU times: user 8min 51s, sys: 997 ms, total: 8min 52s
Wall time: 8min 57s


CatBoost.

In [None]:
%%time
model = CatBoostClassifier(loss_function="Logloss", iterations=175) 
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print('F1:', f1_score(prediction, y_test))

Learning rate set to 0.306659
0:	learn: 0.4371776	total: 5.92s	remaining: 17m 10s
1:	learn: 0.3228587	total: 10.5s	remaining: 15m 10s
2:	learn: 0.2740831	total: 15.1s	remaining: 14m 26s
3:	learn: 0.2462572	total: 19.8s	remaining: 14m 7s
4:	learn: 0.2329068	total: 24.4s	remaining: 13m 50s
5:	learn: 0.2254028	total: 29s	remaining: 13m 37s
6:	learn: 0.2189279	total: 33.7s	remaining: 13m 29s
7:	learn: 0.2142304	total: 38.4s	remaining: 13m 22s
8:	learn: 0.2105194	total: 43.1s	remaining: 13m 15s
9:	learn: 0.2063897	total: 47.8s	remaining: 13m 9s
10:	learn: 0.2035294	total: 52.6s	remaining: 13m 4s
11:	learn: 0.2008108	total: 57.3s	remaining: 12m 58s
12:	learn: 0.1977061	total: 1m 2s	remaining: 12m 52s
13:	learn: 0.1952309	total: 1m 6s	remaining: 12m 44s
14:	learn: 0.1917842	total: 1m 11s	remaining: 12m 40s
15:	learn: 0.1900512	total: 1m 15s	remaining: 12m 34s
16:	learn: 0.1881420	total: 1m 20s	remaining: 12m 28s
17:	learn: 0.1859558	total: 1m 25s	remaining: 12m 23s
18:	learn: 0.1841839	total:

<a id='Step_4'></a>
## Выводы

<a id='Step_4'></a>
## Conclusions

В этом проекте нами были подготовлены данные:
- лемматизиация;
- отчистка текста;
- разделениние на обучение и тест.

In this project, we prepared the following data:
- lemmatization;
- cleaning up the text;
- division into training and test.

И обучены 3 модели:
- логистическая регрессия;
- lgbm классификатор;
- cat boost  классификатор.

And trained 3 models:
- logistic regression;
- lgbm classifier;
- cat boost classifier.

Lgbm позволил достичь небходимого уровня метрики F1 = 0.7698 (F1>75 - требование задния). Время обучения LGBM меньше чем у Cat Boost. Возможно достичь лучших результатов и на LGBM и на Cat Boost, подбирая параметры и, скорее всего, увеличивая время обучения. С текущими параметрами наилучшей моделью является LGBM. Регрессия оказалсь быстрее остальных моделей, но показала хучшее качество. 

Lgbm made it possible to achieve the required level of the metric F1 = 0.7698 (F1>75 is the rear requirement). LGBM training time is less than Cat Boost. It is possible to achieve better results on both LGBM and Cat Boost by tweaking the parameters and most likely increasing the training time. With the current parameters, the best model is LGBM. The regression turned out to be faster than the other models, but showed the worst quality.