# Анализ токсичности комментариев

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучим модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Значение метрики качества *F1*  должно быть не меньше 0.75. 

**План работы**  
1. Изучим данные;
2. Очистим и лемматизируем текст;
3. Подготовим признаки;
4. Подберем гиперпараметры и обучим модели классификации;
5. Определим лучшую модель и проверим ее качество на тестовой выборке.

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier

import re

import nltk
from nltk.corpus import stopwords as nltk_stopwords

import spacy

import lightgbm as lgb

from fast_ml.model_development import train_valid_test_split

## Подготовка

In [2]:
df = pd.read_csv('toxic_comments.csv', index_col=0)
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [4]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Обнаружили дисбаланс классов.

### Очистим текст

In [5]:
def clear_text(text):
    clear_text = re.sub(r'[^a-zA-Z]', ' ', text)
    clear_text_split = clear_text.split()
    return ' '.join(clear_text_split)

In [6]:
df['clear_text'] = df['text'].apply(clear_text)
df.head()

Unnamed: 0,text,toxic,clear_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


### Лемматизируем текст

Используем библиотеку SpaCy.

In [7]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [8]:
# SpaCy
def lemma(text):
    text = text.lower()
    doc = nlp(text)
    lemma_list = [token.lemma_ for token in doc]
    return ' '.join(lemma_list)

In [9]:
%%time
df['lemm_text'] = df['clear_text'].apply(lemma)  # 26 минут

CPU times: total: 25min 19s
Wall time: 26min 2s


In [10]:
df.head(2)

Unnamed: 0,text,toxic,clear_text,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...,d aww he match this background colour I m seem...


Разобьем данные на обучающую, валидационную и тестовую выборки в соотношении 3:1:1

In [11]:
df = df.drop(['text', 'clear_text'], axis=1)  # удалим лишние признаки

In [12]:
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = 'toxic',
                                                                            train_size=0.6, valid_size=0.2, 
                                                                            test_size=0.2, random_state=123)

Проверим размер выборок.

In [13]:
display(X_train.shape)
y_train.shape

(95575, 1)

(95575,)

In [14]:
display(X_valid.shape)
y_valid.shape

(31858, 1)

(31858,)

In [15]:
display(X_test.shape)
y_test.shape

(31859, 1)

(31859,)

### Создадим признаки с учетом важности слова с помощью величины TF-IDF.

In [16]:
nltk.download('stopwords')  # загрузим стоп-слова
stopwords = nltk_stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kolyk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)  # удалим стоп-слова
feat_train = count_tf_idf.fit_transform(X_train['lemm_text'])
feat_valid = count_tf_idf.transform(X_valid['lemm_text'])
feat_test = count_tf_idf.transform(X_test['lemm_text'])

# Проверим размерность
print(feat_train.shape)
print(feat_valid.shape)
print(feat_test.shape)

(95575, 111644)
(31858, 111644)
(31859, 111644)


**Вывод**  
На данном этапе мы:
1. Ознакомились с данными;
2. Очистили и лемматизировали текст;
3. Разбили данные на обучающую, валидационную и тестовую выборки.
4. Создали новые признаки с помощью TF-IDF.

## Обучение

Будем решать задачу классификации используя следующие модели:

- Решающее дерево
- Логистическая регрессия
- LightGBM

### Дерево решений

In [18]:
%%time
model = DecisionTreeClassifier(random_state=1, class_weight='balanced')
param = {'criterion': ['gini', 'entropy'], 'max_depth': range(10, 15)}
gscv = GridSearchCV(model, param, scoring='f1', cv=5)
gscv.fit(feat_train, y_train)
print('Лучшие параметры', gscv.best_params_)
best_tree = gscv.best_estimator_
tree_pred = best_tree.predict(feat_valid)
score = f1_score(y_valid, tree_pred)
print('Tree f1_score:', score)
print()  # 22 min

Лучшие параметры {'criterion': 'gini', 'max_depth': 14}
Tree f1_score: 0.6082434514637904

CPU times: total: 23min 36s
Wall time: 24min 27s


### Логистическая регрессия

In [19]:
%%time
model = LogisticRegression(class_weight='balanced')
param = {'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}
gscv = GridSearchCV(model, param, scoring='f1', cv=5)
gscv.fit(feat_train, y_train)
print('Лучшие параметры', gscv.best_params_)
best_forest = gscv.best_estimator_
model.fit(feat_train, y_train)
lr_pred = model.predict(feat_valid)
score = f1_score(y_valid, lr_pred)
print('LogisticR f1_score:', score)
print()

Лучшие параметры {'C': 10, 'solver': 'liblinear'}
LogisticR f1_score: 0.7453032742887816

CPU times: total: 5min 17s
Wall time: 1min 23s


### LightGBM

In [20]:
%%time
clf = lgb.LGBMClassifier()
param = {'num_leaves': [20, 30], 'learning_rate': [0.1, 0.3]}
gscv = GridSearchCV(clf, param, scoring='f1', cv=5)
gscv.fit(feat_train, y_train)
print('Лучшие параметры', gscv.best_params_)
best_light = gscv.best_estimator_
best_light.fit(feat_train, y_train)
light_pred = best_light.predict(feat_valid)
score = f1_score(y_valid, light_pred)
print('LGBM f1_score:', score)
print()

[LightGBM] [Info] Number of positive: 7809, number of negative: 68651
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.108648 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 375302
[LightGBM] [Info] Number of data points in the train set: 76460, number of used features: 7198
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102132 -> initscore=-2.173759
[LightGBM] [Info] Start training from score -2.173759
[LightGBM] [Info] Number of positive: 7809, number of negative: 68651
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.111803 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372838
[LightGBM] [Info] Number of data points in the train set: 76460, number of used features: 7144
[LightGBM] [

[LightGBM] [Info] Number of positive: 7809, number of negative: 68651
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.115682 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 372838
[LightGBM] [Info] Number of data points in the train set: 76460, number of used features: 7144
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102132 -> initscore=-2.173759
[LightGBM] [Info] Start training from score -2.173759
[LightGBM] [Info] Number of positive: 7809, number of negative: 68651
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.118228 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 374826
[LightGBM] [Info] Number of data points in the train set: 76460, number of used features: 7180
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102132 -> initscore=-2.173

Проверим нашу лучшую модель на тестовой выборке.

In [21]:
best_light.fit(feat_train, y_train)
light_pred_test = best_light.predict(feat_test)
score = f1_score(y_test, light_pred_test)
print('LGBM f1_score:', score)
print()

[LightGBM] [Info] Number of positive: 9761, number of negative: 85814
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.592187 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 440554
[LightGBM] [Info] Number of data points in the train set: 95575, number of used features: 8273
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.102129 -> initscore=-2.173787
[LightGBM] [Info] Start training from score -2.173787
LGBM f1_score: 0.775856105153926



**Вывод**  
На данном шаге мы рассмотрели 3 модели для решения задачи классификации, подобрали гиперпараметры и получили значение f1_score для модели LightGBM = `0.78`

## Общий вывод

В данном проекте мы выполнили следующие этапы:

1. Изучили дынные;
2. Подготовили данные для обучения моделей (очистили и лемматизировали текст, добавили принаки);
3. Подобрали гиперпараметры для моделей дерева решений, логистической регрессии и LightGBM;
4. Проанализировали качество моделей;
5. Определили модель, которая удовлетворяет условию - `LightGBM`. `f1_score` - `0.78`.