<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Прогнозирование тональности комментариев

Американский интернет-магазин запускает новый сервис. Теперь пользователи могут оставлять комментарии, редактировать и дополнять описания товаров. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Метрика качества:F1 <br>
Требуемое значение: >=0.75

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

**Презентация**  
https://drive.google.com/file/d/1vMlDw9Vuw9t95teCoVzE8SIEIcSdaIvT/view?usp=share_link

## Подготовка

In [None]:
# импорт библиотек
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV

import torch
import tensorflow as tf
import transformers
from transformers import BertTokenizer
from tqdm import notebook
from tqdm.notebook import tqdm
tqdm.pandas()
import re
import sys
#!{sys.executable} -m pip install spacy
#!{sys.executable} -m spacy download en
import spacy
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dronp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# загрузка файла
try:
    df = pd.read_csv('C:/Users/dronp/Downloads/toxic_comments.csv')
except:
    print('Не удалось загрузить файл!')

In [None]:
# удалим cтолбец Unnamed: 0
df = df.drop('Unnamed: 0', axis=1)

In [None]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
df.tail()

Unnamed: 0,text,toxic
159287,""":::::And for the second time of asking, when ...",0
159288,You should be ashamed of yourself \n\nThat is ...,0
159289,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,And it looks like it was actually you who put ...,0
159291,"""\nAnd ... I really don't think you understand...",0


In [None]:
df.shape

(159292, 2)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
# для работы с английским языком будем использовать spaCy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable = ['parser', 'ner'])

def lemmatize(text):
    text = text.lower()
    text = nlp(text)
    lemm_text = ' '.join([token.lemma_ for token in text])
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', lemm_text)
    return ' '.join(cleared_text.split())

In [None]:
%%time
df['text'] = df['text'].progress_apply(lemmatize)

  0%|          | 0/159292 [00:00<?, ?it/s]

CPU times: total: 11min 41s
Wall time: 11min 39s


In [None]:
# создание корпуса
corpus = df['text'].values
corpus

array(['explanation why the edit make under my username hardcore metallica fan be revert they be not vandalism just closure on some gas after I vote at new york dolls fac and please do not remove the template from the talk page since I be retire now',
       'd aww he match this background colour I be seemingly stick with thank talk january utc',
       'hey man I be really not try to edit war it be just that this guy be constantly remove relevant information and talk to I through edit instead of my talk page he seem to care more about the formatting than the actual info',
       ...,
       'spitzer umm there s no actual article for prostitution ring crunch captain',
       'and it look like it be actually you who put on the speedy to have the first version delete now that I look at it',
       'and I really do not think you understand I come here and my idea be bad right away what kind of community go you have bad idea go away instead of help rewrite they'],
      dtype=object)

In [None]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Присутствует явный дисбаланс классов. Примерно 1 к 9.

## Обучение

In [None]:
# разделение признаков
X = corpus
y = df['toxic']

In [None]:
# деление на тренировочную и тестовую выборки
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
%%time
# преобразовавание текста в частотные векторы слова с помощью TfidfVectorizer;
m_tf_idf = TfidfVectorizer(stop_words=list(stopwords))
X_train = m_tf_idf.fit_transform(X_train) #Learn vocabulary and idf, return document-term matrix.
X_test = m_tf_idf.transform(X_test) #Transform documents to document-term matrix.

CPU times: total: 4.77 s
Wall time: 4.76 s


In [None]:
print('Размер тренировочной выборки', X_train.shape)
print('Размер тестовой выборки', X_test.shape)

Размер тренировочной выборки (143362, 144000)
Размер тестовой выборки (15930, 144000)


<b>Логистическая регрессия</b>

In [None]:
%%time
lr = LogisticRegression(random_state=42, max_iter=500, solver='saga')
param_lr = {'class_weight':['balanced', None],
    'C':[1,2,12,24]}

grid_lr = GridSearchCV(lr, param_lr, cv=5, scoring='f1', n_jobs=-1, verbose=1)
grid_lr.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
CPU times: total: 8.52 s
Wall time: 2min 52s


In [None]:
print('Лучшие параметры:', grid_lr.best_params_)

Лучшие параметры: {'C': 24, 'class_weight': None}


In [None]:
print('Качество модели по кросс-валидации:', grid_lr.best_score_)

Качество модели по кросс-валидации: 0.7714955094084629


<b>Дерево принятий решений</b>

In [None]:
%%time
tree = DecisionTreeClassifier(random_state=42, min_samples_leaf=5)
param_tree = {'max_depth':[64,128],
             'class_weight': ['balanced', None]}
grid_tree = GridSearchCV(tree, param_tree, cv=3, scoring='f1', n_jobs=-1, verbose=1)
grid_tree.fit(X_train, y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
CPU times: total: 1min 4s
Wall time: 2min 53s


In [None]:
print('Лучшие параметры:', grid_tree.best_params_)

Лучшие параметры: {'class_weight': None, 'max_depth': 128}


In [None]:
print('Качество модели по кросс-валидации:', grid_tree.best_score_)

Качество модели по кросс-валидации: 0.7223861196757221


<b>CatBoostClassifier</b>

In [None]:
%%time
cat = CatBoostClassifier(random_seed=42, learning_rate=0.01, eval_metric='F1', iterations=200)
param_cat = {'auto_class_weights': ['SqrtBalanced', None]}

cat_grid_search = cat.grid_search(param_cat,
                                       X=X_train,
                                       y=y_train,
                                       cv=3,
                                       train_size=0.9)

0:	learn: 0.5402940	test: 0.5384487	best: 0.5384487 (0)	total: 713ms	remaining: 2m 21s
1:	learn: 0.5867431	test: 0.5872468	best: 0.5872468 (1)	total: 1.28s	remaining: 2m 6s
2:	learn: 0.5570865	test: 0.5610375	best: 0.5872468 (1)	total: 1.85s	remaining: 2m 1s
3:	learn: 0.5821068	test: 0.5802375	best: 0.5872468 (1)	total: 2.4s	remaining: 1m 57s
4:	learn: 0.5852993	test: 0.5866487	best: 0.5872468 (1)	total: 2.97s	remaining: 1m 55s
5:	learn: 0.5830356	test: 0.5840546	best: 0.5872468 (1)	total: 3.54s	remaining: 1m 54s
6:	learn: 0.5835837	test: 0.5851408	best: 0.5872468 (1)	total: 4.13s	remaining: 1m 53s
7:	learn: 0.5832911	test: 0.5838403	best: 0.5872468 (1)	total: 4.71s	remaining: 1m 53s
8:	learn: 0.5828948	test: 0.5830436	best: 0.5872468 (1)	total: 5.28s	remaining: 1m 52s
9:	learn: 0.5837370	test: 0.5860330	best: 0.5872468 (1)	total: 5.86s	remaining: 1m 51s
10:	learn: 0.5831887	test: 0.5837434	best: 0.5872468 (1)	total: 6.41s	remaining: 1m 50s
11:	learn: 0.5838993	test: 0.5850438	best: 0.

In [None]:
print('Лучшие параметры:', cat.get_params())

Лучшие параметры: {'iterations': 200, 'learning_rate': 0.01, 'random_seed': 42, 'eval_metric': 'F1', 'auto_class_weights': 'SqrtBalanced'}


In [None]:
results_df = pd.DataFrame(cat_grid_search['cv_results'])
idx = results_df['test-F1-mean'].argmax()
f1_score_cros_sval = results_df.loc[idx, 'test-F1-mean']
print('Качество модели по кросс-валидации:', f1_score_cros_sval)

Качество модели по кросс-валидации: 0.6327880071568569


## Выводы

Лучшей моделью на кросс-валидации оказалось логистическая регрессия. Проверим качество на тестовой выборке.

In [None]:
pred = grid_lr.best_estimator_.predict(X_test)
print('Качество логистической регрессии на тестовой выборке:', f1_score(y_test, pred))

Качество логистической регрессии на тестовой выборке: 0.7897227856659906




Логистическая регрессия успешно справилась и показала требуемый результат.