<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Цели-и-ход-проекта" data-toc-modified-id="Цели-и-ход-проекта-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Цели и ход проекта</a></span></li><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Загрузка-и-подготовка-данных." data-toc-modified-id="Загрузка-и-подготовка-данных.-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Загрузка и подготовка данных.</a></span></li><li><span><a href="#Подготовка-данных" data-toc-modified-id="Подготовка-данных-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Подготовка данных</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#LinearSVC" data-toc-modified-id="LinearSVC-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>LinearSVC</a></span></li><li><span><a href="#CatBoostClassifier" data-toc-modified-id="CatBoostClassifier-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>CatBoostClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Классификация комментариев

## Цели и ход проекта

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

В данном проекте мы подготовим набор текстовых данных с разметкой о токсичности правок для обучения модели классификации.

**Цель проекта**

Обучить модель классифицировать комментарии на позитивные и негативные.

**Ход поекта**

1. Загрузка и подготовка данных.
2. Обучение разные модели.
3. Выводы.

## Подготовка

### Загрузка и подготовка данных.

In [46]:
# импортируем библиотеки
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords as nltk_stopwords

import spacy

import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import f1_score
from sklearn.dummy import DummyClassifier

from catboost import CatBoostClassifier, Pool, cv

In [3]:
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en

!pip install -q optuna
import optuna

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 1.2 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
# открываем файл
try:
    df = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

In [5]:
# изучаем информацию
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [7]:
# проверяем наличие пропусков
df.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [8]:
# и дубликатов
df.duplicated().sum()

0

In [9]:
# изучаем баланс классов в целевом признаке
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [10]:
df['toxic'].value_counts(normalize = True)

0    0.898388
1    0.101612
Name: toxic, dtype: float64

В датасете **159292** строк и два столбца `text` и целевой признак `toxic`.

Пропусков и дубликатов в данных нет.

Наблюдается дисбаланс классов в целевом признаке. Нулевой класс состаляет почти 90% от общего числа значений.

### Подготовка данных

In [11]:
# напишем функию для лемматизации текстов и очищения их от лишних знаков и пробелов

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatize_text(text):
    lemm_text = ' '.join([token.lemma_ for token in nlp(text.lower())]) 
    clear_text = ' '.join(re.sub(r'[^a-zA-Z]', ' ', lemm_text).split())
    return clear_text

In [12]:
# применим функцию к текстам

lemmatized = df['text'].apply(lemmatize_text)

In [13]:
df['lemm_text'] = lemmatized
df_lemm = df[['lemm_text', 'toxic']]

In [14]:
df_lemm.head()

Unnamed: 0,lemm_text,toxic
0,explanation why the edit make under my usernam...,0
1,d aww he match this background colour I be see...,0
2,hey man I be really not try to edit war it be ...,0
3,more I can not make any real suggestion on imp...,0
4,you sir be my hero any chance you remember wha...,0


In [29]:
# разделим датасет на выборки
features, features_test, target, target_test = train_test_split(
df_lemm.drop('toxic', axis=1), df_lemm['toxic'], test_size=0.2, random_state=12346)

In [30]:
# загрузим стоп-слова

nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
tf = TfidfVectorizer(stop_words=stopwords)

## Обучение

### Логистическая регрессия

In [32]:
# Инициализируем модель логистической регрессии, 
# настроим гиперпараметры с помощью optuna 
# и проверим качество модели на кросс-валидации с помощью cross_val_score

def objective(trial):
    max_iter = trial.suggest_int('max_iter', 500, 1500, 500)
    C = trial.suggest_categorical('C', [0.01, 0.1, 1.0, 10.0])
    
    model = LogisticRegression(max_iter=max_iter,
                               C=C,
                               class_weight = 'balanced',
                               random_state=12345)
    
    tf = TfidfVectorizer(stop_words=stopwords)
    
    pipeline = Pipeline(steps=[('tf', tf), ('model', model)])
    
    sc = cross_val_score(pipeline, 
                         features['lemm_text'], 
                         target, 
                         n_jobs = -1, 
                         cv = 5, 
                         scoring = 'f1')
    return sc.mean()

study = optuna.create_study(direction ='maximize')
study.optimize(objective, n_trials = 5, show_progress_bar = True)

best_max_iter = study.best_trial.params.get('max_iter')
best_C = study.best_trial.params.get('C')
lr_best_score = study.best_value

lr = LogisticRegression(max_iter=best_max_iter,
                        C=best_C,
                        class_weight = 'balanced',
                        random_state=12345)

[32m[I 2023-03-27 08:49:56,850][0m A new study created in memory with name: no-name-f9b6c403-22f4-40b2-9042-9e7bbadccb95[0m
  self._init_valid()


  0%|          | 0/5 [00:00<?, ?it/s]

[32m[I 2023-03-27 08:54:55,019][0m Trial 0 finished with value: 0.751355892710969 and parameters: {'max_iter': 500, 'C': 1.0}. Best is trial 0 with value: 0.751355892710969.[0m
[32m[I 2023-03-27 09:00:14,851][0m Trial 1 finished with value: 0.751355892710969 and parameters: {'max_iter': 1500, 'C': 1.0}. Best is trial 0 with value: 0.751355892710969.[0m
[32m[I 2023-03-27 09:01:16,460][0m Trial 2 finished with value: 0.6803455970916541 and parameters: {'max_iter': 500, 'C': 0.01}. Best is trial 0 with value: 0.751355892710969.[0m
[32m[I 2023-03-27 09:06:35,303][0m Trial 3 finished with value: 0.751355892710969 and parameters: {'max_iter': 1000, 'C': 1.0}. Best is trial 0 with value: 0.751355892710969.[0m
[32m[I 2023-03-27 09:07:37,226][0m Trial 4 finished with value: 0.6803455970916541 and parameters: {'max_iter': 1000, 'C': 0.01}. Best is trial 0 with value: 0.751355892710969.[0m


In [36]:
best_max_iter

500

In [37]:
best_C 

1.0

In [38]:
lr_best_score

0.751355892710969

### LinearSVC

In [33]:
# Инициализируем модель LinearSVC, 
# настроим гиперпараметры с помощью optuna 
# и проверим качество модели на кросс-валидации с помощью cross_val_score

def objective(trial):
    C = trial.suggest_categorical('C', [0.01, 0.1, 1.0, 10.0])
    
    model = LinearSVC(C=C,
                      penalty='l1',
                      dual=False,
                      max_iter=1000,
                      class_weight = 'balanced',
                      random_state = 12345)
    
    tf = TfidfVectorizer(stop_words=stopwords)
    
    pipeline = Pipeline(steps=[('tf', tf), ('model', model)])
    
    sc = cross_val_score(pipeline, 
                         features['lemm_text'], 
                         target, 
                         n_jobs = -1, 
                         cv = 5, 
                         scoring = 'f1')
    return sc.mean()

study = optuna.create_study(direction ='maximize')
study.optimize(objective, n_trials = 3, show_progress_bar = True)

best_C = study.best_trial.params.get('C')

lsvc_best_score = study.best_value

lsvc = LinearSVC(C=best_C,
                 penalty='l1',
                 max_iter=1000,
                 dual=False,
                 class_weight='balanced',
                 random_state=12345)

[32m[I 2023-03-27 09:10:48,041][0m A new study created in memory with name: no-name-7a5096c3-c4d1-41d1-86bb-ecdde8c4f979[0m
  self._init_valid()


  0%|          | 0/3 [00:00<?, ?it/s]

[32m[I 2023-03-27 09:11:18,512][0m Trial 0 finished with value: 0.7152514682977832 and parameters: {'C': 0.01}. Best is trial 0 with value: 0.7152514682977832.[0m
[32m[I 2023-03-27 09:12:32,256][0m Trial 1 finished with value: 0.7584681648695567 and parameters: {'C': 1.0}. Best is trial 1 with value: 0.7584681648695567.[0m
[32m[I 2023-03-27 09:13:02,587][0m Trial 2 finished with value: 0.7152514682977832 and parameters: {'C': 0.01}. Best is trial 1 with value: 0.7584681648695567.[0m


In [39]:
best_C

1.0

In [40]:
lsvc_best_score

0.7584681648695567

### CatBoostClassifier

In [53]:
# Инициализируем модель CatBoostClassifier, 
# и проверим качество модели на кросс-валидации

cb = CatBoostClassifier(loss_function='Logloss',
                        eval_metric='F1',
                        early_stopping_rounds=200,
                        random_seed=12345)

pool = Pool(data=features[['lemm_text']],
            label=target,
            text_features=['lemm_text'])

cb_results = cv(pool,
                cb.get_params(),
                fold_count=2)

Training on fold [0/2]
0:	learn: 0.6847673	test: 0.7105334	best: 0.7105334 (0)	total: 363ms	remaining: 6m 2s
1:	learn: 0.6843497	test: 0.7100242	best: 0.7105334 (0)	total: 730ms	remaining: 6m 4s
2:	learn: 0.6959534	test: 0.7189091	best: 0.7189091 (2)	total: 1.11s	remaining: 6m 8s
3:	learn: 0.6959534	test: 0.7189091	best: 0.7189091 (2)	total: 1.49s	remaining: 6m 10s
4:	learn: 0.6959534	test: 0.7189091	best: 0.7189091 (2)	total: 1.86s	remaining: 6m 10s
5:	learn: 0.6960635	test: 0.7188471	best: 0.7189091 (2)	total: 2.23s	remaining: 6m 10s
6:	learn: 0.6959317	test: 0.7185339	best: 0.7189091 (2)	total: 2.65s	remaining: 6m 15s
7:	learn: 0.6862573	test: 0.7115367	best: 0.7189091 (2)	total: 3.05s	remaining: 6m 18s
8:	learn: 0.6873299	test: 0.7138269	best: 0.7189091 (2)	total: 3.44s	remaining: 6m 19s
9:	learn: 0.6894312	test: 0.7141964	best: 0.7189091 (2)	total: 3.83s	remaining: 6m 18s
10:	learn: 0.6859577	test: 0.7100730	best: 0.7189091 (2)	total: 4.27s	remaining: 6m 24s
11:	learn: 0.6859577	t

In [60]:
cb.fit(pool)

Custom logger is already specified. Specify more than one logger at same time is not thread safe.

Learning rate set to 0.081637
0:	learn: 0.7066726	total: 515ms	remaining: 8m 34s
1:	learn: 0.7072388	total: 1.04s	remaining: 8m 39s
2:	learn: 0.7081778	total: 1.56s	remaining: 8m 38s
3:	learn: 0.7091675	total: 2.09s	remaining: 8m 40s
4:	learn: 0.7085403	total: 2.63s	remaining: 8m 42s
5:	learn: 0.7076684	total: 3.19s	remaining: 8m 49s
6:	learn: 0.7082055	total: 3.72s	remaining: 8m 48s
7:	learn: 0.7084023	total: 4.3s	remaining: 8m 53s
8:	learn: 0.7093519	total: 4.87s	remaining: 8m 55s
9:	learn: 0.7088971	total: 5.4s	remaining: 8m 54s
10:	learn: 0.7113214	total: 5.98s	remaining: 8m 57s
11:	learn: 0.7129236	total: 6.52s	remaining: 8m 56s
12:	learn: 0.7159523	total: 7.06s	remaining: 8m 55s
13:	learn: 0.7157439	total: 7.63s	remaining: 8m 57s
14:	learn: 0.7174439	total: 8.17s	remaining: 8m 56s
15:	learn: 0.7183645	total: 8.72s	remaining: 8m 56s
16:	learn: 0.7185370	total: 9.23s	remaining: 8m 53s
17:	learn: 0.7200689	total: 9.78s	remaining: 8m 53s
18:	learn: 0.7218879	total: 10.3s	remaining: 8

<catboost.core.CatBoostClassifier at 0x7f88c9bafac0>

## Выводы

In [61]:
# соберем результаты моделей в таблицу
results = pd.DataFrame(index=[['LogReg', 'LinearSVC', 'Catboost']], 
                       columns=['f1'],
                       data=[lr_best_score, lsvc_best_score, 0.7760034527])

In [62]:
results

Unnamed: 0,f1
LogReg,0.751356
LinearSVC,0.758468
Catboost,0.776003


In [63]:
# протестируем лучшую модель

predictions = cb.predict(features_test[['lemm_text']])
f1_score(target_test, predictions)

0.7855172413793104

In [64]:
# проверим модель на адекватнось
dummy = DummyClassifier(strategy='constant', constant=1)
dummy.fit(features, target)
predictions = dummy.predict(features_test)
f1_score(target_test, predictions)

0.18296404026577695

**Общий вывод**

Мы подготовили текстовые данные с разметкой о токсичности правок для обучения моделей классификации:

- Лемматизировали и очистили тексты от лишних знаков и пробелов;
- Разделили данные на обучающую и тестовую выборки:
- Создали матрицу cо значениями TF-IDF, указав стоп-слова.

Мы обучили три модели с перебором гиперпараметров: `Логистическую ререссию`, `Случайный лес` и `CatBoostClassifier`.

Лучший результат на кросс-валидации показала модель `CatBoostClassifier`.

На тестовой выборке `f1` данной модели составила **0.79**.