<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Импорт-библиотек" data-toc-modified-id="Импорт-библиотек-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Импорт библиотек</a></span></li><li><span><a href="#Загрузка-данных" data-toc-modified-id="Загрузка-данных-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Загрузка данных</a></span></li><li><span><a href="#Предобработка" data-toc-modified-id="Предобработка-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Предобработка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

**Цель проекта** - обучить модель классифицировать комментарии на позитивные и негативные со значением метрики качества *F1* не меньше 0.75. 


**Задачи проекта**

1. Загрузить и подготовить данные.
2. Обучить разные модели. 
3. Сделать выводы.


**Описание данных**

В нашем распоряжении набор данных с разметкой о токсичности правок. Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Импорт библиотек

In [1]:
pip install optuna




In [2]:
import pandas as pd
import numpy as np
import re
import nltk

nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from collections import defaultdict

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

import optuna

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
import lightgbm as lgb

from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\js\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Загрузка данных

In [3]:
server_path = '/datasets/'
local_path = ''
data = 'toxic_comments.csv'

try:
    data = pd.read_csv(server_path + data, index_col=0)  
except: 
    data = pd.read_csv(local_path + data, index_col=0)  
  


In [4]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [6]:
data.isnull().sum()

text     0
toxic    0
dtype: int64

In [7]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [8]:
data.duplicated().sum() 

0

Пропусков и дубликатов в данных не обнаружено. Целевой признак неравномерен. Пока оставим, посмотрим, как обучится модель.

## Предобработка

In [9]:
stop_words = set(stopwords.words('english'))

In [10]:
lemmatizer = WordNetLemmatizer()

In [11]:
def preproc(text):
    text = re.sub(r'[^a-zA-z ]', ' ', text)
    text = text.lower()
    token = nltk.word_tokenize(text)
    text = [word for word in token if word not in stop_words]
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    tag_map['R'] = wn.NOUN
    text = [lemmatizer.lemmatize(word, tag_map[tag[0]]) for word, tag in pos_tag(text)]
    text = ' '.join(text)
    return text

In [12]:
data['preproc'] = data['text'].apply(preproc)
data.head()

Unnamed: 0,text,toxic,preproc
0,Explanation\nWhy the edits made under my usern...,0,explanation edits make username hardcore metal...
1,D'aww! He matches this background colour I'm s...,0,aww match background colour seemingly stuck th...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man really try edit war guy constantly rem...
3,"""\nMore\nI can't make any real suggestions on ...",0,make real suggestion improvement wonder sectio...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page


In [13]:
df = data.copy()

In [14]:
train, test = train_test_split(df,
                        test_size = 0.3,
                        random_state = 12348,
                       stratify = df['toxic'])
valid, test = train_test_split(test,
                        test_size = 0.5,
                        random_state = 12348,
                       stratify = test['toxic'])

print(train.shape)
print(valid.shape)
print(test.shape)

(111504, 3)
(23894, 3)
(23894, 3)


<font color='blue'><b>Комментарий ревьюера : </b></font> ✔️\
<font color='green'> Здорово, что у нас есть три выборки!</font>

In [15]:
for sample in [train, valid, test]:    
    print(sample[sample['toxic'] == 1].shape[0] / sample.shape[0])

0.10161070454871574
0.10161546831840629
0.10161546831840629


In [16]:
train.head(5)

Unnamed: 0,text,toxic,preproc
49048,You're such a pussy fairyboy. \n\nBlock your t...,1,pussy fairyboy block talk page edit ghandi
151496,"Saint Petersurg #2\nAlright, Ill make sure.",0,saint petersurg alright ill make sure
80686,"Recatting, etc.\nNice shots - I'd like to get ...",0,recatting etc nice shot like get someday recat...
36662,Sulasgeir\n\nYou have flagged my Sulasgeir art...,0,sulasgeir flag sulasgeir article copy culture ...
11796,Are you monitoring me? \n\nHow did you notice ...,0,monitoring notice edit quickly content true fi...


In [17]:
tfidf = TfidfVectorizer()

In [18]:
tfidf_train = tfidf.fit_transform(train['preproc'])
tfidf_valid = tfidf.transform(valid['preproc'])
tfidf_test = tfidf.transform(test['preproc'])

In [19]:
tfidf_train.shape

(111504, 129486)

In [20]:
tfidf_test.shape

(23894, 129486)

In [21]:
tfidf_valid.shape

(23894, 129486)

## Обучение

In [22]:
model = lgb.LGBMClassifier(n_jobs=-1, random_state = 42)
model.fit(tfidf_train, train['toxic'])
pred = model.predict(tfidf_valid)


In [23]:
f1 = f1_score(pred, valid['toxic'])
print('f-1:', f1)

f-1: 0.7619502868068835


In [24]:
model = CatBoostClassifier(early_stopping_rounds = 100, loss_function= 'Logloss', random_state = 42)
model.fit(tfidf_train, train['toxic'])
pred = model.predict(tfidf_valid)
f1 = f1_score(pred, valid['toxic'])
print('f-1:', f1)

Learning rate set to 0.077113
0:	learn: 0.6154415	total: 5s	remaining: 1h 23m 11s
1:	learn: 0.5484866	total: 6.95s	remaining: 57m 49s
2:	learn: 0.4912873	total: 8.19s	remaining: 45m 20s
3:	learn: 0.4490393	total: 9.39s	remaining: 38m 58s
4:	learn: 0.4119117	total: 10.6s	remaining: 35m 2s
5:	learn: 0.3817380	total: 11.7s	remaining: 32m 25s
6:	learn: 0.3567929	total: 12.9s	remaining: 30m 32s
7:	learn: 0.3361845	total: 14.2s	remaining: 29m 22s
8:	learn: 0.3189970	total: 15.4s	remaining: 28m 16s
9:	learn: 0.3040541	total: 16.6s	remaining: 27m 21s
10:	learn: 0.2918249	total: 17.8s	remaining: 26m 37s
11:	learn: 0.2818345	total: 18.9s	remaining: 25m 59s
12:	learn: 0.2733984	total: 20.1s	remaining: 25m 26s
13:	learn: 0.2658727	total: 21.3s	remaining: 24m 56s
14:	learn: 0.2588651	total: 22.4s	remaining: 24m 33s
15:	learn: 0.2532158	total: 23.6s	remaining: 24m 12s
16:	learn: 0.2486349	total: 24.8s	remaining: 23m 52s
17:	learn: 0.2445171	total: 25.9s	remaining: 23m 35s
18:	learn: 0.2410103	total:

f-1: 0.7481569560047563

In [25]:
lr = LogisticRegression(multi_class='ovr', class_weight='balanced', max_iter=10000)
lr.fit(tfidf_train, train['toxic'])
pred_train, pred_valid = lr.predict(tfidf_train), lr.predict(tfidf_valid)


In [26]:
print('f-1:', f1_score(valid['toxic'], pred_valid))

f-1: 0.7499097798628654


Подберем параметры для lgbm.

In [27]:
def objective_lgbm(trial):

    param = {     
        'max_depth' : trial.suggest_int("max_depth", 3, 10),
        'n_estimators' : trial.suggest_int('n_estimators', 500, 4000, log=True),
        'learning_rate' : trial.suggest_float('learning_rate', 0.01, 0.4, log=True),
        'num_leaves' : trial.suggest_int("num_leaves", 3, 20)
       
    }

    model = lgb.LGBMClassifier(n_jobs=-1, random_state = 42, **param)
    model.fit(tfidf_train, train['toxic'])
    pred_valid = model.predict(tfidf_valid)

    f1 = f1_score(valid['toxic'], pred_valid)

    return f1

In [28]:
%%time
study_lgbm = optuna.create_study(direction='maximize')
study_lgbm.optimize(objective_lgbm, n_trials=500, timeout=600, n_jobs=-1)


[32m[I 2023-10-03 18:01:58,170][0m A new study created in memory with name: no-name-4cf3f0af-6c17-4e98-b3f3-73451423ef18[0m
[32m[I 2023-10-03 18:08:01,249][0m Trial 5 finished with value: 0.6008486562942008 and parameters: {'max_depth': 10, 'n_estimators': 797, 'learning_rate': 0.013408984446545907, 'num_leaves': 4}. Best is trial 5 with value: 0.6008486562942008.[0m
[32m[I 2023-10-03 18:12:31,996][0m Trial 7 finished with value: 0.6260387811634348 and parameters: {'max_depth': 3, 'n_estimators': 1295, 'learning_rate': 0.010394493110643855, 'num_leaves': 16}. Best is trial 7 with value: 0.6260387811634348.[0m
[32m[I 2023-10-03 18:12:33,619][0m Trial 2 finished with value: 0.7846118405897259 and parameters: {'max_depth': 5, 'n_estimators': 859, 'learning_rate': 0.22064744804882344, 'num_leaves': 14}. Best is trial 2 with value: 0.7846118405897259.[0m
[32m[I 2023-10-03 18:14:11,524][0m Trial 8 finished with value: 0.627939142461964 and parameters: {'max_depth': 3, 'n_estima

Wall time: 14min 34s


In [30]:
print('f-1:',study_lgbm.best_value)
p = study_lgbm.best_params
print(p)

f-1: 0.7874771480804389
{'max_depth': 3, 'n_estimators': 2014, 'learning_rate': 0.19098391959565014, 'num_leaves': 15}


Проверим на тестовой выборке.

In [31]:
model = lgb.LGBMClassifier(n_jobs=-1, random_state = 42, **p)
model.fit(tfidf_train, train['toxic'])
pred_test = model.predict(tfidf_test)
f1 = f1_score(test['toxic'], pred_test)

In [32]:
print('f-1:',f1)

f-1: 0.7771402341060364


## Выводы

Удалось обучить модель классификации текстов со значением метрики f1 = 0.777. Для решения задачи лучше всего подходит логистическая регрессия.