<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучим модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Построим модель со значением метрики качества *F1* не меньше 0.75. 

## Подготовка

In [22]:
import pandas as pd
import numpy as np
import nltk
import re
import matplotlib.pyplot as plt
from tqdm import tqdm

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier

from pymystem3 import Mystem
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...

[nltk_data]   Package stopwords is already up-to-date!

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...

[nltk_data]   Package wordnet is already up-to-date!

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...

[nltk_data]   Package punkt is already up-to-date!


In [23]:
df_comm = pd.read_csv('/datasets/toxic_comments.csv')
df_comm.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 159292 entries, 0 to 159291

Data columns (total 3 columns):

 #   Column      Non-Null Count   Dtype 

---  ------      --------------   ----- 

 0   Unnamed: 0  159292 non-null  int64 

 1   text        159292 non-null  object

 2   toxic       159292 non-null  int64 

dtypes: int64(2), object(1)

memory usage: 3.6+ MB


In [24]:
df_comm.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [25]:
print(f"Количество дубликатов: {df_comm.duplicated().sum()}")

Количество дубликатов: 0


Очистим датафрейм от специальных символов, заглавных букв и разделителей строк

In [27]:
def cleaning(text):
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^a-zA-Z ]+", "", text).strip()
    text = text.lower()
    return text

df_comm['text'] = df_comm['text'].apply(cleaning)
print(df_comm.head(20))

    Unnamed: 0                                               text  toxic

0            0  explanation why the edits made under my userna...      0

1            1  daww he matches this background colour im seem...      0

2            2  hey man im really not trying to edit war its j...      0

3            3  more i cant make any real suggestions on impro...      0

4            4  you sir are my hero any chance you remember wh...      0

5            5  congratulations from me as well use the tools ...      0

6            6       cocksucker before you piss around on my work      1

7            7  your vandalism to the matt shirvington article...      0

8            8  sorry if the word nonsense was offensive to yo...      0

9            9  alignment on this subject and which are contra...      0

10          10  fair use rationale for imagewonjujpg  thanks f...      0

11          11  bbq   be a man and lets discuss itmaybe over t...      0

12          12  hey what is it   talk 

In [28]:
%%time
import sys
import spacy

nlp = spacy.load("en_core_web_sm", disable = ['parser','ner'])
def normalize(row):
    doc = nlp(row)
    lem_doc = ' '.join([token.lemma_ for token in doc if not (token.is_stop or token.is_punct)])
    return lem_doc


df_comm['lemm_text'] = df_comm['text'].apply(normalize)
df_comm = df_comm.drop(['text'], axis = 1)

df_comm.head(10)

CPU times: user 15min 56s, sys: 751 ms, total: 15min 57s

Wall time: 15min 57s


Unnamed: 0.1,Unnamed: 0,toxic,lemm_text
0,0,0,explanation edit username hardcore metallica f...
1,1,0,daww match background colour m seemingly stuck...
2,2,0,hey man m try edit war guy constantly remove r...
3,3,0,not real suggestion improvement wonder secti...
4,4,0,sir hero chance remember page s
5,5,0,congratulation use tool talk
6,6,1,cocksucker piss work
7,7,0,vandalism matt shirvington article revert no...
8,8,0,sorry word nonsense offensive m intend write a...
9,9,0,alignment subject contrary dulithgow


In [29]:
print(df_comm.head(10))

   Unnamed: 0  toxic                                          lemm_text

0           0      0  explanation edit username hardcore metallica f...

1           1      0  daww match background colour m seemingly stuck...

2           2      0  hey man m try edit war guy constantly remove r...

3           3      0  not real suggestion improvement   wonder secti...

4           4      0                    sir hero chance remember page s

5           5      0                     congratulation use tool   talk

6           6      1                               cocksucker piss work

7           7      0  vandalism matt shirvington article revert   no...

8           8      0  sorry word nonsense offensive m intend write a...

9           9      0               alignment subject contrary dulithgow


Проверим сбалансированность классов

In [30]:
display(df_comm['toxic'].value_counts())
class_ratio = df_comm['toxic'].value_counts()[0] / df_comm['toxic'].value_counts()[1]
print(class_ratio)

0    143106
1     16186
Name: toxic, dtype: int64

8.841344371679229


Видим сильный  дисбаланс классов, избавимся от него

In [31]:
target = df_comm['toxic']
features = df_comm.drop(['toxic'], axis=1)


features_train, features_test, target_train, target_test=train_test_split(df_comm, target,test_size=0.1,random_state=42)

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

features_train = count_tf_idf.fit_transform(features_train['lemm_text'])
features_test = count_tf_idf.transform(features_test['lemm_text'])

print(features_train.shape)
print(features_test.shape)

(143362, 190254)

(15930, 190254)


In [32]:
cv_counts = 3
model = LogisticRegression(class_weight='balanced')
train_f1_balanced = cross_val_score(model, 
                                    features_train, 
                                    target_train, 
                                    cv=cv_counts, 
                                    scoring='f1').mean()
print('F1 на CV с балансированными классами', train_f1_balanced)

F1 на CV с балансированными классами 0.7525527042612357


# Итог первого этапа

На данном этапе мы ознакомились с данными, узнали их тип, общее количество строк. Проверили на дубликаты, очистили данные от специальных символов, заглавных букв и разделителей строк, а затем провели лемматизацию текста. Выявили сильный дисбаланс классов, ресемплировали данные и высчитали предварительную F1 меру с помощью кроссвалидации. 

## Обучение

Подберем параметры через Pipeline для Logistic Regression

In [34]:
%%time
pipe = Pipeline([
    (
    ('model', LogisticRegression(random_state=1234, solver='liblinear', max_iter=200))
    )
    ])


param_grid = [
        {

            'model': [LogisticRegression(random_state=1234, solver='liblinear')],
            'model__penalty' : ['l1', 'l2'],
            'model__C': list(range(1,15,3))
        }
]
grid = GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv = cv_counts, verbose=True, n_jobs=-1)
best_grid = grid.fit(features_train, target_train)
print('Лучшие параметры:', grid.best_params_)
print('Лучшая метрика:', grid.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Лучшие параметры: {'model': LogisticRegression(C=4, penalty='l1', random_state=1234, solver='liblinear'), 'model__C': 4, 'model__penalty': 'l1'}

Лучшая метрика: 0.772752891688528

CPU times: user 2min 56s, sys: 2min 4s, total: 5min

Wall time: 5min 1s


Проверим классификатор RandomForestClassifier 

In [36]:
%%time
params_forest = {
    'n_estimators': list(range(50,300,50)),
    'max_depth':[5,15],
    'max_features' : list(range(1,20, 2))
}


model = RandomForestClassifier(random_state=12345)
                                 
grid = GridSearchCV(model, param_grid=params_forest, scoring='f1', cv=3, verbose=True, n_jobs=-1)
best_grid = grid.fit(features_train, target_train)
print('Лучшие параметры:', grid.best_params_)
print('Лучшая метрика:', grid.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits

Лучшие параметры: {'max_depth': 5, 'max_features': 1, 'n_estimators': 50}

Лучшая метрика: 0.0

CPU times: user 32min 40s, sys: 4.53 s, total: 32min 44s

Wall time: 32min 46s


Проверим классификатор CatBoostClassifier

In [45]:
%%time

model = CatBoostClassifier(verbose=False, iterations=200)
model.fit(features_train, target_train)
target_predict = model.predict(features_test)
cv_f1_CBC = cross_val_score(model,
                                         features_train, 
                                         target_train, 
                                         cv=cv_counts, 
                                         scoring='f1').mean()

print('F1 на cv', cv_f1_CBC)


KeyboardInterrupt: 

Протестируем лучшую модель по оценкам на кросс-валидации и оценим ее с помощью тестовой выборки

In [44]:
model = LogisticRegression(random_state=1234, C = 4, penalty = 'l1', solver='liblinear', max_iter=200)
model.fit(features_train, target_train)
pred_test = model.predict(features_test)
f1_score(target_test, pred_test)

0.7904312668463612

## Выводы

Лучший результат показала Logistic Regression c F1 мерой 0.79