<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier

import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 

import warnings
warnings.filterwarnings("ignore")

from tqdm import tqdm

Загрузка данных и стандартные проверки:

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv', index_col = 0)
display(data)
display(f'Полных дубликатов: {data.duplicated().sum()}')
print(f'Пропуски:\n{data.isnull().sum()}\n\nОбщая информация:')
data.info()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159446,""":::::And for the second time of asking, when ...",0
159447,You should be ashamed of yourself \n\nThat is ...,0
159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159449,And it looks like it was actually you who put ...,0


'Полных дубликатов: 0'

Пропуски:
text     0
toxic    0
dtype: int64

Общая информация:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [3]:
data.toxic.value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Очистка и леммитизация данных:

In [4]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
def text3(text):
    text = re.sub(r'[^a-zA-z ]', ' ', text.lower())
    text = [word for word in nltk.word_tokenize(text) if word not in stop_words]
    return ' '.join([lemmatizer.lemmatize(word, 'v') for word in text])

sentence1 = "The striped bats are hanging on their feet for best"
sentence2 = "you should be ashamed of yourself went worked"
df_my = pd.DataFrame([sentence1, sentence2], columns = ['text'])
print(df_my,'\n')
print(df_my['text'].apply(text3))

                                                text
0  The striped bats are hanging on their feet for...
1      you should be ashamed of yourself went worked 

0    strip bat hang feet best
1             ashamed go work
Name: text, dtype: object


In [7]:
# v2
tqdm.pandas()
data['text2'] = data['text'].progress_apply(text3)
data.iloc[:,2:]

100%|██████████| 159292/159292 [01:17<00:00, 2052.41it/s]


Unnamed: 0,text2
0,explanation edit make username hardcore metall...
1,aww match background colour seemingly stick th...
2,hey man really try edit war guy constantly rem...
3,make real suggestions improvement wonder secti...
4,sir hero chance remember page
...,...
159446,second time ask view completely contradict cov...
159447,ashamed horrible thing put talk page
159448,spitzer umm theres actual article prostitution...
159449,look like actually put speedy first version de...


In [11]:
train, test = train_test_split(data.iloc[:,1:], test_size = 0.3, stratify = data.toxic, random_state=1)
print('Train data shape:', train.shape)
print('Test data shape:', test.shape)

Train data shape: (111504, 2)
Test data shape: (47788, 2)


## Обучение

In [12]:
result = pd.DataFrame(columns=['f1', 'params', 'model', 'label'])

def training(*, model, params, label=None):
    global result
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(min_df = 1)), # TFIDF векторизация
        ('model', model)])
    grid = GridSearchCV(pipeline, cv = 5, n_jobs = -1, param_grid = params,
                        scoring = 'f1', verbose = False)
    grid.fit(train.text2, train.toxic)
    exist_hyper=[]
    for i in list(grid.get_params().keys()):
        if 'estimator__model__' in i:
            exist_hyper.append(i.replace('estimator__model__', '')) #v1
    display(f"доступные гиперпараметры: {exist_hyper}")
    result.loc[len(result)] = pd.Series({'f1': grid.best_score_,
                     'params': grid.best_params_,
                     'model': grid.best_estimator_,
                     'label': label})


In [13]:
%%time
training(model=LogisticRegression(),
            params={"model__C":[0.1, 1.0, 10.0], "model__penalty":["l2"]},
            label='LogisticRegression')

"доступные гиперпараметры: ['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start']"

CPU times: user 4min 49s, sys: 4min 53s, total: 9min 43s
Wall time: 9min 44s


In [14]:
%%time
training(model=LogisticRegression(),
            params={"model__C": range(5,16), "model__penalty":["l1", "l2"]},
            label='LogisticRegression v.2')

"доступные гиперпараметры: ['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start']"

CPU times: user 23min 27s, sys: 22min 7s, total: 45min 34s
Wall time: 45min 37s


In [15]:
%%time
training(model=DecisionTreeClassifier(),
            params={'model__criterion':['gini','entropy'],'model__max_depth':[2,4,6]},
            label='DecisionTreeClassifier')

"доступные гиперпараметры: ['ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'random_state', 'splitter']"

CPU times: user 2min 32s, sys: 5.4 s, total: 2min 37s
Wall time: 2min 37s


In [16]:
%%time
training(model=CatBoostClassifier(logging_level='Silent'),
            params={'model__depth': [4,6],
                 'model__learning_rate' : [0.01,0.03],
                  'model__iterations' : [10, 50]},
            label='CatBoostClassifier')

"доступные гиперпараметры: ['logging_level']"

CPU times: user 43min 25s, sys: 1min 13s, total: 44min 38s
Wall time: 45min 14s


In [17]:
result

Unnamed: 0,f1,params,model,label
0,0.77435,"{'model__C': 10.0, 'model__penalty': 'l2'}","(TfidfVectorizer(), LogisticRegression(C=10.0))",LogisticRegression
1,0.775407,"{'model__C': 15, 'model__penalty': 'l2'}","(TfidfVectorizer(), LogisticRegression(C=15))",LogisticRegression v.2
2,0.558517,"{'model__criterion': 'gini', 'model__max_depth...","(TfidfVectorizer(), DecisionTreeClassifier(max...",DecisionTreeClassifier
3,0.5272,"{'model__depth': 6, 'model__iterations': 50, '...","(TfidfVectorizer(), <catboost.core.CatBoostCla...",CatBoostClassifier


In [18]:
best_result = result.sort_values('f1', ascending=False).iloc[0]
result.sort_values('f1', ascending=False)

Unnamed: 0,f1,params,model,label
1,0.775407,"{'model__C': 15, 'model__penalty': 'l2'}","(TfidfVectorizer(), LogisticRegression(C=15))",LogisticRegression v.2
0,0.77435,"{'model__C': 10.0, 'model__penalty': 'l2'}","(TfidfVectorizer(), LogisticRegression(C=10.0))",LogisticRegression
2,0.558517,"{'model__criterion': 'gini', 'model__max_depth...","(TfidfVectorizer(), DecisionTreeClassifier(max...",DecisionTreeClassifier
3,0.5272,"{'model__depth': 6, 'model__iterations': 50, '...","(TfidfVectorizer(), <catboost.core.CatBoostCla...",CatBoostClassifier


In [19]:
predict_test = best_result.model.predict(test.text2)
f1_score(predict_test, test.toxic)

0.7715398716773602

## Выводы

Лучшая модель - LogisticRegression. Модель показала значение метрики F1 выше порога и на обучающей и на тестовой выборках.