# Проект для «Викишоп»

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. Доступен набор данных с разметкой о токсичности правок.

Необходима модель со значением метрики качества *F1* не меньше 0.75. 

**Этапы выполнения проекта**

1. Загрузка и подготовка данных.
2. Обучение моделей. 
3. Проверка лучшей модели

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbs

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import f1_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
import re
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer

## Подготовка

In [3]:
data = pd.read_csv("/datasets/toxic_comments.csv", index_col=0)
data.sample(5)

Unnamed: 0,text,toxic
54326,Thank you for experimenting with the page Nint...,0
68243,"Yes, you are the very model of flexibility and...",0
113732,Please stop. If you continue to vandalize Wiki...,0
8452,"""\nYes, it seems to have calmed down and thank...",0
134428,"Please undo your revert, there was no need for...",0


<div class="alert alert-success">
<b>Комментарий ревьюера ✔️:</b> Огонь, данные на месте:)</div>

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Данные загружены

### Лемматизация

In [5]:
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    lemmatized = []
    for word in nltk.word_tokenize(text):
        lemmatized.append(lemmatizer.lemmatize(word))
    return ' '.join(lemmatized)

In [6]:
%%time

lemmatize(data['text'][0])

CPU times: user 1.26 s, sys: 52.2 ms, total: 1.31 s
Wall time: 1.31 s


"Explanation Why the edits made under my username Hardcore Metallica Fan were reverted ? They were n't vandalism , just closure on some GAs after I voted at New York Dolls FAC . And please do n't remove the template from the talk page since I 'm retired now.89.205.38.27"

### Отчистка

In [7]:
def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z ]', ' ', text).split()
    return ' '.join(cleaned_text)

In [8]:
clean_text(data['text'][2])

'Hey man I m really not trying to edit war It s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info'

In [9]:
%%time

data['cleaned_text'] = data['text'].apply(lambda x: clean_text(lemmatize(x.lower())))

CPU times: user 2min 24s, sys: 403 ms, total: 2min 24s
Wall time: 2min 24s


In [10]:
data.sample(5)

Unnamed: 0,text,toxic,cleaned_text
2861,"""Just to make the point that I'm not making th...",0,just to make the point that i m not making thi...
51233,"""\nWhen the producer goes to Jones """"Truth mea...",0,when the producer go to jones truth meating th...
150067,"2010 (UTC)\n2 or 5 13:03, 22 September",0,utc or september
47365,"""\n\nAlthough I respect your admiration for it...",0,although i respect your admiration for it is n...
5402,Please stop removing referenced information fr...,0,please stop removing referenced information fr...


Текст обработан

## Обучение

### Разделение данных

In [11]:
X, y = data['cleaned_text'], data['toxic']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [13]:
X_train.shape

(127433,)

In [14]:
X_test.shape

(31859,)

### Tf-Idf

In [17]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
tf_idf = TfidfVectorizer(stop_words=stopwords)

### Выбор модели

Использование кроссвалидации требует использование пайплайна, что занимает много времени. Поэтому обучим одну модель и проверим ее на уже подготовленных валидационных данных

#### Dummy

In [23]:
dummy_model = Pipeline(steps=[('transform', tf_idf), ('clf', DummyClassifier())])
dummy_score = cross_val_score(dummy_model, X_train, y_train, scoring='f1')

In [24]:
dummy_score

array([0., 0., 0., 0., 0.])

#### Logistic regression

In [35]:
%%time

log_reg_pipeline = Pipeline(steps=[('transformer', tf_idf), ("clf", LogisticRegression(max_iter=300))])

parameters = {
    'transformer__max_df': (0.25, 0.5, 0.75),
    'transformer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__C': [1,2,6]
}

log_reg_model = RandomizedSearchCV(log_reg_pipeline,
                                   parameters,
                                   scoring='f1',
                                   cv=3,
                                   verbose=10,
                                   n_jobs=-1).fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3; 1/10] START clf__C=6, transformer__max_df=0.25, transformer__ngram_range=(1, 2)
[CV 1/3; 1/10] END clf__C=6, transformer__max_df=0.25, transformer__ngram_range=(1, 2); total time= 3.6min
[CV 2/3; 1/10] START clf__C=6, transformer__max_df=0.25, transformer__ngram_range=(1, 2)
[CV 2/3; 1/10] END clf__C=6, transformer__max_df=0.25, transformer__ngram_range=(1, 2); total time= 3.2min
[CV 3/3; 1/10] START clf__C=6, transformer__max_df=0.25, transformer__ngram_range=(1, 2)
[CV 3/3; 1/10] END clf__C=6, transformer__max_df=0.25, transformer__ngram_range=(1, 2); total time= 2.7min
[CV 1/3; 2/10] START clf__C=6, transformer__max_df=0.75, transformer__ngram_range=(1, 3)
[CV 1/3; 2/10] END clf__C=6, transformer__max_df=0.75, transformer__ngram_range=(1, 3); total time= 6.0min
[CV 2/3; 2/10] START clf__C=6, transformer__max_df=0.75, transformer__ngram_range=(1, 3)
[CV 2/3; 2/10] END clf__C=6, transformer__max_df=0.75, transformer

In [36]:
log_reg_model.best_score_

0.7677701060377763

In [37]:
log_reg_model.best_params_

{'transformer__ngram_range': (1, 1), 'transformer__max_df': 0.5, 'clf__C': 6}

#### Decision Tree

In [38]:
%%time

tree_pipeline = Pipeline(steps=[
    ('transformer', tf_idf),
    ('clf', DecisionTreeClassifier())
])

parameters = {
    'transformer__max_df': (0.25, 0.5, 0.75),
    'transformer__ngram_range': ((1, 1), (1, 2), (1, 3)),
    'clf__max_depth': range(3, 10)
}

tree_model = RandomizedSearchCV(tree_pipeline,
                                   parameters,
                                   scoring='f1',
                                   cv=3,
                                   verbose=10,
                                   n_jobs=-1).fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3; 1/10] START clf__max_depth=3, transformer__max_df=0.25, transformer__ngram_range=(1, 2)
[CV 1/3; 1/10] END clf__max_depth=3, transformer__max_df=0.25, transformer__ngram_range=(1, 2); total time=  24.9s
[CV 2/3; 1/10] START clf__max_depth=3, transformer__max_df=0.25, transformer__ngram_range=(1, 2)
[CV 2/3; 1/10] END clf__max_depth=3, transformer__max_df=0.25, transformer__ngram_range=(1, 2); total time=  25.6s
[CV 3/3; 1/10] START clf__max_depth=3, transformer__max_df=0.25, transformer__ngram_range=(1, 2)
[CV 3/3; 1/10] END clf__max_depth=3, transformer__max_df=0.25, transformer__ngram_range=(1, 2); total time=  24.9s
[CV 1/3; 2/10] START clf__max_depth=6, transformer__max_df=0.75, transformer__ngram_range=(1, 3)
[CV 1/3; 2/10] END clf__max_depth=6, transformer__max_df=0.75, transformer__ngram_range=(1, 3); total time= 1.1min
[CV 2/3; 2/10] START clf__max_depth=6, transformer__max_df=0.75, transformer__ngram_range=(

In [39]:
tree_model.best_score_

0.5704684268109356

In [40]:
tree_model.best_params_

{'transformer__ngram_range': (1, 1),
 'transformer__max_df': 0.25,
 'clf__max_depth': 8}

Лучшей оказалась модель ллгистической регрессии

## Тестирование модели

Протестируем модель на тестовой выборке

In [42]:
f1_score(y_test, log_reg_model.best_estimator_.predict(X_test))

0.7629551820728292

Модель достаточно хороша

## Выводы

В результате работы было выполнено:
 - Загружены данные
 - Проведена предобработка текста: леммантизация, отчистка
 - Проведена векторизация слов при помощи tf-idf
 - Проведен процесс выбора модели. Лучшей оказалася градиентный бустинг
 - Проведено тестирование модели