<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
from pymystem3 import Mystem
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from tqdm.notebook import tqdm

from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, TimeSeriesSplit
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, roc_auc_score, roc_curve
from sklearn.utils import shuffle

from sklearn.ensemble import RandomForestClassifier


import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [None]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Создадим корпус слов

In [None]:
corpus = list(df['text'])

In [None]:
lemmatizer = WordNetLemmatizer()

Создадим функцию по очистке данных

In [None]:
def clear_text(text):
    y=re.sub(r"[^'a-zA-Z ]", ' ', text)
    k=" ".join(y.split())
    return k

Создадим функцию которая будет использовать функцию по очистке слов и лемматизировать каждое слово

In [None]:
from nltk.corpus import wordnet
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

Выполним преобразование текста

In [None]:
def lemmafunction(textc):
    k=[]
    for i in nltk.word_tokenize(textc):
        y=lemmatizer.lemmatize(i, get_wordnet_pos(i))
        k.append(y)
    return ' '.join(k)

lemy=[]
for i in tqdm(range(len(corpus))):

    lemy.append(lemmafunction(clear_text(corpus[i])))
df['lemm_text']=pd.Series(lemy, index=df.index)

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm_text
0,0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits make under my userna...
1,1,D'aww! He matches this background colour I'm s...,0,D'aww He match this background colour I 'm see...
2,2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I 'm really not try to edit war It 's ...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,More I ca n't make any real suggestion on impr...
4,4,"You, sir, are my hero. Any chance you remember...",0,You sir be my hero Any chance you remember wha...


In [None]:
target = df['toxic']
features = df.drop(['toxic'], axis=1)


features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=123)


In [None]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
features_train = count_tf_idf.fit_transform(features_train['lemm_text'].values) #astype('U'))
features_test = count_tf_idf.transform(features_test['lemm_text'].values) #astype('U'))
print(features_train.shape)
print(features_test.shape)

(79646, 106137)
(79646, 106137)


## Обучение

 LogisticRegression

In [None]:
model = LogisticRegression(random_state=1, solver='liblinear', max_iter=100)
params = {
    'penalty':['l1', 'l2'],
   'C':list(range(1,15,3))
    }
grid_cv = GridSearchCV(estimator=model, cv=3, param_grid=params, n_jobs=-1, verbose=10, scoring='f1')
grid_cv.fit(features_train, target_train)
grid_cv.best_params_
rf_rmse = grid_cv.best_score_
print('f1_score =', rf_rmse)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3; 1/10] START C=1, penalty=l1............................................
[CV 1/3; 1/10] END ..........................C=1, penalty=l1; total time=   0.5s
[CV 2/3; 1/10] START C=1, penalty=l1............................................
[CV 2/3; 1/10] END ..........................C=1, penalty=l1; total time=   0.5s
[CV 3/3; 1/10] START C=1, penalty=l1............................................
[CV 3/3; 1/10] END ..........................C=1, penalty=l1; total time=   0.5s
[CV 1/3; 2/10] START C=1, penalty=l2............................................
[CV 1/3; 2/10] END ..........................C=1, penalty=l2; total time=   5.1s
[CV 2/3; 2/10] START C=1, penalty=l2............................................
[CV 2/3; 2/10] END ..........................C=1, penalty=l2; total time=   4.9s
[CV 3/3; 2/10] START C=1, penalty=l2............................................
[CV 3/3; 2/10] END ..........................C=1

DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(random_state = 123)
params = {
   'criterion':['gini', 'entropy'],
   'max_depth':list(range(1,15,5))
}



tree_gs = GridSearchCV(tree, params, cv=3, scoring='f1', verbose=True).fit(features_train, target_train)

print ("f1_score =", tree_gs.best_score_)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
f1_score = 0.6010043208971775


In [None]:
predict_lgb_test = grid_cv.predict(features_test)
rmse_lgb_test = f1_score(target_test, predict_lgb_test)
print ("f1_score =", rmse_lgb_test)

f1_score = 0.7682711864406779


## Выводы

- Модель со значением метрики качества F1 не меньше 0.75 LogisticRegression.
- Метрика f1_score LogisticRegression на тестовой выборке = 0.7781719885115729