<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

**Цель проекта** - обучить модель классифицировать комментарии на позитивные и негативные. Построить модель со значением метрики качества *F1* не меньше 0.75. 

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [3]:
import pandas as pd                     # импорт библиотек
import numpy as np
from pymystem3 import Mystem
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from nltk.stem import WordNetLemmatizer
import re
from catboost import CatBoostClassifier

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
import os

pth1 = '/datasets/toxic_comments.csv'
pth2 = 'C:\\Users\\User\\Downloads\\toxic_comments.csv'

if os.path.exists(pth1):
    df = pd.read_csv(pth1)
elif os.path.exists(pth2):
    df = pd.read_csv(pth2)
else:
    print('Something is wrong') 

In [7]:
df

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Проведём лемматизацию с помощью WordNetLemmatizer.

In [8]:
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_output

Для очистки текста от лишних символов с помощью регулярных выражений применим функцию clear_text. Оставим в тексте только латинские символы и пробелы.

In [9]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text)
    return " ".join(text.split())

In [10]:
corpus = df['text']

In [11]:
corpus_lemm = [lemmatize(clear_text(corpus[i])) for i in range(len(corpus))] 

In [12]:
stopwords = set(nltk_stopwords.words('english')) # Вызовем функцию stopwords.words()

In [13]:
# разделим данные на обучающую и тренировочную выборки
features_train, features_test, target_train, target_test = train_test_split(corpus_lemm, df['toxic'], 
                                                    test_size=0.2,
                                                    random_state=42)

Проводим векторизацию и удаляем стопслова.

In [14]:
tf_idf_vec = TfidfVectorizer(stop_words=stopwords)

In [15]:
features_train_vec = tf_idf_vec.fit_transform(features_train)

In [16]:
features_test_vec = tf_idf_vec.transform(features_test)

## Обучение

**LogisticRegression**

In [22]:
parameters = {'C': [10, 20],        # подбираем параметры для логистической регрессии
             'max_iter': [1000]}
lrm = LogisticRegression()
grid = GridSearchCV(lrm, parameters,
                  cv=5,
                  scoring='f1',
)
grid.fit(features_train_vec, target_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [10, 20], 'max_iter': [1000]}, scoring='f1')

In [23]:
grid.best_params_

{'C': 10, 'max_iter': 1000}

In [17]:
model = LogisticRegression(random_state=8, penalty= 'l2', C=10, max_iter=1000).fit(features_train_vec, target_train)
predictions = model.predict(features_test_vec)
lr_f1_score = f1_score(target_test, predictions)
print(lr_f1_score)

0.7787182587666264


**CatBoostClassifier**

In [25]:
model_cat = CatBoostClassifier().fit(features_train_vec, target_train)

Learning rate set to 0.081698
0:	learn: 0.6112146	total: 2.86s	remaining: 47m 34s
1:	learn: 0.5423298	total: 5.41s	remaining: 45m
2:	learn: 0.4852243	total: 7.78s	remaining: 43m 5s
3:	learn: 0.4396121	total: 10.1s	remaining: 41m 50s
4:	learn: 0.4027409	total: 12.3s	remaining: 40m 54s
5:	learn: 0.3714661	total: 14.6s	remaining: 40m 26s
6:	learn: 0.3471794	total: 16.9s	remaining: 39m 56s
7:	learn: 0.3275877	total: 19.1s	remaining: 39m 33s
8:	learn: 0.3115686	total: 21.4s	remaining: 39m 14s
9:	learn: 0.2982727	total: 23.6s	remaining: 38m 58s
10:	learn: 0.2872807	total: 25.9s	remaining: 38m 45s
11:	learn: 0.2780288	total: 28.1s	remaining: 38m 35s
12:	learn: 0.2700926	total: 30.3s	remaining: 38m 23s
13:	learn: 0.2636327	total: 32.7s	remaining: 38m 22s
14:	learn: 0.2577095	total: 35s	remaining: 38m 20s
15:	learn: 0.2528595	total: 37.3s	remaining: 38m 12s
16:	learn: 0.2485132	total: 39.5s	remaining: 38m 5s
17:	learn: 0.2447017	total: 41.7s	remaining: 37m 57s
18:	learn: 0.2411621	total: 44s	re

In [26]:
cat_predict = model_cat.predict(features_test_vec)

In [27]:
cat_f1_score = f1_score(target_test, cat_predict)
print(cat_f1_score)

0.7580877066858376


## Выводы

Для задачи классификации негативных и позитивных комментариев были использованы 2 модели:<br>
- CatBoostClassifier<br>
- LogisticRegression<br>
Предварительно данные были лемматизированы, удалены стопслова и проведена векторизация.
Логистическая регрессия показала метрику F1 выше 0,77. 

In [28]:
stats = pd.DataFrame([
                      ['CatBoostClassifier', cat_f1_score],
                      ['LogisticRegression', lr_f1_score]],
                    columns = ['model', 'F1'])
display(stats)

Unnamed: 0,model,F1
0,CatBoostClassifier,0.758088
1,LogisticRegression,0.778507
