<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Загрузка-библиотек-и-данных" data-toc-modified-id="Загрузка-библиотек-и-данных-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Загрузка библиотек и данных</a></span></li><li><span><a href="#Подготовка-выборок-для-моделей" data-toc-modified-id="Подготовка-выборок-для-моделей-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Подготовка выборок для моделей</a></span></li><li><span><a href="#Выводы-по-1-этапу" data-toc-modified-id="Выводы-по-1-этапу-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Выводы по 1 этапу</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Тестирование" data-toc-modified-id="Тестирование-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Тестирование</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для Интернет-магазина

# Project for online store

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.
________________________________________________________________________________________________________________________________
The online store launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

Train the model to classify comments into positive and negative. At your disposal is a dataset with markup on the toxicity of edits.

Build a model with a quality metric *F1* of at least 0.75.

**Instructions for the implementation of the project**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

It is not necessary to use *BERT* to run the project, but you can try.

**Data Description**

The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Подготовка

## Preparation

### Загрузка библиотек и данных

### Loading libraries and data

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
from pymystem3 import Mystem
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.utils import shuffle
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [3]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


### Подготовка выборок для моделей

### Preparing selections for models

Начнем с определения признаков и целевого признака.

Let's start by defining the features and the target feature.

In [5]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()
 
def lemmatize_text(text):
    text = text.lower()
    words = text.split()
    words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
    return ' '.join(words)

data['lemm_text'] = data['text'].apply(lemmatize_text)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [5]:
target = data['toxic']
features = data.drop(['toxic'], axis=1)

Далее разобьем данные на выборки для обучения моделей.

Next, we divide the data into samples for training models.

In [6]:
features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                              target, 
                                                                              test_size=0.1, 
                                                                              random_state=161222)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, 
                                                                            target_valid, 
                                                                            test_size=0.1,
                                                                            random_state=161222)

In [7]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

features_train = count_tf_idf.fit_transform(features_train['lemm_text'])
features_valid = count_tf_idf.transform(features_valid['lemm_text'])
features_test = count_tf_idf.transform(features_test['lemm_text'])
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)
cv_counts = 3

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(143362, 160030)
(14337, 160030)
(1593, 160030)


### Выводы по 1 этапу

### Conclusions on stage 1

Были выполнены следующие шаги:

- Загружены данные и проведен первичный осмотр
- Подготовлены признаки и выборки для обучения моделей
________________________________________________________________________________________________________________________________
The following steps have been taken:

- Data uploaded and initial inspection carried out
- Prepared features and samples for training models

## Обучение

## Education

Перед нами стоит задача классификации. Выберем для сравнения три модели:

- LogisticRegression
- DecisionTreeClassifier
- CatBoostClassifier
________________________________________________________________________________________________________________________________
We are faced with the task of classification. Let's choose three models for comparison:

- LogisticRegression
-DecisionTreeClassifier
- Cat Boost Classifier

Начнем с LogisticRegression.

In [8]:
%%time

model = LogisticRegression()
hyperparams = [{'solver':['newton-cg', 'liblinear'],
                'C':[0.1, 1, 10]}]


print('# Tuning hyper-parameters for F1_score')
grid = GridSearchCV(model, hyperparams, scoring='f1',cv=cv_counts)
grid.fit(features_train, target_train)
print("Best parameters set found on development set:")
LR_best_params = grid.best_params_
print(LR_best_params)
print("Grid scores on development set:")
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid.cv_results_['params']):
    print("%0.6f for %r"% (mean, params))
print()

cv_f1_LR = max(means)

# Tuning hyper-parameters for F1_score
Best parameters set found on development set:
{'C': 10, 'solver': 'liblinear'}
Grid scores on development set:
0.393789 for {'C': 0.1, 'solver': 'newton-cg'}
0.394471 for {'C': 0.1, 'solver': 'liblinear'}
0.685298 for {'C': 1, 'solver': 'newton-cg'}
0.685298 for {'C': 1, 'solver': 'liblinear'}
0.753687 for {'C': 10, 'solver': 'newton-cg'}
0.753726 for {'C': 10, 'solver': 'liblinear'}

CPU times: user 1min 58s, sys: 2min 3s, total: 4min 2s
Wall time: 4min 2s


In [9]:
%%time

model = LogisticRegression()
model.set_params(**LR_best_params)
model.fit(features_train, target_train)
target_predict = model.predict(features_valid)
valid_f1_LR = f1_score(target_valid, target_predict)
print('F1 на cv', cv_f1_LR)
print('F1 на валидации', valid_f1_LR)

F1 на cv 0.7537255301911344
F1 на валидации 0.7776719375119071
CPU times: user 7.92 s, sys: 7.55 s, total: 15.5 s
Wall time: 15.5 s


Далее рассмотрим DecisionTreeClassifier.

In [10]:
%%time

model = DecisionTreeClassifier()
hyperparams = [{'max_depth':[x for x in range(1,36,2)],
                'random_state':[161222]}]


print('# Tuning hyper-parameters for F1_score')
grid = GridSearchCV(model, hyperparams, scoring='f1',cv=cv_counts)
grid.fit(features_train, target_train)
print("Best parameters set found on development set:")
DTC_best_params = grid.best_params_
print(DTC_best_params)
print("Grid scores on development set:")
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid.cv_results_['params']):
    print("%0.6f for %r"% (mean, params))
print()

cv_f1_DTC = max(means)

# Tuning hyper-parameters for F1_score
Best parameters set found on development set:
{'max_depth': 35, 'random_state': 161222}
Grid scores on development set:
0.271831 for {'max_depth': 1, 'random_state': 161222}
0.422089 for {'max_depth': 3, 'random_state': 161222}
0.491049 for {'max_depth': 5, 'random_state': 161222}
0.540452 for {'max_depth': 7, 'random_state': 161222}
0.569065 for {'max_depth': 9, 'random_state': 161222}
0.590691 for {'max_depth': 11, 'random_state': 161222}
0.610604 for {'max_depth': 13, 'random_state': 161222}
0.619229 for {'max_depth': 15, 'random_state': 161222}
0.625445 for {'max_depth': 17, 'random_state': 161222}
0.640359 for {'max_depth': 19, 'random_state': 161222}
0.648608 for {'max_depth': 21, 'random_state': 161222}
0.657878 for {'max_depth': 23, 'random_state': 161222}
0.662220 for {'max_depth': 25, 'random_state': 161222}
0.666249 for {'max_depth': 27, 'random_state': 161222}
0.669017 for {'max_depth': 29, 'random_state': 161222}
0.672802 for {'max_de

In [11]:
%%time

model = DecisionTreeClassifier()
model.set_params(**DTC_best_params)
model.fit(features_train, target_train)
target_predict = model.predict(features_valid)
valid_f1_DTC = f1_score(target_valid, target_predict)
print('F1 на cv', cv_f1_DTC)
print('F1 на валидации', valid_f1_DTC)

F1 на cv 0.6794530855258221
F1 на валидации 0.6778121775025799
CPU times: user 15.8 s, sys: 53.8 ms, total: 15.9 s
Wall time: 15.9 s


И наконец, очередь CatBoostClassifier.

In [None]:
%%time

model = CatBoostClassifier(verbose=False, iterations=300)
model.fit(features_train, target_train)
target_predict = model.predict(features_valid)
cv_f1_CBC = cross_val_score(model,
                            features_train, 
                            target_train, 
                            cv=cv_counts, 
                            scoring='f1').mean()
valid_f1_CBC = f1_score(target_valid, target_predict)
print('F1 на cv', cv_f1_CBC)
print('F1 на валидации', valid_f1_CBC)

### Тестирование

### Testing

Опробуем лучшую модель (LogisticRegression) на тестовых данных. Для наглядности также построим график.

Let's try the best model (LogisticRegression) on the test data. For clarity, we will also build a graph.

In [None]:
plt.figure(figsize=[15,10])

plt.plot([0, 1], [0, 1], linestyle='--', label='RandomModel')


model = LogisticRegression()
model.set_params(**LR_best_params)
model.fit(features_train, target_train)
probabilities_test = model.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr, tpr, thresholds = roc_curve(target_test, probabilities_one_test)
predict_test = model.predict(features_test)
plt.plot(fpr, tpr, label='LogisticRegression')
print('Метрики LogisticRegression')
print('ROC AUC:', roc_auc_score(target_test, probabilities_one_test))
print('F1:', f1_score(target_test, predict_test))
print('Precision:', precision_score(target_test, predict_test))
print('Recall:', recall_score(target_test, predict_test))
print('Accuracy:', accuracy_score(target_test, predict_test))
print()

plt.xlim([0,1])
plt.ylim([0,1])

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

plt.legend(loc='lower right', fontsize='x-large')

plt.title("ROC-кривая")
plt.show()

## Выводы

## Conclusions

В течение работы по проекты были сделаны следующие этапы:

- Загружены и подготовлены данные.
- Данные поделены на обучающую, валидационную и тестовою выборку.
- Проведено сравнительное обучение различных моделей.
- Лучшая модель проверена на тестовой выборке.

При сравнительной оценке по метрике F1 лучше всего себя показал LogisticRegression. На тестовой выборке получен результат F1 = 0.78.

Это говорит о том, что при примененении данной модели токсичные комментарии будут находится наиболее эффективно на практике.
________________________________________________________________________________________________________________________________
During the work on the project, the following stages were made:

- Loaded and prepared data.
- The data is divided into training, validation and test sets.
- Conducted comparative training of various models.
- The best model is tested on a test sample.

In a comparative assessment of the F1 metric, LogisticRegression showed itself best of all. On the test sample, the result F1 = 0.78 was obtained.

This suggests that when applying this model, toxic comments will be found most effectively in practice.