# Описание проекта

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества F1 не меньше 0.75.

# 1. Подготовка

### 1.1 Загрузим необходимые библиотеки, датасет и получим общую информацию 

In [2]:
import pandas as pd
import numpy as np
import xgboost as xgb
import catboost as ctb
import lightgbm as lgb
from pymystem3 import Mystem
from sklearn.linear_model import LogisticRegression
import nltk
import re
import random
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

In [3]:
#data = pd.read_csv('/Users/aleksandrsaraev/project/30_07_2020/toxic_comments.csv')
data = pd.read_csv('/datasets/toxic_comments.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [3]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Представлен датасет, состоящий из 2 столбцов и 159571 комментариев

### 1.2 Приведем комментарии к нижнему регистру

In [4]:
data['text'] = data['text'].str.lower()

In [5]:
data

Unnamed: 0,text,toxic
0,explanation\nwhy the edits made under my usern...,0
1,d'aww! he matches this background colour i'm s...,0
2,"hey man, i'm really not trying to edit war. it...",0
3,"""\nmore\ni can't make any real suggestions on ...",0
4,"you, sir, are my hero. any chance you remember...",0
...,...,...
159566,""":::::and for the second time of asking, when ...",0
159567,you should be ashamed of yourself \n\nthat is ...,0
159568,"spitzer \n\numm, theres no actual article for ...",0
159569,and it looks like it was actually you who put ...,0


### 1.3 Очистим текст от ненужных символов (оставим только латинский текст и пробелы) с помощью регулярных выражений

In [6]:
def clear_text(text):
    re_text = re.sub(r'[^a-zA-Z ]', ' ', text)
    return " ".join(re_text.split())

In [7]:
from tqdm import tqdm
tqdm.pandas()

  from pandas import Panel


In [8]:
data['clear_text'] = data['text'].progress_apply(clear_text)

100%|██████████| 159571/159571 [00:02<00:00, 68146.51it/s]


In [9]:
data['clear_text'][1]

'd aww he matches this background colour i m seemingly stuck with thanks talk january utc'

In [10]:
data['text'][1]

"d'aww! he matches this background colour i'm seemingly stuck with. thanks.  (talk) 21:51, january 11, 2016 (utc)"

### 1.4 Выполним лемматизацию комментариев и удалим стоп-слова

In [11]:
from nltk.corpus import wordnet
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)


In [12]:
def lemmatiser(row):
    stop_words = set(nltk_stopwords.words('english')) 
    wnl = WordNetLemmatizer()
    text = ' '.join(wnl.lemmatize(w, get_wordnet_pos(w)) 
                    for w in nltk.word_tokenize(row) 
                    if not w in stop_words)
    return text

In [13]:
data['lemm_text'] = data['clear_text'].progress_apply(lemmatiser)

100%|██████████| 159571/159571 [6:38:04<00:00,  6.68it/s]       


In [14]:
data

Unnamed: 0,text,toxic,clear_text,lemm_text
0,explanation\nwhy the edits made under my usern...,0,explanation why the edits made under my userna...,explanation edits make username hardcore metal...
1,d'aww! he matches this background colour i'm s...,0,d aww he matches this background colour i m se...,aww match background colour seemingly stuck th...
2,"hey man, i'm really not trying to edit war. it...",0,hey man i m really not trying to edit war it s...,hey man really try edit war guy constantly rem...
3,"""\nmore\ni can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...,make real suggestion improvement wonder sectio...
4,"you, sir, are my hero. any chance you remember...",0,you sir are my hero any chance you remember wh...,sir hero chance remember page
...,...,...,...,...
159566,""":::::and for the second time of asking, when ...",0,and for the second time of asking when your vi...,second time ask view completely contradicts co...
159567,you should be ashamed of yourself \n\nthat is ...,0,you should be ashamed of yourself that is a ho...,ashamed horrible thing put talk page
159568,"spitzer \n\numm, theres no actual article for ...",0,spitzer umm theres no actual article for prost...,spitzer umm there actual article prostitution ...
159569,and it looks like it was actually you who put ...,0,and it looks like it was actually you who put ...,look like actually put speedy first version de...


Сохраним данную таблицу для использования при перезагрузке ноутбука

In [15]:
data.to_csv('toxic_comments_lemm')

# 2. Обучение

### 2.1 Разделим датасет на обучающую, валидационную и тестовую выборки

In [16]:
train, testing = train_test_split(data, test_size=0.4, random_state=42)
valid, test = train_test_split(testing, test_size=0.5, random_state=42)
print(train.shape)
print(valid.shape)
print(test.shape)

(95742, 4)
(31914, 4)
(31915, 4)


### 2.2 Создадим корпус комментариев для обучения

In [17]:
corpus_train = train['lemm_text'].values.astype('U')
corpus_valid = valid['lemm_text'].values.astype('U')
corpus_test = test['lemm_text'].values.astype('U')

### 2.3 Рассмотрим несколько вариантов преобразования корпуса в векторы.

2.3.1 Создадим мешок слов

In [18]:
count_vect = CountVectorizer()
vect_train = count_vect.fit_transform(corpus_train)
vect_valid = count_vect.transform(corpus_valid)
vect_test = count_vect.transform(corpus_test)
vect_train.shape

(95742, 111790)

2.3.2 Вычислим TF-IDF для корпуса текстов

In [19]:
count_tf_idf = TfidfVectorizer()
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
tf_idf_valid = count_tf_idf.transform(corpus_valid)
tf_idf_test = count_tf_idf.transform(corpus_test)

In [20]:
tf_idf_train.shape

(95742, 111790)

2.3.3 Создадим целевой признак

In [21]:
y_train = train['toxic']
y_valid = valid['toxic']
y_test = test['toxic']

2.3.4 Проверим баланс целевого признака

In [22]:
y_train.mean()

0.10109460842681373

Наблюдается дисбаланс в целевом признаке

### 2.4 Проверим работу  признаков с помощью модели LinearRegression  и Catboost

In [23]:
def cross_val(model, X_train):
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring='f1')
    return scores.mean()

In [24]:
model_test_lr = LogisticRegression(random_state=42, class_weight='balanced')

In [25]:
model_test_ctb = ctb.CatBoostClassifier(random_seed=42, eval_metric='F1', 
                                   n_estimators=500, class_weights=[1, 5])

In [26]:
best_features = pd.DataFrame(
    {'CountVectorizer': [cross_val(model_test_lr, vect_train), cross_val(model_test_ctb, vect_train)],
    'TfidfVectorizer': [cross_val(model_test_lr, tf_idf_train), cross_val(model_test_ctb, tf_idf_train)]},
     index=('lr', 'ctb'))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Learning rate set to 0.114741
0:	learn: 0.5972514	total: 374ms	remaining: 3m 6s
1:	learn: 0.6173412	total: 625ms	remaining: 2m 35s
2:	learn: 0.6426251	total: 865ms	remaining: 2m 23s
3:	learn: 0.6504082	total: 1.11s	remaining: 2m 17s
4:	learn: 0.6396964	total: 1.36s	remaining: 2m 14s
5:	learn: 0.6494202	total: 1.61s	remaining: 2m 12s
6:	learn: 0.6497518	total: 1.85s	remaining: 2m 10s
7:	learn: 0.6492623	total: 2.11s	remaining: 2m 9s
8:	learn: 0.6493105	total: 2.36s	remaining: 2m 8s
9:	learn: 0.6444803	total: 2.61s	remaining: 2m 7s
10:	learn: 0.6357345	total: 2.86s	remaining: 2m 7s
11:	learn: 0.6543266	total: 3.12s	remaining: 2m 6s
12:	learn: 0.6585466	total: 3.37s	remaining: 2m 6s
13:	learn: 0.6634839	total: 3.63s	remaining: 2m 5s
14:	learn: 0.6640163	total: 3.88s	remaining: 2m 5s
15:	learn: 0.6651867	total: 4.13s	remaining: 2m 5s
16:	learn: 0.6814580	total: 4.38s	remaining: 2m 4s
17:	learn: 0.6822874	total: 4.63s	remaining: 2m 3s
18:	learn: 0.6922970	total: 4.87s	remaining: 2m 3s
19:	l

In [27]:
best_features

Unnamed: 0,CountVectorizer,TfidfVectorizer
lr,0.754341,0.744864
ctb,0.769313,0.774405


Модели по-разному реагируют на признаки. Но лучшей оказалась модель Catboost с признаками, полученными с помощью TfidfVectorizer.

### 2.4 Попробуем улучшить модель Catboost подбором гиперпараметров для признаков, полученных с помощью TfidfVectorizer. 

In [29]:
best_param = []
for i in (0.15, 0.2, 0.25):
    for j in range(2, 9, 2):
                model_ctb = ctb.CatBoostClassifier(random_seed=42, eval_metric='F1',
                                                   learning_rate=i, max_depth=j,
                                                   n_estimators=500, class_weights=[1, 5])
                model_ctb.fit(tf_idf_train, y_train)
                y_pred_train = model_ctb.predict(tf_idf_train)
                y_pred_valid = model_ctb.predict(tf_idf_valid)
                scores_train = f1_score(y_train, y_pred_train)
                scores_valid = f1_score(y_valid, y_pred_valid)
                best_param.append([i, j, scores_train, scores_valid])
best_param = pd.DataFrame(data = best_param,\
                         columns = ['learning_rate', 'max_depth', 'scores_train', 'scores_valid'])\
                         .sort_values(by = 'scores_valid', ascending=False)
best_param


0:	learn: 0.4337863	total: 168ms	remaining: 1m 23s
1:	learn: 0.4741596	total: 327ms	remaining: 1m 21s
2:	learn: 0.5160641	total: 506ms	remaining: 1m 23s
3:	learn: 0.5488313	total: 670ms	remaining: 1m 23s
4:	learn: 0.5455903	total: 839ms	remaining: 1m 23s
5:	learn: 0.5796112	total: 989ms	remaining: 1m 21s
6:	learn: 0.5540573	total: 1.15s	remaining: 1m 20s
7:	learn: 0.5787781	total: 1.34s	remaining: 1m 22s
8:	learn: 0.5915436	total: 1.49s	remaining: 1m 21s
9:	learn: 0.6058183	total: 1.65s	remaining: 1m 20s
10:	learn: 0.5861294	total: 1.81s	remaining: 1m 20s
11:	learn: 0.5836723	total: 1.97s	remaining: 1m 19s
12:	learn: 0.5836457	total: 2.13s	remaining: 1m 19s
13:	learn: 0.6056968	total: 2.28s	remaining: 1m 19s
14:	learn: 0.6090127	total: 2.43s	remaining: 1m 18s
15:	learn: 0.5948062	total: 2.59s	remaining: 1m 18s
16:	learn: 0.6073273	total: 2.75s	remaining: 1m 18s
17:	learn: 0.6083287	total: 2.9s	remaining: 1m 17s
18:	learn: 0.6119316	total: 3.06s	remaining: 1m 17s
19:	learn: 0.6157810	to

Unnamed: 0,learning_rate,max_depth,scores_train,scores_valid
11,0.25,8,0.9187,0.788974
7,0.2,8,0.893277,0.786885
3,0.15,8,0.865942,0.784872
6,0.2,6,0.860738,0.781212
5,0.2,4,0.837779,0.78069
9,0.25,4,0.844366,0.779765
10,0.25,6,0.876994,0.779522
1,0.15,4,0.825041,0.778447
2,0.15,6,0.846427,0.777318
8,0.25,2,0.814112,0.774775


# 3. Тестирование

Обучим модель с подобранными гиперпараетрами и протестируем ее на тестовой выборке

In [30]:
model_best = ctb.CatBoostClassifier(random_seed=42, eval_metric='F1',
                                                   learning_rate=0.25, max_depth=8,
                                                   n_estimators=500, class_weights=[1, 5])

In [31]:
model_best.fit(tf_idf_train, y_train)
y_pred_best = model_best.predict(tf_idf_test)
f1_score(y_test, y_pred_best)

0:	learn: 0.6023598	total: 1.98s	remaining: 16m 30s
1:	learn: 0.6433346	total: 3.52s	remaining: 14m 36s
2:	learn: 0.6822061	total: 4.73s	remaining: 13m 4s
3:	learn: 0.6733130	total: 5.89s	remaining: 12m 10s
4:	learn: 0.7021140	total: 7.16s	remaining: 11m 48s
5:	learn: 0.7030152	total: 8.35s	remaining: 11m 27s
6:	learn: 0.7104603	total: 9.54s	remaining: 11m 11s
7:	learn: 0.7198703	total: 10.7s	remaining: 10m 59s
8:	learn: 0.7334622	total: 11.9s	remaining: 10m 48s
9:	learn: 0.7358937	total: 13.1s	remaining: 10m 40s
10:	learn: 0.7419240	total: 14.3s	remaining: 10m 33s
11:	learn: 0.7474553	total: 15.4s	remaining: 10m 27s
12:	learn: 0.7504323	total: 16.6s	remaining: 10m 21s
13:	learn: 0.7544178	total: 17.8s	remaining: 10m 16s
14:	learn: 0.7575967	total: 18.9s	remaining: 10m 10s
15:	learn: 0.7605048	total: 20.1s	remaining: 10m 7s
16:	learn: 0.7629469	total: 21.3s	remaining: 10m 4s
17:	learn: 0.7713362	total: 22.4s	remaining: 10m
18:	learn: 0.7776903	total: 23.6s	remaining: 9m 58s
19:	learn: 

0.7821901323706378

Проверим модель на адекватность

In [32]:
random_predictions = [random.randint(0, 1) for _ in range(len(y_test.index))]
if f1_score(y_test, random_predictions) < f1_score(y_test, y_pred_best):
    print('Показатель f1 расчетной модели:', f1_score(y_test, y_pred_best))
    print('Показатель f1 случайной модели:', f1_score(y_test, random_predictions))
    print('f1 расчетной модели лучше случайной')

Показатель f1 расчетной модели: 0.7821901323706378
Показатель f1 случайной модели: 0.17414330218068536
f1 расчетной модели лучше случайной


### Модель с показателем f1 более 0.75 подобрана