# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

## Подготовка

Импортируем необходимые библиотеки:

In [1]:
import pandas as pd
from tqdm import *

import re

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier

from sklearn.metrics import f1_score, recall_score, precision_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.utils import shuffle

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Проверим присутствует ли дисбаланс классов:

In [5]:
data['toxic'].mean()

0.10167887648758234

Дисбаланс классов присутствует!

Напишем функцию для лемматизации текста:

In [6]:
def lemmatize(text):
    m = WordNetLemmatizer()
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([m.lemmatize(i) for i in word_list])
    return lemmatized_output

Напишем функцию для очистки текста от лишних символов:

In [7]:
def clear_text(text):
    return " ".join(re.sub(r'[^a-zA-Z]', ' ', text.lower()).split())

Ввиду большого дисбаланса классов, разделим данные на обучающую и тестовую выборки, и в обучающей выборке уменьшим дисбаланс классов, а в тестовой выборке оставим естественный:

In [8]:
train, test = train_test_split(data,
                               test_size=0.25,
                               random_state=12345,
                               stratify=data['toxic'])

In [9]:
#train_zeros = train[train['toxic'] == 0]
#train_ones = train[train['toxic'] == 1]

In [10]:
#fraction = train['toxic'].mean()
#train_downsampled = pd.concat(
#    [train_zeros.sample(frac=fraction,
#                        random_state=123)] + [train_ones])
#train_downsampled = shuffle(train_downsampled,
#                            random_state=123)
train_downsampled = train

Проверим дисбаланс классов в тренировочной выборке:

In [12]:
train_downsampled['toxic'].mean()

0.10168117782716957

In [13]:
train_downsampled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119678 entries, 510 to 56802
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    119678 non-null  object
 1   toxic   119678 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.7+ MB


In [14]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39893 entries, 47336 to 94090
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    39893 non-null  object
 1   toxic   39893 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 935.0+ KB


В тренировочной выборке дисбаланс классов устранен!

Лемматизируем и очистим текст:

In [15]:
train_downsampled = train_downsampled.reset_index(drop=True)
test = test.reset_index(drop=True)

In [16]:
for i in tqdm(range(len(train_downsampled))):
    train_downsampled.loc[i, 'lemm_text'] = lemmatize(train_downsampled.loc[i, 'text'])

100%|██████████| 119678/119678 [08:54<00:00, 223.80it/s]


In [17]:
for i in tqdm(range(len(test))):
    test.loc[i, 'lemm_text'] = lemmatize(test.loc[i, 'text'])

100%|██████████| 39893/39893 [01:28<00:00, 450.42it/s]


In [18]:
for i in tqdm(range(len(train_downsampled))):
    train_downsampled.loc[i, 'lemm_text'] = clear_text(train_downsampled.loc[i, 'lemm_text'])

100%|██████████| 119678/119678 [09:58<00:00, 199.98it/s]


In [19]:
for i in tqdm(range(len(test))):
    test.loc[i, 'lemm_text'] = clear_text(test.loc[i, 'lemm_text'])

100%|██████████| 39893/39893 [01:06<00:00, 602.31it/s]


Подготовленные данные:

In [20]:
train_downsampled.head()

Unnamed: 0,text,toxic,lemm_text
0,"Belarus a developed country\nHahaha, nothing m...",0,belarus a developed country hahaha nothing mor...
1,"Simon-in-sagamihara|talk]]) 04:43, 12 February...",0,simon in sagamihara talk february
2,I believe the benefit that you are seeking is ...,0,i believe the benefit that you are seeking is ...
3,before talking about others\nIslam will take over,0,before talking about others islam will take over
4,Looking through the history of this article an...,0,looking through the history of this article an...


In [21]:
test.head()

Unnamed: 0,text,toxic,lemm_text
0,The same standards obviously do not hold for l...,0,the same standard obviously do not hold for li...
1,"""\n\n I think what we have here is a differenc...",0,i think what we have here is a difference in v...
2,"""\nhttp://don.logan.com/aboutus.htm\n\nNope yo...",0,http don logan com aboutus htm nope your incor...
3,"""=== Page is in rough shape ===\nWith the rece...",0,page is in rough shape with the recent purge o...
4,Seattle Biomed AGAIN \n\nDid you read my messa...,0,seattle biomed again did you read my message t...


Вывод: слова приведены к леммам и текст очищен от мусора.

## Обучение

### CatBoost

Модель CatBoost может обрабатывать под капотом естественный язык, воспользуемся данным функционалом:

Разделим таргет и фичи:

In [22]:
features_train = train_downsampled.drop(['text', 'toxic'], axis = 1)
target_train = train_downsampled['toxic']
features_test = test.drop(['text', 'toxic'], axis = 1)
target_test = test['toxic']

Найдем оптимальные гиперпараметры:

In [None]:
%%time
model = CatBoostClassifier()
params = {
    'max_depth' : [2, 5],
    'random_seed' : [12345],
    'learning_rate' : [0.1],
    'logging_level' : ['Silent'],
}
grid = GridSearchCV(estimator = model,
                    param_grid = params,
                    cv = 3,
                    scoring = 'f1')
grid.fit(features_train, target_train,
         text_features = ['lemm_text'],
         plot = False)

In [23]:
print('Лучшие гиперпараметры: ', grid.best_params_)

Обучим модель с подобранными гиперпараметрами:

In [24]:
model = CatBoostClassifier(max_depth = 5,
                           random_seed = 12345,
                           learning_rate = 0.1,
                           logging_level = 'Silent',
                           eval_metric = 'F1')

In [25]:
%%time
model.fit(features_train,
          target_train,
          text_features = ['lemm_text'],
          plot=False)

CPU times: user 5min 40s, sys: 23.2 s, total: 6min 3s
Wall time: 6min 5s


<catboost.core.CatBoostClassifier at 0x7ff60175fcd0>

In [26]:
pred_test = model.predict(features_test)

In [27]:
f1 = f1_score(target_test, pred_test)
print('F1:',f1)

F1: 0.7819528250137137


Удалось достичь метрики F1 равной 0.78

### TF-IDF и логистическая регрессия

Разделим выборку на обучающую и тестовую:

In [29]:
features_train = train_downsampled['lemm_text']
target_train = train_downsampled['toxic']
features_test = test['lemm_text']
target_test = test['toxic']

In [30]:
corpus = features_train.values

In [31]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [32]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords,
                               ngram_range=(1, 1))
tf_idf = count_tf_idf.fit_transform(corpus)

Подготовим текст в тестовой выборке:

In [33]:
corpus_test = features_test.values
tf_idf_test = count_tf_idf.transform(corpus_test)

Вектора подготовлены, обучим модель:

In [34]:
model_lr = LogisticRegression(random_state = 4057, 
                              solver='lbfgs',
                              max_iter=1000,
                              class_weight='balanced')
model_lr.fit(tf_idf, target_train)

LogisticRegression(max_iter=1000, random_state=4057)

Сделаем предсказание для тестовой выборки:

In [35]:
predict = model_lr.predict(tf_idf_test)

In [36]:
f1 = f1_score(target_test, predict)
print('F1:',f1)

F1: 0.7283913565426171


### TF-IDF и CatBoost

In [38]:
model = CatBoostClassifier(max_depth = 5,
                           random_seed = 12345,
                           learning_rate = 0.1,
                           logging_level = 'Silent',
                           eval_metric = 'F1')

In [None]:
%%time
model.fit(tf_idf,
          target_train,
          plot=False)

In [None]:
pred_test = model.predict(tf_idf_test)

In [None]:
f1 = f1_score(target_test, pred_test)
print('F1:',f1)

In [None]:
print(recall_score(target_test, pred_test))
print(precision_score(target_test, pred_test))

## Выводы

Удалось достичь метрики F1 равной 0.78