<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Тест" data-toc-modified-id="Тест-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тест</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Проект для интернет-магазина

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Необходимо построить модель со значением метрики качества *F1* не меньше 0.75. 

In [1]:
!pip install pymystem3
from pymystem3 import Mystem

In [2]:
import pandas as pd
import os
import numpy as np 
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier 
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score
import time
import warnings
warnings.filterwarnings('ignore')
import re 
from tqdm import tqdm
from nltk.stem import WordNetLemmatizer 


In [3]:
pth1 = 'toxic_comments.csv'
pth2 = '/datasets/toxic_comments.csv'

if os.path.exists(pth1):
    data = pd.read_csv(pth1)
elif os.path.exists(pth2):
    data = pd.read_csv(pth2)
else:
    print('Something is wrong')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [6]:
data.isna().sum()

text     0
toxic    0
dtype: int64

In [7]:
s = data['toxic'].value_counts()
s[1]/s[0]

0.1131876717871444

Проведем чистку, токенизацию и лемматизацию с помощью WordNetLemmatizer

In [8]:
corp = data['text'].values

In [9]:
corpus = []
s = r'[^a-zA-Z0-9]'
for i in corp:
    cleared = re.sub(s, " ", i)
    corpus.append(" ". join(cleared.split()))

In [26]:
wnl = WordNetLemmatizer()
def lemmatizered(corpus):
    corpus_lem = []
    for i in corpus:
        s = nltk.word_tokenize(i)
        corpus_lem.append(' '.join([wnl.lemmatize(k) for k in s]))
    return corpus_lem

In [27]:
data['lemm_text']  = lemmatizered(corpus)

In [28]:
data.head()

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


В сете 2 столбца: toxic, text. Из них 90% нетоксичных комментариев

## Подготовка

In [29]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
X = data['lemm_text']
y = data['toxic']

In [31]:
X_train, X_test, target_train, target_test = train_test_split(X, y, test_size = 0.2, random_state=2007)

In [32]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords) 

In [33]:
features_train = count_tf_idf.fit_transform(X_train)
features_test = count_tf_idf.transform(X_test)

In [34]:
print(features_train.shape)
print(features_test.shape)
print(target_train.shape)
print(target_test.shape)

(127656, 155968)
(31915, 155968)
(127656,)
(31915,)


Подготовили данные, выделили фичи и таргеты, после лемматизации и чистки.

## Обучение

LogisticRegression

In [35]:
LR = LogisticRegression(class_weight='balanced', 
                        random_state=2007,
                        n_jobs=-1,
                        solver='liblinear'
                       )

In [36]:
LR_params = {"max_iter":[10,100,10],
             'C': [0.1, 1, 10, 100],
            }

In [37]:
LR_gsearch = GridSearchCV(LR, LR_params, scoring='f1', cv=5)
LR_gsearch.fit(features_train, target_train)

GridSearchCV(cv=5,
             estimator=LogisticRegression(class_weight='balanced', n_jobs=-1,
                                          random_state=2007,
                                          solver='liblinear'),
             param_grid={'C': [0.1, 1, 10, 100], 'max_iter': [10, 100, 10]},
             scoring='f1')

In [38]:
LR_gsearch.best_params_

{'C': 10, 'max_iter': 10}

In [39]:
LR_gsearch.best_score_

0.7651767815565916

CatBoost - очень долгая, можно попробовать уменьшить выборку в n раз

In [40]:
# %%time
# CB_model = CatBoostClassifier(random_state = 2007)
# CB_param_search = { 
#                     'learning_rate': [0.03, 0.1],
#                     'depth': [4, 6, 10]
# }
# CB_gsearch = GridSearchCV(n_jobs = -1, estimator=CB_model, cv=3, param_grid=CB_param_search, scoring = 'f1')

# CB_gsearch.fit(features_train, target_train, verbose=None)

In [41]:
# CB_gsearch.best_params_

Forest

In [42]:
Forest = RandomForestClassifier(class_weight='balanced', n_jobs=-1 )
forest_params = { 'n_estimators': [100],
                    'max_depth' : [i for i in range(13,15)]
                }
forest_gsearch = GridSearchCV(Forest, forest_params, scoring='f1', cv=5)
forest_gsearch.fit(features_train, target_train)

GridSearchCV(cv=5,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1),
             param_grid={'max_depth': [13, 14], 'n_estimators': [100]},
             scoring='f1')

In [43]:
forest_gsearch.best_params_

{'max_depth': 14, 'n_estimators': 100}

In [44]:
forest_gsearch.best_score_

0.3718524410533445

LGBMClassifier

In [None]:
LGBM_model = LGBMClassifier(random_state = 2007)
LGBM_params = {
  'n_estimators': [200],
  'learning_rate': [0.01, 0.1],
  'max_depth': [i for i in range(8,15)]}

LGBM_gsearch = GridSearchCV(n_jobs = -1, estimator=LGBM_model, cv=5, param_grid=LGBM_params, scoring = 'f1')

LGBM_gsearch.fit(features_train, target_train)


In [None]:
LGBM_gsearch.best_params_

In [None]:
LGBM_gsearch.best_score_

CatBoost долго считается, поэтому закомментируем ее. 

## Тест

LogisticRegression

In [None]:
LR_test = LogisticRegression(class_weight='balanced',
                             C = 10,
                        random_state=2007,
                        n_jobs=-1,
                        solver='liblinear',
                        max_iter = 10
                       )
LR_test.fit(features_train, target_train)
LR_pred = LR_test.predict(features_test)
LR_F1 = f1_score(target_test, LR_pred)
LR_F1

Forest

In [None]:
Forest_test = RandomForestClassifier(class_weight='balanced',
                                 random_state=2007,
                                 n_jobs=-1,
                                 max_depth =14,
                                 n_estimators = 100
                       )
Forest_test.fit(features_train, target_train)
Forest_pred = Forest_test.predict(features_test)
Forest_F1 = f1_score(target_test, Forest_pred)
Forest_F1

LGBMClassifier

In [None]:
LGBM_test = LGBMClassifier(class_weight='balanced',
                           random_state=2007,
                                   n_jobs=-1,
                                   n_estimators = 200,
                                   learning_rate=0.01,
                                   max_depth=4,
                                   )
LGBM_test.fit(features_train, target_train)
LGBM_pred = LGBM_test.predict(features_test)
LGBM_F1 = f1_score(target_test, LGBM_pred)
LGBM_F1

In [None]:
# CB_model = CatBoostRegressor(random_state = 2007,
#                             learning_rate = 50,
#                             depth = 4
#                             )
# CB_model.fit(features_train, target_train)
# CB_predict = CB_model.predict(features_test)
# CB_F1 = f1_score(target_test, CB_predict)
# CB_F1

## Выводы

Лучшей моделью оказалась LogisticRegression, только она удовлетворяет порогу в 0.75 f1 метрики, также она самая быстрая.
