# Проект для «Викишоп»

## Подготовка

Для интернет магазина необходим инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. Нужно обучить модель классифицировать "хорошие" и "плохие" комментарии. Метрика качества F1 должна быть не меньше 0,75

In [1]:
import pandas as pd
from pymystem3 import Mystem
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, make_scorer
from lightgbm import LGBMClassifier

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
comments = pd.read_csv('/datasets/toxic_comments.csv')
comments.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Пропусков в данных нет, форматы в порядке.

Лемматизируем текст и отбросим стоп-слова:

In [5]:
%%time
lemmatizer = WordNetLemmatizer()
lem_text = []

def clear_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split() 
    return " ".join(text)
   
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

for i in range(comments.shape[0]):
    clear = clear_text(comments.loc[i, 'text'])
    token_text = nltk.word_tokenize(clear)
    texts = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in token_text if w not in set(stopwords.words('english'))]
    lem_text.append(' '.join(texts))
    
comments['lem_text'] = lem_text
#print(lem_text)
#comments.head(10)

CPU times: user 32min 47s, sys: 3min 49s, total: 36min 36s
Wall time: 36min 44s


Разобьем на трейн и таргет в соотношении 60 на 40:

In [6]:
comments_train, comments_test = train_test_split(comments, test_size=0.4, random_state=12345)
print('Тренировочная выборка:', len(comments_train))

Тренировочная выборка: 95575


Рассчитаем TF-IDF для трейна:

In [7]:
count_tf_idf = TfidfVectorizer()
corpus_train = comments_train['lem_text']
corpus_test = comments_test['lem_text']
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
tf_idf_test = count_tf_idf.transform(corpus_test)

Подготовим признаки и таргеты:

In [8]:
target_train = comments_train['toxic']
feature_train = tf_idf_train
feature_test = tf_idf_test
target_test = comments_test['toxic']

## Обучение

Обучим модель логистической регрессии и подберем гиперпараметры с помощью кросс-валидации:

In [9]:
%%time
model_lin = LogisticRegression(class_weight='balanced', solver='liblinear', max_iter=100)

grid_space = {'C': list(range(1,15,3))}
grid_lin = GridSearchCV(estimator=model_lin, cv=2, param_grid=grid_space, scoring=make_scorer(f1_score), verbose=2)
model_grid = grid_lin.fit(feature_train, target_train)

print('Лучшие гиперпараметры линейной модели: '+str(model_grid.best_params_))
print('Лучший F1 линейной модели: '+str(abs(model_grid.best_score_)))

Fitting 2 folds for each of 5 candidates, totalling 10 fits
[CV] END ................................................C=1; total time=   6.7s
[CV] END ................................................C=1; total time=   6.9s
[CV] END ................................................C=4; total time=  11.6s
[CV] END ................................................C=4; total time=  10.6s
[CV] END ................................................C=7; total time=  13.9s
[CV] END ................................................C=7; total time=  14.7s
[CV] END ...............................................C=10; total time=  15.0s
[CV] END ...............................................C=10; total time=  14.1s
[CV] END ...............................................C=13; total time=  15.7s
[CV] END ...............................................C=13; total time=  15.5s
Лучшие гиперпараметры линейной модели: {'C': 10}
Лучший F1 линейной модели: 0.7579136735707597
CPU times: user 47.8 s, sys: 1min 3

Обучим модель леса и подберем гиперпараметры с помощью кросс-валидации:

In [10]:
%%time
model_forest = RandomForestClassifier(random_state=12345, class_weight='balanced')

grid_space={'max_depth':[50, 150]}

grid_forrest = GridSearchCV(model_forest, param_grid=grid_space, cv=2, scoring=make_scorer(f1_score), verbose=2)
model_grid = grid_forrest.fit(feature_train, target_train)

print('Лучшие гиперпараметры леса: '+str(model_grid.best_params_))
print('Лучший F1 леса: '+str(abs(model_grid.best_score_)))

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[CV] END .......................................max_depth=50; total time=  45.7s
[CV] END .......................................max_depth=50; total time=  43.8s
[CV] END ......................................max_depth=150; total time= 2.3min
[CV] END ......................................max_depth=150; total time= 2.2min
Лучшие гиперпараметры леса: {'max_depth': 150}
Лучший F1 леса: 0.5651119955187116
CPU times: user 9min, sys: 1.03 s, total: 9min 1s
Wall time: 9min 2s


Лучшие гиперпараметры для леса с глубиной 150, F1 = 0.57, значительно хуже чем у логистической регрессии

## Выводы

Применим лучшую модель логистической регрессии для тестовой выборки:

In [13]:
model = LogisticRegression(class_weight='balanced', solver='liblinear', max_iter=100, C=10)
model.fit(feature_train, target_train)

predict_test = model.predict(feature_test)
test_f1 = f1_score(target_test, predict_test)

print("F1 для тестовой выборки:", test_f1)

F1 для тестовой выборки: 0.760028653295129


На тесте получен F1=0.76, что удовлетворяет требованиям заказчика