# Классификаци комментариев

Интернет-магазин, запускающий новый сервис редактирования и дополнения описаний товаров пользователями, нуждается в инструменте, который определяет токсичные комментарии и отправляет их на модерацию.

Необходимо обучить модель классифицировать комментарии на позитивные и негативные.

**План**

1. Загрузка и подготовка данных.
2. Обучение и сравнение моделей. 
3. Тестирование и выводы.

**Описание данных**

Входная информация -- набор данных с разметкой о токсичности правок. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import re

import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import f1_score

In [2]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
SEED = 6

In [4]:
try:
    data = pd.read_csv('toxic_comments.csv')
except:
    data = pd.read_csv('/datasets/toxic_comments.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


Столбец `Unnamed: 0` дублирует индекс, избавимся от него. 

In [7]:
data.drop('Unnamed: 0', axis=1, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Очистим и лемматизируем тексты.

In [8]:
lemmatizer = WordNetLemmatizer()
def clear_n_lemmatize(row):
    text = row['text']
    cleared = ' '.join(re.sub(r'[^a-zA-Z ]', ' ', text).split())
    tokens = nltk.word_tokenize(cleared)
    lemms = []
    for token in tokens:
        lemms.append(lemmatizer.lemmatize(token))
    return ' '.join(lemms)

In [9]:
%%time

data['lemm_text'] = data.apply(clear_n_lemmatize, axis=1)

CPU times: user 1min 12s, sys: 193 ms, total: 1min 13s
Wall time: 1min 13s


In [10]:
data.head(10)

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...
5,"""\n\nCongratulations from me as well, use the ...",0,Congratulations from me a well use the tool we...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
7,Your vandalism to the Matt Shirvington article...,0,Your vandalism to the Matt Shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,Sorry if the word nonsense wa offensive to you...
9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...


Подготовим выборки. Для начала посмотрим на баланс классов.

In [11]:
data['toxic'].value_counts() / data.shape[0]

0    0.898388
1    0.101612
Name: toxic, dtype: float64

Видим дисбаланс, учтём это при разбиении на обучающую и тестовую выборки.

In [12]:
(text_train, text_test,
 target_train, target_test) = train_test_split(data['lemm_text'], data['toxic'], test_size=0.25,
                                               random_state=SEED, stratify=data['toxic'])

In [13]:
# проверка разбиения
print(f'Train size = {text_train.shape[0]}')
print(f'Test size = {text_test.shape[0]}')

Train size = 119469
Test size = 39823


## Обучение

Перейдём к обучению моделей и подбору параметров. Т.к. параметры можно настраивать и для этапа векторизации текста, и для этапа обучения моделей, то применим `Pipeline`.

In [14]:
stopwords = set(nltk_stopwords.words('english'))

### LogisticRegression

In [15]:
pipeline = Pipeline(
    [
        ('vect', TfidfVectorizer(stop_words=stopwords)),
        ('clf', LogisticRegression(solver='sag', class_weight='balanced',
                                   max_iter=1000, random_state=SEED))
    ]
)

In [16]:
params = {
    'vect__ngram_range':((1, 1), (1, 2), (1, 3), (2, 2)),
    'clf__C':(1, 10, 100)
}

In [17]:
%%time

grid_search = RandomizedSearchCV(pipeline, params,
                                 n_jobs=-1, cv=3, scoring='f1',
                                 random_state=SEED, refit=False
                                )
grid_search.fit(text_train.values, target_train)



CPU times: user 22min 37s, sys: 13 s, total: 22min 50s
Wall time: 22min 51s


RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('vect',
                                              TfidfVectorizer(stop_words={'a',
                                                                          'about',
                                                                          'above',
                                                                          'after',
                                                                          'again',
                                                                          'against',
                                                                          'ain',
                                                                          'all',
                                                                          'am',
                                                                          'an',
                                                                          'and',
                      

In [18]:
lr_best_params = grid_search.best_params_
lr_best_score = grid_search.best_score_

print('LogisticRegression')
print(f'Best params {lr_best_params}')
print(f'F1 = {lr_best_score}')

LogisticRegression
Best params {'vect__ngram_range': (1, 2), 'clf__C': 10}
F1 = 0.7806318129742852


### DecisionTreeClassifier

In [19]:
pipeline = Pipeline(
    [
        ('vect', TfidfVectorizer(stop_words=stopwords)),
        ('clf', DecisionTreeClassifier(class_weight='balanced', random_state=SEED))
    ]
)

In [20]:
params = {
    'vect__ngram_range':((1, 1), (1, 2), (1, 3), (2, 2)),
    'vect__max_features':range(1000, 11000, 1000),
    'clf__max_depth':range(2, 9, 2)
}

In [21]:
%%time

grid_search = RandomizedSearchCV(pipeline, params,
                                 n_jobs=-1, cv=3, scoring='f1',
                                 random_state=SEED, refit=False
                                )
grid_search.fit(text_train.values, target_train)

CPU times: user 7min 2s, sys: 12.9 s, total: 7min 15s
Wall time: 7min 15s


RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('vect',
                                              TfidfVectorizer(stop_words={'a',
                                                                          'about',
                                                                          'above',
                                                                          'after',
                                                                          'again',
                                                                          'against',
                                                                          'ain',
                                                                          'all',
                                                                          'am',
                                                                          'an',
                                                                          'and',
                      

In [22]:
dt_best_params = grid_search.best_params_
dt_best_score = grid_search.best_score_

print('DecisionTreeClassifier')
print(f'Best params {dt_best_params}')
print(f'F1 = {dt_best_score}')

DecisionTreeClassifier
Best params {'vect__ngram_range': (1, 2), 'vect__max_features': 10000, 'clf__max_depth': 8}
F1 = 0.5254125156039966


### RandomForestClassifier

In [23]:
pipeline = Pipeline(
    [
        ('vect', TfidfVectorizer(stop_words=stopwords)),
        ('clf', RandomForestClassifier(class_weight='balanced', random_state=SEED))
    ]
)

In [24]:
params = {
    'vect__ngram_range':((1, 1), (1, 2), (1, 3), (2, 2)),
    'vect__max_features':range(1000, 11000, 1000),
    'clf__n_estimators':range(3, 10, 2),
    'clf__max_depth':range(2, 9, 2)
}

In [25]:
%%time

grid_search = RandomizedSearchCV(pipeline, params,
                                 n_jobs=-1, cv=3, scoring='f1',
                                 random_state=SEED, refit=False
                                )
grid_search.fit(text_train.values, target_train)

CPU times: user 5min 25s, sys: 9.27 s, total: 5min 34s
Wall time: 5min 35s


RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('vect',
                                              TfidfVectorizer(stop_words={'a',
                                                                          'about',
                                                                          'above',
                                                                          'after',
                                                                          'again',
                                                                          'against',
                                                                          'ain',
                                                                          'all',
                                                                          'am',
                                                                          'an',
                                                                          'and',
                      

In [26]:
rf_best_params = grid_search.best_params_
rf_best_score = grid_search.best_score_

print('RandomForestClassifier')
print(f'Best params {rf_best_params}')
print(f'F1 = {rf_best_score}')

RandomForestClassifier
Best params {'vect__ngram_range': (1, 1), 'vect__max_features': 8000, 'clf__n_estimators': 9, 'clf__max_depth': 8}
F1 = 0.40347425384748153


### Сравнение моделей

In [27]:
results = [['LogisticRegression', lr_best_score],
           ['DecisionTreeClassifier', dt_best_score],
           ['RandomForestClassifier', rf_best_score]]
columns = ['Модель', 'F1']
table = pd.DataFrame(data=results, columns=columns)
table

Unnamed: 0,Модель,F1
0,LogisticRegression,0.780632
1,DecisionTreeClassifier,0.525413
2,RandomForestClassifier,0.403474


На этапе обучения моделей и подбора гиперпараметров наилучший результат показала модель логичстической регрессии. Проведём тестирование данной модели.

## Тестирование

In [28]:
vectorizer = TfidfVectorizer(stop_words=stopwords,
                             ngram_range=lr_best_params['vect__ngram_range']
                            )
model = LogisticRegression(solver='sag', class_weight='balanced',
                           max_iter=1000, random_state=SEED,
                           C=lr_best_params['clf__C']
                          )

In [29]:
%%time

features_train = vectorizer.fit_transform(text_train.values)

CPU times: user 17.6 s, sys: 647 ms, total: 18.2 s
Wall time: 18.2 s


In [30]:
%%time

model.fit(features_train, target_train)

CPU times: user 3min 25s, sys: 172 ms, total: 3min 26s
Wall time: 3min 26s


LogisticRegression(C=10, class_weight='balanced', max_iter=1000, random_state=6,
                   solver='sag')

In [31]:
%%time

features_test = vectorizer.transform(text_test)
predicted_test = model.predict(features_test)
print(f'Final F1 = {f1_score(target_test, predicted_test)}')

Final F1 = 0.7748384737291235
CPU times: user 3.51 s, sys: 0 ns, total: 3.51 s
Wall time: 3.52 s


Итоговое значение метрики на выбранной модели оказалось больше 0.75, что удовлетворяет указанному в задании требованию.

## Выводы

На первом этапе были загружены предложенные данные. Тексты очищены и лемматизированы. Все данные поделены на обучающую и тестовую выборки.

При помощи `Pipeline` и `RandomizedSearchCV` были обучены следующие модели:
 - `LogisticRegression`
 - `DecisionTreeClassifier`
 - `RandomForestClassifier`,
 
а при векторизации текстов были рассмотрены следующие случаи:
 - только униграммы (`ngram_range=(1, 1)`);
 - униграммы и биграммы (`ngram_range=(1, 2)`);
 - униграммы, биграммы и триграммы (`ngram_range=(1, 3)`);
 - только биграммы (`ngram_range=(2, 2)`).

Наилучший результат на этапе обучения показала модель `'LogisticRegression'` с параметрами `C=10, class_weight='balanced', max_iter=1000, solver='sag'` на униграммах и биграммах.

На тестовых данных данная модель показала значение метрики `F1 = 0.775`.