# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

### Импорт библиотек

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import re

from pymystem3 import Mystem

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords as nltk_stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.utils import shuffle

import warnings
warnings.filterwarnings('ignore')

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Загрузка и подготовка данных

In [None]:
try:
    df=pd.read_csv('/datasets/toxic_comments.csv')
except:
    df=pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [None]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [None]:
class_ratio = df['toxic'].value_counts()[0] / df['toxic'].value_counts()[1]
class_ratio

8.841344371679229

Классы несбалансированы, удалим столбец `Unnamed: 0`.

In [None]:
dict_classes={0:1, 1:class_ratio}

In [None]:
df = df.drop(['Unnamed: 0'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


#### Подготовка признаков перед обучением

##### Очистка и лемматизация

In [None]:
def clear_text(text):
    text = text.lower()
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', text)
    cleared_text = re.sub(r'(?:\n|\r)', ' ', cleared_text)
    return " ".join(cleared_text.split())


In [None]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV
               }
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    text = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)]
    return ' '.join(text)


In [None]:
df['text'] = df['text'].apply(clear_text)

In [None]:
%%time

df['lemm_text'] = df['text'].apply(lemmatize_text)
df

CPU times: user 15min 26s, sys: 1min 28s, total: 16min 55s
Wall time: 16min 56s


Unnamed: 0,text,toxic,lemm_text
0,explanation why the edits made under my userna...,0,explanation why the edits make under my userna...
1,d aww he matches this background colour i m se...,0,d aww he match this background colour i m seem...
2,hey man i m really not trying to edit war it s...,0,hey man i m really not try to edit war it s ju...
3,more i can t make any real suggestions on impr...,0,more i can t make any real suggestion on impro...
4,you sir are my hero any chance you remember wh...,0,you sir be my hero any chance you remember wha...
...,...,...,...
159287,and for the second time of asking when your vi...,0,and for the second time of ask when your view ...
159288,you should be ashamed of yourself that is a ho...,0,you should be ashamed of yourself that be a ho...
159289,spitzer umm theres no actual article for prost...,0,spitzer umm there no actual article for prosti...
159290,and it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


In [None]:
df = df.drop(['text'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   toxic      159292 non-null  int64 
 1   lemm_text  159292 non-null  object
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Разобьем выборку на тренировочное, валидационное и тестовое множества

In [None]:
target = df['toxic']
features = df.drop(['toxic'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(features,
                                                                              target,
                                                                              test_size=0.4,
                                                                              random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid,
                                                                            target_valid,
                                                                            test_size=0.5,
                                                                            random_state=12345)

In [None]:
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(95575, 1)
(31858, 1)
(31859, 1)


##### Stopwords, TF-IDF

In [None]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

features_train = count_tf_idf.fit_transform(features_train['lemm_text'])
features_valid = count_tf_idf.transform(features_valid['lemm_text'])
features_test = count_tf_idf.transform(features_test['lemm_text'])

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Обучение

Для обучения будем использовать модели:

    LogisticRegression
    DecisionTreeClassifier

### LogisticRegression

In [None]:
cv_counts = 3

In [None]:
%%time

classificator = LogisticRegression()
hyperparams = [{'solver':['newton-cg', 'lbfgs', 'liblinear'],
                'C':[0.1, 1, 10],
                'class_weight':[dict_classes]}]


print('# Tuning hyper-parameters')
print()
clf = GridSearchCV(classificator, hyperparams, scoring='f1',cv=cv_counts)
clf.fit(features_train, target_train)
print("Best parameters:")
print()
LR_best_params = clf.best_params_
print(LR_best_params)
print()
print("Grid scores:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f for %r"% (mean, params))
print()

cv_f1_LR = max(means)

# Tuning hyper-parameters

Best parameters:

{'C': 10, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'lbfgs'}

Grid scores:

0.707272 for {'C': 0.1, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'newton-cg'}
0.707304 for {'C': 0.1, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'lbfgs'}
0.707080 for {'C': 0.1, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'liblinear'}
0.746550 for {'C': 1, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'newton-cg'}
0.746505 for {'C': 1, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'lbfgs'}
0.746445 for {'C': 1, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'liblinear'}
0.751718 for {'C': 10, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'newton-cg'}
0.752547 for {'C': 10, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'lbfgs'}
0.751755 for {'C': 10, 'class_weight': {0: 1, 1: 8.841344371679229}, 'solver': 'liblinear'}

CPU times: user 4min 34s, sys: 5min 54s, total: 

In [None]:
%%time

classificator = LogisticRegression()
classificator.set_params(**LR_best_params)
classificator.fit(features_train, target_train)
target_predict = classificator.predict(features_valid)
valid_f1_LR = f1_score(target_valid, target_predict)
print('CrossVal F1', cv_f1_LR)
print('Валидация F1', valid_f1_LR)

CrossVal F1 0.7525472577805025
Валидация F1 0.7574939622105413
CPU times: user 19 s, sys: 23.3 s, total: 42.3 s
Wall time: 42.4 s


### DecisionTreeClassifier

In [None]:
%%time

classificator = DecisionTreeClassifier()
hyperparams = [{'max_depth':[x for x in range(10,50,5)],
                'random_state':[12345],
                'class_weight':[dict_classes]}]


print('# Tuning hyper-parameters')
print()
clf = GridSearchCV(classificator, hyperparams, scoring='f1',cv=cv_counts)
clf.fit(features_train, target_train)
print("Best parameters:")
print()
DTC_best_params = clf.best_params_
print(DTC_best_params)
print()
print("Grid scores:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f for %r"% (mean, params))
print()

cv_f1_DTC = max(means)

# Tuning hyper-parameters

Best parameters:

{'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 45, 'random_state': 12345}

Grid scores:

0.583522 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 10, 'random_state': 12345}
0.608500 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 15, 'random_state': 12345}
0.609225 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 20, 'random_state': 12345}
0.620805 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 25, 'random_state': 12345}
0.616930 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 30, 'random_state': 12345}
0.616370 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 35, 'random_state': 12345}
0.612446 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 40, 'random_state': 12345}
0.620953 for {'class_weight': {0: 1, 1: 8.841344371679229}, 'max_depth': 45, 'random_state': 12345}

CPU times: user 5min 16s, sys: 3.31 s, total: 5min 2

In [None]:
%%time

classificator = DecisionTreeClassifier()
classificator.set_params(**DTC_best_params)
classificator.fit(features_train, target_train)
target_predict = classificator.predict(features_valid)
valid_f1_DTC = f1_score(target_valid, target_predict)
print('CrossVal F1', cv_f1_DTC)
print('Валидация F1', valid_f1_DTC)

CrossVal F1 0.620952517141258
Валидация F1 0.6226754649070186
CPU times: user 25.3 s, sys: 215 ms, total: 25.5 s
Wall time: 25.6 s


### Тестирование модели

In [None]:
classificator = LogisticRegression()
classificator.set_params(**LR_best_params)
classificator.fit(features_train, target_train)
predict_test = classificator.predict(features_test)

print('Test F1:', f1_score(target_test, predict_test))


Test F1: 0.7435672514619882


## Выводы

<div class="paragraph">В процессе работы необходимо было обучить модель классифицировать комментарии на позитивные и негативные и построить модель со значением метрики качества <em>F1</em> не меньше 0.75.</div>
<div class="paragraph">&nbsp;</div>
<div class="paragraph">В нашем распоряжении был набор данных с разметкой о токсичности правок.</div>
<div class="paragraph">&nbsp;</div>
<div class="paragraph">Столбец <em>text</em>&nbsp;содержит текст комментария, а <em>toxic</em> &mdash; целевой признак.</div>
<div class="paragraph">&nbsp;</div>
<div class="paragraph">В процессе предобработки проведена лемматизация текстов, очистка от лишних символов, найдены стоп-слова, оценка важности слов определена величиной TF-IDF, дисбаланс классов учтен.</div>
<div class="paragraph">&nbsp;</div>
<div class="paragraph">Построены модели LogisticRegression и DecisionTreeClassifier, лучшие значения показала модель логистической регрессии.</div>
<div class="paragraph">&nbsp;</div>
<div class="paragraph">CrossVal F1 0.7525472577805025</div>
<div class="paragraph">Валидация F1 0.7574939622105413</div>
<div class="paragraph">&nbsp;</div>
<div class="paragraph">Значение F1 на тесте 0.7435672514619882</div>
<div class="paragraph">&nbsp;</div>