<a id="up"></a>

## Проект для «Викишоп» 

### Описание проекта

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества F1 не меньше 0.75.

### Задачи исследования

### [1. Загрузите и подготовьте данные. ](#1)

### [2. Обучите разные модели.](#2)

### [3. Сделайте выводы.](#3)

### Описание данных

`Признаки`
- text — текст комментария

`Целевой признак`
- toxic — набор данных с разметкой о токсичности правок

<a id="1"></a>

### 1. Загрузим файл и изучим общую информацию, подготовим данные.

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import torch
import transformers


from pymystem3 import Mystem
from nltk.corpus import stopwords
from tqdm import notebook

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer 
from transformers import BertTokenizer

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import f1_score,make_scorer
from sklearn.metrics import confusion_matrix


import warnings
warnings.filterwarnings("ignore")

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Домашний\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Домашний\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Домашний\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
try:
    df = pd.read_csv('toxic_comments.csv')
except Exception as e:
    print(e)
    df = pd.read_csv('/datasets/toxic_comments.csv')

df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
display(df['toxic'].value_counts())

0    143346
1     16225
Name: toxic, dtype: int64

In [5]:
corpus = list(df['text'])

In [6]:
# функции лематизации и очистки текста, дополнительно убрали лишние пробелы
m = WordNetLemmatizer()

def lemmatize(text):
    word_list = nltk.word_tokenize(text)
    
    return ' '.join([m.lemmatize(w) for w in word_list])

def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text)
    text_clear = ' '.join(text.split())
    return text_clear

In [7]:
df['text'] = df['text'].apply(clear_text)

In [8]:
df['lemm_text'] = df['text'].apply(lemmatize)

In [9]:
df = df.drop(['text'], axis=1)

In [10]:
df

Unnamed: 0,toxic,lemm_text
0,0,Explanation Why the edits made under my userna...
1,0,D aww He match this background colour I m seem...
2,0,Hey man I m really not trying to edit war It s...
3,0,More I can t make any real suggestion on impro...
4,0,You sir are my hero Any chance you remember wh...
...,...,...
159566,0,And for the second time of asking when your vi...
159567,0,You should be ashamed of yourself That is a ho...
159568,0,Spitzer Umm there no actual article for prosti...
159569,0,And it look like it wa actually you who put on...


In [11]:
corpus = df['lemm_text'].values #.astype('U') #корпус из лематизированных и очищеных тексов

In [12]:
features = corpus # Разобьем выборки
target = df['toxic'].values

train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.2, random_state=42)

In [13]:
stop_words = set(stopwords.words('english')) 

In [14]:
count_tf_idf = TfidfVectorizer(stop_words=stop_words) # TF-IDF для корпуса выборок
train_features = count_tf_idf.fit_transform(train_features) 
test_features = count_tf_idf.transform(test_features)

#### Вывод:
- загрузили данные, выяснили, что классы несбалансированны
- написали функции лематизации и очистки текста, дополнительно убрали лишние пробелы
- получили стоп-слова, TF-IDF для корпуса выборок
- разбили выборки, необходимые для построения моделей

<a id="2"></a>

### 2. Обучение моделей.

In [54]:
c_space = (1,3)

In [55]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

In [56]:
pipe_lr = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])
param_grid = {"clf__C": c_space,"clf__penalty": ['l1', 'l2']}

In [None]:
clf = GridSearchCV(pipe_lr,param_grid=param_grid,cv=3, scoring='f1')
clf.fit(train_features, train_target)

##### LogisticRegression

In [42]:
%%time

lr_model = LogisticRegression(solver='liblinear',random_state=12345)

hyperparams = [{'C':np.logspace(-3,3,20),   
                'class_weight':['balanced']}]

clf = GridSearchCV(lr_model, hyperparams, scoring='f1',cv=3)
clf.fit(train_features, train_target)
LR_best_params = clf.best_params_
print("Лучшие параметры модели:", LR_best_params)
print('F1:', clf.best_score_.round(2))

Лучшие параметры модели: {'C': 6.158482110660261, 'class_weight': 'balanced'}
F1: 0.76
Wall time: 3min 5s


##### LGBMClassifier

In [58]:
%%time

lgbm_classifier = LGBMClassifier(random_state=12345, class_weight='balanced') 

hyperparams = [{'max_depth' : [10], 
                'learning_rate':[0.001], 
                'n_estimators' : [50]}]  
                            
 
clf = GridSearchCV(lgbm_classifier, hyperparams, scoring='f1')
clf.fit(train_features, train_target) 
LGBM_best_params = clf.best_params_
print('Лучшие параметры модели:', LGBM_best_params) 
print('F1', clf.best_score_.round(2))

Лучшие параметры модели: {'learning_rate': 0.001, 'max_depth': 10, 'n_estimators': 50}
F1 0.56
Wall time: 1min 38s


##### CatBoostClassifier

In [17]:
%%time
cat_model=CatBoostClassifier(loss_function="Logloss", iterations=40)

Wall time: 71 ms


In [18]:
cat_model.fit(train_features,train_target)

Learning rate set to 0.5
0:	learn: 0.3474917	total: 2.98s	remaining: 1m 56s
1:	learn: 0.2621497	total: 5.19s	remaining: 1m 38s
2:	learn: 0.2360504	total: 7.36s	remaining: 1m 30s
3:	learn: 0.2207688	total: 9.65s	remaining: 1m 26s
4:	learn: 0.2125140	total: 11.8s	remaining: 1m 22s
5:	learn: 0.2058840	total: 14s	remaining: 1m 19s
6:	learn: 0.2005402	total: 16.3s	remaining: 1m 16s
7:	learn: 0.1964914	total: 18.5s	remaining: 1m 14s
8:	learn: 0.1924651	total: 20.8s	remaining: 1m 11s
9:	learn: 0.1891771	total: 23s	remaining: 1m 8s
10:	learn: 0.1854460	total: 25.2s	remaining: 1m 6s
11:	learn: 0.1832202	total: 27.4s	remaining: 1m 4s
12:	learn: 0.1806358	total: 29.6s	remaining: 1m 1s
13:	learn: 0.1787896	total: 31.7s	remaining: 58.9s
14:	learn: 0.1755252	total: 33.9s	remaining: 56.5s
15:	learn: 0.1733977	total: 36.1s	remaining: 54.1s
16:	learn: 0.1708961	total: 38.2s	remaining: 51.7s
17:	learn: 0.1694914	total: 40.3s	remaining: 49.3s
18:	learn: 0.1676301	total: 42.5s	remaining: 47s
19:	learn: 0.

<catboost.core.CatBoostClassifier at 0x1a292df0>

In [19]:
predict_cat=cat_model.predict(test_features)

In [20]:
f1_cat=f1_score(predict_cat,test_target)
print('F1:', f1_cat.round(2))

F1: 0.71


<a id="3"></a>

### 3. Выводы.

##### LogisticRegression

In [60]:
%%time
lr_model = LogisticRegression(solver='liblinear',random_state=12345)
lr_model.set_params(**LR_best_params)
lr_model.fit(train_features, train_target)
prediction = lr_model.predict(test_features)
f1_LR = f1_score(test_target, prediction)
print('F1:', f1_LR.round(2))
print()
print('Матрица ошибок')
print(confusion_matrix(test_target, prediction))
print()

F1: 0.76

Матрица ошибок
[[27554  1117]
 [  545  2699]]

Wall time: 4.43 s


##### LGBMClassifier

In [61]:
%%time
LightGBM_model = LGBMClassifier(random_state=12345, class_weight='balanced')
LightGBM_model.set_params(**LGBM_best_params)
LightGBM_model.fit(train_features, train_target)
prediction = LightGBM_model.predict(test_features)
f1_LGBM = f1_score(test_target, prediction)
print('F1:', f1_LGBM.round(2))
print()
print('Матрица ошибок')
print(confusion_matrix(test_target, prediction))
print()

F1: 0.58

Матрица ошибок
[[28407   264]
 [ 1814  1430]]

Wall time: 19.8 s


##### CatBoostClassifier

In [62]:
%%time
cat_model=CatBoostClassifier(loss_function="Logloss", iterations=40)
cat_model.fit(train_features,train_target)
predict_cat=cat_model.predict(test_features)
f1_cat=f1_score(predict_cat,test_target)
print('F1:', f1_cat.round(2))
print()
print('Матрица ошибок')
print(confusion_matrix(predict_cat,test_target))
print()

Learning rate set to 0.5
0:	learn: 0.3474917	total: 2.78s	remaining: 1m 48s
1:	learn: 0.2621497	total: 5.09s	remaining: 1m 36s
2:	learn: 0.2360504	total: 7.58s	remaining: 1m 33s
3:	learn: 0.2207688	total: 10.1s	remaining: 1m 31s
4:	learn: 0.2125140	total: 12.5s	remaining: 1m 27s
5:	learn: 0.2058840	total: 14.9s	remaining: 1m 24s
6:	learn: 0.2005402	total: 17.1s	remaining: 1m 20s
7:	learn: 0.1964914	total: 19.5s	remaining: 1m 17s
8:	learn: 0.1924651	total: 21.8s	remaining: 1m 15s
9:	learn: 0.1891771	total: 24.2s	remaining: 1m 12s
10:	learn: 0.1854460	total: 26.5s	remaining: 1m 9s
11:	learn: 0.1832202	total: 28.8s	remaining: 1m 7s
12:	learn: 0.1806358	total: 31s	remaining: 1m 4s
13:	learn: 0.1787896	total: 33.3s	remaining: 1m 1s
14:	learn: 0.1755252	total: 35.5s	remaining: 59.2s
15:	learn: 0.1733977	total: 37.7s	remaining: 56.6s
16:	learn: 0.1708961	total: 40s	remaining: 54.1s
17:	learn: 0.1694914	total: 42.2s	remaining: 51.6s
18:	learn: 0.1676301	total: 44.4s	remaining: 49.1s
19:	learn:

In [63]:
razult=np.array([f1_LR,f1_LGBM,f1_cat])

rezult_df=pd.DataFrame(razult,index=['LogisticRegression','LGBMClassifier','CatBoostClassifier'])
rezult_df.set_axis(['F1'],axis='columns',inplace=True)
rezult_df

Unnamed: 0,F1
LogisticRegression,0.764589
LGBMClassifier,0.579182
CatBoostClassifier,0.708255


### Вывод:
- модели `LogisticRegression` и `CatBoostClassifier` показали удовлетворительное значение метрики F1
- лучший результат показала модель `LogisticRegression`с результатом F1=0.764589

##### [к оглавлению](#up)