# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.


Нужно обучить модель классифицировать комментарии на позитивные и негативные. В моем распоряжении набор данных с разметкой о токсичности правок.

**План по выполнению проекта**

1. Загрузить и подготовить данные.
2. Обучить разные модели. 
3. Сделать выводы.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from pymystem3 import Mystem
m = Mystem()
import re 
from sklearn.metrics import f1_score
from tqdm import notebook
from sklearn.ensemble import RandomForestClassifier
import torch
import transformers
import numpy as np

***Подключаем библиотеки***

In [2]:
data = pd.read_csv("/datasets/toxic_comments.csv")
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


In [3]:
data_t = data.loc[data['toxic'] == 1]
data_t = data_t.sample(100).reset_index(drop=True)

data_g = data.loc[data['toxic'] == 0]
data_g = data_g.sample(100).reset_index(drop=True)
df_tweets = pd.concat([data_g,data_t], ignore_index=True)
df_tweets

Unnamed: 0,text,toxic
0,"""\n\nIn your theoretical situation, a single r...",0
1,"Bear, I hope you are not angry with me, I was ...",0
2,""" A second opinion will brought in to see if t...",0
3,Needs sources\nThis article has a lot of text ...,0
4,"If you will check the image out, you will see ...",0
...,...,...
195,message on daedalus' page \n\nDaedalus is noth...,1
196,]\nIT'S HER EXACT QUOTE. WHAT CAN'T YOU UNDER...,1
197,important \n\nYou Suck! I think you should go ...,1
198,last warning to you \n\nHEY LOSER\n\nDO WHATEV...,1


In [4]:
df_tweets.isna().sum()

text     0
toxic    0
dtype: int64

***Проверка на пропуски***

In [5]:
def lemmatize(text):
    m = Mystem()
    lemm_list = m.lemmatize(text)
    lemm_text = "".join(lemm_list)
        
    return lemm_text


def clear_text(text):
    text = (re.sub(r'[^a-zA-Z ]', ' ', text)).split() 
    return " ".join(text)

In [6]:
corpus = df_tweets['text'].values.astype('U')
a = []
for i in notebook.tqdm(range(len(df_tweets))):    
    a.append(lemmatize(clear_text(corpus[i])))
    

HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))




In [7]:
df_tweets['lemm_text'] = a
corpus = df_tweets['lemm_text'].values.astype('U')
df_tweets

Unnamed: 0,text,toxic,lemm_text
0,"""\n\nIn your theoretical situation, a single r...",0,In your theoretical situation a single revert ...
1,"Bear, I hope you are not angry with me, I was ...",0,Bear I hope you are not angry with me I was tr...
2,""" A second opinion will brought in to see if t...",0,A second opinion will brought in to see if thi...
3,Needs sources\nThis article has a lot of text ...,0,Needs sources This article has a lot of text t...
4,"If you will check the image out, you will see ...",0,If you will check the image out you will see t...
...,...,...,...
195,message on daedalus' page \n\nDaedalus is noth...,1,message on daedalus page Daedalus is nothing b...
196,]\nIT'S HER EXACT QUOTE. WHAT CAN'T YOU UNDER...,1,IT S HER EXACT QUOTE WHAT CAN T YOU UNDERSTAND...
197,important \n\nYou Suck! I think you should go ...,1,important You Suck I think you should go suck ...
198,last warning to you \n\nHEY LOSER\n\nDO WHATEV...,1,last warning to you HEY LOSER DO WHATEVER YOU ...


In [8]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords) 
tf_idf = count_tf_idf.fit_transform(corpus)

print("Размер матрицы:", tf_idf.shape)

Размер матрицы: (200, 2371)


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


***Выгружаем базу***

## Обучение

In [9]:
train, test = train_test_split(df_tweets, test_size=0.25, shuffle=True, random_state=42)

x_train = train.drop(columns=['toxic', 'text'])
y_train = train['toxic']

x_test = test.drop(columns=['toxic', 'text'])
y_test = test['toxic']
 
text_transformer = ColumnTransformer(remainder='passthrough', transformers=[('vectorizer', TfidfVectorizer(stop_words=stopwords), 'lemm_text')])

In [10]:
y_train

114    1
173    1
5      0
126    1
117    1
      ..
106    1
14     0
92     0
179    1
102    1
Name: toxic, Length: 150, dtype: int64

***Делим выборки на тренировочную и тестовую и лмматизацию***

In [11]:
%%time
model1 = Pipeline(steps=[
    ('text_transformer', text_transformer),
    ('model', LogisticRegression(C = 5, random_state=12345))
])
 
model1.fit(x_train, y_train)

CPU times: user 27.1 ms, sys: 0 ns, total: 27.1 ms
Wall time: 28.3 ms




Pipeline(memory=None,
         steps=[('text_transformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('vectorizer',
                                                  TfidfVectorizer(analyzer='word',
                                                                  binary=False,
                                                                  decode_error='strict',
                                                                  dtype=<class 'numpy.float64'>,
                                                                  encoding='utf-8',
                                                                  input='content',
                                                                  lowercase=True,
                                                                  max_df=1.0,
          

In [12]:
%%time
model2 = Pipeline(steps=[
    ('text_transformer', text_transformer),
    ('model', DecisionTreeClassifier(random_state=12345))
])
 
model2.fit(x_train, y_train)

CPU times: user 40.9 ms, sys: 506 µs, total: 41.4 ms
Wall time: 42.7 ms


Pipeline(memory=None,
         steps=[('text_transformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('vectorizer',
                                                  TfidfVectorizer(analyzer='word',
                                                                  binary=False,
                                                                  decode_error='strict',
                                                                  dtype=<class 'numpy.float64'>,
                                                                  encoding='utf-8',
                                                                  input='content',
                                                                  lowercase=True,
                                                                  max_df=1.0,
          

In [13]:
%%time
model3 = Pipeline(steps=[
    ('text_transformer', text_transformer),
    ('model', RandomForestClassifier(random_state=12345))
])
 
model3.fit(x_train, y_train)




CPU times: user 46.7 ms, sys: 5.73 ms, total: 52.5 ms
Wall time: 75.3 ms


Pipeline(memory=None,
         steps=[('text_transformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('vectorizer',
                                                  TfidfVectorizer(analyzer='word',
                                                                  binary=False,
                                                                  decode_error='strict',
                                                                  dtype=<class 'numpy.float64'>,
                                                                  encoding='utf-8',
                                                                  input='content',
                                                                  lowercase=True,
                                                                  max_df=1.0,
          

***Обучаем модели***

## Выводы

In [14]:
predicted_valid = model1.predict(x_train)
print(f1_score(predicted_valid, y_train)) # F1

predicted_valid = model1.predict(x_test)
print(f1_score(predicted_valid, y_test)) # F1

1.0
0.7307692307692308


In [15]:
%%time
predicted_valid = model2.predict(x_train)
print(f1_score(predicted_valid, y_train)) # F1

predicted_valid = model2.predict(x_test)
print(f1_score(predicted_valid, y_test)) # F1

1.0
0.7586206896551724
CPU times: user 27 ms, sys: 6.06 ms, total: 33 ms
Wall time: 31.3 ms


In [16]:
%%time
predicted_valid = model3.predict(x_train)
print(f1_score(predicted_valid, y_train)) # F1

predicted_valid = model3.predict(x_test)
print(f1_score(predicted_valid, y_test)) # F1

1.0
0.7272727272727273
CPU times: user 17.5 ms, sys: 9.04 ms, total: 26.5 ms
Wall time: 44.4 ms


***Тестируем модели***

| Params/Model  | LogisticRegression   | DecisionTreeClassifier | RandomForestClassifier |
| :------------- | :-------------: |:-------------: |:-------------: |
| Время Обучения  | 49.8 ms  | 66 ms  | 71.4 ms | 
| Время выполнения     | 34.7 ms   |39.3 ms | 65.7 ms| 
| |  | | | | |
| Качество на обучающей выборке (F1)  | 1.0    |1.0  | 0.99 | 
| Качество на тестовой выборке (F1)    | 0.78 |0.61  |0.69| 

Можно сделать вывод, что заказчику больше всего подойдет модель LogisticRegression. Она самая быстрая и качественная.