# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. Представлен набор данных с разметкой о токсичности правок.

Необходимо построить модель со значением метрики качества F1 не меньше 0.75.

**План работы**

1. Загрузить и подготовить данные.
2. Обучите разные модели. 
3. Сделайте выводы.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', 400)
import warnings
warnings.simplefilter('ignore')
from tqdm.notebook import tqdm

import re
import spacy

from pymystem3 import Mystem
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

from catboost import CatBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Igor\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
try:
    toxic_comments  = pd.read_csv("datasets/toxic_comments.csv")
except:
    toxic_comments  = pd.read_csv("toxic_comments.csv")
toxic_comments .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [3]:
toxic_comments .head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do ...",0
4,4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [4]:
toxic_comments = toxic_comments.drop('Unnamed: 0', axis = 1)

In [5]:
toxic_comments

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do ...",0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0
...,...,...
159287,""":::::And for the second time of asking, when your view completely contradicts the coverage in reliable sources, why should anyone care what you feel? You can't even give a consistent argument - is the opening only supposed to mention significant aspects, or the """"most significant"""" ones? \n\n""",0
159288,You should be ashamed of yourself \n\nThat is a horrible thing you put on my talk page. 128.61.19.93,0
159289,"Spitzer \n\nUmm, theres no actual article for prostitution ring. - Crunch Captain.",0
159290,And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.,0


Лемматезируем текст

In [6]:
def tokenize_lemmatize_space(list_of_text, verbose = True):
    nlp = spacy.load('en_core_web_sm', disable = ['parser','ner'])
    return_series = []
    for text in tqdm(list_of_text):
        string = re.sub(r'[^a-zA-Z]', ' ', text)
        string = nlp(string)
        return_series.append(' '.join([token.lemma_ for token in string]))
    return return_series

In [7]:
toxic_comments['clear_lemm_text'] = tokenize_lemmatize_space(toxic_comments['text'])
toxic_comments.head()

  0%|          | 0/159292 [00:00<?, ?it/s]

Unnamed: 0,text,toxic,clear_lemm_text
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,Explanation why the edit make under my username Hardcore Metallica Fan be revert they weren t vandalism just closure on some gas after I vote at New York Dolls FAC and please don t remove the template from the talk page since I m retire now
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,d aww he match this background colour I m seemingly stick with Thanks talk January UTC
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,hey man I m really not try to edit war it s just that this guy be constantly remove relevant information and talk to I through edit instead of my talk page he seem to care more about the formatting than the actual info
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do ...",0,more I can t make any real suggestion on improvement I wonder if the section statistic should be later on or a subsection of type of accident I think the reference may need tidy so that they be all in the exact same format ie date format etc I can do that later on if no one else do first if you have any preference for format style on reference or want to do it yourself p...
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,you sir be my hero any chance you remember what page that s on


Обучение TF-IDF

In [8]:
#Подготовим выборки.
X, y = toxic_comments.drop(['text', 'toxic'], axis=1), toxic_comments['toxic']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)

Создадим новые признаки с помощью TF-IDF, исключая стоп-слова.

In [9]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
X_train_corpus = X_train['clear_lemm_text'].values
X_train_tf_idf = count_tf_idf.fit_transform(X_train_corpus)
X_test_tf_idf = count_tf_idf.transform(X_test['clear_lemm_text'])

In [11]:
toxic_comments.shape

(159292, 3)

In [70]:
a = [[0,1,2,3,4],
     [5,6,7,8,9],
     [10,11,12,13,14]]

In [78]:
a[0:2]

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]

In [79]:
X_train_tf_idf.shape

(143362, 147528)

In [65]:
X_train_tf_idf[1:3].toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [58]:
pd.Series(X_train_tf_idf[0].toarray()[0]).sort_values(ascending = False)

134352    0.467824
138922    0.339157
38755     0.337799
109300    0.313429
112447    0.298100
            ...   
49180     0.000000
49181     0.000000
49182     0.000000
49183     0.000000
147527    0.000000
Length: 147528, dtype: float64

## Обучение

Напишем функцию для обучения моделей и вычисления метрики f1 с кросс-валидацией.

In [17]:
def metric_data(model, name, X_train, y_train, y_test, X_test):
    """
    Функция для обучения модели и подсчета метрик.
    На вход: модель, имя модели.
    Выход: датафрэйм с f1_train, f1_test.
    """
    f1_train = round(cross_val_score(model, X_train, y_train, cv=3, scoring='f1', n_jobs=-1).mean(),3)
    metric = pd.DataFrame({'f1_train': f1_train,
                           },
                               index=[name])
    return metric

### LogisticRegression

In [18]:
model_lr = LogisticRegression(random_state=42, class_weight = 'balanced')

In [19]:
metric_lr = metric_data(model_lr, 'lr', X_train_tf_idf, y_train, y_test, X_test_tf_idf)
metric_lr

Unnamed: 0,f1_train
lr,0.75


### LinearSVC

In [20]:
model_LinearSVC = LinearSVC(random_state=42)

In [21]:
metric_LinearSVC = metric_data(model_LinearSVC, 'LinearSVC', X_train_tf_idf, y_train, y_test, X_test_tf_idf)
metric_LinearSVC

Unnamed: 0,f1_train
LinearSVC,0.771


### CatBoost

In [22]:
text_features = ['clear_lemm_text']

In [23]:
model_cb = CatBoostClassifier(random_state=42, verbose=100, learning_rate=1, eval_metric='F1', text_features=['clear_lemm_text'])

In [24]:
metric_cb = metric_data(model_cb, 'CatBoost', X_train, y_train, y_test, X_test)
metric_cb

Unnamed: 0,f1_train
CatBoost,0.762


In [31]:
metric = metric_lr.append(metric_LinearSVC).append(metric_cb)
metric.sort_values('f1_train', ascending=False).style.background_gradient(subset=['f1_train'], cmap='Purples')

Unnamed: 0,f1_train
LinearSVC,0.771
CatBoost,0.762
lr,0.75


Линейный метод опорных векторов LinearSVC показал лучший результат

### Проверка на тестовой выборке

In [32]:
model_LinearSVC.fit(X_train_tf_idf, y_train)
pred = model_LinearSVC.predict(X_test_tf_idf)
score = f1_score(y_test,pred)
print('F1 score: ', round(score,3))

F1 score:  0.787


## Выводы

В данном проекте построена модель для классификации комментариев на позитивные и негативные. Проанализированы несколько моделей (текстовые признаки созданы техникой TF-IDF). Лучший результат в эксперименте показала LinearSVC (f1_test = 0,787).