Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [1]:
!pip install pymorphy2

Collecting pymorphy2
[?25l  Downloading https://files.pythonhosted.org/packages/07/57/b2ff2fae3376d4f3c697b9886b64a54b476e1a332c67eee9f88e7f1ae8c9/pymorphy2-0.9.1-py3-none-any.whl (55kB)
[K     |██████                          | 10kB 14.9MB/s eta 0:00:01[K     |███████████▉                    | 20kB 23.0MB/s eta 0:00:01[K     |█████████████████▊              | 30kB 28.2MB/s eta 0:00:01[K     |███████████████████████▋        | 40kB 21.4MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51kB 13.4MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.9MB/s 
[?25hCollecting pymorphy2-dicts-ru<3.0,>=2.4
[?25l  Downloading https://files.pythonhosted.org/packages/3a/79/bea0021eeb7eeefde22ef9e96badf174068a2dd20264b9a378f2be1cdd9e/pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2MB)
[K     |████████████████████████████████| 8.2MB 16.2MB/s 
Collecting dawg-python>=0.7.1
  Downloading https://files.pythonhosted.org/packages/6a/84/ff1ce2071d4c650ec857

In [2]:
import pandas as pd
import numpy as np

import pymorphy2
from pymystem3 import Mystem
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import nltk
from nltk.corpus import stopwords as nltk_stopwords
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
from google.colab import drive

drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [4]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

'Alfa battle хакатон'  'behavioral cloning'   text_analysis_yandex


In [5]:
df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/data/text_analysis_yandex/toxic_comments.csv')
df.head(5)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [6]:
df.shape

(159571, 2)

In [7]:
print("20 :", df.loc[20,'text'])
print('------------------------------')
print("\n100 :", df.loc[100,'text'])
print('------------------------------')
print("\n1000 :", df.loc[1000,'text'])

20 : "

 Regarding your recent edits 

Once again, please read WP:FILMPLOT before editing any more film articles.  Your edits are simply not good, with entirely too many unnecessary details and very bad writing.  Please stop before you do further damage. -''''''The '45 "
------------------------------

100 : However, the Moonlite edit noted by golden daph was me (on optus ...)  Wake up wikkis.  So funny
------------------------------

1000 : Rex Mundi 

I've created a stub on Rex Mundi at Rex Mundi High School.  Only thing I know about it is that both my Aunt Donna and Bob Griese went there.  Please add anything you might know about it.

BTW, my dad was a Panther; I live in Princeton myself.


# 2. Обучение

## Мешок слов

In [8]:
corpus = df['text']

In [9]:
morph = pymorphy2.MorphAnalyzer()

def lemmatize(text):
    words = text.split() # разбиваем текст на слова
    res = list()
    for word in words:
        p = morph.parse(word)[0]
        res.append(p.normal_form)

    return " ".join(res)

def clear_text(text):
    pattern = r'[^a-zA-Z]'
    cyrillic_text = re.sub(pattern, " ", text)
    
    return " ".join(cyrillic_text.split())

In [None]:
%%time
lemmatized_text = corpus.apply(lambda x: lemmatize(clear_text(x)) )
lemmatized_text[:3]

CPU times: user 4min, sys: 539 ms, total: 4min
Wall time: 4min 1s


In [None]:
X_train, X_test, y_train, y_test = train_test_split(lemmatized_text, df['toxic'])

In [None]:
count_vect = CountVectorizer(stop_words=stopwords)
n_gramm_train = count_vect.fit_transform(X_train)
n_gramm_test = count_vect.transform(X_test)

print("Размер train'a:", n_gramm_train.shape)
print("Размер test'a:", n_gramm_test.shape)

Размер train'a: (119678, 142716)
Размер test'a: (39893, 142716)


In [None]:
LR = LogisticRegression(max_iter=1000)
LR.fit(n_gramm_train, y_train)
print("train:", LR.score(n_gramm_train, y_train))
print("test:", LR.score(n_gramm_test, y_test))

print("\nF1:", f1_score(y_test, LR.predict(n_gramm_test)))

train: 0.9810157255301726
test: 0.9556814478730604

F1: 0.7608225108225107


TF-IDF vectorizer

In [None]:
Tf_Idf_count = TfidfVectorizer(stop_words=stopwords, ngram_range=(1, 2))
n_gramm_train = Tf_Idf_count.fit_transform(X_train)
n_gramm_test = Tf_Idf_count.transform(X_test)

print("Размер train'a:", n_gramm_train.shape)
print("Размер test'a:", n_gramm_test.shape)

Размер train'a: (119678, 2267440)
Размер test'a: (39893, 2267440)


In [None]:
LR = LogisticRegression(max_iter=1000)
LR.fit(n_gramm_train, y_train)
print("train:", LR.score(n_gramm_train, y_train))
print("test:", LR.score(n_gramm_test, y_test))

print("\nF1:", f1_score(y_test, LR.predict(n_gramm_test)))

train: 0.953065726365748
test: 0.9495400195523024

F1: 0.6943052391799545
