Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [1]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
import numpy as np
from pymystem3 import Mystem
m = Mystem()
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import f1_score

In [25]:
from sklearn.ensemble import RandomForestClassifier

# 1. Подготовка

In [3]:
df = pd.read_csv('/datasets/toxic_comments.csv')
df[:5]

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
df.text

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: text, Length: 159571, dtype: object

### В данных присутствуют разделители строки, заглавные символы. Очищу.

In [5]:
def cleaning(text):
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^a-zA-Z ]+", "", text).strip()
    text = text.lower()
    return text

df['text'] = df['text'].apply(cleaning)

In [6]:
print(df.text[:3])

0    explanation why the edits made under my userna...
1    daww he matches this background colour im seem...
2    hey man im really not trying to edit war its j...
Name: text, dtype: object


### Проведу лемматизацию.

In [7]:
corpus = df['text'].values.astype('U')
def lemmatize(text):
    lemm = m.lemmatize(text)
    return "".join(lemm)

corpus[0] = lemmatize(corpus[0])
corpus

array(['explanation why the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now\n',
       'daww he matches this background colour im seemingly stuck with thanks  talk  january   utc',
       'hey man im really not trying to edit war its just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info',
       ...,
       'spitzer   umm theres no actual article for prostitution ring   crunch captain',
       'and it looks like it was actually you who put on the speedy to have the first version deleted now that i look at it',
       'and  i really dont think you understand  i came here and my idea was bad right away  what kind of community goes you have bad ideas go away instead of helping rewrite th

# 2. Обучение

###  Список стоп слов

In [8]:
stopwords = set(nltk_stopwords.words('english'))

### Разделю выборки

In [9]:
target=df['toxic'].values
features = df['text']
f_other, f_test, t_other, t_test = train_test_split(features, target, test_size = .1, random_state = 42)
f_train, f_valid, t_train, t_valid = train_test_split(f_other, t_other, shuffle=False, test_size=0.25, random_state = 42)

f_train.shape[0], f_valid.shape[0], f_test.shape[0]

(107709, 35904, 15958)

### TFIDF векторизация

In [10]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tfidf_train = count_tf_idf.fit_transform(f_train)
tfidf_valid = count_tf_idf.transform(f_valid)
tfidf_test = count_tf_idf.transform(f_test)

### Подберу параметры

In [15]:
%%time
pipe = Pipeline([
    (
    ('model', LogisticRegression(random_state=1, solver='liblinear', max_iter=200))
    )
    ])


param_grid = [
        {

            'model': [LogisticRegression(random_state=42, solver='liblinear')],
            'model__penalty' : ['l1', 'l2'],
            'model__C': list(range(1,15,3))
        }
]
grid = GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv=3, verbose=True, n_jobs=-1)
best_grid = grid.fit(tfidf_train, t_train)
print('Best parameters is:', grid.best_params_)
print('Best score is:', grid.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.7min finished


Best parameters is: {'model': LogisticRegression(C=4, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=42, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False), 'model__C': 4, 'model__penalty': 'l1'}
Best score is: 0.769191886783492
CPU times: user 2min 30s, sys: 1min 13s, total: 3min 43s
Wall time: 3min 44s


In [46]:
%%time
params_forest = {
    'n_estimators': list(range(50,300,50)),
    'max_depth':[5,15],
    'max_features' : list(range(1,20, 2))
}


model_forest = RandomForestClassifier(random_state=12345)
                                 
grid = GridSearchCV(model_forest, param_grid=params_forest, scoring='f1', cv=3, verbose=True, n_jobs=-1)
best_grid = grid.fit(tfidf_train, t_train)
print('Best parameters is:', grid.best_params_)
print('Best score is:', grid.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 38.9min finished


Best parameters is: {'max_depth': 5, 'max_features': 1, 'n_estimators': 50}
Best score is: 0.0
CPU times: user 38min 27s, sys: 1.93 s, total: 38min 28s
Wall time: 38min 53s


### Обучу модель регрессии 

In [47]:
model = LogisticRegression(random_state=42, C = 4, penalty = 'l1', solver='liblinear', max_iter=200)
model.fit(tfidf_train, t_train)
valid_pred = model.predict(tfidf_valid)
f1_score(t_valid, valid_pred)

0.7841279241930709

In [50]:
test_pred = model.predict(tfidf_test)
print(metrics.classification_report(t_test, test_pred), metrics.f1_score(t_test, test_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98     14383
           1       0.86      0.72      0.78      1575

    accuracy                           0.96     15958
   macro avg       0.91      0.85      0.88     15958
weighted avg       0.96      0.96      0.96     15958
 0.7813148788927337


In [54]:
model_forest = RandomForestClassifier(max_depth=5, max_features=1, n_estimators = 50)
model_forest.fit(tfidf_train, t_train)
valid_pred_f = model_forest.predict(tfidf_valid)
f1_score(t_valid, valid_pred_f)

0.0

# 3. Выводы

F1=0.7813 - модель обучилась с качеством выше требуемого)