Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler


from scipy.sparse import hstack


from pymystem3 import Mystem
import re

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold


from joblib import dump, load


In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
display(data)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


In [4]:
data['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

In [5]:
#с помощью spacy лемматизируем текст 
nlp = en_core_web_sm.load()
#Функция лемматизации
def lemmatize(text):
    temp = []
    for token in nlp(text):
        if token.is_stop == False:
            temp.append(token.lemma_)
    return " ".join(temp)

In [6]:
#Функция очистки текста
def clear_text(text):
    
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^a-zA-Z ]+", "", text).strip()
    return text

In [42]:
data['text_clean'] = data['text'].apply(clear_text)

In [13]:
data['text_lemma'] = data['text_clean'].apply(lemmatize)

In [14]:
data

Unnamed: 0,text,toxic,text_clean,text_lemma
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...,explanation edit username Hardcore Metallica F...
1,D'aww! He matches this background colour I'm s...,0,Daww He matches this background colour Im seem...,Daww match background colour be seemingly stic...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man Im really not trying to edit war Its j...,hey man be try edit war guy constantly remove ...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I cant make any real suggestions on impro...,not real suggestion improvement wonder secti...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...,sir hero chance remember page s
...,...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...,second time ask view completely contradict cov...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ...,ashamed horrible thing talk page
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for pro...,Spitzer Umm s actual article prostitution r...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...,look like actually speedy version delete look


данные получены, очищены и лемматизированы.

In [7]:
from joblib import dump, load

In [8]:
#Сохраним результат (лемматизация выполняется слишком долго)
DATA_PREPARED_FILE = "toxic_comments_lemma"


In [None]:
dump(data, DATA_PREPARED_FILE)

# 2. Обучение

In [9]:
#выгрузим файл и за одно пороверим
data = load(DATA_PREPARED_FILE)
data.shape

(159571, 4)

In [10]:
#разделим данные на выборки: обучающую, валидационную, тестовую
features = data.drop(columns=['toxic'])
target = data['toxic']

features_train, features_test, target_train, target_test = train_test_split(features, target,
                                                                            shuffle=True,
                                                                            test_size=0.2,
                                                                            random_state=4221)
features_train_valid, features_valid, target_train_valid, target_valid = train_test_split(features_train, target_train,
                                                                                          shuffle=True,
                                                                                          test_size=0.2,
                                                                                          random_state=4221)

In [14]:
count_tf_idf = TfidfVectorizer(ngram_range=(1, 1))
features_train_valid_tfidf = count_tf_idf.fit_transform(features_train_valid['text_lemma'].values.astype('U'))
features_valid_tfidf = count_tf_idf.transform(features_valid['text_lemma'].values.astype('U'))
features_train_tfidf = count_tf_idf.transform(features_train['text_lemma'].values.astype('U'))
features_test_tfidf = count_tf_idf.transform(features_test['text_lemma'].values.astype('U'))

### Логистическая модель

In [15]:
model = LogisticRegression(solver='liblinear')
model.fit(features_train_valid_tfidf, target_train_valid)
target_valid_pred = model.predict(features_valid_tfidf)


In [17]:
print('F1 валидационная:', f1_score(target_valid, target_valid_pred))

F1 валидационная: 0.7256309330863625


близкий результат к искомому f1 = 0.75

### RandomForest

In [18]:
for i in range(1, 20, 2):
    rndfrs = RandomForestClassifier(random_state=4221, max_depth=3, n_estimators=i)
    rndfrs.fit(features_train_valid_tfidf, target_train_valid)
    target_valid_pred = rndfrs.predict(features_valid_tfidf)
    
    print('F1 валидационная:', f1_score(target_valid, target_valid_pred))

F1 валидационная: 0.017358490566037735


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0
F1 валидационная: 0.0007619047619047618


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


F1 валидационная: 0.0
F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


In [19]:
rndfrs = RandomForestClassifier(random_state=4221, max_depth=3, n_estimators=10)
rndfrs.fit(features_train_valid_tfidf, target_train_valid)
target_valid_pred = rndfrs.predict(features_valid_tfidf)
print('F1 валидационная:', f1_score(target_valid, target_valid_pred))

F1 валидационная: 0.0


  'precision', 'predicted', average, warn_for)


не работает с "лесом"

### Ridge

In [21]:
from sklearn.linear_model import RidgeClassifier

In [22]:
clf = RidgeClassifier()
clf.fit(features_train_valid_tfidf, target_train_valid)
target_valid_pred = clf.predict(features_valid_tfidf)
print('F1 валидационная:', f1_score(target_valid, target_valid_pred))

F1 валидационная: 0.702728127939793


#### Вернемся логистической модели

In [23]:
#попробуем подобрать лучшие параметры для логистической модели
pipe = Pipeline([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1))),
    ('model', LogisticRegression(random_state=4221, solver='liblinear', max_iter=200)) #lbfgs
    ])

params = [
        {
            'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
            'model': [LogisticRegression(random_state=4221, solver='liblinear')],
            'model__C': [1, 10, 50, 100, 200]
        }
]

In [24]:
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=4221)

grid = GridSearchCV(pipe, param_grid=params, scoring='f1', cv=cv, verbose=False)

In [25]:
grid.fit(features_train_valid['text_lemma'], target_train_valid)

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=4221, shuffle=True),
             error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vectorizer',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                        

In [26]:
dump(grid, 'grid')

['grid']

In [27]:
grid = load('grid')
grid.best_params_, grid.best_score_

({'model': LogisticRegression(C=200, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=4221, solver='liblinear', tol=0.0001, verbose=0,
                     warm_start=False),
  'model__C': 200,
  'vectorizer__ngram_range': (1, 2)},
 0.7717174546025666)

Модель стала значительно лучше

# 3. Выводы

In [28]:
model = grid.best_estimator_

model.fit(features_train['text_clean'], target_train)

test_pred = model.predict(features_test['text_clean'])

f1_score(target_test, test_pred)

0.7906019550677413

Результат f1 выше 0,75. Работа с текстами занимает много времени обработки. Не все модели работают (RandomForest например)