# Машинное обучение для текстов

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from pymystem3 import Mystem
import re
import spacy
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier

!pip install catboost
from catboost import CatBoostClassifier



### Загрузим дынные и посмотрим на них

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)

In [3]:
display(data)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159446,""":::::And for the second time of asking, when ...",0
159447,You should be ashamed of yourself \n\nThat is ...,0
159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159449,And it looks like it was actually you who put ...,0


Уменьшим датасет для ускорения работы кода

### Лемматизация и очистка текста

In [4]:
nlp = spacy.load('en_core_web_sm', disable=["parser", "ner"])

In [5]:
def lemmatize(text):
    result = nlp(text)
    result = " ".join([token.lemma_ for token in result])
    return result
# функция для лемматизации текста

In [6]:
%%time
data['text_lemm'] = data['text'].apply(lemmatize)

CPU times: user 18min 27s, sys: 4.57 s, total: 18min 32s
Wall time: 18min 33s


In [7]:
display(data['text_lemm'])

0         Explanation \n why the edit make under my user...
1         D'aww ! he match this background colour I be s...
2         hey man , I be really not try to edit war . it...
3         " \n More \n I can not make any real suggestio...
4         you , sir , be my hero . any chance you rememb...
                                ...                        
159446    " : : : : : and for the second time of asking ...
159447    you should be ashamed of yourself \n\n that be...
159448    spitzer \n\n Umm , there s no actual article f...
159449    and it look like it be actually you who put on...
159450    " \n and ... I really do not think you underst...
Name: text_lemm, Length: 159292, dtype: object

In [7]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split()
    text = ' '.join(text)
    return text
# функция для очистки текста

In [8]:
data['text_clear'] = data['text_lemm'].apply(clear_text)

In [9]:
display(data['text_clear'])

0         explanation why the edit make under my usernam...
1         d aww he match this background colour i be see...
2         hey man i be really not try to edit war it be ...
3         more i can not make any real suggestion on imp...
4         you sir be my hero any chance you remember wha...
                                ...                        
159446    and for the second time of asking when your vi...
159447    you should be ashamed of yourself that be a ho...
159448    spitzer umm there s no actual article for pros...
159449    and it look like it be actually you who put on...
159450    and i really do not think you understand i com...
Name: text_clear, Length: 159292, dtype: object

In [10]:
features = data['text_clear']

In [11]:
target = data['toxic']

### Исключим стоп-слова и расчитаем TF-IDF текстов 

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
stopwords = list(nltk_stopwords.words('english'))

In [14]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.21, random_state=12345
)

In [15]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

In [16]:
features_train_tf_idf = count_tf_idf.fit_transform(features_train) 

In [17]:
features_valid_tf_idf = count_tf_idf.transform(features_valid)

## Обучение


### Обучим модель LogisticRegression + GridSearchCV

In [22]:
model_lr = LogisticRegression(random_state=12345)

In [19]:
params={
    'C':[0.01,0.05,1,0.5,1,5,10],
    'penalty':['l1','l2']
}

In [23]:
grid = GridSearchCV(estimator=model_lr, cv=5, n_jobs=-1, param_grid=params ,scoring='f1')

In [24]:
%%time
grid.fit(features_train_tf_idf, target_train)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)


CPU times: user 9min 3s, sys: 11min 29s, total: 20min 32s
Wall time: 20min 35s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=5, estimator=LogisticRegression(random_state=12345), n_jobs=-1,
             param_grid={'C': [0.01, 0.05, 1, 0.5, 1, 5, 10],
                         'penalty': ['l1', 'l2']},
             scoring='f1')

In [25]:
grid.best_params_

{'C': 10, 'penalty': 'l2'}

In [26]:
grid.best_estimator_

LogisticRegression(C=10, random_state=12345)

In [27]:
print("F1:", grid.best_score_.round(2))

F1: 0.77


### Обучим модель с помощью CatBoostClassifier

In [28]:
model_cbc = CatBoostClassifier(loss_function="Logloss", iterations=160, learning_rate=0.47)

In [30]:
features_train_cb, features_test, target_train_cb, target_test = train_test_split(
    features_train, target_train, test_size=0.21, random_state=12345
)

In [31]:
count_tf_idf_cb = TfidfVectorizer(stop_words=stopwords)

In [33]:
features_train_tf_idf_cb = count_tf_idf_cb.fit_transform(features_train_cb) 

In [35]:
features_test_tf_idf = count_tf_idf_cb.transform(features_test)

In [37]:
model_cbc.fit(features_train_tf_idf_cb, target_train_cb, verbose=10)

0:	learn: 0.3476026	total: 2.28s	remaining: 6m 2s
10:	learn: 0.1813939	total: 19.2s	remaining: 4m 20s
20:	learn: 0.1611229	total: 35.7s	remaining: 3m 56s
30:	learn: 0.1490894	total: 53.4s	remaining: 3m 42s
40:	learn: 0.1413028	total: 1m 10s	remaining: 3m 24s
50:	learn: 0.1345580	total: 1m 26s	remaining: 3m 5s
60:	learn: 0.1288160	total: 1m 43s	remaining: 2m 48s
70:	learn: 0.1252634	total: 1m 59s	remaining: 2m 29s
80:	learn: 0.1219392	total: 2m 16s	remaining: 2m 13s
90:	learn: 0.1182976	total: 2m 33s	remaining: 1m 56s
100:	learn: 0.1154652	total: 2m 50s	remaining: 1m 39s
110:	learn: 0.1135905	total: 3m 6s	remaining: 1m 22s
120:	learn: 0.1112783	total: 3m 22s	remaining: 1m 5s
130:	learn: 0.1094779	total: 3m 38s	remaining: 48.4s
140:	learn: 0.1074918	total: 3m 55s	remaining: 31.7s
150:	learn: 0.1056195	total: 4m 12s	remaining: 15s
159:	learn: 0.1039813	total: 4m 26s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7fad6e7d5eb0>

In [39]:
predicted_model_cbc = model_cbc.predict(features_test_tf_idf)

In [43]:
f1_cbc = f1_score(target_test, predicted_model_cbc).round(2)
print('F1:', f1_cbc)

F1: 0.75


### Обучим модель RandomForestClassifier, используя cross_val_score

In [41]:
model_feat = RandomForestClassifier(random_state=12345)

In [42]:
%%time
scores = cross_val_score(
    model_feat, features_train_tf_idf, target_train, cv=3, n_jobs=-1, verbose=10, scoring='f1'
)

[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] START .....................................................................
[CV] END ................................ score: (test=0.667) total time=10.4min
[CV] START .....................................................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed: 10.4min remaining:    0.0s


[CV] END ................................ score: (test=0.682) total time=10.4min
[CV] START .....................................................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 20.8min remaining:    0.0s


[CV] END ................................ score: (test=0.697) total time=10.4min
CPU times: user 31min, sys: 10.1 s, total: 31min 10s
Wall time: 31min 11s


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 31.2min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 31.2min finished


In [44]:
mean_score = (sum(scores) / len(scores)) 

print('Средняя оценка качества модели:', mean_score.round(2))

Средняя оценка качества модели: 0.68


## Выводы

### Объединим в одну таблицу результаты оценки качества работы моделей и выберем лучшую

In [45]:
parameters = [
    ['F1', '0,77', '0,75', '0,68']
]

list = ['Метрика', 'LogisticRegression + GridSearchCV', 'CatBoostClassifier', 'RandomForestClassifier+cross_val_score']

total = pd.DataFrame(data=parameters, columns=list) 

display(total)

Unnamed: 0,Метрика,LogisticRegression + GridSearchCV,CatBoostClassifier,RandomForestClassifier+cross_val_score
0,F1,77,75,68


<div class="alert alert-info">

Лучший показатель метрики качества F1 у модели  LogisticRegression + GridSearchCV - 0,75.

Параметры модели LogisticRegression для обучения:

* 'C': 10, 
* 'penalty': 'l2'

<div class="alert alert-info">
Проверим лучшую модель на тествой выборке

In [46]:
predicted_grid = grid.predict(features_valid_tf_idf)

In [47]:
f1 = f1_score(target_valid, predicted_grid).round(2)
print('F1:', f1)

F1: 0.79


<div class="alert alert-info">
Метрика F1 выбранной модели на тестовой выборке показала лучший результат по сравнению с обучающей выборкой - на 0,02.
    
Можно сделать вывод, что обученная нами модель LogisticRegression (grid) подходит для классификации комментариев на позитивные и негативные.