<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [48]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from nltk.stem import WordNetLemmatizer
import re
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [49]:
!pip install nltk -q

<div class="alert alert-block alert-success">
<b>Успех:</b> Импорты  как всегда на месте
</div>



In [40]:
RANDOM_STATE = 42

In [41]:
data = pd.read_csv('/datasets/toxic_comments.csv')

In [42]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Подсчет количества пропусков:

In [44]:
data.isnull().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

Подсчет количества значений:

In [45]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [46]:
def clear_text(text):
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^a-zA-Z ]+", "", text).strip()
    text = text.lower()
    return text

data['new'] = data['text'].apply(clear_text)

In [51]:
stop_words = set(nltk_stopwords.words('english'))

def stopwords_remove(text):
    text = [i for i in text.split() if not i in stop_words]
    return text

data['stopwords'] = data['new'].apply(lambda x: stopwords_remove(x))

In [52]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

Лемматизация:

In [53]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    lemmatized_text = ' '.join([lemmatizer.lemmatize(x, get_wordnet_pos(x)) for x in nltk.word_tokenize(text)])
    return lemmatized_text

data['lemmatized'] = data['new'].apply(lambda x: lemmatize_text(x))
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,new,stopwords,lemmatized
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...,"[explanation, edits, made, username, hardcore,...",explanation why the edits make under my userna...
1,1,D'aww! He matches this background colour I'm s...,0,daww he matches this background colour im seem...,"[daww, matches, background, colour, im, seemin...",daww he match this background colour im seemin...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man im really not trying to edit war its j...,"[hey, man, im, really, trying, edit, war, guy,...",hey man im really not try to edit war it just ...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,more i cant make any real suggestions on impro...,"[cant, make, real, suggestions, improvement, w...",more i cant make any real suggestion on improv...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...,"[sir, hero, chance, remember, page, thats]",you sir be my hero any chance you remember wha...


In [None]:
data.head()

Добавление столбца с обработанным текстом

In [54]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,new,stopwords,lemmatized
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...,"[explanation, edits, made, username, hardcore,...",explanation why the edits make under my userna...
1,1,D'aww! He matches this background colour I'm s...,0,daww he matches this background colour im seem...,"[daww, matches, background, colour, im, seemin...",daww he match this background colour im seemin...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man im really not trying to edit war its j...,"[hey, man, im, really, trying, edit, war, guy,...",hey man im really not try to edit war it just ...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,more i cant make any real suggestions on impro...,"[cant, make, real, suggestions, improvement, w...",more i cant make any real suggestion on improv...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...,"[sir, hero, chance, remember, page, thats]",you sir be my hero any chance you remember wha...


## Обучение

Разбивка на выборки:

In [64]:
X_train, X_test, y_train, y_test = train_test_split(data['lemmatized'], data['toxic'], test_size = 0.2, stratify = data['toxic'], random_state=RANDOM_STATE)

In [65]:
print(X_train.shape, y_train.shape)

(127433,) (127433,)


In [66]:
print(X_test.shape, y_test.shape)

(31859,) (31859,)


Векторизация с помощью TfidfVectorizer:

In [67]:
tf_idf = TfidfVectorizer(ngram_range=(1,1), min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1)

In [68]:
train_tf_idf = tf_idf.fit_transform(X_train)
test_tf_idf = tf_idf.transform(X_test)

Логистическая регрессия:

In [69]:
parameters = {'C': np.linspace(10, 20, num = 11, endpoint = True), 'max_iter': [500]}

logistic_regression = LogisticRegression(random_state = RANDOM_STATE)

clf = RandomizedSearchCV(logistic_regression, parameters, cv = 5, scoring = 'f1', n_jobs = -1, verbose = 2)


<div class="alert alert-block alert-info">
<b>Совет: </b>  Также напомню, что внутри кросс-валидации происходих разбиение выборки на треин и валидацию. Однако, в таком случае векторизатор обучен на всей выборке, а это не совсем корректно. Для избежания такого эффекта можно использовать <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">пайплайн</a>.
</div>


In [70]:
clf.fit(train_tf_idf, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ...............................C=12.0, max_iter=500; total time= 1.6min
[CV] END ...............................C=12.0, max_iter=500; total time= 1.6min
[CV] END ...............................C=12.0, max_iter=500; total time= 1.6min
[CV] END ...............................C=12.0, max_iter=500; total time= 1.8min
[CV] END ...............................C=12.0, max_iter=500; total time= 1.5min
[CV] END ...............................C=15.0, max_iter=500; total time= 1.9min
[CV] END ...............................C=15.0, max_iter=500; total time= 1.9min
[CV] END ...............................C=15.0, max_iter=500; total time= 1.8min
[CV] END ...............................C=15.0, max_iter=500; total time= 1.9min
[CV] END ...............................C=15.0, max_iter=500; total time= 1.8min
[CV] END ...............................C=14.0, max_iter=500; total time= 1.5min
[CV] END ...............................C=14.0, 

RandomizedSearchCV(cv=5, estimator=LogisticRegression(random_state=42),
                   n_jobs=-1,
                   param_distributions={'C': array([10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.]),
                                        'max_iter': [500]},
                   scoring='f1', verbose=2)

In [77]:
print(f"Наилучший показатель f1: {clf.best_score_:.3f}")
print(f"Лучшие параметры: {clf.best_params_}")

Наилучший показатель f1: 0.789
Лучшие параметры: {'C': 1.0}


In [78]:
result = f1_score(y_test, clf.predict(test_tf_idf))

In [79]:
result

0.7951519289860021

Linear Support Vector Classification:

In [80]:
parameters = {'C': np.linspace(1, 20, num = 5, endpoint = True)}

linear_support_vector_classification = LinearSVC(random_state = RANDOM_STATE)

clf = RandomizedSearchCV(linear_support_vector_classification, parameters, cv=5, scoring='f1', n_jobs=-1,verbose=2)

In [81]:
clf.fit(train_tf_idf, y_train)



Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] END ..............................................C=1.0; total time=   1.0s
[CV] END ..............................................C=1.0; total time=   1.0s
[CV] END ..............................................C=1.0; total time=   1.0s
[CV] END ..............................................C=1.0; total time=   1.0s
[CV] END ..............................................C=1.0; total time=   1.0s
[CV] END .............................................C=5.75; total time=   4.2s
[CV] END .............................................C=5.75; total time=   3.0s
[CV] END .............................................C=5.75; total time=   4.6s
[CV] END .............................................C=5.75; total time=   3.2s
[CV] END .............................................C=5.75; total time=   3.2s
[CV] END .............................................C=10.5; total time=   5.4s
[CV] END ........................................



[CV] END ............................................C=15.25; total time=   8.9s
[CV] END ............................................C=15.25; total time=   7.2s




[CV] END .............................................C=20.0; total time=   9.1s




[CV] END .............................................C=20.0; total time=   8.8s
[CV] END .............................................C=20.0; total time=   7.6s




[CV] END .............................................C=20.0; total time=   9.2s
[CV] END .............................................C=20.0; total time=   9.3s


RandomizedSearchCV(cv=5, estimator=LinearSVC(random_state=42), n_jobs=-1,
                   param_distributions={'C': array([ 1.  ,  5.75, 10.5 , 15.25, 20.  ])},
                   scoring='f1', verbose=2)

In [82]:
print(f'F1 : {clf.best_score_:.3f}')
print(f'Параметры: {clf.best_params_}')

F1 : 0.789
Параметры: {'C': 1.0}


## Выводы

**Общий вывод:**
- Общая информация по данным
- Добавление нового столбца с обработанным текстом lemmatize_text
- Обучена модель Логистической Регрессии, Наилучший показатель f1: 0.784, Лучшие параметры: {'max_iter': 500, 'C': 17.0}
- F1 на тестовой выборке - 0.79

## Чек-лист проверки