<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-Регрессия" data-toc-modified-id="Логистическая-Регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Логистическая Регрессия</a></span></li><li><span><a href="#Catboost" data-toc-modified-id="Catboost-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Catboost</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LightGBM</a></span></li></ul></li><li><span><a href="#Тестирование" data-toc-modified-id="Тестирование-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li>

# ML для текста
## Поиск токсичных комментариев

Интернет-магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучим модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Построим модель со значением метрики качества *F1* не меньше 0.7.

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## План работы:


*   Очистка текста
*   Лемматизация
*   Токенизация и обучение моделей с помощью pipeline
*   Выбор лучшей модели и тестирование



## Подготовка

In [None]:
!pip3 install catboost

In [None]:
import numpy as np
import pandas as pd

import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

import random
seed = random.seed(0)

import warnings
warnings.filterwarnings("ignore")

from tqdm import notebook

In [6]:
df = pd.read_csv('/content/toxic_comments.csv', index_col=[0]).sample(100000).reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    100000 non-null  object
 1   toxic   100000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [7]:
df.head()

Unnamed: 0,text,toxic
0,It's quite a funny coincidence that all of the...,0
1,That is EXACTLY why I asked so vehemently to b...,0
2,"""\n\nsimple reading of the Act would explain:\...",0
3,To the attention of mr. W. Waggel s.s.t.t.:,0
4,What rule did i break exactly?,0


In [8]:
df.toxic.value_counts(normalize=True)

0    0.89867
1    0.10133
Name: toxic, dtype: float64

Баланс классов примерно 90/10

In [9]:
corpus = list(df['text'].values)

In [10]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [11]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w.lower(), get_wordnet_pos(w)) for w in word_list])
    return lemmatized_output

def clear_text(text):
    text = re.sub(r"[^a-zA-Z ]", ' ', text)
    text = text.split()
    return " ".join(text)

In [12]:
corpus[3]

'To the attention of mr. W. Waggel s.s.t.t.:'

In [13]:
clear_text(lemmatize(corpus[3]))

'to the attention of mr w waggel s s t t'

In [14]:
for i in notebook.tqdm(range(len(corpus))):
    corpus[i] = clear_text(lemmatize(corpus[i]))

  0%|          | 0/100000 [00:00<?, ?it/s]

In [15]:
corpus[0:3]

['it s quite a funny coincidence that all of the ip editor attack the page be from the uk and several of them have also significantly vandalizes sandpit include special contributions not funny ha ha mind you but funny',
 'that be exactly why i ask so vehemently to be block you know for a site that claim to be righteous and forgive when it come to this kind of thing you be pretty underhanded can i create a new account no because the idiot that block me in the first place put that under the block parameter what evidence be there of personal attack and harassment if you can not answer that then you can not block me it s a very simple concept to grasp',
 'simple reading of the act would explain an act to provide a national currency secure by a pledge of united state stock and to provide for the circulation and redemption thereof sec and be it far enact that every association after have comply with the provision of this act preliminary to the commencement of banking business under it provis

In [16]:
nltk.download('stopwords')
stopwords = list(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
features = corpus
target = df['toxic']

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.1,
                                                                           stratify=target, random_state=seed)

## Обучение

### Логистическая Регрессия

In [21]:
pipe_lr = Pipeline([
    ('count_tf_idf', TfidfVectorizer(stop_words=stopwords, ngram_range=(1,2))),
    ('lr', LogisticRegression(solver='liblinear', class_weight = 'balanced', random_state=seed))
])

In [None]:
%%time

grid_lr = {"lr__C":np.logspace(-3,3,7), "lr__penalty":['l1', "l2"]} # l1 lasso l2 ridge

gs_lr = GridSearchCV(pipe_lr, grid_lr, cv=3, scoring='f1', error_score='raise')
gs_lr.fit(features_train, target_train)

print("tuned hyperparameters :(best parameters) ", gs_lr.best_params_)
print("f1 :", gs_lr.best_score_)

Для Логистической Регрессии подобрали оптимальные параметры:  
С = 100, class_weight = 'balanced', penalty = 'l2'
Результат F1 - 0.78

### Catboost

In [30]:
pipe_cb = Pipeline([
    ('count_tf_idf', TfidfVectorizer(stop_words=stopwords, ngram_range=(1,2))),
    ('cb', CatBoostClassifier(class_weights = list(target.value_counts()), iterations=100,
                              random_seed=seed, verbose=-1))
])

In [31]:
grid_cb = {
    'cb__learning_rate' : list(np.arange(0.1,1.1,0.3)),
    'cb__max_depth' : list(range(4, 13, 4))
}

In [None]:
%%time

gs_cb = GridSearchCV(
    pipe_cb,
    grid_cb,
    scoring='f1',
    n_jobs=-1,
    cv = 3
)

gs_cb.fit(features_train, target_train)

print("tuned hyperparameters :(best parameters) ", gs_cb.best_params_)
print("f1 :", gs_cb.best_score_)

Для модели catboost подобрали оптимальные параметры:  
rate - 0.7, depth - 10
Результат F1 - 0.74

### LightGBM

In [27]:
pipe_lgbm = Pipeline([
    ('count_tf_idf', TfidfVectorizer(stop_words=stopwords, ngram_range=(1,2))),
    ('lgbm', LGBMClassifier(class_weight = 'balanced', random_seed=seed, n_jobs=-1))
])

In [28]:
grid_lgbm = {
    'lgbm__learning_rate' : list(np.arange(0.1,1.1,0.3)),
    'lgbm__max_depth' : list(range(1, 11, 3))
}

In [None]:
%%time

gs_lgbm = GridSearchCV(
    pipe_lgbm,
    grid_lgbm,
    scoring='f1',
    n_jobs=-1,
    cv = 3
)

gs_lgbm.fit(features_train, target_train)

print("tuned hyperparameters :(best parameters) ", gs_lgbm.best_params_)
print("f1 :", gs_lgbm.best_score_)

Для модели lgbm подобрали оптимальные параметры:  
rate - 0.4, depth - 10
Результат F1 - 0.75

По результатам обучения, самой корректной оказалась модель логистической регрессии с показателем метрики F1 0.78

Для Логистической Регрессии подобрали оптимальные параметры:
С = 100, class_weight = 'balanced', penalty = 'l2'  
Результат F1 - 0.78

## Тестирование

In [None]:
prediction_lr = gs_lr.predict_proba(features_test)[:,1]

round(f1_score(target_test, prediction_lr), 2)

0.78

## Выводы

В ходе исследования были применены следующие методы:
* Токенезация
* Лемматизация
* Очистка текста от символов и бранных слов
* Векторизация для n-грамм. (1 леммa)

Проверены следующие модели:
* Линейная Регрессия
* CatBoostClassifier
* LGBMClassifier

**По результатам измерения f1-score рекомендуется использовать модель Логистической Регрессии.  
оптимальные параметры:
С = 100, class_weight = 'balanced', penalty = 'l2' Результат F1 - 0.78**