<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Загрузка-данных" data-toc-modified-id="Загрузка-данных-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Загрузка данных</a></span></li><li><span><a href="#Лемматизация" data-toc-modified-id="Лемматизация-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Лемматизация</a></span></li><li><span><a href="#Разделение-на-выборки" data-toc-modified-id="Разделение-на-выборки-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Разделение на выборки</a></span></li><li><span><a href="#Кодирование-текстов-TF-IDF" data-toc-modified-id="Кодирование-текстов-TF-IDF-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Кодирование текстов TF-IDF</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Модель-градиентного-бустинга" data-toc-modified-id="Модель-градиентного-бустинга-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Модель градиентного бустинга</a></span></li><li><span><a href="#Случайный-лес" data-toc-modified-id="Случайный-лес-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Случайный лес</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [1]:
import pandas as pd
from lightgbm import LGBMClassifier
import re 
import nltk
import spacy

from nltk.stem import WordNetLemmatizer 
from tqdm.notebook import tqdm
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn. metrics import f1_score
from sklearn.model_selection import train_test_split

## Подготовка

### Загрузка данных

In [2]:
data = pd.read_csv("/datasets/toxic_comments.csv", index_col=0)
data.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 159292 entries, 0 to 159450

Data columns (total 2 columns):

 #   Column  Non-Null Count   Dtype 

---  ------  --------------   ----- 

 0   text    159292 non-null  object

 1   toxic   159292 non-null  int64 

dtypes: int64(1), object(1)

memory usage: 3.6+ MB


### Лемматизация

In [4]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [5]:
def clear_text(text):
    return (" ".join(re.sub(r'[^a-zA-Z ]', ' ', text).split()))  

In [6]:
def lemmatize(text):
    doc = nlp(text)
    return (" ".join([token.lemma_ for token in doc]))

In [7]:
data['clear'] = data['text'].apply(
  lambda x: (clear_text(x)))

In [8]:
tqdm.pandas()
data['lemm'] = data['clear'].progress_apply(
  lambda x: (lemmatize(x)))

  0%|          | 0/159292 [00:00<?, ?it/s]

In [9]:
data.head(10)

Unnamed: 0,text,toxic,clear,lemm
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...,d aww he match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...,hey man I m really not try to edit war it s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...,More I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...,you sir be my hero any chance you remember wha...
5,"""\n\nCongratulations from me as well, use the ...",0,Congratulations from me as well use the tools ...,congratulation from I as well use the tool wel...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,COCKSUCKER before you pis around on my work
7,Your vandalism to the Matt Shirvington article...,0,Your vandalism to the Matt Shirvington article...,your vandalism to the Matt Shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,Sorry if the word nonsense was offensive to yo...,sorry if the word nonsense be offensive to you...
9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...,alignment on this subject and which be contrar...


### Разделение на выборки

In [10]:
features = data['lemm']
target = data['toxic']

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.2, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.5, random_state=12345)

### Кодирование текстов TF-IDF

In [11]:
count_tf_idf = TfidfVectorizer().fit(features_train.values)
tf_idf = count_tf_idf.transform(features_train.values)

In [12]:
tf_idf1 = count_tf_idf.transform(features_valid.values)

In [13]:
tf_idf2 = count_tf_idf.transform(features_test.values)

## Обучение

### Логистическая регрессия

In [20]:
model = LogisticRegression(max_iter=50, class_weight='balanced', C=8).fit(tf_idf, target_train)
prediction = model.predict(tf_idf1)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.



Increase the number of iterations (max_iter) or scale the data as shown in:

    https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

  n_iter_i = _check_optimize_result(


In [21]:
f1_score(target_valid, prediction)

0.7729306487695748

### Модель градиентного бустинга

In [24]:
booster = LGBMClassifier(objective="binary", learning_rate=0.2, seed=12345, n_estimators=50, class_weight='balanced')

In [25]:
booster.fit(tf_idf, target_train, eval_set=[(tf_idf1, target_valid),], eval_metric="f1")

[1]	valid_0's binary_logloss: 0.626482

[2]	valid_0's binary_logloss: 0.576613

[3]	valid_0's binary_logloss: 0.539092

[4]	valid_0's binary_logloss: 0.502195

[5]	valid_0's binary_logloss: 0.480376

[6]	valid_0's binary_logloss: 0.450985

[7]	valid_0's binary_logloss: 0.431643

[8]	valid_0's binary_logloss: 0.415956

[9]	valid_0's binary_logloss: 0.398108

[10]	valid_0's binary_logloss: 0.387916

[11]	valid_0's binary_logloss: 0.379443

[12]	valid_0's binary_logloss: 0.362013

[13]	valid_0's binary_logloss: 0.3523

[14]	valid_0's binary_logloss: 0.346669

[15]	valid_0's binary_logloss: 0.337789

[16]	valid_0's binary_logloss: 0.327823

[17]	valid_0's binary_logloss: 0.32179

[18]	valid_0's binary_logloss: 0.315627

[19]	valid_0's binary_logloss: 0.308467

[20]	valid_0's binary_logloss: 0.30336

[21]	valid_0's binary_logloss: 0.298813

[22]	valid_0's binary_logloss: 0.293053

[23]	valid_0's binary_logloss: 0.288369

[24]	valid_0's binary_logloss: 0.28401

[25]	valid_0's binary_logloss:

LGBMClassifier(class_weight='balanced', learning_rate=0.2, n_estimators=50,
               objective='binary', seed=12345)

In [26]:
valid_preds = booster.predict(tf_idf1)

In [27]:
f1_score(target_valid, valid_preds)

0.7149883510225213

### Случайный лес

In [22]:
model1 = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=12345)
model1.fit(tf_idf, target_train)
preds = model1.predict(tf_idf1)

In [23]:
f1_score(target_valid, preds)

0.0

Модели логистической регрессии и классификации градиентным бустингом показали близкое значение матрики f1 на валидационной выборке при решении данной задачи. Остановимся на модели логистической регрессии.

## Выводы

In [28]:
f1_score(target_test, model.predict(tf_idf2))

0.7703415717856151

**Вывод:** Были згруженны данные с комментариями и осуществлено кодирование при помощи  TF-IDF. Затем были обучены и пртестировваниы несколько моделей, лучшие из них проверены на тестовой выборке. Результат полученный при помщи модели логистической регрессии лучший.