<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#RandomForestClassifier" data-toc-modified-id="RandomForestClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>RandomForestClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
from sklearn.metrics import f1_score
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
print(df)

                                                     text  toxic
0       Explanation\nWhy the edits made under my usern...      0
1       D'aww! He matches this background colour I'm s...      0
2       Hey man, I'm really not trying to edit war. It...      0
3       "\nMore\nI can't make any real suggestions on ...      0
4       You, sir, are my hero. Any chance you remember...      0
...                                                   ...    ...
159566  ":::::And for the second time of asking, when ...      0
159567  You should be ashamed of yourself \n\nThat is ...      0
159568  Spitzer \n\nUmm, theres no actual article for ...      0
159569  And it looks like it was actually you who put ...      0
159570  "\nAnd ... I really don't think you understand...      0

[159571 rows x 2 columns]


In [4]:
lemmatizer = WordNetLemmatizer()


def lemmatize(text):
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w, 'n') for w in nltk.word_tokenize(text)])
    return lemmatized_output
    

def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text) 
    text = text.split()
    cleared_text = " ".join(text)
    
    return cleared_text

In [5]:
df['lemm_text'] = df['text'].apply(clear_text)
df['lemm_text'] = df['lemm_text']

In [6]:
print(df.head(5))

                                                text  toxic  \
0  Explanation\nWhy the edits made under my usern...      0   
1  D'aww! He matches this background colour I'm s...      0   
2  Hey man, I'm really not trying to edit war. It...      0   
3  "\nMore\nI can't make any real suggestions on ...      0   
4  You, sir, are my hero. Any chance you remember...      0   

                                           lemm_text  
0  Explanation Why the edits made under my userna...  
1  D aww He matches this background colour I m se...  
2  Hey man I m really not trying to edit war It s...  
3  More I can t make any real suggestions on impr...  
4  You sir are my hero Any chance you remember wh...  


In [7]:
df['lemm_text'] = df['lemm_text'].apply(lemmatize)

In [8]:
df['lemm_text'] = df['lemm_text'].str.lower()

In [9]:
print(df.head(5))

                                                text  toxic  \
0  Explanation\nWhy the edits made under my usern...      0   
1  D'aww! He matches this background colour I'm s...      0   
2  Hey man, I'm really not trying to edit war. It...      0   
3  "\nMore\nI can't make any real suggestions on ...      0   
4  You, sir, are my hero. Any chance you remember...      0   

                                           lemm_text  
0  explanation why the edits made under my userna...  
1  d aww he match this background colour i m seem...  
2  hey man i m really not trying to edit war it s...  
3  more i can t make any real suggestion on impro...  
4  you sir are my hero any chance you remember wh...  


## Обучение

In [10]:
train_df, test_df = train_test_split(df, test_size=0.25)

In [11]:
corpus = train_df['lemm_text'].values.astype('U')
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
train_data = count_tf_idf.fit_transform(corpus)

In [12]:
corpus_test = test_df['lemm_text'].values.astype('U')
test_data = count_tf_idf.transform(corpus_test)

### LogisticRegression

In [25]:
model = LogisticRegression(class_weight='balanced')
model.fit(train_data, train_df['toxic'])
predictions = model.predict(test_data)



In [26]:
print(f1_score(predictions, test_df['toxic']))

0.7552416685058487


### RandomForestClassifier

In [22]:
test_df['toxic'].value_counts()

0    35898
1     3995
Name: toxic, dtype: int64

In [24]:
model = RandomForestClassifier(n_estimators=20, max_depth=4, class_weight = 'balanced')
model.fit(train_data, train_df['toxic'])
predictions = model.predict(test_data)
print(f1_score(predictions, test_df['toxic']))

0.2595732711857596


## Выводы

Модель LogisticRegression показывает необходимо минимальный результат F1 = 0.7552416685058487