Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

Загрузим данные.

In [1]:
import pandas as pd
data = pd.read_csv("/datasets/toxic_comments.csv")
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Подготовим функции для лемматизации предложений с помощью библиотеки nltk.

In [3]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
import re

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def lemm_sent(sentence):
    return (' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]))


def clear_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = " ".join(text.split())
    return text

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Добавим лемматизированный текст.

In [4]:
%%time
for i in range (0,len(data['text'])):
    data.loc[i,'lemm_text']=clear_text(lemm_sent(data.loc[i,'text']))

CPU times: user 1h 15min 19s, sys: 3min 30s, total: 1h 18min 49s
Wall time: 1h 19min 27s


In [11]:
data.head(15)

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits make under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not try to edit war It s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I ca n t make any real suggestion on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir be my hero Any chance you remember wha...
5,"""\n\nCongratulations from me as well, use the ...",0,Congratulations from me a well use the tool we...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
7,Your vandalism to the Matt Shirvington article...,0,Your vandalism to the Matt Shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,Sorry if the word nonsense be offensive to you...
9,alignment on this subject and which are contra...,0,alignment on this subject and which be contrar...


Разделим данные на обучающую и тестовую выборки.

In [6]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(data, test_size=0.2, random_state=12345)

Выделим признаки.

In [7]:
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)
features_train = count_tf_idf.fit_transform(df_train['lemm_text'])
features_test = count_tf_idf.transform(df_test['lemm_text'])
target_train = df_train['toxic']
target_test = df_test['toxic']

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# 2. Обучение

Обучим модель логистической регрессии.

In [12]:
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
model.fit(features_train,target_train)
predicted = model.predict(features_test)
print (f1_score(target_test, predicted))

0.7365880824206423


Сравним с моделью случайного леса.

In [22]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=400, random_state=12345)
model.fit(features_train,target_train)
predicted = model.predict(features_test)
print (f1_score(target_test, predicted))

0.7027543993879112


Используем градиентный бустинг.

In [20]:
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb

model = lgb.LGBMClassifier()

parameters = {'num_leaves'         : [10,20,30],
              'learning_rate' : [0.03, 0.1, 0.5, 1],
              'n_estimators'    : [100,200,400]}

grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 2, n_jobs=-1, scoring='f1')
grid.fit(features_train, target_train)

print("\n The best estimator across ALL searched params:\n", grid.best_estimator_)
print("\n The best score across ALL searched params:\n", grid.best_score_)
print("\n The best parameters across ALL searched params:\n", grid.best_params_)


 The best estimator across ALL searched params:
 LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=400, n_jobs=-1, num_leaves=30, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

 The best score across ALL searched params:
 0.7588732651453847

 The best parameters across ALL searched params:
 {'learning_rate': 0.1, 'n_estimators': 400, 'num_leaves': 30}


In [21]:
model = lgb.LGBMClassifier(learning_rate=0.1, n_estimators=400, num_leaves=30)
model.fit(features_train,target_train)
predicted = model.predict(features_test)
print (f1_score(target_test, predicted))

0.7719358731253232


# 3. Выводы

Благодаря лемматизации нам удалось обучить модель классифицировать комментарии, лучше всего с задачей справилась модель, использующая градтиентный бустинг.