Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [108]:
import pandas as pd

In [109]:
data=pd.read_csv('/datasets/toxic_comments.csv')

In [110]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


In [111]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [112]:
# создание корпуса для дальнейшей лемматизации
corpus=data['text']

### 1.1 Токенизация и лемматизация

In [113]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [114]:
import numpy as np

In [115]:
# токенизация 
word_list=[]
for i in corpus:
    word_list.append(' '.join(nltk.word_tokenize(i)))

In [116]:
# создание словаря
dict_of_words = { i : word_list[i] for i in range(0, len(word_list) ) }

In [117]:
dict_of_words

{0: "Explanation Why the edits made under my username Hardcore Metallica Fan were reverted ? They were n't vandalisms , just closure on some GAs after I voted at New York Dolls FAC . And please do n't remove the template from the talk page since I 'm retired now.89.205.38.27",
 1: "D'aww ! He matches this background colour I 'm seemingly stuck with . Thanks . ( talk ) 21:51 , January 11 , 2016 ( UTC )",
 2: "Hey man , I 'm really not trying to edit war . It 's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page . He seems to care more about the formatting than the actual info .",
 3: "`` More I ca n't make any real suggestions on improvement - I wondered if the section statistics should be later on , or a subsection of `` '' types of accidents '' '' -I think the references may need tidying so that they are all in the exact same format ie date format etc . I can do that later on , if no-one else does first - if you have 

In [118]:
from collections import defaultdict
from nltk.corpus import wordnet

In [119]:
import nltk
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [120]:
# фунцкия для конвертирования nltk tag в  wordnet tag
def nltk_tag_to_wordnet_tag(treebank_tag):

    if treebank_tag.startswith('ADJ'):
        return wordnet.ADJ
    elif treebank_tag.startswith('VERB'):
        return wordnet.VERB
    elif treebank_tag.startswith('NOUN'):
        return wordnet.NOUN
    elif treebank_tag.startswith('ADV'):
        return wordnet.ADV
    else:
        return 'None'

In [121]:
# Лемматизация текста
lemma_all_text=[]
lemmatizer = WordNetLemmatizer()
for w in range(0,len(dict_of_words)):    
    lemma_text=[]
    tagged = nltk.pos_tag(dict_of_words[w].split(),tagset='universal')
    for i,tag  in tagged:
        wntag = nltk_tag_to_wordnet_tag(tag)
        if wntag=='None':# not supply tag in case of None
            lemma = lemmatizer.lemmatize(i).lower()
            lemma_text.append(''.join(lemma))
        else:
            lemma = lemmatizer.lemmatize(i, pos=wntag).lower() 
            lemma_text.append(''.join(lemma))
    lemma_all_text.append(' '.join(lemma_text))

### 1.2 Чистка текста от лишних символов

In [122]:
import re

In [123]:
lemma_clear_text=[]
for i in lemma_all_text:
    i=re.sub(r"[^a-zA-Z'  ]", ' ', i)
    text=i.split()
    text=' '.join(text)
    lemma_clear_text.append(text)       

In [124]:
lemma_clear_text

["explanation why the edits make under my username hardcore metallica fan be revert they be n't vandalisms just closure on some gas after i vote at new york dolls fac and please do n't remove the template from the talk page since i 'm retired now",
 "d'aww he match this background colour i 'm seemingly stick with thanks talk january utc",
 "hey man i 'm really not try to edit war it 's just that this guy be constantly remove relevant information and talk to me through edits instead of my talk page he seem to care more about the formatting than the actual info",
 "more i ca n't make any real suggestion on improvement i wonder if the section statistic should be later on or a subsection of '' type of accident '' '' i think the reference may need tidy so that they be all in the exact same format ie date format etc i can do that later on if no one else do first if you have any preference for format style on reference or want to do it yourself please let me know there appear to be a backlog 

### 1.3 Подготовка признаков

In [125]:
from sklearn.model_selection import train_test_split

In [126]:
data['lemm_text']=lemma_clear_text

In [127]:
# подготовка признаков для обучения, валидации и тестирования модели
train_valid, test = train_test_split(data, shuffle=False, test_size=0.1)
train, valid = train_test_split(train_valid, shuffle=False, test_size=0.12)

In [128]:
features_train=train['lemm_text']
target_train=train['toxic']

features_valid=valid['lemm_text']
target_valid=valid['toxic']

features_test=test['lemm_text']
target_test=test['toxic']

### Устранение дисбаланса

In [129]:
# посмотрим соотношение классов в целевом призкаке
data['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Есть дисбаланс классов

In [130]:
from sklearn.utils import shuffle
from sklearn.metrics import precision_score, recall_score

In [131]:
#Сделаем объекты редкого класса не такими редкими техникой upsampling 
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled

In [132]:
features_upsampled, target_upsampled = upsample(features_train, target_train,4)

### Векторизация признаков

In [133]:
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [134]:
# находим стоп-слова
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [135]:
# вычислим TF-IDF для признаков, учитавая ngram для униграмм и биграмм
count_tf_idf = TfidfVectorizer(stop_words=stopwords,ngram_range=(1,2))
tf_idf_train = count_tf_idf.fit_transform(features_upsampled)
tf_idf_valid = count_tf_idf.transform(features_valid)
tf_idf_test = count_tf_idf.transform(features_test)

In [136]:
tf_idf_train.shape

(164995, 2120129)

# 2. Обучение

### 2.1 Обучение Логистической регрессии

In [137]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [138]:
model_log=LogisticRegression(random_state=12345,solver='lbfgs')
model_log.fit(tf_idf_train,target_upsampled)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12345, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [139]:
predict_valid_log=model_log.predict(tf_idf_valid)

In [140]:
F1_valid_logistic=print('F1_valid для логистической регресии:', '{:.2f}'.format(f1_score(target_valid,predict_valid_log)))
F1_valid_logistic=f1_score(target_valid,predict_valid_log)

F1_valid для логистической регресии: 0.77


### 2.2 Обучение модели LinearSVC

In [141]:
from sklearn.svm import LinearSVC

In [142]:
model_SVC=LinearSVC(random_state=2018)

In [143]:
model_SVC.fit(tf_idf_train,target_upsampled)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=2018, tol=0.0001,
          verbose=0)

In [144]:
predict_valid_SVC=model_SVC.predict(tf_idf_valid)

In [145]:
F1_valid_linear=print('F1_valid для LinearSVC:', '{:.2f}'.format(f1_score(target_valid,predict_valid_SVC)))
F1_valid_linear=f1_score(target_valid,predict_valid_SVC)

F1_valid для LinearSVC: 0.77


### 2.3 Обучение модели XGBClassifier

In [146]:
from xgboost import XGBClassifier

In [147]:
#num_class = len(np.unique(tf_idf_train))

In [148]:
model_XGB=XGBClassifier(learning_rate=0.5,max_depth=6)

In [149]:
model_XGB.fit(tf_idf_train,target_upsampled)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.5, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [150]:
predict_valid_XGB=model_XGB.predict(tf_idf_valid)

In [151]:
F1_valid_classifier=print('F1_valid для XGBClassifier:', '{:.2f}'.format(f1_score(target_valid,predict_valid_XGB)))
F1_valid_classifier=f1_score(target_valid,predict_valid_XGB)

F1_valid для XGBClassifier: 0.75


# 3. Выводы

### 3.1 Тест модели LogisticRegression

In [152]:
predict_test_log=model_log.predict(tf_idf_test)
F1_test_logistic=print('F1_valid для логистической регресии:', '{:.2f}'.format(f1_score(target_test,predict_test_log)))
F1_test_logistic=f1_score(target_test,predict_test_log)

F1_valid для логистической регресии: 0.77


### 3.2 Тест модели LinearSVC

In [153]:
predict_test_SVC=model_SVC.predict(tf_idf_test)
F1_test_linear=print('F1_valid для LinearSVC:', '{:.2f}'.format(f1_score(target_test,predict_test_SVC)))
F1_test_linear=f1_score(target_test,predict_test_SVC)

F1_valid для LinearSVC: 0.79


### 3.3 Тест модели XGBClassifier

In [154]:
predict_test_XGB=model_XGB.predict(tf_idf_test)
F1_test_classifier=print('F1_valid для XGBClassifier:', '{:.2f}'.format(f1_score(target_test,predict_test_XGB)))
F1_test_classifier=f1_score(target_test,predict_test_XGB)

F1_valid для XGBClassifier: 0.75


### 3.4 Сравнение результатов

In [155]:
# сводная таблица с результатами валидации и теста
table = {'Model':['LogisticRegression','LinearSVC','XGBClassifier'], 'F1_valid':[F1_valid_logistic, F1_valid_linear, F1_valid_classifier],
        'F1_test':[F1_test_logistic,F1_test_linear,F1_test_classifier]}
table = pd.DataFrame(table).round(decimals=2).reset_index(drop=True)
table

Unnamed: 0,Model,F1_valid,F1_test
0,LogisticRegression,0.77,0.77
1,LinearSVC,0.77,0.79
2,XGBClassifier,0.75,0.75


Наиболее точной и полной моделью для предсказаний тональности оказалась LinearSVC, для которой на тестовой выборке F1=0.79. Это скорее всего связано с тем, что линейные модели лучше маштабируются. Модели, основонные на бинарной классификации имели значение чуть меньше, пр этом LogisticRegression обучалась быстро,а обучение XGBClassifier заняло много времени, так как у модели более сложный алгоритм, а данных в обучающей выборки слишком много.