<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

Подключаем библиотеки:

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.utils import shuffle
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score, accuracy_score
import torch
import transformers
from tqdm import notebook
from nltk.corpus import wordnet
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Загрузим данные из файла, выведем первые 10 строк таблиц для первого взгляда на данные.

In [2]:
comments = pd.read_csv('/datasets/toxic_comments.csv')
comments.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [3]:
comments.shape

(159571, 2)

In [4]:
comments['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

In [5]:
print(f"Процент объектов класса 1 к общему объёму датасета: {(sum(comments['toxic']) / len(comments) * 100):.2f}%")

Процент объектов класса 1 к общему объёму датасета: 10.17%


Вывод: датафрейм содержит 2 столбца и 159571 запись в каждом из них
пропусков в данных нету. Для признака text необходимо провести предобработку, т.е., необходимо очистить комментарии от лишних слов и лемматизацию текста (приведение слова к начальной форме или лемме).

Подготовим данные для векторизации:Проведём лемматизацию слов с помощью WordNetLemmatizer() из библиотеки nltk.
Удалим пунктуацию и лишние пробелы.
Удалим стоп-слова (пока загрузим список, удалять будем в процессе tf-idf векторизации).

In [6]:
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    text = text.lower() 
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_output

In [7]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text)
    return " ".join(text.split())

In [8]:
corpus = comments['text']

In [9]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text)
    return " ".join(text.split())


comments['our_clear_text'] = comments['text'].apply(clear_text)

display(comments.head(5))

Unnamed: 0,text,toxic,our_clear_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


In [10]:
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    text = text.lower() 
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_output


comments['text_lemma'] = comments['our_clear_text'].apply(lemmatize)


display(comments.tail(10))


Unnamed: 0,text,toxic,our_clear_text,text_lemma
159561,"""\nNo he did not, read it again (I would have ...",0,No he did not read it again I would have thoug...,no he did not read it again i would have thoug...
159562,"""\n Auto guides and the motoring press are not...",0,Auto guides and the motoring press are not goo...,auto guide and the motoring press are not good...
159563,"""\nplease identify what part of BLP applies be...",0,please identify what part of BLP applies becau...,please identify what part of blp applies becau...
159564,Catalan independentism is the social movement ...,0,Catalan independentism is the social movement ...,catalan independentism is the social movement ...
159565,The numbers in parentheses are the additional ...,0,The numbers in parentheses are the additional ...,the number in parenthesis are the additional d...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...,and for the second time of asking when your vi...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...,you should be ashamed of yourself that is a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...,spitzer umm there no actual article for prosti...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...,and it look like it wa actually you who put on...
159570,"""\nAnd ... I really don't think you understand...",0,And I really don t think you understand I came...,and i really don t think you understand i came...


In [11]:
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

print(lemmatizer.lemmatize('motoring', get_wordnet_pos('motoring')))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...


motor


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [12]:
def lemmatize_pos(text):
    text = text.lower() 
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w,  get_wordnet_pos(w)) for w in word_list])
    
    return lemmatized_output

print(lemmatize_pos('auto guide and the motoring press are not go'))

auto guide and the motor press be not go


In [13]:
comments['text_lemma_pos'] = comments['our_clear_text'].apply(lemmatize_pos)
display(comments.tail(10))

Unnamed: 0,text,toxic,our_clear_text,text_lemma,text_lemma_pos
159561,"""\nNo he did not, read it again (I would have ...",0,No he did not read it again I would have thoug...,no he did not read it again i would have thoug...,no he do not read it again i would have though...
159562,"""\n Auto guides and the motoring press are not...",0,Auto guides and the motoring press are not goo...,auto guide and the motoring press are not good...,auto guide and the motor press be not good sou...
159563,"""\nplease identify what part of BLP applies be...",0,please identify what part of BLP applies becau...,please identify what part of blp applies becau...,please identify what part of blp applies becau...
159564,Catalan independentism is the social movement ...,0,Catalan independentism is the social movement ...,catalan independentism is the social movement ...,catalan independentism be the social movement ...
159565,The numbers in parentheses are the additional ...,0,The numbers in parentheses are the additional ...,the number in parenthesis are the additional d...,the number in parenthesis be the additional de...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...,and for the second time of asking when your vi...,and for the second time of ask when your view ...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...,you should be ashamed of yourself that is a ho...,you should be ashamed of yourself that be a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...,spitzer umm there no actual article for prosti...,spitzer umm there no actual article for prosti...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...,and it look like it wa actually you who put on...,and it look like it be actually you who put on...
159570,"""\nAnd ... I really don't think you understand...",0,And I really don t think you understand I came...,and i really don t think you understand i came...,and i really don t think you understand i come...


In [14]:
corpus = comments['text_lemma_pos']
corpus_lemm = [lemmatize(clear_text(corpus[i])) for i in range(len(corpus))]

In [15]:
stopwords = set(nltk_stopwords.words('english'))

Разделим данные на тренировочную и тестовую выборки.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(corpus_lemm, comments['toxic'], 
                                                    test_size=0.2,
                                                    random_state=42)

In [17]:
print(f"Размер тренировочного корпуса: {len(X_train)}")
print(f"Размер тестового корпуса: {len(X_test)}")

Размер тренировочного корпуса: 127656
Размер тестового корпуса: 31915


Проведём векторизацию корпусов с помощью TfidfVectorizer, заодно удалим стоп-слова.
Попробуем обучить модель без использования n-gramm

In [18]:
tf_idf_vec = TfidfVectorizer(ngram_range=(1,1), stop_words=stopwords,
               min_df=3, max_df=0.9, use_idf=1,
               smooth_idf=1, sublinear_tf=1 )

In [19]:
X_train_vec = tf_idf_vec.fit_transform(X_train)

In [20]:
X_test_vec = tf_idf_vec.transform(X_test)

In [21]:
print(f"Размер тренировочного датасета: {X_train_vec.shape}")
print(f"Размер тестового датасета: {X_test_vec.shape}")

Размер тренировочного датасета: (127656, 36120)
Размер тестового датасета: (31915, 36120)


## Обучение

Найдём метрику accuracy для константной модели. Будем предсказывать все твиты нетоксичными ('toxic'=0)

In [22]:
base_predicts = pd.Series(data=np.zeros((len(y_test))), index=y_test.index, dtype='int16')
base_accuacy = accuracy_score(y_test, base_predicts)
print(f"Accuracy константной модели {base_accuacy:.3f}")

Accuracy константной модели 0.898


Logistic Regression

In [None]:
%%time
#Для начала попробуем обучить модель логистической регрессии.
#Обучение, подбор гиперпараметров, кросс-валидацию проведём с помощью GridSearchCV библиотеки sklearn
#Подбирать будем гиперпараметр регуляризации С

parameters = {'C': np.linspace(10, 20, num = 11, endpoint = True),
             'max_iter': [1000]}
lrm = LogisticRegression()
clf = GridSearchCV(lrm, parameters,
                  cv=5,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)
clf.fit(X_train_vec, y_train)

Fitting 5 folds for each of 11 candidates, totalling 55 fits
[CV] END ..............................C=10.0, max_iter=1000; total time= 1.3min
[CV] END ..............................C=10.0, max_iter=1000; total time= 1.4min
[CV] END ..............................C=10.0, max_iter=1000; total time= 1.5min
[CV] END ..............................C=10.0, max_iter=1000; total time= 1.3min
[CV] END ..............................C=10.0, max_iter=1000; total time= 1.4min
[CV] END ..............................C=11.0, max_iter=1000; total time= 1.7min
[CV] END ..............................C=11.0, max_iter=1000; total time= 1.4min
[CV] END ..............................C=11.0, max_iter=1000; total time= 1.4min
[CV] END ..............................C=11.0, max_iter=1000; total time= 1.1min
[CV] END ..............................C=11.0, max_iter=1000; total time= 1.4min
[CV] END ..............................C=12.0, max_iter=1000; total time= 1.5min
[CV] END ..............................C=12.0, m

In [None]:
GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': np.array([10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.]),
                         'max_iter': [1000]},
             scoring='f1', verbose=2)

In [None]:
print(f"Наилучший показатель f1 на кросс-валидации : {clf.best_score_:.3f}")
print(f"Параметр регуляризации для лучшей модели: {clf.best_params_}")

NB-SVM

In [None]:
def prob(x, y_i, y):
    p = x[y==y_i].sum(axis=0)
    return (p+1) / ((y==y_i).sum()+1)

In [None]:
r = np.log(prob(X_train_vec, 1, y_train.values) / prob(X_train_vec, 0, y_train.values))
X_train_nb = X_train_vec.multiply(r)
X_test_nb = X_test_vec.multiply(r)

In [None]:
parameters = {'C': np.linspace(1, 11, num = 11, endpoint = True)}
nblrm = LogisticRegression(solver='liblinear', 
                           dual=True, 
                           max_iter = 1000)
clf_nb = GridSearchCV(nblrm, parameters,
                  cv=5,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)

In [None]:
%%time
clf_nb.fit(X_train_nb, y_train)

In [None]:
GridSearchCV(cv=5,
             estimator=LogisticRegression(dual=True, max_iter=1000,
                                          solver='liblinear'),
             n_jobs=-1,
             param_grid={'C': np.array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.])},
             scoring='f1', verbose=2)

In [None]:
print(f"Наилучший показатель f1 на кросс-валидации : {clf_nb.best_score_:.3f}")
print(f"Параметр регуляризации для лучшей модели: {clf_nb.best_params_}")

LinearSVC

In [None]:
%%time
parameters = {'C': np.linspace(1, 31, num = 7, endpoint = True)}
lsvcm = LinearSVC(max_iter = 1000)
clf_lsvc = GridSearchCV(nblrm, parameters,
                  cv=5,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)

In [None]:
clf_lsvc.fit(X_train_vec, y_train)

In [None]:
GridSearchCV(cv=5,
             estimator=LogisticRegression(C=3, dual=True, max_iter=1000,
                                          solver='liblinear'),
             n_jobs=-1,
             param_grid={'C': np.array([ 1.,  6., 11., 16., 21., 26., 31.])},
             scoring='f1', verbose=2)

In [None]:
print(f"Наилучший показатель f1 на кросс-валидации : {clf_lsvc.best_score_:.3f}")
print(f"Параметр регуляризации для лучшей модели: {clf_lsvc.best_params_}")

Проверяем лучшую модель на тесте.

In [None]:
nblrm = LogisticRegression(C=3,
                           solver='liblinear', 
                           dual=True,
                           max_iter=1000)
nblrm.fit(X_train_nb, y_train)
predict = nblrm.predict(X_test_nb)
f1_nblr = f1_score(y_test, predict)

In [None]:
print(f"Показатель f1 на тестовой выборке: {f1_nblr:.3f}")

## Выводы

Данные о токсичности твитов успешно загружены и обработаны:

Лемматизация проведена с помощью WordNetLemmatizer библиотеки nltk

Знаки пунктуации, а также лишние пробелы удалены

Стоп слова удалены (список взят из библиотеки nltk)

Корпус векторизован с помощью TfidfVectorizer

На получившихся данных обучены модели: LogisticRegression, NB-SVM, LinearSVC.
