# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [25]:
#библиотеки
import pandas as pd 
import numpy as np 
import nltk 
import re 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline


In [26]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [28]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [29]:
# # Download Wordnet through NLTK in python console
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

def get_wordnet_posttag(w):
    #"""This function takes in a word which then allocates a tag. Eg: "fish" gets a tag v which is VERB"""
    # added in [0][1][0] so as to get main letter. eg VBG, with [0][1][0] gets back V only
    tagged = pos_tag([w])[0][1][0].upper()
    tag_dict = {
        "J": wn.ADJ,
        "N": wn.NOUN,
        "V": wn.VERB,
        "R": wn.ADV
    }
    return tag_dict.get(tagged, wn.NOUN)  # The OG description is always NOUN

In [30]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [31]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [32]:

lemmatizer = WordNetLemmatizer()

def clear_text(text):
    text = text.lower()
    text_clear = re.sub(r'[^a-zA-Z ]', ' ', text) 
    text_clear = " ".join(text_clear.split())
    word_list = nltk.word_tokenize(text_clear)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_list])
    return lemmatized_output

df['lemm_text'] = df['text'].apply(clear_text)

In [53]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm_text
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits make under my userna...
1,1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour i m seem...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not try to edit war it s ju...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestion on impro...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...
...,...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0,and for the second time of ask when your view ...
159288,159447,You should be ashamed of yourself \n\nThat is ...,0,you should be ashamed of yourself that be a ho...
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm there no actual article for prosti...
159290,159449,And it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


In [54]:
features = df['lemm_text']
target = df['toxic']

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.15, random_state=12345, stratify = target)

In [55]:
print(features_train.shape)
print(features_test.shape)

(135398,)
(23894,)


In [56]:
#стоп слова
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
features_train = count_tf_idf.fit_transform(features_train)
features_test = count_tf_idf.transform(features_test)

In [58]:
print(features_train.shape)
print(features_test.shape)

(135398, 137687)
(23894, 137687)


In [59]:
print(target_train.shape)
print(target_test.shape)

(135398,)
(23894,)


## Обучение

 1 способ (пайплайн)

In [63]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC
pipe_lr = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LogisticRegression(random_state=12345))])

In [64]:
param_range = [9, 10]
param_range_fl = [1.0, 0.5]

grid_params_lr = [{'clf__penalty': ['l1', 'l2'],
        'clf__C': param_range_fl,
        'clf__solver': ['liblinear']}] 

LR = GridSearchCV(estimator=pipe_lr,
            param_grid=grid_params_lr,
            scoring='f1',
            cv=10) 

LR.fit(features_train, target_train)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1203, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "/opt/conda/lib/python3.9/site-packages/sklearn/feature_extraction/text.

AttributeError: lower not found

In [65]:
print(LR.best_params_)

{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}


In [66]:
LR = LogisticRegression(solver='liblinear', penalty='l1', C=1, random_state=12345)
LR.fit(features_train, target_train)
scores_LR_pp = cross_val_score(LR,
                            features_train,
                            target_train,
                            cv=3,
                            verbose=10,
                            n_jobs=-1,
                            scoring='f1') 

f1_LR_pp = scores_LR_pp.mean()
print('f для логической регрессии:', f1_LR_pp)

[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] START .....................................................................
[CV] END ................................ score: (test=0.767) total time=   0.8s
[CV] START .....................................................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV] END ................................ score: (test=0.761) total time=   0.8s
[CV] START .....................................................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    1.6s remaining:    0.0s


[CV] END ................................ score: (test=0.759) total time=   0.8s
f для логической регрессии: 0.7626192352280553


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    2.4s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    2.4s finished


2 способ

In [17]:
%%time

param_grid = {
    'solver': ['lbfgs', 'liblinear'],
    'penalty': ['l2'],
    'C': [10, 1.0 ,0.1 ,0.01]}


model_LR = LogisticRegression(random_state=12345)

model_LR_gs = GridSearchCV(estimator=model_LR, 
                     param_grid=param_grid,
                     cv=3, 
                     n_jobs=-1, 
                     verbose=10,
                     scoring='f1')
model_LR_gs.fit(features_train, target_train)


Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3; 1/8] START C=10, penalty=l2, solver=lbfgs..............................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3; 1/8] END ............C=10, penalty=l2, solver=lbfgs; total time=  41.0s
[CV 2/3; 1/8] START C=10, penalty=l2, solver=lbfgs..............................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3; 1/8] END ............C=10, penalty=l2, solver=lbfgs; total time=  37.5s
[CV 3/3; 1/8] START C=10, penalty=l2, solver=lbfgs..............................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3; 1/8] END ............C=10, penalty=l2, solver=lbfgs; total time=  38.6s
[CV 1/3; 2/8] START C=10, penalty=l2, solver=liblinear..........................
[CV 1/3; 2/8] END ........C=10, penalty=l2, solver=liblinear; total time=  13.8s
[CV 2/3; 2/8] START C=10, penalty=l2, solver=liblinear..........................
[CV 2/3; 2/8] END ........C=10, penalty=l2, solver=liblinear; total time=  11.8s
[CV 3/3; 2/8] START C=10, penalty=l2, solver=liblinear..........................
[CV 3/3; 2/8] END ........C=10, penalty=l2, solver=liblinear; total time=  11.6s
[CV 1/3; 3/8] START C=1.0, penalty=l2, solver=lbfgs.............................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3; 3/8] END ...........C=1.0, penalty=l2, solver=lbfgs; total time=  37.7s
[CV 2/3; 3/8] START C=1.0, penalty=l2, solver=lbfgs.............................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3; 3/8] END ...........C=1.0, penalty=l2, solver=lbfgs; total time=  39.3s
[CV 3/3; 3/8] START C=1.0, penalty=l2, solver=lbfgs.............................
[CV 3/3; 3/8] END ...........C=1.0, penalty=l2, solver=lbfgs; total time=  29.6s
[CV 1/3; 4/8] START C=1.0, penalty=l2, solver=liblinear.........................
[CV 1/3; 4/8] END .......C=1.0, penalty=l2, solver=liblinear; total time=   6.8s
[CV 2/3; 4/8] START C=1.0, penalty=l2, solver=liblinear.........................
[CV 2/3; 4/8] END .......C=1.0, penalty=l2, solver=liblinear; total time=   7.7s
[CV 3/3; 4/8] START C=1.0, penalty=l2, solver=liblinear.........................
[CV 3/3; 4/8] END .......C=1.0, penalty=l2, solver=liblinear; total time=   7.3s
[CV 1/3; 5/8] START C=0.1, penalty=l2, solver=lbfgs.............................
[CV 1/3; 5/8] END ...........C=0.1, penalty=l2, solver=lbfgs; total time=  15.4s
[CV 2/3; 5/8] START C=0.1, penalty=l2, solver=lbfgs.............................
[CV 2/3; 5/8] END ..........

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=3, estimator=LogisticRegression(random_state=12345), n_jobs=-1,
             param_grid={'C': [10, 1.0, 0.1, 0.01], 'penalty': ['l2'],
                         'solver': ['lbfgs', 'liblinear']},
             scoring='f1', verbose=10)

In [18]:
print(model_LR_gs.best_params_)

{'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}


In [21]:
model_LR = LogisticRegression(solver='lbfgs', penalty='l2', C=10, random_state=12345)
model_LR.fit(features_train, target_train)
scores_LR = cross_val_score(model_LR,
                            features_train,
                            target_train,
                            cv=3,
                            verbose=10,
                            n_jobs=-1,
                            scoring='f1') 

f1_LR = scores_LR.mean()
print('f для логической регрессии:', f1_LR)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] START .....................................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   41.7s remaining:    0.0s


[CV] END ................................ score: (test=0.765) total time=  41.6s
[CV] START .....................................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV] END ................................ score: (test=0.767) total time=  37.9s
[CV] START .....................................................................
[CV] END ................................ score: (test=0.754) total time=  39.1s
f для логической регрессии: 0.7620445107624816


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min finished


In [26]:
%%time

param_grid = {
    'max_depth': range(1, 10, 1),
    'n_estimators': range(1, 20, 1)}

model_RTC = RandomForestClassifier(random_state=12345) 

model_RTC_gs = GridSearchCV(estimator=model_RTC, 
                     param_grid=param_grid,
                     cv=3, 
                     n_jobs=-1, 
                     verbose=10,
                     scoring='f1')
model_RTC_gs.fit(features_train, target_train)
print(model_RTC_gs.best_params_)

Fitting 3 folds for each of 171 candidates, totalling 513 fits
[CV 1/3; 1/171] START max_depth=1, n_estimators=1...............................
[CV 1/3; 1/171] END .............max_depth=1, n_estimators=1; total time=   0.1s
[CV 2/3; 1/171] START max_depth=1, n_estimators=1...............................
[CV 2/3; 1/171] END .............max_depth=1, n_estimators=1; total time=   0.1s
[CV 3/3; 1/171] START max_depth=1, n_estimators=1...............................
[CV 3/3; 1/171] END .............max_depth=1, n_estimators=1; total time=   0.1s
[CV 1/3; 2/171] START max_depth=1, n_estimators=2...............................
[CV 1/3; 2/171] END .............max_depth=1, n_estimators=2; total time=   0.2s
[CV 2/3; 2/171] START max_depth=1, n_estimators=2...............................
[CV 2/3; 2/171] END .............max_depth=1, n_estimators=2; total time=   0.2s
[CV 3/3; 2/171] START max_depth=1, n_estimators=2...............................
[CV 3/3; 2/171] END .............max_depth=1, 

In [27]:
model_RTC = RandomForestClassifier(max_depth=9, n_estimators=1, random_state=12345)
model_RTC.fit(features_train, target_train)
scores_RTC = cross_val_score(model_RTC,
                            features_train,
                            target_train,
                            cv=3,
                            n_jobs=-1,
                            scoring='f1') 

f1_RTC = scores_RTC.mean()
print('f1 для RandomForestClassifier:', f1_RTC)

f1 для RandomForestClassifier: 0.07508427254846584


In [29]:
%%time

param_grid = {'max_depth': range(1, 50, 1)}

model_DTC = DecisionTreeClassifier(random_state=12345) 

model_DTC_gs = GridSearchCV(estimator=model_DTC, 
                     param_grid=param_grid,
                     cv=3, 
                     n_jobs=-1, 
                     verbose=10,
                     scoring='f1')
model_DTC_gs.fit(features_train, target_train)
print(model_DTC_gs.best_params_)

Fitting 3 folds for each of 49 candidates, totalling 147 fits
[CV 1/3; 1/49] START max_depth=1................................................
[CV 1/3; 1/49] END ..............................max_depth=1; total time=   6.5s
[CV 2/3; 1/49] START max_depth=1................................................
[CV 2/3; 1/49] END ..............................max_depth=1; total time=   7.0s
[CV 3/3; 1/49] START max_depth=1................................................
[CV 3/3; 1/49] END ..............................max_depth=1; total time=   6.7s
[CV 1/3; 2/49] START max_depth=2................................................
[CV 1/3; 2/49] END ..............................max_depth=2; total time=   6.7s
[CV 2/3; 2/49] START max_depth=2................................................
[CV 2/3; 2/49] END ..............................max_depth=2; total time=   6.6s
[CV 3/3; 2/49] START max_depth=2................................................
[CV 3/3; 2/49] END ............................

In [31]:
model_DTC = DecisionTreeClassifier(max_depth=49, random_state=12345)
model_DTC.fit(features_train, target_train)
scores_DTC = cross_val_score(model_DTC,
                            features_train,
                            target_train,
                            cv=3,
                            n_jobs=-1,
                            scoring='f1') 

f1_DTC = scores_DTC.mean()
print('f1 для DecisionTreeClassifier:', f1_DTC)

f1 для DecisionTreeClassifier: 0.7030880057750589


## Тестирование

Лучший результат скоринга f1 показала модель LogisticRegression с помощью пайплайн на тестовой выборке. 

In [67]:
LR

LogisticRegression(C=1, penalty='l1', random_state=12345, solver='liblinear')

In [68]:
predictions = LR.predict(features_test)
print(f1_score(target_test, predictions))

0.774971031286211


## Выводы

Результат на тестовой выборке хороший, 0.78. Получилось обучить модель классифицировать комментарии на позитивные и негативные. 
В работе сначала были подготовленны данные: проведена леммаизация признака, выборку разделила на обучающую и тестовую, убрала стоп слова и перевела текст в векторы  методом TfidfVectorizer.
Затем были протестирование 3 вида моделей, из них выбрана лучшая: LogisticRegression. Лусщую модель протестироваа на тестовой выборке. 

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны