<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

Начнём с подключения всех необходимых библиотек.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import f1_score, make_scorer

from tqdm.notebook import tqdm

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\egork\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\egork\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\egork\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\egork\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Загрузим данные и взглянем на них.

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')
#data = pd.read_csv(r'C:\\Users\\egork\\Downloads\\toxic_comments.csv')

display(data.head())
display(data.info())

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


None

Пропусков нет. Взглянем, сколько записей каждого из классов мы имеем.

In [3]:
print(data['toxic'].value_counts(normalize=True))

toxic
0    0.898388
1    0.101612
Name: proportion, dtype: float64


Мы имеем данные, где практически 90% записей не являются токсичными и 10% являются. При инициализации моделей следует учесть этот факт. Проверим данные на дубликаты.

In [4]:
print(data[data.duplicated()])

Empty DataFrame
Columns: [Unnamed: 0, text, toxic]
Index: []


Дубликатов нет, можем работать дальше. Разделим данные и разобьём их на выборки для обучения и тестов. Но сперва подготовим функции для лематизации. 

In [5]:
def get_wordnet_pos(tag, word):
    if tag.startswith('J') and word.endswith('ed'):
        return wordnet.VERB
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
    
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    words = [word for word in text.split() if word not in stop_words]
    pos_tags = pos_tag(words)
    lemmatized_words = []
    for word, tag in pos_tags:
        pos = get_wordnet_pos(tag, word)
        if pos:
            lemmatized_word = lemmatizer.lemmatize(word, pos)
        else:
            lemmatized_word = lemmatizer.lemmatize(word)
        lemmatized_words.append(lemmatized_word)
    return ' '.join(lemmatized_words)

In [6]:
X = data['text']
y = data['toxic']

In [7]:
print("Preprocessing text...")
tqdm.pandas()
X = X.progress_apply(preprocess_text)

Preprocessing text...


  0%|          | 0/159292 [00:00<?, ?it/s]

Данные лематизированы. Теперь разделим их на выборки и прдеставим в виде векторов.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [9]:
vectorizer = TfidfVectorizer(max_features=100000)
X_train_tf = vectorizer.fit_transform(X_train)
X_test_tf = vectorizer.transform(X_test)

Теперь можем работать с моделями.

## Обучение

Рассмотрим 4 модели разной сложности.

In [10]:
models_and_params = {
    'LogisticRegression': (
        LogisticRegression(max_iter=2000, random_state=42),
        {'C': [5, 10, 15], 
         'penalty': ['l1', 'l2'], 
         'solver': ['liblinear'], 
         'class_weight': ['balanced']
        }
    ),
    'RandomForest': (
        RandomForestClassifier(random_state=42),
        {'n_estimators': [100, 200], 
         'max_depth': [10, 20, None], 
         'class_weight': ['balanced']
        }
    ),
    'SVC': (
        SVC(random_state=42),
        {'C': [0.1, 1, 10], 
         'kernel': ['linear', 'rbf'], 
         'class_weight': ['balanced']
        }
    ),
    'XGBClassifier': (
        XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
        {'n_estimators': [100, 200], 
         'max_depth': [6, 10], 
         'learning_rate': [0.01, 0.1, 0.2]
        }
    )
}

Модели инициализированны. Теперь можем перебрать их и выбрать лучшую.

In [11]:
best_model = None
best_score = 0
best_name = None

for model_name, (model, param_grid) in models_and_params.items():
    print(f"Training {model_name}...")
    grid = GridSearchCV(model, param_grid, scoring=make_scorer(f1_score), cv=5, verbose=3)
    grid.fit(X_train_tf, y_train)
    if grid.best_score_ > best_score:
        best_score = grid.best_score_
        best_model = grid.best_estimator_
        best_name = model_name

    print(f"Лучшая модель {model_name}: {grid.best_params_}")
    print(f"Её метрика {model_name}: {grid.best_score_}")

Training LogisticRegression...
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END C=5, class_weight=balanced, penalty=l1, solver=liblinear;, score=0.748 total time=   3.4s
[CV 2/5] END C=5, class_weight=balanced, penalty=l1, solver=liblinear;, score=0.762 total time=   3.3s
[CV 3/5] END C=5, class_weight=balanced, penalty=l1, solver=liblinear;, score=0.769 total time=   3.9s
[CV 4/5] END C=5, class_weight=balanced, penalty=l1, solver=liblinear;, score=0.755 total time=   3.2s
[CV 5/5] END C=5, class_weight=balanced, penalty=l1, solver=liblinear;, score=0.769 total time=   3.5s
[CV 1/5] END C=5, class_weight=balanced, penalty=l2, solver=liblinear;, score=0.746 total time=   2.8s
[CV 2/5] END C=5, class_weight=balanced, penalty=l2, solver=liblinear;, score=0.764 total time=   2.7s
[CV 3/5] END C=5, class_weight=balanced, penalty=l2, solver=liblinear;, score=0.768 total time=   2.8s
[CV 4/5] END C=5, class_weight=balanced, penalty=l2, solver=liblinear;, score=0.754 t

Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.414 total time=  34.9s


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.409 total time=  34.1s


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.435 total time=  32.9s


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.411 total time=  34.2s


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.411 total time=  34.2s


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.01, max_depth=6, n_estimators=200;, score=0.472 total time= 1.2min


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.01, max_depth=6, n_estimators=200;, score=0.478 total time= 1.2min


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.01, max_depth=6, n_estimators=200;, score=0.493 total time= 1.1min


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.01, max_depth=6, n_estimators=200;, score=0.478 total time= 1.2min


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.01, max_depth=6, n_estimators=200;, score=0.484 total time= 1.2min


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.503 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.498 total time= 1.4min


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.517 total time= 1.4min


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.491 total time= 1.4min


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.513 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=200;, score=0.550 total time= 3.0min


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=200;, score=0.547 total time= 2.9min


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=200;, score=0.573 total time= 2.9min


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=200;, score=0.551 total time= 2.9min


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=200;, score=0.562 total time= 2.9min


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.641 total time=  30.4s


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.633 total time=  30.0s


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.659 total time=  30.3s


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.640 total time=  31.1s


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.659 total time=  33.0s


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.1, max_depth=6, n_estimators=200;, score=0.692 total time=  56.5s


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.1, max_depth=6, n_estimators=200;, score=0.687 total time=  54.8s


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.1, max_depth=6, n_estimators=200;, score=0.707 total time=  53.1s


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.1, max_depth=6, n_estimators=200;, score=0.689 total time=  53.5s


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.1, max_depth=6, n_estimators=200;, score=0.707 total time=  53.4s


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.683 total time= 1.1min


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.680 total time= 1.1min


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.699 total time= 1.1min


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.684 total time= 1.1min


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.694 total time= 1.2min


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.717 total time= 1.8min


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.722 total time= 1.8min


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.730 total time= 1.9min


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.725 total time= 1.9min


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.1, max_depth=10, n_estimators=200;, score=0.731 total time= 1.8min


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.2, max_depth=6, n_estimators=100;, score=0.696 total time=  27.0s


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.2, max_depth=6, n_estimators=100;, score=0.690 total time=  26.8s


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.2, max_depth=6, n_estimators=100;, score=0.706 total time=  27.7s


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.2, max_depth=6, n_estimators=100;, score=0.692 total time=  27.4s


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.2, max_depth=6, n_estimators=100;, score=0.708 total time=  26.7s


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.2, max_depth=6, n_estimators=200;, score=0.730 total time=  47.2s


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.2, max_depth=6, n_estimators=200;, score=0.731 total time=  46.3s


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.2, max_depth=6, n_estimators=200;, score=0.741 total time=  47.6s


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.2, max_depth=6, n_estimators=200;, score=0.731 total time=  47.5s


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.2, max_depth=6, n_estimators=200;, score=0.743 total time=  46.2s


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.2, max_depth=10, n_estimators=100;, score=0.721 total time=  57.0s


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.2, max_depth=10, n_estimators=100;, score=0.721 total time=  57.6s


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.2, max_depth=10, n_estimators=100;, score=0.732 total time=  56.9s


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.2, max_depth=10, n_estimators=100;, score=0.727 total time=  56.0s


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.2, max_depth=10, n_estimators=100;, score=0.732 total time=  55.8s


Parameters: { "use_label_encoder" } are not used.



[CV 1/5] END learning_rate=0.2, max_depth=10, n_estimators=200;, score=0.749 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



[CV 2/5] END learning_rate=0.2, max_depth=10, n_estimators=200;, score=0.750 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



[CV 3/5] END learning_rate=0.2, max_depth=10, n_estimators=200;, score=0.749 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



[CV 4/5] END learning_rate=0.2, max_depth=10, n_estimators=200;, score=0.749 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



[CV 5/5] END learning_rate=0.2, max_depth=10, n_estimators=200;, score=0.758 total time= 1.5min


Parameters: { "use_label_encoder" } are not used.



Лучшая модель XGBClassifier: {'learning_rate': 0.2, 'max_depth': 10, 'n_estimators': 200}
Её метрика XGBClassifier: 0.7510765274117126


Модели перебрали, Модель лучшего качества нашли. Проверим её на тестовой выборке.

## Выводы

In [13]:
print(f"Лучшая модель: {best_name} и её метрика: {best_score}")
y_pred_test = best_model.predict(X_test_tf)
f1_test = f1_score(y_test, y_pred_test)
print(f"Метрика на тестовой выборке ({best_name}): {f1_test}")

Лучшая модель: SVC и её метрика: 0.7785471440140757
Метрика на тестовой выборке (SVC): 0.786134903640257


При загрузке данных не было выявлено никаких недочётов, но выяснилось, что данные распределены неравномерно, то есть практически 90% данных это комментарии, не являющиеся токсичными. Далее данные были лематизированны и представлены в векторной форме и началась работа с моделями. Рассматривались 4 модели разной сложности - линейная регрессия, рандомный лес, SVC, XGBoost. Модели линейной регрессии и бустинга прошли порог в 0.75. Но модель бустинга имеет лучшее качество, поэтому она наиболее актуальна и лучше подходит для нашей задачи.

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны