# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
#импорт библиотек и настройка отображения 
!pip install optuna -q

In [2]:
#импорт библиотек и настройка отображения 
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

from optuna.integration import OptunaSearchCV
from optuna import distributions

from pymystem3 import Mystem

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.neighbors import KNeighborsClassifier

from sklearn import svm
from sklearn.svm import SVC 

from sklearn.pipeline import Pipeline

from nltk.corpus import stopwords as nltk_stopwords
import nltk

import re

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from sklearn.metrics import f1_score

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

pd.options.display.float_format = '{:.2f}'.format

RANDOM_STATE = 42
TEST_SIZE = 0.25

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ksenchik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Ksenchik\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
# загрузка данных
try:
    comment = pd.read_csv('/datasets/toxic_comments.csv')
except:
    comment = pd.read_csv('toxic_comments.csv')

In [4]:
# посмотрим на содержание
comment.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [5]:
print(comment.info())
print('Дублей: \n', comment.duplicated().sum())
print('Пустых: \n', comment.isna().sum())
print('Размер: \n', comment.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB
None
Дублей: 
 0
Пустых: 
 Unnamed: 0    0
text          0
toxic         0
dtype: int64
Размер: 
 (159292, 3)


In [6]:
#столбец Unnamed не может быть индексом - отсутствуют значения 6080-6083. Удалим его
comment = comment[['text', 'toxic']]

Итак, в нашем файле 2 столбца - целевой признак и сам текст комментария. Пропуски и дубли отсутствуют. Комментарии даны на английском языке. 

Посмотрим, как распределен целевой признак.

In [7]:
comment['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [8]:
# доля токсичных комментариев
len(comment[comment['toxic'] == 1]) / comment.shape[0]

0.10161213369158527

Нужно будет учесть дисбаланс классов при обучении.


Подготовим данные для работы с моделью логистической регрессии.

In [9]:
#создадим функцию для очистки текста от цифр и символов
def clear_text(text):
    clear_text = re.sub(r'[^a-zA-Z]', ' ', text)
    clear_text = clear_text.split()
    clear_text = ' '.join(clear_text).lower()
    return clear_text

In [10]:
#преобразуем текст в текст без лишних символов
lemm_text = []

for row in range(comment.shape[0]):
    lemm_text.append(clear_text(comment.iloc[row]['text']))

# перезапишем столбец
comment['text'] = lemm_text

# проверим, как сработало
comment.head()

Unnamed: 0,text,toxic
0,explanation why the edits made under my username hardcore metallica fan were reverted they weren t vandalisms just closure on some gas after i voted at new york dolls fac and please don t remove the template from the talk page since i m retired now,0
1,d aww he matches this background colour i m seemingly stuck with thanks talk january utc,0
2,hey man i m really not trying to edit war it s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info,0
3,more i can t make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if no one else does first if you have any preferences for formatting style on references or want to do it yourself please let me know there appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up it s listed in the relevant form eg wikipedia good article nominations transport,0
4,you sir are my hero any chance you remember what page that s on,0


In [11]:
def lemmatize(text, m):
    return "".join(m.lemmatize(text))

In [12]:
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

comment['text'] = comment['text'].apply(lambda v: nltk.pos_tag(nltk.word_tokenize(v)))


In [13]:
comment['text'] = comment['text'].transform(lambda value: ' '
                                            .join([lemmatizer
                                                   .lemmatize(a[0],pos=get_wordnet_pos(a[1])) 
                                                   if get_wordnet_pos(a[1]) else a[0] for a in  value]))

In [14]:
comment.head()

Unnamed: 0,text,toxic
0,explanation why the edits make under my username hardcore metallica fan be revert they weren t vandalism just closure on some gas after i vote at new york doll fac and please don t remove the template from the talk page since i m retire now,0
1,d aww he match this background colour i m seemingly stick with thanks talk january utc,0
2,hey man i m really not try to edit war it s just that this guy be constantly remove relevant information and talk to me through edits instead of my talk page he seem to care more about the formatting than the actual info,0
3,more i can t make any real suggestion on improvement i wonder if the section statistic should be later on or a subsection of type of accident i think the reference may need tidy so that they be all in the exact same format ie date format etc i can do that later on if no one else do first if you have any preference for format style on reference or want to do it yourself please let me know there appear to be a backlog on article for review so i guess there may be a delay until a reviewer turn up it s list in the relevant form eg wikipedia good article nomination transport,0
4,you sir be my hero any chance you remember what page that s on,0


Получилось!

Теперь данные готовы для обучения. 

## Обучение

In [15]:
X = comment['text']
y = comment['toxic']

In [16]:
#разделим на выборки 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.45, random_state=73, stratify=y)

print("X_train.shape", X_train.shape)
print("y_train.shape", y_train.shape)
print("X_test.shape", X_test.shape)
print("y_test.shape", y_test.shape)

X_train.shape (87610,)
y_train.shape (87610,)
X_test.shape (71682,)
y_test.shape (71682,)


Избавимся от стоп-слов

In [17]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=list(stopwords))
#count_tf_idf_train = TfidfVectorizer(stop_words=list(stopwords))
X_train_tf_idf = count_tf_idf.fit_transform(X_train)

print("Размер матрицы:", X_train_tf_idf.shape)

X_test_tf_idf = count_tf_idf.transform(X_test.values)
print("Размер матрицы:", X_test_tf_idf.shape)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ksenchik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Размер матрицы: (87610, 109429)
Размер матрицы: (71682, 109429)


In [18]:
pipe_tfidf = Pipeline([('tfidf', TfidfVectorizer(stop_words=list(stopwords))),
                       ('logreg', LogisticRegression(class_weight = 'balanced', random_state=RANDOM_STATE))])

In [19]:
pipe_tfidf.fit(X_train, y_train)

In [20]:
scores_list = cross_val_score(estimator=pipe_tfidf, 
                              X=X_train, 
                              y=y_train, 
                              cv=5,  
                              scoring='f1')

val_score = scores_list.mean()
print(f'Метрика f1 на тренировочной выборке: {val_score}')

Метрика f1 на тренировочной выборке: 0.7289524664072075


Добавим регуляризацию

In [21]:
pipe_tfidf2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=list(stopwords))),
                       ('logreg', LogisticRegression(penalty='l1', class_weight = 'balanced', random_state=RANDOM_STATE, 
                                                     solver='liblinear', C = 4))])

pipe_tfidf2.fit(X_train, y_train)

In [22]:
scores_list_C = cross_val_score(estimator=pipe_tfidf2, 
                              X=X_train, 
                              y=y_train, 
                              cv=5,  
                              scoring='f1')

val_score_C = scores_list_C.mean()
print(f'Метрика f1 на тренировочной выборке: {val_score_C}')

Метрика f1 на тренировочной выборке: 0.7480188390306297


Начнем.

Попробуем обучение нескольких моделей: LogisticRegression, CatBoostClassifier.

In [23]:
%%time
cat = CatBoostClassifier(
            loss_function="Logloss",
            n_estimators=400, 
            class_weights=[1, 9],
            max_depth=6,
            verbose=100)
cat.fit(X_train_tf_idf, y_train)



Learning rate set to 0.161182
0:	learn: 0.6277789	total: 558ms	remaining: 3m 42s
100:	learn: 0.3241445	total: 41.3s	remaining: 2m 2s
200:	learn: 0.2636211	total: 1m 21s	remaining: 1m 20s
300:	learn: 0.2300850	total: 2m 1s	remaining: 39.8s
399:	learn: 0.2073125	total: 2m 40s	remaining: 0us
Wall time: 2min 43s


<catboost.core.CatBoostClassifier at 0x22c7f13b280>

In [24]:
%%time
scores_list2 = cross_val_score(estimator=cat, 
                              X=X_train_tf_idf, 
                              y=y_train, 
                              cv=3,  
                              scoring='f1')

val_score2 = scores_list2.mean()
print(f'Метрика f1 на тренировочной выборке: {val_score2}')

Learning rate set to 0.135557
0:	learn: 0.6390977	total: 352ms	remaining: 2m 20s
100:	learn: 0.3382736	total: 32.2s	remaining: 1m 35s
200:	learn: 0.2698410	total: 1m 3s	remaining: 1m 2s
300:	learn: 0.2321154	total: 1m 34s	remaining: 31.2s
399:	learn: 0.2072812	total: 2m 5s	remaining: 0us
Learning rate set to 0.135558
0:	learn: 0.6372370	total: 331ms	remaining: 2m 12s
100:	learn: 0.3376575	total: 32.9s	remaining: 1m 37s
200:	learn: 0.2695483	total: 1m 4s	remaining: 1m 4s
300:	learn: 0.2324316	total: 1m 36s	remaining: 31.8s
399:	learn: 0.2065437	total: 2m 7s	remaining: 0us
Learning rate set to 0.135558
0:	learn: 0.6355095	total: 327ms	remaining: 2m 10s
100:	learn: 0.3403007	total: 32.4s	remaining: 1m 36s
200:	learn: 0.2727446	total: 1m 3s	remaining: 1m 2s
300:	learn: 0.2364670	total: 1m 34s	remaining: 31.2s
399:	learn: 0.2101853	total: 2m 5s	remaining: 0us
Метрика f1 на тренировочной выборке: 0.7381088652686932
Wall time: 6min 26s


Попробуем подобрать параметры для CatBoostClassifier с целью улучшения метрики 

In [25]:
%time

#попробуем подобрать иные параметры 
cat_model = CatBoostClassifier(loss_function="Logloss", class_weights=[1, 9], verbose=100)

cat_params =  {
    'max_depth': distributions.IntDistribution(1, 6),
    'n_estimators': distributions.IntDistribution(100, 1000)
} 

oscv_all_add = OptunaSearchCV(
    cat_model, 
    cat_params, 
    cv=3,
    n_trials=5,
    error_score='raise',
    scoring='f1',
    random_state=RANDOM_STATE
) 

oscv_all_add.fit(X_train_tf_idf, y_train)

print('Лучшая модель и её параметры:\n\n', oscv_all_add.best_estimator_)
print ('Метрика лучшей модели на тренировочной выборке:', oscv_all_add.best_score_)

[I 2024-03-06 20:59:42,519] A new study created in memory with name: no-name-4e17fa9f-85b9-408e-a205-f98c8493d845


Wall time: 0 ns
Learning rate set to 0.06791
0:	learn: 0.6671122	total: 124ms	remaining: 1m 44s
100:	learn: 0.4449378	total: 10.6s	remaining: 1m 18s
200:	learn: 0.3899090	total: 21.1s	remaining: 1m 8s
300:	learn: 0.3536864	total: 31.4s	remaining: 57.3s
400:	learn: 0.3289825	total: 41.9s	remaining: 46.9s
500:	learn: 0.3102191	total: 52.2s	remaining: 36.3s
600:	learn: 0.2948995	total: 1m 2s	remaining: 25.9s
700:	learn: 0.2818259	total: 1m 12s	remaining: 15.5s
800:	learn: 0.2700054	total: 1m 23s	remaining: 5.09s
849:	learn: 0.2651012	total: 1m 28s	remaining: 0us
Learning rate set to 0.067911
0:	learn: 0.6681078	total: 135ms	remaining: 1m 55s
100:	learn: 0.4456358	total: 10.9s	remaining: 1m 20s
200:	learn: 0.3906247	total: 21.5s	remaining: 1m 9s
300:	learn: 0.3544225	total: 31.9s	remaining: 58.2s
400:	learn: 0.3298151	total: 42.2s	remaining: 47.3s
500:	learn: 0.3102812	total: 52.6s	remaining: 36.6s
600:	learn: 0.2952997	total: 1m 2s	remaining: 26.1s
700:	learn: 0.2819947	total: 1m 13s	rema

[I 2024-03-06 21:04:15,608] Trial 0 finished with value: 0.7283090721645191 and parameters: {'max_depth': 3, 'n_estimators': 850}. Best is trial 0 with value: 0.7283090721645191.


Learning rate set to 0.139727
0:	learn: 0.6577563	total: 66.1ms	remaining: 25.5s
100:	learn: 0.4653671	total: 4.24s	remaining: 12s
200:	learn: 0.4118960	total: 8.36s	remaining: 7.74s
300:	learn: 0.3811489	total: 12.5s	remaining: 3.57s
386:	learn: 0.3614351	total: 16s	remaining: 0us
Learning rate set to 0.139728
0:	learn: 0.6582645	total: 68.7ms	remaining: 26.5s
100:	learn: 0.4662898	total: 4.23s	remaining: 12s
200:	learn: 0.4143290	total: 8.33s	remaining: 7.71s
300:	learn: 0.3816574	total: 12.4s	remaining: 3.55s
386:	learn: 0.3625988	total: 16s	remaining: 0us
Learning rate set to 0.139728
0:	learn: 0.6594654	total: 48.6ms	remaining: 18.7s
100:	learn: 0.4669864	total: 4.25s	remaining: 12s
200:	learn: 0.4141341	total: 8.5s	remaining: 7.87s
300:	learn: 0.3832783	total: 12.7s	remaining: 3.63s
386:	learn: 0.3644857	total: 16.3s	remaining: 0us


[I 2024-03-06 21:05:11,708] Trial 1 finished with value: 0.7146176739431805 and parameters: {'max_depth': 1, 'n_estimators': 387}. Best is trial 0 with value: 0.7283090721645191.


Learning rate set to 0.146328
0:	learn: 0.6433326	total: 133ms	remaining: 48.6s
100:	learn: 0.3794855	total: 10.7s	remaining: 28.2s
200:	learn: 0.3201171	total: 21s	remaining: 17.4s
300:	learn: 0.2864293	total: 31.3s	remaining: 6.97s
367:	learn: 0.2691168	total: 38.2s	remaining: 0us
Learning rate set to 0.14633
0:	learn: 0.6455771	total: 142ms	remaining: 52.2s
100:	learn: 0.3816290	total: 10.8s	remaining: 28.4s
200:	learn: 0.3217965	total: 21.1s	remaining: 17.5s
300:	learn: 0.2863022	total: 31.4s	remaining: 6.99s
367:	learn: 0.2696647	total: 38.4s	remaining: 0us
Learning rate set to 0.14633
0:	learn: 0.6457544	total: 123ms	remaining: 45.1s
100:	learn: 0.3812695	total: 10.6s	remaining: 28.1s
200:	learn: 0.3234858	total: 21s	remaining: 17.5s
300:	learn: 0.2908058	total: 31.3s	remaining: 6.97s
367:	learn: 0.2743353	total: 38.2s	remaining: 0us


[I 2024-03-06 21:07:14,388] Trial 2 finished with value: 0.726431332950329 and parameters: {'max_depth': 3, 'n_estimators': 368}. Best is trial 0 with value: 0.7283090721645191.


Learning rate set to 0.208593
0:	learn: 0.6373040	total: 99.8ms	remaining: 24.8s
100:	learn: 0.3765813	total: 7.29s	remaining: 10.7s
200:	learn: 0.3211764	total: 14.3s	remaining: 3.49s
249:	learn: 0.3048930	total: 17.8s	remaining: 0us
Learning rate set to 0.208595
0:	learn: 0.6414722	total: 100ms	remaining: 24.9s
100:	learn: 0.3776884	total: 7.29s	remaining: 10.8s
200:	learn: 0.3242259	total: 14.4s	remaining: 3.52s
249:	learn: 0.3067547	total: 17.9s	remaining: 0us
Learning rate set to 0.208595
0:	learn: 0.6399539	total: 92.6ms	remaining: 23.1s
100:	learn: 0.3821822	total: 7.21s	remaining: 10.6s
200:	learn: 0.3268480	total: 14.2s	remaining: 3.46s
249:	learn: 0.3097806	total: 17.6s	remaining: 0us


[I 2024-03-06 21:08:15,553] Trial 3 finished with value: 0.7232120902408349 and parameters: {'max_depth': 2, 'n_estimators': 250}. Best is trial 0 with value: 0.7283090721645191.


Learning rate set to 0.139727
0:	learn: 0.6514102	total: 107ms	remaining: 41.5s
100:	learn: 0.4141940	total: 7.26s	remaining: 20.6s
200:	learn: 0.3561478	total: 14.3s	remaining: 13.2s
300:	learn: 0.3248494	total: 21.3s	remaining: 6.09s
386:	learn: 0.3049504	total: 27.3s	remaining: 0us
Learning rate set to 0.139728
0:	learn: 0.6541247	total: 107ms	remaining: 41.4s
100:	learn: 0.4151244	total: 7.31s	remaining: 20.7s
200:	learn: 0.3580300	total: 14.4s	remaining: 13.3s
300:	learn: 0.3262483	total: 21.4s	remaining: 6.12s
386:	learn: 0.3053047	total: 27.5s	remaining: 0us
Learning rate set to 0.139728
0:	learn: 0.6536900	total: 90.6ms	remaining: 35s
100:	learn: 0.4163100	total: 7.27s	remaining: 20.6s
200:	learn: 0.3603230	total: 14.3s	remaining: 13.2s
300:	learn: 0.3271958	total: 21.3s	remaining: 6.09s
386:	learn: 0.3082927	total: 27.3s	remaining: 0us


[I 2024-03-06 21:09:45,539] Trial 4 finished with value: 0.720853087158603 and parameters: {'max_depth': 2, 'n_estimators': 387}. Best is trial 0 with value: 0.7283090721645191.


Learning rate set to 0.080747
0:	learn: 0.6648086	total: 164ms	remaining: 2m 19s
100:	learn: 0.4312831	total: 13.9s	remaining: 1m 43s
200:	learn: 0.3774283	total: 27.6s	remaining: 1m 29s
300:	learn: 0.3437616	total: 41s	remaining: 1m 14s
400:	learn: 0.3204870	total: 54.5s	remaining: 1m 1s
500:	learn: 0.3035106	total: 1m 7s	remaining: 47.2s
600:	learn: 0.2894230	total: 1m 21s	remaining: 33.6s
700:	learn: 0.2772495	total: 1m 34s	remaining: 20.1s
800:	learn: 0.2671610	total: 1m 47s	remaining: 6.58s
849:	learn: 0.2626301	total: 1m 54s	remaining: 0us
Лучшая модель и её параметры:

 <catboost.core.CatBoostClassifier object at 0x0000022C86687FD0>
Метрика лучшей модели на тренировочной выборке: 0.7283090721645191


Попробуем еще несколько моделей, уже в пайплайне

In [26]:
pipe_tfidf3 = Pipeline([('tfidf', TfidfVectorizer(stop_words=list(stopwords))),
                       ('models', LogisticRegression(class_weight = 'balanced', random_state=RANDOM_STATE))])

param_grid = [
    # словарь для модели DecisionTreeClassifier()
    {
        'models': [DecisionTreeClassifier(class_weight = 'balanced', random_state=RANDOM_STATE)],
        "models__max_depth": range(2, 5),
        "models__max_features": range(2, 5)
    },
    
    # словарь для модели KNeighborsClassifier() 
    {
        'models': [KNeighborsClassifier()],
        'models__n_neighbors': range(2, 5)
    }
]

In [27]:
randomized_search = RandomizedSearchCV(
    pipe_tfidf3, 
    param_grid, 
    cv=5,
    scoring='f1',
    random_state=RANDOM_STATE,
    n_jobs=-1
)

In [28]:
%time
randomized_search.fit(X_train, y_train)

print('Лучшая модель и её параметры:\n\n', randomized_search.best_estimator_)
print ('Метрика лучшей модели на тренировочной выборке:', randomized_search.best_score_)

Wall time: 0 ns
Лучшая модель и её параметры:

 Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['doing', 'out', 'own', 'it',
                                             'each', 'so', 'himself', 'mustn',
                                             'its', "she's", 'under', 'or',
                                             "couldn't", 'he', 'again', 'by',
                                             "hadn't", 'off', 'couldn',
                                             'shouldn', 'when', 'didn', 'were',
                                             'during', 'hadn', 'yourselves',
                                             'their', 'will', 'ourselves',
                                             'after', ...])),
                ('models', KNeighborsClassifier(n_neighbors=3))])
Метрика лучшей модели на тренировочной выборке: 0.30314958671665887


Что ж. 

Соберем все имеющиеся данные. 

In [31]:
df = pd.DataFrame({'Модель': ['LogisticRegression', 'LogisticRegression с регуляризацией', 'CatBoostClassifier', 
                              'CatBoostClassifier с подбором параметров', 'KNeighborsClassifier'],
                   'f1 на обучении': [73, 75, 74, 73, 30]                
                  })

df

Unnamed: 0,Модель,f1 на обучении
0,LogisticRegression,73
1,LogisticRegression с регуляризацией,75
2,CatBoostClassifier,74
3,CatBoostClassifier с подбором параметров,73
4,KNeighborsClassifier,30


Ни одна из представленных моделей не достигла порога в 0,75, хотя LogisticRegression и CatBoostClassifier приблизились к нему. Попробуем посчитать f1 на тестовых данных с помощью CatBoostClassifier

In [33]:
x_test_pred = pipe_tfidf2.predict(X_test)
print(f'Метрика f1 на тестовой выборке: {round(f1_score(y_test, x_test_pred), 2)}')

Метрика f1 на тестовой выборке: 0.75


## Выводы

1. На этапе подготовки выявлено, что пропуски и дубли отсутствуют, а сами твиты даны на английском языке.
2. Присутствует дисбаланс классов: доля токсичных комментариев составляет всего 10%;
3. Проведена предварительная обработка данных:
* данные очищены от цифр и символов;
* лемматизированы;
* очищены от стоп-слов;
* переведены в векторный вид. 
4. Протестированы 4 модели:
* LogisticRegression
* CatBoostClassifier
* CatBoostClassifier с подбором параметров
* SVC.

Самые высокие результаты показали LogisticRegression и CatBoostClassifier. CatBoostClassifier показал максимальный результат на тестовых данных - 0,74.

Можно улучшить результат, освоив BERT. 