# Проект анализа текстов для Интернет-магазина «ИнтерСейл» с использованием языковой модели BERT

# Описание проекта

Интернет-магазин «ИнтерСейл» запускает сервис, в котром пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. Клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.
Обучим модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Целевое значение метрики качества: F1 > 0.75.

In [None]:
# Привязка гугл-диска

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Загрузка необходимых библиотек

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 41.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 31.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 36.3 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacre

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.0.5-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 1.5 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.5


In [None]:
import numpy as np
import pandas as pd

import torch
import transformers

from tqdm import notebook

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier

from sklearn.metrics import f1_score

import spacy

import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import re

from sklearn.model_selection import train_test_split

# Подготовка данных

Прочитаем данные и выведем на экран важную информацию.

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/toxic_comments.csv')

In [None]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Рассмотрим целевой признак. Обратим внимание на дисбаланс классов (9:1).

In [None]:
target = data['toxic']
target.value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

## Модель CatboostClassifier с параметрами по умолчанию на обработанном тексте

Для лемматизации текстов загрузим библиотеку spacy:

In [None]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

Для обработки текстов создадим две функции: lemma (для лемматизации текста) и clear_text (для очистки текста от лишних символов, не являющихся словами):

In [None]:
def lemma(text):
#    m = WordNetLemmatizer()
    lemm_list = nlp(text) #" ".join([token.lemma_ for token in lemm_list])
    lemm_text = " ".join([token.lemma_ for token in lemm_list])
    #lemm_text = "".join(lemm_list)
        
    return lemm_text

In [None]:
def clear_text(text):
    res = re.sub(r'[^a-zA-Z ]', ' ', text)
    result = " ".join(res.split())
    return result

Применим последовательно две функции для создания в датафрейме столбца с очищенным и лематизированным текстом 'lemm_text':

In [None]:
data['clear_text'] = data['text'].apply(clear_text)

In [None]:
%%time
data['lemm_text'] = data['clear_text'].apply(lemma)

CPU times: user 14min 11s, sys: 5 s, total: 14min 16s
Wall time: 14min 41s


In [None]:
data # проверка

Unnamed: 0,text,toxic,clear_text,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...,explanation why the edit make under -PRON- use...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...,D aww -PRON- match this background colour -PRO...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...,hey man -PRON- m really not try to edit war -P...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...,More -PRON- can t make any real suggestion on ...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...,-PRON- sir be -PRON- hero any chance -PRON- re...
...,...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...,and for the second time of ask when -PRON- vie...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...,-PRON- should be ashamed of -PRON- that be a h...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...,Spitzer Umm there s no actual article for pros...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...,and -PRON- look like -PRON- be actually -PRON-...


In [None]:
features = data.drop(['toxic', 'text', 'clear_text'], axis=1)

In [None]:
features

Unnamed: 0,lemm_text
0,explanation why the edit make under -PRON- use...
1,D aww -PRON- match this background colour -PRO...
2,hey man -PRON- m really not try to edit war -P...
3,More -PRON- can t make any real suggestion on ...
4,-PRON- sir be -PRON- hero any chance -PRON- re...
...,...
159566,and for the second time of ask when -PRON- vie...
159567,-PRON- should be ashamed of -PRON- that be a h...
159568,Spitzer Umm there s no actual article for pros...
159569,and -PRON- look like -PRON- be actually -PRON-...


Применим самый простой путь классификации обработанного текста - модель CatBoostClassifier (параметры по умолчанию) с указанием текстового признака text_features:

In [None]:
text_features_cb = ['lemm_text']

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.25)

In [None]:
model_cb = CatBoostClassifier(task_type="GPU", devices='coda:0', text_features=text_features_cb)

In [None]:
%%time
model_cb.fit(features_train, target_train, eval_set=(features_valid, target_valid))

Learning rate set to 0.04832
0:	learn: 0.6210188	test: 0.6202669	best: 0.6202669 (0)	total: 26.7ms	remaining: 26.7s
1:	learn: 0.5575974	test: 0.5563184	best: 0.5563184 (1)	total: 51.3ms	remaining: 25.6s
2:	learn: 0.5037934	test: 0.5018766	best: 0.5018766 (2)	total: 75.4ms	remaining: 25.1s
3:	learn: 0.4600527	test: 0.4576891	best: 0.4576891 (3)	total: 117ms	remaining: 29.2s
4:	learn: 0.4202248	test: 0.4174488	best: 0.4174488 (4)	total: 142ms	remaining: 28.2s
5:	learn: 0.3851558	test: 0.3819868	best: 0.3819868 (5)	total: 166ms	remaining: 27.5s
6:	learn: 0.3564902	test: 0.3529582	best: 0.3529582 (6)	total: 191ms	remaining: 27s
7:	learn: 0.3332222	test: 0.3293375	best: 0.3293375 (7)	total: 215ms	remaining: 26.6s
8:	learn: 0.3118207	test: 0.3076454	best: 0.3076454 (8)	total: 246ms	remaining: 27.1s
9:	learn: 0.2943138	test: 0.2899285	best: 0.2899285 (9)	total: 271ms	remaining: 26.8s
10:	learn: 0.2791047	test: 0.2745573	best: 0.2745573 (10)	total: 296ms	remaining: 26.6s
11:	learn: 0.2637850	t

<catboost.core.CatBoostClassifier at 0x7fa1d027be90>

In [None]:
predict_test_cb = model_cb.predict(features_test)
        
F1_test_cb = f1_score(target_test, predict_test_cb)
print('Значение метрики F1 на тестовой выборке для модели CatBoostClassifier:', F1_test_cb)

Значение метрики F1 на тестовой выборке для модели CatBoostClassifier: 0.7619212762268076


**Вывод.** Достигнуто целевое значение метрики F1=0,7619 простым и быстрым путем. Время обучения модели - 39 секунд.

## Подбор параметров модели CatboostClassifier на обработанном тексте

In [None]:
model_cb = CatBoostClassifier(text_features=text_features_cb)

Применим метод grid_search для подбора оптимальных параметров

In [None]:
grid = {'iterations': [700, 1000, 2000],'learning_rate': [0.05, 0.1, 0.15],'depth': [4, 6, 10]}

In [None]:
%%time
grid_search_result = model_cb.grid_search(grid,
                                          X=features_train,
                                          y=target_train,
                                          cv=3,
                                          train_size=0.8)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
1128:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.39s
1129:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.38s
1130:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.37s
1131:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.36s
1132:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.35s
1133:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.34s
1134:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.33s
1135:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.31s
1136:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.3s
1137:	learn: 0.3271296	test: 0.3336509	best: 0.3336157 (39)	total: 10.9s	remaining: 8.29s
1138:	learn: 0.3271

In [None]:
best_params = grid_search_result['params']
print ('Оптимальное значение параметров модели:', best_params)

Оптимальное значение параметров модели: {'depth': 4, 'iterations': 700, 'learning_rate': 0.15}


In [None]:
model_cb_best = CatBoostClassifier(**best_params, task_type="GPU", devices='cuda:0', text_features=text_features_cb)

In [None]:
%%time
model_cb_best.fit(features_train, target_train, eval_set=(features_valid, target_valid))

0:	learn: 0.4950270	test: 0.4927379	best: 0.4927379 (0)	total: 20.1ms	remaining: 14s
1:	learn: 0.3788344	test: 0.3751274	best: 0.3751274 (1)	total: 38.6ms	remaining: 13.5s
2:	learn: 0.3096862	test: 0.3052916	best: 0.3052916 (2)	total: 57.2ms	remaining: 13.3s
3:	learn: 0.2621402	test: 0.2570664	best: 0.2570664 (3)	total: 76ms	remaining: 13.2s
4:	learn: 0.2283666	test: 0.2226062	best: 0.2226062 (4)	total: 94.7ms	remaining: 13.2s
5:	learn: 0.2108290	test: 0.2049441	best: 0.2049441 (5)	total: 113ms	remaining: 13.1s
6:	learn: 0.1967785	test: 0.1905437	best: 0.1905437 (6)	total: 132ms	remaining: 13.1s
7:	learn: 0.1852520	test: 0.1787335	best: 0.1787335 (7)	total: 150ms	remaining: 13s
8:	learn: 0.1781238	test: 0.1713625	best: 0.1713625 (8)	total: 169ms	remaining: 13s
9:	learn: 0.1719242	test: 0.1649238	best: 0.1649238 (9)	total: 193ms	remaining: 13.3s
10:	learn: 0.1688100	test: 0.1619146	best: 0.1619146 (10)	total: 211ms	remaining: 13.2s
11:	learn: 0.1665835	test: 0.1596146	best: 0.1596146 (1

<catboost.core.CatBoostClassifier at 0x7fa1ba718cd0>

In [None]:
predict_test_cb = model_cb_best.predict(features_test)
        
F1_test_cb = f1_score(target_test, predict_test_cb)
print('Значение метрики F1 на тестовой выборке для модели CatBoostClassifier:', F1_test_cb)

Значение метрики F1 на тестовой выборке для модели CatBoostClassifier: 0.7622523461939521


**Вывод.** Достигнуто целевое значение метрики F1=0,7623, но получить ощутимого роста качества путём подбора параметров модели не удалось. Время на подбор параметров (11 минут 18 секунд) потрачено неэффективно.

## Модель BERT

Для использования пред-обученной модели BERT создадим токенизатор и маску attention_mask. Поскольку модель BERT работает с длиной текста не более 512 токенов, укажем значение max_len = 512, ограничив длину текста. Токены, превышающие длину текста 512, будем удалять.

In [None]:
tokenizer = transformers.BertTokenizer(vocab_file='/content/drive/MyDrive/Colab Notebooks/vocab.txt')

In [None]:
%%time
tokenized = data['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

max_len = 512
for i in tokenized.values:
    if len(i) > max_len:
        del i[max_len:len(i)]

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

CPU times: user 5min 45s, sys: 2.81 s, total: 5min 47s
Wall time: 5min 53s


In [None]:
attention_mask.shape # проверка размеров маски

(159571, 512)

Подготовка к использованию GPU на сервисе colab:

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Инициализируем конфигурацию BertConfig. В качестве аргумента передадим ей JSON-файл с описанием настроек модели.

In [None]:
config = transformers.BertConfig.from_json_file('/content/drive/MyDrive/Colab Notebooks/bert_config.json')

Инициализируем модель класса BertModel. Передадим ей файл с предобученной моделью и конфигурацией:

In [None]:
model = transformers.BertModel.from_pretrained('/content/drive/MyDrive/Colab Notebooks/pytorch_model.bin', config=config)
model = model.to(device)

Some weights of the model checkpoint at /content/drive/MyDrive/Colab Notebooks/pytorch_model.bin were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Создадим эмбеддинги:

In [None]:
batch_size = 20
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
        batch = batch.to(device) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        attention_mask_batch = attention_mask_batch.to(device)
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())


  0%|          | 0/7978 [00:00<?, ?it/s]

Назовем массив эмбеддингов признаками features для дальнейшего обучения

In [None]:
features = np.concatenate(embeddings)

In [None]:
target = data['toxic']

In [None]:
features.shape # проверка размеров обучающих признаков

(159560, 768)

Последний (неполный) батч не вошёл в признаки (техническая особенность работы с BERT), поэтому, чтобы размер features соответсвовал target, удалим из последнего значения, у которых index = [159560, 159561, 159562, 159563, 159564, 159565, 159566, 159567, 159568, 159569, 159570]

In [None]:
target = target.drop(target[target.index >= 159560].index)

In [None]:
target # проверка

0         0
1         0
2         0
3         0
4         0
         ..
159555    0
159556    0
159557    0
159558    0
159559    0
Name: toxic, Length: 159560, dtype: int64

## Модель логистической регресии на эмбеддингах



Создадим обучающие и тестовые признаки

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2)

Создадим модель логистической регресии, обучим её на эмбеддингах, выполним предсказания и найдём метрику F1:

In [None]:
model_lr = LogisticRegression(max_iter = 1000, class_weight='balanced')
model_lr.fit(features_train, target_train)

In [None]:
predict = model_lr.predict(features_test)

In [None]:
F1_test_lr = f1_score(target_test, predict)
print('Значение метрики F1 на тестовой выборке для модели LogisticRegression:', F1_test_lr)

Значение метрики F1 на тестовой выборке для модели LogisticRegression: 0.6089395922947447


**Вывод.** С помощью данных инструментов достичь целевого значения метрики не удалось.

## Модель CatBoostClassifier на эмбеддингах

Создадим обучающие и тестовые признаки

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)

Создадим модель CatBoostClassifier, подберём оптимальные параметры и обучим модель на эмбеддингах, выполним предсказания и найдём метрику F1:

In [None]:
model_cb = CatBoostClassifier(task_type="GPU", devices='cuda:0')

In [None]:
grid = {'iterations': [550, 650, 750],'learning_rate': [0.105, 0.1, 0.095],'depth': [5, 6, 7]}

In [None]:
%%time
grid_search_result = model_cb.grid_search(grid,
                                          X=features_train,
                                          y=target_train,
                                          cv=3,
                                          train_size=0.8)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
465:	learn: 0.1404915	test: 0.1831969	best: 0.1831951 (463)	total: 2m 11s	remaining: 52.1s
466:	learn: 0.1404383	test: 0.1832077	best: 0.1831951 (463)	total: 2m 12s	remaining: 51.8s
467:	learn: 0.1403236	test: 0.1831691	best: 0.1831691 (467)	total: 2m 12s	remaining: 51.5s
468:	learn: 0.1402450	test: 0.1831737	best: 0.1831691 (467)	total: 2m 12s	remaining: 51.2s
469:	learn: 0.1401335	test: 0.1831443	best: 0.1831443 (469)	total: 2m 13s	remaining: 51s
470:	learn: 0.1400611	test: 0.1831553	best: 0.1831443 (469)	total: 2m 13s	remaining: 50.7s
471:	learn: 0.1400605	test: 0.1831540	best: 0.1831443 (469)	total: 2m 13s	remaining: 50.3s
472:	learn: 0.1399686	test: 0.1831430	best: 0.1831430 (472)	total: 2m 13s	remaining: 50s
473:	learn: 0.1398942	test: 0.1831223	best: 0.1831223 (473)	total: 2m 13s	remaining: 49.8s
474:	learn: 0.1398028	test: 0.1830623	best: 0.1830623 (474)	total: 2m 14s	remaining: 49.5s
475:	learn: 

In [None]:
best_params = grid_search_result['params']
print ('Оптимальное значение параметров модели:', best_params)

Оптимальное значение параметров модели: {'depth': 5, 'iterations': 550, 'learning_rate': 0.105}


In [None]:
model_cb_best = CatBoostClassifier(**best_params, task_type="GPU", devices='cuda:0')

In [None]:
%%time
model_cb_best.fit(features_train, target_train)

0:	learn: 0.5836061	total: 38.1ms	remaining: 20.9s
1:	learn: 0.4996499	total: 79.5ms	remaining: 21.8s
2:	learn: 0.4388205	total: 119ms	remaining: 21.6s
3:	learn: 0.3920395	total: 158ms	remaining: 21.5s
4:	learn: 0.3568801	total: 196ms	remaining: 21.4s
5:	learn: 0.3309770	total: 250ms	remaining: 22.7s
6:	learn: 0.3103303	total: 290ms	remaining: 22.5s
7:	learn: 0.2953816	total: 327ms	remaining: 22.1s
8:	learn: 0.2834562	total: 360ms	remaining: 21.7s
9:	learn: 0.2724726	total: 397ms	remaining: 21.4s
10:	learn: 0.2643300	total: 433ms	remaining: 21.2s
11:	learn: 0.2582174	total: 469ms	remaining: 21s
12:	learn: 0.2530188	total: 503ms	remaining: 20.8s
13:	learn: 0.2484888	total: 537ms	remaining: 20.5s
14:	learn: 0.2447384	total: 579ms	remaining: 20.6s
15:	learn: 0.2416908	total: 609ms	remaining: 20.3s
16:	learn: 0.2393363	total: 640ms	remaining: 20.1s
17:	learn: 0.2371625	total: 676ms	remaining: 20s
18:	learn: 0.2353833	total: 706ms	remaining: 19.7s
19:	learn: 0.2334711	total: 742ms	remaining

<catboost.core.CatBoostClassifier at 0x7f556193aa90>

In [None]:
predict_test_cb = model_cb_best.predict(features_test)
        
F1_test_cb = f1_score(target_test, predict_test_cb)
print('Значение метрики F1 на тестовой выборке для модели CatBoostClassifier:', F1_test_cb)

Значение метрики F1 на тестовой выборке для модели CatBoostClassifier: 0.5692942795360724


**Вывод.** С помощью данных инструментов достичь целевого значения метрики не удалось.

## Подготовка текстов: Лемматизация и очистка текста.

Поскольку проект выполнялся с помощью разных технических средств, а иногда - параллельно, создаём ещё одну копию первичного датасета data2:

In [None]:
data2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/toxic_comments.csv')

In [None]:
data2

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Для лемматизации текстов загрузим библиотеку spacy:

In [None]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

Для обработки текстов создадим две функции: lemma (для лемматизации текста) и clear_text (для очистки текста от лишних символов, не являющихся словами):

In [None]:
def lemma(text):
#    m = WordNetLemmatizer()
    lemm_list = nlp(text) #" ".join([token.lemma_ for token in lemm_list])
    lemm_text = " ".join([token.lemma_ for token in lemm_list])
    #lemm_text = "".join(lemm_list)
        
    return lemm_text

In [None]:
def clear_text(text):
    res = re.sub(r'[^a-zA-Z ]', ' ', text)
    result = " ".join(res.split())
    return result

Применим последовательно две функции для создания в датафрейме столбца с очищенным и лематизированным текстом 'lemm_text':

In [None]:
data2['clear_text'] = data2['text'].apply(clear_text)

In [None]:
%%time
data2['lemm_text'] = data2['clear_text'].apply(lemma)

CPU times: user 14min 6s, sys: 7.57 s, total: 14min 13s
Wall time: 14min 46s


In [None]:
#data3 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data2_lemmatized.csv')
#data2.to_csv('data2_lemmatized.csv')

In [None]:
data2 # проверка

Unnamed: 0,text,toxic,clear_text,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...,explanation why the edit make under -PRON- use...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...,D aww -PRON- match this background colour -PRO...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...,hey man -PRON- m really not try to edit war -P...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...,More -PRON- can t make any real suggestion on ...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...,-PRON- sir be -PRON- hero any chance -PRON- re...
...,...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...,and for the second time of ask when -PRON- vie...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...,-PRON- should be ashamed of -PRON- that be a h...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...,Spitzer Umm there s no actual article for pros...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...,and -PRON- look like -PRON- be actually -PRON-...


Разделим датасет на обучающую и тестовую части:

In [None]:
features_train, features_test = train_test_split(data2, test_size=0.2)

In [None]:
features_train # проверка

Unnamed: 0,text,toxic,clear_text,lemm_text
82301,"""* Oppose – Both the PRC and ROC have never ag...",0,Oppose Both the PRC and ROC have never agreed ...,oppose both the PRC and ROC have never agree t...
5818,"""HUNK should have his own page separate from m...",0,HUNK should have his own page separate from my...,HUNK should have -PRON- own page separate from...
69704,"""\n\n WikiProject Indigenous peoples of North ...",0,WikiProject Indigenous peoples of North Americ...,WikiProject indigenous people of North America...
33113,That template is completely non-useful. In fut...,0,That template is completely non useful In futu...,that template be completely non useful in futu...
10398,I think either would be OK if you can back it ...,0,I think either would be OK if you can back it ...,-PRON- think either would be ok if -PRON- can ...
...,...,...,...,...
55145,You are being disruptive \n\nI am fully entitl...,0,You are being disruptive I am fully entitled t...,-PRON- be be disruptive -PRON- be fully entitl...
51329,"""== Cigarette ==\nHey! I wasn't vandalising I ...",0,Cigarette Hey I wasn t vandalising I wanted to...,Cigarette hey -PRON- wasn t vandalising -PRON-...
112619,"""\n\n Basically what he is trying to do is dis...",0,Basically what he is trying to do is disambigu...,basically what -PRON- be try to do be disambig...
139474,Sorry I confused you with another user \n\nGre...,0,Sorry I confused you with another user Greetin...,sorry -PRON- confuse -PRON- with another user ...


In [None]:
features_test # проверка

Unnamed: 0,text,toxic,clear_text,lemm_text
138447,Wildlife Conservation\nThis page should have m...,0,Wildlife Conservation This page should have mo...,Wildlife Conservation this page should have mo...
14467,I commend you \n\nYou and I are making Wikiped...,0,I commend you You and I are making Wikipedia a...,-PRON- commend -PRON- -PRON- and -PRON- be mak...
39186,Thank you for weighing in. I can't believe th...,0,Thank you for weighing in I can t believe that...,thank -PRON- for weigh in -PRON- can t believe...
66936,"""\n\n Article under attack by Websense illegal...",0,Article under attack by Websense illegal astro...,article under attack by websense illegal astro...
121118,"""= Page for Himson ===\nCreated into my usersp...",0,Page for Himson Created into my userspace Can ...,page for Himson create into -PRON- userspace C...
...,...,...,...,...
150277,Freomaniac23 \n\n== \nAbout Me ==\nFreomaniac2...,0,Freomaniac About Me Freomaniac is the username...,freomaniac about Me Freomaniac be the username...
81058,", 1 August 2006 (UTC)\nReplied at User talk:Da...",0,August UTC Replied at User talk David S Adams ...,August UTC reply at User talk David S Adams re...
7141,You would understand... \n\nYou would understa...,1,You would understand You would understand if y...,-PRON- would understand -PRON- would understan...
149680,"""\n\nYou'll just have to forgive me if I take ...",0,You ll just have to forgive me if I take a tim...,-PRON- will just have to forgive -PRON- if -PR...


Создадим обучающий корпус постов и целевой признак:

In [None]:
corpus_train = features_train['lemm_text'].values
target_train = features_train['toxic']

Создадим тестовый корпус постов и целевой признак:

In [None]:
corpus_test = features_test['lemm_text'].values
target_test = features_test['toxic']

Загрузим перечень стоп-слов:

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stopwords = set(nltk_stopwords.words('english'))

Создадим две матрицы tf_idf с учетом стоп-слов.

In [None]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

Создадим обучающую матрицу:

In [None]:
tf_idf_train = count_tf_idf.fit_transform(corpus_train)

Создадим тестовую матрицу:

In [None]:
tf_idf_test = count_tf_idf.transform(corpus_test)

In [None]:
tf_idf_train.shape # проверка размеров матрицы

(127656, 137907)

In [None]:
tf_idf_test.shape # проверка размеров матрицы

(31915, 137907)

## Модель логистической регресии на матрице tf-idf


Создадим модель логистической регресии, обучим её на признаках tf_idf_train, выполним предсказания на tf_idf_test и найдём метрику F1:

In [None]:
model_lr = LogisticRegression(max_iter = 1000, class_weight='balanced', verbose=1) #, penalty='elasticnet', solver='saga', l1_ratio=0.1)
model_lr.fit(tf_idf_train, target_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.5s finished


LogisticRegression(class_weight='balanced', max_iter=1000, verbose=1)

In [None]:
predict = model_lr.predict(tf_idf_test)

In [None]:
F1_test_lr = f1_score(target_test, predict)
print('Значение метрики F1 на тестовой выборке для модели LogisticRegression:', F1_test_lr)

Значение метрики F1 на тестовой выборке для модели LogisticRegression: 0.7380855397148677


**Вывод.** С помощью данных инструментов нам не удалось достичь целевого значения метрики.

## Модель CatBoostClassifier на матрице tf-idf

Создадим модель CatBoostClassifier, обучим её на признаках tf_idf_train, выполним предсказания на tf_idf_test и найдём метрику F1:

In [None]:
model_cb = CatBoostClassifier()

In [None]:
model_cb.fit(tf_idf_train, target_train)

Learning rate set to 0.081698
0:	learn: 0.6095052	total: 2.81s	remaining: 46m 45s
1:	learn: 0.5426392	total: 4.96s	remaining: 41m 15s
2:	learn: 0.4862732	total: 7.11s	remaining: 39m 21s
3:	learn: 0.4390281	total: 9.26s	remaining: 38m 25s
4:	learn: 0.4019451	total: 11.4s	remaining: 37m 57s
5:	learn: 0.3714825	total: 13.6s	remaining: 37m 31s
6:	learn: 0.3466349	total: 15.8s	remaining: 37m 15s
7:	learn: 0.3265971	total: 17.9s	remaining: 37m
8:	learn: 0.3100915	total: 20.1s	remaining: 36m 47s
9:	learn: 0.2963464	total: 22.2s	remaining: 36m 41s
10:	learn: 0.2856661	total: 24.4s	remaining: 36m 31s
11:	learn: 0.2761954	total: 26.5s	remaining: 36m 22s
12:	learn: 0.2678881	total: 28.6s	remaining: 36m 14s
13:	learn: 0.2598464	total: 30.8s	remaining: 36m 12s
14:	learn: 0.2542034	total: 33s	remaining: 36m 7s
15:	learn: 0.2493626	total: 35.1s	remaining: 36m
16:	learn: 0.2447704	total: 37.3s	remaining: 35m 54s
17:	learn: 0.2403789	total: 39.4s	remaining: 35m 49s
18:	learn: 0.2369575	total: 41.5s	rem

<catboost.core.CatBoostClassifier at 0x7f9eaf5de290>

In [None]:
predict_test_cb = model_cb_best.predict(features_test)

In [None]:
F1_test_cb = f1_score(target_test, predict_test_cb)
print('Значение метрики F1 на тестовой выборке для модели CatBoostClassifier:', F1_test_cb)

Значение метрики F1 на тестовой выборке для модели CatBoostClassifier: 0.7633007600434311


**Вывод.** С помощью данных инструментов нам удалось достичь целевого значения метрики. Однако, обучение модели занимает гораздо больше времени (35 минут), чем у логистической регресии (несколько секунд).

# Общий вывод

1. Была применена модель Catboost с указанием text_features - к обработанному тексту (лемматизация+очистка). Было получено весьма высокое для такого простого пути значение метрики **f1=0.7619**. 
2. Модель BERT создала эмбеддинги, которые были использованы для обучения:
- логистической регрессии, f1=0,609 и
- модели catboost, f1=0.569.
Никакие манипуляции с подбором параметров моделей не дали целевого значения метрики 0,75.
3. С помощью лемматизации и очистки текста была создана матрица tf-idf, которая была применена для обучения:  
На логистической регрессии целевого значения метрики (0,75) достичь не удалось: f1=0.7381. Никакие изменения параметров не позволили поднять метрику.  
Модель catboost даже с параметрами по умолчанию дала нужную метрику, **f1=0.7633**.  

Для борьбы с дисбалансом классов в модели логистической регрессии был использован параметр class_weight='balanced'. В модели catboost использование class_weights не дало улучшения результата, поэтому в коде не отражено.