<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Изучение-и-подготовка-данных-для-BERT" data-toc-modified-id="Изучение-и-подготовка-данных-для-BERT-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Изучение и подготовка данных для BERT</a></span></li><li><span><a href="#Построение-моделей" data-toc-modified-id="Построение-моделей-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Построение моделей</a></span><ul class="toc-item"><li><span><a href="#Модель-логистической-регрессии" data-toc-modified-id="Модель-логистической-регрессии-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Модель логистической регрессии</a></span></li><li><span><a href="#Модель-LightGBM" data-toc-modified-id="Модель-LightGBM-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Модель LightGBM</a></span></li><li><span><a href="#Модель-CatBoost" data-toc-modified-id="Модель-CatBoost-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Модель CatBoost</a></span></li></ul></li><li><span><a href="#Вывод-по-BERT:" data-toc-modified-id="Вывод-по-BERT:-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Вывод по BERT:</a></span></li><li><span><a href="#Подготовка-данных-для-классификации-текстов" data-toc-modified-id="Подготовка-данных-для-классификации-текстов-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Подготовка данных для классификации текстов</a></span></li><li><span><a href="#Построение-моделей" data-toc-modified-id="Построение-моделей-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Построение моделей</a></span><ul class="toc-item"><li><span><a href="#Модель-логистической-регрессии" data-toc-modified-id="Модель-логистической-регрессии-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Модель логистической регрессии</a></span></li><li><span><a href="#Модель-LigthGBM" data-toc-modified-id="Модель-LigthGBM-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Модель LigthGBM</a></span></li><li><span><a href="#Модель-CatBoost" data-toc-modified-id="Модель-CatBoost-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Модель CatBoost</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

## Изучение и подготовка данных для BERT

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import GridSearchCV

import re
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Код для чтения данных
try:
    df_toxic = pd.read_csv('df_toxic.csv')
except:
    df_toxic= pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
    df_toxic.to_csv('df_toxic.csv', index=False)

In [3]:
# Будем работать только с 1000 строками ради сокращения времени выполнения кода
df_toxic = df_toxic.sample(n=1000, random_state=12345).reset_index(drop=True)

In [4]:
df_toxic

Unnamed: 0,text,toxic
0,Ahh shut the fuck up you douchebag sand nigger...,1
1,"""\n\nREPLY: There is no such thing as Texas Co...",0
2,"Reply\nHey, you could at least mention Jasenov...",0
3,"Thats fine, there is no deadline ) chi?",0
4,"""\n\nDYK nomination of Mustarabim\n Hello! You...",0
...,...,...
995,"Date warriors \n\nHi, Hertz. Nice catch on Sec...",0
996,REDIRECT Talk:2013 Men's World Ice Hockey Cham...,0
997,This article is one of the most well cited art...,0
998,Harry the point you are making is pointless. I...,0


In [5]:
df_toxic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    1000 non-null   object
 1   toxic   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [6]:
# Найдем дубликаты
df_toxic[df_toxic.duplicated() == True]

Unnamed: 0,text,toxic


In [7]:
# Найдем распределение оценок
df_toxic['toxic'].value_counts()

0    901
1     99
Name: toxic, dtype: int64

In [8]:
# Загрузка предобученной модели DistilBERT
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel,
                                                    ppb.DistilBertTokenizer, 'distilbert-base-uncased')
# Загрузка предобученной модели/токенизатора 
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
# Токенизируем каждый комментарий
tokenized = df_toxic['text'].apply((lambda x:
                                    tokenizer.encode(x, add_special_tokens=True)))

Token indices sequence length is longer than the specified maximum sequence length for this model (862 > 512). Running this sequence through the model will result in indexing errors


In [10]:
# Найдем максимальную длину векторов после токенизации
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
# Применим padding к векторам
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [11]:
# Создадим маску для выделения важных токенов
attention_mask = np.where(padded != 0, 1, 0)

In [12]:
attention_mask.shape

(1000, 4645)

In [13]:
# Ограничим количество токенов
padded_01 = padded[:,:512]
attention_mask_01 = attention_mask[:,:512]

In [14]:
# Сделаем цикл по батчам
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded_01.shape[0] // batch_size)):
        # Преобразуем данные в формат тензоров
        batch = torch.LongTensor(padded_01[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask_01[batch_size*i:batch_size*(i+1)])
        # Укажем, что градиенты не нужны: модель BERT обучать не будем.
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        # # преобразуем элементы методом numpy() к типу numpy.array
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/10 [00:00<?, ?it/s]

## Построение моделей

In [15]:
# Соберём все эмбеддинги в матрицу признаков
features = np.concatenate(embeddings)
target = df_toxic['toxic']
# Разделим данные на тренировочную и тестовую выборки
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)

### Модель логистической регрессии

In [16]:
model_lr = LogisticRegression(random_state=12345)
result_lr = cross_val_score(model_lr, features_train, target_train,
                                 scoring='f1').mean()
print('F1-мера модели логистической регрессии:', result_lr)

F1-мера модели логистической регрессии: 0.7165217391304347


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [17]:
# Проверим F1-меру на тестовой выборке
model_lr.fit(features_train, target_train)
predictions_lr = model_lr.predict(features_test)
print('F1-мера тестовой выборки', f1_score(target_test, predictions_lr))

F1-мера тестовой выборки 0.7599999999999999


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Модель LightGBM

In [None]:
model_lgbm = LGBMClassifier(random_state=12345)
params_lgbm = {'n_estimators' : [100, 200, 500],
               'max_depth' : range(2, 7)}

search_lgbm = GridSearchCV(estimator=model_lgbm, param_grid=params_lgbm, scoring='f1',
                           verbose=0, cv=5)
result_lgbm = search_lgbm.fit(features_train , target_train)

In [19]:
print('Лучшее значение F1-меры в LightGBM:', result_lgbm.best_score_)
print('Лучшие параметры LightGBM:', result_lgbm.best_params_)

Лучшее значение F1-меры в LightGBM: 0.6426023252110209
Лучшие параметры LightGBM: {'max_depth': 2, 'n_estimators': 100}


In [20]:
# Проверим модель LightGBM на тестовой выборке
best_model_lgbm = search_lgbm.best_estimator_
best_model_lgbm.fit(features_train, target_train)
predictions_lgbm=best_model_lgbm.predict(features_test)
print("F1-мера лучшей модели на тестовой выборке:", f1_score(target_test, predictions_lgbm))

F1-мера лучшей модели на тестовой выборке: 0.6666666666666667


### Модель CatBoost

In [None]:
model_cat = CatBoostClassifier(random_seed=12345)
params_cat = {'iterations' : [100, 300, 500],
              'depth' : range(4, 8)}

search_cat = GridSearchCV(estimator=model_cat, param_grid=params_cat, scoring='f1',
                           verbose=1, cv=3);
result_cat = search_cat.fit(features_train, target_train)

In [22]:
print('Лучшее значение F1-меры в CatBoost:', result_cat.best_score_)
print('Лучшие параметры CatBoost:', result_cat.best_params_)

Лучшее значение F1-меры в CatBoost: 0.4985682597297842
Лучшие параметры CatBoost: {'depth': 4, 'iterations': 500}


In [None]:
# Проверим модель LightGBM на тестовой выборке
best_model_cat = search_cat.best_estimator_
best_model_cat.fit(features_train, target_train);
predictions_cat=best_model_cat.predict(features_test)

In [24]:
print("F1-мера лучшей модели на тестовой выборке:", f1_score(target_test, predictions_cat))

F1-мера лучшей модели на тестовой выборке: 0.5714285714285715


## Вывод по BERT: 
Работает долго, поэтому построил модели только на малой части данных (1000 случайных строк). Удивительно, но на логистической регресси каким то образом получилась f1-мера > 0.75=)

# Модель без участия BERT

## Подготовка данных для классификации текстов

In [25]:
# Загрузим данные
df_wikishop = pd.read_csv('df_toxic.csv')
corpus = list(df_wikishop['text'].values.astype('U'))

In [26]:
# Напишем функцию для лемматизации текста
nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()
def lemm(text):   
    word_list = nltk.word_tokenize(text)
    return' '.join([lemmatizer.lemmatize(w) for w in word_list])

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleksandrverlan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleksandrverlan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [29]:
# Напишем функцию для очистки текста
def clear_text(text):
    return ' '.join(re.sub(r'[^a-zA-Z]', ' ', text).split())

In [30]:
# Применим функции ко всем строкам
df_wikishop['text_lemm'] = df_wikishop['text'].apply(lambda x: lemm(clear_text(x)))

In [31]:
df_wikishop.head()

Unnamed: 0,text,toxic,text_lemm
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


In [32]:
df_wikishop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   text       159571 non-null  object
 1   toxic      159571 non-null  int64 
 2   text_lemm  159571 non-null  object
dtypes: int64(1), object(2)
memory usage: 3.7+ MB


In [33]:
# Разделим данные на тренировочную и тестовую выборки
train, test = train_test_split(df_wikishop, test_size=0.25, random_state=12345)

In [34]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
# Выделим переменные признаки и признак, который нужно предсказать
train_target=train['toxic']
train_corpus=train['text_lemm']
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
train_features = count_tf_idf.fit_transform(train_corpus) 
test_target=test['toxic']
test_corpus=test['text_lemm']
test_features = count_tf_idf.transform(test_corpus) 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleksandrverlan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Построение моделей

In [46]:
# Создадим функцию обучения модели
def fit_predict(model):
    model.fit(train_features, train_target)
    predict=model.predict(test_features)
    f1=f1_score(test_target, predict)
    print("F1 модели на тестовой выборке:", f1)

### Модель логистической регрессии

In [36]:
print('Логистическая регрессия')
model_lr = LogisticRegression(random_state=12345)
fit_predict(model_lr)

Логистическая регрессия
F1 модели на тестовой выборке: 0.7374153664998528


### Модель LigthGBM

In [38]:
print('LightGBM')
model_lgb=LGBMClassifier(random_state=12345, n_estimators=500)
fit_predict(model_lgb)

LightGBM
F1 модели на тестовой выборке: 0.774986346258875


### Модель CatBoost

In [54]:
print('CatBoost')
model_cat=CatBoostClassifier(random_seed=12345, iterations=500)
model_cat.fit(train_features, train_target)
predict=model_cat.predict(test_features)
f1=f1_score(test_target, predict)

CatBoost


Custom logger is already specified. Specify more than one logger at same time is not thread safe.

Learning rate set to 0.150069
0:	learn: 0.5467665	total: 740ms	remaining: 6m 9s
1:	learn: 0.4518392	total: 1.37s	remaining: 5m 42s
2:	learn: 0.3855012	total: 1.97s	remaining: 5m 26s
3:	learn: 0.3389108	total: 2.47s	remaining: 5m 6s
4:	learn: 0.3079332	total: 3.08s	remaining: 5m 5s
5:	learn: 0.2842842	total: 3.62s	remaining: 4m 58s
6:	learn: 0.2681348	total: 4.18s	remaining: 4m 54s
7:	learn: 0.2569842	total: 4.74s	remaining: 4m 51s
8:	learn: 0.2478066	total: 5.28s	remaining: 4m 47s
9:	learn: 0.2410701	total: 5.86s	remaining: 4m 47s
10:	learn: 0.2357826	total: 6.38s	remaining: 4m 43s
11:	learn: 0.2302520	total: 6.99s	remaining: 4m 44s
12:	learn: 0.2265553	total: 7.52s	remaining: 4m 41s
13:	learn: 0.2237080	total: 8.06s	remaining: 4m 39s
14:	learn: 0.2209691	total: 8.66s	remaining: 4m 40s
15:	learn: 0.2176854	total: 9.16s	remaining: 4m 37s
16:	learn: 0.2153899	total: 9.78s	remaining: 4m 37s
17:	learn: 0.2135348	total: 10.3s	remaining: 4m 36s
18:	learn: 0.2116383	total: 10.9s	remaining: 4m

157:	learn: 0.1409773	total: 1m 26s	remaining: 3m 8s
158:	learn: 0.1408610	total: 1m 27s	remaining: 3m 7s
159:	learn: 0.1406999	total: 1m 27s	remaining: 3m 6s
160:	learn: 0.1405787	total: 1m 28s	remaining: 3m 6s
161:	learn: 0.1403404	total: 1m 29s	remaining: 3m 5s
162:	learn: 0.1401237	total: 1m 29s	remaining: 3m 5s
163:	learn: 0.1399576	total: 1m 30s	remaining: 3m 4s
164:	learn: 0.1398417	total: 1m 30s	remaining: 3m 4s
165:	learn: 0.1396628	total: 1m 31s	remaining: 3m 3s
166:	learn: 0.1394324	total: 1m 31s	remaining: 3m 2s
167:	learn: 0.1392712	total: 1m 32s	remaining: 3m 2s
168:	learn: 0.1390109	total: 1m 32s	remaining: 3m 1s
169:	learn: 0.1388423	total: 1m 33s	remaining: 3m 1s
170:	learn: 0.1384974	total: 1m 33s	remaining: 3m
171:	learn: 0.1383837	total: 1m 34s	remaining: 3m
172:	learn: 0.1382817	total: 1m 35s	remaining: 2m 59s
173:	learn: 0.1381564	total: 1m 35s	remaining: 2m 59s
174:	learn: 0.1379561	total: 1m 36s	remaining: 2m 58s
175:	learn: 0.1377852	total: 1m 36s	remaining: 2m

310:	learn: 0.1212715	total: 2m 50s	remaining: 1m 43s
311:	learn: 0.1211478	total: 2m 51s	remaining: 1m 43s
312:	learn: 0.1210885	total: 2m 51s	remaining: 1m 42s
313:	learn: 0.1210435	total: 2m 52s	remaining: 1m 41s
314:	learn: 0.1209982	total: 2m 52s	remaining: 1m 41s
315:	learn: 0.1209485	total: 2m 53s	remaining: 1m 40s
316:	learn: 0.1208720	total: 2m 53s	remaining: 1m 40s
317:	learn: 0.1207864	total: 2m 54s	remaining: 1m 39s
318:	learn: 0.1206982	total: 2m 54s	remaining: 1m 39s
319:	learn: 0.1205861	total: 2m 55s	remaining: 1m 38s
320:	learn: 0.1205422	total: 2m 55s	remaining: 1m 38s
321:	learn: 0.1204988	total: 2m 56s	remaining: 1m 37s
322:	learn: 0.1204556	total: 2m 57s	remaining: 1m 37s
323:	learn: 0.1204123	total: 2m 57s	remaining: 1m 36s
324:	learn: 0.1203160	total: 2m 58s	remaining: 1m 35s
325:	learn: 0.1202127	total: 2m 58s	remaining: 1m 35s
326:	learn: 0.1201554	total: 2m 59s	remaining: 1m 34s
327:	learn: 0.1201111	total: 3m	remaining: 1m 34s
328:	learn: 0.1199751	total: 3m	

465:	learn: 0.1103269	total: 4m 16s	remaining: 18.7s
466:	learn: 0.1102896	total: 4m 17s	remaining: 18.2s
467:	learn: 0.1101379	total: 4m 18s	remaining: 17.6s
468:	learn: 0.1100638	total: 4m 18s	remaining: 17.1s
469:	learn: 0.1100344	total: 4m 19s	remaining: 16.5s
470:	learn: 0.1099230	total: 4m 19s	remaining: 16s
471:	learn: 0.1098259	total: 4m 20s	remaining: 15.4s
472:	learn: 0.1097822	total: 4m 20s	remaining: 14.9s
473:	learn: 0.1096510	total: 4m 21s	remaining: 14.3s
474:	learn: 0.1096233	total: 4m 22s	remaining: 13.8s
475:	learn: 0.1095307	total: 4m 22s	remaining: 13.2s
476:	learn: 0.1094643	total: 4m 23s	remaining: 12.7s
477:	learn: 0.1093613	total: 4m 23s	remaining: 12.1s
478:	learn: 0.1093300	total: 4m 24s	remaining: 11.6s
479:	learn: 0.1092788	total: 4m 24s	remaining: 11s
480:	learn: 0.1092082	total: 4m 25s	remaining: 10.5s
481:	learn: 0.1091784	total: 4m 25s	remaining: 9.93s
482:	learn: 0.1090883	total: 4m 26s	remaining: 9.38s
483:	learn: 0.1090196	total: 4m 27s	remaining: 8.8

In [52]:
print("F1 модели на тестовой выборке:", f1)

F1 модели на тестовой выборке: 0.7515341801056087


Модели LightGBM и CatBoost выдали F1-меру > 0.75, но на наших данных LightGBM выиграла=) Логистическая регрессия была близка к ним!

## Выводы
- В ходе работы мы изучили данные
- Подготовили малую часть данных (1000 строк) с помощью предобученной модели DistilBERT
- Обучили различные модели: логистическую регрессию, LightGBM, CatBoost, и сравнили их F1-меру
- Подготовили данные с помощью TF-IDF и обучили те же модели на новых данных
- Получили F1-меру = 0.77 в модели LightGBM.