# Проект для «Викишоп» c BERT

**Описание проекта**


*Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.
Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.
Постройте модель со значением метрики качества F1 не меньше 0.75.*

**Описание данных**

*Данные находятся в файле* **toxic_comments.csv**
- **text** - *столбец, в котором содержит текст комментария*
- **toxic** - *целевой признак*

**План выполнения проекта**

- Загрузить и подготовить данные.
- Обучить разные модели.
- Написать выводы.

In [6]:
!pip install torch
!pip install transformers
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')
import re 
from sklearn.utils import shuffle
from tqdm import notebook
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier

**Загрузим и посмотрим данные:**

In [9]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/bert text/toxic_comments.csv')

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


*Пропусков нет. Имеется один столбец неизвестного назначения и много мусора в текстовых данных. Очистим их от знаков, лишних пробелов и переносов строки(\n)*

In [12]:
df = df.drop('Unnamed: 0', axis = 1)

In [13]:
text = list(df['text'])

In [14]:
def clear(text):  
    text = re.sub(r'[^a-zA-Z ]', ' ', text)
    return ' '.join(text.split())
for i in range(len(text)):
    text[i] = clear(text[i]).lower()

In [15]:
df['text'] = text

**Обучим модель BERT**

In [19]:

model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights).cuda()

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [20]:
tokenized = df['text'].apply((lambda x: tokenizer.encode(x[:512], add_special_tokens=True)))

In [21]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)

In [22]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
Found GPU at: /device:GPU:0


In [23]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.cuda.LongTensor(padded[batch_size*i:batch_size*(i+1)])
        attention_mask_batch = torch.cuda.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

  0%|          | 0/1592 [00:00<?, ?it/s]

In [24]:
features = np.concatenate(embeddings)
target = df['toxic']
target = target.iloc[:len(features)]


X_train, X_test, Y_train, Y_test= train_test_split(features, target, test_size = 0.2, random_state = 42)

*Так как Баланс классов отсутствует - проведем даунсэмпл для обучающей выборки*

In [25]:
Y_train.value_counts()

0    114416
1     12944
Name: toxic, dtype: int64

In [26]:
X_train_df = pd.DataFrame(X_train)
X_train_df.index = Y_train.index

In [27]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled


X_train_down, Y_train_down = downsample(X_train_df, Y_train, 0.11)

In [28]:
Y_train_down.value_counts()

1    12944
0    12586
Name: toxic, dtype: int64

**Обучим модели:**

**LogisticRegression**

In [29]:
%%time
modellcr = LogisticRegression()

cross_lcr = np.mean(cross_val_score(modellcr, X_train, Y_train, cv = 3))
print('F1 кросс - валидацией: ',cross_lcr)

# modellcr.fit( X_train_down, Y_train_down) # Проверку на тесте убираем
# predlcr = modellcr.predict(X_test)

# print('F1 на тесте: ', f1_score(Y_test, predlcr))

F1 кросс - валидацией:  0.9509971736690402
CPU times: user 55.2 s, sys: 5.19 s, total: 1min
Wall time: 31.8 s


**Catboost**

In [30]:
modelcat = CatBoostClassifier(random_state = 42, verbose=100, early_stopping_rounds=200, eval_metric = 'F1')
cross_cat = np.mean(cross_val_score(modelcat, X_train, Y_train, cv = 3))
# modelcat.fit(X_train, Y_train) 
# predcat = modelcat.predict(X_test) # убрал проверку на тесте

print(' ')
print('F1 кросс - валидацией: ', cross_cat)
# print('F1 на тесте: ', f1_score(Y_test, predcat))

Learning rate set to 0.068642
0:	learn: 0.4449660	total: 501ms	remaining: 8m 20s
100:	learn: 0.6963329	total: 41.2s	remaining: 6m 7s
200:	learn: 0.7363618	total: 1m 21s	remaining: 5m 25s
300:	learn: 0.7697303	total: 1m 54s	remaining: 4m 25s
400:	learn: 0.7918748	total: 2m 26s	remaining: 3m 38s
500:	learn: 0.8171105	total: 3m	remaining: 2m 59s
600:	learn: 0.8382343	total: 3m 36s	remaining: 2m 23s
700:	learn: 0.8568493	total: 4m 9s	remaining: 1m 46s
800:	learn: 0.8725303	total: 4m 42s	remaining: 1m 10s
900:	learn: 0.8872171	total: 5m 14s	remaining: 34.5s
999:	learn: 0.8992783	total: 5m 45s	remaining: 0us
Learning rate set to 0.068642
0:	learn: 0.4718047	total: 444ms	remaining: 7m 23s
100:	learn: 0.6965031	total: 37.1s	remaining: 5m 30s
200:	learn: 0.7369618	total: 1m 12s	remaining: 4m 46s
300:	learn: 0.7690271	total: 1m 44s	remaining: 4m 3s
400:	learn: 0.7988457	total: 2m 17s	remaining: 3m 25s
500:	learn: 0.8204661	total: 2m 49s	remaining: 2m 49s
600:	learn: 0.8396098	total: 3m 22s	remai

**Посмотрим на показатели моделей:**

In [34]:

pd.DataFrame({'F1 кросс-валидацией' : {'LogisticRegression': cross_lcr, 'CatBoostClassifier':cross_cat}})

Unnamed: 0,F1 кросс-валидацией
CatBoostClassifier,0.949607
LogisticRegression,0.950997


*Результаты почти одинаковые у 2х моделей. Проведем тест*

In [35]:
modellcr.fit( X_train, Y_train)
predlcr = modellcr.predict(X_test)
print('F1 на тесте: ', f1_score(Y_test, predlcr))

F1 на тесте:  0.7274934952298353


In [38]:
modelcattest = CatBoostClassifier(random_state = 42, verbose=100, early_stopping_rounds=200, eval_metric = 'F1').fit( X_train, Y_train)

Learning rate set to 0.081617
0:	learn: 0.4861563	total: 521ms	remaining: 8m 40s
100:	learn: 0.6959416	total: 47.4s	remaining: 7m 2s
200:	learn: 0.7351616	total: 1m 35s	remaining: 6m 17s
300:	learn: 0.7652858	total: 2m 28s	remaining: 5m 45s
400:	learn: 0.7880356	total: 3m 11s	remaining: 4m 45s
500:	learn: 0.8077108	total: 3m 52s	remaining: 3m 51s
600:	learn: 0.8254422	total: 4m 34s	remaining: 3m 2s
700:	learn: 0.8411263	total: 5m 16s	remaining: 2m 15s
800:	learn: 0.8555166	total: 5m 58s	remaining: 1m 28s
900:	learn: 0.8691919	total: 6m 39s	remaining: 43.9s
999:	learn: 0.8787545	total: 7m 19s	remaining: 0us


In [39]:
print('F1 на тесте: ', f1_score(Y_test, modelcattest.predict(X_test)))

F1 на тесте:  0.7113000354233087


*На тесте LogisticRegression показала результат чуть лучше. Рекомендую эту модель в сочитании с BERT*