# Определение токсичных комментариев с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

## Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подключение библиотек

In [5]:
# !pip install evaluate transformers[torch]

In [6]:
import numpy as np
import pandas as pd

import torch

# Библиотека от HuggingFace
import transformers

import evaluate

from transformers import BertForSequenceClassification, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, AutoTokenizer, BertTokenizer, BertConfig, BertModel

import pyarrow as pa
from datasets import Dataset

from tqdm import notebook

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score

# Отключение предупреждений
import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 12345

In [7]:
f1 = evaluate.load('f1')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels)

## Получение данных

In [8]:
try:
    df = pd.read_csv('/datasets/toxic_comments.csv').drop(['Unnamed: 0'], axis=1)
except:
    df = pd.read_csv('./toxic_comments.csv').drop(['Unnamed: 0'], axis=1)

## Изучение данных

In [9]:
df

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159287,""":::::And for the second time of asking, when ...",0
159288,You should be ashamed of yourself \n\nThat is ...,0
159289,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,And it looks like it was actually you who put ...,0


## Предобработка данных

Делаем датасат с равным количеством наблюдений между классами — соблюдаем баланс классов

In [10]:
# Переименовываем колонку 'toxic' в 'label' для того что бы модель понимала колонку в которой находится целевое значение
# берём из датасета только 800 наблюдений для избегания долгих вычислений
cdf = df.rename(columns={'toxic': 'label'}).sample(n=800, random_state=RANDOM_STATE)

Убираем лишние пробелы

In [13]:
def preprocess(raw):
    raw['text'] = ' '.join(text.strip() for text in raw['text'].split())
    
    return raw

cdf = cdf.apply(preprocess, axis=1)
cdf.head()

Unnamed: 0,text,label
109486,Expert Categorizers Why is there no mention of...,0
104980,""" Noise fart* talk. """,1
82166,"An indefinite block is appropriate, even for a...",0
18721,I don't understand why we have a screenshot of...,0
128178,"Hello! Some of the people, places or things yo...",0


## Предобработка признаков

Загружаем конфигурацию и модель BERT базовую для английского языка

In [14]:
config = BertConfig.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', config=config)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Модель будет с максимальной длиной токенов

In [15]:
config.max_position_embeddings

512

### Для LogisticRegression

Токенизируем текст

In [16]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', max_model_input_sizes=config.max_position_embeddings)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Предотвращаем проблему, "Длина последовательности индексов токенов больше, чем указанная максимальная длина последовательности для данной модели (526 > 512). Прогон этой последовательности через модель приведет к ошибкам индексации" (на самом деле 526 это просто первый попавшийса набор, максимальная длина 1076). Описание слишком большое, `truncation=True` решает эту проблему

In [17]:
cdf['input_ids'] = cdf['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True))

# тоже самое что 'input_ids' в AutoTokenizer
# tokenizer(text, padding="max_length", truncation=True, max_length=512)
cdf

Unnamed: 0,text,label,input_ids
109486,Expert Categorizers Why is there no mention of...,0,"[101, 6739, 4937, 20265, 25709, 2869, 2339, 20..."
104980,""" Noise fart* talk. """,1,"[101, 1000, 5005, 2521, 2102, 1008, 2831, 1012..."
82166,"An indefinite block is appropriate, even for a...",0,"[101, 2019, 25617, 3796, 2003, 6413, 1010, 213..."
18721,I don't understand why we have a screenshot of...,0,"[101, 1045, 2123, 1005, 1056, 3305, 2339, 2057..."
128178,"Hello! Some of the people, places or things yo...",0,"[101, 7592, 999, 2070, 1997, 1996, 2111, 1010,..."
...,...,...,...
73497,ou leftist Wikipedia scum are an insignificant...,1,"[101, 15068, 24247, 16948, 8040, 2819, 2024, 2..."
16456,Hey this IP belongs to a Public Library in NYC,0,"[101, 4931, 2023, 12997, 7460, 2000, 1037, 227..."
16015,"H's P, again Thanks for your kind words Sam. I...",0,"[101, 1044, 1005, 1055, 1052, 1010, 2153, 4283..."
50197,Comment to administrator carrying out the edit...,0,"[101, 7615, 2000, 8911, 4755, 2041, 1996, 1008..."


Проверяем максимальную длину токенов

In [18]:
tokens = [len(el) for el in cdf['input_ids'].values]
max(tokens)

512

In [19]:
max_len_item = tokens.index(max(tokens))
max_len_item, df.iloc[max_len_item]['text']

(37,
 "pretty much everyone from warren county/surrounding regions was born at glens falls hospital. myself included. however, i'm not sure this qualifies anyone as being a glens falls native. rachel ray is, i believe, actually from the town of lake luzerne.  —The preceding unsigned comment was added by 70.100.229.154  04:28:57, August 19, 2007 (UTC)")

In [20]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Для каждого определяем маску значений которые нужно брать, это нужно для механизма внимания

In [21]:
padded = np.array([i + [0]*(max(tokens) - len(i)) for i in cdf['input_ids'].values])
attention_mask = np.where(padded != 0, 1, 0)
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

Для каждого набора токенов маска разная (количество единиц)

In [22]:
mask_counts = [list(x).count(1) for x in attention_mask]
mask_counts[:10]

[43, 10, 31, 28, 114, 87, 18, 9, 105, 51]

In [23]:
batch_size = 200
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
    
    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
    
    embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/4 [00:00<?, ?it/s]

Разделяем на выборки

In [20]:
lr_features = np.concatenate(embeddings)
lr_target = cdf['label']

lr_feature_train, lr_feature_test, lr_target_train, lr_target_test = \
    train_test_split(lr_features, lr_target, test_size=0.5, random_state=RANDOM_STATE)

In [21]:
lr_target_train.value_counts(), lr_target_test.value_counts()

(0    350
 1     50
 Name: label, dtype: int64,
 0    365
 1     35
 Name: label, dtype: int64)

Есть дисбаланс классов, оставляем

### Для BERT

In [22]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def process_data(row):
    encodings = tokenizer(row['text'], padding="max_length", truncation=True, max_length=512)

    encodings['label'] = row['label']
    encodings['text'] = row['text']

    return encodings

In [23]:
print(process_data({
    'text': 'this is a sample review of a movie.',
    'label': 'positive'
}))

{'input_ids': [101, 2023, 2003, 1037, 7099, 3319, 1997, 1037, 3185, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [24]:
processed_data = []
for i in range(len(cdf)):
    processed_data.append(process_data(cdf.iloc[i]))

In [25]:
# Названия колонок важны, модель по умолчанию принимает
# данные именно с такими названиями колонок:
# `attention_mask`, `input_ids`, `label`, `token_type_ids`

new_df = pd.DataFrame(processed_data)
new_df.head()

Unnamed: 0,attention_mask,input_ids,label,text,token_type_ids
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 6739, 4937, 20265, 25709, 2869, 2339, 20...",0,Expert Categorizers Why is there no mention of...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...","[101, 1000, 5005, 2521, 2102, 1008, 2831, 1012...",1,""" Noise fart* talk. ""","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 2019, 25617, 3796, 2003, 6413, 1010, 213...",0,"An indefinite block is appropriate, even for a...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 1045, 2123, 1005, 1056, 3305, 2339, 2057...",0,I don't understand why we have a screenshot of...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 7592, 999, 2070, 1997, 1996, 2111, 1010,...",0,"Hello! Some of the people, places or things yo...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [27]:
train_df, valid_df = train_test_split(
    new_df,
    test_size=0.5,
    random_state=12345
)

cdf_without_train_valid = df[~df.index.isin(cdf.index)]
cdf_without_train_valid = cdf_without_train_valid.rename(columns={'toxic': 'label'})
test_df = cdf_without_train_valid.sample(n=400, random_state=RANDOM_STATE)
test_df

Unnamed: 0,text,label
17770,File:Peppy Hare.png listed for deletion \nA fi...,0
87774,"""Thank you for experimenting with the page X-S...",0
98475,"Also, I was not harassing anyone. I suggest th...",0
9997,? \n\nIt is brutal how you believe that this w...,0
133592,"""\n\nNo worries, it's good of you to apologise...",0
...,...,...
146343,I only removed the Edward Smith stuff. But yo...,0
80891,"""\n\n Removed edit \n\nHi William, I Changed ""...",0
139530,"Crap, this article sucks. I tried reading it,...",1
36173,"""\nMost recent edits have been for typos or by...",0


In [28]:
train_df['label'].value_counts(), valid_df['label'].value_counts(), test_df['label'].value_counts()

(0    350
 1     50
 Name: label, dtype: int64,
 0    365
 1     35
 Name: label, dtype: int64,
 0    359
 1     41
 Name: label, dtype: int64)

In [29]:
train_hg = Dataset(pa.Table.from_pandas(train_df))
valid_hg = Dataset(pa.Table.from_pandas(valid_df))
test_hg = Dataset(pa.Table.from_pandas(test_df))

## Обучение

In [30]:
results = {}

### Модель 1. LogisticRegression

In [31]:
lr_param_grid = {
    'solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
}

lr = LogisticRegression(random_state=RANDOM_STATE)
grid_lr = GridSearchCV(lr, param_grid=lr_param_grid, scoring='f1', cv=5)
grid_lr.fit(lr_feature_train, lr_target_train);
results['LR'] = grid_lr.best_score_
results['LR']

0.5548692810457516

### Модель 2. BERT

In [32]:
bert_cdf = cdf[['text', 'label']].copy()
bert_cdf

Unnamed: 0,text,label
109486,Expert Categorizers Why is there no mention of...,0
104980,""" Noise fart* talk. """,1
82166,"An indefinite block is appropriate, even for a...",0
18721,I don't understand why we have a screenshot of...,0
128178,"Hello! Some of the people, places or things yo...",0
...,...,...
73497,ou leftist Wikipedia scum are an insignificant...,1
16456,Hey this IP belongs to a Public Library in NYC,0
16015,"H's P, again Thanks for your kind words Sam. I...",0
50197,Comment to administrator carrying out the edit...,0


In [33]:
train_hg

Dataset({
    features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids', '__index_level_0__'],
    num_rows: 400
})

In [34]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [35]:
training_args = TrainingArguments(output_dir="./result", evaluation_strategy="epoch", report_to="none")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_hg,
    eval_dataset=valid_hg,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Обучение модели

In [36]:
trainer.train()

  0%|          | 0/150 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.25760623812675476, 'eval_f1': 0.05555555555555556, 'eval_runtime': 17.8291, 'eval_samples_per_second': 22.435, 'eval_steps_per_second': 2.804, 'epoch': 1.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.0644557774066925, 'eval_f1': 0.8529411764705883, 'eval_runtime': 31.3641, 'eval_samples_per_second': 12.753, 'eval_steps_per_second': 1.594, 'epoch': 2.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.0716361328959465, 'eval_f1': 0.8611111111111112, 'eval_runtime': 35.2308, 'eval_samples_per_second': 11.354, 'eval_steps_per_second': 1.419, 'epoch': 3.0}
{'train_runtime': 277.5914, 'train_samples_per_second': 4.323, 'train_steps_per_second': 0.54, 'train_loss': 0.20312365214029948, 'epoch': 3.0}


TrainOutput(global_step=150, training_loss=0.20312365214029948, metrics={'train_runtime': 277.5914, 'train_samples_per_second': 4.323, 'train_steps_per_second': 0.54, 'train_loss': 0.20312365214029948, 'epoch': 3.0})

In [37]:
bert_res = trainer.evaluate()
print(bert_res)

results['BERT'] = bert_res['eval_f1']
results['BERT']

  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.0716361328959465, 'eval_f1': 0.8611111111111112, 'eval_runtime': 36.1282, 'eval_samples_per_second': 11.072, 'eval_steps_per_second': 1.384, 'epoch': 3.0}


0.8611111111111112

In [38]:
model.save_pretrained('./model/')

Результаты 2-х моделей

In [39]:
models_score = pd.DataFrame(results.values(), index=results.keys(), columns=['F1'])
models_score

Unnamed: 0,F1
LR,0.554869
BERT,0.861111


## Тест лучшей модели

Лучшей моделью оказалась BERT

In [41]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [42]:
new_model = AutoModelForSequenceClassification.from_pretrained('./model/').to(device)

In [43]:
new_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [44]:
def get_prediction(text):
    encoding = new_tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
    encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

    outputs = new_model(**encoding)

    logits = outputs.logits
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    sigmoid = torch.nn.Sigmoid()
    
    probs = sigmoid(logits.squeeze().cpu())
    probs = probs.detach().numpy()
    label = np.argmax(probs, axis=-1)
    
    if label == 1:
        return {
            'label': 'toxic',
            'probability': probs[1]
        }
    else:
        return {
            'label': 'not toxic',
            'probability': probs[0]
        }

In [46]:
pred = test_df['text'].apply(get_prediction).apply(lambda x: 1 if x['label'] == 'toxic' else 0)
pred

17770     0
87774     0
98475     0
9997      0
133592    0
         ..
146343    0
80891     0
139530    1
36173     0
48310     1
Name: text, Length: 400, dtype: int64

In [48]:
bert_test_score = f1_score(test_df['label'], pred)
bert_test_score

0.8043478260869564

In [49]:
i = 103
example = new_df.iloc[i]['text']
print(new_df.iloc[i]['label'], example)
get_prediction(example)

0 You would still have to sit through written text like in a video,, lol; it's really a matter of preference. I don't see how this isn't persuasive, though (I admit she talks a bit silly). It's a summary of some of the arguments that have come up in preference to tau.


{'label': 'not toxic', 'probability': 0.95315117}

## Выводы

- Были сделаны 2 разные предобработки текста для 2 разных моделей машинного обучения в эмбеддинги для LogisticRegression и токены для BERT
- Модели были обучены для определения токсичности и не токсичности текста
- На валидации были получены следующие результаты метрики F1:

In [50]:
models_score

Unnamed: 0,F1
LR,0.554869
BERT,0.861111


- Лучшей моделью оказалась BERT с метрикой **F1 0.86** на кросс-валидации и **F1 0.80** на тестовой выборке