# Распознавание токсичных комментариев с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Заказчик требует построить модель со значением метрики качества F1 не меньше 0.75.

Для выполнения проекта будет использован DistilBERT. Выполнено в base google colab environment, T4 GPU Runtime.

## Imports

In [1]:
! pip install -U accelerate
! pip install -U transformers
! pip install -U datasets

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

In [121]:
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AutoTokenizer, AutoModelForSequenceClassification,  TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import torch
from tqdm import tqdm
from google.colab import drive
import accelerate
from datasets import load_metric

In [4]:
accelerate.__version__

'0.30.1'

In [5]:
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
print(torch.cuda.is_available())

True


## Data

In [7]:
dataset_path = 'path'

In [86]:
try:
    df = pd.read_csv(dataset_path)
except Exception as e:
    print("Error reading the file:", e)

Дропаем лишнюю колонку.

In [87]:
df = df.drop(columns=['Unnamed: 0'])

Уменьшаем размер датасета для ускорения тренировки (используем 10% данных). Несмотря на использование лишь 10% от всех данных и использование DistilBERT, требование заказчика с легкостью будет выполнено.


In [88]:
df_bert = df.sample(frac=0.1, random_state=42)
df_bert

Unnamed: 0,text,toxic
31015,"Sometime back, I just happened to log on to ww...",0
102832,"""\n\nThe latest edit is much better, don't mak...",0
67317,""" October 2007 (UTC)\n\nI would think you'd be...",0
81091,Thanks for the tip on the currency translation...,0
90091,I would argue that if content on the Con in co...,0
...,...,...
113390,"On the alternative education article, they did...",0
158273,Does anyone know if Sora Ltd. is a Nintendo se...,0
328,who cares? if i was blocked on myspace or puls...,0
86825,because you are a tweeting cunt flap!!!!!!!,1


Убеждаемся что оба класса присутствуют в итоговой выборке.

In [89]:
df_bert['toxic'].value_counts()

toxic
0    14313
1     1616
Name: count, dtype: int64

## DistilBERT

### Токенизация и сплиты

In [50]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [51]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

In [90]:
train_texts, temp_texts, train_labels, temp_labels = train_test_split(df_bert['text'], df_bert['toxic'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)

In [91]:
train_encodings = tokenizer(list(tqdm(train_texts, desc="Tokenizing train texts")), truncation=True, padding=True)
val_encodings = tokenizer(list(tqdm(val_texts, desc="Tokenizing val texts")), truncation=True, padding=True)
test_encodings = tokenizer(list(tqdm(test_texts, desc="Tokenizing test texts")), truncation=True, padding=True)

Tokenizing train texts: 100%|██████████| 11150/11150 [00:00<00:00, 1524331.47it/s]
Tokenizing val texts: 100%|██████████| 2389/2389 [00:00<00:00, 1020801.98it/s]
Tokenizing test texts: 100%|██████████| 2390/2390 [00:00<00:00, 1812072.77it/s]


### Pytorch dataset class

In [92]:
class ToxicCommentsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

### Обучение модели

In [93]:
train_dataset = ToxicCommentsDataset(train_encodings, train_labels.tolist())
val_dataset = ToxicCommentsDataset(val_encodings, val_labels.tolist())
test_dataset = ToxicCommentsDataset(test_encodings, test_labels.tolist())

In [95]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [96]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [97]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    gradient_accumulation_steps=4,
    fp16=True
)



In [98]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

In [99]:
trainer.train()

Epoch,Training Loss,Validation Loss
0,0.1291,0.114585
1,0.0604,0.120904
2,0.0637,0.132897


TrainOutput(global_step=522, training_loss=0.15856715296020454, metrics={'train_runtime': 611.3049, 'train_samples_per_second': 54.719, 'train_steps_per_second': 0.854, 'total_flos': 4424940984705024.0, 'train_loss': 0.15856715296020454, 'epoch': 2.995695839311334})

In [100]:
eval_results = trainer.evaluate()

print(f"Validation Loss: {eval_results['eval_loss']}")

Validation Loss: 0.13289690017700195


In [101]:
metric = load_metric("f1")
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=16)

model.eval()
predictions = []
references = []

  metric = load_metric("f1")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [102]:
with torch.no_grad():
    for batch in tqdm(val_loader, desc="Evaluating"):
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
        labels = batch['labels'].to(device)
        outputs = model(**inputs)
        preds = outputs.logits.argmax(dim=-1)
        predictions.extend(preds.cpu().numpy())
        references.extend(labels.cpu().numpy())

Evaluating: 100%|██████████| 150/150 [00:13<00:00, 10.81it/s]


In [103]:
f1_score = metric.compute(predictions=predictions, references=references)
print(f"Validation F1 Score: {f1_score['f1']}")

Validation F1 Score: 0.8242187499999999


## Logistic Regression

In [104]:
train_texts_lr, temp_texts_lr, train_labels_lr, temp_labels_lr = train_test_split(df['text'], df['toxic'], test_size=0.3, random_state=42)
val_texts_lr, test_texts_lr, val_labels_lr, test_labels_lr = train_test_split(temp_texts_lr, temp_labels_lr, test_size=0.5, random_state=42)

In [109]:
vectorizer = TfidfVectorizer(max_features=10000)
train_vectors = vectorizer.fit_transform(train_texts_lr)
val_vectors = vectorizer.transform(val_texts_lr)
test_vectors = vectorizer.transform(test_texts_lr)

In [111]:
model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(train_vectors, train_labels_lr)

In [113]:
val_preds = model_lr.predict(val_vectors)

In [122]:
val_f1 = f1_score(val_labels_lr, val_preds)

In [123]:
print(f'Validation F1 Score: {f1}')

Validation F1 Score: 0.7436041083099907


## Оценка лучшей модели на тестовой выборке

Лучшие результаты показывает distilBERT

In [126]:
metric = load_metric("f1")
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16)

model.eval()
predictions = []
references = []

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [127]:
with torch.no_grad():
    for batch in tqdm(val_loader, desc="Evaluating"):
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
        labels = batch['labels'].to(device)
        outputs = model(**inputs)
        preds = outputs.logits.argmax(dim=-1)
        predictions.extend(preds.cpu().numpy())
        references.extend(labels.cpu().numpy())

Evaluating: 100%|██████████| 150/150 [00:15<00:00,  9.96it/s]


In [128]:
f1_score = metric.compute(predictions=predictions, references=references)
print(f"Test F1 Score: {f1_score['f1']}")

Test F1 Score: 0.818565400843882


### Сохраняем обученную модель и токенайзер

In [129]:
save_path = '/content/drive/MyDrive/models'

In [130]:
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('/content/drive/MyDrive/models/tokenizer_config.json',
 '/content/drive/MyDrive/models/special_tokens_map.json',
 '/content/drive/MyDrive/models/vocab.txt',
 '/content/drive/MyDrive/models/added_tokens.json')

### Инференс

In [131]:
new_comments = ["Shit product", "Bad product", "Awesome product"]
inputs = tokenizer(new_comments, return_tensors='pt', padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

In [132]:
predictions = outputs.logits.argmax(dim=-1)

print(predictions)

tensor([1, 0, 0], device='cuda:0')


# Вывод

В ходе проделанной работы была разработана и протестирована система классификации токсичных комментариев с использованием DistilBERT. Несмотря на использование этой малой модели и использование лишь 10% от всего датасета, был достигнут результат F1=0.81, что превышает требование заказчика. Модель была сохранена и готова к использованию.

Дальнейшие шаги:
* При наличии более мощного железа, можно использовать обычный BERT и все данные, что, я уверен, значительно увеличит точность модели.
* При тех же условиях, можно провести более тонкий тюнинг модели
* Можно протестировать другие архитектуры

Основным элементом этой работы я бы обозначил факт, что использование моделей трансформеров позволяет эффективно решать задачи классификации текста, даже на крайне ограниченных вычислительных ресурсах, что было немыслимо лишь несколько лет назад.