# Comparing Toxic Texts with Transformers
In this notebook, a language model such as distilbert is fine-tuned to retrieve two input texts and choose the one that is more offensvie / rude / toxic.

## Installing required dependencies

In [None]:
%pip install transformers datasets evaluate pandas numpy matplotlib accelerate scikit-learn tensorboard > /dev/null

In [None]:
!rm -rf results_pair logs_pair sample_data

In [None]:
%load_ext tensorboard

In [1]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

from datasets import (
    Dataset,
    load_dataset,
    disable_caching
)

import evaluate
import numpy as np
import pandas as pd
import torch

## Loading the Model and Tokenizer

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Devide is {device}")

# load the pre-trained model and tokenizer
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model = model.to(device)

Devide is cuda:0


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.we

## Loading and Tokenizing the Datasets

In [3]:
# Disable datasets cache because of previous traumas
disable_caching()

# Load the datasets
data_files = {
    "train": "train_pair.csv",
    "test": "test_pair.csv",
    "val": "val_pair.csv",
    "kaggle_val": "kaggle_val_pair.csv"
}

datasets = load_dataset('csv', data_files=data_files)

# Shuffle the dataset
datasets = datasets.shuffle(seed=42).flatten_indices()

# tokenize the dataset
def tokenize_function(batch):
    # Get the maximum length from the model configuration
    max_length = model.config.max_position_embeddings

    # Tokenize each text separately and truncate to half the maximum length
    tokenized_text1 = tokenizer(batch['text1'], truncation=True, max_length=int(max_length/2), add_special_tokens=True)
    tokenized_text2 = tokenizer(batch['text2'], truncation=True, max_length=int(max_length/2), add_special_tokens=True)

    # Merge the results
    tokenized_inputs = {
        'input_ids': tokenized_text1['input_ids'] + tokenized_text2['input_ids'][1:],  # exclude the [CLS] token from the second sequence
        'attention_mask': tokenized_text1['attention_mask'] + tokenized_text2['attention_mask'][1:]
    }
    return tokenized_inputs

# Tokenize the datasets
tokenized_datasets = datasets.map(tokenize_function, batched=False)

Downloading and preparing dataset csv/default to /home/disi/.cache/huggingface/datasets/csv/default-021c33ed6a1886a8/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating kaggle_val split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /home/disi/.cache/huggingface/datasets/csv/default-021c33ed6a1886a8/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Flattening the indices:   0%|          | 0/17165 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/5725 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/5720 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/30108 [00:00<?, ? examples/s]

Map:   0%|          | 0/17165 [00:00<?, ? examples/s]

Map:   0%|          | 0/5725 [00:00<?, ? examples/s]

Map:   0%|          | 0/5720 [00:00<?, ? examples/s]

Map:   0%|          | 0/30108 [00:00<?, ? examples/s]

## Training the Model

In [4]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# define training args
training_args = TrainingArguments(
    output_dir='./results_pair',
    num_train_epochs=5,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    evaluation_strategy="steps",
    eval_steps=500,
    logging_dir='./logs_pair',
    logging_strategy="steps",
    logging_steps=500,
    save_total_limit=2
)

# create a function to compute metrics
def compute_metrics(preds, metric_name="accuracy"):
    metric = evaluate.load(metric_name)
    logits, labels = preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# create a trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

# train the model
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
500,0.6844,0.652768,0.637937
1000,0.5931,0.615458,0.7
1500,0.5158,0.521679,0.756119
2000,0.4205,0.566949,0.780245
2500,0.3562,0.701791,0.78042
3000,0.3258,0.574124,0.77972
3500,0.3072,0.676944,0.772203
4000,0.2997,0.715318,0.777098
4500,0.2528,1.044132,0.778846
5000,0.1867,1.025795,0.779371


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

TrainOutput(global_step=10730, training_loss=0.23204719786986167, metrics={'train_runtime': 852.817, 'train_samples_per_second': 100.637, 'train_steps_per_second': 12.582, 'total_flos': 3964204107381564.0, 'train_loss': 0.23204719786986167, 'epoch': 5.0})

## Make Zip Files to Export

In [None]:
!zip -r logs_pair.zip logs_pair
!zip -r results_pair.zip results_pair

## Reporting the Accuracy on the Test Set

In [6]:
metrics_test = trainer.evaluate(eval_dataset=tokenized_datasets['test'])
print(f"Test set accuracy: {metrics_test['eval_accuracy']}")

Test set accuracy: 0.7917903930131004


## Reporting the Accuracy on the Kaggle Validation Set

In [5]:
metrics_kaggle_val = trainer.evaluate(eval_dataset=tokenized_datasets['kaggle_val'])
print(f"Kaggle val set accuracy: {metrics_kaggle_val['eval_accuracy']}")

Kaggle val set accuracy: 0.6507240600504849


## Launch TensorBoard to See the Logs

In [None]:
# If this is the second time tensorboard is running, we need to kill it first
# to release the port.
!kill $(ps -e | grep 'tensorboard' | awk '{print $1}')
%tensorboard --logdir logs_pair/