# Comparing Toxic Texts with Transformers - Regression
In this notebook, a language model such as distilbert is fine-tuned to retrieve one input text and give it a score for toxicity, the higher the score the more toxic the text is.

## Installing Required Dependencies

In [None]:
%pip install transformers[torch] datasets pandas numpy matplotlib accelerate > /dev/null

In [None]:
!rm -rf results_regression logs_regression sample_data

In [None]:
%load_ext tensorboard

In [1]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

from datasets import (
    Dataset,
    load_dataset,
    disable_caching
)

from sklearn.metrics import accuracy_score

import evaluate
import numpy as np
import pandas as pd
import torch

## Load the Model and Tokenizer

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Devide is {device}")

# load the pre-trained model and tokenizer
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1) # Regression Mode
model = model.to(device)

Devide is cuda:0


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.

## Loading and Tokenizing the Datasets

In [3]:
# Disable datasets cache because of previous traumas
disable_caching()

# Load the datasets
data_files = {
    "train": "train.csv",
    "test": "test.csv",
    "val": "val.csv",
    "kaggle_val": "kaggle_val_seq.csv"
}

datasets = load_dataset('csv', data_files=data_files)

# Shuffle each dataset except 'kaggle_val'
for split in datasets.keys():
    if split != 'kaggle_val':
        datasets[split] = datasets[split].shuffle(seed=42)

# flatten_indices on each split separately
for split in datasets.keys():
    datasets[split] = datasets[split].flatten_indices()

# tokenize the dataset
def tokenize_function(batch):
    # Get the maximum length from the model configuration
    max_length = model.config.max_position_embeddings

    # Tokenize text and truncate to the maximum length
    tokenized_text = tokenizer(batch['text'], truncation=True, max_length=max_length, add_special_tokens=True)

    return tokenized_text

# Tokenize the datasets
tokenized_datasets = datasets.map(tokenize_function, batched=False)

Downloading and preparing dataset csv/default to /home/disi/.cache/huggingface/datasets/csv/default-046fb53f93a5a0d7/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating kaggle_val split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /home/disi/.cache/huggingface/datasets/csv/default-046fb53f93a5a0d7/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

Flattening the indices:   0%|          | 0/3433 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/1145 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/1144 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/60216 [00:00<?, ? examples/s]

Map:   0%|          | 0/3433 [00:00<?, ? examples/s]

Map:   0%|          | 0/1145 [00:00<?, ? examples/s]

Map:   0%|          | 0/1144 [00:00<?, ? examples/s]

Map:   0%|          | 0/60216 [00:00<?, ? examples/s]

## Training the Model

In [4]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# define training args
training_args = TrainingArguments(
    output_dir='./results_regression',
    num_train_epochs=5,
    load_best_model_at_end=True,
    metric_for_best_model='mae',
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_dir='./logs_regression',
    logging_strategy="steps",
    logging_steps=500,
    save_total_limit=2
)

# create a function to compute metrics
def compute_metrics(preds, metric_name="mae"):
    metric = evaluate.load(metric_name)
    predictions, labels = preds
    eval = metric.compute(predictions=predictions, references=labels)
    eval[metric_name] = abs(1 - eval[metric_name])
    return eval

# create a trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

# train the model
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Mae
500,0.0132,0.007104,0.933951
1000,0.0051,0.00712,0.933496
1500,0.003,0.00716,0.933794
2000,0.0018,0.006944,0.934796


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

TrainOutput(global_step=2150, training_loss=0.005482908362566039, metrics={'train_runtime': 102.211, 'train_samples_per_second': 167.937, 'train_steps_per_second': 21.035, 'total_flos': 494058623166654.0, 'train_loss': 0.005482908362566039, 'epoch': 5.0})

## Make Zip Files to Export

In [None]:
!zip -r logs_regression.zip logs_regression
!zip -r results_regression.zip results_regression

## Reporting MAE on the Test Set

In [11]:
metrics_test = trainer.evaluate(eval_dataset=tokenized_datasets['test'])
print(f"Test set MAE: {abs(1 - metrics_test['eval_mae'])}")

Test set MAE: 0.06218572874516903


## Reporting Accuracy on Kaggle Validation Set

In [8]:
preds = trainer.predict(tokenized_datasets['kaggle_val'])
predictions = preds[0]

predictions_pair = []
for i in range(0, len(predictions), 2):
    if predictions[i] > predictions[i+1]:
        predictions_pair.append(0)
    else:
        predictions_pair.append(1)

df = pd.read_csv("kaggle_val_pair.csv")
label_pair = df["labels"].to_list()

accuracy = accuracy_score(label_pair, predictions_pair)
print(f"Kaggle val set Accuracy: {accuracy}")

Kaggle val set Accuracy: 0.6736415570612462
