# Setup

I begin by installing the libraries required to train a model in hugging faces

I also instruct PyTorch to use the GPU and allocate maximum memory reserves

In [1]:
!pip install transformers evaluate accelerate
import os
import torch
from datasets import load_dataset, DatasetDict

# Allocate maximum CUDA memory reserve in an attempt to prevent CUDA out of memory errors
# Reserve is simply the reserved memory, not the in-use memory.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"


device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")



# Data processing

### Dataset location
Next, I need to create a dataset,

To begin, I need to initialise the dataset location, which differs depending on where I run the notebook.

In [2]:
dataset_location = "scoring_dataset.jsonl"

### Initialise dataset

First, the labels need to be converted to some numerical equivalent, as natural language models only expect a numerical input.

Recall however, the labels are "Negative", "Neutral" or "Positive".

I will assign these to "0,1,2" respectively.

In [3]:
dataset = load_dataset("json", data_files=dataset_location,split='train')

def convertToInt(input):
    type = input["rating"]
    if type == "negative":
        input["rating"] = 0
        return input
    if type == "neutral":
        input["rating"] = 1
        return input
    if type == "positive":
        input["rating"] = 2
        return input

dataset = dataset.map(convertToInt)

For classification using a base model, the dataset columns need to be renamed.

The model expects the input columns to be named "text" and "labels" respectively

Currently, they are "summary" for the produced summary and "rating" for the given rating (positive,negative, neutral)

In [4]:
dataset = dataset.rename_column("summary","text")
dataset = dataset.rename_column("rating","labels")

To split the dataset I use the "dataset" library from python.

I split the dataset into three sets:
- Training set - The data shown to the model during training
- Validation - The data shown to the model to calculate loss on backward pass
- Test - Reserved strictly for after the model is trained, used to evaluate the model on a completely unseen set

However, the "datasets" library doesn't offer the possibility to split into three sets so I use a workaround sourced from: [This stackoverflow post](https://stackoverflow.com/questions/76001128/splitting-dataset-into-train-test-and-validation-using-huggingface-datasets-fun)

It works by first splitting the data set into a train set (80%) and a validation set (20%).

It then splits this validation set into a train set and validation set of 50% each, resulting in two sets of 10% each.

A final dataset consisting of a train, test and validation set is then built using these split datasets

In [5]:

dataset = dataset.shuffle(seed=2424)

test_valid_split_dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

test_split = test_valid_split_dataset['test'].train_test_split(test_size=0.5, shuffle = False)

dataset = DatasetDict({
    'train': test_valid_split_dataset['train'],
    'test': test_split['test'],
    'valid': test_split['train']})

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 867
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 109
    })
    valid: Dataset({
        features: ['text', 'labels'],
        num_rows: 108
    })
})


# Dataset review

Next, I want to check the dataset to ensure the dataset has been loaded correctly, and to view some properties of the data.

We can see the first example is a "neutral" summary, meaning it doesn't have a particular negative or positive stance towards user privacy.

In [6]:
dataset["test"][0]

{'text': 'This service is only available to users over a certain age. You can opt out of targeted advertising. Your IP address is collected, which can be used to view your approximate location. The service claims to be CCPA compliant for California users. Do Not Track (DNT) headers are ignored and you are tracked anyway even if you set this header.. You can opt out of promotional communications',
 'labels': 1}

### Category sizes

Next, it'll be good to see the number of items in each category, "Positive", "Neutral" or "Negative" in the training set.

The dataset leans more on the negative side - with 54% of items being negative, this is a result of the source of the data (tos;dr) as well as the state of privacy policies and terms and conditions generally.

In [7]:
dataset_length = len(dataset["train"])
number_of_negative = len(dataset["train"].filter(lambda x: x["labels"] == 0))
number_of_neutral = len(dataset["train"].filter(lambda x: x["labels"] == 1))
number_of_positive = len(dataset["train"].filter(lambda x: x["labels"] == 2))

dataset_length,number_of_negative,number_of_neutral,number_of_positive
print(f'The train dataset has {dataset_length} items with: \n'
      f'{number_of_negative} negative items ({round(number_of_negative/dataset_length,2)}%)\n'
      f'{number_of_neutral} neutral items ({round(number_of_neutral/dataset_length,2)}%)\n'
      f'{number_of_positive} positive items ({round(number_of_positive/dataset_length,2)}%) \n')

Filter:   0%|          | 0/867 [00:00<?, ? examples/s]

Filter:   0%|          | 0/867 [00:00<?, ? examples/s]

Filter:   0%|          | 0/867 [00:00<?, ? examples/s]

The train dataset has 867 items with: 
467 negative items (0.54%)
239 neutral items (0.28%)
161 positive items (0.19%) 



Next, as NLP models accept only numerical inputs, thus, a tokeniser is needed.

For training this model, I will be using transfer learning.

This takes the base of a model trained on some other task but in a similar domain (e.g. summarising books), removes the head of the model which is more specialised (e.g. contains information specific to books), while retaining useful information about the English language. The model is then trained on a new specific task, in my case, summarising terms and conditions or privacy policies, utilising its pre-existing knowledge of the English language.

This significantly reduces training time and resources required for training such that I can stay within the final year project deadlines.

The base model I will use as a base is the x ...

Next, I will define the tokeniser. The Hugging Face library is used to infer the tokeniser from the model.

In [8]:
from transformers import AutoTokenizer

tokeniser = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Next, I will need to define a function which will apply the tokeniser to the dataset.

Only the summary will need to be tokenised, as the outputs are already numerical.

It will also need to truncate anything exceeding the maximum token count of this model.

In [9]:
def tokenise_dataset(item):
    return tokeniser(item["text"],truncation = True)

# passing batched into the dataset allows `map` to work on more than one item at a time
tokenised_dataset = dataset.map(tokenise_dataset, batched=True)

Map:   0%|          | 0/867 [00:00<?, ? examples/s]

Map:   0%|          | 0/109 [00:00<?, ? examples/s]

Map:   0%|          | 0/108 [00:00<?, ? examples/s]

Next, a data collator needs to be defined.

(Below Information sourced from: https://huggingface.co/docs/transformers/main_classes/data_collator)

This is responsible for constructing batches and applying pre-processing such as padding to ensure all inputs are of the same size.

The Hugging Face library provides a function for sourcing a data collator with padding.

In [10]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokeniser)

# Evaluation

Next, the evaluation technique needs to be defined, for use during training.

For the model's evaluation loop I define:
 - Accuracy
 - Precision
 - Recall

### Accuracy
Accuracy is simply the number of correct predictions over the total number of predictions:
$$\frac{correct \space predictions}{total \space predictions}$$

### TP, FP and FN
To calculate precision and recall, first TP,FP and FN needs to be defined.

This is slightly different when considering more than two categories,

#### True Positive
Correctly predicted classes for a category.

e.g. correctly predicting "positive" (class 0)

#### False Positive
An incorrect prediction for a category.

e.g. The model incorrectly predicts positive class (class 0) and the correct answer actually is negative (class 2) would be a false positive for class 2

### False Negative
Failure to predict a category when it should have been predicted.

e.g. The model incorrectly predicts "positive" (class 0) would be a false negative for class 0.


## Precision
Precision is simply true positives divided by the sum of true positives and false positives:

$$\frac{tp}{tp+fp}$$

## Recall
Recall is simply true positives divided by the sum of True positives and False negatives

$$\frac{tp}{tp+fn}$$

## Average precision, Average recall
Average precision and Average recall is also calculated, which simply divides the sum of the precisions / recalls over 3 (positive, negative, neutral)




In [11]:
import numpy as np

def find_accuracy(correct, total):
    # Avoid division by zero, return 0 if total is zero
    return correct / total if total != 0 else 0

def find_precision(TP, FP):
    # Avoid division by zero, return 0 if the denominator is zero
    return TP / (TP + FP) if (TP + FP) != 0 else 0

def find_recall(TP, FN):
    # Avoid division by zero, return 0 if the denominator is zero
    return TP / (TP + FN) if (TP + FN) != 0 else 0




In [12]:

def evaluation_loop(preds):
    predictions, labels = preds # extract predictions and labels from input.


    # Only interested in the highest value in the predicted list - the most confident answer
    # Thus keep the highest values
    predictions = np.argmax(predictions,axis=1)
    metrics = {
        "num_correct": 0,

        "TP_neg": 0,
        "TP_mid": 0,
        "TP_pos": 0,

        "FP_neg": 0,
        "FP_mid": 0,
        "FP_pos": 0,

        "FN_neg": 0,
        "FN_mid": 0,
        "FN_pos": 0,
    }

    for prediction,label in zip(predictions,labels):
        if prediction == label:
            metrics["num_correct"] += 1

            match prediction:
                case 0:
                    metrics["TP_neg"] += 1
                case 1:
                    metrics["TP_mid"] += 1
                case 2:
                    metrics["TP_pos"] += 1
        else:
            match prediction:
                case 0:
                    metrics["FP_neg"] += 1
                case 1:
                    metrics["FP_mid"] += 1
                case 2:
                    metrics["FP_pos"] += 1
            match label:
                case 0:
                    metrics["FN_neg"] += 1
                case 1:
                    metrics["FN_mid"] += 1
                case 2:
                    metrics["FN_pos"] += 1

    accuracy = find_accuracy(metrics["num_correct"], len(labels))

    precision_neg = find_precision(metrics["TP_neg"],metrics["FP_neg"])
    recall_neg = find_recall(metrics["TP_neg"],metrics["FN_neg"])

    precision_neutral = find_precision(metrics["TP_mid"],metrics["FP_mid"])
    recall_neutral = find_recall(metrics["TP_mid"],metrics["FN_mid"])

    precision_pos = find_precision(metrics["TP_pos"],metrics["FP_pos"])
    recall_pos = find_recall(metrics["TP_pos"],metrics["FN_pos"])

    avg_precision = (precision_neg + precision_neutral + precision_pos) / 3
    avg_recall = (recall_neg + recall_neutral + recall_pos) / 3

    return {
        "accuracy": float(
            accuracy
        ),
        "precision_negative":float(
            precision_neg
        ),
        "precision_neutral":float(
            precision_neutral
        ),
        "precision_positive":float(
            precision_pos
        ),
        "average precision":float(
            avg_precision
        ),
        "recall_negative":float(
            recall_neg
        ),
        "recall_neutral":float(
            recall_neutral
        ),
        "recall_positive":float(
            recall_pos
        ),
        "average_recall":float(
            avg_recall
        )
    }



# Training a model

First, for training a classification model, 2 more parameters need to be defined:
  - `id2label`
  - `label2id`

Recall earlier, I translated the labels from "negative" "neutral" and "positive" to  "0","1" and "2" respectively.

These parameters contain the translation from the ids to labels and vice versa.

In [13]:
id2label = {0: "negative", 1: "neutral", 2:"positive"}
label2id = {"negative": 0, "neutral": 1, "positive":2}

Finally, the base model can be defined.


In [14]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
training_args = TrainingArguments(
    output_dir="summarisation_model_10_epoch",
    evaluation_strategy="epoch", # Run evaluation function on each epoch
    weight_decay=0.01, # Utilises L2 regularization in an attempt to prevent overfitting
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    save_strategy="epoch",
    logging_steps = 55
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_dataset["train"],
    eval_dataset=tokenised_dataset["valid"],
    tokenizer=tokeniser,
    data_collator=data_collator,
    compute_metrics=evaluation_loop,
)

trainer.train()
trainer.save_model("summarisation_model_10_epoch")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision Negative,Precision Neutral,Precision Positive,Average precision,Recall Negative,Recall Neutral,Recall Positive,Average Recall
1,0.9785,0.841888,0.62963,0.722892,0.32,0.0,0.347631,0.923077,0.296296,0.0,0.406458
2,0.8254,0.767908,0.675926,0.796875,0.37931,0.733333,0.636506,0.784615,0.407407,0.6875,0.626508
3,0.7014,0.728043,0.666667,0.808824,0.333333,0.5,0.547386,0.846154,0.222222,0.6875,0.585292
4,0.5949,0.696335,0.712963,0.791667,0.470588,0.631579,0.631278,0.876923,0.296296,0.75,0.641073
5,0.4935,0.704884,0.722222,0.80597,0.5,0.705882,0.670618,0.830769,0.444444,0.75,0.675071
6,0.4029,0.814784,0.666667,0.8125,0.363636,0.545455,0.573864,0.8,0.296296,0.75,0.615432
7,0.3404,1.087383,0.555556,0.844444,0.28125,0.419355,0.515016,0.584615,0.333333,0.8125,0.576816
8,0.2829,1.004818,0.638889,0.851852,0.354839,0.521739,0.576143,0.707692,0.407407,0.75,0.6217
9,0.2435,0.906473,0.685185,0.830508,0.433333,0.631579,0.631807,0.753846,0.481481,0.75,0.661776
10,0.2113,0.913119,0.703704,0.836066,0.464286,0.631579,0.643977,0.784615,0.481481,0.75,0.672032


# Evaluation on test set

In [16]:
trained_tokeniser = AutoTokenizer.from_pretrained("summarisation_model_10_epoch")
trained_model = AutoModelForSequenceClassification.from_pretrained("summarisation_model_10_epoch").to("cuda")


In [17]:
def eval_calculate_accuracy(preds):
    predictions, labels = preds # extract predictions and labels from input.

    # Only interested in the highest value in the predicted list - the most confident answer

    num_correct = 0

    for prediction,label in zip(predictions,labels):
        if prediction == label:
            num_correct += 1

    return round(num_correct / len(labels),2)


In [18]:
summaries = []
truths = []
count = 1

for item in dataset["test"]:

    document = item["text"]
    ground_truth_summary = item["labels"]

    tokenised_document = trained_tokeniser(document,return_tensors="pt",truncation=True).to("cuda")
    with torch.no_grad():
      logits = model(**tokenised_document).logits

    predicted_class_id = logits.argmax().item()

    summaries.append(predicted_class_id)
    truths.append(ground_truth_summary)
    count += 1

print(eval_calculate_accuracy([summaries,truths]))

0.64
