# Install required libraries and imports.

In [1]:
!pip install -U transformers
!pip3 install torch --index-url https://download.pytorch.org/whl/cu118
!pip install -U datasets
from datasets import load_dataset, DatasetDict
import os

# Allocate maximum CUDA memory reserve in an attempt to prevent CUDA out of memory errors
# Reserve is simply the reserved memory, not the in-use memory.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/20/0a/739426a81f7635b422fbe6cb8d1d99d1235579a6ac8024c13d743efa6847/transformers-4.36.2-py3-none-any.whl.metadata
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.36.0
    Uninstalling transformers-4.36.0:
      Successfully uninstalled transformers-4.36.0
Successfully installed transformers-4.36.2
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting datasets
  Obtaining dependency informati

Tell PyTorch to use GPU wherever possible

In [2]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Data processing

First, the dataset and checkpoint needs to be initialised.

### Dataset location
Dataset location depends on where notebook is running, for ease I set it up to just uncomment line below depending on location the notebook is running as I run the notebook in a lot of locations.

In [3]:
dataset_location = "/kaggle/input/new-datasets/Privacy_Policy_dataset.jsonl" # Kaggle

#dataset_location = "Privacy_Policy_dataset.jsonl" # Local / Google Colab

### Initialise dataset
To initialise the dataset I use the "dataset" library from python.

I split the dataset into three sets:
- Training set - The data shown to the model during training
- Validation - The data shown to the model to calculate loss on backward pass
- Test - Reserved strictly for after the model is trained, used to evaluate the model on a completely unseen set

However, the "datasets" library doesn't offer the possibility to split into three sets so I use a workaround sourced from: [This stackoverflow post](https://stackoverflow.com/questions/76001128/splitting-dataset-into-train-test-and-validation-using-huggingface-datasets-fun)

It works by first splitting the data set into a train set (80%) and a validation set (20%).

It then splits this validation set into a train set and validation set of 50% each, resulting in two sets of 10% each.

A final dataset consisting of a train, test and validation set is then built using these split datasets

In [4]:
dataset = load_dataset("json", data_files=dataset_location,split='train')

dataset = dataset.shuffle(seed=2424)

test_valid_split_dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

test_split = test_valid_split_dataset['test'].train_test_split(test_size=0.5, shuffle = False)

dataset = DatasetDict({
    'train': test_valid_split_dataset['train'],
    'test': test_split['test'],
    'valid': test_split['train']})

print(dataset)



Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 453
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 57
    })
    valid: Dataset({
        features: ['summary', 'document'],
        num_rows: 57
    })
})


### Reviewing the dataset
Next I want to see the properties of the dataset, to understand what i'm working with.

For training a summarisation model knowing the length of the collected documents is crucial.

The largest summarisation base model is only capable of processing 16384 tokens - higher token limits is a limitation in NLP as a whole.

Roughly, [One token is equal to about four English characters](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

This gives roughly 65,536 characters which the model will be able to parse and create a summary for, and anything which exceeds this number needs to be truncated down to the 16384 token limit.

This unfortunately means on some documents, some detail will be missing.

However, as visible below, the average document in the train set is just 24,724 characters, considerably under the maximum token limit, thus, for most items in the dataset this isn't a problem.

In [5]:
length_of_first_item = len(dataset['train'][0]['document'])
print(f'The length of the first privacy policy in the train dataset is {length_of_first_item} characters')

length_of_longest_document = len(max(dataset['train'], key=lambda x: len(x['document']))['document'])
print(f'Length of the longest privacy policy in the train dataset is {length_of_longest_document} characters')

length_of_shortest_document = len(min(dataset['train'], key=lambda x: len(x['document']))['document'])
print(f'Length of the shortest privacy policy in the train dataset is {length_of_shortest_document} characters')

total_char_count = sum(map(len, dataset['train']['document']))
avg_char_count = round(total_char_count / len(dataset['train']['document']))

print(f'The average character count in the dataset is: {avg_char_count}')

The length of the first privacy policy in the train dataset is 5342 characters
Length of the longest privacy policy in the train dataset is 261619 characters
Length of the shortest privacy policy in the train dataset is 826 characters
The average character count in the dataset is: 24724


# Base model

I will be utilising transfer learning to train a model.

This takes the base of a model trained on some other task but in a similar domain (e.g. summarising books), removes the head of the model which is more specialised (e.g. contains information specific to books), while retaining useful information about the English language. The model is then trained on a new specific task, in my case, summarising terms and conditions or privacy policies, utilising its pre-existing knowledge of the English language.

This significantly reduces training time and resources required for training such that I can stay within the final year project deadlines.

The model I will be using as a base is the [led-large-book-summary](https://huggingface.co/pszemraj/led-large-book-summary). This model utilises the Longformer Encoder-Decoder (LED) model as it's base, and was trained further to summarise long-form text such as novels, plays and stories from the [BookSum dataset](https://arxiv.org/abs/2105.08209)

 Below I initialise the tokeniser for this model through the [Hugging Face](https://huggingface.co/models) library, which offer a variety of base models for transfer-learning. 
  - This base model is ~1.5 GB 


In [6]:
from transformers import AutoTokenizer

base_model_name = "pszemraj/led-large-book-summary"
tokeniser = AutoTokenizer.from_pretrained(base_model_name)

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Transformer models *only* take numerical inputs, thus, a tokeniser is responsible for transforming the input text into its numerical representation.

It does this by splitting text into "tokens" which are small groups of characters.

Tokenisers also use special characters, often indicating the start and the end of sequences and words.

Below I display an example by tokenising a simple string using the tokeniser for the "led-large-book-summary" model

The `input_ids` represents the tokenised input, `attention_mask` is tells the model to ignore tokens if the equivalent index in the attention mask array is zero.

Note when converting the string back to English, we can see`<s>` being used to indicate the start of a sequence, and each word beginning with `Ġ`.
 - This changes depending on the base model, but the Hugging Face library picks out the right tokeniser for the base model.

Also, words can be split into two tokens if it is deemed useful, below `tokeniser` is split into two tokens `token` and `iser`, as the tokens `iser` and `token` could later be re-used with other words, saving tokens.

In [7]:
test_string = tokeniser("This is a test string to test out the tokeniser")
print(test_string)
tokeniser.convert_ids_to_tokens(test_string.input_ids)

{'input_ids': [0, 713, 16, 10, 1296, 6755, 7, 1296, 66, 5, 19233, 5999, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


['<s>',
 'This',
 'Ġis',
 'Ġa',
 'Ġtest',
 'Ġstring',
 'Ġto',
 'Ġtest',
 'Ġout',
 'Ġthe',
 'Ġtoken',
 'iser',
 '</s>']

Next, a function is needed which will tokenise the text.

Here, the maximum length can be defined as 16384 tokens and the tokeniser will be responsible for ensuring any text exceeding this is truncated.

In the dataset, the "document" column contains the input document (privacy policy or Terms of Service)

The "summary" column contains the ground truth summary for the matching document.
 - This doesn't need to be truncated, as they are all <500 tokens

Furthermore, this function assigns the tokenised ground truth summaries `input_id`'s to the "labels" property of the tokenised documents.
 - This is the format required for training a summarisaiton model in pytorch.

In [8]:
def tokenise_truncate_dataset(input):
    truncated_input = tokeniser(
        input["document"],
        max_length = 16384,
        truncation = True
    )
    labels = tokeniser( # don't truncate labels
        input["summary"],
        truncation = False,
    )
    truncated_input["labels"] = labels["input_ids"]
    return truncated_input
# By passing `batched = true` into the map function, more than one item is applied to the function at a time.
tokenised_dataset = dataset.map(tokenise_truncate_dataset, batched = True)

Map:   0%|          | 0/453 [00:00<?, ? examples/s]

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

The properties of the dataset can now be viewed, as expected, there is a train, test and validation dataset.

In [9]:
tokenised_dataset

DatasetDict({
    train: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 453
    })
    test: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
    valid: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
})

However, the "summary" and "document" dataset are no longer needed, as we have their tokenised equivalents - "input_ids" and "labels".

Thus, these can be removed.

Furthermore, the dataset needs to be set to return pytorch tensors, in order to be able to be trained in pytorch.

In [10]:
tokenised_dataset = tokenised_dataset.remove_columns(["summary","document"])
tokenised_dataset.set_format("torch")
tokenised_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 453
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
    valid: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
})

Next, a data collator needs to be defined.

(Below Information sourced from: https://huggingface.co/docs/transformers/main_classes/data_collator)

This is responsible for constructing batches and applying pre-processing such as padding to ensure all inputs are of the same size.

The Hugging Face library provides a function for sourcing a data collator with padding.

In [11]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokeniser, model= base_model_name)


# Model

First, the base model needs to be defined.

Next, an optimiser needs to be defined.

The standard optimiser to use is `adamW`, but again, due to VRAM limitations, a less memory-intensive optimiser will be used called "adafactor"

In [12]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM
from transformers.optimization import Adafactor

base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

Next, ROUGE needs to be defined for the evaluation loop, and a function needs to be defined which will compute the rouge scores.

The rouge score metric is imported from the `rouge_score` python library, which will calculate and return the following metrics
 
 ### ROUGE-1
 ROUGE-1 is the overlap of words between the produced summary and ground truth summary.
 $$\frac{overlapping \space words}{total \space words}$$
 
 ### ROUGE-2
 ROUGE-2 is the overlap of bi-grams (pairs of words)
 
 $$\frac{overlapping \space pairs}{total \space number \space of \space bi-grams}$$ 
 
 ### ROUGE-L
 Source partially from (https://en.wikipedia.org/wiki/Longest_common_subsequence)
 
 ROUGE-L is based on the idea of a *longest common subsequence* - the longest common sequence of words between two texts.
 
 A subsequence does not necessarily have to be *consecutive*, letters can be skipped to make a subsequence.
 
 Consider strings "ABCD" and "ACBAD" - the longest common subseqeunces sequences are ABD and ACD.
 
 $$\frac{Longest \space common \space subsequence}{total \space words}$$ 
 
 Such that for the above example, if we consider "ABCD" the produced text and "ACBAD" the ground truth, ROUGE-L is equal to $ \frac{3}{5} = 0.6$
 
 ### ROUGE-L-SUM
 sourced from https://dev.to/aws-builders/mastering-rouge-matrix-your-guide-to-large-language-model-evaluation-for-summarization-with-examples-jjg
 
This is similar to ROUGE-L, but instead compares this at a *sentence* level, calculating ROUGE-L for each sentence in the produced summary.

This is a better measurement of accuracy in my use-case, as the produced summaries are split into sentences, where each sentence is a "summary point".

In [13]:
!pip install evaluate
!pip install rouge_score

import evaluate
import numpy as np

rouge = evaluate.load("rouge")

def decode_and_find_rouge(result):
    predictions,labels = result # extract predictions and labels from the passed in result.

    # Decode by converting tokenised inputs back into English representations
    decoded_predictions = tokeniser.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokeniser.batch_decode(labels, skip_special_tokens = True)
    
    # Calculate rouge scores by passing in predictions and ground truths
    rouge_score = rouge.compute(predictions=decoded_predictions, references=decoded_labels)
    
    # Calculate the average generated length fo examples.
    num_of_predictions = [np.count_nonzero(pred != tokeniser.pad_token_id) for pred in predictions]
    rouge_score["gen_len"] = np.mean(num_of_predictions)

    # Round and return results.
    return {k: round(v, 4) for k, v in rouge_score.items()}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c33b394dae66bc35c69/evaluate-0.4.1-py3-none-any.whl.metadata
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.1


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=13111d44b393f6651ada33ed7d0360c7c21d7d5b5bba946fecf7657c84fbf6ad
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Next, the model training loop needs to be defined:

This is a very expensive model on memory and it is essential to avoid exceeding the GPU's VRAM, thus the hyper-parameters are configured as such:

1) The batch size will be set to just 1 to reduce the amount of data stored in memory.

2) 16-bit floating point precision will be used, rather than 32-bit.

3) The optimiser "adafactor" is used over the industry standard of "Adam". As a result, the model converges slower but also uses less memory.

4) "Gradient Checkpointing" is used, instructing the model to forget the majority of forward-pass activations and instead recompute them on demand during the backward pass, saving only the "most important" activations. Significantly slower training time, but also saves a lot of VRAM.

5) "Gradient Accumulation" is used simulate a "larger" effective batch size. Instead of updating the model's parameters after processing eachbatch, the gradients are accumulated over 2 batches before performing a single update.

In [14]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

models_arguments = Seq2SeqTrainingArguments(
    output_dir="new_privacy_policy_model_4_epoch",
    evaluation_strategy="epoch", # Run evaluation function on each epoch
    learning_rate=2e-5, # learning rate hyperparameter set to 0.00002
    per_device_train_batch_size= 1, # split into batches of 1 for training
    per_device_eval_batch_size=1, # split to batches of 1 for evaluation
    weight_decay=0.01, # Utilises L2 regularization in an attempt to prevent overfitting
    save_total_limit=3, # save 3 checkpoints only and delete older checkpoints (Kept using all RAM without)
    num_train_epochs=4, # train for 4 epochs
    predict_with_generate=True, # Generate summaries for each input ; essential for summarisation tasks
    fp16=True, # use 16 bit floating point - reduced memory usage
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adafactor"
)

# Collect the previously defined trainer parameters, such as the evaluation technique, tokeniser and datasets.
trainer = Seq2SeqTrainer(
    model=base_model,
    args=models_arguments,
    train_dataset=tokenised_dataset["train"],
    eval_dataset=tokenised_dataset["valid"],
    tokenizer=tokeniser,
    data_collator=data_collator,
    compute_metrics=decode_and_find_rouge,
)

import os

# Uncomment below line if using Kaggle.
os.environ["WANDB_DISABLED"] = "true"

trainer.train()
trainer.save_model("new_privacy_policy_model_4_epoch")

You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
0,No log,0.638767,0.426,0.2438,0.2359,0.2358,293.0175
2,0.782500,0.564965,0.5073,0.3157,0.2721,0.2719,183.4386
3,0.782500,0.560927,0.4883,0.2946,0.2709,0.2708,145.5789


# Evaluation