# Install required libraries and imports.

In [2]:
!pip install datasets
!pip install -U transformers
!pip3 install torch --index-url https://download.pytorch.org/whl/cu118
!pip install -U datasets

from datasets import load_dataset, DatasetDict
import os

# Allocate maximum CUDA memory reserve in an attempt to prevent CUDA out of memory errors
# Reserve is simply the reserved memory, not the in-use memory.os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/ec/93/454ada0d1b289a0f4a86ac88dbdeab54921becabac45da3da787d136628f/datasets-2.16.1-py3-none-any.whl.metadata
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0 (from datasets)
  Obtaining dependency information for fsspec[http]<=2023.10.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metadata
  Downloading fsspec-2023.10.0-py3-none-any

Tell PyTorch to use GPU wherever possible

In [3]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Data processing

First, the dataset and checkpoint needs to be initialised.

### Dataset location
Dataset location depends on where notebook is running, for ease I set it up to just uncomment line below depending on location the notebook is running as I run the notebook in a lot of locations.

In [4]:
dataset_location = "/kaggle/input/new-datasets/Privacy_Policy_dataset.jsonl" # Kaggle

#dataset_location = "Privacy_Policy_dataset.jsonl" # Local / Google Colab

### Initialise dataset
To initialise the dataset I use the "dataset" library from python.

I split the dataset into three sets:
- Training set - The data shown to the model during training
- Validation - The data shown to the model to calculate loss on backward pass
- Test - Reserved strictly for after the model is trained, used to evaluate the model on a completely unseen set

However, the "datasets" library doesn't offer the possibility to split into three sets so I use a workaround sourced from: [This stackoverflow post](https://stackoverflow.com/questions/76001128/splitting-dataset-into-train-test-and-validation-using-huggingface-datasets-fun)

It works by first splitting the data set into a train set (80%) and a validation set (20%).

It then splits this validation set into a train set and validation set of 50% each, resulting in two sets of 10% each.

A final dataset is then built using these split datasets

In [5]:
dataset = load_dataset("json", data_files=dataset_location,split='train')

dataset = dataset.shuffle(seed=2424)

test_valid_split_dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

test_split = test_valid_split_dataset['test'].train_test_split(test_size=0.5, shuffle = False)

dataset = DatasetDict({
    'train': test_valid_split_dataset['train'],
    'test': test_split['test'],
    'valid': test_split['train']})

print(dataset)



Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 453
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 57
    })
    valid: Dataset({
        features: ['summary', 'document'],
        num_rows: 57
    })
})


### Review dataset
Next I want to see the properties of the dataset, to understand what i'm working with.

For training a summarisation model knowing the length of the collected documents is crucial.

The largest summarisation base model is only capable of processing 16384 tokens - higher token limits is a limitation in NLP as a whole.

Roughly we can equate [One token to be equal to about 4 English characters)[https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them].

This gives roughly 65,536 characters which the model will be able to parse at once, and anything which exceeds this number needs to be truncated down to the 16384 token limit.

This unfortunately means on some documents, some detail will be missing.

However, as visible below, the average document is just 24,724 considerably under the maximum token limit, thus, for most items in the dataset this isn't a problem.

In [6]:
length_of_first_item = len(dataset['train'][0]['document'])
print(f'The length of the first privacy policy in the train dataset is {length_of_first_item} characters')

length_of_longest_document = len(max(dataset['train'], key=lambda x: len(x['document']))['document'])
print(f'Length of the longest privacy policy in the train dataset is {length_of_longest_document} characters')

length_of_shortest_document = len(min(dataset['train'], key=lambda x: len(x['document']))['document'])
print(f'Length of the shortest privacy policy in the train dataset is {length_of_shortest_document} characters')

total_char_count = sum(map(len, dataset['train']['document']))
avg_char_count = round(total_char_count / len(dataset['train']['document']))

print(f'The average character count in the dataset is: {avg_char_count}')

The length of the first privacy policy in the train dataset is 5342 characters
Length of the longest privacy policy in the train dataset is 261619 characters
Length of the shortest privacy policy in the train dataset is 826 characters
The average character count in the dataset is: 24724


# Base model

I will be utilising transfer learning to train a model.

This takes the base of a model trained on some other task but in a similar domain (e.g. summarising books), removes the head of the model which is more specialised (e.g. contains information specific to books), while retaining useful information about the English language. The model is then trained on a new specific task, in my case, summarising terms and conditions or privacy policies, utilising its pre-existing knowledge of the English language.

This significantly reduces training time and resources required for training such that I can stay within the final year project deadlines.

The model I will be using as a base is the [led-large-book-summary](https://huggingface.co/pszemraj/led-large-book-summary). This model utilises the Longformer Encoder-Decoder (LED) model as it's base, and was trained further to summarise long-form text such as novels, plays and stories from the [BookSum dataset](https://arxiv.org/abs/2105.08209)

 Below I initialise the tokeniser for this model through the [Hugging Face](https://huggingface.co/models) library, which offer a variety of base models for transfer-learning.


In [7]:
from transformers import AutoTokenizer

base_model_name = "pszemraj/led-large-book-summary"
tokeniser = AutoTokenizer.from_pretrained(base_model_name)

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Transformer models *only* take numerical inputs, thus, a tokeniser is responsible for transforming the input text into its numerical representation.

It does this by splitting text into "tokens" which are small groups of characters.

Tokenisers also use special characters, often indicating the start and the end of sequences and words.

Below I display an example by tokenising a simple string using the tokeniser for the "led-large-book-summary" model

The `input_ids` represents the tokenised input, `attention_mask` is tells the model to ignore tokens if the equivalent index in the attention mask array is zero.

Note when converting the string back to English, we can see`<s>` being used to indicate the start of a sequence, and each word beginning with `Ġ`.
 - This changes depending on the base model, but the Hugging Face library picks out the right tokeniser for the base model.

Also, words can be split into two tokens if it is deemed useful, below `tokeniser` is split into two tokens `token` and `iser`, as the tokens `iser` and `token` could later be re-used with other words, saving tokens.

In [8]:
test_string = tokeniser("This is a test string to test out the tokeniser")
print(test_string)
tokeniser.convert_ids_to_tokens(test_string.input_ids)

{'input_ids': [0, 713, 16, 10, 1296, 6755, 7, 1296, 66, 5, 19233, 5999, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


['<s>',
 'This',
 'Ġis',
 'Ġa',
 'Ġtest',
 'Ġstring',
 'Ġto',
 'Ġtest',
 'Ġout',
 'Ġthe',
 'Ġtoken',
 'iser',
 '</s>']

Next, a function is needed which will tokenise the text.

Here, the maximum length can be defined as 16384 tokens and the tokeniser will be responsible for ensuring any text exceeding this is truncated.

In the dataset, the "document" column contains the input document (privacy policy or Terms of Service)

The "summary" column contains the ground truth summary for the matching document.
 - This doesn't need to be truncated, as they are all <500 tokens

Furthermore, this function assigns the tokenised ground truth summaries `input_id`'s to the "labels" property of the tokenised documents.
 - This is the format required for training a summarisaiton model in pytorch.

In [9]:
def tokenise_truncate_dataset(input):
    truncated_input = tokeniser(
        input["document"],
        max_length = 16384,
        truncation = True
    )
    labels = tokeniser( # don't truncate labels
        input["summary"],
        truncation = False,
    )
    truncated_input["labels"] = labels["input_ids"]
    return truncated_input
# By passing `batched = true` into the map function, more than one item is applied to the function at a time.
tokenised_dataset = dataset.map(tokenise_truncate_dataset, batched = True)

Map:   0%|          | 0/453 [00:00<?, ? examples/s]

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

The properties of the dataset can now be viewed, as expected, there is a train, test and validation dataset.

In [10]:
tokenised_dataset

DatasetDict({
    train: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 453
    })
    test: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
    valid: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
})

However, the "summary" and "document" dataset are no longer needed, as we have their tokenised equivalents - "input_ids" and "labels".

Thus, these can be removed.

Furthermore, the dataset needs to be set to return pytorch tensors, in order to be able to be trained in pytorch.

In [11]:
tokenised_dataset = tokenised_dataset.remove_columns(["summary","document"])
tokenised_dataset.set_format("torch")
tokenised_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 453
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
    valid: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 57
    })
})

Next, a data collator needs to be defined.

(Below Information sourced from: https://huggingface.co/docs/transformers/main_classes/data_collator)

This is responsible for constructing batches and applying pre-processing such as padding to ensure all inputs are of the same size.

The Hugging Face library provides a function for sourcing a data collator with padding.

In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokeniser)


Next, Pytorch dataloaders need to be defined, in order to load the data into the model.

A *very* small batch size is used, of just 1, as training this model uses *a lot* of GPU RAM, and anything higher than this causes Kaggle / Colab to crash.

In [13]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenised_dataset["train"], batch_size=1, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenised_dataset["valid"], batch_size=1, collate_fn=data_collator
)
test_dataloader = DataLoader(
    tokenised_dataset["test"], batch_size=1, collate_fn=data_collator
)

# Model

First, the base model needs to be defined.

Next, an optimiser needs to be defined.

The standard optimiser to use is `adamW`, but again, due to VRAM limitations, a less memory-intensive optimiser will be used.

The learning rate and base model parameters will be passed to the optimiser.

relative_step will be set to false to define a custom learning rate, which I used in previous models and found to be good.

In [14]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM
from transformers.optimization import Adafactor

base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)
optimiser = Adafactor(base_model.parameters(),lr=2e-5, relative_step=False)


config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

(This stackoverflow post helped with defining training steps https://stackoverflow.com/questions/60120043/optimizer-and-scheduler-for-bert-fine-tuning)

Next, the *learning rate scheduler* needs to be defined, which is responsible for reducing this learning rate as training continues.

The simplest implementation is to just handle this linearly by multiplying the number of epochs by the number of training items, to calculate the number of "training steps", then define a learning rate scheduler to handle this linearly.

The number of warm up steps is typically set to zero.

In [15]:
from transformers.optimization import get_scheduler

epochs = 4
training_steps = epochs * len(train_dataloader)
learning_rate_scheduler = get_scheduler("linear",optimizer=optimiser,num_warmup_steps=0,num_training_steps=training_steps)

Next, the base model needs to be assigned to run the GPU

If the output is 'cuda' then a GPU is assigned.

In [16]:
base_model.to(device)
device

device(type='cuda')

Next, ROUGE needs to be defined for the evaluation loop.

In [17]:
!pip install evaluate
!pip install rouge_score

import evaluate

rouge = evaluate.load("rouge")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c33b394dae66bc35c69/evaluate-0.4.1-py3-none-any.whl.metadata
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.1


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=2a208112f104aef51746e2c7e4c7838199c90de6abb6a1ef28f3ae4af53f5ca1
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Next, the model training loop needs to be defined:

By default PyTorch doesn't offer any visualisation of model training progress, so the library `tqdm` is used to visualise this

In [18]:
from tqdm.auto import tqdm
progress_bar = tqdm(range(training_steps))

base_model.train()
for epoch in range(epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = base_model(**batch)
        loss = outputs.loss
        loss.backward()

        optimiser.step()
        learning_rate_scheduler.step()
        optimiser.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1812 [00:00<?, ?it/s]

You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.06 GiB (GPU 0; 15.89 GiB total capacity; 14.18 GiB already allocated; 960.12 MiB free; 14.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Finally, the evaluation loop to be used during model training needs to be defined.

This will run against the evaluation set (eval_dataloader) defined earlier.

In [None]:
base_model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = base_model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    rouge.add_batch(predictions=predictions, references=batch["labels"])

rouge.compute()