# Fine-tuning led-large-book-summary on privacy policies (10 Epochs)

This notebook will go through data-processing and training of a the NLP model responsible for summarising privacy policies, using the "led-large-book-summary" as a base to be fine-tuned on my task of summarising privacy policies.

led-large-book-summary: https://huggingface.co/pszemraj/led-large-book-summary

This model will be trained in 10 epochs, in contrast to the previous, which was trained in four epochs to study the affect of this on the produced model.

**This model needs ~16GB of VRAM for training, running on any lower amount will give a CUDA out of memory error.**

# Pre-Processing
First, import and install required libraries

In [1]:
!pip install datasets
!pip install -U accelerate
!pip install -U transformers
from datasets import load_dataset, DatasetDict
import os

# Allocate maximum CUDA memory reserve in an attempt to prevent CUDA out of memory errors
# Reserve is simply the reserved memory, not the in-use memory.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/20/0a/739426a81f7635b422fbe6cb8d1d99d1235579a6ac8024c13d743efa6847/transformers-4.36.2-py3-none-any.whl.metadata
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.36.0
    Uninstalling transformers-4.36.0:
      Successfully uninstalled transformers-4.36.0
Successfully installed transformers-4.36.2


First, the dataset needs to be converted from JSON to a "dataset" object from the `datasets` library.

This library provides a `train_test_split` to split the dataset into a test set and training set.

I shuffle the dataset first with a fixed seed, so the results are always repoducible.

In [2]:
# Dataset files are stored in different locations depending on where the Notebook is ran
# Uncomment depending on location:

# Kaggle:
dataset_location = "/kaggle/input/nlp-attempt-2/dataset.json"

# Google Colab / Running Locally:
# dataset_location = "dataset.json"

dataset = load_dataset("json", data_files=dataset_location,split='train')
dataset = dataset.shuffle(seed=2424)
dataset = dataset.train_test_split(test_size=0.1, shuffle=False) # disabling shuffling to shuffle with a fixed seed on previous line instead
print(dataset)

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ba8482ee1cddf81f/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ba8482ee1cddf81f/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.
DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 217
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 25
    })
})


The dataset has been split into two sets:
 - Train
 - Test

The train set will be used to train the model - this is the information the model will learn from.
The test set will be used to test the model after training, to see how it performs for some data **It has never seen**

The dataset has two features - the "document" which is a privacy policy and then "summary" is the respective summary

Next, a model needs to be selected to conduct transfer learning on.

There is a problem here, in that, the collected privacy policies and terms and conditions are **very long**.

Below, the length of the first item in the dataset is 167,601 characters whilst the largest is 644,722 characters.

In [3]:
length_of_first_item = len(dataset['train'][0]['document'])
print(f'The length of the first privacy policy in the train dataset is {length_of_first_item} characters')

length_of_longest_document = len(max(dataset['train'], key=lambda x: len(x['document']))['document'])
print(f'Length of the longest privacy policy in the train dataset is {length_of_longest_document} characters')

The length of the first privacy policy in the train dataset is 167601 characters
Length of the longest privacy policy in the train dataset is 644722 characters


This is problematic as NLP models have a **maximum token count** that it can handle, often much lower than the length of the collected documents, exacerbated by the fact summarisation models typically have lower maximum token counts.

This may end up affecting the accuracy of the model and it's ability to learn from the data, as if a portion of the document is cut off to stay within the maximum token count, the summary may not fully match the document.

Unfortunately, this is a limitation of NLP as a whole.

The maximum length of most of the most common summarization models is 1024 tokens, it's impossible to tell how many characters this is, but a rough heuristic is [1 token = 4 characters](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

The model below thus supports around 4096 characters, which is clearly not good enough for the data.

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Commented out to save memory for training...

# bart_model_checkpoint_name = "facebook/bart-large-cnn"
# bart_tokenizer = AutoTokenizer.from_pretrained(bart_model_checkpoint_name)
# bart_max_length = bart_tokenizer.model_max_length
# bart_max_length, bart_max_length*4

Thus, it's required to look at a bigger model.

Below, "led-large-book-summary" has a model max length of 16384 tokens, applying the same heuristic as before, this is about 65536 words.

This model was trained on the [BookSum dataset](https://arxiv.org/abs/2105.08209) which contains "plays, short stories, and novels" with expired copyright, and the aim of the training was to produce valuable summaries of the given documents.

I will conduct transfer learning to remove the head of this model which is specifically focused on books, and train on the new task of summarising **privacy policies**

In [5]:
model_checkpoint_name = "pszemraj/led-large-book-summary"
tokeniser = AutoTokenizer.from_pretrained(model_checkpoint_name)
tokeniser.model_max_length, tokeniser.model_max_length * 4

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

(16384, 65536)

The above has taken the "tokeniser" from the model.

Transformer models only take numerical inputs, thus, it's essential to get a numerical representation of the input.

The tokeniser is responsible for turning sentences into a series of numbers, called "tokens".

Tokenisers also use special characters, often to indicate the start and end of sentences or words.

All this information helps a NLP neural network form an understanding of the input.

The below output shows the tokens used to encode the given test string, with `<s>` being used to indicate the start of a sequence, and each subsequent token beginning with `Ġ`, indicating the start of a new token.

Note that `tokeniser` was split into two tokens `token` and `iser`, this is important for tokenisers to be able to re-use tokens wherever possible - `iser` can be used as the suffix for many different words.

An attention mask can tell the tokenizer not to pay attention to certain tokens.

In [6]:
test_string = tokeniser("This is a test string to test out the tokeniser")
print(test_string)
tokeniser.convert_ids_to_tokens(test_string.input_ids)

{'input_ids': [0, 713, 16, 10, 1296, 6755, 7, 1296, 66, 5, 19233, 5999, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


['<s>',
 'This',
 'Ġis',
 'Ġa',
 'Ġtest',
 'Ġstring',
 'Ġto',
 'Ġtest',
 'Ġout',
 'Ġthe',
 'Ġtoken',
 'iser',
 '</s>']

Next, recall that the maximum token size for the model was 16384, or about 65536 words.

The majority of the scraped documents still exceed this amount, unfortunately, since no bigger summarisation model exists, the input text will need to be truncated down to the maximum token size, which **will have an impact on the model's ability to learn** but there is no other option.

Furthermore, the tokenizer will output may different outputs, one of which being `input_ids`, the numerical IDs of the tokens in the tokenised text.

The `input_ids` will need to be assigned to the `labels` key of the tokenised input document. This prepares the tokenised input for a task where the model needs to predict the summary, which is how the model will learn.

Below defines and runs a function which will truncate the input document, **but not the output** and assign the labels and input id's as necessary

In [7]:
def truncate_input_tokens(input):
    truncated_input = tokeniser(
        input["document"],
        max_length = 16384,
        truncation = True
    )
    labels = tokeniser( # don't truncate output
        input["summary"],
        truncation = False,
    )
    truncated_input["labels"] = labels["input_ids"]
    return truncated_input

# By passing `batched = true` into the map function, more than one item is applied to the function at a time.
tokenised_dataset = dataset.map(truncate_input_tokens, batched = True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Below now looks at the new dataset.

We can see new features `input_ids`, `attention_mask` and `labels`

The lengths of these inputs are also visible, both at 16384 (the maximum) tokens for the first input, and the biggest.

The smallest input however, is just 393 tokens, suggesting some documents stayed well under the maximum.

In [8]:
print(tokenised_dataset)

length_of_first_item = len(tokenised_dataset['train'][0]['input_ids'])
print(f'The length of the first privacy policy in the train dataset is {length_of_first_item} tokens')

length_of_longest_document = len(max(tokenised_dataset['train'], key=lambda x: len(x['input_ids']))['input_ids'])
print(f'Length of the longest privacy policy in the train dataset is {length_of_longest_document} tokens')

length_of_longest_document = len(min(tokenised_dataset['train'], key=lambda x: len(x['input_ids']))['input_ids'])
print(f'Length of the smallest privacy policy in the train dataset is {length_of_longest_document} tokens')

DatasetDict({
    train: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 217
    })
    test: Dataset({
        features: ['summary', 'document', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 25
    })
})
The length of the first privacy policy in the train dataset is 16384 tokens
Length of the longest privacy policy in the train dataset is 16384 tokens
Length of the smallest privacy policy in the train dataset is 393 tokens


# Defining Evaluation Techniques

In Evaluation of summarisation tasks in NLP, the ROUGE metric is most often used.

To evaluate the model, I will use the `Hugging Face evaluate` library, as well as the `rouge-score` library to produce the rouge score for the model's summaries.

In [9]:
!pip install evaluate
!pip install rouge_score

import evaluate

rouge = evaluate.load("rouge")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c33b394dae66bc35c69/evaluate-0.4.1-py3-none-any.whl.metadata
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.1


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=07f9da9d29a0e808baef9e7af1cdcb968c6116e08b5952241525b04640f83772
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2




Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

To calculate the ROUGE metric on predictions, we need the **decoded** predictions and the **decoded** labels, as they are currently *tokenised* (labels simply refers to the expected values).

Thus, we need to decode the input tokens back into their English representations.

The `batch-decode` function does this, with the parameter `skip_special_tokens` specifying to not handle any special tokens such as `<s>` indicating the start of a sequence.

The function below will decode and find the ROUGE score.

In [10]:
import numpy as np

def decode_and_find_rouge(result):
    predictions,labels = result # extract predictions and labels from the passed in result.

    # Decode by converting tokenised inputs back into English representations
    decoded_predictions = tokeniser.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokeniser.batch_decode(labels, skip_special_tokens = True)

    rouge_score = rouge.compute(predictions=decoded_predictions, references=decoded_labels)

    # Find average length of generated text by removing non-zero and padding ids, and take the mean.
    num_of_predictions = [np.count_nonzero(pred != tokeniser.pad_token_id) for pred in predictions]
    rouge_score["gen_len"] = np.mean(num_of_predictions) # The average is stored in dict with key "gen_len"

    return {k: round(v, 4) for k, v in rouge_score.items()} # round all rouge score outputs to 4 decimal places.

# Training the Model

First, the model hyper-paramters need to be defined.

This is a very expensive model on memory and it is essential to avoid exceeding the GPU's VRAM, thus the hyper-parameters are configured as such:

1) The batch size will be set to just 1 to reduce the amount of data stored in memory.

2) 16-bit floating point precision will be used, rather than 32-bit.

3) The optimiser "adafactor" is used over the industry standard of "Adam". As a result, the model converges slower but also uses less memory.

4) "Gradient Checkpointing" is used, instructing the model to forget the majority of forward-pass activations and instead recompute them on demand during the backward pass, saving only the "most important" activations. Significantly slower training time, but also saves a lot of VRAM.

5) "Gradient Accumulation" is used simulate a "larger" effective batch size. Instead of updating the model's parameters after processing eachbatch, the gradients are accumulated over 2 batches before performing a single update.

In [11]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint_name)

# Creating a data collator, which will form the batches which will be fed to the model.
# It will also conduct padding if necessary to get all inputs of equal length.
data_collator = DataCollatorForSeq2Seq(tokenizer=tokeniser, model=model_checkpoint_name)

models_arguments = Seq2SeqTrainingArguments(
    output_dir="trained_model_book_10",
    evaluation_strategy="epoch", # Run evaluation function on each epoch
    learning_rate=2e-5, # learning rate hyperparameter set to 0.00002
    per_device_train_batch_size= 1, # split into batches of 1 for training
    per_device_eval_batch_size=1, # split to batches of 1 for evaluation
    weight_decay=0.01, # Utilises L2 regularisation in an attempt to prevent overfitting
    save_total_limit=3, # save 3 checkpoints only and delete older checkpoints (Kept using all RAM without)
    num_train_epochs=10, # train for 10 epochs
    predict_with_generate=True, # Generate summaries for each input ; essential for summarisation tasks
    fp16=True, # use 16 bit floating point - reduced memory usage
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adafactor"
)

# Collect the previously defined trainer parameters, such as the evaluation technique, tokeniser and datasets.
trainer = Seq2SeqTrainer(
    model=model,
    args=models_arguments,
    train_dataset=tokenised_dataset["train"],
    eval_dataset=tokenised_dataset["test"],
    tokenizer=tokeniser,
    data_collator=data_collator,
    compute_metrics=decode_and_find_rouge,
)

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

Finally, the model can be ran using the `train()` function on the trainer.

After training, the model will be saved in a file called "trained_model".
 - The size of the model is ~1.7GB

In [12]:
import os

# Uncomment below line if using Kaggle.
os.environ["WANDB_DISABLED"] = "true"

trainer.train()
trainer.save_model("trained_model_book_10")

You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
0,No log,0.701851,0.4914,0.3056,0.2528,0.2509,359.04
2,No log,0.574102,0.4798,0.2795,0.2536,0.2533,155.08
4,0.690900,0.553204,0.5045,0.3054,0.2565,0.2564,176.28
6,0.690900,0.605856,0.4659,0.289,0.2468,0.2461,128.0
8,0.690900,0.669184,0.4854,0.3,0.2616,0.2612,150.04
9,0.286800,0.671114,0.5087,0.3219,0.2709,0.2701,167.12


# Evaluating the model

First, we can observe the output of the model on some examples, to see how the model is handling privacy policies.

In [13]:
from transformers import pipeline, AutoTokenizer, AutoModel,LEDTokenizer, LEDForConditionalGeneration

trained_tokeniser = LEDTokenizer.from_pretrained("/kaggle/working/trained_model_book_10")

trained_model = LEDForConditionalGeneration.from_pretrained("/kaggle/working/trained_model_book_10")

summarizer = pipeline("summarization", model=trained_model,tokenizer=trained_tokeniser)


example = dataset["test"]["document"][0]

tokenised_example = trained_tokeniser(example,return_tensors="pt",truncation=True).input_ids


outputs = trained_model.generate(tokenised_example)
trained_tokeniser.decode(outputs[0], skip_special_tokens=True)

'There is a date of the last update of the agreements. This service is only available to users over a certain age. You can opt out of promotional communications. Your IP address is collected, which can be used to view your approximate location. Instead of asking directly, this Service will assume your consent merely from your usage.. The service may use tracking pixels, web beacons, browser fingerprinting, and/or device fingerprinting on users.. Blocking first party cookies may limit your ability to use the service. Third-party cookies are used for advertising. Your personal data may be sold or otherwise transferred as part of a bankruptcy proceeding or other type of financial transaction'

In [14]:
second_example = dataset["test"]["document"][1]

tokenised_example_2 = trained_tokeniser(second_example,return_tensors="pt",truncation=True).input_ids

outputs_2 = trained_model.generate(tokenised_example_2)
trained_tokeniser.decode(outputs_2[0], skip_special_tokens=True)

'Blocking first party cookies may limit your ability to use the service. The service provides details about what kinds of personal information they collect. You can request access, correction and/or deletion of your data. Your IP address is collected, which can be used to view your approximate location. There is a date of the last update of the agreements.  Terms may be changed any time at their discretion, without notice to you. The service claims to be GDPR compliant for European users. This service gives your personal data to third parties involved in its operation. Information is gathered about you through third parties. Your personal data may be sold or otherwise transferred as part of a bankruptcy proceeding or other type of financial transaction. Logs are kept for an undefined period of time. Do Not Track (DNT) headers are ignored and you are tracked anyway even if you set this header.. Tracking pixels are used in service-to-user communication. Many different types of personal d

In [15]:
lorem_example = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum"

tokenised_example_lorem = trained_tokeniser(lorem_example,return_tensors="pt",truncation=True).input_ids

outputs_lorem = trained_model.generate(tokenised_example_lorem)
trained_tokeniser.decode(outputs_lorem[0], skip_special_tokens=True)

'The dolor sit amet, consectetur adipiscing elit, sed do eiusmodem tempor incididunt ut labore et dolore magna aliqua. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolor eu fugiat nulla pariatur. Except for a sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum'

On the surface these summaries look sufficient.

We can look at the ROUGE scores of a test set example:

In [16]:
example_document = dataset["test"]["document"][5]
ground_truth_summary = dataset["test"]["summary"][5]

tokenised_document = trained_tokeniser(example_document,return_tensors="pt",truncation=True).input_ids

tokenised_ground_truth = trained_tokeniser(ground_truth_summary,return_tensors="pt",truncation=True).input_ids

output_summary = trained_model.generate(tokenised_document)

decoded_summary = trained_tokeniser.decode(output_summary[0], skip_special_tokens=True)

# Rouge metric expects array.
truth = [ground_truth_summary]
summary = [decoded_summary]

rouge_score = rouge.compute(predictions=truth, references=summary)

print(rouge_score)

{'rouge1': 0.45662100456621, 'rouge2': 0.28571428571428575, 'rougeL': 0.2831050228310502, 'rougeLsum': 0.2831050228310502}


We can now take the ROUGE score across the entire test set

This will take a while, due to needing to create summaries for the entire test set.

In [17]:
summaries = []
truths = []
count = 1

for item in dataset["test"]:
    print(f'Processing document {count} of {len(dataset["test"])}')
    document = item["document"]
    ground_truth_summary = item["summary"]

    tokenised_document = trained_tokeniser(document,return_tensors="pt",truncation=True).input_ids
    generated_summary = trained_model.generate(tokenised_document)
    decoded_summary = trained_tokeniser.decode(generated_summary[0], skip_special_tokens=True)

    summaries.append(decoded_summary)
    truths.append(ground_truth_summary)
    count += 1


print(f'{len(summaries)} summaries produced and {len(truths)} ground truths collected')

rouge_score = rouge.compute(predictions=summaries,references=truths)
print(rouge_score)

Processing document 1 of 25
Processing document 2 of 25
Processing document 3 of 25
Processing document 4 of 25
Processing document 5 of 25
Processing document 6 of 25
Processing document 7 of 25
Processing document 8 of 25
Processing document 9 of 25
Processing document 10 of 25
Processing document 11 of 25
Processing document 12 of 25
Processing document 13 of 25
Processing document 14 of 25
Processing document 15 of 25
Processing document 16 of 25
Processing document 17 of 25
Processing document 18 of 25
Processing document 19 of 25
Processing document 20 of 25
Processing document 21 of 25
Processing document 22 of 25
Processing document 23 of 25
Processing document 24 of 25
Processing document 25 of 25
25 summaries produced and 25 ground truths collected
{'rouge1': 0.4990144671830846, 'rouge2': 0.3087143739207341, 'rougeL': 0.26336704673984224, 'rougeLsum': 0.26386319844376394}
