# **Finetuning Longformer Encoder-Decoder (LED) Model** 
###Trained on SGH Dataset
#### Adapted from Source: https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing

To check that we are having enough RAM we can run the following command.
If the randomely allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

## Installing packages and loading data

In [None]:
# Checking for Sufficient RAM
!nvidia-smi

Tue Nov  1 22:18:26 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Installing packages

%%capture
!pip install datasets==1.2.1
!pip install transformers==4.21.3
!pip install rouge_score
!pip install dill==0.3.4

In [None]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


## Loading data

In [None]:
from datasets import load_dataset, load_metric, Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dfall = pd.read_csv("/content/drive/MyDrive/SGH Project/SGH_combined100.csv", encoding = 'utf_8')

In [None]:
from sklearn.model_selection import train_test_split
dftrain, dftest = train_test_split(dfall, test_size=0.1)

In [None]:
# Selecting only the necessary columns
dftrain = dftrain[['Article','Summary']]
dftest = dftest[['Article','Summary']]

In [None]:
# Dropping index column

dftrain = dftrain.reset_index(drop=True)
dftest = dftest.reset_index(drop=True)

In [None]:
dftrain.head(5)

Unnamed: 0,Article,Summary
0,"coronavirus cases over the past few days, the ...","On 30 Sep, MOH said that there had been a 35% ..."
1,"Reading the article ""Respect for GPs needed fo...","Reading the article ""Respect for GPs needed fo..."
2,WASHINGTON (REUTERS) - The fast-spreading BA.4...,The US Centers for Disease Control and Prevent...
3,I recently made an insurance claim for a heart...,Mr James Koh recently made an insurance claim ...
4,"SINGAPORE - There were 12,784 new Covid-19 cas...","Singapore recorded 12,784 new COVID-19 cases o..."


In [None]:
dftest.head(5)

Unnamed: 0,Article,Summary
0,"SINGAPORE - For the next six weeks, cyclists w...",ST (10 Jul) reported that over the next six we...
1,SINGAPORE: It has been more than five months s...,Tanglin Halt elderly residents who moved to Da...
2,SINGAPORE - Traditional Chinese medicine (TCM)...,Min Ong said in Parliament on 5 Oct that Tradi...
3,"SINGAPORE - For the first time, Singapore will...","On 6 Jul, Minister Ong said that for the first..."
4,SINGAPORE - Senior citizens in the Beo Crescen...,SGH signed a Collaborative Agreement with Thye...


In [None]:
train_dataset = Dataset.from_pandas(dftrain, split = "train")

In [None]:
val_dataset = Dataset.from_pandas(dftest, split = "validation")

In [None]:
train_dataset

Dataset({
    features: ['Article', 'Summary'],
    num_rows: 86
})

In [None]:
val_dataset

Dataset({
    features: ['Article', 'Summary'],
    num_rows: 10
})

In [None]:
train_dataset['Article'][1]

'Reading the article "Respect for GPs needed for Healthier SG\'s big shift to preventive care, say MPs" (Oct 4), I recalled my family\'s experience with a caring and dedicated general practitioner.\r\n\r\nThis GP went beyond the call of duty.\r\n\r\nHe always called to follow up after each visit made by my elderly mother, who has several health problems.\r\n\r\n\r\nADVERTISING\r\n\r\n\r\nMy children were comfortable seeing him, and he always made the correct diagnosis.\r\n\r\nWe went to him for a few years before he moved to another clinic.\r\n\r\nEven though he works at a clinic that is farther away now, we are contemplating whether to continue seeing him.\r\n\r\nMany Singaporeans form good relationships with their GPs who have gained their trust.\r\n\r\nIssues may sometimes arise, however, when GPs need to refer patients to public hospitals.\r\n\r\nOur GP once referred my mother to a hospital, with a detailed letter on what he could diagnose at his level of care.\r\n\r\nBut I found t

In [None]:
train_dataset['Summary'][1]

'Reading the article "Respect for GPs needed for Healthier SG\'s big shift to preventive care, say MPs" (4 Oct), Ms Vivien Goh Choon Lian recounted her family\'s experience with a caring and dedicated General Practitioner (GP) who went beyond the call of duty. However, this GP moved to another clinic further away, and her family was contemplating whether to continue seeing him. The writer said many Singaporeans form good relationships with their GPs who have gained their trust. While issues may sometimes arise when GPs need to refer patients to public hospitals, the partnership between GPs and hospitals needed to be a good one. The writer noted that hospitals needed to respect the role GPs play in ensuring the healthcare system works well.'

## Data preprocessing

In [None]:
from transformers import AutoTokenizer

 and load the tokenizer

In [None]:
# Loading LED base instead of LED large model as it requires less GPU RAM to run.
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

HBox(children=(FloatProgress(value=0.0, description='Downloading tokenizer_config.json', max=27.0, style=Progr…




HBox(children=(FloatProgress(value=0.0, description='Downloading config.json', max=1092.0, style=ProgressStyle…




HBox(children=(FloatProgress(value=0.0, description='Downloading vocab.json', max=898822.0, style=ProgressStyl…




HBox(children=(FloatProgress(value=0.0, description='Downloading merges.txt', max=456318.0, style=ProgressStyl…




HBox(children=(FloatProgress(value=0.0, description='Downloading special_tokens_map.json', max=772.0, style=Pr…




In [None]:
# Max input length is set to the median length of SGH articles.  Datasets will be tokenized to max_input_length.
max_input_length = 3072
max_output_length = 512
batch_size = 4

In [None]:
# Create function to tokenize inputs and labels, create attention and global attention masks.
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["Article"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["Summary"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

In [None]:
# Preprocessing training dataset
train_dataset = train_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["Article", "Summary"],
)

HBox(children=(FloatProgress(value=0.0, max=22.0), HTML(value='')))




In [None]:
# Preprocessing validation dataset
val_dataset = val_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["Article", "Summary"],
)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




In [None]:
# Convert datasets to PyTorch format
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)
val_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

In [None]:
train_dataset

Dataset({
    features: ['attention_mask', 'global_attention_mask', 'input_ids', 'labels'],
    num_rows: 86
})

In [None]:
val_dataset

Dataset({
    features: ['attention_mask', 'global_attention_mask', 'input_ids', 'labels'],
    num_rows: 10
})

## Loading model and metrics

In [None]:
from transformers import AutoModelForSeq2SeqLM

In [None]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading pytorch_model.bin', max=647693783.0, style=Pr…




In [None]:
# Set hyperparameters
led.config.num_beams = 2
led.config.max_length = 512
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

In [None]:
rouge = load_metric("rouge")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1955.0, style=ProgressStyle(description…




In [None]:
# Defining compute_metrics function
# Compute metrics function expects output pred.predictions and label pred.label_ids
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # Decoding tokens
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    # Computing rouge score
    rouge1_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge1"])["rouge1"].mid
    rouge2_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid        
    rougeL_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rougeL"])["rougeL"].mid 
    rougeLsum_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rougeLsum"])["rougeLsum"].mid 

    return {
        "rouge1_precision": round(rouge1_output.precision, 4),
        "rouge1_recall": round(rouge1_output.recall, 4),
        "rouge1_fmeasure": round(rouge1_output.fmeasure, 4),
        "rouge2_precision": round(rouge2_output.precision, 4),
        "rouge2_recall": round(rouge2_output.recall, 4),
        "rouge2_fmeasure": round(rouge2_output.fmeasure, 4),
        "rougeL_precision": round(rougeL_output.precision, 4),
        "rougeL_recall": round(rougeL_output.recall, 4),
        "rougeL_fmeasure": round(rougeL_output.fmeasure, 4),
        "rougeLsum_precision": round(rougeLsum_output.precision, 4),
        "rougeLsum_recall": round(rougeLsum_output.recall, 4),
        "rougeLsum_fmeasure": round(rougeLsum_output.fmeasure, 4),
    }

## Model training, evaluation and deployment

In [None]:
# Importing relevant packages
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from huggingface_hub.keras_mixin import push_to_hub_keras
from transformers import Trainer

In [None]:
# Defining training arguments
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="summarise_v10",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=1,
    num_train_epochs=10,
    push_to_hub=True
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
# Pass training arguments, model, tokenizer, datasets and compute_metrics function to the trainer
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

/content/summarise_v10 is already a clone of https://huggingface.co/debbiesoon/summarise_v10. Make sure you pull the latest changes with `repo.git_pull()`.
Using cuda_amp half precision backend


In [None]:
# Train model, generate evaluation metrics
trainer.train()

***** Running training *****
  Num examples = 90
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 230


Step,Training Loss,Validation Loss,Rouge1 Precision,Rouge1 Recall,Rouge1 Fmeasure,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure,Rougel Precision,Rougel Recall,Rougel Fmeasure,Rougelsum Precision,Rougelsum Recall,Rougelsum Fmeasure
10,1.4834,1.700125,0.2304,0.6761,0.3152,0.1326,0.4034,0.1797,0.1495,0.4624,0.2069,0.1495,0.4624,0.2069
20,1.5011,1.605079,0.4301,0.5372,0.4087,0.2481,0.3439,0.245,0.2878,0.3928,0.2834,0.2878,0.3928,0.2834
30,0.9289,1.550093,0.431,0.597,0.4364,0.2653,0.393,0.2736,0.3007,0.4233,0.3037,0.3007,0.4233,0.3037
40,1.0895,1.596948,0.4661,0.5481,0.4486,0.2736,0.3439,0.2689,0.3318,0.4045,0.3221,0.3318,0.4045,0.3221
50,0.7785,1.587543,0.4527,0.5405,0.4209,0.2942,0.3634,0.272,0.3268,0.4047,0.3042,0.3268,0.4047,0.3042
60,0.635,1.608058,0.4142,0.5649,0.4172,0.242,0.3659,0.2549,0.2787,0.4156,0.2909,0.2787,0.4156,0.2909
70,0.514,1.61502,0.4431,0.5665,0.4569,0.2656,0.3754,0.2853,0.3252,0.441,0.3434,0.3252,0.441,0.3434
80,0.5617,1.644672,0.3956,0.6304,0.451,0.2353,0.425,0.2776,0.2883,0.4904,0.3332,0.2883,0.4904,0.3332
90,0.396,1.742341,0.4276,0.609,0.4506,0.2657,0.4142,0.2858,0.3091,0.4677,0.3316,0.3091,0.4677,0.3316
100,0.3427,1.757154,0.3877,0.5633,0.4169,0.216,0.3635,0.2468,0.2706,0.4314,0.3018,0.2706,0.4314,0.3018


***** Running Evaluation *****
  Num examples = 11
  Batch size = 4
Saving model checkpoint to summarise_v10/checkpoint-10
Configuration saved in summarise_v10/checkpoint-10/config.json
Model weights saved in summarise_v10/checkpoint-10/pytorch_model.bin
tokenizer config file saved in summarise_v10/checkpoint-10/tokenizer_config.json
Special tokens file saved in summarise_v10/checkpoint-10/special_tokens_map.json
tokenizer config file saved in summarise_v10/tokenizer_config.json
Special tokens file saved in summarise_v10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 11
  Batch size = 4
Saving model checkpoint to summarise_v10/checkpoint-20
Configuration saved in summarise_v10/checkpoint-20/config.json
Model weights saved in summarise_v10/checkpoint-20/pytorch_model.bin
tokenizer config file saved in summarise_v10/checkpoint-20/tokenizer_config.json
Special tokens file saved in summarise_v10/checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
 

TrainOutput(global_step=230, training_loss=0.4629465349342512, metrics={'train_runtime': 5564.2312, 'train_samples_per_second': 0.162, 'train_steps_per_second': 0.041, 'total_flos': 1822638263500800.0, 'train_loss': 0.4629465349342512, 'epoch': 10.0})

In [None]:
# Push model to HuggingFace Hub
trainer.push_to_hub(commit_message="Training complete")

Saving model checkpoint to summarise_v10
Configuration saved in summarise_v10/config.json
Model weights saved in summarise_v10/pytorch_model.bin
tokenizer config file saved in summarise_v10/tokenizer_config.json
Special tokens file saved in summarise_v10/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


HBox(children=(FloatProgress(value=0.0, description='Upload file pytorch_model.bin', max=647678513.0, style=Pr…

HBox(children=(FloatProgress(value=0.0, description='Upload file runs/Nov01_22-25-42_a02702c8c575/events.out.t…

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/debbiesoon/summarise_v10
   4a6008c..22fa740  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/debbiesoon/summarise_v10
   4a6008c..22fa740  main -> main







Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}}
To https://huggingface.co/debbiesoon/summarise_v10
   22fa740..5a90b97  main -> main

   22fa740..5a90b97  main -> main



'https://huggingface.co/debbiesoon/summarise_v10/commit/22fa740ce33ddf32a8f3810914215c1e7cb6def6'

In [None]:
# Save model to Google Drive
trainer.save_model('/content/drive/MyDrive/SGH Project/summarise_v10/')

Saving model checkpoint to /content/drive/MyDrive/SGH Project/summarise_v10/
Configuration saved in /content/drive/MyDrive/SGH Project/summarise_v10/config.json
Model weights saved in /content/drive/MyDrive/SGH Project/summarise_v10/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/SGH Project/summarise_v10/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/SGH Project/summarise_v10/special_tokens_map.json
Saving model checkpoint to summarise_v10
Configuration saved in summarise_v10/config.json
Model weights saved in summarise_v10/pytorch_model.bin
tokenizer config file saved in summarise_v10/tokenizer_config.json
Special tokens file saved in summarise_v10/special_tokens_map.json
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}}
