# <center> Fine-Tune a Generative AI Model for News Article Summarization </center>

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [Zephy-7b](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Set up - Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the news-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up - Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Required Dependencies

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install -U transformers
%pip install -U datasets 
%pip install evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Import the necessary components.

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np



In [None]:
from kaggle_secrets import UserSecretsClient
secret_label = "wandb-key"
secret_value = UserSecretsClient().get_secret(secret_label)
personal_key_for_api = user_secrets.get_secret()

! wandb login $personal_key_for_api

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [ News-Sum](https://huggingface.co/datasets/glnmario/news-qa-summarization) Hugging Face dataset. It contains 10,000+ news article with the corresponding manually labeled summaries and question answers. 

In [4]:
huggingface_dataset_name = "glnmario/news-qa-summarization"

dataset = (load_dataset(huggingface_dataset_name, data_files="data.jsonl", split='train').train_test_split(train_size=800, test_size=200))


Load the pre-trained [ZEPHYR-7b model](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) from HuggingFace. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [None]:

# model_name='HuggingFaceH4/zephyr-7b-beta'
# model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.bfloat16)

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that to understand the differences of full fine-tuning and LoRA methods.

In [None]:
# def print_number_of_trainable_model_parameters(model):
#     trainable_model_params = 0
#     all_model_params = 0
#     for _, param in model.named_parameters():
#         all_model_params += param.numel()
#         if param.requires_grad:
#             trainable_model_params += param.numel()
#     return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

# print(print_number_of_trainable_model_parameters(model))

using low memory model

In [5]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)

def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to news article the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [6]:
index = 40

story = dataset['train'][index]['story']
summary = dataset['train'][index]['summary']
prompt =  f"""
Summarize the following article.

{story}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=60,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (936 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following article.

WASHINGTON (CNN)  -- President Richard M. Nixon and his Brazilian counterpart, Emilio Medici, in 1971 discussed ways their countries could work together to overthrow the socialist government of Salvador Allende in Chile, according to a newly declassified document. President Richard M. Nixon, right, and his Brazilian counterpart, Emilio Medici. During a meeting of the two leaders at the White House on December 9 of that year, Medici was discussing the possibility of a coup by the Chilean military with assistance from Brazilian military officers when Nixon said that it was "very important that Brazil and the United States work closely in this field," according to the document. Nixon offered money or other discreet aid for the effort if it could be made available, the document shows. "We must try and prevent new Allendes and Castros, and try 

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the News-Summary Dataset

You need to convert the News-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following article` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following article.

For many travelers, duty-free is a luxurious enigma wrapped up in discounted Swiss chocolate and soaked in tax-free vodka. Duty-free goods are mostly sold inside international airport terminals, ferry stations, cruise ports, and border stops. 

Duty-free shops sell products without local import tax. 

As the name implies, duty-free shops sell products without duty (a.k.a. local import tax). For example, by buying goods in a duty-free shop at Paris's Charles de Gaulle, you avoid paying the duty that France slaps on imported goods (like Swedish vodka) and that French stores ordinarily include as part of a product's list price. 

In Europe, there's a bonus perk: Duty-free shops in airports and ports are "tax-free shops," too, which means you are spared the value added tax (or V.A.T., a type of sales tax) that would otherwise be included in the price of goods sold elsewhere in the European Union. That means a savings of between 5 and 25 percent, depending on the country. 

But there's a catch for duty-free products bought in Europe and elsewhere. If you bring into the U.S. more than $800 worth of items purchased abroad -- duty-free or not -- you'll have to pay the U.S. duty. As a rule of thumb, Americans returning from overseas trips must pay 3 percent on the first $1,000 worth of merchandise over the $800 allowance. Import products worth even more than that and you may be taxed at a higher percentage. 

In short, duty-free is hit-or-miss for Americans. The best deals are on items labeled "tax free" and otherwise taxed heavily -- such as alcohol and cigarettes. You may also find it worthwhile to shop in duty-free stores if you have some local currency left and would rather put it to use than redeem it for dollars (and get hit with the high conversion fee of a bank or currency exchange bureau). 

Not every duty-free item is a true bargain. Yngve Bia, president of the duty-free research company Generation Research, says price differences depend on two things: geography and currency exchange rates. "Right now, Heathrow and Gatwick in London offer good deals, especially for liquor, because of the weak British pound," he says. For example, a one-liter bottle of Absolut vodka has a typical non-duty-free price of about $30 at retail U.S. shops. But travelers can buy it for just $15 (£10) at duty free shops at London's Heathrow and Gatwick airports. That's a significant savings.
```
    
Summary: 
```

Training response (summary):
```
Duty-free shopping is hit-or-miss for Americans .
The best deals are on items labeled "tax free" and otherwise taxed heavily .
For countries in the EU, duty-free shops at airports and ports are also tax-free .
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [7]:
def tokenize_function(example):
    
    start_prompt = 'Summarize the following article.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + story + end_prompt for story in example["story"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['story', 'questions', 'answers', 'summary',])

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [8]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 2 == 0, with_indices=True)

Filter:   0%|          | 0/800 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (400, 2)
Test: (100, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 400
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 100
    })
})


<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [None]:
output_dir = f'./article-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=15,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1,
    save_strategy='epoch'
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

In [None]:
# trainer.train()
#this will throw out of memory error due to memory constraints

As this needs more memory and we could see the issue, Lets try the PeFT method

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained

In [21]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16, # Rank
    lora_alpha=16,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="all",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

In [22]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1770240
all model parameters: 249347328
percentage of trainable model parameters: 0.71%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [23]:
output_dir = f'./peft-article-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    save_strategy='epoch',
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=15,
    logging_steps=1,
    max_steps=20,
    logging_dir=f'{output_dir}/logs'
    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [24]:

peft_trainer.train()

peft_model_path="./peft-article-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
1,43.5
2,47.5
3,44.0
4,43.75
5,43.0
6,40.5
7,38.75
8,37.5
9,36.0
10,34.25


('./peft-article-summary-checkpoint-local/tokenizer_config.json',
 './peft-article-summary-checkpoint-local/special_tokens_map.json',
 './peft-article-summary-checkpoint-local/spiece.model',
 './peft-article-summary-checkpoint-local/added_tokens.json',
 './peft-article-summary-checkpoint-local/tokenizer.json')

In [25]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       './peft-article-summary-checkpoint-local/', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [15]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 249347328
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [16]:
index = 10
story = dataset['test'][index]['story']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following article.

{story}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
# print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
# print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (952 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
John Stremlau: 20 years ago today, Nelson Mandela was released from Cape Town jail .
Stremlau: After 27 years, Mandela emerged "without bitterness, his humanity intact"
Mandela maintains commitment to democracy, freedom and rule of law, he writes .
Mandela is inspiration in troubled nation that desperately needs it, Stremlau says .
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
John Stremlau: Mandela has declined to participate in this week's many celebrations in his honor.
---------------------------------------------------------------------------------------------------
PEFT MODEL: John Stremlau: The world's most famous political prisoner emerged without bitterness, his humanity intact.


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [26]:
stories = dataset['test'][0:10]['story']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
# instruct_model_summaries = []
peft_model_summaries = []

for idx, story in enumerate(stories):
    prompt = f"""
Summarize the following article.

{story}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

#     instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
#     instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
#     instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df.to_csv('model_output_results.csv',index=False)

Token indices sequence length is longer than the specified maximum sequence length for this model (536 > 512). Running this sequence through the model will result in indexing errors


Compute ROUGE score for this subset of the data. 

In [27]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2388080517387528, 'rouge2': 0.08595817328211694, 'rougeL': 0.1814362801134182, 'rougeLsum': 0.23031945093532405}
PEFT MODEL:
{'rouge1': 0.25536953771640963, 'rouge2': 0.10062124166103695, 'rougeL': 0.19085985743535694, 'rougeLsum': 0.21585355759610594}


Notice, that PEFT model results are not too bad, while the training process was much easier!

Calculate the improvement of PEFT over the original model:

In [28]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 1.66%
rouge2: 1.47%
rougeL: 0.94%
rougeLsum: -1.45%


Further finetuning would result better as we work on trainable parameters