# Fine-tune a Model and Evaluate it using ROUGE Metrics

* Make sure you change the kernel to **PyTorch 2.6** to test the notebook
* We mark **TODO** in the notebook cells to indicate the place where you need to complete the missing code. You can refer to the exercises in the course repository for code examples.

## Install necessary packages

This is a onestep process to install necessary bitsandbytes(Alpha release) for the notebook.
1. Run below cell uncommenting the installation commands, after successful installation, comment back again.
2. Now Restart the kernel. `Kernel->Restart Kernel`
3. Now run the cells normally.

In [1]:
import sys
import os
import site
from pathlib import Path

!echo "Installation in progress, please wait..."
!{sys.executable} -m pip cache purge > /dev/null

%pip install --user --upgrade transformers datasets trl peft accelerate scipy sentencepiece ipywidgets evaluate rouge_score --no-warn-script-location

!echo "Installation completed."

# Get the site-packages directory
site_packages_dir = site.getsitepackages()[0]

# add the site pkg directory where these pkgs are insalled to the top of sys.path
if not os.access(site_packages_dir, os.W_OK):
    user_site_packages_dir = site.getusersitepackages()
    if user_site_packages_dir in sys.path:
        sys.path.remove(user_site_packages_dir)
    sys.path.insert(0, user_site_packages_dir)
else:
    if site_packages_dir in sys.path:
        sys.path.remove(site_packages_dir)
    sys.path.insert(0, site_packages_dir)

Installation in progress, please wait...
Collecting peft
  Downloading peft-0.15.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.15.0-py3-none-any.whl (410 kB)
[0mInstalling collected packages: peft
  Attempting uninstall: peft
    Found existing installation: peft 0.14.0
    Uninstalling peft-0.14.0:
      Successfully uninstalled peft-0.14.0
[0mSuccessfully installed peft-0.15.0
[0mNote: you may need to restart the kernel to use updated packages.
Installation completed.


## Import necessary packages

In [2]:
import torch
import os

os.environ["WANDB_DISABLED"] = "true"
import transformers
from transformers import AutoTokenizer
from peft import LoraConfig
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from evaluate import load

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## Login to HuggingFace

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load Gemma-2-2b-it Model from HuggingFace Hub

In [4]:
model_path = "google/gemma-2-2b-it"

# TODO: create tokenizer using AutoTokenizer class
# tokenizer = ...
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             attn_implementation='eager',
                                             device_map="auto")


  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
2025-03-20 00:47:02,198 - accelerate.utils.modeling - INFO - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Load, Format, and Split Dataset

In [5]:
def process_dataset(sample):
    messages = [
        {"role": "user", "content": f"Instruction:\nSummarize the following article.\n\nInput:\n{sample['Articles']}"},
        {"role": "assistant", "content": sample['Summaries']}
    ]
    sample = tokenizer.apply_chat_template(messages, tokenize=True, return_dict=True)
    return sample

dataset = load_dataset("gopalkalpande/bbc-news-summary", split="train")
dataset = dataset.map(process_dataset)

split_dataset = dataset.train_test_split(test_size=0.1, seed=99)
train_dataset = split_dataset["train"]
validation_dataset = split_dataset["test"]

## Evaluate Base Model Summaries using ROUGE Metric

In [6]:
rouge = load('rouge')

# initialize lists of predictions and references later used to compute rouge scores
predictions = []
references = []

# iterate through the first 15 samples
for article, abstract in zip(validation_dataset["Articles"][:15], validation_dataset["Summaries"][:15]):
    messages = [
        {"role": "user", "content": f"Instruction:\nSummarize the following article.\n\nInput:\n{article}"},
    ]
    input_ids = tokenizer.apply_chat_template(messages,
                                              tokenize=True,
                                              add_generation_prompt=True,
                                              return_tensors="pt").to("xpu")
    
    # TODO: perform model inference using the tokens in ``input_ids''
    # output =   
    output = model.generate(
        input_ids=input_ids,
        max_new_tokens=100,  # Adjust token limit as needed
    )
    
    # Remove input prompt from output
    prompt_length = input_ids.shape[1]
    answer = tokenizer.decode(output[0][prompt_length:], skip_special_tokens=True)
    
    # TODO: add one answer to the ``predictions'' list, which is later passed to rouge compute
    # 
    predictions.append(answer)
    # TODO: add one abstract to the ``references'' list, which is later passed to rouge compute
    # 
    references.append(abstract)
    
    
    print(100*'-')
    print("Abstract:", abstract)
    print(100*'-')
    print("Model Summary:", answer)

print(100*'-')
# TODO: compute and print out the rouge scores including rouge1, rouge2, rougeL and rougeLsum
# TODO: you can refer to the exercise that computes rouge scores
# print(...)
rouge_scores = rouge.compute(predictions=predictions, references=references)
print("ROUGE-1:", rouge_scores["rouge1"])
print("ROUGE-2:", rouge_scores["rouge2"])
print("ROUGE-L:", rouge_scores["rougeL"])
print("ROUGE-Lsum:", rouge_scores["rougeLsum"])

print(100*'-')


----------------------------------------------------------------------------------------------------
Abstract: Immigration and asylum have normally been issues politicians from the big parties have tiptoed around at election time.But, while all the parties appear to agree the time has come to properly debate and address the issue, there are already signs they will run into precisely the same problems as before.Labour has already branded the proposal unworkable but party strategists have seen the Tories seizing a poll advantage over the issue.Former union leader Sir Bill Morris has already accused both the big parties of engaging in a "bidding war about who can be nastiest to asylum seekers".The challenge for the big parties is to ensure they can engage in the debate during the cut and thrust of a general election while also avoiding that trap.That has been attacked by the Tories as too little, too late and for failing to tackle the key issue of the numbers entering the UK.That was also

2025-03-20 01:04:42,493 - absl - INFO - Using default tokenizer.


----------------------------------------------------------------------------------------------------
Abstract: The US stock market regulator is investigating troubled insurance broker Marsh & McLennan's shareholder transactions, the firm has said.Marsh has said it is co-operating fully with the SEC investigation.Marsh is also the focus of an inquiry the New York attorney-general into whether insurers rigged the market.Since that inquiry was launched in October, Marsh has replaced its chief executive and held a boardroom shake-out to meet criticism by lessening the number of company executives on the board.Prosecutors allege that Marsh - the world's biggest insurance broker - and other US insurance firms may have fixed bids for corporate cover.The uncertainty unleashed by the scandal has prompted three credit rating agencies - Standard & Poor's, Moody's and Fitch - to downgrade Marsh in recent weeks.
---------------------------------------------------------------------------------------

## Run the SFTTrainer to Fine-tune Model

In [7]:
finetuned_model = "gemma-2-2b-it-finetuned"

peft_config = LoraConfig(
    r=64,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",
    modules_to_save=["lm_head", "embed_token"],
    task_type="CAUSAL_LM",
)

if torch.xpu.is_available():
    torch.xpu.empty_cache()

# TODO: set up the trainer using SFTTrainer class
# TODO: you can refer to the gemma_xpu_finetuning.ipynb exercise
# TODO: this part is relatively long because of the arguments that need to be set
# trainer = SFTTrainer(...)

# Calculate max_steps based on the subset size
num_train_samples = len(train_dataset)
batch_size = 1
gradient_accumulation_steps = 8
steps_per_epoch = num_train_samples // (batch_size * gradient_accumulation_steps)
num_epochs = 5
max_steps = steps_per_epoch * num_epochs

finetuned_model_id = "gemma-2-2b-it-finetuned"
PUSH_TO_HUB = True
USE_WANDB = False

training_args = transformers.TrainingArguments(
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_ratio=0.05,
        max_steps=max_steps,
        learning_rate=1e-5,
        evaluation_strategy="steps",
        save_steps=500,
        bf16=True,
        logging_steps=100,
        output_dir=finetuned_model_id,
        hub_model_id=finetuned_model_id if PUSH_TO_HUB else None,
        use_ipex=True,
        report_to="wandb" if USE_WANDB else None,
        #push_to_hub=PUSH_TO_HUB,
        max_grad_norm=0.6,
        weight_decay=0.01,
        group_by_length=True
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=peft_config
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
model.gradient_checkpointing_enable()
model = torch.compile(model)
result = trainer.train()
model.config.use_cache = True
print(result)

# save lora model
tuned_lora_model = "gemma-2-2b-it-finetuned-lora"
trainer.model.save_pretrained(tuned_lora_model)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = SFTTrainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
2025-03-19 04:16:39,861 - _logger.py - IPEX - INFO - Currently split master weight for xpu only support sgd
2025-03-19 04:16:39,889 - _logger.py - IPEX - INFO - Currently split master weight for xpu only support sgd
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
100,2.3671,2.28148
200,2.2002,2.171038
300,2.1245,2.104002
400,2.0429,2.057977
500,2.0047,2.031388
600,1.9694,2.016161
700,1.9576,2.004878
800,1.9291,1.997234
900,1.9177,1.990582
1000,1.9251,1.98596


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
2025-03-19 04:18:33,751 - _logger.py - IPEX - INFO - Currently split master weight for xpu only support sgd
2025-03-19 04:18:33,780 - _logger.py - IPEX - INFO - Linear BatchNorm folding failed during the optimize process.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
2025-03-19 04:20:51,168 - _logger.py - IPEX - INFO - Currently split master weight for xpu only support sgd
2025-03-19 04:20:51,195 - _logger.py - IPEX - INFO - Linear BatchNorm folding failed during the optimize process.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
2025-03-19 04:22:47,810 - _logger.py - I

TrainOutput(global_step=1250, training_loss=2.0158024047851564, metrics={'train_runtime': 1511.7378, 'train_samples_per_second': 6.615, 'train_steps_per_second': 0.827, 'total_flos': 1.0874404918906675e+17, 'train_loss': 2.0158024047851564})


## Inference Fine-tuned Model and Evaluate Summaries using ROUGE

In [8]:
rouge = load('rouge')
finetuned_model = "gemma-2-2b-it-finetuned"
finetuned_model_path = f"{finetuned_model}/checkpoint-1250"
loaded_model = AutoModelForCausalLM.from_pretrained(finetuned_model_path, device_map="xpu")
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

predictions = []
references = []

# TODO: compute rouge scores on the first 15 sample again.
# TODO: you can repeat the code from the earlier cells.
#
for article, abstract in zip(validation_dataset["Articles"][:15], validation_dataset["Summaries"][:15]):
    messages = [
        {"role": "user", "content": f"Instruction:\nSummarize the following article.\n\nInput:\n{article}"},
    ]
    
    # Tokenize the input
    input_ids = tokenizer.apply_chat_template(messages,
                                              tokenize=True,
                                              add_generation_prompt=True,
                                              return_tensors="pt").to("xpu")
    
    # Perform model inference
    output = loaded_model.generate(
        input_ids=input_ids,
        max_new_tokens=100,  # Adjust token limit as needed
    )
    
    # Remove the input prompt from the output
    prompt_length = input_ids.shape[1]
    answer = tokenizer.decode(output[0][prompt_length:], skip_special_tokens=True)
    
    # Append the prediction and reference to their respective lists
    predictions.append(answer)
    references.append(abstract)

    print(100*'-')
    print("Abstract:", abstract)
    print(100*'-')
    print("Model Summary:", answer)
# Compute the ROUGE scores
rouge_scores = rouge.compute(predictions=predictions, references=references)

# Print out the ROUGE scores
print(100 * '-')
print("ROUGE-1:", rouge_scores["rouge1"])
print("ROUGE-2:", rouge_scores["rouge2"])
print("ROUGE-L:", rouge_scores["rougeL"])
print("ROUGE-Lsum:", rouge_scores["rougeLsum"])
print(100 * '-')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



----------------------------------------------------------------------------------------------------
Abstract: Immigration and asylum have normally been issues politicians from the big parties have tiptoed around at election time.But, while all the parties appear to agree the time has come to properly debate and address the issue, there are already signs they will run into precisely the same problems as before.Labour has already branded the proposal unworkable but party strategists have seen the Tories seizing a poll advantage over the issue.Former union leader Sir Bill Morris has already accused both the big parties of engaging in a "bidding war about who can be nastiest to asylum seekers".The challenge for the big parties is to ensure they can engage in the debate during the cut and thrust of a general election while also avoiding that trap.That has been attacked by the Tories as too little, too late and for failing to tackle the key issue of the numbers entering the UK.That was also

2025-03-20 01:08:28,575 - absl - INFO - Using default tokenizer.


----------------------------------------------------------------------------------------------------
Abstract: The US stock market regulator is investigating troubled insurance broker Marsh & McLennan's shareholder transactions, the firm has said.Marsh has said it is co-operating fully with the SEC investigation.Marsh is also the focus of an inquiry the New York attorney-general into whether insurers rigged the market.Since that inquiry was launched in October, Marsh has replaced its chief executive and held a boardroom shake-out to meet criticism by lessening the number of company executives on the board.Prosecutors allege that Marsh - the world's biggest insurance broker - and other US insurance firms may have fixed bids for corporate cover.The uncertainty unleashed by the scandal has prompted three credit rating agencies - Standard & Poor's, Moody's and Fitch - to downgrade Marsh in recent weeks.
---------------------------------------------------------------------------------------