# 📘 Dataset Description
This project uses the dialogsum-test dataset, sourced from Hugging Face's neil-code repository. It consists of dialogue data and corresponding summaries, suitable for fine-tuning or evaluating models on dialogue summarization tasks. This dataset is particularly useful for training transformer-based language models to generate concise summaries from conversational input.

# 🧰 Install Required Libraries
This cell installs all necessary libraries using pip. These include:

- bitsandbytes for efficient 8-bit model quantization,

- transformers for model loading and tokenization,

- peft for parameter-efficient fine-tuning,

- accelerate for distributed training,

- datasets for loading Hugging Face datasets,

- scipy, einops, and evaluate for various utilities and metrics,

- trl (Transformers Reinforcement Learning) for supervised fine-tuning.

In [1]:
!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.9/61.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.3/472.3 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.3/35.3 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"

# 📦 Import Required Libraries

In [3]:
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    GenerationConfig
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login

interpreter_login()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Enter your token (input will not be visible): ··········
Add token as git credential? (Y/n) y


# 📊 Define GPU Memory Monitoring Utility
This utility function checks and prints the current GPU memory usage using NVIDIA's NVML library. It’s useful for ensuring that the model and data fit into memory during training or inference.

In [4]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

# 📂 Load Dataset from Hugging Face
This cell loads the dialogsum-test dataset using Hugging Face’s load_dataset() function. It assigns the dataset to a variable named dataset for future use in tokenization and training.

In [5]:
huggingface_dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(huggingface_dataset_name)
dataset

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1999 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/499 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/499 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1999
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
})

# 🔍 View a Sample from the Dataset
This cell displays the first training example from the loaded dataset. It's a good practice to inspect a few samples before preprocessing to understand the structure and contents (e.g., input dialogues and summaries).

In [6]:
dataset['train'][0]

{'id': 'train_0',
 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.",
 'summary': "Mr. Smith'

# ⚙️ Configure 4-bit Quantization for Model Loading
This block sets up 4-bit quantization using the BitsAndBytesConfig from Hugging Face Transformers. This allows loading large models more efficiently on limited GPU memory by reducing precision:

- load_in_4bit=True: Enables 4-bit weight loading.

- bnb_4bit_quant_type='nf4': Uses NormalFloat 4 quantization.

- bnb_4bit_compute_dtype=torch.float16: Sets computation precision to float16.

- bnb_4bit_use_double_quant=False: Disables nested quantization.

- device_map={"": 0}: Maps the model to the first GPU.

In [7]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )
device_map = {"": 0}

# 🤖 Load Pretrained GPT-2 Model with Quantization
This cell loads the GPT-2 model (gpt2) from Hugging Face with the previously defined quantization settings. Key options include:

- device_map=device_map: Ensures model is loaded on GPU.

- quantization_config=bnb_config: Enables 4-bit quantization.

- trust_remote_code=True: Allows loading custom model classes from remote repos.

- use_auth_token=True: Uses your Hugging Face authentication token for private model access.

In [8]:
model_name='gpt2'
original_model = AutoModelForCausalLM.from_pretrained(model_name,
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# ✂️ Load and Configure the Tokenizer
Here, the GPT-2 tokenizer is loaded and customized:

padding_side="left": Left-pads sequences (important for causal language modeling).

add_eos_token=True: Adds end-of-sequence tokens if missing.

add_bos_token=True: Adds beginning-of-sequence tokens if needed.

use_fast=False: Uses the slower, more flexible Python tokenizer.

tokenizer.pad_token = tokenizer.eos_token: Sets the padding token to be the same as the EOS token (GPT-2 doesn't have a pad token by default).

In [9]:
# https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
print_gpu_utilization()

GPU memory occupied: 683 MB.


In [11]:
eval_tokenizer = AutoTokenizer.from_pretrained(model_name, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

def gen(model,p, maxlen=100, sample=True):
    toks = eval_tokenizer(p, return_tensors="pt")
    res = model.generate(**toks.to("cuda"), max_new_tokens=maxlen, do_sample=sample,num_return_sequences=1,temperature=0.1,num_beams=1,top_p=0.95,).to('cpu')
    return eval_tokenizer.batch_decode(res,skip_special_tokens=True)

In [12]:
%%time
from transformers import set_seed
seed = 42
set_seed(seed)

index = 10

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
#print(res[0])
output = res[0].split('Output:\n')[1]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# attends Brian's birthday pa

In [13]:
def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction','output')
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"

    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"

    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt

    return sample

In [14]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

In [15]:
from functools import partial

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)

    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

In [16]:
print_gpu_utilization()

GPU memory occupied: 713 MB.


In [17]:
# ## Pre-process dataset
max_length = get_max_length(original_model)
print(max_length)

train_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['train'])
eval_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['validation'])

Found max lenth: 1024
1024
Preprocessing dataset...


Map:   0%|          | 0/1999 [00:00<?, ? examples/s]

Map:   0%|          | 0/1999 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1999 [00:00<?, ? examples/s]

Preprocessing dataset...


Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Filter:   0%|          | 0/499 [00:00<?, ? examples/s]

In [18]:
print(f"Shapes of the datasets:")
print(f"Training: {train_dataset.shape}")
print(f"Validation: {eval_dataset.shape}")
print(train_dataset)

Shapes of the datasets:
Training: (1996, 3)
Validation: (499, 3)
Dataset({
    features: ['text', 'input_ids', 'attention_mask'],
    num_rows: 1996
})


In [19]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 39422208
all model parameters: 81972480
percentage of trainable model parameters: 48.09%


In [20]:
print(original_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Linear4bit(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear4bit(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Linear4bit(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear4bit(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affin

In [21]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

config = LoraConfig(
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'c_attn',
        'c_proj'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()

# 2 - Using the prepare_model_for_kbit_training method from PEFT
original_model = prepare_model_for_kbit_training(original_model)

peft_model = get_peft_model(original_model, config)

In [22]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3244032
all model parameters: 85216512
percentage of trainable model parameters: 3.81%


In [23]:
# See how the model looks different now, with the LoRA adapters added:
print(peft_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=768, out_features=2304, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_

In [24]:
output_dir = './peft-dialogue-summary-training/final-checkpoint'
import transformers

peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    save_strategy="steps",
    logging_steps=25,
    logging_dir="./logs",
    save_steps=25,
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
)

peft_model.config.use_cache = False

peft_trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [25]:
peft_training_args.device

device(type='cuda', index=0)

In [26]:
peft_trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
25,2.7993
50,2.1912
75,2.1553
100,1.7769
125,2.0499
150,1.6537
175,2.0251
200,1.6773
225,2.0023
250,1.6462


TrainOutput(global_step=1000, training_loss=1.8337457427978516, metrics={'train_runtime': 558.0451, 'train_samples_per_second': 7.168, 'train_steps_per_second': 1.792, 'total_flos': 609997106663424.0, 'train_loss': 1.8337457427978516, 'epoch': 2.004008016032064})

In [28]:
print_gpu_utilization()

GPU memory occupied: 1365 MB.


In [29]:
# Free memory for merging weights
del original_model
del peft_trainer
torch.cuda.empty_cache()

In [30]:
print_gpu_utilization()

GPU memory occupied: 737 MB.


In [31]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "gpt2"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)



# 🧪 Define Text Generation Function for Evaluation
This cell sets up an evaluation tokenizer (a clone of the training tokenizer) and defines a gen() function for generating text with the model:

- It encodes a prompt into tokens.

- Uses the model to generate text (model.generate) with defined decoding settings (e.g. temperature, top_p, etc.).

- The generated output is returned after decoding.

This function is used to test zero-shot performance before fine-tuning.

In [32]:
eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

# 🧪 Run a Zero-Shot Generation Test
This block tests the baseline performance of the model before fine-tuning:

Sets a seed for reproducibility.

Selects a test example from the dataset.

Constructs a natural-language prompt asking the model to summarize the conversation.

Uses the gen() function to generate output.

Prints the original prompt, human-written summary, and model-generated summary.

In [35]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "/content/peft-dialogue-summary-training/final-checkpoint/checkpoint-1000",torch_dtype=torch.float16,is_trainable=False)
ft_model.push_to_hub(
    "AnikDey/gpt2-peft-lora-modded",
    use_auth_token=True,
    commit_message="Test model for homelessness app"
)



README.md: 0.00B [00:00, ?B/s]

HfHubHTTPError: (Request ID: Root=1-686abf47-4339dbba168638fb23d7b494;93fd4a42-f71d-42a0-ba19-c2799c485e47)

403 Forbidden: Authorization error..
Cannot access content at: https://huggingface.co/AnikDey/gpt2-peft-lora-modded.git/info/lfs/objects/batch.
Make sure your token has the correct permissions.

# ✅ Test the Fine-Tuned Model (PEFT)
This cell evaluates the fine-tuned model on a test example by:

- Creating a summarization prompt from the 26th test dialogue.

- Generating output using the PEFT model.

- Extracting and displaying:

1. The input prompt,

2. Ground truth summary,

3. Model-generated summary.

This helps assess how well the model performs after fine-tuning.

In [36]:
%%time
from transformers import set_seed
set_seed(seed)

index = 26
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('#End')

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Instruct: Summarize the following conversation.
#Person1#: Steven, I need badly your help.
#Person2#: What's the matter?
#Person1#: My wife has found that I have an affair with my secretary, and now she is going to divorce me.
#Person2#: How could you cheat on your wife? You have been married for ten years.
#Person1#: Yes, I know I'm wrong. But I swear that the affair lasts only for two months. And I still love my wife. I couldn't live without her.
#Person2#: I will try my best to persuade her to reconsider the divorce. But are you sure that from now on you will be faithful to her forever?
#Person1#: Yes, I swear.
Output:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# begs Steven's to persuade #Person1#'s wife not to divorce him, and Steven agrees.

--------------------------------------

#  Load Base Language Model with Quantization

In [37]:
original_model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)


# 📊 Compare Summaries Across Models
This block evaluates and compares summaries generated by:

- The original pre-trained GPT-2 model,

- The fine-tuned model (using PEFT),

- Human-written reference summaries (ground truth).

Steps:

- Takes the first 10 dialogues from the test set.

For each dialogue:

- Generates summaries using both original and PEFT models.

- Stores the outputs alongside the human-written summary.

- Combines all summaries into a pandas DataFrame for easy viewing and comparison.



In [38]:
import pandas as pd

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    human_baseline_text_output = human_baseline_summaries[idx]
    prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

    original_model_res = gen(original_model,prompt,100,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]

    peft_model_res = gen(ft_model,prompt,100,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    #print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('#End')


    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I am ready.\n#Person2#: I am ready....,#Person1# tells #Person2# to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1#: I am ready.\n#Person2#: I am ready....,#Person1# tells #Person2# to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I am ready.\n#Person2#: I am ready....,#Person1# tells #Person2# to take a dictation ...
3,#Person2# arrives late because of traffic jam....,#Person1#: I'm not sure if I can do it.\n#Pers...,#Person1# and #Person2# decide to take public ...
4,#Person2# decides to follow #Person1#'s sugges...,#Person1#: I'm not sure if I can do it.\n#Pers...,#Person1# and #Person2# decide to take public ...
5,#Person2# complains to #Person1# about the tra...,#Person1#: I'm not sure if I can do it.\n#Pers...,#Person1# and #Person2# decide to take public ...
6,#Person1# tells Kate that Masha and Hero get d...,"#Person1#: Masha and Hero, I'm so happy to see...",#Person1# tells Kate that Masha and Hero are g...
7,#Person1# tells Kate that Masha and Hero are g...,"#Person1#: Masha and Hero, they are going to d...",Kate tells #Person1# that Masha and Hero are g...
8,#Person1# and Kate talk about the divorce betw...,#Person2#: I think they are going to divorce.\...,Kate tells #Person1# that Masha and Hero are g...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Brian, I'm so happy you remember, p...",Brian and Brian are happy to have a dance with...


# 📦 Install ROUGE Score Library
This command installs the rouge_score package, a popular library used to compute ROUGE metrics (Recall-Oriented Understudy for Gisting Evaluation), which are commonly used to evaluate the quality of generated summaries against reference texts.

In [39]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=9eb873d144bc919f805e5c66087427825be8cbb615840b97b004bb493b166424
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


# 📐 Evaluate Summaries with ROUGE Metrics
This block uses the evaluate library to compute ROUGE scores for both the original and fine-tuned (PEFT) models. ROUGE scores help measure how closely the generated summaries match human-written ones by comparing overlapping n-grams and sequences.

Key points:

- use_stemmer=True: Improves matching by reducing words to their root form.

- use_aggregator=True: Returns average scores across all samples.

Compares both:

- original_model_summaries vs. human_baseline_summaries

- peft_model_summaries vs. human_baseline_summaries

In [40]:
import evaluate

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Downloading builder script: 0.00B [00:00, ?B/s]

ORIGINAL MODEL:
{'rouge1': np.float64(0.0914053113260613), 'rouge2': np.float64(0.013666586557718496), 'rougeL': np.float64(0.08626870539858417), 'rougeLsum': np.float64(0.07270080576607135)}
PEFT MODEL:
{'rouge1': np.float64(0.24208498134076503), 'rouge2': np.float64(0.09352472884705074), 'rougeL': np.float64(0.1992675206820696), 'rougeLsum': np.float64(0.19422274052140942)}


#📈 Calculate Improvement of PEFT Model Over Original
This cell computes the absolute percentage improvement of the PEFT model compared to the original model for each ROUGE metric.

How it works:

- Converts the ROUGE scores to NumPy arrays.

- Calculates the difference (PEFT - Original) for each metric.

- Multiplies by 100 to express the result as a percentage.

- Prints improvement for each ROUGE score (e.g., ROUGE-1, ROUGE-2, ROUGE-L).

In [41]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 15.07%
rouge2: 7.99%
rougeL: 11.30%
rougeLsum: 12.15%
