<a href="https://colab.research.google.com/github/MelMacLondon/ML/blob/main/LLM_The_Fine_Tuning_Lab_Rank_vs_Performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Fine-Tuning Lab: Rank vs. Performance

Welcome to the final practical challenge of Workshop 2.

Earlier, you learned how to run LoRA. Now, you will learn how to **tune** it.

## What is a Summarisation Task?

Summarisation is a classic Natural Language Processing (NLP) task where the goal is to produce a concise and fluent summary while keeping the key information and overall meaning of a longer text.

There are generally two types:

- **Extractive**: Selecting and stitching together important sentences from the original text (like highlighting).
- **Abstractive**: Generating entirely new sentences to capture the essence of the text (like a human writing a summary).

In this lab, we are doing **Abstractive Summarisation** using a Sequence-to-Sequence (Seq2Seq) model. We are feeding in a dialogue (from the `SAMSum` dataset) and asking the model to generate a natural language summary of the conversation.

## The Objective

You are now an AI Researcher. Your goal is to determine the optimal LoRA configuration for a summarization task, balancing **Model Size** (efficiency) against **Performance** (quality).

You will conduct a controlled experiment to answer two questions:

1.  **Does Rank matter?** Does increasing the LoRA Rank ($r$) from 4 to 32 actually improve the ROUGE score, or is it just a waste of parameters?
2.  **Does Data Size matter?** Is it better to have a High Rank model on small data, or a Low Rank model on more data?

Of course, there are other hyperparameters to consider when fine-tuning a LoRA model. You can find further details of these at [https://huggingface.co/docs/peft/developer_guides/lora](https://huggingface.co/docs/peft/developer_guides/lora).

## The Task

You need to write a Python script (or use the provided Notebook) to conduct an **Ablation Study**. This means changing one variable at a time to isolate its effect.

1.  **Define a reusable training loop** that accepts `rank` and `sample_size` as arguments.
2.  **Run four distinct experiments** (A 2x2 Grid Search):
    - **R8_D200**: Rank=8, Data=200.
    - **R32_D200**: Rank=32, Data=200.
    - **R8_D500**: Rank=8, Data=500.
    - **R32_D500**: Rank=32, Data=500.
3.  **Evaluate all models** on a held-out test set using an appropriate metric.
4.  **Compare the results** in a table.

## Getting Started

1.  Run the setup cells to load the `samsum` dataset (Conversations -> Summaries).
2.  Fill in the "Hypothesis" section before writing any code.
3.  Implement the training loop using the `transformers` Train API.
4.  Generate your report.

## Helpful Tips

- **Saving Models**: Make sure to save each model to a different directory (e.g., `lora_r8_d200` vs `lora_r32_d200`).
- **Evaluation**: Use the `evaluate` library for ROUGE scores.
- **Time Management**: A single training run on CPU might take 5-10 minutes. Start small! Get your code running first, then scale up.


## 1. Setup Data
We will use the `knkarthick/samsum` dataset. It contains text messages and their summaries.

You need to install the library rouge-score for this practical. Uncomment the line below, run it, comment it out then restart the kernel.

In [2]:
!pip install rouge_score
!pip install evaluate

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=5d4bbed0716f7848de887086cf6255ee2bcccde8c25099c46b38dafa5f6cc44b
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [3]:
from collections import Counter

## The Model: FLAN-T5

For this experiment, we are switching from `distilgpt2` to **`FLAN-T5-Small`**.

*   **Type**: Encoder-Decoder (Seq2Seq). This architecture is superior for summarisation because it can see the entire input text (Encoder) before generating the summary (Decoder).
*   **Why FLAN?**: It has been "instruction tuned" on thousands of tasks. It understands prompts like "Summarize this conversation:" much better than a raw base model.
*   **Why Small?**: It has ~60M parameters, making it trainable on a CPU in reasonable time, which is perfect for our lab comparison.

In [4]:
import os
import evaluate
import numpy as np
import nltk
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from peft import LoraConfig, get_peft_model, TaskType, PeftModel

try:
    nltk.data.find("tokenizers/punkt")
except (LookupError, OSError):
    nltk.download("punkt")

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_ID = "google/flan-t5-small"     # MODEL CARD: https://huggingface.co/google/flan-t5-small
OUTPUT_DIR = "solution_output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
USE_CPU = not torch.cuda.is_available()

print(f">>> Device selection: {'CPU' if USE_CPU else 'GPU'}")

def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(examples["summary"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## LOAD TOKENIZER
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
dataset = load_dataset("knkarthick/samsum")     # https://huggingface.co/datasets/knkarthick/samsum
tokenized_dataset = dataset.map(preprocess_function, batched=True)
print("Data Ready!")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


>>> Device selection: GPU


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14731 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/14731 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Data Ready!


## 2. Formulate your Predictions

Before you code, predict what will happen in your **2x2 Grid Search**.

**Question 1:** Will increasing the Rank ($r$) from 8 to 32 significantly increase the ROUGE score on small data (200 samples)?

**Question 2:** Will simply adding more data (500 samples) be more effective than increasing the Rank?


## 3. The Experiment Loop

Write a function `run_experiment(rank, sample_size, output_name)` that:

1.  Loads a fresh `google/flan-t5-small` model.
2.  Configures LoRA with the given `rank`.
3.  Slices the training data to `sample_size`.
4.  Trains for 1 epoch.
5.  Saves the model to disk (so we can check size).

In [5]:
ds_train = tokenized_dataset["train"].select(range(200))
print( ds_train )
ds_eval  = tokenized_dataset["validation"].select(range(50))
print( ds_eval )
print( ds_train[:3]['dialogue'] )
print( ds_train[:3]['summary'] )



Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 200
})
Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 50
})
["Amanda: I baked  cookies. Do you want some?\nJerry: Sure!\nAmanda: I'll bring you tomorrow :-)", 'Olivia: Who are you voting for in this election? \nOliver: Liberals as always.\nOlivia: Me too!!\nOliver: Great', "Tim: Hi, what's up?\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\nTim: What did you plan on doing?\nKim: Oh you know, uni stuff and unfucking my room\nKim: Maybe tomorrow I'll move my ass and do everything\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\nTim: It really helps\nKim: thanks, maybe I'll do that\nTim: I also like using post-its in kaban style"]
['Amanda baked cookie

In [6]:
def run_experiment(rank, sample_size, exp_name):

    # 1. Load Model
    output_path = os.path.join(OUTPUT_DIR, exp_name)

    if os.path.exists(output_path) and os.path.exists(os.path.join(output_path, "adapter_config.json")):
        print(f">>> Found existing model at {output_path}, skipping training.")
        return output_path

    # Reload Base Model (Fresh start)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

    # 2. Config LoRA
    # TODO: Define config using the 'rank' argument
    peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM
                             , inference_mode=False
                             , r=rank              # Rank argument
                             , lora_alpha=32       # dont know why we are hard coding this ?
                             , lora_dropout=0.1
                             )

    # apply the config!!
    model = get_peft_model(model, peft_config)


    # 3. Setup Trainer and Train
    # TODO: Slice dataset -> tokenized_dataset["train"].select(range(sample_size))

    ds_train = tokenized_dataset["train"].shuffle(seed=42).select(range(sample_size))
    # ds_eval  = tokenized_dataset["validation"].select(range(sample_size))

    # TODO: Configure Seq2SeqTrainingArguments() and Seq2SeqTrainer()
    training_args = Seq2SeqTrainingArguments(output_dir=output_path,
                                             per_device_train_batch_size=8,
                                             logging_strategy="epoch",
                                             save_strategy="no",        # Save manually at the end
                                             learning_rate=1e-3,
                                             num_train_epochs = 1,
                                             load_best_model_at_end=False,
                                             use_cpu = USE_CPU
                                             )

    trainer = Seq2SeqTrainer(model=model,
                             args=training_args,
                             train_dataset=ds_train,
                             data_collator=DataCollatorForSeq2Seq(tokenizer, model=model))

    '''
      the Trainer must already be holding the unwrapped base model before it enters training / evaluation
      (where .generate() is called).
    '''
    # trainer.model = trainer.model.base_model

    # TODO: Begin training
    trainer.train()

    # TODO: Save model to disc

    model.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)
    print(f">>> Experiment saved to {output_path}")

    return output_path
    pass


In [7]:

# Run Experiments (2x2 Grid Search)
# 1. R8_D200
# 2. R32_D200
# 3. R8_D500
# 4. R32_D500

experiments = [
    {"rank": 8, "data": 200, "name": "R8_D200"},
    {"rank": 32, "data": 200, "name": "R32_D200"},
    {"rank": 8, "data": 500, "name": "R8_D500"},
    {"rank": 32, "data": 500, "name": "R32_D500"},
]

results = {}

for exp in experiments:
    path = run_experiment(rank=exp["rank"], sample_size=exp["data"], exp_name=exp["name"])
    exp["path"] = path

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Step,Training Loss
25,10.900123


>>> Experiment saved to solution_output/R8_D200


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Step,Training Loss
25,10.858899


>>> Experiment saved to solution_output/R32_D200


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Step,Training Loss
63,9.120339


>>> Experiment saved to solution_output/R8_D500


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Step,Training Loss
63,9.147206


>>> Experiment saved to solution_output/R32_D500


## 4. Evaluation

Compare the results! Check file sizes and calculate ROUGE scores on the test set.

Note that this might be slow running on the CPU

In [8]:
import evaluate
rouge = evaluate.load("rouge")
test_data = tokenized_dataset["test"].select(range(10)) # First 10 examples for quick check

# TODO: Write a function that loads a model adaptor, generates summaries, and scores it. Should return a single report.
# TODO: Apply this to all experiment outputs. Generate a full report.


def evaluate_model(path, name):
    try:
        # Load Base + Adapter ...
        base = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)
        model = PeftModel.from_pretrained(base, path)

        # Move to GPU if available
        if not USE_CPU:
            model.to("cuda")

        predictions = []
        references = []

        print(f"Generating for {name}...")

        for i in range(len(test_data)):
            # TODO define inputs ...
            inputs = tokenizer(test_data[i]["dialogue"], return_tensors="pt", truncation=True, padding=True).input_ids

            if not USE_CPU:
                inputs = inputs.to("cuda")

            # TODO generate outputs and decode them ...
            outputs = model.generate(input_ids=inputs, max_new_tokens=50)
            decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

            predictions.append(decoded)
            references.append(test_data[i]["summary"])

        # TODO compute metrics ...
        metrics = rouge.compute(predictions=predictions, references=references)

        # Check File Size
        size_mb = 0
        for dirpath, _, filenames in os.walk(path):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                size_mb += os.path.getsize(fp)
        size_mb /= (1024 * 1024)

        return metrics["rougeL"], size_mb
    except Exception as e:
        print(f"Error evaluating {name}: {e}")
        return 0.0, 0.0

print(f"\n{'Name':<10} | {'Rank':^5} | {'Data':^6} | {'Size (MB)':^10} | {'ROUGE-L':^8}")
print("-" * 50)

for exp in experiments:
    score, size = evaluate_model(exp["path"], exp["name"])
    print(f"{exp['name']:<10} | {exp['rank']:^5} | {exp['data']:^6} | {size:^10.2f} | {score:^8.4f}")

Downloading builder script: 0.00B [00:00, ?B/s]


Name       | Rank  |  Data  | Size (MB)  | ROUGE-L 
--------------------------------------------------


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Generating for R8_D200...
R8_D200    |   8   |  200   |    3.34    |  0.2685 


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Generating for R32_D200...
R32_D200   |  32   |  200   |    7.28    |  0.2666 


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Generating for R8_D500...
R8_D500    |   8   |  500   |    3.34    |  0.0944 


Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



Generating for R32_D500...
R32_D500   |  32   |  500   |    7.28    |  0.0727 
