# Model Evaluation: Fine-tuned vs Base Model Comparison

This notebook compares the fine-tuned model against the original `unsloth/Llama-3.2-3B-Instruct` using:
1. **ROUGE scores** comparing generated responses to ground truth
2. **Qualitative examples** - side-by-side comparisons
3. **Response length analysis**

In [1]:
%%capture
# Install dependencies
%uv pip install unsloth
%uv pip install rouge-score evaluate datasets tqdm

In [2]:
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, standardize_sharegpt
import torch
import numpy as np
from tqdm import tqdm
from datasets import load_dataset

# Configuration
max_seq_length = 2048
dtype = "float16"
load_in_4bit = True

BASE_MODEL_NAME = "unsloth/Llama-3.2-3B-Instruct"
LORA_ADAPTER_PATH = "/vol/checkpoint-10688"  # your local folder in Modal


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 1. Prepare the Test Dataset

Using the same splits as during training (10% test set)

In [3]:
# Load and prepare dataset (same as training)
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)

# Same splits as training: train+val (90%), test (10%)
train_val_split = dataset.train_test_split(test_size=0.15, seed=42)
test_dataset = train_val_split["test"]

print(f"Test set size: {len(test_dataset)} samples")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Unsloth: Standardizing formats (num_proc=14):   0%|          | 0/100000 [00:00<?, ? examples/s]

Test set size: 15000 samples


## 2. Load Base Model

In [4]:
# Load the BASE model (not fine-tuned)
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name     = BASE_MODEL_NAME,
    max_seq_length = max_seq_length,
    dtype          = dtype,
    load_in_4bit   = load_in_4bit,
)
base_tokenizer = get_chat_template(base_tokenizer, chat_template="llama-3.1")
FastLanguageModel.for_inference(base_model)
print("Base model loaded!")

==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.56.0.
   \\   /|    Tesla T4. Num GPUs = 3. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Base model loaded!


## 3. Load Fine-tuned Model (with LoRA adapters)

In [16]:
# Load the FINE-TUNED model from LoRA checkpoint folder
finetuned_model, finetuned_tokenizer = FastLanguageModel.from_pretrained(
    model_name     = LORA_ADAPTER_PATH,    # <-- peka direkt på /vol/checkpoint-10688
    max_seq_length = max_seq_length,
    dtype          = dtype,
    load_in_4bit   = load_in_4bit,
)
finetuned_tokenizer = get_chat_template(finetuned_tokenizer, chat_template="llama-3.1")
FastLanguageModel.for_inference(finetuned_model)
print("Fine-tuned model loaded!")


==((====))==  Unsloth 2025.11.6: Fast Llama patching. Transformers: 4.56.0.
   \\   /|    Tesla T4. Num GPUs = 3. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.11.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Fine-tuned model loaded!


In [17]:
prompt = [
    {"role": "user", "content": "Explain photosynthesis in one short paragraph."}
]

b = generate_response(base_model, base_tokenizer, prompt, deterministic=True)
f = generate_response(finetuned_model, finetuned_tokenizer, prompt, deterministic=True)

print("BASE:\n", b)
print("\nFINETUNED:\n", f)


BASE:
 Photosynthesis is the process by which plants, algae, and some bacteria convert light energy from the sun into chemical energy in the form of glucose. This process occurs in specialized organelles called chloroplasts, which contain the pigment chlorophyll. Water and carbon dioxide are absorbed by the plant, and with the energy from sunlight, they are converted into glucose and oxygen, releasing oxygen into the atmosphere as a byproduct.

FINETUNED:
 Photosynthesis is the process by which plants, algae, and some bacteria convert sunlight, water, and carbon dioxide into glucose and oxygen. During photosynthesis, chlorophyll in the plant's cells absorbs sunlight, which is then used to convert carbon dioxide and water into glucose and oxygen. This process is essential for life on Earth, as it provides the energy and organic compounds needed for growth and sustenance.


## 4. Evaluation Functions

In [6]:
from rouge_score import rouge_scorer

def format_conversation_for_eval(example):
    convos = example["conversations"]

    # referens = sista assistant-svaret
    reference = None
    for msg in reversed(convos):
        if msg["role"] == "assistant":
            reference = msg["content"]
            break

    if reference is None:
        return [], None

    # prompt = alla meddelanden före detta
    cutoff_index = convos.index(next(m for m in convos if m["content"] == reference))
    prompt_messages = convos[:cutoff_index]

    return prompt_messages, reference




def generate_response(model, tokenizer, messages, max_new_tokens=256, deterministic=True):
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    attention_mask = torch.ones_like(inputs)

    gen_kwargs = {
        "input_ids": inputs,
        "attention_mask": attention_mask,
        "max_new_tokens": max_new_tokens,
        "use_cache": True,
        "pad_token_id": tokenizer.eos_token_id,
    }

    if deterministic:
        gen_kwargs.update(
            dict(
                do_sample=False,
                temperature=None,
            )
        )
    else:
        gen_kwargs.update(
            dict(
                do_sample=True,
                temperature=0.7,
            )
        )

    with torch.no_grad():
        outputs = model.generate(**gen_kwargs)

    generated = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return generated.strip()



def compute_rouge_scores(predictions, references):
    """Compute ROUGE scores"""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    
    scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    
    for pred, ref in zip(predictions, references):
        if ref and pred:
            score = scorer.score(ref, pred)
            scores['rouge1'].append(score['rouge1'].fmeasure)
            scores['rouge2'].append(score['rouge2'].fmeasure)
            scores['rougeL'].append(score['rougeL'].fmeasure)
    
    return {
        'rouge1': np.mean(scores['rouge1']),
        'rouge2': np.mean(scores['rouge2']),
        'rougeL': np.mean(scores['rougeL']),
    }

print("Evaluation functions defined!")

Evaluation functions defined!


## 5. Run Evaluation on Test Set

We'll evaluate on a subset of the test set for speed (adjust `num_samples` as needed)

In [19]:
# Number of samples to evaluate (more = better statistics, but slower)
num_samples = 100  # Increase for more reliable results

# Sample from test set
eval_indices = np.random.RandomState(42).choice(len(test_dataset), min(num_samples, len(test_dataset)), replace=False)
eval_samples = test_dataset.select(eval_indices)

print(f"Evaluating on {len(eval_samples)} samples...")

Evaluating on 100 samples...


In [20]:
# Generate responses from both models
base_predictions = []
finetuned_predictions = []
references = []
prompts_used = []

for i, example in enumerate(tqdm(eval_samples, desc="Generating responses")):
    prompt_messages, reference = format_conversation_for_eval(example)
    
    if reference is None or len(prompt_messages) == 0:
        continue
    
    try:
        base_response = generate_response(base_model, base_tokenizer, prompt_messages, deterministic=True)
        finetuned_response = generate_response(finetuned_model, finetuned_tokenizer, prompt_messages, deterministic=True)

        
        base_predictions.append(base_response)
        finetuned_predictions.append(finetuned_response)
        references.append(reference)
        prompts_used.append(prompt_messages)
        
    except Exception as e:
        print(f"Error on sample {i}: {e}")
        continue

print(f"Successfully evaluated {len(references)} samples")

Generating responses:   0%|                                                            | 0/100 [00:00<?, ?it/s]Generating responses:   1%|▌                                                   | 1/100 [00:16<26:46, 16.22s/it]Generating responses:   2%|█                                                   | 2/100 [00:42<36:10, 22.15s/it]Generating responses:   3%|█▌                                                  | 3/100 [01:08<38:59, 24.11s/it]Generating responses:   4%|██                                                  | 4/100 [01:35<40:10, 25.10s/it]Generating responses:   5%|██▌                                                 | 5/100 [01:56<37:17, 23.56s/it]Generating responses:   6%|███                                                 | 6/100 [02:22<38:28, 24.56s/it]Generating responses:   7%|███▋                                                | 7/100 [02:49<38:55, 25.12s/it]Generating responses:   8%|████▏                                               | 8/100 [03:15<39:08, 25

Successfully evaluated 100 samples





## 6. Compute Metrics

In [21]:
# Calculate ROUGE scores
base_rouge = compute_rouge_scores(base_predictions, references)
finetuned_rouge = compute_rouge_scores(finetuned_predictions, references)

print("=" * 60)
print("EVALUATION RESULTS")
print("=" * 60)
print(f"Number of samples evaluated: {len(references)}")
print()
print("-" * 60)
print("ROUGE Scores (higher is better)")
print("-" * 60)
print(f"{'Metric':<15} {'Base Model':<20} {'Fine-tuned':<20} {'Change':<15}")
print("-" * 60)

for metric in ['rouge1', 'rouge2', 'rougeL']:
    base_score = base_rouge[metric]
    ft_score = finetuned_rouge[metric]
    improvement = ((ft_score - base_score) / base_score) * 100 if base_score > 0 else 0
    print(f"{metric:<15} {base_score:<20.4f} {ft_score:<20.4f} {improvement:+.2f}%")

print("=" * 60)

EVALUATION RESULTS
Number of samples evaluated: 100

------------------------------------------------------------
ROUGE Scores (higher is better)
------------------------------------------------------------
Metric          Base Model           Fine-tuned           Change         
------------------------------------------------------------
rouge1          0.4732               0.5323               +12.49%
rouge2          0.2255               0.2849               +26.35%
rougeL          0.2856               0.3521               +23.30%


## 7. Qualitative Comparison - Side by Side Examples

In [22]:
num_examples = 10

for i in range(min(num_examples, len(references))):
    print("=" * 80)
    print(f"EXAMPLE {i+1}")
    print("=" * 80)
    
    last_user_msg = None
    for msg in prompts_used[i]:
        if msg["role"] == "user":
            last_user_msg = msg["content"]
    
    print(f"USER PROMPT:\n{last_user_msg[:500]}{'...' if len(str(last_user_msg)) > 500 else ''}")
    print(f"\nREFERENCE:\n{references[i][:500]}{'...' if len(references[i]) > 500 else ''}")
    print(f"\nBASE MODEL:\n{base_predictions[i][:500]}{'...' if len(base_predictions[i]) > 500 else ''}")
    print(f"\nFINE-TUNED MODEL:\n{finetuned_predictions[i][:500]}{'...' if len(finetuned_predictions[i]) > 500 else ''}")
    print()

EXAMPLE 1
USER PROMPT:
Explain the process of photosynthesis in simple terms and describe its importance for the ecosystem.

REFERENCE:
Photosynthesis is a process by which plants, algae, and some bacteria convert sunlight, water, and carbon dioxide into sugar and oxygen. This process occurs in the chloroplasts of these organisms. In simple terms, sunlight is absorbed, and its energy is used to break down water and carbon dioxide molecules, which are then reassembled into sugars and oxygen. The sugar provides energy for growth, while oxygen is released into the atmosphere. Photosynthesis is essential for the ecosystem because it...

BASE MODEL:
**What is Photosynthesis?**

Photosynthesis is a process by which plants, algae, and some bacteria convert light energy from the sun into chemical energy in the form of glucose (a type of sugar). This process is essential for life on Earth, as it provides energy and organic compounds for plants to grow and thrive.

**The Process of Photosynthesi

## 8. Response Length Analysis

In [24]:
base_lengths = [len(p.split()) for p in base_predictions]
ft_lengths = [len(p.split()) for p in finetuned_predictions]
ref_lengths = [len(r.split()) for r in references]

print("Response Length Analysis (words)")
print("-" * 50)
print(f"{'Metric':<25} {'Base':<12} {'Fine-tuned':<12} {'Reference':<12}")
print("-" * 50)
print(f"{'Mean length':<25} {np.mean(base_lengths):<12.1f} {np.mean(ft_lengths):<12.1f} {np.mean(ref_lengths):<12.1f}")
print(f"{'Median length':<25} {np.median(base_lengths):<12.1f} {np.median(ft_lengths):<12.1f} {np.median(ref_lengths):<12.1f}")
print(f"{'Std deviation':<25} {np.std(base_lengths):<12.1f} {np.std(ft_lengths):<12.1f} {np.std(ref_lengths):<12.1f}")

Response Length Analysis (words)
--------------------------------------------------
Metric                    Base         Fine-tuned   Reference   
--------------------------------------------------
Mean length               165.0        155.7        216.3       
Median length             176.0        165.5        199.0       
Std deviation             42.5         53.6         109.4       


## 9. Save Results

In [25]:
import json
from datetime import datetime

results = {
    "timestamp": datetime.now().isoformat(),
    "base_model": BASE_MODEL_NAME,
    "finetuned_model": LORA_ADAPTER_PATH,
    "num_samples": len(references),
    "metrics": {
        "base_model": base_rouge,
        "finetuned_model": finetuned_rouge,
    },
    "response_lengths": {
        "base_mean": float(np.mean(base_lengths)),
        "finetuned_mean": float(np.mean(ft_lengths)),
        "reference_mean": float(np.mean(ref_lengths)),
    }
}

with open("evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print("Results saved to evaluation_results.json")
print(json.dumps(results, indent=2))

Results saved to evaluation_results.json
{
  "timestamp": "2025-12-02T10:09:18.235153",
  "base_model": "unsloth/Llama-3.2-3B-Instruct",
  "finetuned_model": "/vol/checkpoint-10688",
  "num_samples": 100,
  "metrics": {
    "base_model": {
      "rouge1": 0.4732268736349978,
      "rouge2": 0.22545052802537063,
      "rougeL": 0.2855963588727634
    },
    "finetuned_model": {
      "rouge1": 0.5323317587648463,
      "rouge2": 0.28485583536195763,
      "rougeL": 0.35213159677059536
    }
  },
  "response_lengths": {
    "base_mean": 165.0,
    "finetuned_mean": 155.68,
    "reference_mean": 216.27
  }
}
