# Understanding Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning is a machine learning technique where we take a **pre-trained language model** (like Llama 3) and teach it to perform specific tasks by showing it examples.

Think of it like this:
- **Pre-trained model** = A student who has read millions of books and knows general knowledge
- **Fine-tuning** = Teaching that student to become an expert in Nigerian government services by showing them question-answer pairs

**How it works:**
1. We start with Llama 3 (already trained on internet text)
2. We show it examples: "Question: How do I register a business?" ‚Üí "Answer: To register..."
3. The model learns the pattern and adapts its knowledge
4. After training, it can answer similar questions it hasn't seen before



**Why LoRA (Low-Rank Adaptation)?**
- Training all 8 billion parameters is slow and expensive
- LoRA only trains ~1% of parameters by adding small "adapter" layers
- Result: 2x faster, uses 60% less memory, but maintains quality

**Features:**
- 2x faster training with Unsloth
- 60% less memory usage
- 4-bit quantization support
- LoRA efficient fine-tuning

**Requirements:**
- GPU Runtime (T4, V100, or A100)
- ~15GB GPU memory
- training_data.json file uploaded to Colab

## üì¶ Step 1: Install Required Libraries

This cell installs all the required libraries:
- **Unsloth**: Makes training 2x faster and uses less memory
- **TRL**: Provides the `SFTTrainer` for supervised fine-tuning
- **Transformers**: HuggingFace library for working with language models
- **PEFT**: Enables LoRA (parameter-efficient fine-tuning)
- **bitsandbytes**: Enables 4-bit quantization to save memory

**What is 4-bit quantization?**
Instead of storing model weights as 32-bit numbers, we use 4-bit numbers. This reduces memory by 8x with minimal quality loss.

In [None]:
!pip install -r requirements.txt

## üìö Step 2: Import Libraries

Import all required libraries and check GPU availability.

In [6]:
import torch

# --- CRITICAL FIX FOR "torch.int1" ERROR ---
if not hasattr(torch, "int1"):
    print("‚ö†Ô∏è Patching torch.int1 = torch.int8 to fix attribute error.")
    torch.int1 = torch.int8
# -------------------------------------------

from unsloth import FastLanguageModel, is_bfloat16_supported

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
import pandas as pd
import json
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

PyTorch version: 2.9.1+cu128
CUDA available: True
GPU: Tesla T4
GPU Memory: 15.83 GB


This is your control panel for training. Key settings:

**Model Settings:**
- `MODEL_NAME`: Which pre-trained model to start from
- `MAX_SEQ_LENGTH`: Maximum tokens in one training example (1024 = ~750 words)
- `LOAD_IN_4BIT`: Use 4-bit quantization to save memory

**LoRA Settings:**
- `LORA_R = 16`: Rank of LoRA adapters (higher = more parameters, slower but potentially better)
- `LORA_ALPHA = 16`: Scaling factor (typically equals rank)
- `LORA_DROPOUT = 0`: No dropout (regularization technique)

**Training Settings:**
- `MAX_STEPS = 60`: Train for 60 optimizer steps (quick test run)
- `PER_DEVICE_BATCH_SIZE = 2`: Process 2 examples at once
- `GRADIENT_ACCUMULATION_STEPS = 4`: Accumulate 4 batches before updating (effective batch = 2√ó4 = 8)
- `LEARNING_RATE = 2e-4`: How big each training step is (0.0002)
- `WARMUP_STEPS = 5`: Gradually increase learning rate for first 5 steps

**Why these numbers?**
- Smaller batch sizes use less memory
- Gradient accumulation simulates larger batches
- 60 steps is for testing; real training uses 500-5000 steps

In [7]:
# ============================================================================
# CONFIGURATION - OPTIMIZED FOR SPEED
# ============================================================================

# Model settings
MODEL_NAME = "unsloth/llama-3.1-8b-Instruct-bnb-4bit"
MAX_SEQ_LENGTH = 1024
LOAD_IN_4BIT = True

# LoRA settings (lower rank = faster training)
LORA_R = 16  # Reduced from 64 for faster training
LORA_ALPHA = 16  # Matches rank
LORA_DROPOUT = 0

# Training settings (optimized for speed)
MAX_STEPS = 60  # Set max steps instead of epochs for testing
PER_DEVICE_BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 4
LEARNING_RATE = 2e-4
WARMUP_STEPS = 5
LOGGING_STEPS = 1

# Data settings
DATA_FILE = "training_data.json"
TRAIN_SPLIT = 0.9

# Output settings
OUTPUT_DIR = "./outputs"

print("‚úÖ Configuration loaded!")
print(f"Effective batch size: {PER_DEVICE_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"Max training steps: {MAX_STEPS}")

‚úÖ Configuration loaded!
Effective batch size: 8
Max training steps: 60


This cell loads the Llama 3 model and tokenizer:

**Tokenizer:**
- Converts text into numbers (tokens) that the model understands
- Example: "Hello world" ‚Üí [15339, 1917]
- Each token represents a piece of text (word, subword, or character)

**Model:**
- Loaded in 4-bit quantization to save memory
- `dtype=None`: Auto-detect best precision (FP16 for T4, BF16 for A100)
- Pre-trained on trillions of tokens of internet text

**Padding configuration:**
- Models need all sequences in a batch to be the same length
- `pad_token`: Special token used to fill shorter sequences
- `padding_side = "right"`: Add padding at the end

In [8]:
print("Loading model and tokenizer...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect (Float16 for T4, Bfloat16 for A100)
    load_in_4bit=LOAD_IN_4BIT,
)

# Configure tokenizer
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer.padding_side = "right"

print("‚úÖ Model and tokenizer loaded successfully!")
print(f"Model dtype: {model.dtype}")
print(f"Vocabulary size: {len(tokenizer)}")

Loading model and tokenizer...
==((====))==  Unsloth 2026.1.3: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


‚úÖ Model and tokenizer loaded successfully!
Model dtype: torch.float16
Vocabulary size: 128256


This cell adds LoRA (Low-Rank Adaptation) layers to the model:

**What are we adding?**
- Small trainable matrices to attention and feed-forward layers
- Target modules: `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
- These are the key computation layers in each transformer block

**Why LoRA?**
- Full fine-tuning: Train all 8B parameters (slow, memory-intensive)
- LoRA: Train only ~42M parameters (0.92% of total)
- Result: Much faster, less memory, similar quality

**Key parameters:**
- `r=16`: Rank of adaptation matrices (controls capacity)
- `lora_alpha=16`: Scaling factor for LoRA updates
- `use_gradient_checkpointing`: Trade compute for memory (slower but uses less RAM)

After this cell, only the LoRA weights will be updated during training!

In [9]:
print("Adding LoRA adapters...")

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

# Calculate parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("‚úÖ LoRA adapters added!")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")

Adding LoRA adapters...


Unsloth 2026.1.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


‚úÖ LoRA adapters added!
Total parameters: 4,582,543,360
Trainable parameters: 41,943,040 (0.92%)




## üìä Step 6: Load and Prepare Training Data

Load your training_data.json file and split into three sets:

**Three-way split:**
- **Training (80%)**: Model learns from these examples
- **Validation (10%)**: Monitor training progress, prevent overfitting
- **Test (10%)**: Final evaluation on completely unseen data

**Why split this way?**
- **Train**: The model sees these during training
- **Validation**: Check if model is generalizing (not memorizing)
- **Test**: Hold out for final evaluation after training is complete

**Data format:**
Each example is converted to chat format with:
- **System**: Instructions for the model's behavior
- **User**: The question being asked
- **Assistant**: The correct answer to learn from

This structure teaches the model to respond appropriately to Nigerian government service questions.

In [29]:
print(f"Loading training data from {DATA_FILE}...")

# Load JSON data
with open(DATA_FILE, 'r', encoding='utf-8') as f:
    raw_data = json.load(f)

print(f"‚úÖ Loaded {len(raw_data)} samples from JSON")

# Convert to chat format for fine-tuning
def convert_to_chat_format(item):
    """
    Convert each item to chat format with system, user, and assistant messages.
    """
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant for Nigerian government services and agencies. Provide accurate information and include relevant contact details when available."
            },
            {
                "role": "user",
                "content": item["question"]
            },
            {
                "role": "assistant",
                "content": item["answer"]
            }
        ],
        "agency": item.get("agency", "Unknown")
    }

# Convert all data
formatted_data = [convert_to_chat_format(item) for item in raw_data]

# Create DataFrame
df = pd.DataFrame(formatted_data)

# Split into train (80%), validation (10%), and test (10%)
train_size = int(len(df) * 0.8)
val_size = int(len(df) * 0.1)

train_df = df[:train_size]
val_df = df[train_size:train_size + val_size]
test_df = df[train_size + val_size:]

print(f"\nüìä Dataset Statistics:")
print(f"Total samples: {len(df):,}")
print(f"Training samples: {len(train_df):,} ({len(train_df)/len(df)*100:.1f}%)")
print(f"Validation samples: {len(val_df):,} ({len(val_df)/len(df)*100:.1f}%)")
print(f"Test samples: {len(test_df):,} ({len(test_df)/len(df)*100:.1f}%)")


Loading training data from training_data.json...
‚úÖ Loaded 6181 samples from JSON

üìä Dataset Statistics:
Total samples: 6,181
Training samples: 4,944 (80.0%)
Validation samples: 618 (10.0%)
Test samples: 619 (10.0%)


## üîÑ Step 7: Format Dataset with Chat Template

Apply Llama 3 chat template to format messages properly for training.

In [11]:
def format_chat_template(example):
    """Apply chat template to format messages for training."""
    formatted_text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": formatted_text}

print("Formatting training dataset...")
train_dataset = Dataset.from_pandas(train_df)
train_dataset = train_dataset.map(format_chat_template)

print("Formatting validation dataset...")
val_dataset = Dataset.from_pandas(val_df)
val_dataset = val_dataset.map(format_chat_template)

print(f"\n‚úÖ Datasets formatted successfully!")
print(f"Training dataset: {len(train_dataset)} samples")
print(f"Validation dataset: {len(val_dataset)} samples")

print("\nüìÑ Formatted example (first 500 chars):")
print(train_dataset[0]["text"][:500] + "...")

Formatting training dataset...


Map:   0%|          | 0/5562 [00:00<?, ? examples/s]

Formatting validation dataset...


Map:   0%|          | 0/619 [00:00<?, ? examples/s]


‚úÖ Datasets formatted successfully!
Training dataset: 5562 samples
Validation dataset: 619 samples

üìÑ Formatted example (first 500 chars):
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant for Nigerian government services and agencies. Provide accurate information and include relevant contact details when available.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is APCON and what is its regulatory role in Nigeria‚Äôs advertising industry?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

APCON stands for the Advertisin...


## ‚öôÔ∏è Step 8: Configure Training Arguments

This cell configures how the model will be trained:

**Training Schedule:**
- `max_steps=60`: Stop after 60 training steps (quick test)
- For real training, use `num_train_epochs=3` instead of `max_steps`

**Batch Configuration:**
- `per_device_train_batch_size=2`: Process 2 examples per GPU
- `gradient_accumulation_steps=4`: Accumulate gradients for 4 batches
- **Effective batch size = 2 √ó 4 = 8 examples per update**

**Optimization:**
- `learning_rate=2e-4`: Step size for parameter updates (0.0002)
- `warmup_steps=5`: Gradually increase LR for stability
- `optim="adamw_8bit"`: Memory-efficient AdamW optimizer
- `lr_scheduler_type="linear"`: Learning rate decreases linearly

**Precision:**
- `fp16` or `bf16`: Use 16-bit floating point (2x faster, half the memory)
- BF16 is better for training but only available on newer GPUs

**Other:**
- `seed=3407`: For reproducibility (same results every time)
- `report_to="none"`: Don't log to WandB/TensorBoard

In [12]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    
    # Training schedule
    max_steps=MAX_STEPS,  # Using max_steps instead of num_epochs
    per_device_train_batch_size=PER_DEVICE_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    
    # Optimization
    learning_rate=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    
    # Precision
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    
    # Logging
    logging_steps=LOGGING_STEPS,
    
    # Other
    seed=3407,
    report_to="none",
    remove_unused_columns=False,  # Add this line!
)

print("‚úÖ Training configuration created!")
print(f"Effective batch size: {PER_DEVICE_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"Total training steps: {MAX_STEPS}")

‚úÖ Training configuration created!
Effective batch size: 8
Total training steps: 60


## üéØ Step 9: Initialize Trainer
This cell creates the SFTTrainer that handles the training loop:

**What does SFTTrainer do?**
1. Takes your formatted dataset
2. Tokenizes text into numbers
3. Creates batches of examples
4. Feeds batches through the model
5. Computes loss (how wrong the predictions are)
6. Updates model weights to reduce loss
7. Repeats until training is complete

**Key parameters:**
- `dataset_text_field="text"`: Column containing formatted chat text
- `max_seq_length=1024`: Truncate sequences longer than 1024 tokens
- `dataset_num_proc=2`: Use 2 CPU cores for data processing
- `packing=False`: Don't pack multiple examples into one sequence

**What is packing?**
- With packing: Combine short examples to fill max_seq_length ‚Üí More efficient
- Without packing: One example per sequence ‚Üí Simpler, easier to debug
- For beginners, `packing=False` is recommended

In [13]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,           # ‚Üê keeps eval working
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/5562 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/619 [00:00<?, ? examples/s]

## üöÄ Step 10: Start Training

Execute the training loop. This will:
- Train for the specified number of epochs
- Evaluate on validation set periodically
- Save checkpoints
- Track training loss

**This may take 30 minutes to several hours depending on dataset size and GPU.**

In [14]:
print("üöÄ Starting training...")
print("=" * 70)

trainer_stats = trainer.train()

print("=" * 70)
print("‚úÖ Training completed!")
print(f"Final training loss: {trainer_stats.training_loss:.4f}")

The model is already on multiple devices. Skipping the move to device specified in `args`.


üöÄ Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,562 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
1,2.7786
2,2.8938
3,2.8796
4,2.6027
5,2.3353
6,2.0886
7,1.962
8,1.7464
9,1.4155
10,1.2428


‚úÖ Training completed!
Final training loss: 1.1324


## üìä Step 11: Final Evaluation

This cell evaluates the model on the validation set:

**Why evaluate?**
- Training loss shows how well the model fits training data
- Validation loss shows if the model generalizes to new examples
- If training loss ‚Üì but validation loss ‚Üë = overfitting (memorizing, not learning)

**What is `eval_loss`?**
- Average loss on validation examples
- Should be close to final training loss
- If much higher: Model is overfitting

**Other metrics:**
- `eval_runtime`: How long evaluation took
- `eval_samples_per_second`: Throughput
- `eval_steps_per_second`: Processing speed

**Good results:**
- `eval_loss` close to training loss (within 0.1-0.2)
- `eval_loss` < 1.0 for good quality responses

In [15]:
print("Running final evaluation...")

# Safety check
print("Validation dataset columns:", val_dataset.column_names)
print("Sample text length:", len(val_dataset[0]["text"]))

eval_results = trainer.evaluate()

print("\nüìä Evaluation Results:")
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

print("\n‚úÖ Evaluation completed!")

Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Running final evaluation...
Validation dataset columns: ['messages', 'agency', 'text']
Sample text length: 1288



üìä Evaluation Results:
  eval_loss: 0.9315
  eval_runtime: 139.2768
  eval_samples_per_second: 4.4440
  eval_steps_per_second: 0.5600
  epoch: 0.0863

‚úÖ Evaluation completed!


## üíæ Step 12: Save Fine-tuned Model

Save both LoRA adapters and merged model.

In [16]:
print("Saving fine-tuned model...")

# Create output directory
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Save LoRA adapters (small, ~50-100MB)
lora_path = f"{OUTPUT_DIR}/lora_adapters"
model.save_pretrained(lora_path)
tokenizer.save_pretrained(lora_path)
print(f"‚úÖ LoRA adapters saved to: {lora_path}")

# Save merged model (large, ~16GB)
print("\nSaving merged model (this may take a few minutes)...")
merged_path = f"{OUTPUT_DIR}/merged_model"

model.save_pretrained_merged(
    merged_path,
    tokenizer,
    save_method="merged_16bit",
)
print(f"‚úÖ Merged model saved to: {merged_path}")

print("\n" + "=" * 70)
print("MODEL SAVED SUCCESSFULLY!")
print("=" * 70)

Saving fine-tuned model...


‚úÖ LoRA adapters saved to: ./outputs/lora_adapters

Saving merged model (this may take a few minutes)...
Found HuggingFace hub cache directory: /teamspace/studios/this_studio/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 26800.66it/s]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [01:18<00:00, 19.61s/it]


Unsloth: Merge process complete. Saved to `/teamspace/studios/this_studio/outputs/merged_model`
‚úÖ Merged model saved to: ./outputs/merged_model

MODEL SAVED SUCCESSFULLY!


## üß™ Step 13: Test the Fine-tuned Model

Test the fine-tuned model with sample questions from your domain.

In [20]:
print("Testing the fine-tuned model...")

# Prepare model for inference
FastLanguageModel.for_inference(model)

# Test with multiple questions
test_questions = [
    "How can I register my business in Nigeria?",
    "What are the requirements for obtaining a business license?",
    "How do I get a Loan from CBN"
]

for i, question in enumerate(test_questions, 1):
    print(f"\n{'='*70}")
    print(f"Test {i}: {question}")
    print('='*70)
    
    test_messages = [
        {"role": "system", "content": "You are a helpful assistant for Nigerian government services and agencies. Provide accurate information and concise information"},
        {"role": "user", "content": question}
    ]
    
    # Format and tokenize
    test_prompt = tokenizer.apply_chat_template(test_messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
    
    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        use_cache=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "assistant" in response:
        assistant_response = response.split("assistant")[-1].strip()
        print(f"\nResponse: {assistant_response}")
    else:
        print(f"\nResponse: {response}")

print("\n" + "=" * 70)
print("‚úÖ Testing completed!")
print("=" * 70)

Testing the fine-tuned model...

Test 1: How can I register my business in Nigeria?



Response: Business registration in Nigeria typically starts with choosing a name and then registering the business with the Corporate Affairs Commission (CAC). The CAC website provides online registration tools and guidelines. You can start by checking name availability, then proceed to register your company through the CAC portal, including filing documents and making payments. For more information, visit the CAC website (www.cac.gov.ng) and follow the registration process.

Contact Information:
Agency: Corporate Affairs Commission (CAC)
Official Address: Plot 420, Tigris Crescent,
Off Aguiyi Ironsi Street,
Maitama, Abuja.
Nigeria.
Official Email: cservice@cac.gov.ng
Official Phone: (+234) 708 062 9000
Official Website: https://www.cac.gov.ng/

Test 2: What are the requirements for obtaining a business license?

Response: Applicants must meet specific licensing requirements for their business type, including registration with CAC, payment of relevant fees, and compliance with CAC reg

# Model Evaluation

This cell evaluates your fine-tuned model using quantitative metrics:

**ROUGE Metrics** (Recall-Oriented Understudy for Gisting Evaluation):
- Measures overlap between generated response and reference answer
- **ROUGE-1**: Unigram (single word) overlap
- **ROUGE-2**: Bigram (two consecutive words) overlap
- **ROUGE-L**: Longest common subsequence

**How ROUGE works:**
- Reference: "The capital of Nigeria is Abuja"
- Response: "Nigeria's capital city is Abuja"
- ROUGE-1: High (matches: capital, Nigeria, Abuja)
- ROUGE-2: Medium (matches: "is Abuja")
- ROUGE-L: Medium (longest common sequence)

**Interpreting scores** (0-1 scale):
- **>0.3**: Good overlap, relevant response
- **0.2-0.3**: Moderate overlap, partially correct
- **<0.2**: Low overlap, may be wrong or off-topic

**Note:** ROUGE measures n-gram overlap, not semantic meaning. A response can be correct but score low if phrased differently.

**Evaluation process:**
1. Load test dataset (unseen during training)
2. For each question, generate model response
3. Compare response to reference answer
4. Calculate ROUGE scores
5. Average across all samples

In [None]:
!pip install evaluate 
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (pyproject.toml) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24987 sha256=d33a006ae27a669fd5941c1a2ae4b1e90e74bfc09eb7da95b2df949b6761cc54
  Stored in directory: /home/zeus/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [27]:
# ============================================================================
# Simple Model Evaluation on Test Set
# ============================================================================

import evaluate
from tqdm import tqdm

# Prepare model for inference
FastLanguageModel.for_inference(model)

# Load ROUGE metric
rouge = evaluate.load('rouge')

# Number of samples to test
NUM_EVAL_SAMPLES = min(10, len(test_df))  # Use test set

predictions = []
references = []

print(f"Evaluating {NUM_EVAL_SAMPLES} samples from TEST SET...")
print("="*60)

# Evaluate samples from TEST set (unseen during training)
for i in tqdm(range(NUM_EVAL_SAMPLES)):
    sample = test_df.iloc[i]
    
    # Get question and reference answer
    question = sample['messages'][1]['content']  # User message
    reference = sample['messages'][2]['content']  # Assistant message
    
    # Create prompt
    messages = [
        {"role": "system", "content": "You are a helpful assistant for Nigerian government services and agencies. Provide accurate information and include relevant contact details when available."},
        {"role": "user", "content": question}
    ]
    
    # Format and generate
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
    
    # Extract response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("assistant")[-1].strip()
    
    predictions.append(response)
    references.append(reference)

# Calculate ROUGE scores
rouge_results = rouge.compute(predictions=predictions, references=references)

# Print results
print("\n" + "="*60)
print("üìä EVALUATION RESULTS (Test Set - Unseen Data)")
print("="*60)
print(f"Samples evaluated: {NUM_EVAL_SAMPLES}")
print(f"\nROUGE Scores (0-1, higher is better):")
print(f"  ROUGE-1: {rouge_results['rouge1']:.3f}  (word overlap)")
print(f"  ROUGE-2: {rouge_results['rouge2']:.3f}  (phrase overlap)")
print(f"  ROUGE-L: {rouge_results['rougeL']:.3f}  (sentence overlap)")

# Interpretation
print("\nInterpretation:")
if rouge_results['rouge1'] > 0.3:
    print("Good word overlap with reference answers")
elif rouge_results['rouge1'] > 0.2:
    print("Fair overlap - model is learning but could improve")
else:
    print("Low overlap - model needs more training")

if rouge_results['rougeL'] > 0.25:
    print("Good sentence structure similarity")
else:
    print("Responses differ significantly in structure")

print("="*60)

# Show 3 examples
print("\nüìù Sample Responses:\n")
for i in range(min(3, NUM_EVAL_SAMPLES)):
    print(f"Example {i+1}:")
    print(f"Q: {test_df.iloc[i]['messages'][1]['content'][:100]}...")
    print(f"\nüìñ Reference: {references[i][:200]}...")
    print(f"\nü§ñ Model: {predictions[i][:200]}...")
    print("-"*60 + "\n")

Evaluating 10 samples from TEST SET...


  0%|          | 0/10 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [01:51<00:00, 11.15s/it]



üìä EVALUATION RESULTS (Test Set - Unseen Data)
Samples evaluated: 10

ROUGE Scores (0-1, higher is better):
  ROUGE-1: 0.424  (word overlap)
  ROUGE-2: 0.235  (phrase overlap)
  ROUGE-L: 0.358  (sentence overlap)

Interpretation:
  ‚úÖ Good word overlap with reference answers
  ‚úÖ Good sentence structure similarity

üìù Sample Responses:

Example 1:
Q: How does the Finance Act, 2020 reclassify company sizes and define primary agricultural production?...

üìñ Reference: The Act defines new categories: (a) medium sized company as one with gross turnover between N25,000,000 and N100,000,000 per annum; (b) small sized company as one with gross turnover of N25,000,000 or...

ü§ñ Model: The Finance Act, 2020 amends the Companies and Allied Matters Act to classify companies by size (Small, Medium, and Large) based on turnover, assets, and employees. Small companies are defined as thos...
------------------------------------------------------------

Example 2:
Q: What is the significanc

## ‚òÅÔ∏è Step 14: (Optional) Push to Hugging Face Hub

Upload your model to Hugging Face for sharing and deployment.

**Uncomment and run if you want to upload.**

In [None]:
from huggingface_hub import login

# Login to Hugging Face
login()

# Push LoRA adapters
print("Pushing LoRA adapters to Hugging Face...")
model.push_to_hub(HF_MODEL_NAME, token=True, private=False)
tokenizer.push_to_hub(HF_MODEL_NAME, token=True, private=False)

# Optionally push merged model (takes longer)
# print("Pushing merged model to Hugging Face...")
# merged_model.push_to_hub(f"{HF_MODEL_NAME}-merged", token=True, private=False)

print(f"‚úÖ Model pushed to: https://huggingface.co/{HF_MODEL_NAME}")

print("Upload cell ready (uncomment to use)")

## üì• Step 15: Download Model Files

Download the trained model to your local machine.

This will create a zip file you can download.

In [None]:
# Download directly in Colab
from google.colab import files

files.download('llama3_finetuned_model.zip')
print("‚úÖ Download started!")

## üéâ Training Complete!


2. **Adjust hyperparameters**: If results aren't optimal, try:
   - Increasing epochs (NUM_EPOCHS)
   - Adjusting learning rate (LEARNING_RATE)
   - Changing LoRA rank (LORA_R)
3. **Deploy**: Use the model in your application
4. **Share**: Upload to Hugging Face Hub

