# Fine-Tune a Generative AI Model for Study Assistant (Q&A)

This notebook is adapted to work with the **study_assistant_data.jsonl** dataset.

- **Data format**: `instruction`/`input`/`output`
- **Task type**: question answering
- **Prompt template**: Adapted for instruction-following Q&A format
- **Data loading**: From local JSONL file

---
# Why Do We Need to Preprocess the Dataset?

Preprocessing is **essential** for fine-tuning LLMs. Here's why:

## 1. Format Compatibility
LLMs don't understand raw JSON or text - they need **tokenized numerical sequences**.

Your data looks like:
```json
{"instruction": "You are My Learning Buddy...", "input": "What is the Pythagorean theorem?", "output": "The formula is a² + b² = c²..."}
```

But the model needs:
```
input_ids: [425, 19, 8, 12901, 7, 9, 102, ...]
labels: [37, 3, 31839, 32, 9, ...]
```

## 2. Tokenization Requirements
- **Breaking text into tokens**: "preprocessing" → ["pre", "process", "ing"]
- **Adding special tokens**: `<BOS>` (begin), `<EOS>` (end), `<PAD>` (padding)
- **Uniform lengths**: All sequences padded/truncated to same length for efficient batching

## 3. Prompt Engineering
We wrap your data in a template so the model learns the expected format:
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}
```

## 4. Separating Input and Labels
- **input_ids**: What the model sees (the question)
- **labels**: What the model should generate (the answer)

## 5. Memory Efficiency
- GPUs process data in batches
- All sequences must have the same length
- Preprocessing creates uniform tensors for parallel computation

---
# Table of Contents

- [1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [1.2 - Load Dataset and LLM](#1.2)
  - [1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [2 - Perform Full Fine-Tuning](#2)
  - [2.1 - Preprocess the Q&A Dataset](#2.1)
  - [2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [3.2 - Train PEFT Adapter](#3.2)
  - [3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [None]:
!pip install -U pip setuptools wheel
!pip install -q torch torchdata
!pip install -q transformers>=4.40.0 datasets evaluate rouge-score peft loralib



In [None]:
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import json

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

We load the local `study_assistant_data.jsonl` file.

The data has the format:
- `instruction`: System instruction ("You are My Learning Buddy...")
- `input`: The user's question
- `output`: The expected answer

In [None]:

# Load the study assistant dataset from JSONL file
file_path = '/content/study_assistant_data.jsonl'
dataset = load_dataset('json', data_files=file_path, split='train')

print(f"Dataset size: {len(dataset)}")
print(f"\nDataset features: {dataset.features}")
print(f"\nFirst example:")
print(dataset[0])

In [None]:
from sklearn.model_selection import train_test_split

# First split: 80% train, 20% temp
train_test = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test['train']
temp_dataset = train_test['test']

# Second split: 50% of temp for validation, 50% for test
val_test = temp_dataset.train_test_split(test_size=0.5, seed=42)
val_dataset = val_test['train']
test_dataset = val_test['test']

print(f"Train samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Create HuggingFace DatasetDict
dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

print(f"\nDataset structure:")
print(dataset)

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. FLAN-T5 is already instruction-tuned, making it a good choice for Q&A tasks.

In [None]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Let's test how the base model performs on your Q&A task WITHOUT fine-tuning.

In [None]:

index = 30

# Get sample from test set
instruction = dataset['train'][index]['instruction']
question = dataset['train'][index]['input']
expected_answer = dataset['train'][index]['output']

# Create prompt for Q&A
prompt = f"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{question}

### Response:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=500,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT QUESTION:\n{question}')
print(dash_line)
print(f'EXPECTED ANSWER (first 500 chars):\n{expected_answer[:500]}...')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Q&A Dataset

**This is where the magic happens!** We need to:

1. Convert each Q&A pair into a prompt template
2. Tokenize both the prompt (input) and the answer (labels)
3. Pad/truncate to uniform lengths

The tokenize function uses `instruction`, `input`, and `output` fields.

In [None]:

def tokenize_function(example):
    """
    Preprocess the Q&A dataset:
    - Creates prompt from instruction + input
    - Tokenizes prompt as input_ids
    - Tokenizes output as labels
    """
    # Build the prompt template (Alpaca-style format)
    start_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n'
    middle_prompt = '\n\n### Input:\n'
    end_prompt = '\n\n### Response:\n'

    # Combine instruction and input into the prompt
    prompts = [
        start_prompt + inst + middle_prompt + inp + end_prompt
        for inst, inp in zip(example["instruction"], example["input"])
    ]

    # Tokenize the prompts (input to the model)
    example['input_ids'] = tokenizer(
        prompts,
        padding="max_length",
        truncation=True,
        max_length=512,  # Increased for longer instructions
        return_tensors="pt"
    ).input_ids

    # Tokenize the outputs/answers (what the model should generate)
    example['labels'] = tokenizer(
        example["output"],
        padding="max_length",
        truncation=True,
        max_length=512,  # Increased for longer answers
        return_tensors="pt"
    ).input_ids

    return example

# Apply tokenization to all splits
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Remove the original text columns (no longer needed after tokenization)
tokenized_datasets = tokenized_datasets.remove_columns(['instruction', 'input', 'output'])

print("Tokenization complete!")
print(tokenized_datasets)

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now we train the model using the HuggingFace `Trainer` class.

In [None]:
output_dir = f'./study-assistant-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=3,  # Increased epochs for small dataset
    weight_decay=0.01,
    logging_steps=1,
    max_steps=100,  # Adjusted for small dataset
    per_device_train_batch_size=4,  # Small batch size for small dataset
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
# Optional: Login to Weights & Biases for experiment tracking
# Uncomment the following lines if you want to use wandb

import wandb
from google.colab import userdata
wandb_key = userdata.get('WANDB_API_KEY')
wandb.login(key=wandb_key)

In [None]:
# Start training
trainer.train()

In [None]:
# Save the fine-tuned model
trainer.save_model("./study-assistant-finetuned-checkpoint")
tokenizer.save_pretrained("./study-assistant-finetuned-checkpoint")

In [None]:
# Load the fine-tuned model for inference
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(
    "./study-assistant-finetuned-checkpoint",
    torch_dtype=torch.bfloat16
)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

Let's compare the original model vs the fine-tuned model on the same question.

In [None]:

from transformers import GenerationConfig

# Pick a sample from test set
index = 0
instruction = dataset['test'][index]['instruction']
question = dataset['test'][index]['input']
expected_answer = dataset['test'][index]['output']

# Create the prompt
prompt = f"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{question}

### Response:
"""

# Move models to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
original_model = original_model.to(device)
instruct_model = instruct_model.to(device)

# Tokenize input
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Generation config
gen_config = GenerationConfig(max_new_tokens=300, num_beams=1)

# Generate outputs
with torch.no_grad():
    original_outputs = original_model.generate(input_ids=input_ids, generation_config=gen_config)
    instruct_outputs = instruct_model.generate(input_ids=input_ids, generation_config=gen_config)

# Decode
original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)
instruct_text = tokenizer.decode(instruct_outputs[0], skip_special_tokens=True)

# Print comparison
dash_line = "-" * 80
print(dash_line)
print(f'QUESTION:\n{question}')
print(dash_line)
print(f'EXPECTED ANSWER (first 500 chars):\n{expected_answer[:500]}...')
print(dash_line)
print(f'ORIGINAL MODEL OUTPUT:\n{original_text}')
print(dash_line)
print(f'FINE-TUNED MODEL OUTPUT:\n{instruct_text}')
print(dash_line)

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

ROUGE metrics help measure how similar the generated text is to the expected output.

In [None]:
rouge = evaluate.load('rouge')

In [None]:

# Get all test samples
test_instructions = dataset['test']['instruction']
test_questions = dataset['test']['input']
expected_answers = dataset['test']['output']

# Initialize result lists
original_model_answers = []
instruct_model_answers = []

gen_config = GenerationConfig(max_new_tokens=300, num_beams=1)

# Loop through test samples
for idx, (inst, question) in enumerate(zip(test_instructions, test_questions)):
    prompt = f"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{inst}

### Input:
{question}

### Response:
"""

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    with torch.no_grad():
        # Original model
        orig_out = original_model.generate(input_ids=input_ids, generation_config=gen_config)
        orig_text = tokenizer.decode(orig_out[0], skip_special_tokens=True)
        original_model_answers.append(orig_text)

        # Fine-tuned model
        inst_out = instruct_model.generate(input_ids=input_ids, generation_config=gen_config)
        inst_text = tokenizer.decode(inst_out[0], skip_special_tokens=True)
        instruct_model_answers.append(inst_text)

    print(f"✅ Processed sample {idx+1}/{len(test_questions)}")

# Create results DataFrame
results_df = pd.DataFrame({
    'question': test_questions,
    'expected_answer': expected_answers,
    'original_model': original_model_answers,
    'finetuned_model': instruct_model_answers
})

results_df

In [None]:
# Compute ROUGE scores
original_model_results = rouge.compute(
    predictions=original_model_answers,
    references=expected_answers,
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_answers,
    references=expected_answers,
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL ROUGE SCORES:')
print(original_model_results)
print('\nFINE-TUNED MODEL ROUGE SCORES:')
print(instruct_model_results)

In [None]:
print("Absolute percentage improvement of FINE-TUNED MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

PEFT allows us to fine-tune only a small number of parameters, saving compute resources while achieving good results.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32,  # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM  # FLAN-T5
)

In [None]:
# Reload the original model for PEFT training
original_model_for_peft = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16
)

peft_model = get_peft_model(original_model_for_peft, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

In [None]:
output_dir = f'./peft-study-assistant-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,  # Higher learning rate for PEFT
    num_train_epochs=5,  # More epochs for small dataset
    logging_steps=1,
    max_steps=100
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [None]:
# Train PEFT model
peft_trainer.train()

# Save the PEFT model
peft_model_path = "./peft-study-assistant-checkpoint-local-l"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

In [None]:
from peft import PeftModel, PeftConfig

# Load PEFT model for inference
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-base",
    torch_dtype=torch.bfloat16
)

peft_model = PeftModel.from_pretrained(
    peft_model_base,
    './peft-study-assistant-checkpoint-local-l/',
    torch_dtype=torch.bfloat16,
    is_trainable=False
)

print(print_number_of_trainable_model_parameters(peft_model))

In [None]:
# Compare all three models on a test sample
index = 0
instruction = dataset['test'][index]['instruction']
question = dataset['test'][index]['input']
expected_answer = dataset['test'][index]['output']

prompt = f"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{question}

### Response:
"""

# Move PEFT model to device
peft_model = peft_model.to(device)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
gen_config = GenerationConfig(max_new_tokens=300, num_beams=1)

with torch.no_grad():
    original_outputs = original_model.generate(input_ids=input_ids, generation_config=gen_config)
    instruct_outputs = instruct_model.generate(input_ids=input_ids, generation_config=gen_config)
    peft_outputs = peft_model.generate(input_ids=input_ids, generation_config=gen_config)

original_text = tokenizer.decode(original_outputs[0], skip_special_tokens=True)
instruct_text = tokenizer.decode(instruct_outputs[0], skip_special_tokens=True)
peft_text = tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

dash_line = "-" * 80
print(dash_line)
print(f'QUESTION:\n{question}')
print(dash_line)
print(f'EXPECTED ANSWER (first 500 chars):\n{expected_answer[:500]}...')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_text}')
print(dash_line)
print(f'FULL FINE-TUNED MODEL:\n{instruct_text}')
print(dash_line)
print(f'PEFT MODEL:\n{peft_text}')
print(dash_line)

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

In [None]:
# Generate PEFT model outputs for all test samples
peft_model_answers = []

for idx, (inst, question) in enumerate(zip(test_instructions, test_questions)):
    prompt = f"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{inst}

### Input:
{question}

### Response:
"""

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    with torch.no_grad():
        peft_out = peft_model.generate(input_ids=input_ids, generation_config=gen_config)
        peft_text = tokenizer.decode(peft_out[0], skip_special_tokens=True)
        peft_model_answers.append(peft_text)

    print(f"✅ Processed sample {idx+1}/{len(test_questions)}")

In [None]:
# Compute ROUGE scores for PEFT model
peft_model_results = rouge.compute(
    predictions=peft_model_answers,
    references=expected_answers,
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('\nFULL FINE-TUNED MODEL:')
print(instruct_model_results)
print('\nPEFT MODEL:')
print(peft_model_results)

In [None]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

In [None]:
print("Absolute percentage improvement of PEFT MODEL over FULL FINE-TUNED MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

## Summary

In this notebook, we:

1. **Loaded your custom study assistant Q&A dataset** from a JSONL file
2. **Preprocessed the data** by:
   - Creating prompt templates with instruction/input/response format
   - Tokenizing both inputs and labels
   - Padding/truncating to uniform lengths
3. **Performed full fine-tuning** on FLAN-T5
4. **Performed PEFT/LoRA fine-tuning** with only ~0.3% of parameters
5. **Evaluated both approaches** using ROUGE metrics

### Key Takeaways:
- **Preprocessing is essential** to convert text to tokenized tensors
- **PEFT achieves comparable results** with much less compute
- **With small datasets** like yours (50 samples), PEFT often works better due to less overfitting risk