# Fine-Tuned Model Inference and Testing

## 🎯 Purpose
This notebook demonstrates **inference and testing** with our fine-tuned LLaMA-2-7B model. After completing the LoRA fine-tuning process, this notebook loads the trained model and runs it on various inputs to demonstrate the improvements achieved through instruction fine-tuning on the Dolly-15K dataset.

## 📊 Evaluation Context
Following the complete fine-tuning pipeline:

1. **[Baseline Model Evaluation](2_baseline_model.ipynb)** - Established baseline performance metrics
2. **[Fine-tuning Process](3_finetuning.ipynb)** - LoRA fine-tuning on Dolly-15K dataset  
3. **This Notebook** - Fine-tuned model inference and testing
4. **[AlpacaEval Benchmark](5_benchmark_alpaca_eval.ipynb)** - Formal evaluation on AlpacaEval dataset

## 🔧 Technical Implementation
- **Model**: Fine-tuned LLaMA-2-7B with LoRA adapters loaded from saved checkpoints
- **Inference**: Generate responses to various test prompts
- **Testing**: Compare fine-tuned model outputs with baseline model
- **Evaluation**: Demonstrate improved instruction-following capabilities

## 📈 Expected Improvements
After LoRA fine-tuning on Dolly-15K, we expect to see:
- **Better instruction following** - More accurate responses to user prompts
- **Improved formatting** - Consistent response structure and formatting
- **Enhanced helpfulness** - More useful and coherent responses
- **Better context understanding** - Improved ability to follow complex instructions

## 🛠️ Workflow
1. **Load fine-tuned model** from saved LoRA adapters
2. **Test basic inference** with simple prompts
3. **Run on evaluation dataset** to generate responses
4. **Compare outputs** with baseline model results
5. **Save results** for further analysis and reporting

## 🎯 Success Criteria
- **Successful model loading** - Fine-tuned weights load correctly
- **Improved response quality** - Better instruction following than baseline
- **Consistent performance** - Reliable responses across different input types
- **Clear improvements** - Demonstrable benefits from fine-tuning

---
**Note**: This notebook demonstrates the practical application of our fine-tuned model and provides evidence of the effectiveness of our LoRA fine-tuning approach.


In [1]:
!pip install git+https://github.com/tatsu-lab/alpaca_eval.git

Collecting git+https://github.com/tatsu-lab/alpaca_eval.git
  Cloning https://github.com/tatsu-lab/alpaca_eval.git to /tmp/pip-req-build-t94pdjug
  Running command git clone --filter=blob:none --quiet https://github.com/tatsu-lab/alpaca_eval.git /tmp/pip-req-build-t94pdjug
  Resolved https://github.com/tatsu-lab/alpaca_eval.git to commit cd543a149df89434d8a54582c0151c0b945c3d20
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fire (from alpaca_eval==0.6.6)
  Downloading fire-0.7.1-py3-none-any.whl.metadata (5.8 kB)
Downloading fire-0.7.1-py3-none-any.whl (115 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.9/115.9 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: alpaca_eval
  Building wheel for alpaca_eval (setup.py) ... [?25l[?25hdone
  Created wheel for alpaca_eval: filename=alpaca_eval-0.6.6-py3-none-any.whl size=362273 sha256=8821dab3535a1f8fc63a66f72dbdfb5cd3caf6af4d35edbe22645d95b32bbbf1
  Stored in

In [2]:
!pip install -U transformers peft bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.48.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.48.1-py3-none-manylinux_2_24_x86_64.whl (60.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.48.1


In [3]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from huggingface_hub import hf_hub_url
from peft import PeftModel
from tqdm.notebook import tqdm
import pandas as pd
import polars as pl
import os
from google.colab import drive

In [4]:
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [5]:
from huggingface_hub import login
login(new_session=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
#Load the fine-tuned model
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

adapter_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/results/final_lora_adapter"
print(f"Loading Fine-Tuned LoRA adapter from: {adapter_path}...")
model = PeftModel.from_pretrained(base_model, adapter_path)

tokenizer = AutoTokenizer.from_pretrained(adapter_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(model)



config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Loading Fine-Tuned LoRA adapter from: /content/drive/MyDrive/LLaMA2-Dolly-Training/results/final_lora_adapter...
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): P

In [7]:
#Do a simple run with the fine tuned model
prompt = "What is the capital of France"

inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generate_ids[0], skip_special_tokens=True)
print(generated_text)

What is the capital of France?
 nobody knows
What is the capital of France?
Paris is the capital of France.
What is the capital of France? Paris
What is the capital of France? Paris
What is the capital of France? Paris
What


In [8]:
output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
baseline_parquet_path = os.path.join(output_dir, "baseline_model_outputs.parquet")

eval_set_with_baseline = pl.read_parquet(baseline_parquet_path)

eval_set_with_baseline.head(10)

dataset,instruction,output,generator,baseline_output
str,str,str,str,str
"""koala""","""in billiards what happens if o…","""If every striped ball is pocke…","""text_davinci_003""","""**f**: ``` (6) ``` ### Instr…"
"""koala""","""i assume you are familiar with…","""Estimates and Error Margins: •…","""text_davinci_003""","""<img src=""https://i.imgur.com/…"
"""selfinstruct""","""Give students tips on how to k…","""1. Practice your presentation …","""text_davinci_003""","""1. Relax 2. Breathe 3. Talk …"
"""helpful_base""","""what is the name of chris tuck…","""Chris Tucker's first movie was…","""text_davinci_003""","""[The Fifth Element](https://ww…"
"""koala""","""Please summarise in point form…","""1. Decline in agricultural pro…","""text_davinci_003""","""In the article, Devèze summari…"
"""helpful_base""","""I'm trying to teach myself to …","""Sure! Here are a few tips to h…","""text_davinci_003""","""### Instruction:"""
"""koala""","""cost of fuel for a 14 mile jou…","""£3.75""","""text_davinci_003""","""cost of fuel for a 14 mile jou…"
"""koala""","""Explain me the Finite Elemente…","""The Finite Element Method (FEM…","""text_davinci_003""","""The finite element method (FEM…"
"""oasst""","""How can I use software defined…","""To detect and locate a drone f…","""text_davinci_003""","""The first thing you need to do…"
"""selfinstruct""","""You are given a tweet and you …","""Offensive""","""text_davinci_003""","""Offensive. ### Instruction: …"


In [9]:
instructions = eval_set_with_baseline.get_column("instruction").to_list()
finetuned_outputs = []
BATCH_SIZE = 1
print(f"\nGenerating {len(instructions)} outputs from fine-tuned model in batches of {BATCH_SIZE}...")

PROMPT_TEMPLATE = """### Instruction:
{instruction}

### Response:
"""

for i in tqdm(range(0, len(instructions), BATCH_SIZE)):
    batch_instructions = instructions[i : i + BATCH_SIZE]
    prompts = [PROMPT_TEMPLATE.format(instruction=inst) for inst in batch_instructions]

    inputs = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=1024
    ).to(model.device)

    with torch.no_grad():
        generate_ids = model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    output_tokens = generate_ids[:, inputs.input_ids.shape[1]:]
    batch_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
    finetuned_outputs.extend(batch_outputs)

eval_set_complete = eval_set_with_baseline.with_columns(
    pl.Series("finetuned_output", finetuned_outputs)
)

eval_set_complete.head()


Generating 10 outputs from fine-tuned model in batches of 1...


  0%|          | 0/10 [00:00<?, ?it/s]

dataset,instruction,output,generator,baseline_output,finetuned_output
str,str,str,str,str,str
"""koala""","""in billiards what happens if o…","""If every striped ball is pocke…","""text_davinci_003""","""**f**: ``` (6) ``` ### Instr…","""On the break every strip ball …"
"""koala""","""i assume you are familiar with…","""Estimates and Error Margins: •…","""text_davinci_003""","""<img src=""https://i.imgur.com/…","""The Drake equation gives the f…"
"""selfinstruct""","""Give students tips on how to k…","""1. Practice your presentation …","""text_davinci_003""","""1. Relax 2. Breathe 3. Talk …","""- Practice the presentation ma…"
"""helpful_base""","""what is the name of chris tuck…","""Chris Tucker's first movie was…","""text_davinci_003""","""[The Fifth Element](https://ww…","""Friday"""
"""koala""","""Please summarise in point form…","""1. Decline in agricultural pro…","""text_davinci_003""","""In the article, Devèze summari…","""1. High cost of production 2. …"


In [16]:
random_row_with_output = eval_set_complete.sample(n=1)
print(f"DATASET: {random_row_with_output['dataset'][0]}")
print(f"INSTRUCTION: {random_row_with_output['instruction'][0]}")
print(f"GENERATOR: {random_row_with_output['generator'][0]}")
print(f"OUTPUT:\n {random_row_with_output['output'][0]}")

print(f"BASELINE:\n {random_row_with_output['baseline_output'][0]}")

print(f"FINE-TUNED:\n {random_row_with_output['finetuned_output'][0]}")

DATASET: koala
INSTRUCTION: Explain me the Finite Elemente Method
GENERATOR: text_davinci_003
OUTPUT:
 The Finite Element Method (FEM) is an analytical technique used to approximate the solution of differential equations. It divides a region into a large number of small elements, and then uses the properties of those elements to calculate an approximate solution. FEM is used extensively in engineering and science to solve a wide range of problems. It can be used to determine the stresses and displacements in structures and machines, to analyze fluid flow, to predict the behavior of electrical and electronic circuits, and to solve partial differential equations.
BASELINE:
 The finite element method (FEM) is a computational method for solving problems in engineering and mathematical physics.

The basic idea behind the finite element method is to divide the domain into small sub-domains, called elements. The
function values on the boundary of an element are specified, and the function val

In [17]:
output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
combined_parquet_path = os.path.join(output_dir, "eval_outputs_combined.parquet")
os.makedirs(output_dir, exist_ok=True)
print(f"Saving combined DataFrame (baseline + fine-tuned) to: {combined_parquet_path}")
eval_set_complete.write_parquet(combined_parquet_path)
!ls -lh "{combined_parquet_path}"

Saving combined DataFrame (baseline + fine-tuned) to: /content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/eval_outputs_combined.parquet
-rw------- 1 root root 14K Oct 19 09:33 /content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/eval_outputs_combined.parquet
