# Baseline Model Evaluation

## Purpose
This notebook benchmarks the **baseline LLaMA-2-7B model** before fine-tuning to establish performance baselines for comparison with the fine-tuned model.

## Evaluation Framework
Following the requirements from [Assignment 7](../../tasks/Assignment7.md), we run the baseline model using:

### 1. **AlpacaEval 2**
- **Purpose**: Instruction-following quality assessment
- **Repository**: https://github.com/tatsu-lab/alpaca_eval
- **Metrics**: Win rate against reference model, response quality

### 2. **MT-Bench (FastChat)**
- **Purpose**: Multi-turn dialogue quality evaluation  
- **Repository**: https://github.com/lm-sys/FastChat
- **Metrics**: Multi-turn conversation capability, consistency

## Workflow
1. **Load baseline model**: `meta-llama/Llama-2-7b-hf`
2. **Run AlpacaEval 2**: Generate responses and calculate win rates
3. **Run MT-Bench**: Evaluate multi-turn dialogue performance
4. **Document results**: Baseline metrics for comparison with fine-tuned model

## Expected Outcomes
- Establish baseline performance metrics
- Identify areas where fine-tuning should improve
- Provide comparison baseline for fine-tuned model evaluation
- Demonstrate clear performance gaps that fine-tuning aims to address

---
**Note**: This baseline evaluation is crucial for demonstrating the effectiveness of our LoRA fine-tuning approach on the Dolly-15K dataset.


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch

from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from huggingface_hub import login
login(new_session=False)

#Load the model
model_id = "meta-llama/Llama-2-7b-hf"
cache_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/models/Llama-2-7b-hf"

if os.path.exists(cache_path):
    print(f"Found cached model in Google Drive, Loading...")
    model = AutoModelForCausalLM.from_pretrained(
        cache_path,
        dtype=torch.float16,
        device_map="auto",
        local_files_only=True
    )
    tokenizer = AutoTokenizer.from_pretrained(cache_path)
    print(f"✓ Model {model_id} loaded")
else:
    print(f"Downloading {model_id} for the first time...")

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)


    os.makedirs(cache_path, exist_ok=True)
    model.save_pretrained(cache_path)
    tokenizer.save_pretrained(cache_path)
    print(f"✓ Cached to: {cache_path}")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Found cached model in Google Drive, Loading...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✓ Model meta-llama/Llama-2-7b-hf loaded


In [4]:
prompt = "What is the capital of France"

inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(generate_ids[0], skip_special_tokens=True)
print(generated_text)

What is the capital of France?
How many regions are in France?
What is the currency of France?
What is the largest city in France?
What is the official language of France?
What is the religion of France?
What is the population of France


In [50]:
import polars as pl
import pandas as pd
from huggingface_hub import hf_hub_url

EVAL_MODE = "sample"
json_url = hf_hub_url(
    repo_id="tatsu-lab/alpaca_eval",
    filename="alpaca_eval.json",
    repo_type="dataset"
)

print(f"Downloading data from: {json_url}")

df_pandas = pd.read_json(json_url)
eval_set = pl.from_pandas(df_pandas)

print(f"✓ Loaded {eval_set.height} instructions")

if EVAL_MODE == "sample":
    eval_set = eval_set.sample(n=10, seed=42)
    print(f"✓ Sampled {eval_set.height} instructions")

eval_set.head(10)

Downloading data from: https://huggingface.co/datasets/tatsu-lab/alpaca_eval/resolve/main/alpaca_eval.json
✓ Loaded 805 instructions
✓ Sampled 10 instructions


dataset,instruction,output,generator
str,str,str,str
"""koala""","""in billiards what happens if o…","""If every striped ball is pocke…","""text_davinci_003"""
"""koala""","""i assume you are familiar with…","""Estimates and Error Margins: •…","""text_davinci_003"""
"""selfinstruct""","""Give students tips on how to k…","""1. Practice your presentation …","""text_davinci_003"""
"""helpful_base""","""what is the name of chris tuck…","""Chris Tucker's first movie was…","""text_davinci_003"""
"""koala""","""Please summarise in point form…","""1. Decline in agricultural pro…","""text_davinci_003"""
"""helpful_base""","""I'm trying to teach myself to …","""Sure! Here are a few tips to h…","""text_davinci_003"""
"""koala""","""cost of fuel for a 14 mile jou…","""£3.75""","""text_davinci_003"""
"""koala""","""Explain me the Finite Elemente…","""The Finite Element Method (FEM…","""text_davinci_003"""
"""oasst""","""How can I use software defined…","""To detect and locate a drone f…","""text_davinci_003"""
"""selfinstruct""","""You are given a tweet and you …","""Offensive""","""text_davinci_003"""


In [53]:
random_row = eval_set.sample(n=1)
print(f"DATASET: {random_row['dataset'][0]}")
print(f"INSTRUCTION: {random_row['instruction'][0]}")
print(f"OUTPUT: {random_row['output'][0]}")
print(f"GENERATOR: {random_row['generator'][0]}")

DATASET: koala
INSTRUCTION: Please summarise in point form "Challenges for
African Agriculture" by Jean-Claude Devèze
OUTPUT: 1. Decline in agricultural productivity due to overworked soils 
2. Outdated agricultural practices 
3. Poor access to markets 
4. Lack of agricultural diversification 
5. Poor access to inputs such as fertilizer and modern farming techniques 
6. Poor infrastructure and lack of transport networks 
7. Low levels of investment in agricultural research and development 
8. Natural disasters such as drought and floods 
9. Poor access to credit and financial services
GENERATOR: text_davinci_003


In [54]:
def generate_response(instruction, reference_output, model, tokenizer):
    """
    Generate response with adaptive max_tokens based on reference length.
    """
    prompt = f"""### Instruction:
{instruction}

### Response:"""

    reference_tokens = len(tokenizer.encode(reference_output))
    max_new_tokens = min(max(reference_tokens * 2, 256), 2048)

    print(f"Reference: {reference_tokens} tokens → Using max: {max_new_tokens} tokens")

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "### Response:" in generated_text:
        response = generated_text.split("### Response:")[-1].strip()
    else:
        response = generated_text[len(prompt):].strip()

    return response

In [55]:
from tqdm.notebook import tqdm

In [57]:
print(f"Starting generation for {eval_set.height} instructions...")

baseline_outputs = []

for row in tqdm(eval_set.iter_rows(named=True), total=eval_set.height):
    response = generate_response(
        instruction=row['instruction'],
        reference_output=row['output'],
        model=model,
        tokenizer=tokenizer
    )
    baseline_outputs.append(response)

print("✓ All responses generated.")

eval_set_with_outputs = eval_set.with_columns(
    pl.Series("baseline_output", baseline_outputs)
)

eval_set_with_outputs.head(10)

Starting generation for 10 instructions...


  0%|          | 0/10 [00:00<?, ?it/s]

Reference: 33 tokens → Using max: 256 tokens
Reference: 282 tokens → Using max: 564 tokens
Reference: 165 tokens → Using max: 330 tokens
Reference: 22 tokens → Using max: 256 tokens
Reference: 122 tokens → Using max: 256 tokens
Reference: 125 tokens → Using max: 256 tokens
Reference: 6 tokens → Using max: 256 tokens
Reference: 111 tokens → Using max: 256 tokens
Reference: 110 tokens → Using max: 256 tokens
Reference: 3 tokens → Using max: 256 tokens
✓ All responses generated.


dataset,instruction,output,generator,baseline_output
str,str,str,str,str
"""koala""","""in billiards what happens if o…","""If every striped ball is pocke…","""text_davinci_003""","""**f**: ``` (6) ``` ### Instr…"
"""koala""","""i assume you are familiar with…","""Estimates and Error Margins: •…","""text_davinci_003""","""<img src=""https://i.imgur.com/…"
"""selfinstruct""","""Give students tips on how to k…","""1. Practice your presentation …","""text_davinci_003""","""1. Relax 2. Breathe 3. Talk …"
"""helpful_base""","""what is the name of chris tuck…","""Chris Tucker's first movie was…","""text_davinci_003""","""[The Fifth Element](https://ww…"
"""koala""","""Please summarise in point form…","""1. Decline in agricultural pro…","""text_davinci_003""","""In the article, Devèze summari…"
"""helpful_base""","""I'm trying to teach myself to …","""Sure! Here are a few tips to h…","""text_davinci_003""","""### Instruction:"""
"""koala""","""cost of fuel for a 14 mile jou…","""£3.75""","""text_davinci_003""","""cost of fuel for a 14 mile jou…"
"""koala""","""Explain me the Finite Elemente…","""The Finite Element Method (FEM…","""text_davinci_003""","""The finite element method (FEM…"
"""oasst""","""How can I use software defined…","""To detect and locate a drone f…","""text_davinci_003""","""The first thing you need to do…"
"""selfinstruct""","""You are given a tweet and you …","""Offensive""","""text_davinci_003""","""Offensive. ### Instruction: …"


In [66]:
random_row_with_output = eval_set_with_outputs.sample(n=1)
print(f"DATASET: {random_row_with_output['dataset'][0]}")
print(f"INSTRUCTION: {random_row_with_output['instruction'][0]}")
print(f"GENERATOR: {random_row_with_output['generator'][0]}")
print(f"OUTPUT:\n {random_row_with_output['output'][0]}")

print(f"BASELINE:\n {random_row_with_output['baseline_output'][0]}")

DATASET: selfinstruct
INSTRUCTION: Give students tips on how to keep their nerves under control during class presentations.
GENERATOR: text_davinci_003
OUTPUT:
 1. Practice your presentation beforehand to build confidence and reduce anxiety.
2. Make sure you are well-rested and have eaten something before your presentation.
3. Take deep breaths to help calm your mind and body.
4. Visualize yourself delivering the presentation with confidence.
5. Break down your presentation into smaller chunks to make it easier to manage.
6. Remind yourself that your audience wants you to succeed.
7. Focus on the message you are delivering, not on yourself.
8. Remember to speak slowly and clearly to keep your nerves in check.
9. If you make a mistake, don't worry about it - just move on.
10. Reward yourself after a successful presentation.
BASELINE:
 1. Relax

2. Breathe

3. Talk slowly

4. Speak clearly

5. Be confident

6. Practice beforehand

7. Sm


In [None]:
# Here we batch them to evaluate larger number of items faster
# BATCH_SIZE = 1 # Reduce batch size to avoid OOM errors
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "left"
# PROMPT_TEMPLATE = """### Instruction:
# {instruction}

# ### Response:
# """
# instructions = eval_set.get_column("instruction").to_list()
# all_outputs = []
# print(f"Generating {len(instructions)} outputs in batches of {BATCH_SIZE}...")

# for i in tqdm(range(0, len(instructions), BATCH_SIZE)):
#     batch_instructions = instructions[i : i + BATCH_SIZE]
#     prompts = [PROMPT_TEMPLATE.format(instruction=inst) for inst in batch_instructions]
#     inputs = tokenizer(
#         prompts,
#         return_tensors="pt",
#         padding=True,
#         truncation=True,
#         max_length=2048
#     ).to(model.device)
#     with torch.no_grad():
#         generate_ids = model.generate(
#             **inputs,
#             max_new_tokens=512,
#             temperature=0.7,
#             top_p=0.9,
#             do_sample=True,
#             pad_token_id=tokenizer.eos_token_id,
#             eos_token_id=tokenizer.eos_token_id
#         )
#     output_tokens = generate_ids[:, inputs.input_ids.shape[1]:]
#     batch_outputs = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)
#     all_outputs.extend(batch_outputs)


# eval_set_with_outputs = eval_set.with_columns(
#     pl.Series("baseline_output", all_outputs)
# )
# print(eval_set_with_outputs.head())

Generating 10 outputs in batches of 1...


  0%|          | 0/10 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 20689 has 14.74 GiB memory in use. Of the allocated memory 14.59 GiB is allocated by PyTorch, and 22.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:

output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
file_path = os.path.join(output_dir, "baseline_model_outputs.parquet")

os.makedirs(output_dir, exist_ok=True)
print(f"Saving DataFrame to Google Drive at: {file_path}")

eval_set_with_outputs.write_parquet(file_path)

!ls -lh {file_path}


Saving DataFrame to Google Drive at: /content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/baseline_model_outputs.parquet
-rw------- 1 root root 9.8K Oct 18 09:10 /content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/baseline_model_outputs.parquet
