# Gemma Model Comparison: Factual QA Benchmark

**Author:** Kanishq Gandharv  
**Date:** February 2026  
**Purpose:** Reproducible benchmark comparing Google Gemma 2B and 7B models (base and instruction-tuned) on 100 factual questions

## What This Notebook Does

1. Loads 100 factual questions across 6 categories
2. Evaluates 4 Gemma model variants:
   - `google/gemma-2b` (base)
   - `google/gemma-2b-it` (instruction-tuned)
   - `google/gemma-7b` (base)
   - `google/gemma-7b-it` (instruction-tuned)
3. Saves detailed results and summary statistics
4. Compares performance across models

## Expected Results

- **2B-IT**: ~85% accuracy (best performer)
- **7B-IT**: ~69% accuracy
- **Base models**: ~20% accuracy (not instruction-aligned)

**Runtime:** ~45-60 minutes total on free Colab T4


In [None]:
# Install required packages
!pip install -q transformers accelerate torch bitsandbytes

print("‚úì Dependencies installed successfully!")

## Setup & Configuration

Before running, you need:
1. A Hugging Face account: https://huggingface.co/join
2. Accept Gemma license: https://huggingface.co/google/gemma-2b
3. Create access token: https://huggingface.co/settings/tokens (read permission)

Replace `YOUR_TOKEN_HERE` below with your actual token.


In [None]:
import os

# REQUIRED: Replace with your Hugging Face token
HF_TOKEN = "YOUR_TOKEN_HERE"  # Get from https://huggingface.co/settings/tokens
os.environ["HF_TOKEN"] = HF_TOKEN

print("‚úì Hugging Face token set successfully!")
print("‚ö†Ô∏è Make sure you've accepted the Gemma license at: https://huggingface.co/google/gemma-2b")


## Evaluation Dataset

100 factual questions across 6 balanced categories:
- **Geography** (15): Capitals, continents, landmarks
- **Math** (15): Arithmetic, geometry, basic algebra  
- **Science** (20): Chemistry, physics, biology, astronomy
- **History** (15): Major events, figures, dates
- **Literature** (15): Authors and famous works
- **General Knowledge** (20): Common facts, animals, everyday knowledge

All questions have single, unambiguous answers suitable for substring matching.


In [None]:
# 100-question evaluation dataset for Gemma Mini-Benchmark
# Categories: Geography, Math, Science, History, Literature, General Knowledge

eval_data = [
    # Geography (15 questions)
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "What is the capital of Japan?", "answer": "Tokyo"},
    {"question": "What is the capital of Italy?", "answer": "Rome"},
    {"question": "What is the capital of Germany?", "answer": "Berlin"},
    {"question": "What is the capital of Canada?", "answer": "Ottawa"},
    {"question": "What is the capital of Australia?", "answer": "Canberra"},
    {"question": "What is the capital of Brazil?", "answer": "Bras√≠lia"},
    {"question": "What is the capital of India?", "answer": "New Delhi"},
    {"question": "What is the capital of Russia?", "answer": "Moscow"},
    {"question": "What is the capital of Egypt?", "answer": "Cairo"},
    {"question": "What is the largest ocean on Earth?", "answer": "Pacific Ocean"},
    {"question": "What is the smallest continent?", "answer": "Australia"},
    {"question": "How many continents are there?", "answer": "7"},
    {"question": "What is the longest river in the world?", "answer": "Nile"},
    {"question": "What is the highest mountain in the world?", "answer": "Mount Everest"},

    # Math (15 questions)
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is 5 * 6?", "answer": "30"},
    {"question": "What is 100 - 37?", "answer": "63"},
    {"question": "What is 144 / 12?", "answer": "12"},
    {"question": "What is the square root of 64?", "answer": "8"},
    {"question": "What is 15% of 200?", "answer": "30"},
    {"question": "What is the smallest prime number?", "answer": "2"},
    {"question": "What is the next prime number after 7?", "answer": "11"},
    {"question": "What is 2 to the power of 5?", "answer": "32"},
    {"question": "What is the value of pi rounded to 2 decimal places?", "answer": "3.14"},
    {"question": "How many sides does a hexagon have?", "answer": "6"},
    {"question": "What is 1/4 as a decimal?", "answer": "0.25"},
    {"question": "What is 20% of 50?", "answer": "10"},
    {"question": "What is the cube of 3?", "answer": "27"},
    {"question": "What is the sum of angles in a triangle?", "answer": "180"},

    # Science (20 questions)
    {"question": "What is the chemical symbol for water?", "answer": "H2O"},
    {"question": "What is the chemical symbol for gold?", "answer": "Au"},
    {"question": "What is the chemical symbol for oxygen?", "answer": "O"},
    {"question": "What is the chemical symbol for carbon?", "answer": "C"},
    {"question": "What is the speed of light in vacuum (in m/s)?", "answer": "299792458"},
    {"question": "What is the largest planet in our solar system?", "answer": "Jupiter"},
    {"question": "What is the smallest planet in our solar system?", "answer": "Mercury"},
    {"question": "How many planets are in our solar system?", "answer": "8"},
    {"question": "What is the closest planet to the Sun?", "answer": "Mercury"},
    {"question": "What gas do plants absorb from the atmosphere?", "answer": "Carbon dioxide"},
    {"question": "What gas do plants release during photosynthesis?", "answer": "Oxygen"},
    {"question": "What is the powerhouse of the cell?", "answer": "Mitochondria"},
    {"question": "What is DNA an abbreviation for?", "answer": "Deoxyribonucleic acid"},
    {"question": "What is the hardest natural substance on Earth?", "answer": "Diamond"},
    {"question": "What is the boiling point of water at sea level in Celsius?", "answer": "100"},
    {"question": "What is the freezing point of water in Celsius?", "answer": "0"},
    {"question": "How many bones are in the adult human body?", "answer": "206"},
    {"question": "What is the largest organ in the human body?", "answer": "Skin"},
    {"question": "What force keeps us on the ground?", "answer": "Gravity"},
    {"question": "What is the atomic number of hydrogen?", "answer": "1"},

    # History (15 questions)
    {"question": "What year did World War 2 end?", "answer": "1945"},
    {"question": "What year did World War 1 start?", "answer": "1914"},
    {"question": "Who was the first President of the United States?", "answer": "George Washington"},
    {"question": "In what year did the Titanic sink?", "answer": "1912"},
    {"question": "Who discovered America in 1492?", "answer": "Christopher Columbus"},
    {"question": "What year did the Berlin Wall fall?", "answer": "1989"},
    {"question": "What year did India gain independence?", "answer": "1947"},
    {"question": "Who was the first man on the moon?", "answer": "Neil Armstrong"},
    {"question": "What year did humans first land on the moon?", "answer": "1969"},
    {"question": "What ancient wonder is located in Egypt?", "answer": "Pyramids"},
    {"question": "What year did the French Revolution begin?", "answer": "1789"},
    {"question": "Who invented the telephone?", "answer": "Alexander Graham Bell"},
    {"question": "Who invented the light bulb?", "answer": "Thomas Edison"},
    {"question": "What year did the American Civil War end?", "answer": "1865"},
    {"question": "What ancient civilization built Machu Picchu?", "answer": "Inca"},

    # Literature (15 questions)
    {"question": "Who wrote Romeo and Juliet?", "answer": "Shakespeare"},
    {"question": "Who wrote Hamlet?", "answer": "Shakespeare"},
    {"question": "Who wrote 1984?", "answer": "George Orwell"},
    {"question": "Who wrote Pride and Prejudice?", "answer": "Jane Austen"},
    {"question": "Who wrote The Great Gatsby?", "answer": "F. Scott Fitzgerald"},
    {"question": "Who wrote Moby Dick?", "answer": "Herman Melville"},
    {"question": "Who wrote To Kill a Mockingbird?", "answer": "Harper Lee"},
    {"question": "Who wrote Harry Potter?", "answer": "J.K. Rowling"},
    {"question": "Who wrote The Odyssey?", "answer": "Homer"},
    {"question": "Who wrote The Iliad?", "answer": "Homer"},
    {"question": "Who wrote Macbeth?", "answer": "Shakespeare"},
    {"question": "Who wrote The Divine Comedy?", "answer": "Dante"},
    {"question": "Who wrote War and Peace?", "answer": "Leo Tolstoy"},
    {"question": "Who wrote Crime and Punishment?", "answer": "Fyodor Dostoevsky"},
    {"question": "Who wrote The Lord of the Rings?", "answer": "J.R.R. Tolkien"},

    # General Knowledge (20 questions)
    {"question": "How many days are in a leap year?", "answer": "366"},
    {"question": "How many hours are in a day?", "answer": "24"},
    {"question": "How many minutes are in an hour?", "answer": "60"},
    {"question": "How many seconds are in a minute?", "answer": "60"},
    {"question": "How many weeks are in a year?", "answer": "52"},
    {"question": "How many months are in a year?", "answer": "12"},
    {"question": "What color is the sky on a clear day?", "answer": "Blue"},
    {"question": "What is the opposite of hot?", "answer": "Cold"},
    {"question": "What animal is known as man's best friend?", "answer": "Dog"},
    {"question": "What is the largest land animal?", "answer": "Elephant"},
    {"question": "What is the fastest land animal?", "answer": "Cheetah"},
    {"question": "What bird is known for its ability to mimic human speech?", "answer": "Parrot"},
    {"question": "How many legs does a spider have?", "answer": "8"},
    {"question": "How many legs does an insect have?", "answer": "6"},
    {"question": "What is the largest mammal in the world?", "answer": "Blue whale"},
    {"question": "What is the tallest animal in the world?", "answer": "Giraffe"},
    {"question": "What do bees produce?", "answer": "Honey"},
    {"question": "What is the name of the fairy tale character who left a glass slipper?", "answer": "Cinderella"},
    {"question": "What fruit is associated with keeping doctors away?", "answer": "Apple"},
    {"question": "What vegetable makes you cry when you cut it?", "answer": "Onion"}
]

print(f"‚úì Loaded {len(eval_data)} evaluation questions")
print(f"\nBreakdown:")
print(f"  Geography: 15")
print(f"  Math: 15")
print(f"  Science: 20")
print(f"  History: 15")
print(f"  Literature: 15")
print(f"  General Knowledge: 20")


## Helper Functions

Define reusable functions for:
- **Prompt formatting**: Converts questions into instruction-style prompts
- **Answer generation**: Runs model inference with consistent settings
- **String normalization**: Prepares text for case-insensitive matching
- **Model loading**: Handles authentication and device placement
- **Evaluation**: Runs full benchmark and saves results


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import json

def format_prompt(question: str) -> str:
    """Format question as instruction-style prompt."""
    return (
        "You are a helpful assistant. "
        "Answer the following question in one short phrase.\n\n"
        f"Question: {question}\n"
        "Answer:"
    )

def generate_answer(model, tokenizer, question, max_new_tokens=16):
    """Generate answer from Gemma model with consistent settings."""
    prompt = format_prompt(question)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Deterministic greedy decoding
            pad_token_id=tokenizer.eos_token_id,
        )

    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only text after "Answer:" if present
    if "Answer:" in full_text:
        full_text = full_text.split("Answer:")[-1].strip()

    return full_text

def normalize(s: str) -> str:
    """Normalize string for case-insensitive matching."""
    return "".join(s.lower().split())

def load_model(model_name: str):
    """Load Gemma model and tokenizer from Hugging Face."""
    print(f"\n{'='*60}")
    print(f"Loading {model_name}...")
    print('='*60)

    tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        token=HF_TOKEN,
        device_map="auto",
        torch_dtype=torch.float16,
    )

    print(f"‚úì Model loaded successfully!\n")
    return model, tokenizer

def evaluate_model(model, tokenizer, model_name, eval_data, label_prefix):
    """Run full evaluation and save results."""
    correct = 0
    results = []

    print(f"\n{'='*60}")
    print(f"Evaluating {model_name} ({label_prefix})")
    print('='*60 + "\n")

    for i, item in enumerate(eval_data, 1):
        question = item["question"]
        gold_answer = item["answer"]

        pred = generate_answer(model, tokenizer, question)
        is_correct = normalize(gold_answer) in normalize(pred)
        correct += int(is_correct)

        result = {
            "question": question,
            "gold_answer": gold_answer,
            "predicted": pred,
            "correct": is_correct,
        }
        results.append(result)

        # Progress updates every 10 questions
        if i % 10 == 0 or i == len(eval_data):
            print(f"[{i}/{len(eval_data)}] Progress: {correct}/{i} correct so far")

    accuracy = correct / len(eval_data)

    print(f"\n{'='*60}")
    print(f"FINAL RESULTS: {model_name}")
    print('='*60)
    print(f"Accuracy: {accuracy:.2%} ({correct}/{len(eval_data)} correct)")
    print('='*60 + "\n")

    # Save detailed results (one JSON object per line)
    results_path = f"{label_prefix}_eval_results.jsonl"
    with open(results_path, "w", encoding="utf-8") as f:
        for r in results:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

    # Save summary statistics
    summary = {
        "model": model_name,
        "label": label_prefix,
        "total_questions": len(eval_data),
        "correct": correct,
        "accuracy": accuracy,
    }
    summary_path = f"{label_prefix}_eval_summary.json"
    with open(summary_path, "w", encoding="utf-8") as f:
        json.dump(summary, f, indent=2)

    print(f"Results saved to:")
    print(f"  ‚Ä¢ {results_path} (detailed per-question results)")
    print(f"  ‚Ä¢ {summary_path} (summary statistics)\n")

    return accuracy

print("‚úì All helper functions defined successfully!")


## Model Evaluation

Running evaluation on all 4 models:

### Estimated Runtime (Free Colab T4)
- **gemma-2b**: ~5 minutes
- **gemma-2b-it**: ~5 minutes  
- **gemma-7b**: ~15 minutes
- **gemma-7b-it**: ~15 minutes (with memory optimization)

**Total: ~40-50 minutes**

Each model processes 100 questions with greedy decoding. Results are saved incrementally, so you can check Files panel (left sidebar) for outputs as models complete.


In [None]:
# Define first 3 models to evaluate (skip 7B-IT for now)
models_to_evaluate = [
    ("google/gemma-2b",    "gemma2b_base"),
    ("google/gemma-2b-it", "gemma2b_it"),
    ("google/gemma-7b",    "gemma7b_base"),
]

# Store results
all_accuracies = {}

# Evaluate each model sequentially
for model_name, label in models_to_evaluate:
    model, tokenizer = load_model(model_name)
    accuracy = evaluate_model(model, tokenizer, model_name, eval_data, label)
    all_accuracies[label] = accuracy

    # Free memory before loading next model
    del model
    del tokenizer
    torch.cuda.empty_cache()

print("\n" + "="*60)
print("‚úì 2B and 7B base models complete!")
print("="*60)


## Evaluating google/gemma-7b-it

**Memory Challenge:** The 7B instruction-tuned model requires aggressive memory optimization on free Colab.

**Optimizations applied:**
- Reduced `max_new_tokens` from 16 ‚Üí 8
- Explicit memory cleanup after each generation
- Truncation of long input prompts
- Mixed precision inference

This may slightly impact accuracy compared to 16-token budget, but ensures the evaluation completes without OOM crashes.


In [None]:
import gc

# Hard memory reset
gc.collect()
torch.cuda.empty_cache()

HF_TOKEN = os.environ["HF_TOKEN"]

print("Loading google/gemma-7b-it (fp16, memory-optimized)...")

tokenizer_7b_it = AutoTokenizer.from_pretrained(
    "google/gemma-7b-it",
    token=HF_TOKEN,
)

model_7b_it = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b-it",
    token=HF_TOKEN,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    max_memory={0: "13GB"},
)

print("‚úì Model loaded!\n")

def generate_answer_7b_it(question, max_new_tokens=8):
    """Memory-optimized generation for 7B-IT."""
    prompt = format_prompt(question)
    inputs = tokenizer_7b_it(prompt, return_tensors="pt", truncation=True, max_length=512).to("cuda")

    with torch.no_grad():
        with torch.cuda.amp.autocast():
            outputs = model_7b_it.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=tokenizer_7b_it.eos_token_id,
                use_cache=True,
            )

    text = tokenizer_7b_it.decode(outputs[0], skip_special_tokens=True)
    if "Answer:" in text:
        text = text.split("Answer:")[-1].strip()

    # Aggressive cleanup
    del inputs, outputs
    torch.cuda.empty_cache()
    gc.collect()

    return text

# Run evaluation
correct = 0
results = []

print("Evaluating gemma-7b-it (fp16, 100 questions)...\n")

for i, item in enumerate(eval_data, 1):
    question = item["question"]
    gold_answer = item["answer"]

    try:
        pred = generate_answer_7b_it(question)
        is_correct = gold_answer.lower() in pred.lower()
        correct += int(is_correct)

        results.append({
            "question": question,
            "gold_answer": gold_answer,
            "predicted": pred,
            "correct": is_correct,
        })

        if i % 10 == 0:
            print(f"[{i}/100] {correct}/{i} correct")

    except RuntimeError as e:
        print(f"\n‚ö†Ô∏è OOM at question {i}")
        break

accuracy = correct / len(results) if results else 0

print("\n" + "=" * 60)
print(f"Final: {accuracy:.2%} ({correct}/{len(results)})")
print("=" * 60)

# Save results
with open("gemma7b_it_fp16_eval_results.jsonl", "w", encoding="utf-8") as f:
    for r in results:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

summary = {
    "model": "google/gemma-7b-it",
    "label": "gemma7b_it_fp16",
    "quantization": "none (fp16)",
    "total_questions": len(results),
    "correct": correct,
    "accuracy": accuracy,
}
with open("gemma7b_it_fp16_eval_summary.json", "w", encoding="utf-8") as f:
    json.dump(summary, f, indent=2)

print("\n‚úì Results saved!")
all_accuracies["gemma7b_it_fp16"] = accuracy

# Clean up
del model_7b_it, tokenizer_7b_it
gc.collect()
torch.cuda.empty_cache()


## Final Results & Comparison

All 4 models evaluated. Loading results from saved summary files and displaying comparison table.


In [None]:
import json

print("\n" + "="*60)
print("FINAL COMPARISON - 100 Question Factual QA")
print("="*60 + "\n")

# Load accuracies from saved summary files
summary_files = {
    "gemma2b_base": "gemma2b_base_eval_summary.json",
    "gemma2b_it": "gemma2b_it_eval_summary.json",
    "gemma7b_base": "gemma7b_base_eval_summary.json",
    "gemma7b_it_fp16": "gemma7b_it_fp16_eval_summary.json",
}

accuracies = {}
for label, filename in summary_files.items():
    try:
        with open(filename, "r") as f:
            data = json.load(f)
            accuracies[label] = data["accuracy"]
    except FileNotFoundError:
        accuracies[label] = None

# Print comparison table
print(f"{'Model':<25} {'Params':<10} {'Type':<20} {'Accuracy':<10}")
print("-" * 70)

model_info = [
    ("google/gemma-2b",    "gemma2b_base",    "2B", "base"),
    ("google/gemma-2b-it", "gemma2b_it",      "2B", "instruct"),
    ("google/gemma-7b",    "gemma7b_base",    "7B", "base"),
    ("google/gemma-7b-it", "gemma7b_it_fp16", "7B", "instruct (fp16)"),
]

for model_name, label, params, model_type in model_info:
    acc = accuracies.get(label)
    if acc is not None:
        print(f"{model_name:<25} {params:<10} {model_type:<20} {acc:.2%}")
    else:
        print(f"{model_name:<25} {params:<10} {model_type:<20} N/A")

print("\n" + "="*60)
print("All result files saved in Files panel (left sidebar)")
print("="*60)


## Key Findings

### 1. Instruction Tuning is Critical
- **2B-IT vs 2B Base**: +66 percentage points (19% ‚Üí 85%)
- Base models generate continuations instead of direct answers

### 2. Surprising Result: 2B-IT Outperforms 7B-IT
- **2B-IT: 85%** vs **7B-IT: 69%**
- Likely due to memory constraints (8 tokens vs 16)
- Future work: Re-evaluate 7B-IT with full 16-token budget

### 3. Scale Without Tuning Adds Little Value
- 7B base only +2% over 2B base (21% vs 19%)
- Instruction alignment > raw parameter count

## Download Results

Click the Files icon (üìÅ) on the left sidebar to download:
- `*_eval_results.jsonl` - Per-question detailed results
- `*_eval_summary.json` - Accuracy statistics

Upload these to your GitHub repo under `results/` folder.
