<a href="https://colab.research.google.com/github/kissflow/prompt2finetune/blob/main/Tinyllama_before_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**📋 PURPOSE:**

This notebook demonstrates a baseline test of TinyLlama's knowledge about
IPL 2023 cricket tournament BEFORE any fine-tuning. This helps us:
1. Understand what the pre-trained model already knows
2. Identify knowledge gaps (IPL 2023 facts it doesn't know)
3. Establish a baseline to measure improvement after fine-tuning

**🎯 LEARNING OBJECTIVES:**
- Understand how to load and test large language models
- Learn about model quantization for efficient memory usage
- Explore prompt formatting and ChatML structure
- See the difference between general knowledge and specific domain knowledge
- Measure inference performance and response quality

**⚙️ REQUIREMENTS:**
- Google Colab with GPU (T4, A100, or similar)
- No prior ML experience needed!


In [None]:
#============================================================================
# 🔧 STEP 1: INSTALLATION
#============================================================================
# Install necessary libraries for running the language model

# Install 'uv' - A fast Python package installer (faster than pip)
!pip install uv

# Install Unsloth - A library that optimizes language model training/inference
# This is 2-5x faster than standard methods and uses less memory
# 'colab-new' flag ensures compatibility with Google Colab's latest environment
!uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Install supporting libraries:
# - trl: Transformer Reinforcement Learning (for training)
# - peft: Parameter-Efficient Fine-Tuning (LoRA adapters)
# - accelerate: Multi-GPU training support
# - bitsandbytes: 4-bit/8-bit model quantization for memory efficiency
!uv pip install trl peft accelerate bitsandbytes

# 💡 KEY TAKEAWAY:
# These libraries work together to let us run large models (1.1B parameters!)
# on consumer GPUs by using clever memory optimization techniques

In [None]:
#============================================================================
# 📦 STEP 2: IMPORT LIBRARIES & CONFIGURE MODEL
#============================================================================

from unsloth import FastLanguageModel  # Optimized model loading from Unsloth
import torch                           # PyTorch - the deep learning framework
import time                            # For measuring inference speed
from datetime import datetime          # For timestamps in results

# Model configuration constants
# Using constants (UPPERCASE) is a Python best practice for values that don't change
MODEL_NAME = "unsloth/tinyllama"  # TinyLlama: A small but capable 1.1B parameter model
MAX_SEQ_LENGTH = 2048             # Maximum number of tokens (words/subwords) the model can process at once

# 💡 WHY TinyLlama?
# - Small enough to run on free Colab GPU (only 1.1B parameters vs 7B+ for larger models)
# - Fast inference (< 1 second per response)
# - Good baseline for demonstrating fine-tuning improvements

print(f"Loading {MODEL_NAME}...\n")

In [None]:
#============================================================================
# 🤖 STEP 3: LOAD THE PRE-TRAINED MODEL
#============================================================================
# Load the TinyLlama model WITHOUT any fine-tuning
# This is the "base" model trained on general internet text

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,           # Which model to load
    max_seq_length=MAX_SEQ_LENGTH,   # How many tokens it can handle
    dtype=None,                       # Auto-detect best precision (float16/bfloat16)
    load_in_4bit=True,               # ⭐ Use 4-bit quantization to save memory
)

# 💡 WHAT IS QUANTIZATION?
# Normal models use 16-bit numbers (2 bytes per parameter)
# 4-bit quantization uses only 4 bits (0.5 bytes per parameter)
# This reduces memory by 75% with minimal quality loss!
# Example: 1.1B parameters × 2 bytes = 2.2GB (16-bit)
#          1.1B parameters × 0.5 bytes = 550MB (4-bit)

print("✅ Base model loaded successfully!")
print(f"✅ Model: {MODEL_NAME}")
print(f"✅ Parameters: ~1.1 Billion")
print(f"✅ Quantization: 4-bit (saves ~75% memory)")
print(f"✅ Status: Ready for testing (NO fine-tuning applied yet)")
print("="*80 + "\n")

In [None]:
#============================================================================
# 🔤 STEP 4: CONFIGURE THE TOKENIZER
#============================================================================
# The tokenizer converts text into numbers (tokens) that the model understands

# Set padding token to be the same as end-of-sequence token
# Padding is used when processing multiple sentences of different lengths
tokenizer.pad_token = tokenizer.eos_token

# Set padding to happen on the right side of the text
# This matters for how the model processes the input
tokenizer.padding_side = "right"

# 💡 WHAT IS A TOKENIZER?
# Language models don't understand text - they understand numbers!
# Tokenizer breaks text into pieces (tokens) and converts them to numbers:
# Example: "Hello world!" → ["Hello", " world", "!"] → [15496, 1879, 0]


#============================================================================
# 🚀 STEP 5: ENABLE INFERENCE MODE
#============================================================================
# Put the model in "inference mode" for faster predictions
FastLanguageModel.for_inference(model)

# 💡 WHAT IS INFERENCE MODE?
# Training mode: Model learns by updating weights (slow, uses more memory)
# Inference mode: Model just makes predictions (fast, uses less memory)
# We're only testing, not training, so inference mode is perfect!

In [None]:
#============================================================================
# 📝 STEP 6: HELPER FUNCTIONS FOR PROMPTING
#============================================================================

def build_prompt(user_msg: str) -> str:
    """
    Convert a user question into the proper ChatML format expected by TinyLlama.

    ChatML (Chat Markup Language) is a structured format that tells the model:
    - What part is the user's question
    - Where the model should start its response

    Args:
        user_msg: The question we want to ask (plain English)

    Returns:
        Formatted prompt in ChatML structure

    Example:
        Input:  "Who won IPL 2023?"
        Output: "<|user|>\nWho won IPL 2023?</s>\n<|assistant|>\n"

    💡 ChatML Structure Explained:
    <|user|>          ← Marks the start of user's message
    {question}</s>    ← The actual question, ended with </s> (end-of-sequence)
    <|assistant|>     ← Marks where the model should generate its response
    """
    return f"<|user|>\n{user_msg}</s>\n<|assistant|>\n"


def extract_answer(full_text: str) -> str:
    """
    Extract just the assistant's response from the full generated text.

    The model generates the entire conversation including special tokens.
    We only want the actual answer part, not the formatting tokens.

    Args:
        full_text: Complete generated text with all tokens

    Returns:
        Just the assistant's answer, cleaned up

    Example:
        Input:  "<|user|>\nQuestion?</s>\n<|assistant|>\nThe answer is...</s>"
        Output: "The answer is..."

    💡 WHY WE NEED THIS:
    Models generate more than just the answer - they include the prompt too!
    We need to extract just the useful part for display.
    """
    # Check if the assistant marker exists in the text
    if "<|assistant|>" not in full_text:
        return full_text  # Return as-is if no marker found

    # Split on "<|assistant|>" and take everything after it
    ans = full_text.split("<|assistant|>")[-1]

    # Remove the end-of-sequence token and any extra whitespace
    return ans.split(tokenizer.eos_token)[0].strip()


def generate_response(
    question: str,
    max_new_tokens: int = 120,
    temperature: float = 0.1,
    show_time: bool = True
):
    """
    Generate a response to a question using the language model.

    This is the main function that:
    1. Formats the question properly
    2. Converts text to tokens
    3. Runs the model to generate a response
    4. Converts tokens back to text
    5. Extracts and returns the answer

    Args:
        question: The question to ask (plain English)
        max_new_tokens: Maximum length of response (120 ≈ 100 words)
        temperature: Randomness of response (0.1 = focused, 1.0 = creative)
        show_time: Whether to return inference time

    Returns:
        answer: The model's response
        elapsed: Time taken (if show_time=True)

    💡 GENERATION PARAMETERS EXPLAINED:
    - max_new_tokens: Stop after this many tokens to prevent endless generation
    - temperature: Controls randomness
        * 0.0 = Always pick most likely word (deterministic)
        * 0.1 = Mostly likely words (good for facts)
        * 0.7 = Balanced (good for conversation)
        * 1.0+ = More random (good for creativity)
    - top_p: Only consider words in top 90% probability (nucleus sampling)
    - do_sample: Whether to use randomness (True) or always pick top word (False)
    """
    # STEP 1: Format the question into ChatML format
    prompt = build_prompt(question)

    # STEP 2: Convert text to tokens and move to GPU
    # tokenizer() converts text → token IDs
    # .to("cuda") moves the data to GPU for fast processing
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # STEP 3: Start timing the inference
    start_time = time.time()

    # STEP 4: Generate response (no gradient calculation needed for inference)
    with torch.no_grad():  # Saves memory by not tracking gradients
        outputs = model.generate(
            **inputs,                                           # Input token IDs
            max_new_tokens=max_new_tokens,                    # Maximum response length
            temperature=temperature,                           # Randomness (0.1 = focused)
            top_p=0.9,                                        # Nucleus sampling threshold
            do_sample=True if temperature > 0 else False,     # Enable sampling if temperature > 0
            pad_token_id=tokenizer.eos_token_id,             # Token for padding
            eos_token_id=tokenizer.eos_token_id,             # Token to end generation
        )

    # STEP 5: Calculate how long inference took
    elapsed = time.time() - start_time

    # STEP 6: Convert token IDs back to text
    # skip_special_tokens=False keeps formatting tokens like <|assistant|>
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

    # STEP 7: Extract just the answer part
    answer = extract_answer(full_text)

    # STEP 8: Return answer (and time if requested)
    if show_time:
        return answer, elapsed
    return answer

In [None]:
#============================================================================
# 🏏 STEP 7: TEST WITH IPL 2023 SPECIFIC QUESTIONS
#============================================================================
# These questions require knowledge of IPL 2023 tournament
# The base model likely WON'T know these answers (they're after its training cutoff)

print("\n" + "="*80)
print("📊 TESTING: IPL 2023 Specific Knowledge (Expected: ❌ Poor Performance)")
print("="*80)
print("\n⚠️  IMPORTANT: The base model was trained on text from before IPL 2023")
print("   It should NOT know specific match results, scores, or statistics.")
print("   This demonstrates why fine-tuning is necessary!\n")
print("="*80 + "\n")

# List of IPL 2023 specific questions
# These require exact knowledge of the 2023 tournament
ipl_2023_questions = [
    "In IPL 2023, who won when Gujarat Titans played against Chennai Super Kings?",
    "Where was the IPL 2023 match between Rajasthan Royals and Sunrisers Hyderabad played?",
    "Which stadium hosted the Punjab Kings vs Rajasthan Royals match in IPL 2023?",
    "Who was the Man of the Match when Mumbai Indians played Chennai Super Kings in IPL 2023?",
    "What was the result of the match between Royal Challengers Bangalore and Kolkata Knight Riders in IPL 2023?",
    "Who won the toss in the Delhi Capitals vs Lucknow Super Giants match in IPL 2023?",
    "What was the margin of victory when Chennai Super Kings beat Gujarat Titans in IPL 2023?",
    "Who scored the most runs in the IPL 2023 match between Punjab Kings and Mumbai Indians?",
]

# Storage for results
results = []        # Will store each question, answer, and time
total_time = 0      # Track total time for all questions

# Loop through each question and get the model's response
for i, question in enumerate(ipl_2023_questions, 1):
    # Display question number and text
    print(f"\n[Question {i}/{len(ipl_2023_questions)}]")
    print(f"Q: {question}")
    print("-" * 80)

    # Generate response with low temperature (0.1) for factual answers
    # Low temperature makes the model more confident/deterministic
    answer, elapsed = generate_response(question, temperature=0.1)
    total_time += elapsed  # Add to running total

    # Display the answer and how long it took
    print(f"A: {answer}")
    print(f"⏱️  Time: {elapsed:.2f}s")
    print("=" * 80)

    # Store results for later analysis
    results.append({
        "question": question,
        "answer": answer,
        "time": f"{elapsed:.2f}s"
    })

# Calculate and display average response time
avg_time = total_time / len(ipl_2023_questions)

print(f"\n📊 RESULTS SUMMARY:")
print(f"   Total questions: {len(ipl_2023_questions)}")
print(f"   Average response time: {avg_time:.2f}s")
print(f"   Total test time: {total_time:.2f}s")

# 💡 KEY TAKEAWAY:
# You'll likely see the model give VAGUE, GENERIC, or INCORRECT answers
# This is EXPECTED! The model doesn't have this specific knowledge.
# After fine-tuning, these same questions should get accurate, specific answers.