# Inference Notebook

Use this notebook to interact with your trained model. This notebook is designed to:
1.  **Load Weights:** Import your trained model checkpoints & original model checkpoints (specifically `.pth` files).
2.  **Generate Text:** Run inference to produce new text based on your custom training.

## Imports & Device Setup

In [1]:
import torch
import tiktoken
import os
import tqdm

from model import GPTModel
from generate import generate_and_print
from download_weights_hf import load_weights_into_gpt 

# Import Transformers for downloading official weights
from transformers import GPT2Model

# Device Setup
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(f"Inference Device: {device}")

Inference Device: mps


## Model Selection & Configuration

**Choose Your Model:**
In the cell below, you can toggle `MODEL_SELECTION` between your own trained model or one of the official pre-trained GPT-2 models (loaded via Hugging Face).

* **Custom Model (`"custom"`):** Loads the `.pth` file you trained earlier from the `./model_weights` directory.
    * *Important:* Ensure the configuration inside the `if` block (especially `context_length`) matches your training settings exactly.
* **Official Models:** Downloads pre-trained weights from the Hugging Face Hub and maps them into our custom architecture on the fly.
    * *Available Sizes:* `gpt2-small (124M)`, `gpt2-medium (355M)`, `gpt2-large (774M)`, `gpt2-xl (1558M)`.
    * *Mechanism:* This uses the `transformers` library to fetch weights, then transfers them layer-by-layer into your `GPTModel` class.
    * *Hardware Warning:* The **XL (1558M)** model requires significant RAM (~12GB+) during the loading process because it must hold both the source weights and your custom model in memory simultaneously before cleanup.

In [2]:
# Available configurations for gpt2
model_configs = {
    "gpt2-small (124M)":  {"emb_dim": 768,  "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)":  {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)":    {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# --- USER CHOICE ---
# You can copy-paste keys from above:
# "gpt2-small (124M)", "gpt2-medium (355M)", "gpt2-large (774M)", "gpt2-xl (1558M)"
# OR set to "custom" for your locally trained model.
MODEL_SELECTION = "gpt2-small (124M)" 

# Directory to store models
MODELS_DIR = "./model_weights"
os.makedirs(MODELS_DIR, exist_ok=True)

# 1. Setup Configuration based on selection
if MODEL_SELECTION == "custom":
    # --- PATH A: YOUR CUSTOM MODEL ---
    filename = "trained_gpt2.pth" 
    model_path = os.path.join(MODELS_DIR, filename)
    print(f"\nSelection: Custom Model ({model_path})")
    
    # Ensure this config matches your training exactly!
    config = {
        "vocab_size": 50257,    
        "context_length": 256, 
        "emb_dim": 768,         
        "n_heads": 12,          
        "n_layers": 12,         
        "drop_rate": 0.1,       
        "qkv_bias": False       
    }

elif MODEL_SELECTION in model_configs:
    # --- PATH B: OFFICIAL OPENAI MODEL (VIA HF) ---
    print(f"\nSelection: Official OpenAI GPT-2 ({MODEL_SELECTION})")
    
    # Official models always use context 1024 and Bias=True
    config = {
        "vocab_size": 50257,
        "context_length": 1024,
        "drop_rate": 0.0,
        "qkv_bias": True 
    }
    config.update(model_configs[MODEL_SELECTION])
    model_path = None # No local file path needed yet

else:
    raise ValueError(f"Unknown model selection: {MODEL_SELECTION}")

print("Configuration set.")


Selection: Official OpenAI GPT-2 (gpt2-small (124M))
Configuration set.


## Instantiation & Weight Loading

In [3]:
# 1. Initialize YOUR custom architecture
print(f"Initializing Custom GPTModel...")
model = GPTModel(config)
model.to(device)
model.eval()

# 2. Load Weights (Path A vs Path B)
if MODEL_SELECTION == "custom":
    print(f"Loading local weights from: {model_path} ...")
    state_dict = torch.load(model_path, map_location=device, weights_only=True)
    model.load_state_dict(state_dict)

else:
    print(f"Downloading/Loading Official Weights via Hugging Face...")
    
    # Map selection to HF ID
    hf_mapping = {
        "gpt2-small (124M)": "gpt2",
        "gpt2-medium (355M)": "gpt2-medium",
        "gpt2-large (774M)": "gpt2-large",
        "gpt2-xl (1558M)": "gpt2-xl"
    }
    
    # Download the HF model (to CPU first to save GPU RAM)
    hf_model = GPT2Model.from_pretrained(
        hf_mapping[MODEL_SELECTION], 
        cache_dir=MODELS_DIR
    )
    
    # Transfer weights to custom model
    print("Transferring weights to custom architecture...")
    load_weights_into_gpt(model, hf_model)
    
    # Re-assert the device move to ensure MPS allocates the new weight storage
    model.to(device)
    
    # Cleanup HF model
    del hf_model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("-> Model successfully loaded and ready!")

Initializing Custom GPTModel...
Downloading/Loading Official Weights via Hugging Face...
Transferring weights to custom architecture...
-> Weights successfully transferred from HF to Custom Model.
-> Model successfully loaded and ready!


## Interactive Generation

Run the cell below to generate text. 

* **`start_context`**: The prompt you give the model.
* **`max_new_tokens`**: How much text to write.
* **`temperature`**: Controls creativity.
    * `0.0` = Deterministic (Always picks the most likely word).
    * `0.8` = Creative (More random, diverse).
    * `1.2` = Chaotic (Can be incoherent).
* **`top_k`**: Limits the vocabulary choices to the $K$ most likely next tokens. This prevents the model from picking highly improbable "junk" words that can ruin the sentence.

In [4]:
# Setup Tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# --- USER SETTINGS ---
PROMPT = "The history of artificial intelligence"
LENGTH = 100
TEMP   = 0.7  # Try 0.5 to 1.0 for best results
TOP_K  = 50   # Restrict to top 50 probable words

# Run Generation
print(f"Generating for prompt: '{PROMPT}'\n" + "-"*50)
output = generate_and_print(
    model, 
    tokenizer, 
    device, 
    start_context=PROMPT, 
    max_new_tokens=LENGTH, 
    temperature=TEMP,
    top_k=TOP_K
)

Generating for prompt: 'The history of artificial intelligence'
--------------------------------------------------

[Gen]: The history of artificial intelligence in AI has been a fascinating one, ranging from the earliest days of AI to the present day. For example, the world of the human brain was first created in the 1960s by a group of researchers called the Institute for Advanced Study in the US. However, in the early 1980s, the team that was led by George Allen, a neuroscientist at the University of California, Berkeley, was led by David S. Siegel, an assistant professor of psychiatry at the University of California,

