# SC3000 Lab Assignment 1: Teaching NanoGPT to Do Math

Group members: Siaw Xuan, Ramakrishna Rohan, Law Wei Zhen
-

---

## Summary

This notebook applies **Direct Preference Optimization (DPO)** to fine-tune a pretrained NanoGPT for mathematical reasoning. DPO trains the language model to favor better responses over weaker ones—without building a separate reward model.

### What is DPO?

**Direct Preference Optimization** is a training approach that replaces the RLHF pipeline. Rather than learning a reward model and using reinforcement learning, DPO updates the model directly from paired examples where one response is preferred over another.

### Why DPO?

* **Simpler pipeline**: Removes the need for a reward model and RL steps
* **More stable**: Avoids the instabilities common in RL-based training
* **Strong results**: Often matches or beats RLHF performance
* **Resource-friendly**: Lower computational overhead

### Data Setup

Training uses **preference pairs**:

* **Positive (preferred)**: Correct solutions with clear reasoning
* **Negative (dispreferred)**: Incorrect answers or weak/illogical reasoning

The objective pushes the model to assign higher likelihood to positive responses and lower likelihood to negative ones.


## The DPO Algorithm: Mathematical Intuition

### Core Idea

In classic RLHF, you typically:

1. Train a reward model on human preferences
2. Use that reward model to score candidate outputs
3. Run RL (e.g., PPO) to improve the policy

**DPO** collapses this into a *single* stage: it updates the policy (language model) *directly* from preference pairs.

### The DPO Loss Function (Explained)

Given a preference pair (y_pos, y_neg) where y_pos is preferred over y_neg:

$$\mathcal{L}_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_{pos}|x)}{\pi_{ref}(y_{pos}|x)} - \beta \log \frac{\pi_\theta(y_{neg}|x)}{\pi_{ref}(y_{neg}|x)}\right)\right]$$

**Term meanings:**

* *(\pi_\theta)*: the policy we’re training
* *(\pi_{\text{ref}})*: a reference policy (often the current/initial model)
* *(\beta)*: temperature scaling how strongly preferences shape updates
* *(\sigma)*: the sigmoid mapping the log-odds difference to a probability

**Intuition:**

* Pushes the model to assign *higher* likelihood to preferred responses
* Penalizes giving *higher* likelihood to dispreferred ones
* *(\beta)* adjusts the *strength* of these updates

### Why DPO Works

1. *Direct optimization*: skips reward-model + RL loops
2. *Stable*: avoids RL-induced instability
3. *Principled*: connects back to the RLHF objective
4. *Efficient*: one-stage training

---


## Step 1: Install necessary packages

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd "/content/drive/MyDrive/Colab Notebooks/NanoGPT-Math"

In [None]:
!pip install matplotlib
!pip install torch numpy transformers datasets tiktoken wandb tqdm

In [None]:
import torch
torch.cuda.is_available()

## Hyperparameter Choices and Their Impact

| Parameter             |           Value | Impact                                                                  |
| --------------------- | --------------: | ----------------------------------------------------------------------- |
| **beta**              |             0.5 | Preference strength; lower = stronger push, higher = more conservative. |
| **learning_rate**     |            1e-4 | Stable fine-tuning; too high risks divergence.                          |
| **batch_size**        |              64 | Smoother grads with size; needs more VRAM.                              |
| **max_length** |              64 | Caps input; trim long prompts upstream.                                 |
| **temperature**       |             0.8 | Randomness: <1 deterministic, >1 diverse.                               |


**Key trade-offs:**

* **Higher β** → More conservative, closer to reference policy
* **Lower β** → More aggressive preference learning (risk of overfitting/drift)
* **Larger batch_size** → Smoother gradients, higher VRAM demand
* **Higher learning_rate** → Faster convergence, greater instability risk
* **Lower temperature / smaller top_k** → Safer, more deterministic generations (possible blandness)
* **More epochs / larger max_new_tokens** → Better coverage and completeness, higher compute and overfit risk


## Step 2: Package imports and configuration

### Configuration Parameters

* **`beta`**: DPO scaling temperature—tunes how strongly the policy is pushed away from the reference model
* **`base_lr`**: Optimizer step size (learning rate)
* **`epochs`**: Upper bound on full passes over the training data
* **`batch_size`**: Count of preference pairs processed per optimization step
* **`max_length`**: Cap on input sequence length
* **`temperature`**: Generation randomness control (smaller values → more deterministic outputs)
* **`top_k`**: Limits sampling to the top *k* most probable tokens


In [None]:
import sys
import os
import torch.nn.functional as F
sys.path.append(os.path.abspath(".."))
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import pickle
from model import GPT, GPTConfig
import random
from tqdm import tqdm
import time
import json
import matplotlib.pyplot as plt

# DPO hyperparameters
beta = 0.5
base_lr = 1e-4
epochs = 20
batch_size = 64

# Model parameters
max_length =64
num_samples = 1
max_new_tokens = 200
temperature = 0.8
top_k = 200
device = 'cuda' if torch.cuda.is_available() else 'cpu'


# tokenizer functions
with open("sft/meta.pkl", "rb") as f:
    meta = pickle.load(f)
stoi, itos = meta["stoi"], meta["itos"]

PAD_ID = len(stoi)          # add new padding token id
stoi['<PAD>'] = PAD_ID
itos[PAD_ID] = '<PAD>'

def encode(s): return [stoi[c] for c in s]
def decode(l): return ''.join([itos[i] for i in l])

## Step 3: Define Helper Functions

We’ll implement three core utilities to support DPO training:

### 1. `compute_logprob(input_ids)`

Calculates the log-likelihood of a given token sequence under the current model—this feeds directly into the DPO objective.

**Intuition**: It quantifies how confident the model is in a sequence. Larger log-probability ⇒ the model finds the sequence more plausible.

### 2. `pad_or_truncate(seq, max_length)`

Normalizes sequence length by trimming long sequences and padding shorter ones with zeros so tensors align in a batch.

### 3. `get_batches(lines, batch_size)`

Builds mini-batches of preference pairs for training. Each batch bundles matched **(positive, negative)** examples—where the positive is the correct reasoning/answer and the negative is the incorrect one.


In [None]:
def compute_logprob(input_ids):
    inputs = input_ids[:, :-1]
    targets = input_ids[:, 1:]
    logits, _ = gpt(inputs, full_seq=True)
    B, T, V = logits.size()
    logits_flat = logits.reshape(-1, V)
    targets_flat = targets.reshape(-1)
    loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=PAD_ID, reduction='none')   #added ignore_index=PAD_ID
    loss = loss.reshape(B, T)
    attention_mask = (targets != 0).float()
    loss = (loss * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)
    return -loss

def pad_or_truncate(seq, max_length):
    return seq[-max_length:] if len(seq) > max_length else seq + [PAD_ID] * (max_length - len(seq))

def get_batches(lines, batch_size):
    random.shuffle(lines)
    #for l in lines:
    #    print(l[1])
    for i in range(0, len(lines), batch_size):
        batch = lines[i:i+batch_size]
        if len(batch) < batch_size:
            continue
        neg_inputs = [pad_or_truncate(encode(p['negative'] + '\n\n\n\n'), max_length) for p in batch]
        pos_inputs = [pad_or_truncate(encode(p['positive'] + '\n\n\n\n'), max_length) for p in batch]
        neg_tensor = torch.tensor(neg_inputs, dtype=torch.long, device=device)
        pos_tensor = torch.tensor(pos_inputs, dtype=torch.long, device=device)
        yield neg_tensor, pos_tensor

## Step 4: Load the pretrained NanoGPT model
Bring in a NanoGPT checkpoint that’s already been pretrained to use as the initialization for DPO fine-tuning.

In [None]:
ckpt = torch.load("sft/gpt.pt", map_location=device)
gptconf = GPTConfig(**ckpt['model_args'])
gpt = GPT(gptconf)
state_dict = ckpt['model']
unwanted_prefix = '_orig_mod.'
for k in list(state_dict.keys()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
gpt.to(device).train()

## Step 5: Load Data

Import the DPO dataset of **preference pairs**. Each record includes:

* **`positive`**: a correct solution with clear, valid reasoning
* **`negative`**: an incorrect answer or weak/flawed reasoning

The corpus contains **100,000 pairs**, giving ample supervision for the model to learn preference alignment.



In [None]:
import torch, pickle, os
from model import GPT, GPTConfig

device = 'cuda' if torch.cuda.is_available() else 'cpu'

if 'gpt' not in globals() or 'encode' not in globals() or 'decode' not in globals():
    with open("sft/meta.pkl", "rb") as f:
        meta = pickle.load(f)
    stoi, itos = meta["stoi"], meta["itos"]

    def encode(s): return [stoi[c] for c in s]
    def decode(l): return ''.join([itos[i] for i in l])

    ckpt = torch.load("sft/gpt.pt", map_location=device)
    gptconf = GPTConfig(**ckpt['model_args'])
    gpt = GPT(gptconf)
    sd = ckpt['model']
    unwanted_prefix = '_orig_mod.'
    for k in list(sd.keys()):
        if k.startswith(unwanted_prefix):
            sd[k[len(unwanted_prefix):]] = sd.pop(k)
    gpt.load_state_dict(sd)
    gpt.to(device).eval()

max_length    = globals().get('max_length', 64)
temperature   = globals().get('temperature', 0.8)
top_k         = globals().get('top_k', 200)
max_new_tokens= min(globals().get('max_new_tokens', 60), 120)


In [None]:
import torch

@torch.no_grad()
def sample_from_gpt(prompt: str,
                    max_new: int = max_new_tokens,
                    temp: float = temperature,
                    topk: int = top_k) -> str:
    gpt.eval()
    ids = encode(prompt)
    x = torch.tensor([ids[-max_length:]], dtype=torch.long, device=device)
    generated = []

    for _ in range(max_new):
        logits, _ = gpt(x)
        logits = logits[:, -1, :]
        if temp and temp > 0:
            logits = logits / temp
        if topk and 0 < topk < logits.size(-1):
            v, _ = torch.topk(logits, topk)
            thresh = v[:, -1].unsqueeze(-1)
            logits[logits < thresh] = -float('inf')
        probs = torch.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        token_id = int(next_id.item())
        generated.append(token_id)

        ch = itos[token_id]
        if ch in ['\n', '\r']:
            break

        x = torch.cat([x, next_id], dim=1)
        if x.size(1) > max_length:
            x = x[:, -max_length:]

    return ''.join(itos[i] for i in generated).strip()


In [None]:
import random

def _int_div_triplet():
    b = random.randint(1, 12)
    x = random.randint(1, 12)
    a = b * x
    return a, b, x

def generate_problem():
    mode = random.choice([
        "arith_add","arith_sub","arith_mul","arith_div",
        "alg_x_mul","alg_mul_x","alg_x_add","alg_add_x",
        "alg_x_sub","alg_sub_x","alg_div_x","alg_x_div"
    ])

    if mode == "arith_add":
        a,b = random.randint(-50,99), random.randint(-50,99)
        ans = a + b; prompt = f"{a}+{b}, x=?"
        expl = f"{a}+{b} equals {ans}."
    elif mode == "arith_sub":
        a,b = random.randint(-50,99), random.randint(-50,99)
        ans = a - b; prompt = f"{a}-{b}, x=?"
        expl = f"{a}-{b} equals {ans}."
    elif mode == "arith_mul":
        a,b = random.randint(-12,12), random.randint(-12,12)
        ans = a * b; prompt = f"{a}*{b}, x=?"
        expl = f"{a}*{b} equals {ans}."
    elif mode == "arith_div":
        a,b,x = _int_div_triplet()
        ans = a // b; prompt = f"{a}/{b}, x=?"
        expl = f"{a}/{b} equals {ans}."
    elif mode == "alg_x_mul":            # x*b=a
        a,b,x = _int_div_triplet()
        ans = x; prompt = f"x*{b}={a}, x=?"
        expl = f"{a}/{b} equals {ans}."
    elif mode == "alg_mul_x":            # b*x=a
        a,b,x = _int_div_triplet()
        ans = x; prompt = f"{b}*x={a}, x=?"
        expl = f"{a}/{b} equals {ans}."
    elif mode == "alg_x_add":            # x+b=a
        b = random.randint(-50,99); x = random.randint(-50,99); a = x + b
        ans = x; prompt = f"x+{b}={a}, x=?"
        expl = f"{a}-{b} equals {ans}."
    elif mode == "alg_add_x":            # b+x=a
        b = random.randint(-50,99); x = random.randint(-50,99); a = b + x
        ans = x; prompt = f"{b}+x={a}, x=?"
        expl = f"{a}-{b} equals {ans}."
    elif mode == "alg_x_sub":            # x-b=a => x=a+b
        b = random.randint(-50,99); x = random.randint(-50,99); a = x - b
        ans = x; prompt = f"x-{b}={a}, x=?"
        expl = f"{a}+{b} equals {ans}."
    elif mode == "alg_sub_x":            # b-x=a => x=b-a
        a = random.randint(-50,99); b = random.randint(-50,99); x = b - a
        ans = x; prompt = f"{b}-x={a}, x=?"
        expl = f"{b}-{a} equals {ans}."
    elif mode == "alg_div_x":            # a/x=b => x=a/b
        a,b,x = _int_div_triplet()
        ans = x; prompt = f"{a}/x={b}, x=?"
        expl = f"{a}/{b} equals {ans}."
    else:                                # x/b=a => x=a*b
        b = random.randint(1,12); a = random.randint(-12,12); x = a*b
        ans = x; prompt = f"x/{b}={a}, x=?"
        expl = f"{a}*{b} equals {ans}."

    return prompt, ans, expl

def build_positive(prompt: str, ans: int, expl: str) -> str:
    return f"{prompt} The answer is {ans} because {expl}"


In [None]:
def tidy_model_reply(text: str) -> str:
    """
    Keep first line, clamp length, fallback to a safe string if junk/empty.
    """
    t = text.replace('\r','\n').split('\n')[0].strip()
    if not t:
        return "Sorry, I do not know!"
    if len(t) > 160:
        t = t[:160].rstrip()
    bad = ["<|", "|>", "[INST]", "[/INST]"]
    if any(b in t for b in bad):
        return "Sorry, I do not know!"
    return t


In [None]:
import os, json
from tqdm import tqdm

N_SAMPLES = 100_000
OUT_PATH  = "dpo/pos_neg_pairs.json"
os.makedirs(os.path.dirname(OUT_PATH), exist_ok=True)

pairs = []
for _ in tqdm(range(N_SAMPLES)):
    prompt, ans, expl = generate_problem()

    # Negative via pretrained NanoGPT
    neg_cont = sample_from_gpt(prompt, max_new=max_new_tokens)
    neg_text = tidy_model_reply(neg_cont)
    negative = f"{prompt} {neg_text}"

    # Positive via solver (human-preference)
    positive = build_positive(prompt, ans, expl)

    pairs.append({"negative": negative, "positive": positive})

with open(OUT_PATH, "w", encoding="utf-8") as f:
    json.dump(pairs, f, ensure_ascii=True, indent=2)

print(f"Generated {len(pairs)} pairs → {OUT_PATH}")


In [None]:
import json

# Path to the data file
# We use "dpo/pos_neg_pairs.json" because working directory is "NanoGPT-Math"
data_path = "dpo/pos_neg_pairs.json"

# Open and load the JSON file
with open(data_path, 'r', encoding='utf-8') as f:
    lines = json.load(f)

print(f"Successfully loaded {len(lines)} data pairs from {data_path}")
print("First pair:", lines[0])

## Step 6: Configure the Optimizer and LR Schedule

### AdamW Optimizer

Use **AdamW**, the go-to optimizer for Transformers. It adds **decoupled weight decay**, which acts as regularization to curb overfitting while keeping Adam’s adaptive updates.

### Cosine Annealing Schedule

Apply a **cosine decay** to the learning rate so it:

* Begins with relatively larger steps for quick early gains
* Tapers to smaller steps for fine-grained refinement
* Improves convergence by reducing the chance of overshooting the optimum


In [None]:
# --- 1. Install the 'transformers' library for the scheduler ---
!pip install transformers

# --- 2. Import everything ---
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

# --- 3. Define hyperparameters (make sure 'batch_size' is set!) ---
epochs = 20
batch_size = 16  # <-- MAKE SURE THIS IS DEFINED
base_lr = 1e-4     # <-- THIS WAS THE MISSING VARIABLE

# --- 4. Calculate steps (this depends on 'lines' from Step 5) ---
num_batches = len(lines) // batch_size
total_steps = num_batches * epochs

# --- 5. Create optimizer and scheduler ---
optimizer = AdamW(gpt.parameters(), lr=base_lr) # <-- Use optimizer, not optim
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.15 * total_steps),
    num_training_steps=total_steps
)

# --- 6. Print to confirm (changed 'rint' to 'print') ---
print(f"Optimizer and scheduler created.")
print(f"Total training steps: {total_steps}")

In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

# Assumes 'gpt', 'learning_rate', and 'epochs' are already defined
optimizer = optim.AdamW(gpt.parameters(), lr=base_lr, weight_decay=0.10, betas=(0.9, 0.95), eps=1e-8)

# T_max=epochs means the learning rate will gradually decrease over the total number of epochs
scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=0)

print("AdamW optimizer and CosineAnnealingLR scheduler created successfully.")


### Step 7: Begin training (**students are required to complete this part!**)

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()


In [None]:
# choose a safe fallback char that is definitely in your vocab
SAFE_CHAR = ' ' if ' ' in stoi else '\n'

def encode(s):
    fallback_id = stoi[SAFE_CHAR]
    return [stoi.get(c, fallback_id) for c in s]


In [None]:
import torch.nn.functional as F
from torch.nn.utils import clip_grad_norm_
from tqdm import tqdm

# if beta wasn't set earlier, provide a sensible default
beta = 0.1 if 'beta' not in globals() else beta

total_steps = len(lines) // batch_size
for epoch in range(epochs):
    pbar = tqdm(get_batches(lines, batch_size), total=total_steps, desc=f"Epoch {epoch+1}/{epochs}")
    for step, (neg_tensor, pos_tensor) in enumerate(pbar, start=1):
        ###########################################################
        # Completed training code
        ###########################################################
        # 1) move to device
        neg_tensor = neg_tensor.to(device)
        pos_tensor = pos_tensor.to(device)

        # 2) zero grad
        optimizer.zero_grad()

        # 3) forward: per-sample log-probabilities
        #    (compute_logprob should be defined in your Step 3)
        neg_logprob = compute_logprob(neg_tensor)   # shape [B]
        pos_logprob = compute_logprob(pos_tensor)   # shape [B]

        # 4) DPO-style loss
        loss = -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean()

        # 5) backward
        loss.backward()
        clip_grad_norm_(gpt.parameters(), 1.0)

        # 6) optimizer step
        optimizer.step()

        # progress
        lr_now = optimizer.param_groups[0]['lr']
        pbar.set_postfix(loss=f"{loss.item():.4f}", lr=f"{lr_now:.2e}")

        # keep the progress bar aligned with total_steps
        if step >= total_steps:
            break
    ###########################################################

    # step the scheduler once per epoch (matches CosineAnnealingLR in Step 6)
    if 'scheduler' in globals():
        scheduler.step()

    # save checkpoint (guard model_args in case ckpt isn't in scope)
    ckpt_path = f"./dpo.pt"
    model_args = ckpt['model_args'] if ('ckpt' in globals() and isinstance(ckpt, dict) and 'model_args' in ckpt) else None
    torch.save({
        "model_state_dict": gpt.state_dict(),
        "model_args": model_args,
    }, ckpt_path)
    print(f"Saved checkpoint to {ckpt_path} | LR: {optimizer.param_groups[0]['lr']:.6f}")

print("Training finished.")

### Step 8: Model Evaluation and Tewsting

Test the model on a small set of example problems covering different opeartion types. This quick test helps verify the model is working correctly.

In [None]:
import os, torch
import torch.nn.functional as F

# 1) Load the fine-tuned model
ckpt_path = "./dpo.pt"
assert os.path.exists(ckpt_path), f"Checkpoint not found at {ckpt_path}. Run Step 7 first."

checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
gpt = GPT(gptconf).to(device)

# state dict (supports either key name)
state_dict = checkpoint.get('model', checkpoint.get('model_state_dict'))
unwanted_prefix = '_orig_mod.'
for k in list(state_dict.keys()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)

gpt.load_state_dict(state_dict, strict=True)

# 2) Test
gpt.eval()
test_set = ["17+19=?", "3*17=?", "72/4=?", "72-x=34,x=?", "x*11=44,x=?", "3*17=?", "72/4=?", "72-x=34,x=?"]

with torch.no_grad():
    for prompt in test_set:
        prompt_ids = encode(prompt)
        x_ids = prompt_ids[-max_length:]
        x = torch.tensor([x_ids], dtype=torch.long, device=device)

        # generate continuation
        y, _ = gpt.generate(
            x,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )

        # decode: print only the newly generated part (cleaner), plus an optional full line
        out_ids = y[0].tolist()
        new_tokens_only = out_ids[len(x_ids):]
        out = decode(new_tokens_only)

        print(f"{prompt} → {out.strip()}")

