Task 1: Building positive-negative data pairs.

Since the pretrained NanoGPT only knows QA-type responses, we need to create a dataset that contrasts incorrect model-style answers with correct, human-preferred answers to math problems. 
These pairs will later be used for DPO fine-tuning

Approach:
1. Load existing QA pairs (if avail.)
2. Generate additional arithmeetic & algebra problems automatically
3. For each problem, create:
   - a negative response
   - a positive response (human-correct, with explanation)
4. Save all to the provided .json file..

In [1]:
import json, random, math, os, pandas as pd
from pathlib import Path

In [2]:
# Set random seed for reproducibility
random.seed(42)

In [3]:
# Define file paths
save_path = "C:/Users/syedr/OneDrive/Documents/NTU Semester Resources/NTU Y3S1/SC3000 - Artificial Intelligence/Lab 1/NanoGPT-Math-main/dpo/pos_neg_pairs.json"
os.makedirs("dpo", exist_ok=True)

In [4]:
# Load existing base data if present
base = []
if os.path.exists(save_path):
    with open(save_path, "r", encoding="utf-8") as f:
        base = json.load(f)

print(f"Loaded {len(base)} existing samples from {save_path}")

Loaded 40000 existing samples from C:/Users/syedr/OneDrive/Documents/NTU Semester Resources/NTU Y3S1/SC3000 - Artificial Intelligence/Lab 1/NanoGPT-Math-main/dpo/pos_neg_pairs.json


In [7]:
# Helper functions to create problems
def make_add(a, b):
    q = f"{a}+{b}=?"
    ans = a + b
    pos = f"{q} The answer is {ans} because {a}+{b} equals {ans}."
    neg = f"{q} Hmm, I’m not sure!"
    return {"negative": neg, "positive": pos}

def make_sub(a, b):
    q = f"{a}-{b}=?"
    ans = a - b
    pos = f"{q} The answer is {ans} because {a}-{b} equals {ans}."
    neg = f"{q} Sorry, I don’t know the answer!"
    return {"negative": neg, "positive": pos}

def make_mul(a, b):
    q = f"{a}*{b}=?"
    ans = a * b
    pos = f"{q} The answer is {ans} because {a}×{b} equals {ans}."
    neg = f"{q} Sorry, I don’t know!"
    return {"negative": neg, "positive": pos}

def make_div(a, b):
    if b == 0:
        b = 1
    a = a - (a % b)
    q = f"{a}/{b}=?"
    ans = a // b
    pos = f"{q} The answer is {ans} because {a}/{b} equals {ans}."
    neg = f"{q} I think it’s {random.randint(0,9)}, but not sure."
    return {"negative": neg, "positive": pos}

def make_linear(a, b):
    # ax + b = c  -> x = (c - b)/a
    if a == 0:
        a = 1
    x = random.randint(-20, 50)
    c = a * x + b
    q = f"{a}*x+{b}={c}, x=?"
    ans = x
    pos = f"{q} The answer is {ans} because ({c}-{b})/{a} = {ans}."
    neg = f"{q} Sorry, I’m not sure how to solve that."
    return {"negative": neg, "positive": pos}

def make_two_step():
    # Example: (x + a)*b = c
    a = random.randint(1, 10)
    b = random.randint(1, 10)
    x = random.randint(0, 20)
    c = (x + a) * b
    q = f"(x+{a})*{b}={c}, x=?"
    pos = f"{q} The answer is {x} because x+{a}={c}/{b}={c//b}, so x={x}."
    neg = f"{q} Hmm… maybe {random.randint(0,10)}?"
    return {"negative": neg, "positive": pos}

def make_word():
    # Word-style simple arithmetic
    a = random.randint(1, 10)
    b = random.randint(1, 10)
    ans = a + b
    q = f"Tom has {a} apples and buys {b} more. How many apples does he have in total?"
    pos = f"{q} He has {ans} apples because {a}+{b}={ans}."
    neg = f"{q} Sorry, I don’t know!"
    return {"negative": neg, "positive": pos}

In [8]:
# Generate data
target_total = 40000
need = max(0, target_total - len(base))

pairs = []
ops = ["add", "sub", "mul", "div", "linear", "two_step", "word"]
weights = [0.2, 0.15, 0.2, 0.1, 0.2, 0.1, 0.05]

for _ in range(need):
    op = random.choices(ops, weights=weights)[0]
    if op == "add":
        pairs.append(make_add(random.randint(0, 999), random.randint(0, 999)))
    elif op == "sub":
        pairs.append(make_sub(random.randint(0, 999), random.randint(0, 999)))
    elif op == "mul":
        pairs.append(make_mul(random.randint(0, 99), random.randint(0, 20)))
    elif op == "div":
        pairs.append(make_div(random.randint(1, 999), random.randint(1, 20)))
    elif op == "linear":
        pairs.append(make_linear(random.randint(-9, 9) or 1, random.randint(-50, 100)))
    elif op == "two_step":
        pairs.append(make_two_step())
    elif op == "word":
        pairs.append(make_word())

In [9]:
# Combine with existing, shuffle
combined = base + pairs
random.shuffle(combined)

In [10]:
# Save to file
with open(save_path, "w", encoding="utf-8") as f:
    json.dump(combined, f, indent=2, ensure_ascii=False)

print(f"Saved expanded dataset with {len(combined)} examples to {save_path}")

Saved expanded dataset with 40000 examples to C:/Users/syedr/OneDrive/Documents/NTU Semester Resources/NTU Y3S1/SC3000 - Artificial Intelligence/Lab 1/NanoGPT-Math-main/dpo/pos_neg_pairs.json


In [11]:
# Preview a few examples
df_preview = pd.DataFrame(combined[:10])
display(df_preview)

Unnamed: 0,negative,positive
0,"984/3=? I think it’s 0, but not sure.",984/3=? The answer is 328 because 984/3 equals...
1,"2*6=? Sorry, I don’t know!",2*6=? The answer is 12 because 2×6 equals 12.
2,"210+36=? Hmm, I’m not sure!",210+36=? The answer is 246 because 210+36 equa...
3,"4*x+75=203, x=? Sorry, I’m not sure how to sol...","4*x+75=203, x=? The answer is 32 because (203-..."
4,"867+516=? Hmm, I’m not sure!",867+516=? The answer is 1383 because 867+516 e...
5,"8*x+-7=-23, x=? Sorry, I’m not sure how to sol...","8*x+-7=-23, x=? The answer is -2 because (-23-..."
6,"243-76=? Sorry, I don’t know the answer!",243-76=? The answer is 167 because 243-76 equa...
7,"455+985=? Hmm, I’m not sure!",455+985=? The answer is 1440 because 455+985 e...
8,"343/7=? I think it’s 1, but not sure.",343/7=? The answer is 49 because 343/7 equals 49.
9,"79*7=? Sorry, I don’t know!",79*7=? The answer is 553 because 79×7 equals 553.


I generated a total of 40,000 positive–negative QA pairs for DPO training.

- **Positive samples**: contain the correct mathematical reasoning and final answer.
- **Negative samples**: represent model-like uncertain or incorrect responses.
- **Problem diversity**: includes arithmetic (+, −, ×, ÷), single-variable linear equations,
  two-step algebra problems, and word-style problems.
- The dataset ensures balanced coverage and realistic reasoning text.

This dataset is now ready foruse in*Task 2* to fine-tune the NanoGPT model using DPO.

### Step 1: Install necesscary packages

In [1]:
!pip install matplotlib
!pip install torch numpy transformers datasets tiktoken wandb tqdm

Collecting torch
  Downloading torch-2.9.0-cp311-cp311-win_amd64.whl.metadata (30 kB)
Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
     ---------------------------------------- 0.0/44.0 kB ? eta -:--:--
     ---------------------------------------- 44.0/44.0 kB 2.3 MB/s eta 0:00:00
Collecting datasets
  Downloading datasets-4.2.0-py3-none-any.whl.metadata (18 kB)
Collecting tiktoken
  Downloading tiktoken-0.12.0-cp311-cp311-win_amd64.whl.metadata (6.9 kB)
Collecting wandb
  Downloading wandb-0.22.2-py3-none-win_amd64.whl.metadata (10 kB)
Collecting typing-extensions>=4.10.0 (from torch)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.35.3-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transforme

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.75 requires requests_mock, which is not installed.
conda-repo-cli 1.0.75 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.75 requires python-dateutil==2.8.2, but you have python-dateutil 2.9.0.post0 which is incompatible.
conda-repo-cli 1.0.75 requires requests==2.31.0, but you have requests 2.32.5 which is incompatible.
spyder 5.4.3 requires jedi<0.19.0,>=0.17.2, but you have jedi 0.19.1 which is incompatible.
streamlit 1.30.0 requires packaging<24,>=16.8, but you have packaging 24.0 which is incompatible.


### Step 2: Package imports and configuration

In [5]:
import sys
import os
sys.path.append(os.path.abspath("..")) 
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import pickle
from model import GPT, GPTConfig
import random
from tqdm import tqdm
import time
import json
import matplotlib.pyplot as plt
# Configuration
beta = 0.5
device = 'cuda' if torch.cuda.is_available() else 'cpu'
base_lr = 1e-4
epochs = 5
batch_size = 64
max_length =64
num_samples = 1
max_new_tokens = 200
temperature = 0.8
top_k = 200
# tokenizer
with open("../sft/meta.pkl", "rb") as f:
    meta = pickle.load(f)
stoi, itos = meta["stoi"], meta["itos"]
def encode(s): return [stoi[c] for c in s]
def decode(l): return ''.join([itos[i] for i in l])

OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "C:\Users\syedr\anaconda3\Lib\site-packages\torch\lib\c10.dll" or one of its dependencies.

### Step 3: Define helper functions

In [13]:
def compute_logprob(input_ids):
    inputs = input_ids[:, :-1]
    targets = input_ids[:, 1:]
    logits, _ = gpt(inputs, full_seq=True)
    B, T, V = logits.size()
    logits_flat = logits.reshape(-1, V)
    targets_flat = targets.reshape(-1)
    loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=0, reduction='none')
    loss = loss.reshape(B, T)
    attention_mask = (targets != 0).float()
    loss = (loss * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)
    return -loss 

def pad_or_truncate(seq, max_length):
    return seq[-max_length:] if len(seq) > max_length else seq + [0] * (max_length - len(seq))

def get_batches(lines, batch_size):
    random.shuffle(lines)
    #for l in lines:
    #    print(l[1])
    for i in range(0, len(lines), batch_size):
        batch = lines[i:i+batch_size]
        if len(batch) < batch_size:
            continue
        neg_inputs = [pad_or_truncate(encode(p['negative'] + '\n\n\n\n'), max_length) for p in batch]
        pos_inputs = [pad_or_truncate(encode(p['positive'] + '\n\n\n\n'), max_length) for p in batch]
        neg_tensor = torch.tensor(neg_inputs, dtype=torch.long, device=device)
        pos_tensor = torch.tensor(pos_inputs, dtype=torch.long, device=device)
        yield neg_tensor, pos_tensor

### Step 4: Load the pretrained NanoGPT model

In [15]:
ckpt = torch.load("C:/Users/syedr/OneDrive/Documents/NTU Semester Resources/NTU Y3S1/SC3000 - Artificial Intelligence/Lab 1/NanoGPT-Math-main/sft/gpt.pt", map_location=device)
gptconf = GPTConfig(**ckpt['model_args'])
gpt = GPT(gptconf)
state_dict = ckpt['model']
unwanted_prefix = '_orig_mod.'
for k in list(state_dict.keys()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
gpt.to(device).train()

NameError: name 'torch' is not defined

Task 2: Train NanoGPT using DPO (Steps 5–7)

### Step 5: Load Data (**students are required to complete this part!**)

In [7]:
# Load data from ./data/pos_neg_pairs.json

### Step 6: Build the optimizer and scheduler (**students are required to complete this part!**)

In [8]:
# recommend to use the AdamW optimizer 

### Step 7: Begin training (**students are required to complete this part!**)

In [None]:
total_steps = len(lines) // batch_size
for epoch in range(epochs):
    pbar = tqdm(get_batches(lines, batch_size))
    for step, (neg_tensor,pos_tensor) in enumerate(pbar):
        ###########################################################
        # Please complete the training code here!
        # Examples: 
        # ...
        # neg_logprob
        # pos_logprob 
        # loss = -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean() - pos_logprob.mean() * 0.1 
        # ...
        ###########################################################
    ckpt_path = f"./dpo.pt"
    torch.save({
        "model_state_dict": gpt.state_dict(),
        "model_args": ckpt['model_args'],
    }, ckpt_path)
    print(f"Saved checkpoint to {ckpt_path}")

### Step 8: Begin testing (**students are required to complete this part!**)

In [None]:
# Load the fine-tuned model
ckpt_path = "../dpo/dpo.pt"
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
gpt = GPT(gptconf).cuda()
try:
    state_dict = checkpoint['model']
except:
    state_dict = checkpoint['model_state_dict']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
# Test
gpt.eval()
test_set = ["17+19=?", "3*17=?", "72/4=?", "72-x=34,x=?", "x*11=44,x=?", "3*17=?", "72/4=?", "72-x=34,x=?"]
with torch.no_grad():
    for prompt in test_set: 
        prompt_ids = encode(prompt)
        ###########################################################
        # Please complete the test code here!
        # ...
        # gpt.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
        # ...
        ###########################################################