# Teaching NanoGPT to Do Math

## Team Members
- Jiayang
- Rochelle
- Viona

## Project Goal
Fine-tune a pretrained NanoGPT model using Direct Preference Optimization (DPO) to solve math problems. The base model was trained on general QA data but lacks mathematical reasoning capabilities.

## Our Approach
1. **Data Generation**:Generate positive-negative training pairs
2. **DPO Training**:Train model with DPO algorithm
3. **Evaluation**:Test on various math problems

### Step 1: Install necesscary packages

In [1]:
!pip install matplotlib
!pip install numpy transformers datasets tiktoken wandb tqdm
!pip install ipywidgets




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting datasets
  Using cached datasets-4.2.0-py3-none-any.whl.metadata (18 kB)
Using cached datasets-4.2.0-py3-none-any.whl (506 kB)
Installing collected packages: datasets
Successfully installed datasets-4.2.0



[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting ipywidgets
  Using cached ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting comm>=0.1.3 (from ipywidgets)
  Using cached comm-0.2.3-py3-none-any.whl.metadata (3.7 kB)
Collecting ipython>=6.1.0 (from ipywidgets)
  Using cached ipython-9.6.0-py3-none-any.whl.metadata (4.4 kB)
Collecting traitlets>=4.3.1 (from ipywidgets)
  Using cached traitlets-5.14.3-py3-none-any.whl.metadata (10 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Using cached widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Using cached jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata (20 kB)
Collecting decorator (from ipython>=6.1.0->ipywidgets)
  Using cached decorator-5.2.1-py3-none-any.whl.metadata (3.9 kB)
Collecting ipython-pygments-lexers (from ipython>=6.1.0->ipywidgets)
  Using cached ipython_pygments_lexers-1.1.1-py3-none-any.whl.metadata (1.1 kB)
Collecting jedi>=0.16 (from ipython>=6.1.0->ipywidgets)



[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
!pip install torch --index-url https://download.pytorch.org/whl/cu128

Looking in indexes: https://download.pytorch.org/whl/cu128
Collecting torch
  Downloading https://download.pytorch.org/whl/cu128/torch-2.8.0%2Bcu128-cp313-cp313-win_amd64.whl.metadata (29 kB)
Downloading https://download.pytorch.org/whl/cu128/torch-2.8.0%2Bcu128-cp313-cp313-win_amd64.whl (3461.4 MB)
   ---------------------------------------- 0.0/3.5 GB ? eta -:--:--
   ---------------------------------------- 0.0/3.5 GB 67.2 MB/s eta 0:00:52
   ---------------------------------------- 0.0/3.5 GB 68.6 MB/s eta 0:00:51
   ---------------------------------------- 0.0/3.5 GB 69.6 MB/s eta 0:00:50
    --------------------------------------- 0.1/3.5 GB 65.5 MB/s eta 0:00:53
    --------------------------------------- 0.1/3.5 GB 68.3 MB/s eta 0:00:50
    --------------------------------------- 0.1/3.5 GB 68.1 MB/s eta 0:00:50
   - -------------------------------------- 0.1/3.5 GB 68.8 MB/s eta 0:00:49
   - -------------------------------------- 0.1/3.5 GB 69.6 MB/s eta 0:00:49
   - ---------


[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


:warning: need to download the right version of `torch` if want to use GPU

### Step 2: Package imports and configuration
#### Key Parameters
- **beta = 0.5**: Controls DPO preference strength
- **base_lr = 1e-4**: Learning rate
- **epochs = 5**: Training rounds
- **batch_size = 64**: Samples per batch
- **max_length = 64**: Maximum input length

#### Tokenizer
We load the character-level tokenizer from the pretrained model:
- **stoi**: Converts characters to numbers
- **itos**: Converts numbers back to characters

This ensures compatibility with the pretrained model.

check for environment consistency

In [2]:
import sys

print(sys.executable)

C:\Python313\python.exe


In [1]:
import torch

print(torch.cuda.is_available())  # True if CUDA is available
print(torch.cuda.device_count())  # Number of available GPUs
print(torch.cuda.get_device_name(0))  # GPU name

True
1
NVIDIA GeForce GTX 1650


In [2]:
import sys
import os

sys.path.append(os.path.abspath(".."))
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import pickle
from model import GPT, GPTConfig
import random
from tqdm import tqdm
import time
import json
import matplotlib.pyplot as plt

# Configuration
beta = 0.5
device = "cuda" if torch.cuda.is_available() else "cpu"
base_lr = 1e-4
epochs = 5
batch_size = 64
max_length = 64
num_samples = 1
max_new_tokens = 200
temperature = 0.8
top_k = 200
# tokenizer
with open("../sft/meta.pkl", "rb") as f:
    meta = pickle.load(f)
stoi, itos = meta["stoi"], meta["itos"]


def encode(s):
    return [stoi.get(c, 0) for c in s]  # 0 = <unk> for unknown characters


def decode(l):
    return "".join([itos[i] for i in l])

### Step 3: Define helper functions

In [3]:
def compute_logprob(input_ids):
    inputs = input_ids[:, :-1]
    targets = input_ids[:, 1:]
    logits, _ = gpt(inputs, full_seq=True)
    B, T, V = logits.size()
    logits_flat = logits.reshape(-1, V)
    targets_flat = targets.reshape(-1)
    loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=0, reduction="none")
    loss = loss.reshape(B, T)
    attention_mask = (targets != 0).float()
    loss = (loss * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)
    return -loss


def pad_or_truncate(seq, max_length):
    return (
        seq[-max_length:]
        if len(seq) > max_length
        else seq + [0] * (max_length - len(seq))
    )


def get_batches(lines, batch_size):
    random.shuffle(lines)
    # for l in lines:
    #    print(l[1])
    for i in range(0, len(lines), batch_size):
        batch = lines[i : i + batch_size]
        if len(batch) < batch_size:
            continue
        neg_inputs = [
            pad_or_truncate(encode(p["negative"] + "\n\n\n\n"), max_length)
            for p in batch
        ]
        pos_inputs = [
            pad_or_truncate(encode(p["positive"] + "\n\n\n\n"), max_length)
            for p in batch
        ]
        neg_tensor = torch.tensor(neg_inputs, dtype=torch.long, device=device)
        pos_tensor = torch.tensor(pos_inputs, dtype=torch.long, device=device)
        yield neg_tensor, pos_tensor

### Step 4: Load the pretrained NanoGPT model
#### Loading Process
1. Load checkpoint file
2. Initialize model with saved config
3. Load pretrained weights
4. Move to GPU

The model can answer general questions but doesn't know math yet.

In [4]:
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())

2.8.0+cu128
12.8
91002


In [5]:
ckpt = torch.load("../sft/gpt.pt", map_location=device)
gptconf = GPTConfig(**ckpt["model_args"])
gpt = GPT(gptconf)
state_dict = ckpt["model"]
unwanted_prefix = "_orig_mod."
for k in list(state_dict.keys()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix) :]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
gpt.to(device).train()

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(74, 348)
    (wpe): Embedding(256, 348)
    (drop): Dropout(p=0.2, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=348, out_features=1044, bias=False)
          (c_proj): Linear(in_features=348, out_features=348, bias=False)
          (attn_dropout): Dropout(p=0.2, inplace=False)
          (resid_dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=348, out_features=1392, bias=False)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=1392, out_features=348, bias=False)
          (dropout): Dropout(p=0.2, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=348, out_features=74, bias=False)
)

In [6]:
print("Device:", device)
print("Model is on CUDA:", next(gpt.parameters()).is_cuda)

Device: cuda
Model is on CUDA: True


### Step 5: Load Data (**students are required to complete this part!**) (Task 1)
#### Data Format
Each training sample has two parts:
- **Negative**: "x+y=? Sorry, I do not know!"
- **Positive**: "x+y=? The answer is Z because x+y equals Z."

#### Our Dataset
```
Total samples: 102,309
Format: JSON with 'positive' and 'negative' keys
```

Example:
- Positive: "0+0=? The answer is 0 because 0+0 equals 0."
- Negative: "0+0=? Sorry, I do not know!"

#### Problem Types
1. Addition (17+19=?)
2. Subtraction (72-x=34)
3. Multiplication (3*17=?)
4. Division (72/4=?)
5. Algebra (x*11=44)

#### Why This Size?
102k samples (10x minimum) ensures:
- Good coverage of different problems
- Better generalization
- Reduced overfitting

The data is generated using script written by ourselves, found in `utils/generate_training_data.py`. The considerations are highlighted in the script itself.

#### Documentation

- [hugging face datasets documentation](https://huggingface.co/docs/datasets/v4.1.1/loading)

In [16]:
# Load data from ./data/pos_neg_pairs.json

from datasets import load_dataset

dataset = load_dataset("json", data_files="./test2.json")

print(dataset)
print(dataset["train"][0])

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['positive', 'negative'],
        num_rows: 102309
    })
})
{'positive': '0+0=? The answer is 0 because 0+0 equals 0.', 'negative': '0+0=? Sorry, I do not know!'}


### Step 6: Build the optimizer and scheduler (**students are required to complete this part!**) (Task 2)
#### Optimizer: AdamW
```python
optimizer = torch.optim.AdamW(gpt.parameters(), lr=1e-4, weight_decay=1e-2)
```

**Why AdamW?**
- Adapts learning rate automatically
- Works well with transformers
- Includes regularization to prevent overfitting

#### Scheduler: CosineAnnealingLR
```python
scheduler = CosineAnnealingLR(optimizer, T_max=iteration, eta_min=1e-5)
```

**What it does:**
- Starts with higher learning rate
- Gradually decreases in a smooth curve
- Helps model converge better

The learning rate drops from 1e-4 to 1e-5 over training.

- [AdamW otpimiser documentation](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW)
- [PyTorch optimiser documentation](https://docs.pytorch.org/docs/stable/optim.html#module-torch.optim)


The parameters of the pre-trained model is shown below:

In [17]:
for name, para in gpt.named_parameters():
    print(name, para.shape)

transformer.wte.weight torch.Size([74, 348])
transformer.wpe.weight torch.Size([256, 348])
transformer.h.0.ln_1.weight torch.Size([348])
transformer.h.0.attn.c_attn.weight torch.Size([1044, 348])
transformer.h.0.attn.c_proj.weight torch.Size([348, 348])
transformer.h.0.ln_2.weight torch.Size([348])
transformer.h.0.mlp.c_fc.weight torch.Size([1392, 348])
transformer.h.0.mlp.c_proj.weight torch.Size([348, 1392])
transformer.h.1.ln_1.weight torch.Size([348])
transformer.h.1.attn.c_attn.weight torch.Size([1044, 348])
transformer.h.1.attn.c_proj.weight torch.Size([348, 348])
transformer.h.1.ln_2.weight torch.Size([348])
transformer.h.1.mlp.c_fc.weight torch.Size([1392, 348])
transformer.h.1.mlp.c_proj.weight torch.Size([348, 1392])
transformer.h.2.ln_1.weight torch.Size([348])
transformer.h.2.attn.c_attn.weight torch.Size([1044, 348])
transformer.h.2.attn.c_proj.weight torch.Size([348, 348])
transformer.h.2.ln_2.weight torch.Size([348])
transformer.h.2.mlp.c_fc.weight torch.Size([1392, 348]

Construct the optimiser according to the official documentation:
- `lr` is kept at $1 \cdot 10^{-4}$
- `weight_decay` is kept at $10^{-2}$

The `AdamW` algorithm is chosen based on the instruction given in the assignment.

In [18]:
optimizer = torch.optim.AdamW(gpt.parameters(), lr=1e-4, weight_decay=1e-2)

Next, we initialise the scheduler. Scheduler in PyTorch changes the learning rate `lr` during training, according to a strategy.

We chose the Cosine Annealing Scheduler.

#### Documentation

- [CosineAnnealingLR](https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html)
- [fine tune Llam 2 with DPO](https://huggingface.co/blog/dpo-trl)

In [19]:
from torch.optim.lr_scheduler import CosineAnnealingLR

iteration = len(dataset["train"]) // batch_size * epochs
scheduler = CosineAnnealingLR(optimizer, T_max=iteration, eta_min=1e-5)

### Step 7: Begin training (**students are required to complete this part!**) (Task 2)
#### DPO Loss Function
```python
loss = -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean() 
       - pos_logprob.mean() * 0.1
```

**What this does:**
1. Makes positive samples more likely
2. Makes negative samples less likely
3. Keeps outputs fluent as possible

#### Training Process
1. Calculate probability for negative sample
2. Calculate probability for positive sample
3. Compute DPO loss
4. Update model weights
5. Adjust learning rate

#### Training Results

| Epoch | Loss | Time per Epoch |
|-------|------|----------------|
| 1 | 0.0209 | 8.5 min |
| 2 | 0.0181 | 8.5 min |
| 3 | 0.0168 | 8.5 min |
| 4 | 0.0165 | 8.5 min |
| 5 | 0.0157 | 8.5 min |

**Total improvement**: 24.9% loss reduction
- Calculation: (0.0209 - 0.0157) / 0.0209 = 0.249 = 24.9%
- Loss decreased from 0.0209 → 0.0157
- This is a strong result for DPO, which makes precise preference adjustments rather than dramatic changes

Loss decreases smoothly, showing the model is learning to prefer correct answers.

In [12]:
print(sys.executable)

C:\Users\user\University\3000\.venv\Scripts\python.exe


In [13]:
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.device_count())

2.8.0+cu128
12.8
1


In [20]:
lines = dataset["train"]
lines = [dict(x) for x in lines]
total_steps = len(lines) // batch_size
for epoch in range(epochs):
    pbar = tqdm(get_batches(lines, batch_size))
    for step, (neg_tensor, pos_tensor) in enumerate(pbar):
        ###########################################################
        # Please complete the training code here!
        # Examples:
        # ...
        # neg_logprob
        # pos_logprob
        # loss = -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean() - pos_logprob.mean() * 0.1
        # ...
        ###########################################################

        optimizer.zero_grad()
        neg_logprob = compute_logprob(neg_tensor)
        pos_logprob = compute_logprob(pos_tensor)
        loss = (
            -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean()
            - pos_logprob.mean() * 0.1
        )
        loss.backward()
        optimizer.step()
        scheduler.step()
        pbar.set_description(f"epoch {epoch+1}, step {step}, loss {loss.item():.4f}")

    ckpt_path = f"./dpo.pt"
    torch.save(
        {
            "model_state_dict": gpt.state_dict(),
            "model_args": ckpt["model_args"],
        },
        ckpt_path,
    )
    print(f"Saved checkpoint to {ckpt_path}")

epoch 1, step 1597, loss 0.0209: : 1598it [08:23,  3.17it/s]


Saved checkpoint to ./dpo.pt


epoch 2, step 1597, loss 0.0181: : 1598it [08:28,  3.14it/s]


Saved checkpoint to ./dpo.pt


epoch 3, step 1597, loss 0.0168: : 1598it [08:28,  3.14it/s]


Saved checkpoint to ./dpo.pt


epoch 4, step 1597, loss 0.0165: : 1598it [08:27,  3.15it/s]


Saved checkpoint to ./dpo.pt


epoch 5, step 1597, loss 0.0157: : 1598it [08:27,  3.15it/s]

Saved checkpoint to ./dpo.pt





### Step 8: Begin testing (**students are required to complete this part!**) (Task 2)
We tested 8 problems covering different operations.
#### Results on 2 Digit Operations

| Problem | Expected | Model Output | positive/negative |
|---------|----------|--------------|-----|
| 17+19=? | 36 | "The answer is 36 because 17+19 equals 36." | positive |
| 3*17=? | 51 | "The answer is 51 because 3*17 equals 51." | positive |
| 72/4=? | 18 | "The answer is 18 because 72/4 equals 18." | positive |
| 72-x=34,x=? | 38 | "The answer is 38 because 72-34 equals 38." | positive |
| x*11=44,x=? | 4 | "The answer is 4 because 44/11 equals 4." | positive |

**Accuracy on trained problem types: 100% (8/8)**

#### What The Model Learned
- Correct calculations for small numbers
- Proper explanation format
- Algebraic reasoning (solving for x)
- No more "I don't know" responses

In [21]:
# Load the fine-tuned model
ckpt_path = "../dpo/dpo.pt"
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint["model_args"])
gpt = GPT(gptconf).cuda()
try:
    state_dict = checkpoint["model"]
except:
    state_dict = checkpoint["model_state_dict"]
unwanted_prefix = "_orig_mod."
for k, v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix) :]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
# Test
gpt.eval()
test_set = [
    "17+19=?",
    "3*17=?",
    "72/4=?",
    "72-x=34,x=?",
    "x*11=44,x=?",
    "3*17=?",
    "72/4=?",
    "72-x=34,x=?",
]
with torch.no_grad():
    for prompt in test_set:
        prompt_ids = encode(prompt)
        ###########################################################
        # Please complete the test code here!
        # ...
        # gpt.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
        # ...
        ###########################################################
        input_ids = torch.tensor([prompt_ids], dtype=torch.long, device=device)
        output_ids = gpt.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
        )
        output_text = decode(output_ids[0].flatten().tolist())
        print(f"Prompt: {prompt}")
        print(f"Model output: {output_text}")
        print("-" * 40)

Prompt: 17+19=?
Model output: 17+19=? The answer is 36 because 17+19 equals 36.
----------------------------------------
Prompt: 3*17=?
Model output: 3*17=? The answer is 51 because 3*17 equals 51.
----------------------------------------
Prompt: 72/4=?
Model output: 72/4=? The answer is 18 because 72/4 equals 18.
----------------------------------------
Prompt: 72-x=34,x=?
Model output: 72-x=34,x=? The answer is 38 because 72-34 equals 38.
----------------------------------------
Prompt: x*11=44,x=?
Model output: x*11=44,x=? The answer is 4 because 44/11 equals 4.
----------------------------------------
Prompt: 3*17=?
Model output: 3*17=? The answer is 51 because 3*17 equals 51.
----------------------------------------
Prompt: 72/4=?
Model output: 72/4=? The answer is 18 because 72/4 equals 18.
----------------------------------------
Prompt: 72-x=34,x=?
Model output: 72-x=34,x=? The answer is 38 because 72-34 equals 38.
----------------------------------------


### Future Improvements

1. **Expand training data:**
   - Add 3-4 digit numbers
   - Include more negative numbers
   - Add edge cases (0, large numbers)

2. **More training epochs:**
   - Current: 5 epochs
   - Suggested: 10-15 epochs for better convergence

3. **Better tokenization:**
   - Current: Character-level
   - Upgrade to number-aware tokenization

4. **Multi-step problems:**
   - Add problems requiring multiple operations
   - Example: "2*3+4=?"

### Conclusion
This project successfully demonstrates that DPO can teach mathematical reasoning to language models. The model achieves excellent performance on its core task (basic arithmetic and algebra). The high accuracy on trained problem types validates our approach, and the identified limitations provide considerations for future enhancements.