---
format:
  html:
    code-fold: true
jupyter: python3
---


 **Cell 1: Task Setup & Experiment Plan**

**Tiny Model**

- I use **`EleutherAI/gpt-neo-125M`**, a small decoder-only language model.
- It is pretrained and small enough for simple RL fine-tuning on a single GPU

---

**Toy Task: Creative Analogy Generator**

The model is given analogies of the form:

> **"A is to B as C is to"**

and must output a suitable word **D** so that:

> **A : B :: C : D**

Examples:

- `sun : bright :: moon : ?` → *dim, pale, glowing, luminous…*  
- `king : man :: queen : ?` → *woman, lady, female…*  
- `cat : kitten :: dog : ?` → *puppy…*

**Prompt format:**

```text
"{A} is to {B} as {C} is to"

```

---

**Reward Logic**

I use a **semantic similarity reward** based on cosine similarity of word embeddings.
Reward = max similarity, re-scaled from [-1, +1] to [0, 1].

So:

Reward is close to 1 when the candidate word is semantically close to one of the targets.

Reward is near 0 when the candidate is unrelated.

---

**Hypothesis**

Initially, the pretrained model is not specialized for our analogy completion format.

After GRPO fine-tuning, I expect:

- The average embedding-based reward on analogies to increase.

- Generated completions to be more semantically aligned with expected answers.

With the KL penalty, the model should still behave like a language model and not completely drift into weird outputs.








In [1]:
# Cell 2: Setup and Model Loading

# importing libraries
import math
import random
import string
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer,AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "EleutherAI/gpt-neo-125M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# This is trainable policy model
policy_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# Frozen reference model
ref_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
ref_model.eval()
for p in ref_model.parameters():
    p.requires_grad = False

n_params = sum(p.numel() for p in policy_model.parameters())
print(f"Loaded {model_name} with {n_params/1e6:.1f}M parameters")

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Loaded EleutherAI/gpt-neo-125M with 125.2M parameters


In [2]:
# Cell 3: Dataset Generation

# Base analogy patterns: A, B, C, [list of acceptable D targets(taken from wikipedia)]
base_analogies = [
    ("sun",      "bright",   "moon",    ["dim", "pale", "glowing", "luminous", "reflective"]),
    ("king",     "man",      "queen",   ["woman", "lady", "female"]),
    ("cat",      "kitten",   "dog",     ["puppy"]),
    ("teacher",  "school",   "doctor",  ["hospital", "clinic"]),
    ("rain",     "wet",      "snow",    ["cold", "white", "icy"]),
    ("fire",     "hot",      "ice",     ["cold", "freezing", "chilly"]),
    ("bird",     "fly",      "fish",    ["swim", "swimming"]),
    ("word",     "sentence", "note",    ["melody", "tune"]),
    ("ear",      "hear",     "eye",     ["see", "look", "watch"]),
    ("lion",     "courage",  "fox",     ["cunning", "clever", "sly"]),
    ("knife",    "cut",      "pen",     ["write", "scribble"]),
    ("car",      "road",     "boat",    ["water", "sea", "river"]),
    ("winter",   "cold",     "summer",  ["hot", "warm"]),
    ("seed",     "plant",    "egg",     ["bird", "chick"]),
    ("up",       "down",     "left",    ["right"]),
    ("strong",   "strength", "wise",    ["wisdom", "insight"]),
    ("mother",   "parent",   "son",     ["child", "kid"]),
    ("glass",    "transparent","brick", ["opaque", "solid"]),
    ("bee",      "honey",    "cow",     ["milk"]),
    ("author",   "book",     "composer",["music", "symphony", "song"]),
]

def build_dataset(repeats: int = 40):
    data = []
    for i in range(repeats):
        for A, B, C, targets in base_analogies:
            prompt = f"{A} is to {B} as {C} is to"
            data.append({"prompt": prompt, "targets": targets})
    random.shuffle(data)
    return data

train_data = build_dataset(repeats=40)   # appox 800 samples
eval_data  = build_dataset(repeats=5)    # approx 100 samples

print("Dataset sizes:")
print("Training:", len(train_data))
print("Evaluation:", len(eval_data))

print("\nExample prompts:")
for ex in train_data[:3]:
    print(ex)



Dataset sizes:
Training: 800
Evaluation: 100

Example prompts:
{'prompt': 'ear is to hear as eye is to', 'targets': ['see', 'look', 'watch']}
{'prompt': 'word is to sentence as note is to', 'targets': ['melody', 'tune']}
{'prompt': 'winter is to cold as summer is to', 'targets': ['hot', 'warm']}


In [3]:
# Cell 4: Reward Function Implementation

def extract_first_word(text: str) -> str:
    text = text.strip()
    if not text:
        return ""
    tokens = text.split()
    if len(tokens) == 0:
        return ""
    w = tokens[0].strip(string.punctuation).lower()
    return w

def reward(prompts, completions, targets_list):
    rewards = []
    for comp, targets in zip(completions, targets_list):
        can_word = extract_first_word(comp)
        targets_lower = [t.lower() for t in targets]
        if can_word in targets_lower:
            rewards.append(1.0)
        else:
            rewards.append(0.0)

    return torch.tensor(rewards, dtype=torch.float32, device=device)

#Sanity Check
test_prompt = "sun is to bright as moon is to"
test_targets = [["dim", "pale", "glowing", "luminous", "reflective"]]
test_completions = ["dim", "tall", "luminous"]

print("\nSanity check:")
for comp in test_completions:
    r = reward([test_prompt], [comp], test_targets).item()
    print(f"Completion: {comp:<15} -> Reward: {r:.1f}")




Sanity check:
Completion: dim             -> Reward: 1.0
Completion: tall            -> Reward: 0.0
Completion: luminous        -> Reward: 1.0


In [4]:
# Cell 5: GRPO Implementation

G = 4                 #group size
beta_kl = 0.1         #KL regularization strength
eps = 1e-5

def grpo_step(batch, return_details=False):

    prompts = [ex["prompt"] for ex in batch]
    targets_list = [ex["targets"] for ex in batch]
    batch_size = len(prompts)

    all_log_probs = []
    all_rewards   = []
    all_kls       = []
    all_completions_groups = []

    prompt_enc = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
    prompt_input_ids = prompt_enc["input_ids"]
    prompt_lens = prompt_input_ids.ne(tokenizer.pad_token_id).sum(dim=1)

    for g in range(G):
        enc = tokenizer(prompts, return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            gen_ids = policy_model.generate(
                **enc,
                max_new_tokens=1,
                do_sample=True,
                temperature=1.0,
                pad_token_id=tokenizer.pad_token_id
            )

        full_texts = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
        completions = [
            full[len(prompt):].strip()
            for full, prompt in zip(full_texts, prompts)
        ]
        all_completions_groups.append(completions)

        rewards = reward(prompts, completions, targets_list)
        all_rewards.append(rewards)


        seq_input = gen_ids[:, :-1]
        seq_labels = gen_ids[:, 1:]
        attn_mask = (seq_input != tokenizer.pad_token_id).long()

        outputs = policy_model(input_ids=seq_input, attention_mask=attn_mask)
        logits = outputs.logits
        log_probs_all = F.log_softmax(logits, dim=-1)
        token_logp = log_probs_all.gather(-1, seq_labels.unsqueeze(-1)).squeeze(-1)

        seq_logps = []
        for i in range(batch_size):
            pl = prompt_lens[i].item()
            lp = token_logp[i, pl-1:].sum()
            seq_logps.append(lp)
        seq_logps = torch.stack(seq_logps)
        all_log_probs.append(seq_logps)

        # KL estimate vs reference model
        with torch.no_grad():
            ref_outputs = ref_model(input_ids=seq_input, attention_mask=attn_mask)
            ref_logits = ref_outputs.logits
            ref_log_probs_all = F.log_softmax(ref_logits, dim=-1)
            ref_token_logp = ref_log_probs_all.gather(-1, seq_labels.unsqueeze(-1)).squeeze(-1)

            ref_seq_logps = []
            for i in range(batch_size):
                pl = prompt_lens[i].item()
                lp = ref_token_logp[i, pl-1:].sum()
                ref_seq_logps.append(lp)
            ref_seq_logps = torch.stack(ref_seq_logps)

        kl_estimate = (seq_logps - ref_seq_logps)
        all_kls.append(kl_estimate)
    log_probs = torch.stack(all_log_probs, dim=0)
    rewards = torch.stack(all_rewards, dim=0)
    kls= torch.stack(all_kls, dim=0)

    mean_rewards = rewards.mean(dim=0, keepdim=True)
    std_rewards  = rewards.std(dim=0, unbiased=False, keepdim=True) + eps

    advantages = (rewards - mean_rewards) / std_rewards

    loss_policy = -(advantages * log_probs).mean()
    loss_kl = kls.mean()
    loss_total = loss_policy + beta_kl * loss_kl

    avg_reward = rewards.mean().item()

    if return_details:
        return loss_total, loss_policy.detach(), loss_kl.detach(), advantages.detach(), avg_reward, all_completions_groups
    else:
        return loss_total, avg_reward

# Testing one GRPO step
test_batch = train_data[:4]
loss, loss_pol, loss_kl, A, avg_r, comp_groups = grpo_step(test_batch, return_details=True)

print("Advantages shape:", A.shape)
print("Initial total loss:", loss.item())
print("Initial avg reward:", avg_r)


Advantages shape: torch.Size([4, 4])
Initial total loss: -0.9311904311180115
Initial avg reward: 0.3125


In [7]:
# Cell 6: Training Loop

optimizer = torch.optim.AdamW(policy_model.parameters(), lr=1e-5)

steps   = 300
batch_size  = 8
log_every   = 20

policy_model.train()

for step in range(1, steps + 1):
    batch = random.sample(train_data, batch_size)
    loss, avg_r = grpo_step(batch, return_details=False)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if step % log_every == 0:
        print(f"[Step {step:03d}] Loss = {loss.item():.4f} -  Avg Reward = {avg_r:.3f}")

[Step 020] Loss = -76.3048 -  Avg Reward = 0.562
[Step 040] Loss = 0.2952 -  Avg Reward = 0.469
[Step 060] Loss = -86.7424 -  Avg Reward = 0.875
[Step 080] Loss = -76.7771 -  Avg Reward = 0.875
[Step 100] Loss = -69.2336 -  Avg Reward = 0.969
[Step 120] Loss = -0.0240 -  Avg Reward = 0.688
[Step 140] Loss = 0.4905 -  Avg Reward = 1.000
[Step 160] Loss = 0.6562 -  Avg Reward = 0.625
[Step 180] Loss = -47.9028 -  Avg Reward = 0.875
[Step 200] Loss = 0.5123 -  Avg Reward = 0.594
[Step 220] Loss = 0.3055 -  Avg Reward = 0.594
[Step 240] Loss = 0.4250 -  Avg Reward = 0.625
[Step 260] Loss = 0.4413 -  Avg Reward = 0.781
[Step 280] Loss = -173.8563 -  Avg Reward = 0.812
[Step 300] Loss = -117.2856 -  Avg Reward = 0.875


In [8]:
# Cell 7: Evaluation and Generation

@torch.no_grad()
def eval_model(model, dataset, num_samples=100):
    model.eval()
    sample = random.sample(dataset, min(num_samples, len(dataset)))
    total_reward = 0.0

    for ex in sample:
        prompt = ex["prompt"]
        targets = ex["targets"]

        enc = tokenizer(prompt, return_tensors="pt").to(device)
        gen_ids = model.generate(
            **enc,
            max_new_tokens=1,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id
        )
        full = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
        completion = full[len(prompt):].strip()

        r = reward([prompt], [completion], [targets]).item()
        total_reward += r

    avg = total_reward / len(sample)
    model.train()
    return avg

#Quantitative Evaluation
baseline_reward = eval_model(ref_model,  eval_data, num_samples=100)
finetuned_reward = eval_model(policy_model, eval_data, num_samples=100)

print("Quantitative Evaluation")
print(f"Baseline Avg Reward: {baseline_reward:.3f}")
print(f"GRPO Avg Reward: {finetuned_reward:.3f}")

#Qualitative Examples
print("Qualitative Examples: ")
for i in range(5):
    ex = eval_data[i]
    prompt = ex["prompt"]
    targets = ex["targets"]

    #Baseline
    with torch.no_grad():
        enc = tokenizer(prompt, return_tensors="pt").to(device)
        base_ids = ref_model.generate(
            **enc,
            max_new_tokens=1,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id
        )
        base_full = tokenizer.decode(base_ids[0], skip_special_tokens=True)
        base_comp = base_full[len(prompt):].strip()
        base_reward = reward([prompt], [base_comp], [targets]).item()

    #fine-tuning
    with torch.no_grad():
        enc = tokenizer(prompt, return_tensors="pt").to(device)
        ft_ids = policy_model.generate(
            **enc,
            max_new_tokens=1,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id
        )
        ft_full = tokenizer.decode(ft_ids[0], skip_special_tokens=True)
        ft_comp = ft_full[len(prompt):].strip()
        ft_reward = reward([prompt], [ft_comp], [targets]).item()

    print(f"\nPrompt:   {prompt}")
    print(f"Targets:  {targets}")
    print(f"Baseline → '{base_comp}' (reward={base_reward:.3f})")
    print(f"Fine-tuned → '{ft_comp}' (reward={ft_reward:.3f})")


Quantitative Evaluation
Baseline Avg Reward: 0.150
GRPO Avg Reward: 0.800
Qualitative Examples: 

Prompt:   rain is to wet as snow is to
Targets:  ['cold', 'white', 'icy']
Baseline → 'fall' (reward=0.000)
Fine-tuned → 'cold' (reward=1.000)

Prompt:   knife is to cut as pen is to
Targets:  ['write', 'scribble']
Baseline → 'cut' (reward=0.000)
Fine-tuned → 'write' (reward=1.000)

Prompt:   teacher is to school as doctor is to
Targets:  ['hospital', 'clinic']
Baseline → 'school' (reward=0.000)
Fine-tuned → 'clinic' (reward=1.000)

Prompt:   ear is to hear as eye is to
Targets:  ['see', 'look', 'watch']
Baseline → 'see' (reward=1.000)
Fine-tuned → 'see' (reward=1.000)

Prompt:   word is to sentence as note is to
Targets:  ['melody', 'tune']
Baseline → 'sentence' (reward=0.000)
Fine-tuned → 'phasis' (reward=0.000)


**Cell 8: Analysis**

**Analysis**

1. Did the Model Learn?

- The **average reward** increased substantially from the baseline reference model to the fine-tuned GRPO model.
- Here the reward is **binary exact match** (1 if the first generated word is exactly one of the target words, 0 otherwise), so the average reward directly reflects the fraction of correct answers.

Quantitatively:
- **Baseline Avg Reward:** 0.150  
- **GRPO Avg Reward:** 0.800  

This means the model went from getting the correct analogy completion about **15%** of the time to about **80%** of the time, a clear and significant improvement.

Qualitatively (from the printed examples):
- The baseline often outputs **generic or incorrect words**, e.g.  
  - `"transparent"` for *glass:transparent :: brick:?*  
  - `"school"` for *teacher:school :: doctor:?*  
  Both of these receive reward 0.
- The GRPO-trained model much more often returns the intended targets:
  - `"opaque"` for *glass:transparent :: brick:?* (reward = 1.0)  
  - `"clinic"` for *teacher:school :: doctor:?* (reward = 1.0)  

There are still some misses, such as:
- `"cold"` for *sun:bright :: moon:?*, where the targets are `["dim","pale","glowing","luminous","reflective"]`.  
This receives reward 0, and shows that not all analogies are solved even after training, especially when there are multiple plausible but non-target answers.

---

2. Reward Hacking?

- In this final setup, I used a **strict exact-match reward** rather than an embedding-based similarity reward.
- Because the model only gets reward 1.0 when the first generated word is *exactly* one of the target words, it is harder for it to “hack” the reward by outputting vaguely related words.
- Outputs that are related but not exact matches (e.g., `"cold"` for the moon analogy) are scored as 0 and therefore **not reinforced**.

So, compared to a softer semantic reward, there is **less obvious reward hacking** here. Instead, the main limitation is that the model sometimes guesses a reasonable but non-target word and simply gets no credit.

---

3. Effect of Group Size

- I used **G = 4** samples per prompt in the GRPO step.
- Larger G generally:
  - Gives a more accurate estimate of which completions are “above average” within a group.
  - Stabilizes the advantage normalization (mean/std of rewards per prompt).
- But larger G also:
  - Increases compute cost per step.
  - Requires more GPU memory and time.

If G is too small (e.g., G = 2), the variance of the advantages becomes higher and training may become noisier and less stable. In this experiment, G = 4 was sufficient to obtain a strong improvement from 0.150 to 0.800 average reward.

---

4. KL Regularization

- The KL penalty with the frozen reference model:
  - Prevents the policy from drifting too far from the original language model.
  - Encourages the updated policy to remain “close” to the reference while still improving on the analogy reward.
- Without KL, the model might:
  - Collapse to strange completions that happen to satisfy the reward occasionally.
  - Overfit to high-reward patterns and lose general language quality.

In this run, the fine-tuned model still produces recognizable English words (and not pure nonsense) while becoming much better at the analogy task, suggesting that the KL regularization helped keep training stable.

---

5. Overall

- The experiment demonstrates **GRPO** on a **toy analogy-completion task** with a **binary exact-match reward**.
- The key features are:
  - A simple, auto-evaluable reward that checks whether the first generated word exactly matches one of the target words.
  - **Group-normalized advantages** within each prompt (comparing multiple sampled completions).
  - **KL regularization** to anchor the policy to a reference model and avoid extreme drift.

