# Exercise 1

Recreate the RLHF loop in your own words: first supervised fine-tune (SFT) a model, then train a reward model on comparisons, then run PPO. Why is the KL divergence term (keeping $\pi_\theta$ close to a reference model $\pi_{\text{ref}}$) crucial? What happens if $\beta$ (the KL penalty coefficient) is set too low or zero?

## Solution

**RLHF Loop**

1. **SFT:** Start with a model that already follows instructions and has a decent style.

2. **Reward Modeling:** Train a Reward Model (RM) on human preference comparisons between candidate answers. Freeze the RM after this step, and use it to score rewards on the rollouts.

3. **PPO:** Run PPO to push the policy toward outputs that the RM scores higher. PPO trains two networks: an **actor** (the policy) and a **critic** (the **value function**). The critic is not another reward model; it learns to predict the expected shaped return (reward minus KL) and is trained continuously (not frozen), alongside the actor.

The **KL term** relative to the reference model transforms PPO into a constrained optimization problem: *“maximize reward while staying close to the original model.”*

### Consequences of a Low (\beta)

If $beta$ is too low or zero, the policy will drift aggressively toward patterns that “spike” the reward model, even if they decrease real-world usefulness. This results in reward hacking, characterized by:

* **Repeated Structures:** Exploiting structural patterns the RM likes.

* **Confident Nonsense:** Hallucinating facts that sound convincing.

* **Evasive Disclaimers:** Over-relying on safe but unhelpful responses.

This drift can create a feedback loop: the RM score increases while human preference decreases, leading to instability, mode collapse, and degraded general language quality.


# Exercise 2

Coding: Implement the Generalized Advantage Estimator (GAE) for PPO. Given a sequence of rewards and value estimates, code the calculation of advantages with a decay parameter $\lambda$. Verify your implementation on a small synthetic sequence by comparing to the definition.

## Solution

In [None]:
import numpy as np


def compute_gae(rewards, values, dones=None, gamma=0.99, lam=0.95):
    """Generalized Advantage Estimation (GAE-λ) — backward recursion.

    Formula (recursive form):
        δ_t = r_t + γ · V(s_{t+1}) · (1 - d_t) − V(s_t)
        A_t = δ_t + γ · λ · (1 - d_t) · A_{t+1}

    Complexity: O(T) time, O(T) space.

    Args:
        rewards: (T,) array of rewards at each timestep.
        values:  (T+1,) array of value estimates; includes bootstrap V(s_T).
        dones:   (T,) array, 1 if episode ended after step t (optional).
        gamma:   Discount factor (default 0.99).
        lam:     GAE λ for bias-variance tradeoff (default 0.95).

    Returns:
        adv: (T,) array of advantage estimates.
    """
    rewards = np.asarray(rewards, dtype=np.float64)
    values = np.asarray(values, dtype=np.float64)
    T = rewards.shape[0]

    if dones is None:
        dones = np.zeros(T, dtype=np.float64)
    else:
        dones = np.asarray(dones, dtype=np.float64)

    adv = np.zeros(T, dtype=np.float64)
    gae = 0.0

    for t in range(T - 1, -1, -1):
        nonterminal = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * nonterminal - values[t]
        gae = delta + gamma * lam * nonterminal * gae
        adv[t] = gae

    return adv 


# --- Example usage ---
rewards = np.array([1.0, 0.5, -0.2, 2.0])       # T=4
values = np.array([0.3, 0.1, 0.0, 0.2, 0.4])    # T+1=5 (includes bootstrap)
dones = np.array([0, 0, 1, 0])                  # episode ends after t=2

advantages = compute_gae(rewards, values, dones=dones, gamma=0.99, lam=0.95)
print("advantages:", advantages)


advantages: [ 0.99829195  0.2119     -0.2         2.196     ]


# Exercise 3

Experiment: (Thought experiment or optional coding) Consider an LLM fine-tuned with PPO on a reward model that highly values verbosity. Describe how the generated outputs might drift if $\beta$ is not high enough. How would you detect reward hacking in practice (e.g., the model finds loopholes in the reward)?

## Solution

If PPO is optimizing a reward model that strongly prefers verbosity and longer answers. We can diagnose with the following ways:

1. Inspect extremes: review top-scoring and bottom-scoring samples
2. Held-out evals: run fixed benchmarks / prompt suites not used in the RL/RM loop (true hold-out), watching for regressions in instruction-following, factuality, calibration, and length.
