# Exercise 1

Conceptual: Why does a static preference dataset become insufficient as a policy improves? Explain using an example: initial model outputs are poor, so human preferences cover only easy mistakes; once the model stops making those, the old data is less relevant. How does online DPO address this?

## Solution


A static dataset is **tied to the behavior of the policy that generated it**. As the policy improves, two things happen:

1. **The dataset stops matching the policy’s error profile (distribution shift).**
   Early on, the model makes *obvious* mistakes, so humans mostly compare **bad vs. decent** outputs. After training, the policy rarely produces the “bad” outputs anymore, so those comparisons are no longer representative of what the policy currently does.

2. **The comparisons become weak training signal (saturation).**
   DPO updates are driven by the *margin* between preferred and rejected:
   $$
   \Delta = \big(\log \pi(y_w|x)-\log \pi(y_l|x)\big) - \big(\log \pi_{\text{ref}}(y_w|x)-\log \pi_{\text{ref}}(y_l|x)\big).
   $$
   If the improved policy already strongly prefers (y_w) over (y_l) on the old dataset, (\Delta) is large, the logistic loss saturates, and gradients become small. So the old dataset gives diminishing returns.


Online DPO fixes the mismatch by making the data **on-policy (or near on-policy)** each iteration:

* **Generate candidates from the current policy** (often with some exploration to get diversity).

* **Collect fresh preferences** on *these* candidates.

* **Update with DPO on the new comparisons**, typically with **anchoring/regularization** to prevent drift.



# Exercise 2

Coding: Implement a simple simulation of OFS-DPO. Use two copies of a model (e.g., two sets of parameters representing fast and slow). At each iteration: have the fast model generate an output for a query, label its quality with a simulated metric (e.g., a known reward function or a proxy judge), update fast model via DPO loss, and occasionally copy fast weights to slow. Monitor a metric (like the difference between fast and slow policy outputs or rewards over time) to verify that fast adapts quickly and slow provides a stabilizing anchor.

## Solution

In [None]:
#implementation

# Exercise 3

Discussion: The fast-slow approach mimics having an exploitative agent and a conservative baseline. In a production system like Sqwish, how might we implement this practically? (Hint: The “fast” could be a live updated prompt policy, and the “slow” could be a periodically retrained baseline that prevents the prompt strategy from drifting too far and causing bad experiences.) Propose a mechanism for deciding when to sync the slow model with the fast one (e.g., based on performance plateau or time interval).

## Solution

solution

# Exercise 4

Discussion (Context vs. Weight Adaptation): The fast-slow paradigm updates model weights online. An alternative is updating contexts while keeping weights frozen: ACE (Day 10) does exactly this with evolving playbooks. Compare: weight-based (changes behavior fundamentally, requires gradients, risk of forgetting) vs. context-based (faster, interpretable, limited by context window, no forgetting). When might you prefer each? Consider: (1) dramatic domain shifts, (2) incorporating a single new policy, (3) regulated industries requiring auditability.

## Solution

..