# Exercise 1

Design: Formulate a bandit problem for retrieval in a QA system. Say you have two document indexes (DocsA and DocsB) and also the option to not retrieve at all. How would you set up: contexts (features of the query, like length or topic), actions (which index or none), and reward (e.g. +1 if the answer was correct and -0.1 per document retrieved to penalize cost). Is this reward structure multi-objective (accuracy vs. cost)? How would you incorporate the cost in a bandit reward or would you treat it separately?

## Solution

# Exercise 2

Coding: Given a small QA dataset and two retrieval strategies (e.g., keyword search vs. dense embedding search), simulate an adaptive retrieval policy. Implement a simple bandit (like $\epsilon$-greedy or Thompson Sampling) that chooses strategy per question and receives a reward of 1 if the retrieved set contained the answer. Over many questions, watch the bandit’s strategy selection proportions. Does it learn which strategy is generally better? Now, introduce a context feature: e.g., questions containing dates might do better with keyword search. Modify the bandit to be contextual (e.g. a LinUCB on a feature like “has date or number”). See if it learns a policy: keyword for date queries, dense for others.

## Solution

# Exercise 3

Future Thinking: In Search-R1, an LLM is trained to decide when to call a search API during its reasoning. Why is this a RL problem and not a supervised one? (Hint: the optimal points to call search aren’t known in the training data; the model must discover them by trial and error, guided by a reward like final answer correctness.) If you were to integrate this into Sqwish, what kind of feedback signal could train such behavior? Describe a possible reward function for an LLM agent that can either answer directly or decide to issue a search query and then answer. How would you ensure it doesn’t over-use the search (to minimize latency) or under-use it (and risk being wrong)?
## Solution

The **Search-R1** framework treats the decision of *when* and *how* to search as an RL problem because the optimal search trajectory is not labeled. In supervised fine-tuning, you would need ground-truth sequences of tool calls, which are typically unavailable.

RL allows the agent to explore search calls and learn from an outcome-based signal: if the final answer is correct, actions along the trajectory receive positive feedback.

A simple reward design could be:

- **Outcome reward**: +1 for a correct answer, 0 for an incorrect one.
- **Search cost**: penalize each search call (e.g., -0.1 per call) or use a latency-based penalty to discourage over-use.

One example total reward is:

- If correct: \(r = 1 - 0.1 \cdot N\)
- If incorrect: \(r = -0.1 \cdot N\)

where \(N\) is the number of search calls.

By tuning the search-call penalty, the agent can learn to search only when necessary: a high penalty encourages answering from memory, while a low penalty encourages more frequent search. In a production system like Sqwish, you could further incorporate user feedback (thumbs-up/down) as an additional reward signal.

# Exercise 4

Connection to Context Engineering: Consider ACE's playbook (Day 10) as learned retrieval context, instead of retrieving from a static corpus, the system builds its own knowledge base through experience. How might you combine ACE with traditional RAG? One approach: use RAG for factual, up-to-date information while using an ACE playbook for procedural knowledge (strategies, patterns, failure modes). The playbook could even store meta-knowledge about retrieval itself e.g., "For financial questions, retrieve from SEC filings first." Design a hybrid architecture.

## Solution

Hybrid architecture: use RAG for facts and an ACE playbook for procedures.

- RAG: factual grounding (internal docs, web, filings).
- ACE playbook: reusable procedures/heuristics (e.g., “SEC first for finance”, “cross-check numerics”).

Runtime:

1. Extract query features (domain, has numbers/dates, risk, latency budget).
2. Retrieve relevant playbook entries (procedure) and retrieve evidence from RAG (facts).
3. Router/planner (the “controller” for RAG) chooses a retrieval plan: sources, retrievers, top-k, stop rules, budget.
4. Execute the plan, then answer using evidence + the playbook checklist.

Learning: log outcomes + cost/latency, update the router policy from rewards (accuracy vs latency), and distill good runs into new/edited playbook rules (generate → reflect → curate).