# Exercise 1

Design: Formulate a bandit problem for retrieval in a QA system. Say you have two document indexes (DocsA and DocsB) and also the option to not retrieve at all. How would you set up: contexts (features of the query, like length or topic), actions (which index or none), and reward (e.g. +1 if the answer was correct and -0.1 per document retrieved to penalize cost). Is this reward structure multi-objective (accuracy vs. cost)? How would you incorporate the cost in a bandit reward or would you treat it separately?

## Solution

# Exercise 2

Coding: Given a small QA dataset and two retrieval strategies (e.g., keyword search vs. dense embedding search), simulate an adaptive retrieval policy. Implement a simple bandit (like $\epsilon$-greedy or Thompson Sampling) that chooses strategy per question and receives a reward of 1 if the retrieved set contained the answer. Over many questions, watch the bandit’s strategy selection proportions. Does it learn which strategy is generally better? Now, introduce a context feature: e.g., questions containing dates might do better with keyword search. Modify the bandit to be contextual (e.g. a LinUCB on a feature like “has date or number”). See if it learns a policy: keyword for date queries, dense for others.

## Solution

In [6]:
import numpy as np
import random

# -----------------------------
# Data simulation (Exercise 2)
# -----------------------------
np.random.seed(42)
random.seed(42)

num_questions = 5000
has_number = np.random.rand(num_questions) < 0.35  # 35% queries contain numbers/dates

# Two retrieval strategies (arms):
# 0 = keyword search, 1 = dense search
def success_prob(strategy: int, hn: bool) -> float:
    if strategy == 0:   # keyword
        return 0.78 if hn else 0.45
    else:               # dense
        return 0.62 if hn else 0.74

def retrieval_cost(strategy: int) -> float:
    # dense is more expensive (latency/compute), so it must be more accurate to win
    return 0.05 if strategy == 0 else 0.15

def simulate_outcome(strategy: int, hn: bool):
    """
    Returns:
      success: 1.0 if retrieved set contained the answer else 0.0
      net_reward: success - cost  (cost-aware reward; matches the exercise's "penalize retrieval" idea)
    """
    p = success_prob(strategy, hn)
    success = 1.0 if (np.random.rand() < p) else 0.0
    net_reward = success - retrieval_cost(strategy)
    return success, net_reward


# -----------------------------------------
# Part A: Non-contextual bandit (ε-greedy)
# -----------------------------------------
epsilon = 0.1
K = 2

Q = np.zeros(K)         # estimated mean net reward per arm
N = np.zeros(K, dtype=int)

eg_actions = []
eg_successes = []
eg_net_rewards = []

for t in range(num_questions):
    # Warm start: try each arm once
    if t < K:
        a = t
    else:
        if np.random.rand() < epsilon:
            a = np.random.randint(K)
        else:
            a = int(np.argmax(Q))

    s, r = simulate_outcome(a, bool(has_number[t]))
    N[a] += 1
    Q[a] += (r - Q[a]) / N[a]  # incremental mean update

    eg_actions.append(a)
    eg_successes.append(s)
    eg_net_rewards.append(r)

eg_actions = np.array(eg_actions)
eg_successes = np.array(eg_successes)
eg_net_rewards = np.array(eg_net_rewards)

print("=== ε-greedy (non-contextual) ===")
print("Estimated mean net reward Q:", Q)
print("Action counts: keyword =", np.sum(eg_actions == 0), ", dense =", np.sum(eg_actions == 1))
print("Mean success rate:", float(np.mean(eg_successes)))
print("Mean net reward  :", float(np.mean(eg_net_rewards)))
print()


# -----------------------------------
# Part B: Contextual bandit (LinUCB)
# -----------------------------------
# Import the LinUCB implementation from Day 1.
# We use block features (same idea as Day 3) so each arm has its own linear model.
import sys
from pathlib import Path

REPO_ROOT = Path("/home/luigi/Programming/Onboarding")
sys.path.append(str(REPO_ROOT / "Day_1"))
from linucb import LinUCBAgent

K = 2
base_d = 2
D = K * base_d  # block-feature dimension

linucb = LinUCBAgent(d=D, alpha=0.2, lam=1.0)

ucb_actions = []
ucb_successes = []
ucb_net_rewards = []

for t in range(num_questions):
    # Base context feature vector: [1, has_number]
    x_base = np.array([1.0, 1.0 if has_number[t] else 0.0], dtype=np.float64)

    # Block features: each arm gets its own copy of x_base in a disjoint slice.
    X = np.zeros((K, D), dtype=np.float64)
    for a in range(K):
        X[a, a * base_d : (a + 1) * base_d] = x_base

    # Warm start: try each arm once
    if t < K:
        a = t
    else:
        a = linucb.select_arm(X)

    s, r = simulate_outcome(a, bool(has_number[t]))
    linucb.update(X[a], r)

    ucb_actions.append(a)
    ucb_successes.append(s)
    ucb_net_rewards.append(r)

ucb_actions = np.array(ucb_actions)
ucb_successes = np.array(ucb_successes)
ucb_net_rewards = np.array(ucb_net_rewards)

hn = has_number

print("\n=== LinUCB (contextual, block features) ===")
print("Action counts: keyword =", np.sum(ucb_actions == 0), ", dense =", np.sum(ucb_actions == 1))
print("Mean success rate:", float(np.mean(ucb_successes)))
print("Mean net reward  :", float(np.mean(ucb_net_rewards)))

# Policy diagnostics requested by the exercise
print("\n--- Conditional policy behavior ---")
print("P(keyword | has_number=True ) =", float(np.mean(ucb_actions[hn] == 0)))
print("P(dense   | has_number=False) =", float(np.mean(ucb_actions[~hn] == 1)))

print("\n--- Overall selection proportions ---")
print("Overall keyword share:", float(np.mean(ucb_actions == 0)))
print("Overall dense share  :", float(np.mean(ucb_actions == 1)))


=== ε-greedy (non-contextual) ===
Estimated mean net reward Q: [0.52777778 0.54788584]
Action counts: keyword = 270 , dense = 4730
Mean success rate: 0.6914
Mean net reward  : 0.5468


=== LinUCB (contextual, block features) ===
Action counts: keyword = 1770 , dense = 3230
Mean success rate: 0.7642
Mean net reward  : 0.6496

--- Conditional policy behavior ---
P(keyword | has_number=True ) = 0.9988706945228685
P(dense   | has_number=False) = 0.9996903065964695

--- Overall selection proportions ---
Overall keyword share: 0.354
Overall dense share  : 0.646


# Exercise 3

Future Thinking: In Search-R1, an LLM is trained to decide when to call a search API during its reasoning. Why is this a RL problem and not a supervised one? (Hint: the optimal points to call search aren’t known in the training data; the model must discover them by trial and error, guided by a reward like final answer correctness.) If you were to integrate this into Sqwish, what kind of feedback signal could train such behavior? Describe a possible reward function for an LLM agent that can either answer directly or decide to issue a search query and then answer. How would you ensure it doesn’t over-use the search (to minimize latency) or under-use it (and risk being wrong)?
## Solution

The **Search-R1** framework treats the decision of *when* and *how* to search as an RL problem because the optimal search trajectory is not labeled. In supervised fine-tuning, you would need ground-truth sequences of tool calls, which are typically unavailable.

RL allows the agent to explore search calls and learn from an outcome-based signal: if the final answer is correct, actions along the trajectory receive positive feedback.

A simple reward design could be:

- **Outcome reward**: +1 for a correct answer, 0 for an incorrect one.
- **Search cost**: penalize each search call (e.g., -0.1 per call) or use a latency-based penalty to discourage over-use.

One example total reward is:

- If correct: \(r = 1 - 0.1 \cdot N\)
- If incorrect: \(r = -0.1 \cdot N\)

where \(N\) is the number of search calls.

By tuning the search-call penalty, the agent can learn to search only when necessary: a high penalty encourages answering from memory, while a low penalty encourages more frequent search. In a production system like Sqwish, you could further incorporate user feedback (thumbs-up/down) as an additional reward signal.

# Exercise 4

Connection to Context Engineering: Consider ACE's playbook (Day 10) as learned retrieval context, instead of retrieving from a static corpus, the system builds its own knowledge base through experience. How might you combine ACE with traditional RAG? One approach: use RAG for factual, up-to-date information while using an ACE playbook for procedural knowledge (strategies, patterns, failure modes). The playbook could even store meta-knowledge about retrieval itself e.g., "For financial questions, retrieve from SEC filings first." Design a hybrid architecture.

## Solution

Hybrid architecture: use RAG for facts and an ACE playbook for procedures.

- RAG: factual grounding (internal docs, web, filings).
- ACE playbook: reusable procedures/heuristics (e.g., “SEC first for finance”, “cross-check numerics”).

Runtime:

1. Extract query features (domain, has numbers/dates, risk, latency budget).
2. Retrieve relevant playbook entries (procedure) and retrieve evidence from RAG (facts).
3. Router/planner (the “controller” for RAG) chooses a retrieval plan: sources, retrievers, top-k, stop rules, budget.
4. Execute the plan, then answer using evidence + the playbook checklist.

Learning: log outcomes + cost/latency, update the router policy from rewards (accuracy vs latency), and distill good runs into new/edited playbook rules (generate → reflect → curate).