# Exercise 1

Design: Propose a logging schema for Sqwish’s online learning system. It should record everything needed to reproduce training and do later analysis. For each user request, what would you log? (Think: a unique request ID, user context features, chosen prompt/model, all model outputs maybe, any user click or outcome, timestamps, the probability or propensity of the chosen action if using a stochastic policy, etc.). Write a structured list of fields and justify each (why is it needed? e.g. propensity is needed for IPS in OPE).

## Solution

## Solution: Logging schema (concise)

### Decision event (one row per request)
- `request_id`, `timestamp_ms`, `session_id`, `user_id_hash` (join keys: link this decision to later rewards/errors)
- `trace_id` (distributed tracing: follow the request across services and break down latency)
- `context_features`, `feature_version` (exact inputs the policy used at decision time)
- `candidate_actions`, `chosen_action`, `action_params` (what was available + what we did)
- `logging_policy_id`, `logging_policy_version`, `propensity` (needed for IPS/DR OPE)
- `exploration_state` (epsilon/UCB alpha/TS params; why it explored)
- `latency_ms_total`, `latency_ms_by_stage`, `token_usage`, `estimated_cost_usd`, `error_flags`
- `safety_labels`, `safety_score`, `safety_model_version`, `mitigation_taken`

### Reward/outcome event (separate row, can arrive later)
- `request_id` (join key: attach this outcome back to the original decision)
- `reward_timestamp_ms` (when the outcome was observed)
- `reward_type`, `reward_value` (what happened + numeric value)
- `reward_observation_window` (what window you waited for, e.g. 24h conversion)
- `attribution_model_version` (if credit assignment/labeling uses a model)

### Minimal template
```json
{
  "request_id": "...",
  "timestamp_ms": 0,
  "session_id": "...",
  "user_id_hash": "...",
  "context_features": {"...": "..."},
  "feature_version": "...",
  "candidate_actions": ["..."],
  "chosen_action": "...",
  "action_params": {"...": "..."},
  "logging_policy_id": "...",
  "logging_policy_version": "...",
  "propensity": 0.0,
  "latency_ms_total": 0,
  "token_usage": {"prompt": 0, "completion": 0},
  "estimated_cost_usd": 0.0,
  "safety_labels": [],
  "error_flags": []
}
```



# Exercise 2

Analysis: You have deployed a bandit that optimizes prompts. After a week, you examine results and see overall user satisfaction went up 5%. However, for new users (first-time visitors) satisfaction dropped. How would you investigate this? Outline an experiment or analysis using logs to diagnose why the policy might be underperforming for new users (maybe it over-explored or didn’t personalize properly for cold-start users). What changes to the algorithm could you consider (e.g. epsilon-greedy for new users until enough data)?

## Solution

# Exercise 3

Coding: Using an open bandit dataset (e.g. the Open Bandit Pipeline’s logged data if available, or simulate one), perform an off-policy evaluation of a hypothetical new policy. For example, use logged data from a uniform random policy on a classification task as bandit feedback. Define a new deterministic policy (like always choose arm 1 for certain feature values and arm 2 otherwise). Use IPS and Doubly Robust to estimate the new policy’s reward from the logs. Compare that to the actual reward if you run the new policy on the dataset (if ground truth available). This exercise solidifies understanding of OPE’s value and limitations (if the policy is very different, IPS variance will be high).

## Solution

# Exercise 4

Monitoring: List 3 metrics you would put on a dashboard for the live Sqwish system and why. For example: “Cumulative regret” - to watch if the learning is improving over a non-learning baseline; “10th percentile reward” - to ensure we’re not badly serving a subset of users even if average looks good; “Unsafe response count per 1000 interactions” - to catch any increase in policy violations. Explain briefly how each metric helps ensure the system is healthy.

## Solution