# Exercise 1


Implement a LinUCBAgent class in Python. Use efficient updates (e.g. Sherman-Morrison formula) to achieve $O(d^2)$ per-step updates instead of naive $O(d^3)$. Test it on a simulation with a few arms and measure regret.

## Solution

### **Assumptions**

1. Linear reward model: 

For each round $t$, each arm $a$ has a feature vector:
$$
x_{t,a}\in\mathbb{R}^d
$$
There exists an **unknown** parameter $\theta_\star\in\mathbb{R}^d$ such that the **conditional mean reward** is linear:
$$
\mathbb{E}[r_t(a)\mid x_{t,a}] = x_{t,a}^\top \theta_\star
$$
Observed reward is:
$$
r_t = x_{t,a_t}^\top\theta_\star + \eta_t
$$

2. **Noise assumption:**

$\eta_t$ is zero-mean, **conditionally sub-Gaussian**:
$$
\mathbb{E}[\eta_t\mid \mathcal{F}_{t-1}] = 0,\quad \eta_t \text{ is } \sigma\text{-sub-Gaussian}
$$

3. **Bounded features:**

Assume feature vectors have bounded norm:
$$
\|x_{t,a}\|_2 \le L
$$
Without this, uncertainty can blow up.

4. **Bounded parameter:** 

Often assume:
$$
\|\theta_\star\|_2 \le S
$$
This is not strictly necessary but allows to define the confidence radius $\beta_t$.



## Math Prerequisites:

1. **The Sherman–Morrison formula:**

For an invertible matrix $A \in \mathbb R^{d\times d}$ and vectors $u,v \in \mathbb R^d$,

$$
(A + u v^\top)^{-1} = A^{-1} - \frac{A^{-1} u v^\top A^{-1}}{1 + v^\top A^{-1} u}
$$

In [4]:
import numpy as np

try:
    from linucb import LinUCBAgent
except ImportError:  
    from Day_1.linucb import LinUCBAgent


if __name__ == "__main__":
    # Setup: 2 arms, each with 3 features
    d, K = 3, 2
    agent = LinUCBAgent(d=d, alpha=1.0, lam=1.0)

    # True unknown parameter (what we want to discover)
    theta_star = np.array([0.5, 0.3, 0.1])
    print(f"True theta_star: {theta_star}")

    # Contexts for 2 arms (concrete numbers)
    X = np.array([
        [1.0, 2.0, 3.0],  # arm 0 features
        [4.0, 5.0, 6.0]   # arm 1 features
    ])

    # Step 1: Select an arm
    a = agent.select_arm(X)
    print(f"\nSelected arm: {a}")
    print(f"Feature vector of chosen arm: {X[a]}")

    # Step 2: Calculate reward from environment: r = x^T theta_star + noise
    noise = 0.1  # small noise
    r = X[a] @ theta_star + noise
    print(f"Observed reward: r = {X[a]} @ {theta_star} + {noise} = {r:.3f}")

    # Step 3: Update agent with chosen arm's context and reward
    agent.update(X[a], r)

    # Step 4: Check updated parameter estimate
    theta_hat = agent.theta_hat()
    print(f"\nUpdated theta_hat: {theta_hat}")
    print(f"True theta_star:   {theta_star}")
    print(f"Difference:        {theta_hat - theta_star}")

True theta_star: [0.5 0.3 0.1]

Selected arm: 1
Feature vector of chosen arm: [4. 5. 6.]
Observed reward: r = [4. 5. 6.] @ [0.5 0.3 0.1] + 0.1 = 4.200

Updated theta_hat: [0.21538462 0.26923077 0.32307692]
True theta_star:   [0.5 0.3 0.1]
Difference:        [-0.28461538 -0.03076923  0.22307692]



## Problem
Derive the regret bound $R_T = O(d\sqrt{T})$ (more precisely $\tilde O(d\sqrt T)$) for a linear contextual/linear bandit using LinUCB/OFUL-style analysis. Explain:
*   the role of confidence ellipsoids,
*   why dimension $d$ worsens worst-case regret,
*   how to choose $\beta_t$ to get a uniform high-probability guarantee, and why $\beta_t$ is nondecreasing.

### 1. Model, notation, and estimator
At each round $t=1,\dots,T$:
1.  A set of arm feature vectors $\{x_{t,a}\in\mathbb R^d: a\in\mathcal A\}$ is observed.
2.  An arm $a_t$ is selected, and the reward is observed
    $$
    r_t = x_{t,a_t}^\top\theta_\star + \eta_t,
    $$
    where $\theta_\star\in\mathbb R^d$ is unknown.

Define the played feature vector $x_t := x_{t,a_t}$.


**Ridge design matrix and estimator**
$$
V_t := \lambda I + \sum_{s=1}^{t-1} x_s x_s^\top,
\qquad
\hat\theta_t := V_t^{-1}\sum_{s=1}^{t-1} x_s r_s.
$$
Because $\lambda>0$, $V_t\succ 0$ and $V_t^{-1}$ exists.

### 2. Confidence ellipsoids and $\beta_t$ (probability guarantee)
Define the (random) confidence ellipsoid:
$$
\mathcal C_t := \{ \theta:\ \|\theta-\hat\theta_t\|_{V_t}\le \beta_t \},
\quad \|u\|_{V}:=\sqrt{u^\top V u}.
$$

### What "$\theta_\star\in\mathcal C_t$" means (frequentist)
$\theta_\star$ is fixed; $\mathcal C_t$ is random (depends on data).
$\Pr(\theta_\star\in\mathcal C_t)\ge 1-\delta$ means: across repeated runs, the constructed set contains $\theta_\star$ at least $1-\delta$ fraction of the time.

We want a single event
$$
\mathcal E := \{ \forall t\le T:\ \theta_\star\in\mathcal C_t \}
$$
with $\Pr(\mathcal E)\ge 1-\delta$. We do not multiply per-step probabilities (events are not independent). Instead we use a union bound:
$$
\Pr(\exists t\le T:\ \theta_\star\notin\mathcal C_t)
=\Pr\Big(\bigcup_{t=1}^T E_t^c\Big)
\le \sum_{t=1}^T \Pr(E_t^c),
\quad E_t:=\{\theta_\star\in\mathcal C_t\}.
$$

A standard valid choice of $\beta_t$
A self-normalized concentration inequality for linear regression with sub-Gaussian martingale noise implies that for any fixed $\delta\in(0,1)$, with probability at least $1-\delta$,
$$
\forall t\le T:\quad \|\hat\theta_t-\theta_\star\|_{V_t}\le
\sigma\sqrt{2\log\frac{1}{\delta} + \log\frac{\det(V_t)}{\det(\lambda I)}}
+ \sqrt{\lambda}S.
$$
Thus the following choice is valid:
$$
\boxed{
\beta_t(\delta)
=
\sigma\sqrt{2\log\frac{1}{\delta} + \log\frac{\det(V_t)}{\det(\lambda I)}}
+ \sqrt{\lambda}S.
}
$$


### Why $\beta_t$ is nondecreasing
$V_{t+1}=V_t+x_t x_t^\top\succeq V_t \Rightarrow \det(V_{t+1})\ge \det(V_t)\Rightarrow \log\det(V_t)$ is nondecreasing.

### 3. LinUCB rule and role of confidence ellipsoids
Define the UCB score for any arm $a$:
$$
\mathrm{UCB}_t(a):= x_{t,a}^\top\hat\theta_t + \beta_t\|x_{t,a}\|_{V_t^{-1}},
\quad \|x\|_{V^{-1}}:=\sqrt{x^\top V^{-1}x}.
$$
LinUCB chooses
$$
a_t\in\arg\max_{a\in\mathcal A}\mathrm{UCB}_t(a).
$$

**Why this is the right bonus:** the ellipsoid gives a closed-form bound on how much $x^\top\theta$ can change when $\theta$ varies within $\mathcal C_t$. Specifically,
$$
\max_{\theta\in\mathcal C_t} x^\top\theta
= x^\top\hat\theta_t + \beta_t\|x\|_{V_t^{-1}}.
$$
So LinUCB is "optimistic over plausible parameters."


### 4. Define regret
Let the optimal arm at time $t$ be
$$
a_t^\star\in\arg\max_{a} x_{t,a}^\top\theta_\star.
$$
Instantaneous regret:
$$
\Delta_t := x_{t,a_t^\star}^\top\theta_\star - x_{t,a_t}^\top\theta_\star.
$$
Cumulative regret:
$$
R_T := \sum_{t=1}^T \Delta_t.
$$
Work on the high-probability event $\mathcal E=\{\forall t\le T:\theta_\star\in\mathcal C_t\}$.

**First inequality:** from $\theta_\star\in\mathcal C_t$ to an upper bound on $x^\top\theta_\star$
**Claim:** On $\theta_\star\in\mathcal C_t$, for any vector $x$,
$$
x^\top\theta_\star \le x^\top\hat\theta_t + \beta_t\|x\|_{V_t^{-1}}.
\tag{A}
$$
**Proof:**
Because $\theta_\star\in\mathcal C_t$, we have
$$
\|\theta_\star-\hat\theta_t\|_{V_t}\le\beta_t.
\tag{C}
$$
Write
$$
x^\top\theta_\star = x^\top\hat\theta_t + x^\top(\theta_\star-\hat\theta_t).
$$
Now apply Cauchy–Schwarz:
Thus
$$
x^\top(\theta_\star-\hat\theta_t)\le \beta_t\|x\|_{V_t^{-1}}.
$$
Plugging back gives (A).

Apply (A) with $x=x_{t,a_t^\star}$:
$$
x_{t,a_t^\star}^\top\theta_\star
\le x_{t,a_t^\star}^\top\hat\theta_t + \beta_t\|x_{t,a_t^\star}\|_{V_t^{-1}}.
\tag{A($\star$)}
$$

### 4.2 Second inequality: LinUCB maximization ("optimism")
By definition of $a_t$ as the maximizer of $\mathrm{UCB}_t(\cdot)$,
$$
x_{t,a_t^\star}^\top\hat\theta_t + \beta_t\|x_{t,a_t^\star}\|_{V_t^{-1}}
\le
x_{t,a_t}^\top\hat\theta_t + \beta_t\|x_{t,a_t}\|_{V_t^{-1}}.
\tag{B}
$$
Combine (A($\star$)) and (B):
$$
x_{t,a_t^\star}^\top\theta_\star
\le
x_{t,a_t}^\top\hat\theta_t + \beta_t\|x_{t,a_t}\|_{V_t^{-1}}.
\tag{D}
$$
Subtract $x_{t,a_t}^\top\theta_\star$ from both sides:
$$
\Delta_t
\le
x_{t,a_t}^\top(\hat\theta_t-\theta_\star) + \beta_t\|x_{t,a_t}\|_{V_t^{-1}}.
\tag{1}
$$

### 4.3 Bound the estimation-error term by the same bonus
**Deterministic inequality (always true):**
For any $x$ and any vector $u$,
$$
x^\top u \le \|x\|_{V_t^{-1}}\cdot\|u\|_{V_t}.
\tag{CS-V}
$$
Proof is identical to the one used above (insert $V_t^{\pm1/2}$ and apply Euclidean Cauchy–Schwarz).

**Use the confidence event:**
On $\mathcal E$, $\|\hat\theta_t-\theta_\star\|_{V_t}\le\beta_t$. So with $x=x_{t,a_t}$ and $u=\hat\theta_t-\theta_\star$,
$$
x_{t,a_t}^\top(\hat\theta_t-\theta_\star)
\le
\|x_{t,a_t}\|_{V_t^{-1}}\cdot\|\hat\theta_t-\theta_\star\|_{V_t}
\le
\beta_t\|x_{t,a_t}\|_{V_t^{-1}}.
\tag{2a}
$$
Plug into (1):
$$
\Delta_t \le 2\beta_t\|x_{t,a_t}\|_{V_t^{-1}}.
\tag{2}
$$

### 4.4 Sum over time and replace $\beta_t$ by $\beta_T$
Let $x_t=x_{t,a_t}$. Summing (2):
$$
R_T = \sum_{t=1}^T \Delta_t
\le 2\sum_{t=1}^T \beta_t\|x_t\|_{V_t^{-1}}.
$$
Because $\beta_t$ is nondecreasing, $\beta_t\le \beta_T$ for all $t\le T$, hence
$$
R_T \le 2\beta_T\sum_{t=1}^T \|x_t\|_{V_t^{-1}}.
\tag{3}
$$

## 5) Control $\sum_{t=1}^T \|x_t\|_{V_t^{-1}}$ via elliptical potential
This part is deterministic. It assumes only:
1.  $V_{t+1}=V_t+x_t x_t^\top$,
2.  $\lambda>0$ so $V_t\succ0$.

### 5.1 Determinant recursion (matrix determinant lemma)
Matrix determinant lemma:
$$
\det(V_t + x_t x_t^\top)=\det(V_t)\big(1+x_t^\top V_t^{-1}x_t\big).
$$
So
$$
\det(V_{t+1})=\det(V_t)\big(1+s_t\big),
\quad s_t:=x_t^\top V_t^{-1}x_t=\|x_t\|_{V_t^{-1}}^2.
$$
Telescoping products:
$$
\frac{\det(V_{T+1})}{\det(V_1)}=\prod_{t=1}^T (1+s_t).
$$
Taking logs and using $V_1=\lambda I$:
$$
\log\frac{\det(V_{T+1})}{\det(\lambda I)}
= \sum_{t=1}^T \log(1+s_t).
\tag{4}
$$

### 5.2 Elliptical Potential Inequality (from $\log(1+u)\ge u/2$ on $[0,1]$)

**General bound**: For any $s_t \ge 0$,
$$
\sum_{t=1}^T \min\{1,s_t\}\ \le\ 2\sum_{t=1}^T \log(1+s_t)
=2\log\frac{\det(V_{T+1})}{\det(\lambda I)}.
\tag{5a}
$$

**Simplifying condition**: If we enforce $\lambda \ge L^2$, then for all $t$ we have $s_t \le 1$. Here's why:

- Since $V_t = \lambda I + \sum_{s=1}^{t-1} x_s x_s^\top \succeq \lambda I$, matrix inversion reverses the PSD order: $V_t^{-1} \preceq (\lambda I)^{-1} = \frac{1}{\lambda}I$.
- Using the feature bound $\|x_t\|_2 \le L$:
  $$
  s_t = x_t^\top V_t^{-1} x_t \le \frac{\|x_t\|_2^2}{\lambda} \le \frac{L^2}{\lambda} \le 1.
  $$

**Consequence**: With $\lambda \ge L^2$, $\min\{1,s_t\}=s_t$, so (5a) becomes:
$$
\sum_{t=1}^T \|x_t\|_{V_t^{-1}}^2 = \sum_{t=1}^T s_t \le 2\log\frac{\det(V_{T+1})}{\det(\lambda I)}.
\tag{5}
$$

*(Not assuming $\lambda \ge L^2$? Keep $\min\{1,s_t\}$; it doesn't affect the $\tilde O(d\sqrt{T})$ final rate.)*

### 5.3 Cauchy–Schwarz to convert sum of squares to sum
Let $a_t=\|x_t\|_{V_t^{-1}}$. Then
$$
\left(\sum_{t=1}^T a_t\right)^2 \le \left(\sum_{t=1}^T 1\right)\left(\sum_{t=1}^T a_t^2\right)=T\sum_{t=1}^T a_t^2,
$$
so
$$
\sum_{t=1}^T \|x_t\|_{V_t^{-1}}
\le
\sqrt{T\sum_{t=1}^T \|x_t\|_{V_t^{-1}}^2}.
\tag{CS}
$$
Combine (CS) with (5):
$$
\sum_{t=1}^T \|x_t\|_{V_t^{-1}}
\le
\sqrt{2T\log\frac{\det(V_{T+1})}{\det(\lambda I)}}.
\tag{6a}
$$

## 6) Bound the log-det by $d\log(1+TL^2/\lambda)$
Start from
$$
V_{T+1}=\lambda I+\sum_{t=1}^T x_t x_t^\top
=\lambda\left(I+\frac{1}{\lambda}\sum_{t=1}^T x_t x_t^\top\right).
$$
Thus
$$
\log\frac{\det(V_{T+1})}{\det(\lambda I)}
= \log\det\left(I+A\right),
\quad
A:=\frac{1}{\lambda}\sum_{t=1}^T x_t x_t^\top\succeq 0.
$$
Let eigenvalues of $A$ be $\mu_1,\dots,\mu_d\ge 0$. Then
$$
\log\det(I+A)=\sum_{i=1}^d \log(1+\mu_i).
$$
Use AM–GM on $1+\mu_i$:
$$
\prod_{i=1}^d (1+\mu_i)\le
\left(\frac{1}{d}\sum_{i=1}^d(1+\mu_i)\right)^d
= \left(1+\frac{1}{d}\sum_{i=1}^d \mu_i\right)^d.
$$
Take logs and note $\sum_i\mu_i=\mathrm{tr}(A)$:
$$
\log\det(I+A)\le d\log\left(1+\frac{\mathrm{tr}(A)}{d}\right).
$$
Now
$$
\mathrm{tr}(A)=\frac{1}{\lambda}\sum_{t=1}^T \mathrm{tr}(x_t x_t^\top)
=\frac{1}{\lambda}\sum_{t=1}^T \|x_t\|_2^2
\le \frac{TL^2}{\lambda}.
$$
Therefore
$$
\log\frac{\det(V_{T+1})}{\det(\lambda I)}
\le d\log\left(1+\frac{TL^2}{\lambda d}\right)
\le d\log\left(1+\frac{TL^2}{\lambda}\right).
\tag{7}
$$
Plug (7) into (6a):
$$
\sum_{t=1}^T \|x_t\|_{V_t^{-1}}
\le
\sqrt{2T\cdot d\log\left(1+\frac{TL^2}{\lambda}\right)}.
\tag{8}
$$

## 7) Final regret bound
Combine (3) and (8):
$$
R_T
\le
2\beta_T\sqrt{2T\cdot d\log\left(1+\frac{TL^2}{\lambda}\right)}.
\tag{9}
$$
Using the explicit $\beta_T$ (e.g. from $(\beta)$ and the same log-det bound),
$$
\beta_T
= \sigma\sqrt{2\log\frac{1}{\delta} + \log\frac{\det(V_T)}{\det(\lambda I)}}
+ \sqrt{\lambda}S
\le
\sigma\sqrt{2\log\frac{1}{\delta} + d\log\left(1+\frac{TL^2}{\lambda}\right)}
+ \sqrt{\lambda}S.
$$
Hence (9) implies
$$
R_T = \tilde O(d\sqrt T)
$$
with probability at least $1-\delta$ (or $1-\delta$ uniformly over time depending on whether $\delta$ or $\delta/T$ / anytime $\delta_t$ is used).
This is the requested $O(d\sqrt T)$ up to logs.

## 8) Role of confidence ellipsoids (what they do in the proof)
The confidence ellipsoid provides two indispensable ingredients:

1.  A uniform statement about parameter error:
    $$
    \|\hat\theta_t-\theta_\star\|_{V_t}\le \beta_t
    \quad(\text{with high probability}).
    $$
2.  A conversion from parameter uncertainty to reward uncertainty for any arm feature $x$:
    $$
    x^\top\theta_\star \le x^\top\hat\theta_t + \beta_t\|x\|_{V_t^{-1}}.
    $$

This is exactly what yields the UCB exploration bonus and enables the "optimism chain."
Without the ellipsoid geometry, the clean $\|x\|_{V_t^{-1}}$ term is not available, and its sum is not directly controllable via the determinant argument.

## 9) Why dimension $d$ worsens worst-case regret
 
The dimension $d$ appears in the regret bound in two key places:
 
1. **Sum of exploration bonuses**: The sum $\sum_{t=1}^T \|x_t\|_{V_t^{-1}}$ is bounded by $O(\sqrt{Td\log T})$ via the log-determinant argument. The factor of $d$ arises because $\log\det(V_T) - \log\det(\lambda I)$ scales as $d\log(1 + TL^2/\lambda)$.
 
2. **Confidence width $\beta_T$**: The confidence parameter $\beta_T$ contains a term $\sqrt{d\log(\cdot)}$ from the self-normalized martingale bound.
 
 **Intuition**: In higher dimensions, there are more directions in parameter space that need to be explored. Each new direction adds uncertainty that must be resolved through data. With $d$ parameters to estimate, more observations are required to pin down $\theta_\star$ in all directions, leading to larger regret.
 
Mathematically, the eigenvalues of $V_t^{-1}$ decay more slowly when $d$ is large (observations are "spread thinner" across dimensions), so the exploration bonuses $\|x_t\|_{V_t^{-1}}$ remain larger for longer.




## Exercise 3

Application: Consider a simplified Sqwish scenario with 3 prompt variants. Formulate it as a contextual bandit: define the context (e.g. user embedding), actions (prompts), and reward (e.g. conversion vs no conversion). How would UCB decide which prompt to show early on versus after gathering data?

## Solution: Sqwish as a Contextual Bandit

### Formulation

**Context** $x_t \in \mathbb{R}^{d_{user}}$: User embedding/features available before showing a prompt (e.g., user profile, session info, device, time of day).

**Actions** $a \in \{0, 1, 2\}$: The 3 prompt variants to choose from.

**Reward** $r_t \in [0, R_{\max}]$: Continuous spend (e.g., dollars spent after seeing the prompt), clipped to the range $[0, R_{\max}]$ (negative values become 0, values above $R_{\max}$ are capped).

> Note: Clipping spend into a bounded range $[0, R_{\max}]$ makes the environment no longer perfectly linear (both the lower bound at 0 and upper bound at $R_{\max}$ introduce non-linearity).

### Feature construction (user + one-hot prompt)

For each prompt $a$, build a combined feature vector:
$$
x_{t,a} = [\text{user\_features}; \text{onehot}(a)] \in \mathbb{R}^{d_{user} + 3}
$$

Example with $d_{user}=4$ and 3 prompts:
- Prompt 0: $x_{t,0} = [u_1, u_2, u_3, u_4, 1, 0, 0]$
- Prompt 1: $x_{t,1} = [u_1, u_2, u_3, u_4, 0, 1, 0]$
- Prompt 2: $x_{t,2} = [u_1, u_2, u_3, u_4, 0, 0, 1]$

This gives the model the ability to learn different conversion rates per prompt, plus user-prompt interactions.

### How UCB decides: early vs late

**Early (few observations):**
- $A^{-1}$ is large (close to $(1/\lambda) I$) $\Rightarrow$ uncertainty bonus $\alpha\sqrt{x^\top A^{-1} x}$ is high.
- UCB will **explore**: try prompts that haven't been tested much for similar users.
- Even if a prompt has lower estimated mean, its high uncertainty can make it win.

**Later (many observations):**
- $A^{-1}$ shrinks $\Rightarrow$ uncertainty bonus becomes small.
- UCB **exploits**: picks the prompt with best predicted conversion $x^\top \hat\theta$.
- Different users can get different "best" prompts (personalization).

In [None]:
# --- Sqwish simulation: 3 prompts, user features + one-hot ---

def build_X(user_features: np.ndarray, n_prompts: int = 3) -> np.ndarray:
    """
    Build feature matrix X (K x d) from user features + one-hot prompt ID.
    
    Args:
        user_features: (d_user,) array of user features
        n_prompts: number of prompt variants
    
    Returns:
        X: (n_prompts, d_user + n_prompts) feature matrix
    """
    d_user = len(user_features)
    X = np.zeros((n_prompts, d_user + n_prompts), dtype=np.float64)
    for a in range(n_prompts):
        X[a, :d_user] = user_features      # user features
        X[a, d_user + a] = 1.0             # one-hot for prompt a
    return X


def simulate_sqwish(n_rounds: int = 50, seed: int = 42):
    """
    Run a simple Sqwish simulation with 3 prompts.
    """
    rng = np.random.default_rng(seed)
    
    # Setup
    d_user = 4      # user embedding dimension
    n_prompts = 3   # number of prompt variants
    d = d_user + n_prompts  # total feature dimension
    
    # True unknown parameter theta_star
    # First d_user dims: user feature weights
    # Last 3 dims: prompt-specific biases (prompt 2 is best)
    theta_star = np.array([0.2, 0.1, -0.1, 0.05,   # user weights
                           0.1, 0.3, 0.6])         # prompt biases: prompt 2 > 1 > 0
    
    agent = LinUCBAgent(d=d, alpha=1.0, lam=1.0)
    
    selections = {0: 0, 1: 0, 2: 0}  # count selections per prompt
    cumulative_regret = 0.0          # track regret
    regrets = []                      # regret history
    
    print(f"True theta_star: {theta_star}")
    print(f"Prompt biases (last 3 values): prompt0={theta_star[d_user]:.1f}, "
          f"prompt1={theta_star[d_user+1]:.1f}, prompt2={theta_star[d_user+2]:.1f}")
    print("-" * 60)
    
    for t in range(n_rounds):
        # Sample a random user
        user_features = rng.normal(size=d_user)
        
        # Build X: one row per prompt
        X = build_X(user_features, n_prompts)
        
        # Expected reward per prompt (oracle knows theta_star)
        expected_rewards = X @ theta_star  # shape (n_prompts,)

        # Select prompt using UCB
        a = agent.select_arm(X)
        selections[a] += 1

        # --- Realized regret (simulation-only) ---
        # Real regret uses realized rewards:
        noise_vec = rng.normal(0, 0.5, size=n_prompts)
        raw_spend_all = expected_rewards + noise_vec

        max_spend = 10.0
        r_all = np.clip(raw_spend_all, 0.0, max_spend).astype(np.float64)
        r = float(r_all[a])

        instant_regret = float(np.max(r_all) - r)
        cumulative_regret += instant_regret
        regrets.append(cumulative_regret)

        # Update agent with the observed (chosen-arm) realized reward
        agent.update(X[a], r)
        
        # Print progress at checkpoints
        if t + 1 in [5, 10, 25, 50, 100, 500, 1000]:
            print(f"After {t+1:4d} rounds: selections = {dict(selections)}, "
                  f"cumulative regret = {cumulative_regret:.2f}")
    
    print("-" * 60)
    print(f"Final theta_hat: {agent.theta_hat().round(3)}")
    print(f"True theta_star: {theta_star}")
    print(f"Total cumulative regret: {cumulative_regret:.2f}")
    
    return agent, selections, regrets


# Run the simulation
agent, selections, regrets = simulate_sqwish(n_rounds=1000)

True theta_star: [ 0.2   0.1  -0.1   0.05  0.1   0.3   0.6 ]
Prompt biases (last 3 values): prompt0=0.1, prompt1=0.3, prompt2=0.6
------------------------------------------------------------
After    5 rounds: selections = {0: 2, 1: 3, 2: 0}, cumulative regret = 3.25
After   10 rounds: selections = {0: 2, 1: 3, 2: 5}, cumulative regret = 3.47
After   25 rounds: selections = {0: 2, 1: 3, 2: 20}, cumulative regret = 4.27
After   50 rounds: selections = {0: 2, 1: 3, 2: 45}, cumulative regret = 8.06
After  100 rounds: selections = {0: 2, 1: 4, 2: 94}, cumulative regret = 22.58
After  500 rounds: selections = {0: 2, 1: 4, 2: 494}, cumulative regret = 96.01
After 1000 rounds: selections = {0: 2, 1: 4, 2: 994}, cumulative regret = 183.89
------------------------------------------------------------
Final theta_hat: [ 0.16   0.087 -0.093  0.029  0.028  0.146  0.631]
True theta_star: [ 0.2   0.1  -0.1   0.05  0.1   0.3   0.6 ]
Total cumulative regret: 183.89
