# Capstone Project

*Objective:** Synthesize everything you’ve learned by designing and implementing a mini version of Sqwish’s optimization engine. This capstone project will have you create a simulated environment and then build an agent that optimizes in that environment, incorporating bandits, proxy rewards, and OPE. It’s a chance to put it all together: bandit algorithms, reward modeling, safety checks, and evaluation in one end-to-end prototype.

**Project Brief:** *E-commerce Description Optimizer.* You will simulate an e-commerce website where an LLM generates product descriptions for users, and the goal is to maximize conversion (purchase) while respecting cost. Three different LLMs (of varying cost and quality) are available. Users have different preferences. You’ll build a contextual bandit agent to route and prompt the LLMs optimally.

**Environment Setup:**

- **User Context:** Define a user persona feature (e.g. budget_sensitive vs quality_seeker) and a product category feature. These together form the context $x$.
- **Arms/Actions:** Three LLM choices for generating the description: *Model A* (cheap & concise), *Model B* (moderate), *Model C* (expensive & detailed). You can also allow the prompt to vary or other actions, but at minimum choosing the model is the action.
- **Hidden Reward Function:** Simulate probability of conversion as a function of context and model. For example: budget_sensitive users convert better with concise Model A (perhaps they don’t like fluff), quality_seekers convert better with detailed Model C. You can fabricate this mapping, e.g., $P(\text{buy}|x,\text{A}) = 0.05$ normally, but $0.15$ if user is budget_sensitive; $P(\text{buy}|x,\text{C}) = 0.05$ normally, but $0.15$ if user is quality_seeker, etc. The idea is each model is optimal for a certain segment. Conversion is binary (success/fail).
- **Cost Model:** Assign a “cost” to using each model (e.g. A costs  $\$0.01$, B $\$0.02$, C $\$0.10$ per description). This will be used in evaluating the profit.

**Agent Requirements:**

- Use a **Contextual Bandit algorithm** (Thompson Sampling or LinUCB recommended) to learn over interactions which model works best for which context. The agent will make a choice each round (given context, pick model), observe a stochastic reward (1 if conversion happened, 0 if not).
- Incorporate a **Proxy Reward Model (LLM Judge)**: To make it interesting, assume conversion events are rare (maybe users purchase much later). So instead, introduce an immediate proxy reward - e.g., an LLM that scores the description’s “persuasiveness” from 0 to 1. The bandit will train on this proxy reward every round (since conversion is delayed), but you will later evaluate on actual conversion. The proxy should be correlated with conversion but not perfect, to simulate reality.
- **Off-Policy Evaluation:** Before fully trusting your learned policy, use an offline evaluation. For example, have the agent do an initial random policy for 1000 interactions to gather a log. Then when your bandit policy is learned, use **IPS or Doubly Robust** on that log to estimate the conversion rate of the bandit policy *without* deploying it. Compare this estimate to the actual performance when you do run the bandit live in the simulator. This checks your OPE integration.
- **Safety Constraint:** Implement a simple safety rule in the simulator (for instance, Model C might occasionally produce an unsafe word). If that happens, the user instantly doesn’t buy and is unhappy. Ensure your agent either learns to avoid that or you add a filter. (This can be simulated by saying: with small probability, Model C outputs something disallowed, which always results in no conversion; the agent could learn that risk or you can explicitly penalize it.)
