# The K-Armed Bandit Problem

## 1. Problem Overview
- **Objective**: A decision-maker (agent) selects one of $k$ possible actions and receives a stochastic reward based on the chosen action
- **Key characteristics**:
  - Reward distributions for actions are initially unknown
  - True action value: $q_*(a) = \mathbb{E}[R_t | A_t = a]$
  - Optimal goal: Identify the action with the highest expected value

## 2. Estimating Action Values

### 2.1. Sample-Average Method
$$Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{I}(A_i = a)}{N_t(a)}$$
- $N_t(a)$: Number of times action $a$ was chosen before timestep $t$
- $R_i$: Reward received at trial $i$

### 2.2. Incremental Update Rule
$$Q_{n+1} = Q_n + \frac{1}{n} \left( R_n - Q_n \right)$$
- **Advantage**: No need to store historical reward data
- **Online adjustment**: Updates estimates based on temporal-difference error $(R_n - Q_n)$

## 3. Balancing Exploration & Exploitation

### 3.1. The Fundamental Tradeoff
| **Exploration**                        | **Exploitation**                  |
|----------------------------------------|-----------------------------------|
| Trying new actions to gather information | Choosing best-known action       |
| Improves long-term knowledge           | Maximizes immediate rewards      |
| **Mutually exclusive at each step**    |                                   |

### 3.2. ε-Greedy Action Selection
$$A_t = \begin{cases} 
\underset{a}{\operatorname{argmax}} Q_t(a) & \text{probability } 1-\varepsilon \\
\text{random action from } \{a_1, ..., a_k\} & \text{probability } \varepsilon 
\end{cases}$$
- $\varepsilon$: Exploration rate (typically 0.01-0.1)
- **Balance**: Exploits current knowledge + explores randomly

### 3.3. Optimistic Initial Values
- **Mechanism**: Initialize $Q_0(a)$ higher than true values
- **Effect**: Encourages early exploration of all actions
- **Limitations**:
  - Only effective during initial phases
  - Unsuitable for non-stationary problems
  - No systematic way to set initial values

### 3.4. Upper Confidence Bound (UCB) Action Selection
$$A_t = \underset{a}{\operatorname{argmax}} \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$
- **Components**:
  - $Q_t(a)$: Current value estimate
  - $\sqrt{\frac{\ln t}{N_t(a)}}$: Uncertainty measure
  - $c$: Exploration control parameter
- **Principle**: Prefers high-value or under-explored actions