# Multi-Armed Bandits

**Chapter 1: Exploration vs Exploitation in Sequential Decision Making**

The multi-armed bandit problem is a fundamental framework in reinforcement learning that captures the exploration-exploitation trade-off. Named after slot machines (one-armed bandits), it models scenarios where an agent must choose between multiple options with unknown reward distributions.

# Multi-Armed Bandits

**Chapter 1: Exploration vs Exploitation in Sequential Decision Making**

The multi-armed bandit problem is a fundamental framework in reinforcement learning that captures the exploration-exploitation trade-off. Named after slot machines (one-armed bandits), it models scenarios where an agent must choose between multiple options with unknown reward distributions.

## Problem Formulation

A $k$-armed bandit has $k$ possible actions (arms). Each arm $i$ has an unknown expected reward:

$$q_*(a) = \mathbb{E}[R_t | A_t = a]$$

At each timestep $t$, the agent:
1. Selects an action $A_t$
2. Receives reward $R_t$
3. Updates estimates of action values

The goal is to maximize cumulative reward over time.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

## Bandit Environment

We'll create a simple $k$-armed bandit where each arm returns rewards from a Gaussian distribution:

In [None]:
class MultiArmedBandit:
    """
    k-armed bandit with Gaussian reward distributions.
    """
    def __init__(self, k=10, mean_range=(-2, 2), std=1.0):
        """
        Args:
            k: Number of arms
            mean_range: Range for sampling true means
            std: Standard deviation of reward noise
        """
        self.k = k
        self.std = std
        
        # True mean rewards for each arm (unknown to agent)
        self.true_means = np.random.uniform(
            mean_range[0], mean_range[1], size=k
        )
        self.optimal_arm = np.argmax(self.true_means)
        self.optimal_value = self.true_means[self.optimal_arm]
        
    def pull(self, arm):
        """Pull an arm and receive a noisy reward."""
        return np.random.normal(self.true_means[arm], self.std)
    
    def get_regret(self, arm):
        """Calculate regret for choosing this arm."""
        return self.optimal_value - self.true_means[arm]

# Create a 10-armed bandit
bandit = MultiArmedBandit(k=10)

print("True mean rewards:")
for i, mean in enumerate(bandit.true_means):
    marker = " ← optimal" if i == bandit.optimal_arm else ""
    print(f"  Arm {i}: {mean:.3f}{marker}")

## Action Value Estimation

We estimate the value of action $a$ using sample averaging:

$$Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_i=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_i=a}}$$

This can be computed incrementally:

$$Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n]$$

## Strategy 1: Greedy

Always select the action with highest estimated value:

$$A_t = \arg\max_a Q_t(a)$$

This exploits current knowledge but never explores.

In [None]:
class GreedyAgent:
    """Greedy action selection (pure exploitation)."""
    def __init__(self, k, initial_value=0.0):
        self.k = k
        self.Q = np.full(k, initial_value)  # Estimated values
        self.N = np.zeros(k)  # Action counts
        
    def select_action(self):
        """Select action with highest estimated value."""
        return np.argmax(self.Q)
    
    def update(self, action, reward):
        """Update estimates using sample averaging."""
        self.N[action] += 1
        self.Q[action] += (reward - self.Q[action]) / self.N[action]

## Strategy 2: ε-Greedy

With probability $\varepsilon$, explore (random action); otherwise exploit (greedy action):

$$A_t = \begin{cases}
\text{random action} & \text{with probability } \varepsilon \\
\arg\max_a Q_t(a) & \text{with probability } 1-\varepsilon
\end{cases}$$

In [None]:
class EpsilonGreedyAgent:
    """ε-greedy action selection."""
    def __init__(self, k, epsilon=0.1, initial_value=0.0):
        self.k = k
        self.epsilon = epsilon
        self.Q = np.full(k, initial_value)
        self.N = np.zeros(k)
        
    def select_action(self):
        """Select action using ε-greedy strategy."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.k)  # Explore
        else:
            return np.argmax(self.Q)  # Exploit
    
    def update(self, action, reward):
        """Update estimates."""
        self.N[action] += 1
        self.Q[action] += (reward - self.Q[action]) / self.N[action]

## Strategy 3: Upper Confidence Bound (UCB)

Select actions based on estimated value plus an uncertainty bonus:

$$A_t = \arg\max_a \left[ Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}} \right]$$

The uncertainty term decreases as an action is selected more often, encouraging exploration of uncertain actions.

In [None]:
class UCBAgent:
    """Upper Confidence Bound action selection."""
    def __init__(self, k, c=2.0, initial_value=0.0):
        self.k = k
        self.c = c
        self.Q = np.full(k, initial_value)
        self.N = np.zeros(k)
        self.t = 0
        
    def select_action(self):
        """Select action using UCB."""
        self.t += 1
        
        # Try each action at least once
        if np.min(self.N) == 0:
            return np.argmin(self.N)
        
        # UCB formula
        ucb_values = self.Q + self.c * np.sqrt(np.log(self.t) / self.N)
        return np.argmax(ucb_values)
    
    def update(self, action, reward):
        """Update estimates."""
        self.N[action] += 1
        self.Q[action] += (reward - self.Q[action]) / self.N[action]

## Strategy 4: Thompson Sampling

Bayesian approach that maintains a probability distribution over action values. At each step:
1. Sample a value from each arm's posterior distribution
2. Select the arm with the highest sampled value

For Gaussian rewards, we use a Gaussian posterior with known variance.

In [None]:
class ThompsonSamplingAgent:
    """Thompson Sampling with Gaussian posterior."""
    def __init__(self, k, prior_mean=0.0, prior_std=1.0, reward_std=1.0):
        self.k = k
        self.reward_std = reward_std
        
        # Posterior parameters (mean and precision)
        self.mu = np.full(k, prior_mean)
        self.tau = np.full(k, 1.0 / (prior_std ** 2))  # Precision
        self.N = np.zeros(k)
        
    def select_action(self):
        """Sample from posteriors and select best."""
        # Sample from each arm's posterior
        samples = np.random.normal(
            self.mu, 
            1.0 / np.sqrt(self.tau)
        )
        return np.argmax(samples)
    
    def update(self, action, reward):
        """Update posterior using Bayesian update."""
        self.N[action] += 1
        
        # Bayesian update for Gaussian with known variance
        reward_precision = 1.0 / (self.reward_std ** 2)
        
        new_tau = self.tau[action] + reward_precision
        new_mu = (self.tau[action] * self.mu[action] + 
                  reward_precision * reward) / new_tau
        
        self.mu[action] = new_mu
        self.tau[action] = new_tau

## Comparison Experiment

Let's compare all four strategies on the same bandit problem:

In [None]:
def run_experiment(agent, bandit, steps=1000):
    """Run bandit experiment and track performance."""
    rewards = np.zeros(steps)
    optimal_actions = np.zeros(steps)
    
    for t in range(steps):
        action = agent.select_action()
        reward = bandit.pull(action)
        agent.update(action, reward)
        
        rewards[t] = reward
        optimal_actions[t] = (action == bandit.optimal_arm)
    
    return rewards, optimal_actions

# Run multiple experiments and average
num_runs = 200
steps = 1000

agents_config = [
    ("Greedy", lambda k: GreedyAgent(k)),
    ("ε-greedy (0.01)", lambda k: EpsilonGreedyAgent(k, epsilon=0.01)),
    ("ε-greedy (0.1)", lambda k: EpsilonGreedyAgent(k, epsilon=0.1)),
    ("UCB (c=2)", lambda k: UCBAgent(k, c=2.0)),
    ("Thompson Sampling", lambda k: ThompsonSamplingAgent(k))
]

results = {}

for name, agent_factory in agents_config:
    print(f"Running {name}...")
    all_rewards = np.zeros((num_runs, steps))
    all_optimal = np.zeros((num_runs, steps))
    
    for run in range(num_runs):
        bandit = MultiArmedBandit(k=10)
        agent = agent_factory(10)
        rewards, optimal = run_experiment(agent, bandit, steps)
        all_rewards[run] = rewards
        all_optimal[run] = optimal
    
    results[name] = {
        'rewards': np.mean(all_rewards, axis=0),
        'optimal': np.mean(all_optimal, axis=0)
    }

print("\nDone!")

## Results Visualization

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Average reward over time
for name in results:
    ax1.plot(results[name]['rewards'], label=name, linewidth=1.5)
ax1.set_xlabel('Steps', fontsize=11)
ax1.set_ylabel('Average Reward', fontsize=11)
ax1.set_title('Average Reward vs Steps', fontsize=12)
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)

# % Optimal action
for name in results:
    ax2.plot(results[name]['optimal'] * 100, label=name, linewidth=1.5)
ax2.set_xlabel('Steps', fontsize=11)
ax2.set_ylabel('% Optimal Action', fontsize=11)
ax2.set_title('Optimal Action Selection vs Steps', fontsize=12)
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nFinal Performance (last 100 steps):")
for name in results:
    avg_reward = np.mean(results[name]['rewards'][-100:])
    pct_optimal = np.mean(results[name]['optimal'][-100:]) * 100
    print(f"  {name:20s}: Reward={avg_reward:.3f}, Optimal={pct_optimal:.1f}%")

## Key Insights

1. **Greedy fails** — Gets stuck in suboptimal actions due to no exploration
2. **ε-greedy balances** — Small ε (0.01) exploits more but may underexplore; larger ε (0.1) explores more
3. **UCB excels** — Systematic exploration based on uncertainty, often outperforms fixed ε
4. **Thompson Sampling adapts** — Bayesian approach naturally balances exploration/exploitation

The optimal strategy depends on:
- Time horizon (finite vs infinite)
- Reward variance
- Number of arms
- Non-stationarity of environment

## Extensions

- **Non-stationary bandits** — Reward distributions change over time
- **Contextual bandits** — Actions depend on observed context
- **Restless bandits** — Arm states evolve independently
- **Combinatorial bandits** — Select multiple arms simultaneously

Multi-armed bandits form the foundation for more complex RL algorithms like Q-learning and policy gradients.

## Problem Formulation

A $k$-armed bandit has $k$ possible actions (arms). Each arm $i$ has an unknown expected reward:

$$q_*(a) = \mathbb{E}[R_t | A_t = a]$$

At each timestep $t$, the agent:
1. Selects an action $A_t$
2. Receives reward $R_t$
3. Updates estimates of action values

The goal is to maximize cumulative reward over time.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

## Bandit Environment

We'll create a simple $k$-armed bandit where each arm returns rewards from a Gaussian distribution:

In [None]:
class MultiArmedBandit:
    """k-armed bandit with Gaussian reward distributions."""
    def __init__(self, k=10, mean_range=(-2, 2), std=1.0):
        self.k = k
        self.std = std
        self.true_means = np.random.uniform(mean_range[0], mean_range[1], size=k)
        self.optimal_arm = np.argmax(self.true_means)
        self.optimal_value = self.true_means[self.optimal_arm]
        
    def pull(self, arm):
        """Pull an arm and receive a noisy reward."""
        return np.random.normal(self.true_means[arm], self.std)
    
    def get_regret(self, arm):
        """Calculate regret for choosing this arm."""
        return self.optimal_value - self.true_means[arm]

# Create a 10-armed bandit
bandit = MultiArmedBandit(k=10)
print("True mean rewards:")
for i, mean in enumerate(bandit.true_means):
    marker = " ← optimal" if i == bandit.optimal_arm else ""
    print(f"  Arm {i}: {mean:.3f}{marker}")