![Logo](https://raw.githubusercontent.com/BartaZoltan/deep-reinforcement-learning-course/main/notebooks/shared_assets/logo.png)


**Developers:** Domonkos Nagy, Balázs Nagy, Zoltán Barta  
**Date:** 2026-02-17  
**Version:** 2025-26/2

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/BartaZoltan/deep-reinforcement-learning-course/blob/main/notebooks/sessions/session_01_k_armed_bandit/session_01_k_armed_bandit_empty.ipynb)

# Practice 1: K-armed Bandit

## Summary

This notebook introduces the **k-armed bandit** setting and the exploration-exploitation tradeoff in reinforcement learning.

Content outline:
- k-armed bandit problem formulation and evaluation setup,
- strategy interface and environment construction,
- epsilon-greedy action selection,
- upper-confidence-bound (UCB) action selection,
- gradient bandit methods,
- comparative experiments and result interpretation.


## Introduction

Consider the following learning problem. You are faced repeatedly with a choice among
$k$ different options, or actions. After each choice, you receive a numerical reward
from a stationary probability distribution that depends on the action you selected. Your
objective is to maximize the expected total reward over some time period, for example,
over 1000 action selections, or time steps.

This is called the $k$-armed bandit problem. You can visualize this problem as having to
choose between $k$ slot machines (also known as one-armed bandits) at each time step,
each of which has a different probability distribution for rewards - that is where the name comes from.

<img src="https://raw.githubusercontent.com/BartaZoltan/deep-reinforcement-learning-course/main/notebooks/sessions/session_01_k_armed_bandit/assets/k_armed_bandit.png" width="500"/>

The $k$-armed bandit problem illustrates an important problem in reinforcement
learning: **exploration vs. exploitation**. At each time step $t$, the agent has to make a decision:
take the action with the highest expected reward according to its current knowledge of the environment, 
or choose a different action to get a better estimation of the value of that action. The former is called an
*exploitation* step, because it exploits the current knowledge of the agent in order to obtain a high reward.
The latter is called an *exploration* step, since it involves trying out an action in order to have a better
estimation of its value, thereby exploring the environment.

This notebook introduces a few common strategies to tackle this problem and puts them to the test by simulating
multiple test runs, and comparing the results.

This notebook follows Chapter 2 of Sutton & Barto {cite}`sutton2018`.


In [None]:
import numpy as np
from abc import ABC, abstractmethod
import matplotlib.pyplot as plt
from tqdm.notebook import trange
import seaborn as sns
import time

import random

SEED = 42
np.random.seed(SEED)
random.seed(SEED)

try:
    import torch
    torch.manual_seed(SEED)
except Exception:
    pass


### Why the seed setup matters

Bandit methods are stochastic in two ways: the environment generates random rewards and the policy often explores randomly. 
Setting a fixed seed makes experiments reproducible and comparable across strategy variants.

This follows the testbed philosophy in Chapter 2: compare methods under controlled conditions and average over many runs {cite}`sutton2018`.


## Strategy setup

The `Strategy` base class is used to implement startegies for action selection. An action is selected by the `act` method, and then the `update` method is used
to update the inner state after receiving a reward for the selected action. After an episode (a "run" consisting of $n$ steps, 1000 for example) is over, the `reset` method is called to reset the inner state of the class. The `name` propery is used get a name for the strategy in a visual representation.

In [None]:
# DO NOT MODIFY THIS CELL

class Strategy(ABC):

    def __init__(self, k):
        self.k = k  # Number of actions
         
        self.rewards_history = {i: [] for i in range(self.k)}  # Store observed rewards per action

    @property
    @abstractmethod
    def name():
        pass

    @abstractmethod
    def act(self):
        pass

    @abstractmethod
    def update(self, action, reward):
        self.rewards_history[action].append(reward)  # Update rewards history
        pass

    @abstractmethod
    def reset(self):
        self.rewards_history = {i: [] for i in range(self.k)}  # Reset reward history
        pass

    def plot_estimated_distributions(self):
        """
        Plots the estimated reward distributions for each action
        based on the rewards observed during training.
        """
        plt.figure(figsize=(12, 5))
        for action, rewards in self.rewards_history.items():
            if rewards:
                sns.kdeplot(rewards, label=f"Action {action+1}", fill=True, alpha=0.5)
        
        plt.xlabel("Estimated Reward")
        plt.ylabel("Density")
        plt.title(f"Estimated Reward Distributions - {self.name}")
        plt.legend()
        plt.show()

Numpy's `np.argmax` will choose the smallest index in case there are multiple
maximal values. This function breaks these ties randomly instead, which is
desirable in many cases.

In [None]:
# DO NOT MODIFY THIS CELL

# Argmax function that breaks ties randomly
def argmax(arr):
    arr_max = np.max(arr)
    return np.random.choice(np.where(arr == arr_max)[0])

## Environment Setup

We test our strategies by trying them out in multiple runs, and then averaging out the received reward at each time step. After that, we plot the results to
compare the strategies.

In [None]:
class KArmedBandit:
    def __init__(self, K, mean=0, std_dev=1):
        # TODO: initialize bandit parameters and true action values
        pass

    def get_reward(self, action):
        # TODO: sample stochastic reward for the selected action
        pass

    def get_optimal_action(self):
        # TODO: return current best action index
        pass

    def reset(self):
        # TODO: reset true action values using initialization distribution
        pass


In [None]:
def plot_bandit_distributions(bandit, num_samples=10000):
    """
    Plots the reward distributions for all K actions in a K-armed bandit using a violin plot.

    Parameters:
    - bandit (KArmedBandit): An instance of the KArmedBandit class.
    - num_samples (int): Number of reward samples to generate for each action.
    """
    K = bandit.K  # Number of actions
    rewards = {action: [bandit.get_reward(action) for _ in range(num_samples)] for action in range(K)}

    # Convert to data format suitable for seaborn
    reward_data = []
    action_labels = []
    
    for action, reward_list in rewards.items():
        reward_data.extend(reward_list)
        action_labels.extend([action + 1] * num_samples)  # Convert 0-indexed to 1-indexed for display

    # Create violin plot
    plt.figure(figsize=(12, 5))
    sns.violinplot(x=action_labels, y=reward_data, inner=None, color="lightblue", linewidth=1.5)

    # Add scatter points for true action values
    plt.scatter(range(0, K ), bandit.optimal_action_values, color='blue', s=50)

    # Formatting
    plt.axhline(0, linestyle='dotted', color='black', linewidth=1)  # Dashed line at 0
    plt.xlabel("Actions")
    plt.ylabel("Expected Reward")
    plt.title("Action Reward Distributions in K-Armed Bandit")
    plt.show()

In [None]:
# Create a 10-armed bandit
bandit = KArmedBandit(K=10)

# Plot the full action reward distributions
plot_bandit_distributions(bandit)

### Evaluation protocol

The `simulate(...)` function implements the main empirical protocol:
- independent runs with fresh bandits,
- fixed number of steps per run,
- averaged reward curves,
- frequency of selecting the optimal action.

These are the two canonical diagnostics used throughout Sutton & Barto’s Chapter 2 comparisons {cite}`sutton2018`.


In [None]:
def simulate(strategies, K, bandit_mean=0, bandit_std=1, runs=2000, n_steps=1000):
    """TODO: run repeated bandit experiments and return mean reward / optimal-action curves."""
    # Expected outputs:
    # - mean_rewards shape: (n_strategies, n_steps)
    # - mean_best_action_choices shape: (n_strategies, n_steps)
    pass


To examine the results a plot function is defined.

In [None]:
# DO NOT MODIFY THIS CELL

def plotResults(strategies, rewards, best_action_choices):
  fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

  for strategy, reward in zip(strategies, rewards):
      ax1.plot(reward, label=f"{strategy.name}", zorder=2)
  ax1.set_xlabel('Steps')
  ax1.set_ylabel('Average reward')
  ax1.grid(alpha=0.8, linestyle=':', zorder=0)
  ax1.set_title('Average reward of strategies')
  ax1.legend()

  for strategy, choices in zip(strategies, best_action_choices):
      ax2.plot(choices, label=f"{strategy.name}")
  ax2.set_xlabel('Steps')
  ax2.set_ylabel('% Optimal action')
  ax2.grid(alpha=0.8, linestyle=':', zorder=0)
  ax2.set_title('% Optimal action choices of strategies')
  ax2.legend()

  plt.show()

## $\varepsilon$-greedy Action Selection

With this method, the agent will select a random action with an $\varepsilon$ probability ($0 \le \varepsilon \le 1$), and act greedily (select the best action according to its knowledge) otherwise. The action values are calculated using the *sample-averages* method: the value of an action is the average of all the rewards received after taking that action.



<img src="https://raw.githubusercontent.com/BartaZoltan/deep-reinforcement-learning-course/main/notebooks/sessions/session_01_k_armed_bandit/assets/epsilon_greedy.png" width="700"/>

*Pseudocode adapted from Sutton & Barto {cite}`sutton2018` (p. 32).*




In [None]:
class EpsilonGreedy(Strategy):
    def __init__(self, k, epsilon=0, initial=0, step_size=None):
        super().__init__(k)
        # TODO: initialize epsilon-greedy state
        pass

    @property
    def name(self):
        # TODO: return readable strategy name
        pass

    def act(self):
        # TODO: implement epsilon-greedy action selection (with random tie-break)
        pass

    def update(self, action, reward):
        # TODO: update action-value estimate (sample-average or constant step-size)
        pass

    def reset(self):
        # TODO: reset internal state
        pass


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

def plot_and_save(strategy, step,save_path = "assets/bandit_training.png"):
    """
    Creates and updates the figure, then saves it to disk.

    Parameters:
    - strategy (EpsilonGreedy): The current strategy with reward history.
    - step (int): Current training step.
    """
    K = strategy.k

    # Create new figure inside the function
    fig, ax = plt.subplots(figsize=(12, 5))

    # Prepare data
    reward_data = []
    action_labels = []

    for action, reward_list in strategy.rewards_history.items():
        reward_data.extend(reward_list)
        action_labels.extend([action+1] * len(reward_list))

    # Create violin plot
    sns.violinplot(x=action_labels, y=reward_data, inner=None, density_norm="width", color="lightcoral", linewidth=1.5, ax=ax)

    # Scatter plot for estimated means
    ax.scatter(range(K), strategy.q_estimations, color='red', s=50, zorder=3, label="Estimated Means")

    ax.axhline(0, linestyle='dotted', color='black', linewidth=1)
    ax.set_xlabel("Actions")
    ax.set_ylabel("Estimated Reward")
    ax.set_title(f"Training Step: {step} - Estimated Reward Distributions")
    ax.legend(["Estimated Means"])

    # Save the figure to the same file (overwrite each time)
    fig.savefig(save_path, format="png", dpi=150)
    plt.close(fig)  # Close figure to free memory

Visualize, how the $\varepsilon$-greedy Action Selection updates it's estimations of the reward distributions of the K-Armed bandit problem.

In [None]:
# Initialize bandit and strategy
from pathlib import Path
from PIL import Image
from IPython.display import Image as IPyImage, display

K = 10  # Number of arms
bandit = KArmedBandit(K)
strategy = EpsilonGreedy(k=K, epsilon=0.2)

# Training loop with in-memory frame capture for GIF export
n_steps = 5000
plot_interval = n_steps // 50

# Plot initial bandit distributions before training
plot_bandit_distributions(bandit)

tmp_frame = Path("assets/.tmp_bandit_frame.png")
frames = []

for step in range(n_steps + 1):
    action = strategy.act()
    reward = bandit.get_reward(action)
    strategy.update(action, reward)

    if step % plot_interval == 0:
        plot_and_save(strategy, step, save_path=str(tmp_frame))
        frames.append(Image.open(tmp_frame).copy())

# Build GIF from captured frames (for website use)
gif_path = Path("assets/bandit_training.gif")
if frames:
    frames[0].save(
        gif_path,
        save_all=True,
        append_images=frames[1:],
        duration=120,
        loop=0,
    )

for fr in frames:
    fr.close()

tmp_frame.unlink(missing_ok=True)

print(f"Saved GIF to: {gif_path}")
display(IPyImage(filename=str(gif_path)))


Test of $\varepsilon$-greedy Action Selection.

### Epsilon sensitivity

Compare greedy (`epsilon=0`) and exploratory (`epsilon>0`) behavior.

What to check while reading the plots:
- short-term reward vs long-term reward,
- how quickly each method discovers good arms,
- whether pure greedy gets stuck because it under-explores.

This directly matches the exploration-exploitation discussion in Section 2.1–2.2 {cite}`sutton2018`.


In [None]:
K = 3  # Number of actions

# List of strategies to test
strategies = [EpsilonGreedy(K,epsilon=0.0), 
              EpsilonGreedy(K, epsilon=0.1), 
              EpsilonGreedy(K, epsilon=0.01)]

# Evaluate strategies
rewards, best_action_choices = simulate(strategies, K=K, runs=200, n_steps=500)

plotResults(strategies, rewards, best_action_choices)

### Optimistic initialization and step-size

Here the initial action-value estimates are intentionally high (`initial > 0`), which induces early exploration even with small `epsilon`.

Also note the role of `step_size`:
- sample-average update (`1/N`) gives long memory,
- constant step-size gives faster adaptation.

See the action-value update discussion in Chapter 2 {cite}`sutton2018`.


In [None]:
K = 10  # Number of actions

# List of strategies to test
strategies = [EpsilonGreedy(K,epsilon=0.0, initial=10,step_size=0.1), 
              EpsilonGreedy(K, epsilon=0.1,initial=3,step_size=0.1), 
              EpsilonGreedy(K, epsilon=0.1, initial=0,step_size=0.1)]

# Evaluate strategies
rewards, best_action_choices = simulate(strategies, K=K, runs=200, n_steps=500)

plotResults(strategies, rewards, best_action_choices)

## Upper-Confidence-Bound (UCB) Action Selection

The UCB action selection method offers a way to select an action by taking both the estimated value, as well as the accuracy of those estimates into account.
It uses the following formula:

$$ A_t := \underset{a}{\arg\max} \left[ Q_t(a) + c \sqrt{\frac{\ln(t)}{N_t(a)}} \right] $$

Where $Q_t(a)$ denotes the value of action $a$ (calculated using the *sample-averages* method), $N_t(a)$ denotes the number of times that action $a$ has
been selected prior to time $t$, and the number $c > 0$ controls
the degree of exploration. If $N_t(a) = 0$, then $a$ is considered to be a maximizing action.




In [None]:
class UCB(Strategy):
    def __init__(self, k, c=1, initial=0, step_size=None):
        super().__init__(k)
        # TODO: initialize UCB parameters and internal state
        pass

    @property
    def name(self):
        # TODO: return readable strategy name
        pass

    def act(self):
        # TODO: implement UCB action selection
        # hint: ensure each action is selected at least once
        pass

    def update(self, action, reward):
        # TODO: update counts and action-value estimate
        pass

    def reset(self):
        # TODO: reset internal state
        pass


Test of UCB.

### UCB exploration coefficient

UCB uses an explicit uncertainty bonus, controlled by `c`.

Interpretation goal:
- small `c`: more exploitation,
- large `c`: more exploration pressure.

This is the confidence-bound approach from Section 2.7 {cite}`sutton2018`.


In [None]:
K = 3  # Number of actions

# List of strategies to test
strategies = [UCB(K),
              UCB(K, c=2),
              UCB(K, c=5)]

# Evaluate strategies
rewards, best_action_choices = simulate(strategies, K=K, runs=200, n_steps=500)

plotResults(strategies, rewards, best_action_choices)

## Gradient Bandit Algorithms

Instead of estimating action values, this method learns a numerical *preference*, denoted $H_t(a)$ for each action. The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Action probabilites are determined using the *soft-max* function:

$$ \pi_t(a) := \Pr\{A_t = a\} := \frac{e^{H_t(a)}}{\sum_{b=1}^k e^{H_t(b)}} $$

Here we have also introduced a useful new notation, $\pi_t(a)$, for the probability of
taking action $a$ at time $t$. Note that this function defines a probability distribution over the set of all actions. On each step, after selecting action $A_t$ and receiving the reward $R_t$, the
action preferences are updated by:

$$ H_{t+1}(a) := H_t(a) + \alpha(R_t - \bar{R}_t)(\mathbb{1}_{a=A_t} - \pi_t(a)) $$

Where $\alpha > 0$ is a step-size parameter, and $\bar{R}_t \in \mathbb{R}$ is the average of all the rewards up
through and including time $t$. 
The $\bar{R}_t$ term serves as a
baseline with which the reward is compared.



In [None]:
class Gradient(Strategy):
    def __init__(self, k, step_size=0.1, use_baseline=True):
        super().__init__(k)
        # TODO: initialize gradient bandit state
        pass

    @property
    def name(self):
        # TODO: return readable strategy name
        pass

    def act(self):
        # TODO: sample action from softmax(preferences)
        pass

    def update(self, action, reward):
        # TODO: implement gradient update with optional baseline
        pass

    def reset(self):
        # TODO: reset preferences, baseline and timestep
        pass

    def softmax(self, x):
        # TODO: implement numerically stable softmax
        pass


Test of Gradient Bandit Algorithm.

### Gradient bandit hyperparameters

Gradient bandits optimize action preferences via a softmax policy.

Focus points:
- effect of `alpha` (step-size) on stability/speed,
- effect of using a baseline (variance reduction).

This corresponds to Section 2.8 (gradient bandit algorithm) {cite}`sutton2018`.


In [None]:
K = 3  # Number of actions

# List of strategies to test
strategies = [Gradient(K,step_size=0.1),
              Gradient(K, step_size=0.4),
              Gradient(K, step_size=0.01)]

# Evaluate strategies
rewards, best_action_choices = simulate(strategies, K=K, runs=200, n_steps=500)

plotResults(strategies, rewards, best_action_choices)

In [None]:
K = 10  # Number of actions

# List of strategies to test
strategies = [Gradient(K,step_size=0.1),
             #Gradient(K, step_size=0.4),
              Gradient(K, step_size=0.1, use_baseline=False),
             #Gradient(K, step_size=0.4, use_baseline=False)
             ]

# Evaluate strategies
rewards, best_action_choices = simulate(strategies, K=K,bandit_mean=4, runs=200, n_steps=1000)

plotResults(strategies, rewards, best_action_choices)

## Comprehensive test

In this final section let's run a longer comprehensive test with more actions.

The comparison setup is aligned with the classical ten-armed testbed style in {cite}`zhang_ten_armed`.


In [None]:
K = 5  # Number of actions

# List of strategies to test
strategies = [
        EpsilonGreedy(K),
        EpsilonGreedy(K, epsilon=0.1),
        UCB(K, c=2),
        Gradient(K)
    ]

# Evaluate strategies
rewards, best_action_choices = simulate(strategies, K=K, runs=2000, n_steps=500)

plotResults(strategies, rewards, best_action_choices)

## Non-stationary bandits and constant learning rate

So far we used stationary bandits, where each arm has a fixed true value.
In a **non-stationary** setting, the true action values drift over time.

In this case, sample-average updates (step-size `1/N`) can react too slowly, because they keep very long memory.
A constant step-size `\alpha` tracks recent changes better, as discussed in Sutton & Barto {cite}`sutton2018`.

Below we compare:
- `\varepsilon`-greedy with sample-average updates,
- `\varepsilon`-greedy with constant step-size updates.


### Why non-stationarity changes the conclusion

When true action values drift, old data becomes less reliable.

Key takeaway to verify in plots:
- sample-average update can lag behind changing optima,
- constant `alpha` tracks drift better because it weights recent rewards more.

This is the main motivation behind constant step-size updates in non-stationary bandits (Section 2.5) {cite}`sutton2018`.


In [None]:
class NonStationaryKArmedBandit:
    def __init__(self, K, mean=0.0, std_dev=1.0, drift_std=0.01, reward_std=1.0):
        # TODO: initialize non-stationary bandit
        pass

    def step_dynamics(self):
        # TODO: apply random-walk drift to true action values
        pass

    def get_reward(self, action):
        # TODO: sample reward for selected action
        pass

    def get_optimal_action(self):
        # TODO: return best action index
        pass


def simulate_nonstationary(strategies, K, runs=200, n_steps=2000, drift_std=0.01, reward_std=1.0):
    # TODO: implement non-stationary evaluation loop
    pass


class EpsilonGreedyWithLabel(EpsilonGreedy):
    def __init__(self, *args, label_suffix='', **kwargs):
        super().__init__(*args, **kwargs)
        self.label_suffix = label_suffix

    @property
    def name(self):
        # TODO: readable label for plot legends
        pass


## References
```{bibliography}
```
