# Chapter 2: Multi-armed Bandits

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

### Exercise 2.1

Q: In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon = 0.5$, what is the probability that the greedy action is selected?

**A**: In $\varepsilon$-greedy action selection, the greedy action be selected in one of two ways:
1. With probability $1 - \varepsilon$ it is chosen as the exploitation action;
2. With probability $\tfrac{\varepsilon}{|\mathcal{A}|}$ (where $\mathcal{A}$ is the action space), it could be chosen at random as an exploration action.

These are mutually exclusive events, so the total probability of the greedy action being chosen is

$$1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}|}.$$

In the question, $|\mathcal{A}| = 2$ and $\varepsilon = 0.5$. Hence the greedy action is chosen with probability

$$1 - 0.5 + \frac{0.5}{2} = 0.75.$$

### Exercise 2.2

Q: Consider a $k$-armed bandit problem with $k=4$ actions, denoted 1, 2, 3 and 4. Consider applying to this problem a bandit algorithm using $\varepsilon$-greedy action selection, sample-avarage action-value estimates, and initial estimates of $Q_1(a) = 0$ for all $a$. Suppose the initial sequence of actions and rewards is

$$i$$ | $$A_i$$ | $$R_i$$
:-: | :-: | :-:
 1 | 1 | 1 
 2 | 2 | 1
 3 | 2 | 2
 4 | 2 | 2
 5 | 3 | 0

On some of these times steps the $\varepsilon$ case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?

A: We can manually calculate the action values as follows:

In [None]:
# Initial data
action_space = [1, 2, 3, 4]
actions = [1, 2, 2, 2, 3]
rewards = [1, 1, 2, 2, 0]

# Set up initial Q values and action counts N for i=1
Q = {a: 0 for a in action_space}
N = {a: 0 for a in action_space}

# Iteratively update N and Q for remaining time steps and save Q history
Q_history = []
for i, (A, R) in enumerate(zip(actions, rewards), 1):
    Q_history.append(Q.copy())
    N[A] += 1
    Q[A] = Q[A] + (1.0 / N[A]) * (R - Q[A])

# Display table of action values
(
    pd.DataFrame.from_records(Q_history, index=range(1, len(actions) + 1))
    .rename(lambda a:f"$$Q({a})$$", axis=1)
    .rename_axis("$$i$$", axis=0)
)

From these we can determine whether a greedy action was chosen at each step:

In [None]:
chosen_action_values = [Q[A] for Q, A in zip(Q_history, actions)]
greedy_action_values = [max(Q.values()) for Q in Q_history]
definitely_exploration = [c != g for c, g in zip(chosen_action_values, greedy_action_values)]

# Display results as a table
pd.DataFrame(
    {
        "$$\max_{a \in \mathcal{A}} Q_i(a)$$": greedy_action_values,
        "$$Q_i(A_i)$$": chosen_action_values,
        "Definitely exploration?": definitely_exploration,
    },
    index=pd.Index(range(1, len(actions) + 1), name="$$i$$"),
)

So we can see that at timesteps $i=2$ and $i=5$, the chosen action was **not** the greedy action and hence exploration (the "$\varepsilon$ case") definitely happened. For all other time steps, we do not know: perhaps the greedy action was chosen "on purpose", i.e. as an exploitation action, or perhaps it was still chosen as an exploration action.

### Exercise 2.3

Q: In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.

A: Figure 2.2 from the textbook shows average learning curves (over 2000 randomly chosen bandits) for $\varepsilon$-greedy agents with $\varepsilon = 0$, $0.01$ and $0.1$.

Let's begin by first reproducing this figure. To save computation time, we use a testbed of only 200 bandits, but extend the simulation length to 3000 steps:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bandits import get_epsilon_greedy_bandit_agent_builder, bandit_experiment, BanditResults

# Set up experiment parameters
n_levers = 10
agents_to_test = {
    fr"$\varepsilon = {epsilon}$": get_epsilon_greedy_bandit_agent_builder(epsilon, n_levers)
    for epsilon in [0.0, 0.01, 0.1]
}
n_steps = 1000
test_bed_size = 200
rng = np.random.default_rng(238402285)

# Load saved results, or if not found, run experiment. Then plot results.
results_file = f"results/epsilon-bandit-results-n_steps-{n_steps}-test_bed_size-{test_bed_size}.pkl"
try:
    results = BanditResults.load(results_file)
except FileNotFoundError:
    %time results = bandit_experiment(agents_to_test, test_bed_size, n_steps, random_state=rng)
    results.save(results_file)
results.plot()

We now return to answering the question.

Let's start with the case where $\varepsilon > 0$. In this case, as the simulation length goes to infinity, *all* states will be explored an infinite number of times and hence the agent's action value estimates for all states will converge to the true values. This means that, in the long run, the average reward per step for an $\varepsilon$-greedy agent with non-zero $\varepsilon$ will ultimately converge to

$$(1 - \varepsilon)  R_\mathrm{opt} + \varepsilon R_\mathrm{random}$$

where $R_\mathrm{opt}$ is the expected reward-per-step for an agent that always selects the optimal action and $R_\mathrm{random}$ is the expected reward-per-step for an agent that always selects a random action. When we average over a large number of random bandits (whose mean rewards are drawn from a standard normal distribution and whose standard deviations are one, as specified in the book), then $R_\mathrm{random} = 0$ and

$$R_\mathrm{opt} = \mathbb{E}\left[\max(Z_1, Z_2, \ldots, Z_{10})\right]$$

where $Z_1, \ldots, Z_{10}$ are independent standard normal random variables. The pdf for the distribution for the max of these variables can be derived and the expectation numerically estimated as follows:

In [None]:
experiment.average_optimal_action_value

In [None]:
from scipy.stats import norm
from scipy.integrate import quadrature

integrand = lambda z: 10. * z * (norm.cdf(z) ** 9) * norm.pdf(z)
print("Expected reward for perfect agent: {:.3f} (error: {:.1e})".format(*quadrature(integrand, -10, 10, maxiter=1000)))


In [None]:
0.8 * 1.54