# Multi-armed Bandits: Exercises

### Exercise 2.1: **E-greedy action selection**

**Question: In "E-greedy action selection, for the case of two actions and $\epsilon = 0.5$,  
what is the probability that the greedy action is selected?**

> We have 1 over 2 chance to select a random action, and in that case we have  
  1 over 2 chance to select the greedy action since there is two actions.  
  Plus, we have 1 over 2 chance to be sur to take the greedy action.  
  This give us:  
  $p(a = \text{greedy action}) = {1 \over 2} * {1 \over 2} + {1 \over 2} * 1$  
  $p(a = \text{greedy action}) = {3 \over 4}$

### Exercise 2.2: **Bandit example**

**Question: Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4.  
Consider applying to this problem a bandit algorithm using $\epsilon$-greedy action selection,  
sample-average action-value estimates, and initial estimates of $Q_1(a) = 0$, for all $a$.  
Suppose the initial sequence of actions and rewards is $A_1 = 1$, $R_1 =  1$, $A_2 = 2$, $R_2 = 1$, $A_3 = 2$, $R_3 =  2$, $A_4 = 2$, $R_4 = 2$, $A_5 = 3$, $R_5 = 0$.  
On some of these time steps the " case may have occurred, causing an action to be selected at random.  
On which time steps did this definitely occur ? On which time steps could this possibly have occurred ?**

> 

### Exercise 2.3: **Which method will perform best**

**Question: In the comparison shown in Figure 2.2,  
which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action ?  
How much better will it be ?  
Express your answer quantitatively.**

### Exercise 2.4: **Step-size parameters**

**Question: If the step-size parameters, ↵ n , are not constant, then the estimate Q n is
a weighted average of previously received rewards with a weighting di↵erent from that
given by (2.6). What is the weighting on each prior reward for the general case, analogous
to (2.6), in terms of the sequence of step-size parameters?**

### Exercise 2.5: **Design and conduct an experiment**
**Programming: diculties that sample-average methods have for nonstationary problems. Use a modified
version of the 10-armed testbed in which all the q ⇤ ( a ) start out equal and then take
independent random walks (say by adding a normally distributed increment with mean
zero and standard deviation 0.01 to all the q ⇤ ( a ) on each step). Prepare plots like
Figure 2.2 for an action-value method using sample averages, incrementally computed,
and another action-value method using a constant step-size parameter, ↵ = 0 . 1. Use
" = 0.1 and longer runs, say of 10,000 steps.**

### Exercise 2.6
**Question: Mysterious Spikes The results shown in Figure 2.3 should be quite reliable
because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks.
Why, then, are there oscillations and spikes in the early part of the curve for the optimistic
method? In other words, what might make this method perform particularly better or
worse, on average, on particular early steps?

### Exercise 2.7: Unbiased Constant-Step-Size Trick In most of this chapter we have used
sample averages to estimate action values because sample averages do not produce the
initial bias that constant step sizes do (see the analysis leading to (2.6) ). However, sample
averages are not a completely satisfactory solution because they may perform poorly
on nonstationary problems. Is it possible to avoid the bias of constant step sizes while
retaining their advantages on nonstationary problems? One way is to use a step size of
 n
.
= ↵/¯o n , (2.8)
to process the n th reward for a particular action, where ↵ > 0 is a conventional constant
step size, and ¯o n is a trace of one that starts at 0:
¯o n
.
= ¯o n1 + ↵(1  ¯o n1 ), for n  0, with ¯o 0
.
= 0. (2.9)
Carry out an analysis like that in (2.6) to show that Q n is an exponential recency-weighted
average without initial bias.**

### Exercise 2.8
**Question: UCB Spikes In Figure 2.4 the UCB algorithm shows a distinct spike
in performance on the 11th step. Why is this? Note that for your answer to be fully
satisfactory it must explain both why the reward increases on the 11th step and why it
decreases on the subsequent steps. Hint: if c = 1, then the spike is less prominent.**

### Exercise 2.9
**Question: Show that in the case of two actions, the soft-max distribution is the same
as that given by the logistic, or sigmoid, function often used in statistics and artificial
neural networks.**

### Exercise 2.10
**Question: Suppose you face a 2-armed bandit task whose true action values change
randomly from time step to time step. Specifically, suppose that, for any time step, the
true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A),
and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you
face at any step, what is the best expectation of success you can achieve and how should
you behave to achieve it? Now suppose that on each step you are told whether you are
facing case A or case B (although you still don’t know the true action values). This is an
associative search task. What is the best expectation of success you can achieve in this
task, and how should you behave to achieve it?**

### Exercise 2.11: **Make a figure**
**Programming: Make a figure analogous to Figure 2.6 for the nonstationary case outlined in Exercise 2.5.  
Include the constant-step-size $\epsilon$-greedy algorithm with $\epsilon = 0.1$.  
Use runs of 200,000 steps and, as a performance measure for each algorithm and  
parameter setting, use the average reward over the last 100,000 steps.**

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

%matplotlib inline

class KArmedBanditNonStationary(gym.Env):

    def __init__(self, nb_arms=10, nb_steps=10_000):
        self._nb_arms = nb_arms
        self._nb_steps = nb_steps

        self.action_space = gym.spaces.Discrete(nb_arms)
        self.observation_space = gym.spaces.Discrete(1)
    
    def step(self, action):
        self._step += 1
    
        reward = self._arms[action]
        reward_noise = self.np_random.normal(0, 1, size=1)[0]
        terminated = self._step >= self._nb_steps

        info = { "is_optimal_action": int(action == np.argmax(self._arms)) }

        # Derivation
        self._arms += self.np_random.normal(0, 0.01, size=self._nb_arms)

        return reward + reward_noise, terminated, info

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        self._step = 0
        self._arms = self.np_random.normal(0, 1, size=self._nb_arms)

In [None]:
class EpsilonGreedy():

    def __init__(self, nb_actions, epsilon, alpha):
        self.nb_actions = nb_actions
        self.epsilon = epsilon
        self.alpha = alpha

        self.q = np.zeros(self.nb_actions)

    def action(self):
        take_random_action_prob = np.random.uniform(0, 1)

        if take_random_action_prob < self.epsilon:
            return np.random.randint(0, self.nb_actions)
        else:
            return np.argmax(self.q)
    
    def observe(self, action, reward):
        self.q[action] += self.alpha * (reward - self.q[action])
    
    def reset(self):
        self.q = np.zeros(self.nb_actions)