# Day 1 - Multi-armed Bandits

$Part\ I$ of [the book](http://incompleteideas.net/book/RLbook2020.pdf) is divided into three subsections:
1. The first and second chapter, addressing the simplest case, multi-armed bandits, and the general problem formulation, finite Markov decision processes.
2. The following three chapters, introducting dynamic programming, Monte Carlo methods, and temporal-difference learning.
3. The final two chapters, addressing how these three methods can be combined to achieve more and more powerful algorithms.

* RL is distinguished by actions being evaluated
* RL uses evaluative feedback, rating the action taken, without giving information on whether it was the best action
* Supervised learning gives only the best action, without rating the chosen action (instructive feedback)
* This chapter treats learning in only a single situation, the *nonassociative* setting
* The specific problem is a simple $k$-armed bandit problem
* At the end of the chapter, we take a step to full RL by considering the *associative* setting of multiple different situations

## A $k$-armed Bandit Problem

* The $k$-armed bandit is a simple problem where you pick from a set of actions over and over again
* Each time you select an action, you receive a reward from some distribution
* The goal is to maximize rewards over some number of time steps
* Each of the $k$ actions have some expected/mean reward, which we call the $value$ of the action
* The action selected at time step $t$ is $A_t$, the reward received $R_t$
* The value of an action $a$ is: $$q_*(a)\doteq\mathbb E\left[R_t|A_t=a\right]$$
* Our estimate at time $t$ of this value is $Q_t(a)$
* As we keep an estimate, there's always a greatest value at each time step, which is the $greedy$ action
* Picking the greedy action is $exploitation$ of our knowledge
* Picking any nongreedy action is $exploration$ to gain more knowledge
* If you have many time steps left, exploration may be better for long term reward
* Methods for finding the optimal solution to the exploration/exploitation trade-off usually do not apply to real RL problems
* We try to balance this, but do not care about doing so in a sophisticated way
* Methods involving exploration work much better than purely greedy action selection

## Action-value Methods

* $Action-value\ methods$ use estimates of $q_*(a)$ to make decisions
* One way to estimate these is to average the rewards received: $$Q_t(a)\doteq\frac{\text{sum of rewards when }a\text{ taken prior to }t}{\text{number of times }a\text{ taken prior to }t}=\frac{\sum_{i=1}^{t-1}R_i\cdot\mathbb 1_{A_i=a}}{\sum_{i=1}^{t-1}\mathbb 1_{A_i=a}}$$
* If the action $a$ has not yet been taken, $Q_t(a)$ has some default value, like 0
* As the denominator goes to infinity, $Q_t(a)$ converges to $q_*(a)$
* This is called the $sample$-$average\ method$ for estimating action values
* $Greedy$ action-selection is written as $$A_t\doteq\underset{a}{\operatorname{arg max}}Q_t(a)$$
* A simple alternative is $\varepsilon$-$greedy$ action-selection, where a random action is taken with a probability of $\varepsilon$, the greedy action taken otherwise
* This ensures that each action is taken an infinite number of times, so the $Q_t(a)$ converge to their respective $q_*(a)$
* This assumes that the probability of selecting greedy action converges to near certainty

### $Exercise\ \mathcal{2.1}$

#### In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon = 0.5$, what is the probability that the greedy action is selected?

Irrespective of the number of actions taken, the probability that the greedy action is selected is always $1-\varepsilon$, which in this case is $1-0.5=0.5$, so $50\%$.

## The 10-armed Testbed

* An experiment with 2000 randomly generated $k$-armed bandit problems with $k=10$ arms was performed
* The action values $q_*(a),a=1,...,10$ were sampled from a normal distribution with $\mu=0, \sigma^2=1$
* Actual reward $R_t$ received at time step $t$ after selecting action $A_t$ was sampled from a normal distribution with $\mu=q_*(A_t), \sigma^2=1$
* Performance was measured over 1000 time steps, constituting one $run$, over a total of 2000 independent runs on different bandit problems
* Performance was compared between the greedy method, $\varepsilon$-greedy with $\varepsilon=0.1$, as well as $\varepsilon$-greed with $\varepsilon=0.01$
* The greedy method immediately reached an average reward of 1, while the $\varepsilon=0.1$ method quickly found a reward closer to 1.5, which the $\varepsilon=0.01$ method approached more slowly
* The greedy method selected the optimal action in only about a third of the runs, while $\varepsilon=0.1$ approached about $80\%$, and $\varepsilon=0.01$ about $50\%$ of the time
* Performance could improve even further if $\varepsilon$ is reduced over time
* Environments with noisier rewards favor exploration even more heavily
* The same is true for the deterministic case with nonstationary rewards, as the previously optimal action may suddenly no longer be optimal
* Nonstationarity is the most common case in RL, as the agent learns and improves its policy over time

### $Exercise\ \mathcal{2.2}$*:*$\ Bandit\ example$

#### Consider a $k$-armed bandit problem with $k = 4$ actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using $\varepsilon$-greedy action selection, sample-average action-value estimates, and initial estimates of $Q_1(a)=0$, for all $a$. Suppose the initial sequence of actions and rewards is $A_1=1,\ R_1=-1,\ A_2=2,\ R_2=1,\ A_3=2,\ R_3=-2,\ A_4=2,\ R_4=2,\ A_5=3,\ R_5=0$. On some of these time steps the $\varepsilon$ case may have occurred, causing an action to be selected at random.

$t=1:\; Q_2(1)=R_1=-1$, all others $0$, so $2$-$4$ are greedy  
$t=2:\; Q_3(2)=R_2=1$, so $2$ is greedy  
$t=3:\; Q_4(3)=R_3=-2$, so $2$ is greedy  
$t=4:\; Q_5(2)=\frac{R_2+R_4}{2}=\frac{1+2}{2}=1.5$, so $2$ is greedy  
$t=5:\; Q_6(3)=\frac{R_3+R_5}{2}=\frac{-2+0}{2}=-1$, so $2$ is greedy

#### On which time steps did this definitely occur?

Starting after $t=2$, the greedy action is always $2$, but $3$ is chosen for $t=3$ and $t=5$, meaning those were definitely exploratory actions.

#### On which time steps could this possibly have occurred?

As the $\varepsilon$-greedy policy can choose the greedy action even at random, an action could have been selected at random at each time step.

### $Exercise\ \mathcal{2.2}$

#### In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action?

In the long run, the method using $\varepsilon=0.01$ will perform the best.

#### How much better will it be? Express your answer quantitatively.

Both $\varepsilon$-greedy policies will find the optimal action, but the $\varepsilon=0.01$ method will choose the optimal action $99.1\%$ of the time, while the $\varepsilon=0.1$ method only does so $91\%$ of the time. The book claims the best possible value to be $1.54$, while we know the average reward to be $0$. So the $\varepsilon=0.01$ method receives an average reward of about $1.53$ per time step, while the $\varepsilon=0.1$ method receives only about $1.40$.

In [4]:
0.991 * 1.54, 0.91 * 1.54

(1.52614, 1.4014)