$$Authors: Toke~ Faurby~ (s136232)~ and~ Maciej~ Korzepa~ (mjko).$$

#### Important notes
 * illustrate the knowledge of the Bayesian workflow.
 * Structure: The notebook presents a clear cohesive data analysis story, which is enjoyable to read
 * Statistical terms: Statistical terms are used accurately and with clarity

#### TODO:
 * Description of Bayesian UCB
 * Description of Thompson Sampling
 * Description of Hierarchical extensions
 * Description of the prior
 * Inlude code
 * Posterior predictive checking
 * Model comparison if applicable (e.g. with loo)
 * Predictive performance assessment if applicable (e.g. classification accuracy)
 * Potentially sensitivity analysis
 * Discussion of problems, and potential improvements
 * Convergence diagnostics (Rhat, divergences, neff)
 * Conclusion:
     * The main conclusion of the data analysis should be clear


Tokes job
 * Run excisting code
 * Give Stan a chance


#### DONE: 
 * Introduction: 
     * The introduction is inviting, presents an overview of the notebook. Information is relevant and presented in a logical order.
 * Description of the data
 * Description of analysis problem
 * Description of Frequentis UCB


Things we skip
 * Stan Code 
 * How Stan model is run

# Introduction
**Reinforcement learning** is about goal directed learning from interactions, and is generally thought of as distinct from supervised and unsupervised learning.
The goal is to learn what actions to take, and when to take them, so as to optimize long-term performance.
This may involve sacrificing immediate reward to obtain greater reward in the long-term or just to obtain more information about the environment.
The tradeoff between maximizing reward and learning about the environment is called the _exploration/exploitation dillemma_, and it is one of the problems at the core of reinforcement learning.


**Bayesian methods** uses Bayes rule to define a posterior distribution, based on a prior on the model parameters and the likelihood of the obsevations given model parameters.
Should we obtain even more observations, the old posterior becomes the new prior and the process is repeated.
This property makes Bayesian models attractive for reinforcement learning, as the ability to effectively utilize new observations as they become available is important for overall performance (unlike many other data analysis problems).
This explicit modelling of probabilities gives rise to principled methods for incorporating prior information and action-selection (exploration/exploitation) as a function of the uncertainty in learning.
Bayesian methods also enable hierarchical approaches, that enhace data efficiency even further.


In this notebook we will show the principles of **Bayesain reinforcement learning** using the classic and very simple problem - the **multi-armed bandit problem**.
At each time step the agent must select one of $K$ arms, and will subsequently recieve a reward based on some unknown probability distribution $p(\theta_k)$.
The goal is to get as high a total reward as possible.
The performance is often measured in terms of expected regret, i.e. the difference between the selected actions, and the optimal action.

$$
\mathbb{E} \left(Regret(T)\right) 
=
\mathbb{E} \left[ \sum_{t=1}^T \left( r(a^*) - r(a_t) \right)\right]
$$

where $r(a)$ is the recieved reward after performing action $a$. 
$a^*$ is the action that gives the highest expected reward.

The environment is therefore static, greatly simplifying the problem, but many of the general properties still hold.


## Content
1. The Environment: Multi-Armed Bernoulli Bandits
* The Agents - Frequentist baseline, Bayesian approach, and hierarchical Bayesian approach
* Implementation - what we did and how it works
* Results
* Conclusion and discussion

# The Environment: Multi-Armed Bernoulli Bandits

The **data** is obtained iteratively through interacting with the environment, which we define as follows:

When initialized $K$ parameters $\theta_k$ are sampled from a $Beta(\alpha, \beta)$ distribution, with $\alpha=10$ and $\beta=40$ (these values are arbitrary, but we don't want to use a uniform distribution, as it would then match our prior).
One episode lasts $5000$ steps, and at each time step, $t$ the agent picks an action, $a_t$, and recieves a reward, sampled from $Bernoulli(\theta_{a_t})$.

The agent must therefore balance selecting arms that it believes are good (exploit high $\theta_k$) and unknown arms that might be even better (explore uncertain $\theta_k$).



# The Agents
 * Did you get a sense of what is the model? Where and how might the author make the model description more clear?

In order to demonstrate the effectiveness of Bayesian approaches we use the popular frequentist *upper confidence bound* (UCB) method as a benchmark.
For Bayesian approaches we have selected '*Bayesian UCB*', for easy direct comparison, and '*Thompson Sampling*' (aka. probability matching), which achieves state of the art performance on this task.
We also examin the hierarchical extensions of the two Bayesian approaches.


## Frequentist UCB
The idea behind frequentist UCB is that we shuold always explore some, as we can never be certain whether or not we have foun the optimal arms.
Exploratory actions should be selected based on their potential for being optimal, taking into account both how close their estimates are to being optimal and the uncertainties in those estimates.
This is done using the following equation:

$$
a_t = \arg \max_a \left[ Q(a) + c\sqrt{\frac{\ln{t}}{N(a)}} \right]
$$

Where $Q(a)$ is the emperical mean reward of action $a$, $c$ is a hyperparameter determining the confidence level, and $N(a)$ returns the number of times action $a$ has been selected.

Each time $a$ is selected the associated is reduced ($N(a)$ increases).
Similarly when an action other than $a$ is selected  the uncertainty estimate increases ($t$ increases)
The natural logarithm insures that the means that this increase in uncertainty gets smaller over time, but it is still unbounded, meaning that in the limit all actions will be taken infinitely many times (constant exploration).


## Bayesian UCB
 * TODO: Priors are listed and justified

## Thompson Sampling


## Hierarchical extensions
* TODO: Only use point estimates?


# Code
> NB: Two files `agents.py` and `bandit.py` hold the code describing the agents and bandit environment respectively. The code is reproduced at the end of the notebook (in case you want to run it, run those first). The code is also available [here](https://github.com/Faur/Multi-Armed-Bayesian-Bandits). 

* TODO: Can we get stan to work? http://andrewgelman.com/2018/02/04/andrew-vs-multi-armed-bandit/



# Conclusion and discussion
 * The main conclusion of the data analysis should be clear

# Appendix

In [1]:
### agents.py

In [2]:
### bandit.py