<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://supaerodatascience.github.io/reinforcement-learning/">https://supaerodatascience.github.io/reinforcement-learning/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Class 0: Introduction to Reinforcement Learning</div>

Contents of the class
1. [The mad hatter's casino](#mad)
2. [References](#refs)
3. [Ruining the suspense with a general definition](#def)
4. [Examples of RL problems](#examples)
5. [RL in the taxonomy of Machine Learning](#taxonomy)
6. [Key questions and keywords in RL](#keywords)
7. [Software](#gym)

# <a id="gpi"></a>The mad hatter's casino

Getting the main intuitions in 30 minutes by playing a fun game!
<img src="img/madhatter.png"></img>

> The Mad Hatter has invited you to play in his casino.  
> It is a strange place. There are 4 rooms, with three slot machines in each.  
> Whenever you pull the arm of a slot machine, you get a certain amount of tea and a tunnel opens that leads you to another room (possibly the same).
> You can play for as long as you want.

<div class="alert alert-warning">
    
**Game on!**  
    
- What could possibly be the goal of the game? Can you express it mathematically?
- If the goal is to accumulate tea, what's a good strategy?
    
Let's play for 5 minutes. If you get bored of following the game rules, you're allowed to cheat, change rooms at will, look at the game cards, etc. (but the questions above are posed for the non-cheating case).
</div>

In case we're not able to play this game together (it's a shame, it's a good laugh), you can emulate it with the following code.

In [None]:
from environments.MadHatterCasino import MadHatterCasino
import random
casino = MadHatterCasino()
room = casino.reset()  # default reset of the game, takes you to room 0
room = casino.reset(2) # reset the game in room 2
print("starting room:", room)
for i in range(3):
    machine = random.randint(0,2)
    room, tea = casino.step(machine) # pull the arm of slot machine 'machine' and be teleported to 'room' while drinking 'tea'
    print("you pulled the arm of machine ", machine, ", reached room ", room, ", and drank ", tea, " tea.", sep='')

In [None]:
# Your code here

Let's now answer the questions above by discussing together. If you're playing this notebook on your own, here are a few discussion elements

<details class="alert alert-danger">
    <summary markdown="span"><b>What is the goal of the game?  Can you express it mathematically?</b></summary>

Tricky question and many possible answers here.  
It seems most people try to obtain as much tea as possible.
    But it quickly appears that since the game never stops, the amount of tea you can get is just infinite.  
    So maybe the goal is to get tea as quickly as possible?  
    But in this case, what does "quickly" mean?
    Again, most people seem to consider "quickly" in the sense of "I don't know when the game will stop but there is a certain non-zero probability of termination at each step and I want to drink as much tea as possible before this happens".  
    Let's write $1-\gamma$ this termination probability (and assume that it is constant over time steps). Then each time step is a two-step process: first check if the game has ended, then maybe get some tea and keep playing. The game ends with probability $1-\gamma$ and we get $0$ tea thereafter. So the amount of tea we can expect to get after time step $t$ is $\gamma$ times whatever the future steps will bring. Let's call $R_t$ the amount of tea we collect at time step $t$. Then, formally, the amount of tea we can expect to obtain after $T$ time steps is $\mathbb{E}\left( \sum_{t=0}^T \gamma^t R_t \right)$.  
    So maybe the goal of the game is to find a way to maximize $\mathbb{E}\left( \sum_{t=0}^\infty \gamma^t R_t \right)$ (the infinite horizon expected accumulated tea).  
    Another frequent interpretation is to say "I want to get the highest average inflow of tea possible over a certain horizon".  
    With the notation above and without termination probability, this average inflow is $\mathbb{E}\left( \frac{1}{T} \sum_{t=0}^T R_t \right)$.  
    But again, we don't know when the game will end, so maybe the goal of the game is to maximize $\lim_{T\rightarrow \infty} \mathbb{E} \left(\frac{1}{T} \sum_{t=0}^T R_t \right)$.  
    These two interpretations are the most common ones but many other are possible. For example you could wish to keep your tea intake as steady as possible. Or try to always have an odd cumulated amount of tea (you crazy fool!). Or you could try to get tea while avoiding a certain room (where you might believe a bear lives).  
    Overall, the purpose of this open-ended question is to have a discussion about expressing and formalizing behavior objectives, i.e. "what's your goal in this game?" and "what makes a strategy quantitatively better than another?".
</details>

<details class="alert alert-danger">
    <summary markdown="span"><b>If the goal is to accumulate tea, what's a good strategy?</b></summary>

Sorry, no answer here since it is precisely the goal of the class. However you're encouraged to play this game with your friends and fill the table below with your votes. Who believes machine $m$ should be picked in room $n$? 
    
<img src="img/votes.png"></img>

Now suppose you normalize each column. Then you can read it as "we believe the best way to play in room $n$ is to pick machine $m$ with probability $\pi(n,m)$". So you have expressed a strategy as a function.  
Interestingly, this function has three important characteristics:
- it is stationary: the best course of action changes based on the room you are in, but not the time step
- it only depends on the current room, not on previous rooms and machines
- it is a probability distribution over machines
</details>

As you were playing this game, you followed a tought process that lead you to elaborate a reasonable strategy over several time steps.

This thought process did not involve explicitly modeling the game to obtain a good strategy. Rather this strategy was adjusted using some instantaneous feedback (the amount of tea you obtain at each step). And this feedback was not representative of your long-term goal.

Reinforcement Learning is the study of how this thought process can be formalized, analyzed and transformed into algorithms.

# <a id="refs"></a>References

Lots of excellent books and online resources. Here are a few freely available online.

<table>
<tr>
<td><img src="img/book_sutton2.jpg" style="width: 200px;"></td>
<td><b>Reinforcement Learning: an introduction (2nd edition)</b><br>Richard Sutton and Andrew Barto<br>2018.<br>The Reinforcement Learning bible. Both complete and didactical.<br><a href="http://incompleteideas.net/book/the-book.html">PDF available online</a>.</td>
</tr>
<tr>
<td><img src="img/book_szepesvari.jpg" style="width: 200px;"></td>
<td><b>Algorithms for Reinforcement Learning</b><br>Csaba Szepesvari<br>2010.<br>The essentials in a hundred pages. A bit mathematical.<br><a href="https://sites.ualberta.ca/~szepesva/RLBook.html">PDF available online</a>.</td>
</tr>

<tr>
<td><img src="img/book_deeprl.jpg" style="width: 200px;"></td>
<td><b>An Introduction to Deep Reinforcement Learning</b><br>Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau<br>2019.<br>Deep Reinforcement Learning.<br><a href="https://arxiv.org/abs/1811.12560">PDF available online</a>.</td>
</tr>
    
<tr>
<td><img src="img/web_silver.png" style="width: 200px;"></td>
<td><b>David Silver's UCL course on RL</b><br>10 video lectures + presentation PDFs.<br>2015.<br><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">Available here</a>.</td>
</tr>
</table>


# <a id="def"></a>Ruining the suspense with a general definition

What is Reinforcement Learning about?

It is about controlling dynamic systems.
<img src="img/dynamic.png" style="width: 400px;"></img>
Dynamic systems? **dynamic** evolution of $s$ and $o$ under $\pi$.

Our object of study:<br>
We want to find a control policy $\pi$ (with $u = \pi(o)$) such that the system $\Sigma$ behaves as we desire.

# <a id="examples"></a>Examples of RL problems

<table>
<tr>
  <td><img src="img/spiral.jpg" style="width: 200px;"></td>
  <td>Exiting a spiral</td>
</tr>
<tr>
  <td><img src="img/tests.jpg" style="width: 200px;"></td>
  <td>Dynamic treatment regimes for HIV patients</td>
</tr>
<tr>
  <td><img src="img/pend.png" style="width: 200px;"></td>
  <td>Cart-pole balancing</td>
</tr>
<tr>
  <td><img src="img/waiting.jpg" style="width: 200px;"></td>
  <td>Queueing problems</td>
</tr>
<tr>
  <td><img src="img/market.jpg" style="width: 200px;"></td>
  <td>Portfolio management</td>
</tr>
<tr>
  <td><img src="img/dam.jpg" style="width: 200px;"></td>
  <td>Hydroelectric production</td>
</tr>
</table>

But also:
- Elevator scheduling
- bicyle riding
- Ship steering
- Bioreactor control
- Aerobatics helicopter control
- Airport departures scheduling
- Airlines scheduling
- Robocup soccer
- Video game playing (Quake, CS, Starcraft...)
- Game of Go
- ...

# <a id="taxonomy"></a>RL in the taxonomy of Machine Learning

You may have had classes on Machine Learning before. There are three strongly distinct categories of problems in ML:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning

Let's try to answer the following questions for each category.
- What's the abstract problem we are trying to solve?
- What's the data provided to the algorithms?
- Give examples of algorithms in SL/UL/RL.  

<center>
<table border="1">
<tr>
    <td> <b>Question</b> </td>
    <td style="border-left: 1px solid black"> <b>Supervised</b> </td>
    <td style="border-left: 1px solid black"> <b>Unsupervised</b> </td>
    <td style="border-left: 1px solid black"> <b>Reinforcement</b> </td>
</tr>
<tr>
    <td> Target </td>
    <td style="border-left: 1px solid black"> $f(x)=y$ </td>
    <td style="border-left: 1px solid black"> $x\in X$ </td>
    <td style="border-left: 1px solid black"> $\pi(s)=a$ </td>
</tr>
<tr>
    <td> Target (rephrased) </td>
    <td style="border-left: 1px solid black"> Predict outputs given inputs</td>
    <td style="border-left: 1px solid black"> Discover structure in data </td>
    <td style="border-left: 1px solid black"> Find an optimal behavior </td>
</tr>
<tr>
    <td> Data </td>
    <td style="border-left: 1px solid black"> $\left\{\left(x,y\right)\right\}$ supervisor's labels </td>
    <td style="border-left: 1px solid black"> $\left\{x\right\}$ unlabelled data </td>
    <td style="border-left: 1px solid black"> $\left\{\left(s,a,r,s'\right)\right\}$ experience samples </td>
</tr>
<tr>
    <td> Output </td>
    <td style="border-left: 1px solid black"> Classifier or regressor</td>
    <td style="border-left: 1px solid black"> Clusters or dimension reduction </td>
    <td style="border-left: 1px solid black"> Policies, value functions </td>
</tr>
<tr>
    <td> Key algorithms </td>
    <td style="border-left: 1px solid black"> Neural networks, SVMs, etc.</td>
    <td style="border-left: 1px solid black"> k-means, PCA, etc. </td>
    <td style="border-left: 1px solid black"> Q-learning, Policy Gradients, etc. </td>
</tr>
</table>
</center>

This table helps distinguish the different natures of the problems tackled. The RL problem is about finding the optimal policy for a given environment.

# <a id="keywords"></a>Key questions and keywords in RL

The problem RL tries to solve is the *evaluation* and the *improvement* of the agent's behavior, based on experience samples:
$$(s,a,r,s')$$

One can distinguish two subproblems in RL:
- **Value prediction**: what is this policy's value?<br>
$\rightarrow$ Useful for decision support applications: predict the cost of using the water in a hydro-electric reservoir, predict the expected gain of an investment policy.
- **Policy optimization**: what is the best control policy for this system?<br>
$\rightarrow$ Useful for control applications: robotic actuator control, operations planning, elevator scheduling, agro-ecosystems management, etc.

The core hypothesis of RL, is that the environment behaves as a Markov Decision Process, although we do not know an explicit model of this process.

Finally, there are different contexts and challenges in RL that we can discuss via a few keywords:
- curse of **dimensionnality**<br>
The more state/actions variables, the larger the state space and the harder it is to efficiently optimize the policy.
- **exploration/exploitation dilemma**<br>
Suppose we have discovered some region of the state space yields great rewards. Should we rush towards that region or explore the unknown in search for even better rewards?
- **model-based vs. model-free RL**<br>
One straightforward approach to RL would be to collect samples $(s,a,r,s')$ and approximate the underlying MDP in order to use the algorithms seen above. Model-free RL tries to infer the optimal policy without this intermediate step. This type of RL is the one discussed in this class.
- **online vs. interactive vs. non-interactive RL**<br>
Online: we directly interact with the environment, as a player in a game.<br>
Interactive: we control the interaction with the environment, in particular the choice of the current state or the possility of resets, as if we had a simulator that we can query.<br>
Offline: somebody collected that $\left\{\left(s,a,r,s'\right)\right\}$ pile of data and left it here without a note. Can we still get something out of it?
- **on-policy vs. off-policy learning**<br>
Can we learn something about $\pi$ while applying $\pi'$?

# <a id="gym"></a> Software

This class requires a recent version of Python 3 and scikit-learn (available in the <a href="https://www.anaconda.com/download">Anaconda distribution</a>).

You will need standard elements of Anaconda (numpy, matplotlib, scikit-learn, scikit-image) and graphviz.
```sh
conda install graphviz
conda install python-graphviz
```

Although some environments will be provided as independent files, we will make use of the <a href="https://github.com/openai/gym">OpenAi Gym</a> collection of Reinforcement Learning environments.

In case this notebook becomes outdated, refer to the <a href="https://github.com/openai/gym">installation instructions for Gym</a>. On Debian-based GNU/Linux distribution, this should do the trick:
```
sudo apt install -y g++ libglu1-mesa-dev libgl1-mesa-dev libosmesa6-dev xvfb ffmpeg curl patchelf libglfw3 libglfw3-dev cmake zlib1g zlib1g-dev swig
pip install gym gym[atari] gym[classic_control] gym[box2d] gym[algorithms]
pip install gym[accept-rom-license]
```

Test your installation (if the code below runs fine, you're sorted).

In [None]:
# %load extras/colab_gym_setup.sh
# If you're running this notebook on Colab, uncomment the line above and run this cell.

In [None]:
# %load extras/virtualdisplay_gym_setup.py
# If you're running this notebook on Binder or Colab, uncomment the line above and run this cell.

In [None]:
# This should display a 4x4 grid of letters and open a window of the Breakout game.
# Don't close the window yourself (it shouldn't work anyway)
import gym
env0 = gym.make('FrozenLake-v1')
env0.render()
env1 = gym.make('Breakout-v0')
env1.render()
env2 = gym.make('LunarLander-v2')
env2.render()

In [None]:
# This should close the Breakout and LunarLander windows
env1.close()
env2.close()