# Workshop RL01: Introduction to Reinforcement Learning

## Motivation:

So far we hace learned supervised learning, unsupervised learning as well as deep learning. It's probably a good time to stop and think about what is the fundamental challenge of machine learning and artificial intelligence. Quoting from reinforcement learning(RL) professor Emma Brunskill from Standford: "Fundamental challenge in artificial intelligence and machine learning is 


**<center>learning to make good decisions under uncertainty".</center>**


If we break down this sentence into pieces, we can see that we need to address these following aspects:
- "learning": no advanced knowledge, have to learn from experience
- "good decisions": need some sort of measurement for decision-making process and optimize that measurement 
- "uncertainty": need to explore different probabilities to gain experience 

And RL is all about making **sequential decisions under uncertainty**, which involves:  

- **optimization**: yield best desicions
- **generalization**: generalise experience for decision-making in unprecedented situations  
- **delayed consuquence**: account for decisions made now that can impact things much later 
- **exploration**: interact with the world through decision-making and learn what's the best decision  

As a comparison with other AI methods:

|Comparison|AI planning|Supervised ML|Unsupervised ML|Imitation learning| 
|:------:|:---------:|:-----------:|:-------------:|:----------------:|
|optimization| $\checkmark$ | $\checkmark$ |$\checkmark$| $\checkmark$| 
|generalization|$\checkmark$ |$\checkmark$ |$\checkmark$ |$\checkmark$ |
|delayed consuquence|$\checkmark$ | - | - |$\checkmark$ |
|exploration| - | - | - | - |
|how it learns|learn from models of how decisions impact results|learn from experience/data|learn from experience/data|learn from experience from other intelligence like human|


Some successful RL implementations: 
Gaming, Robotics, Healthcare, ML (NLP, CV) ...





## The fundamentals

So how does RL make sequential decisions? The answer should be pretty obvious: through a loop: 


<img src = 'SDP.png'>

This is known as **sequential decision process**, at each time step $t$:
- **agent** uses data up to time $t$ and takes action $a_t$
- **world** emits observation $o_t$ and reward $r_t$, received by agent
- data are stored in **history**: $h_t = (a_1,o_1,r_1,...,a_t,o_t,r_t)$


|Examples|Action|Observation|Reward|
|:------:|:----:|:---------:|:----:|
|web ad|choose web ad|view time|click on ad|
|blood pressure control|exercise or medication|blood pressure|within healthy range|

Our goal is to maximise total expected (why expected?) future rewards, which may require balancing immediate and long-term rewards, as well as strategic behaviour to achieve high rewards. 

## RL terminologies:
- **agent**: an intelligent subject that can make actions
- **world**: the environment that the agent operates in, and produces observations and rewards accordingly 
- **state**: information state assumed to determine what happens next
- **wrold state**: representation of how the world changes, often true state of world is unknown to agent and we model it with limited data (why?)
- **agent state**: information agent uses to make decisions and evaluate its rewards, e.g. if it's in state $s_1$ then do action $a_1$ 

## RL components
An RL algorithm often contains one of more of:
- **model**: 
    - mathematical models of dynamics and rewards, agent's representation of how the world changes in response to agent's action, e.g.:
        - transition/dynamics model that predicts: $p(s_{t+1} = s'|s_t=s,a_t=a)$, e.g.
        $$\begin{bmatrix}
            p(s_1|s_1,a_1) & p(s_2|s_1,a_1) & p(s_3|s_1,a_1) & \dots  & p(s_N|s_1,a_1) \\
            p(s_1|s_2,a_1) & p(s_2|s_2,a_1) & p(s_3|s_2,a_1) & \dots  & p(s_N|s_2,a_1) \\
            \vdots & \vdots & \vdots & \ddots & \vdots \\
            p(s_1|s_N,a_1) & p(s_2|s_N,a_1) & p(s_3|s_N,a_1) & \dots  & p(s_N|s_N,a_1)
        \end{bmatrix}$$
        - reward model that determines current rewards based on action and/or states: $R(s_t=s,a_t=a)=E \lbrack r_t|s_t,a_t \rbrack$
    - explicit model, may or may not have policy 
- **policy**: 
    - function mapping agent's states to actions, determines agent's actions by some function $\pi$, e.g.:
        - deterministic policy: $a = \pi(s)$
        - stochastic policy: $p(a_t=a|s_t=s)=\pi(a|s)$
- **value function**: 
    - expected (discounted) future rewards, 2 types of value: 
        - state value: $V(s_t=s)=E\lbrack r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+...|s_t=s \rbrack$
        - state-action value, $Q(s_t=s,a_t=a)=E\lbrack r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+...|s_t=s , a_t = a\rbrack$, where
        - $\gamma$ is the discount factor ($\gamma \in [0,1]$) 
    - used as a measurement of rewards for agent


    
By choosing and combining these components, we have different types of agents:

<img src='agents.png'>

### Value-Based
In the value-based method, instead of calculating the value function explicitly, we want to approxiamte it with a set of parameters, i.e. $$V(s) \approx V(s,w)$$ If we define $w$ like we did in deep neural network, then we have a Deep Q-learning Network (DQN), but more about this in the next workshop. 


### Policy-Based
In the policy-based method, we parameterize and "learn" the poliy, i.e. $$\pi(a|s) \approx \pi_{\theta}(a|s)$$
Then find the policy that maximize value function. More about this in the later policy gradient workshop. 


### Model-Based
In the model-based method, we provide the agent a model how the world works. More explicitly, we define a dynamics/transition model that tells the agent how the state would change with its action. If we assume the states are Markov, we can define the agent's decision making process as a **Markov Decision Process (MDP)**. For those of you who are insterested, feel free to go through the optional workshop on more details about it. 

MDP often combines with policy, and is defined as below: 

|MDP with policy|
|:-------------|
|$S$ is a set of states $s_t \in S$|
|$A$ is a set of actions $a \in A $|
|$P^{\pi}$ is dynamics/transition model that specifies $p^{\pi}(s'|{s})$ **under a certain policy**, where $p^{\pi}(s'|s) = \sum_{a \in A} \pi(a|s)p(s'|s,a)$|    
|$R^{\pi}$ is a reward function that specifies current reward $R(s)$ **under a certain policy**, where $R^{\pi}(s) = \sum_{a \in A} \pi(a|s)R(s,a)$|        
|$\gamma$ is the discount factor that $\in [0,1]$|

Then we can calculate the value function for policy in an iterative way:
- for all $s \in S$
- initiate $V_0^{\pi}(s) = 0$ 
- start with k = 1 until converge: $$ V_k^{\pi}(s) = R^{\pi}(s) + \gamma \sum_{s' \in S} p^{\pi}(s'|s)V_{k-1}^{\pi}(s') $$ 

Using the above value function, we can also compute state-action value (Q value):
$$Q^{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}P(s'|s,a)V^{\pi}(s'), \forall a \in A$$

And finaly, to find the optimal policy we can do an exhaustic search for policies and find the optimal one. Or we can improve the policy/value function iteratively to find the optimal one. (Which one is better?)

The second method can be done in 2 ways:
- policy iteration (pi)
- value iteration (vi)

#### 1. Policy Iteration
Policy iteration involves 3 steps:
- policy valuation, which is to compute $V^{\pi}(s)$
- policy improvement, where we take actions that maximise the Q value for each state, i.e. $$ \pi_{i+1}(s) = argmax_a Q^{\pi_i}(s,a)$$
- policy iteration: 
    - for all $s \in S$, initialise $\pi_0(s)$
    - start with $i=0$ until $||\pi_i - \pi_{i+1}||_{l1} = 0$ (no changes in policy)
        - policy valuation $V^{\pi_i}(s)$
        - policy imporvement $\pi_{i+1}$
        - i = i + 1

#### 2. Value Iteration
- for all $s \in S$, initialise $V_0(s)$ with zeros 
- start with $k=1$ until $V(s)$ converges
     - for each $s \in S$: $V_{k+1}(s) = max_aQ(s,a)$
     - k = k + 1
- policy extraction: $\pi_{k+1}(s) = argmax_aQ(s,a)$



### Exercise 
#### Task
For the coding part, we're going to do the Standford reinforcement learning course assignment.
([link](http://web.stanford.edu/class/cs234/assignment1/index.html))

Your task is to implement policy iteration and value iteration in the frozen lake environment. Since it's hard to render in jupyter notebook, we'll be running scripts. 

What are the scripts:
- discrete_env.py: dependency of frozen_lake.py
- draft.py: prints out environment parameters to help understanding. You can also try printing out different parameters.
- frozen_lake.py: creates environment; specifies states, actions and dynamics model. (More info [here](https://gym.openai.com/envs/FrozenLake-v0/))
- getting_started_with_gym.py: simple example to help you familiarise yourself with frozen lake environment. You can also try other environment in the gym. 
- lake_env.py: defines and registers for specific frozen lake environments
- vi_and_pi.py: implementation of value iteration and policy iteration. We have provided the suggested solutions but feel free to try your own solutions!


#### Setting up
To set up the world or environment, we're going to use "gym". (Doc [here](https://gym.openai.com/docs/))


Gym contains lots of common environments (Pacman, cartpole etc.) that we can try. Go to [full list](https://gym.openai.com/envs/#classic_control) to check out more.

In [3]:
! pip install -r ./assignment1/requirements.txt


Collecting gym==0.10.9 (from -r ./assignment1/requirements.txt (line 1))
Collecting pyglet>=1.2.0 (from gym==0.10.9->-r ./assignment1/requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/6b/aa/121ad16b96b6141a04d781be38215581162031bb0410ccf15fc9a597e02f/pyglet-1.4.4-py2.py3-none-any.whl
Installing collected packages: pyglet, gym
Successfully installed gym-0.10.9 pyglet-1.4.4


We realise that over the 3 workshops we can only cover the very basis of RL and the most common used algorithms (DQN, policy gradient). If you are really interested, check out [this **open-source** Standford course](http://web.stanford.edu/class/cs234/schedule.html). 
