# Workshop RL01: Introduction to Reinforcement Learning

## Motivation:

So far we hace learned supervised learning, unsupervised learning as well as deep learning. It's probably a good time to stop and think about what is the fundamental challenge of machine learning and artificial intelligence. Quoting from reinforcement learning(RL) professor Emma Brunskill from Standford: "Fundamental challenge in artificial intelligence and machine learning is 


**<center>learning to make good decisions under uncertainty".</center>**


If we break down this sentence into pieces, we can see that we need to address these following aspects:
- "learning": no advanced knowledge, have to learn from experience
- "good decisions": need some sort of measurement for decision-making process and optimize that measurement 
- "uncertainty": need to explore different probabilities to gain experience 

And RL is all about making **sequential decisions under uncertainty**, which involves:  

- **optimization**: yield best desicions
- **generalization**: generalise experience for decision-making in unprecedented situations  
- **delayed consuquence**: account for decisions made now that can impact things much later 
- **exploration**: interact with the world through decision-making and learn what's the best decision  

As a comparison with other AI methods:

|Comparison|AI planning|Supervised ML|Unsupervised ML|Imitation learning| 
|:------:|:---------:|:-----------:|:-------------:|:----------------:|
|optimization| $\checkmark$ | $\checkmark$ |$\checkmark$| $\checkmark$| 
|generalization|$\checkmark$ |$\checkmark$ |$\checkmark$ |$\checkmark$ |
|delayed consuquence|$\checkmark$ | - | - |$\checkmark$ |
|exploration| - | - | - | - |
|how it learns|learn from models of how decisions impact results|learn from experience/data|learn from experience/data|learn from experience from other intelligence like human|


Some successful RL implementations: 
Gaming, Robotics, Healthcare, ML (NLP, CV) ...





## The fundamentals

So how does RL make sequential decisions? The answer should be pretty obvious: through a loop: 


<img src = 'SDP.png'>

This is known as **sequential decision process**, at each time step $t$:
- **agent** uses data up to time $t$ and takes action $a_t$
- **world** emits observation $o_t$ and reward $r_t$, received by agent
- data are stored in **history**: $h_t = (a_1,o_1,r_1,...,a_t,o_t,r_t)$


|Examples|Action|Observation|Reward|
|:------:|:----:|:---------:|:----:|
|web ad|choose web ad|view time|click on ad|
|blood pressure control|exercise or medication|blood pressure|within healthy range|

Our goal is to maximise total expected (why expected?) future rewards, which may require balancing immediate and long-term rewards, as well as strategic behaviour to achieve high rewards. 

The terminologies:
- **agent**: an intelligent subject that can make actions
- **world**: the environment that the agent operates in, and produces observations and rewards accordingly 
- **state**: information state assumed to determine what happens next
- **wrold state**: representation of how the world changes, often true state of world is unknown to agent and we model it with limited data (why?)
- **agent state**: information agent uses to make decisions, generally some function of history, i.e. $s_t = f(h_t)$, could also include meta data like how many computations executed and how many decisions left 

## RL components
An RL algorithm often contains one of more of:
- **model**: mathematical models of dynamics and rewards
    - agent's representation of how the world changes in response to agent's action, e.g.:
    - transition/dynamics model that predicts $p(s_{t+1}|s_t,a_t)$
    - reward model that determines rewards based on action and/or states $R(s_t=s,a_t=a)=E \lbrack r_t|s_t,a_t \rbrack$
    - explicit model, may or may not have policy and/or value function
- **policy**: functions mapping agent's states to actions 
    - determines agent's actions by some function $\pi$, e.g.:
    - deterministic policy: $a = \pi(s)$
    - stochastic policy: $p(a_t=a|s_t=s)=\pi(a|s)$
- **value function**: expected (discounted) future rewards:
    - if we start in state $s$, value function is defined as:
    - $V(s_t=s)=E\lbrack r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+...|s_t=s \rbrack$, where
    - $\gamma$ is the discount factor (if $\gamma<1$ we place more weights to recent rewards) 
    - can be used to quantify goodness and badness of states and actions
    - can compare different policies



    
By choosing and combining these components, we have different types of agents:

<img src='agents.png'>

In [1]:
# for the coding part, we're going to use the Standford reinforcement course assignment
# link: http://web.stanford.edu/class/cs234/assignment1/index.html

We realised that over the 3 workshops we can only cover the very basis of RL and the most common used algorithms (DQN, policy gradient). If you are really interested, dare yourself and try this [this **open-source** Standford course](http://web.stanford.edu/class/cs234/schedule.html). 
