## 1. Reinforcement Learning

### 1A. What?

Given state $s$, determine action $a$

- We could use supervised learning, but extremely difficult as there are many states in real time and ambiguous.

The key of reinforcement learning is learning through a **reward function**. 

Eg: Autonomous helicopter
- Positive reward: Helicopter flying well (+1)
- Negative reward: Helicopter flying poorly (-1000)

- We cant program to pick "good actions" , but figured out by machine itself through rewards

**Applications**:
1. Controlling robots
2. Factory optimisation
3. Financial (stock) trading
4. Playing games (including video games)

### 1B. Mars Rover Example

- State $s$
- Action $a$
- Reward from state $R(s)$
- New state $s'$
- At every time step - $(s, a, R(s), s')$
- Terminal state- Program receives reward and terminates/ends function

Eg: $(4, <-, 0, 3)$





### 1C. Return in Reinforcement Learning

- Calculated by Discount Factor 

$$\text{Return} = R_1 + \gamma R_2 + \gamma^2 R_3 + ... \text{ until terminal state}$$
$$\text{Discount Factor }\gamma = 0.9$$

- Makes reinforcement learning algorithm impatient and credits more to faster rewards sooner and gives higher value
- High $\gamma$ = Very heavily discounts rewards in future
- Similar concept to Time Value of money

Eg:
<img src= ./images/Wk10_1.png style="width:70%; padding:10px, 20px;">

### 1D. Policies in Reinforcement Learning

- Policy $\pi$
- Function $\pi(s) = a$, takes state and maps to an action, tells you what action $a$ to take in given state  $s$.
- Eg:
    * $\pi(s) = a$
    * $\pi(2) = <-$
    * $\pi(3) = <-$
    * $\pi(4) = <-$
    * $\pi(5) = ->$
    
- **Goal** - Find policy $\pi$ that maps each state to an action in which **RETURN** is **MAXIMIZED**.
     

<img src= ./images/Wk10_2.png style="width:70%; padding:10px, 20px;">

### 1E. Markov Decision Process
- Key Idea - Future actions depend on current state, and not how machine got to the current state

<img src= ./images/Wk10_3.png style="width:70%; padding:10px, 20px;">

## 2. State-action value function

- Key quantity for developing a reinforcement learning algorithm
- Denoted by $Q(s, a)$ = return if you
    * start in state $s$
    * take action $a$ (once)
    * behave optimally after that (might be circular and hard to understand)
    * <img src= ./images/Wk10_4.png style="width:70%; padding:10px, 20px;"

<img src= ./images/Wk10_4.png style="width:70%; padding:10px, 20px;">

### 2A. Picking actions

- Choose action which has highest Q-value (duh)
- Best possible **RETURN** from state $s$ is $\max_{a}{Q(s,a)}$
- Best action in state $s$ is action $a$ that gives $\max_a {Q(s,a)}$

### 2B. Bellman Equation

- **MOST IMPORTANT EQUATION** in Reinforcement Learning
- Compute state-action value / $Q(s, a)$

- $s$: current state
- $R(s)$: reward of current state
- $a$: current action
- $s'$: state you get to after taking action $a$
- $a'$: action you take in state $s'$

$$Q(s,a) = R(s) + \gamma \max_{a'}Q(s',a')$$


<img src= ./images/Wk10_5.png style="width:70%; padding:10px, 20px;">

### 2C. Stochastic (Random) Environment

- Include probability of accidentally doing the wrong action
- Eg: going left instead of going right
- In this case, we want to maximise the average value of the sum of discounted rewards (expected return), instead of return

`misstep_prob = 0.4`

 
$$\text{Expected Return} = E[R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + ... ]$$

 
$$Q(s,a) = R(s) + \gamma E[\max_{a'}Q(s',a')]$$


## 3. Continuous State Spaces

- Mars rover example with 6 different states represents **DISCRETE state spaces**
- Continuous (Mars Rover):
$$s = \begin{bmatrix}
    x \\
    y \\
    \theta \text{ (0-360$^o$)}\\
    \dot{x} \\
    \dot{y} \\
    \dot{\theta} \\
\end{bmatrix}$$

**Autonomous Helicopter**
$$s = \begin{bmatrix}
    x \\
    y \\
    z \\
    \phi \text{ (roll)}\\
    \theta \text{ (pitch)}\\
    \omega \text{ (yaw)}\\ 
    \dot{x} \\
    \dot{y} \\
    \dot{z} \\
    \dot{\phi} \\
    \dot{\theta} \\
    \dot{\omega} \\
\end{bmatrix}$$

### 3A. Lunar Lander

$$s = \begin{bmatrix}
    x \\
    y \\
    \dot{x} \\
    \dot{y} \\
    \theta\\
    \dot{\theta} \\
    l (0,1)\\
    r (0,1)\\
\end{bmatrix}$$

$$a\text{ (actions)} = \begin{bmatrix}
    \text{do nothing} \\
    \text{left thruster} \\
    \text{main thruster} \\
    \text{right thruster} \\
\end{bmatrix}$$
<img src= ./images/Wk10_6.png style="width:70%; padding:10px, 20px;">

$$\text{Policy}(\pi) - \text{pick action $a$ = $\pi(s)$ so as to maximise return}$$
$$\lambda =  0.985$$


### B. Deep Reinforcement Learning (State- value function)
- Train neural network to approximate state-value function $Q(s,a)$
<img src= ./images/Wk10_7.png style="width:70%; padding:10px, 20px;">

How do you train neural network to output $Q(s,a)$?
General Approach:
* Use Bellman's Equation to create large training set ($x -> y$)
* Use supervised learning to learn mapping from $x$(state-action pair) to $y$($Q(s,a)$)
* Try many different possibilites to find results and rewards ($s,a,R(s),s'$)

#### Bellman Equation

<img src= ./images/Wk10_8.png style="width:70%; padding:10px, 20px;">

#### Learning Algorithm
1. Initialise neural network randomly as guess of Q(s,a)
2. Repeat {
    * Take actions in lunar lander. Get($s,a,R(s),s'$)
    * Store 10,000 most recent ($s,a,R(s),s'$) tuples. (Replay Buffer)
    * Train neural network:
        * Create training set of 10,000 examples using
        
        $x = (s,a)$ and $y = R(s) + \gamma max_{a'}Q(s',a')$
        
        * Train $Q_{new}$ such that $Q_{new}(s,a)\approx y$
    * Set $Q = Q_{new}$
    
    }

### 3C. Algorithm refinement (Improved neural network architecture)

<img src= ./images/Wk10_7.png style="width:50%; padding:10px, 20px; float: left;">

<img src= ./images/Wk10_9.png style="width:50%; padding:10px, 20px; float: left;">

Old: Need to carry out inference 4 different times for each action which is inefficient  
New: Compute Q value for each action simultaneously, and efficiently calculates max

### 3D. Algorithm refinement ($\epsilon$-greedy policy)

- Common way to take actions while still learning.

- **Approach**: In some state $s$,

    - Option 1 (not as good): Pick the action that maximizes $Q(s,a)$ (even if not great estimate yet). (Greedy, "Exploitation")
    - Option 2: Pick the action that maximizes $Q(s,a)$ most of the time, but with a small probability (e.g. 5%) pick an action randomly. (In case, it initialises $Q(s,a)$ to wrong parameters (eg. never try good actions))
    - Picking actions randomly is sometimes called an "exploration step".
- $\epsilon$ is the probability of taking an action randomly (e.g. 0.05).
- **Trick**: Start with epsilon high and gradually decrease it over time, so that eventually you're taking greedy actions most of the time.
- Epsilon-greedy exploration allows the agent to balance between exploring new actions and exploiting the actions that it currently believes to be the best.


### 3E. Algorithm refinement (Mini-batch and Soft updates)

#### Mini-batches

- Mini batch gradient descent - refinement to the gradient descent algorithm that is used to **speed up both supervised and reinforcement learning** algorithms.
- The idea is to **use a subset of the training examples**, called a mini batch, rather than the entire dataset on each iteration of the algorithm.
- This **makes each iteration** of the algorithm **faster and more efficient**, particularly when dealing with large datasets.
- In reinforcement learning, this could involve using a subset of the stored tuples in the replay buffer to train the neural network.
<img src= ./images/Wk10_10.png style="width:49%; padding:10px, 100px; float: left;">

<img src= ./images/Wk10_11.png style="width:49%; padding:10px, 20px; ">

#### Soft Updates
Problem: Mini-batches might cause reinforcement learning algorithm to become worse/ have higher error when setting $Q = Q_{new}$

- Soft updates is another refinement to the reinforcement learning algorithm that helps it converge to a good solution.
$$w = 0.01w_{new} + 0.99w$$
$$b = 0.01b_{new} + 0.99b$$
- The idea is to update the target Q-network slowly, rather than copying the weights of the policy network to the target Q-network all at once.
- This helps to stabilize the training process and prevent the algorithm from overfitting to/ diverge from the training data.
- Soft updates are particularly useful in deep reinforcement learning, where overfitting is a common problem.
- Soft update method helps to prevent $Q_{new}$ from becoming worse than the previous $Q$ as it makes a more gradual change to the neural network parameters $w$ and $b$.

### 4. State of Reinforcement Learning

- Fewer applications than supervised and unsupervised learning.
- Reinforcement learning is easier to get to work in simulated environments than in the real world, especially on real robots.
- Odds of using supervised or unsupervised learning are higher than using reinforcement learning in practical applications.
- Remains one of the major pillars of machine learning and has a lot of potential for future applications.