-----------
# Outline of Notebook
- ### Reinforcement Learning
- ### Deep Reinforcement Learning
-----------

# Reinforcement Learning

<u>Intuition</u>
- To train a dog, you can't demonstrate anything. Instead, when it does something good, you say "Good Dog" and when it does something bad, you say "Bad Dog"
- You hope that the dog will learn to get better at doing the things that make us say "Good Dog"
- This is how Reinforcement Learning Works
- Reinforcement learning is based on a reward function which decides how well something is doing
- For example, when flying a helicopter, you may want it to fly automatically and for that, you may tell the helicopter it is doing good if it is stable and if it crashes, you may somehow tell it that that was bad

<u>Return in Reinforcement Learning</u>

![](2022-07-29-22-49-31.png)
- Let's say you are using Reinforcement learning to decide where a rover should go (State 1, 2, 3, 4, 5, or 6)
- The reward at State 1 = 100, the reward at State 6 = 40, and the reward at the other states are 0

Let's assume that the rover goes from State 4 to State 1. Steps taken:
- State 4 = 0 reward
- State 3 = 0 reward
- State 2 = 0 reward
- State 1 = 100 reward

Return = $0 + 0(0.9) + 0(0.9)^2 + 100(0.9)^3$ where $0.9$ = discount factor
- What the discount factor does is that it penalizes the rover for going further. This is because if you can get a $5 bill right where you are, or you can get a $10 bill by traveling 30 minutes away, you may stick with the $5 bill because you don't want to travel. Here also, if you take less steps, you get a higher reward than if you take more steps because of the discount factor.

<u>Generalized Return</u> 
$$\text{Return} = R_1 + \gamma R_2 + \gamma^2 R_3 + \ldots$$
Where $R_i$ = reward and $\gamma$ = discount factor ($\gamma < 1$)

### Reinforcement Learning Formalism

What we want to do is find a function that takes in the state of the rover and returns what action we should take so as to maximize the return.

The function is called a policy and is written $\pi(s) = a$
- This function is saying that given a state, the policy tells us an action that maximizes the return

<u>State Action Value Function (Q-Function)</u>

$Q(s, a)$ = Gives the Return if you
- start in state $s$
- take action $a$ (once)
- then behave optimally after that (not very clear but will clarify later)

![](2022-07-29-23-45-07.png)

Based on the Q-Function, you can pick actions that you want to do:
- The best possible return from state $s$ is the maximum of the Q-Function for all actions that can be taken in the state $s$
- The best possible action in state $s$ is the action $a$ that gives the maximum of the Q-Function for all actions that can be taken in the state $s$

![](2022-07-29-23-48-31.png)
- Here the optimal action from state 4 is to go left as the Q-Function value is greater
- Similarly, the optimal action from state 5 is to go right

<u>Bellman Equation</u>

Notation:
- $s$ = current state
- $R(s)$ = reward of current state
- $a$ = current action
- $s'$ = state you get to after taking action $a$
- $a'$ = action that you take in state $s'$

As a reminder: 

$Q(s, a)$ = Gives the Return if you
- start in state $s$
- take action $a$ (once)
- then behave optimally after that

Bellman Equation: $$Q(s, a) = R(s) + \gamma\max_{a'}(Q(s', a'))$$
- This matches the Q definition above
- R(s) stands for the reward you get right away
- The second term represents the return from behaving optimally starting from state s'
- Another way of writing $R_1 + \gamma R_2 + \gamma^2 R_3 + \ldots $
- Think of it like recursion, where you're using Q in itself and if you keep substituting future Q equations into the previous one, you will get $R_1 + \gamma R_2 + \gamma^2 R_3 + \ldots $ 

![](2022-07-30-00-02-58.png)

Here, let's say we want to calculate Q(2, to the right)

$R(2) = 0$

$\gamma = 0.5$

$\max_{a'}(Q(3, a')) = 25$

Thus, $Q(2, \text{right}) = 0 + 0.5(25) = 12.5$ and this is the Q value we derived earlier for Q(2, right)

### Continuous States vs. Discrete States

In the above examples, we were using discrete states because the state could only be one of 6 possible values for the rover. 

However, more often, you will have many more variables to represent the state and those variables are going to be continuous. For example, if you are trying to program an autonomous truck, to quantify its state, you need a vector giving you information about its x-location, y-location, x-speed, y-speed, angle at which it is turned, and the change in the angle. All these variables are continuous because they can take on any number, not just 1, 2, 3 or something.

Everything else, however, is the same, in terms of the algorithm.

# Deep Reinforcement Learning

![](2022-07-30-00-55-48.png)
- We are feeding a vector into the neural network with the states of an object and the action that is taken
- The neural network then spits out the Q value for the specific action and state of the object

### How to train the Neural Network

![](2022-07-30-01-10-02.png)

### Refinements to Algorithm

<u>Changing Architecture of Neural Network</u>

![](2022-07-30-01-12-37.png)