# Discrete vs Continuous State

Let's say we are coding a helicopter, in this case the state is a vector of 6 values, that is: _x, y, z, row, pitch, yaw_
$$s = \begin{bmatrix} x \\ y \\ z \\ \phi \\ \theta \\ \omega \\ \dot{x} \\ \dot{y} \\ \dot{z} \\ \dot{\phi} \\ \dot{\theta} \\ \dot{\omega} \end{bmatrix}$$

## Lunar Lander

__The state of our system is__
$$state = \begin{bmatrix} x \\ y \\ \theta \\ \dot{x} \\ \dot{y} \\ \dot{\theta} \\ l \\ r\end{bmatrix}$$
__The actions will be:__
1) do nothing
2) left thruster
3) main thruster
4) right thruster <br>

__Reward function:__
- Getting to landing pad: 100 to 140
- Additional reward for moving towards/away from pad
- Crash: -100
- Soft landing: 100
- Leg grounded: 10
- Fire main engine: -0.3
- Fire side thruster: -0.03

## Learning the state-action value function

In a state s, use neural network to compute the following:
1) Q(s, nothing)
2) Q(s, left)
3) Q(s, main)
4) Q(s, right)

Pick the action that maximizes Q(s, a)

To train the neural network we use the Bellman Equation
$$Q(s, a) = R(s) + \gamma \max_{a'}Q(s', a')$$

First to train random things, and observe what we did, and what actions we took, and what reward we got... and we will take a mark of the tuples that maybe <br> $(s^{(1)}, a^{(1)}, R(s^{(1)}), s'^{(1)})$, $(s^{(2)}, a^{(2)}, R(s^{(2)}), s'^{(2)})$, $(s^{(3)}, a^{(3)}, R(s^{(3)}), s'^{(3)})$ and $(s^{(4)}, a^{(4)}, R(s^{(4)}), s'^{(4)})$<br> and this will be enough for us to know the neural network parameters

As now the input would just be $$x = (s^{(1)}, a^{(1)})$$ and $$y = R(s) + \gamma \max_{a'}Q(s', a')$$

## Learning Algorithm

Initialize neural network randomly as guess of Q(s, a), where we train 4  neural network, each for one action, with one output layer

Repeat { <br>
    Take actions in the lunar lander. Get(s, a, R(s), s'), <br>
    Store 10,000 most recent (s, a, R(s), s') <- _Replay Buffer_ <br>
}

- Train neural network:
    - Create training set of 10,000 examples using
        - $x=(s, a)$ and $y=R(s)+ \gamma \max_{a'}Q(s', a')$
    - Train $Q_{new}$ such that $Q_{new}(s, a) \approx y$
- Set $Q = Q_{new}$

## Algorithm refinement

Well go figure, turns out, training 4 neural networks is not the best algorithm and its better to have an output layer with 4 neurons lol

#### $\epsilon$-greedy policy

In some state s
- Option 1:
    - Pick the action a that maximizes $Q(s, a)$
- Option 2:
    - With probability 0.95, pick the action a that maximizes $Q(s, a)$
    - with probability 0.05, pick an action a randomly

Option 1 may work okay but isn't the best option, but Option 2 works!, this is called the __Exploration step__ or,__. $\epsilon$ -greedy policy__ Option 1 is called the __Exploitation step__ or, __Greedy method__

It's better to start with a high $\epsilon$ and then gradually decrease, so initially we pick random actions, and then we take greedy actions later so we learn what every action does

### Mini-batch and soft updates

### Mini-batches

Let's consider the house pricing problem from firest course, what if m is a hundred million, when m is large, the algorithm becomes very slow,

So in mini-batch algorithm, we take a subset of the large set, let's say m' = 1000, and then each iteration through gradient descent we only take 1000, and the subset keeps __changing__ every iteration

So in case of Reinforcement learning, even though we have stored 10,000 most recent tuples, we may not use all of them to train the model, and may only use a 1,000 of them to train the model

### Soft updates

Rather than setting the $Q = Q_{new}$

But with the soft update we set, 
$$w = 0.01w_{new} + 0.99w$$
$$b = 0.01b_{new} + 0.99b$$