## What is Reinforcement Learning

- the goal is to find a function that maps the **state (s)** to an **action (a)**
- uses a reward function to train model **R(s)**

Applications
- Controlling robots
- factory optimization
- financial (stock) trading
- playing games (including video games)

### The Return in Reinforcement Learning
-  **discount factor (γ)**: modifies reward credited to each step, discounting rewards further in the future (often a number close to 1)
- the **return** in reinforcement learning is the **sum of the rewards the system gets** but **weighted by the discount factor** -> rewards in the future are weighted by the discount factor raised to a higher power

### Making Decisions: Policies in Reinforcement Learning
- a policy function, **π(s) = a**, tells you what **action (a)** to take in a given **state (s)**

- **goal**: fund a policy that tells you what action to take in every state (s) so as to maximize the return


### Markov Decision Process (MDP)
- model for sequential decision making when outcomes are uncertain and partly controllable
- "Markov" means that the future only depends on the current state

### State-Action Value Function (Q Function, Q*. Optimal Q Function)
- a function typicall denoted by **Q(s, a)**
    - gives a number equal to the return if you start in a **state (s)**, take the **action (a)** once, and behave optimally after that
    - tells us how good it is to take action a in state s
    - the best possible return from **state (s)** is **max Q(s, a)**

### Bellman Equation
- helps compute the state-action value function (**Q(s, a)**)
- terms
    - **s: current state**
    - **a: current action**
    - **s': state you get to after taking action a**
    - **a': action that you take in state s'**
    - **R(s) = rewards of current state**
- Equation: **Q(s, a) = R(s) + γ(max Q(s', a'))**

### Stochastic Environments
- sometimes, when you take an action, the outcome is not always completely reliable (i.e slippery floor causes robot to move in wrong direction, device thrown off balance, wind, etc)
- there isn't one sequence of rewards that you are guarenteed to see
- now, we are trying to maximize the **average or expected return**
- **Q(s, a) = R(s) + γ* E(max Q(s', a'))**

### Learning the State-Value Function
- key idea: train a neural network to compute / approximate the state action value function ((Q(s, a))) that will in turn let us pick good actions
- get Q(s, a) from NN, compute all options and select action that yeilds best reward
- for training 
    $$ x^{(1)} = (s^{(1)}, a^{(1)}) $$ 

    $$ y^{(1)} = R(s^{(1)}) + γmax Q(s^{'(1)}, a^{'}) $$ 
- at every step, Q will be a guess (that will get better over time)

### Full Algorithm
1. Initialize neural network parameters randomly as an initial random guess for the Q function 
2. Repeat 
    - Take actions in lunar lander. Get (s, a, R(s), s')
    - Store 10,000 most recent tuples (s, a, R(s), s') (called the replay buffer)
    - Occasionally train NN by creating training set of 10,000 examples -> x =(s, a), y = R(s) + γ*max(Q(s',a'))
    - Train such that Q_new (s, a) ~= y
    - set Q = Q_new

     