# Concept 1 - Return

| State | 0 | 1 | 2 | 3 | 4 | 5 |
| - | ------ | ------ | ------ | ------ | ------ | ------ |
| Return | 100     | 50 | 25 | 12.5 | 20 | 40 |
| Policy $\pi$ | | ← | ← | ← | → | |
| Reward | 100    | 0      | 0      | 0      | 0      | 40     |

* $\pi$ is a policy, can be one of the actions ← or →
* inflation $\gamma=0.5$:

In [42]:
import numpy as np

def policy(current_state, obey_rate):

    optimal_way = {1 : ['left', 0], 2:['left', 1], 3:['left', 2], 4:['right', 5]} # as in table above
    not_optimal_way = {1 : ['right', 2], 2:['right', 3], 3:['right', 4], 4:['left', 3]}

    rng = np.random.rand()
    next_step = optimal_way[current_state][1] if rng < obey_rate else not_optimal_way[current_state][1]

    return next_step

next_step = policy(3,0.9)
print(f"next step is {next_step}")

next step is 2


# Concept 2 - MDP

MDP - Markov Decision Process
The Future depends on where you are Now, not how you got here. Discrete.

# Concept 3 - State value function

$Q(s,a)$ = Reward, where
* start in a state $s$, 
* take action $a$,
* behave optimally after that

Example:

$Q(2,→) = 0 + \gamma^1 \cdot Q(3,←) = 0 + \gamma^1 \cdot 0 + \gamma^2 \cdot Q(2,←) = 0 + \gamma^2 \cdot 0 + \gamma^3 \cdot Q(1,←) = 0 + 0.125 \cdot 100 = 12.5$


Bellman's formula:

$Q(s,a) = R(s) + \gamma \cdot \underset{all \space new \space a^{,}} \max Q(s^{,},a^{,})$, where
* $R(s)$ - reward at current state $s$
* $a$ - current action
* $s^{,}$ - new state
* $a^{,}$ - next action

Example:

| State | 1 | 2 | 3 | 4 | 5 | 6 |
| - | ------ | ------ | ------ | ------ | ------ | ------ |
| Return | 100     | ←50 - 12.5→| ←25 - 6.25→ | ←12.5 - 10→ | ←6.25 - 20→ | 40 |



$Q(2,→) = R(2) + \gamma \cdot \underset{all \space new \space a^{,}} \max Q(3,a^{,}) = 0 + 0.5 \cdot \underset{all \space new \space a^{,}} \max (25, 6.25)) = 0.5*25 = 12.5$


# Concept 4 - Stochastic Environment

$Q(s, a)$ will have 90% chance to obey and go direction $a$, and 10% to disobeay and go other direction

Goal - choose policy $\pi$ that maximise Expected Return (over a batch of 1000 applied policies):
$G = E[R_1 + \gamma \cdot R_2 + \gamma \cdot R_3 + \gamma \cdot R_4 + ...]$

Bellman's function $Q(s,a) = R(s) + \gamma \cdot \underset{all \space new \space a^{,}} E[\max Q(s^{,},a^{,})]$

In [44]:
import numpy as np

def bellman_function(current_state, obey_rate = 1):

    states = np.array([1,2,3,4,5,6])
    actions = np.array([-1, 1]) # left / right
    rewards = np.array([100,0,0,0,0,40])
    inflation_rate = 0.5
    obey_rate = 0.9

    optimal_way = {1 : ['left', 0], 2:['left', 1], 3:['left', 2], 4:['right', 5]} # as in table above
    not_optimal_way = {1 : ['right', 2], 2:['right', 3], 3:['right', 4], 4:['left', 3]}

    stochastic_rewards = []

    for i in range(1000):

        intermediate_state = current_state
        intermediate_reward = rewards[current_state] # Q(s,a) = R(s) + ...
        intermediate_step = 0

        while intermediate_reward==0:

            rng = np.random.rand()

            intermediate_step += 1
            intermediate_state = policy(intermediate_state, obey_rate)
            intermediate_reward += rewards[intermediate_state] * inflation_rate**intermediate_step # Q(s,a) = ... + y * Q(s',a')

        stochastic_rewards.append(intermediate_reward)

    result = np.mean(stochastic_rewards)

    return result

stochasticQ1 = bellman_function(1, obey_rate = 0.9)
stochasticQ2 = bellman_function(2, obey_rate = 0.9)
stochasticQ3 = bellman_function(3, obey_rate = 0.9)
stochasticQ4 = bellman_function(4, obey_rate = 0.9)

print(f"Q(1) = {stochasticQ1:.2f}, Q(2) = {stochasticQ2:.2f}, Q(3) = {stochasticQ3:.2f}, Q(4) = {stochasticQ4:.2f}")


Q(1) = 45.41, Q(2) = 21.33, Q(3) = 10.34, Q(4) = 18.55


# Deep Reinforceent Learning

Teach NN to calculate best reward action $a$.

Compute policy $\pi$, so 
$
\pi : f
\begin{bmatrix} 
\vec{x} \\
a
\end{bmatrix} = 
Q(s,a)
$, where
$\vec{x} = \begin{bmatrix} oX \\ oY \\ oZ \\ oX^{'} \\ oY^{'} \\ oZ^{'} \end{bmatrix}$,  and
$a = \begin{bmatrix} 0_{left} \\ 0_{right} \\ 1_{up} \\ 0_{down} \end{bmatrix}$

For each action $a$ calculate the policy $\pi$, and pick max value 
* Q(s, nothing), 
* Q(s, left), 
* Q(s, right),
* etc

Generate dataset by randomly picking actions and recording outcomes: $(s, a, R(s), s^{'})$. Now, we have 10'000 of records $(s, a, R(s), s^{'})$, create a NN to solve $Q(s,a) = R(s) + \gamma \cdot \underset{all \space new \space a^{,}} \max Q(s^{,},a^{,})$:
* inputs x = $(s,a)$, and 
* targets y = $R(s) + \gamma \cdot \underset{all \space new \space a^{,}} \max Q(s^{,},a^{,})$
    * Deep Q network algorithm:
    * initially, $Q(s^{,},a^{,})$ will become a random guess (much like in GD)
    * create new NN "DQN" so it can approximate $Q_{new}(s,a)$ = y targets
* update Q to $Q_{new}$
* repeat algorithm and gradually improve Q function
