<img src="img/probabilistic+black+friday.jpeg"/>

[Markov Chains - The Black Friday Puzzle](https://www.countbayesie.com/blog/2015/11/21/the-black-friday-puzzle-understanding-markov-chains)

# MDP (Markov Decision Process)

<img src="img/Markov Decision Process 2.png"/>

# Action-value function $q_\pi(s,a)$ and state-value function $v_\pi(s)$

\begin{eqnarray*}
q_\pi(s,a)&=&E_\pi(G_t|S_t=s,A_t=a)\nonumber\\
\\
v_\pi(s)&=&E_\pi(G_t|S_t=s)\nonumber\\
\end{eqnarray*}

<tr>
<td> <img src="img/Bellman's expectation equation 1.png"/> </td>
<td> <img src="img/Bellman's expectation equation 2.png"/> </td>
</tr>

# Bellman's expectation equation for $v_\pi$ and $q_\pi$

\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)q_\pi(s,a)\nonumber\\
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}

# Optimal action-value function, state-value function, policy 

<img src="img/Optimal Policy 1.png"/>

<img src="img/Optimal Policy 3.png"/>



<tr>
<td> <img src="img/Bellman's optimality equation 1.png"/> </td>
<td> <img src="img/Bellman's optimality equation 2.png"/> </td>
</tr>

# Bellman optimality equation for $v_{*}$ and $q_{*}$

\begin{eqnarray*}
q_*(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')\nonumber\\
v_*(s)&=&\max_{a}q_*(s,a)\nonumber\\
q_*(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\max_{a'}q_*(s',a')\right)\nonumber\\
v_*(s)&=&\max_{a}\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')\right)\nonumber\\
\end{eqnarray*}

# Value iteration

- Initialize $v_*(s)=0$ for all $s$.

- Repeat.

    For every $s$ (synchronous or asynchronous) update $q_*$ and $v_*$ using Bellman's optimality equation: 

\begin{eqnarray*}
q_*(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')\nonumber\\
v_*(s)&=&\max_{a}q_*(s,a)\nonumber\\
q_*(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\max_{a'}q_*(s',a')\right)\nonumber\\
v_*(s)&=&\max_{a}\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')\right)\nonumber\\
\end{eqnarray*}

### [$Q$-learning](https://en.wikipedia.org/wiki/Q-learning)

<img src="img/Q-learning algorithm.png"/>

### [SARSA](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action)

<img src="img/SARSA algorithm.png"/>

### [Temporal difference learning](https://en.wikipedia.org/wiki/Temporal_difference_learning)

<img src="img/Temporal difference learning algorithm.png"/>

### DQN


# Policy iteration

- Initialize $\pi$ randomly.

- Repeat

    Update $q_\pi$ and $v_\pi$ by solving Bellman's expectation equation.
\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)q_\pi(s,a)\nonumber\\
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}
    
    Update $\pi$ by solving

$$
\pi(s)=\mbox{argmax}_{a}q_\pi(s,a)
$$


### A3C

### TRPO

### ACKTR

### DDPG

### Reinforcement learning

### GAE



In [3]:
# Markov Chains in Python
# http://charlesfranzen.com/posts/markov-chains-in-python/

import numpy as np


class Markov(object):

    def __init__(self, state_dict):
        self.state_dict = state_dict
        self.state = list(self.state_dict.keys())[0]

    def check_state(self):
        print('Current State: %s' % (self.state))

    def set_state(self, state):
        self.state = state
        print('State is now: %s' % (self.state))

    def next_state(self):
        A = self.state_dict[self.state]
        self.state = np.random.choice(a=list(A[0]), p=list(A[1]))
        print('New State: %s' % (self.state))

        
state_dict = {'A': np.array([['A', 'B', 'C'],
                             [.2, .4, .4]]),
              'B': np.array([['A', 'C'],
                             [.4, .6]]),
              'C': np.array([['A', 'B'],
                             [.6, .4]])}

diagram_a = Markov(state_dict)
diagram_a.check_state()

diagram_a.set_state('B')

for _ in range(10):
    diagram_a.next_state()
    
# Exercise.
# Modify the code so that we have rewards in addition.

# Exercise.
# Modify the code so that we have actions and rewards in addition.

# Exercise.
# Modify the code so that we have discount, actions, and rewards in addition.

Current State: A
State is now: B
New State: C
New State: A
New State: C
New State: A
New State: C
New State: B
New State: C
New State: A
New State: B
New State: C
