






# Model Learning
 
Sungchul Lee  




# References

- Reinforcement Learning: 4 Model-Free Prediction [David Silver](https://www.youtube.com/watch?v=PnHCvfgC_ZA&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT&index=4) [local-video](http://localhost:8888/notebooks/Dropbox/Video/RL Course by David Silver - Lecture 4_ Model-Free Prediction.mp4) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 4.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf)

- Reinforcement Learning: 5 Model Free Control [David Silver](https://www.youtube.com/watch?v=0g4j2k_Ggc4&index=5&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 5.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf)

- Tutorial: Deep Reinforcement Learning, ICML 2016 [David Silver](http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/deep_rl_tutorial.pdf)

- Machine Learning, part III: The Q-learning algorithm [JAKE BENNETT](https://articles.wearepop.com/secret-formula-for-self-learning-computers)




# How to run these slides yourself

**Setup python environment**

- Install RISE for an interactive presentation viewer

||Algorithm|
|---|---|
|Policy evaluation|Iterative Policy Evaluation|
|Policy improvement|Value Iteration|
||Policy Iteration|

# Iterative Policy Evaluation 

<div align="center"><img src="img/Snrky_20130629_TC.jpg" width="80%"></div>

http://4.bp.blogspot.com/-TGdTiKfQ23E/Uc79lskYtJI/AAAAAAAABj4/2ZuVOO3-8rI/s1600/Snrky_20130629_TC.jpg

# Iterative Policy Evaluation for $v_\pi$



- Initialize $v_{\pi}(s)=0$ for all $s$.

- Repeat.

    For every $s$ (synchronous or asynchronous) update $v_\pi$  using Bellman's expectation equation: 


\begin{eqnarray*}
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}

# Iterative Policy Evaluation for $q_\pi$

- Initialize $q_{\pi}(s,a)=0$ for all $s$ and $a$.

- Repeat.

    For every $s$ and $a$ (synchronous or asynchronous) update $q_\pi$ using Bellman's expectation equation: 


\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
\end{eqnarray*}

# Iterative Policy Evaluation for $v_\pi$ in Andrew Ng's lecture 16

<div align="center"><img src="img/Screenshot+2017-8.png" width="100%" height="10%"></div>

<div align="center"><img src="img/Screenshot+2017-1.png" width="100%" height="10%"></div>

In [3]:
# Iterative Policy Evaluation for $v_\pi$ in Andrew Ng's lecture 16

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# value function
V = np.zeros(N_STATES)
V[3] = 1
V[6] = -1

for i in range(100):
    for s in range(N_STATES):
        V[s] = sum(
                    [policy[s,a]*(R[s,a]+ gamma*
                     sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)]))
                     for a in range(N_ACTIONS)]
                        )
    V[3] = 1
    V[6] = -1
print(V)

[ 0.71576205  0.74319399  0.772       1.          0.69132019  0.76103021
 -1.          0.66440699  0.64042733  0.61402305  0.60243653]


# Exercise

Find $v_\pi$ for bad and random policy, respectively. 

# Iterative Policy Evaluation for $q_\pi$ in Andrew Ng's lecture 16

<div align="center"><img src="img/Screenshot+2017-9.png" width="100%" height="10%"></div>

<div align="center"><img src="img/Screenshot+2017-2.png" width="100%" height="10%"></div>

In [3]:
# Iterative Policy Evaluation for $q_\pi$ in Andrew Ng's lecture 16

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for i in range(100):
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s,a] = R[s,a]+gamma*sum(
                                        [P[s,a,s1]*
                                         sum([policy[s1,a1]*Q[s1,a1] for a1 in range(N_ACTIONS)])
                                         for s1 in range(N_STATES)]
                                            )
    Q[3,:] = 1
    Q[6,:] = -1
print(Q)

[[ 0.68860443  0.71576205  0.68860443  0.66440699]
 [ 0.68618469  0.74319399  0.71576205  0.71576205]
 [ 0.71576205  0.772       0.74428     0.55907791]
 [ 1.          1.          1.          1.        ]
 [ 0.66440699  0.66440699  0.69132019  0.63538893]
 [ 0.7334199  -0.65632878  0.76103021  0.58934978]
 [-1.         -1.         -1.         -1.        ]
 [ 0.63776292  0.61402305  0.66440699  0.63776292]
 [ 0.64042733  0.60243653  0.61402305  0.61402305]
 [ 0.61402305  0.41678095  0.55808791  0.58788282]
 [ 0.60243653  0.57641217 -0.84456801  0.57641217]]


# Exercise

Find $q_\pi$ for bad and random policy, respectively. 

# Value iteration

<div align="center"><img src="img/Value.jpg" width="50%" height="10%"></div>

http://seanheritage.com/wp-content/uploads/2017/03/Value.jpg

# Value iteration for $v_*$

- Initialize $v_*(s)=0$ for all $s$.

- Repeat.

    For every $s$  (synchronous or asynchronous) update $v_*$ using Bellman's optimality equation: 

\begin{eqnarray*}
v_*(s)&=&\max_{a}\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')\right)\nonumber\\
\end{eqnarray*}

- Find optimal policy $\pi_*$ by solving

$$
\pi_*(s)=\mbox{argmax}_{a}q_*(s,a)
$$

    where
$$
q_*(s,a)={\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')
$$


# Value iteration for $q_*$

- Initialize $q_*(s,a)=0$ for all $s$ and $a$.

- Repeat.

    For every $s$ and $a$ (synchronous or asynchronous) update $q_*$ using Bellman's optimality equation: 

\begin{eqnarray*}
q_*(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\max_{a'}q_*(s',a')\right)\nonumber\\
\end{eqnarray*}

- Find optimal policy $\pi_*$ by solving

$$
\pi_*(s)=\mbox{argmax}_{a}q_*(s,a)
$$


# Value iteration for $v_*$ in Andrew Ng's lecture 16

<div align="center"><img src="img/cs188_mdp_optimal_policies.png" width="70%" height="10%"></div>

https://raw.githubusercontent.com/mebusy/notes/master/imgs/cs188_mdp_optimal_policies.png

<div align="center"><img src="img/Screenshot+2017-7.png" width="100%" height="10%"></div>

In [8]:
# Value iteration for $v_*$ in Andrew Ng's lecture 16

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0]

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0]

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0]

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1]

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0]

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0]

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0]

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1]

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))
else: # fuel-inefficient robot
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))

# discount factor
gamma = 0.99

# value function
V = np.zeros(N_STATES)
V[3] = 1
V[6] = -1

for i in range(100):
    for s in range(N_STATES):
        V[s] = max(
                    [R[s,a] + gamma *
                     sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)])
                     for a in range(N_ACTIONS)]
                        )
    V[3] = 1
    V[6] = -1
print(V)

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for s in range(N_STATES):
    for a in range(N_ACTIONS):
        Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)])
    Q[3,:] = 1
    Q[6,:] = -1
print(Q)

[ 0.71576205  0.74319399  0.772       1.          0.69132019  0.76103021
 -1.          0.66440699  0.64042733  0.61402305  0.60243653]
[[ 0.68860443  0.71576205  0.68860443  0.66440699]
 [ 0.68618469  0.74319399  0.71576205  0.71576205]
 [ 0.71576205  0.772       0.74428     0.55907791]
 [ 1.          1.          1.          1.        ]
 [ 0.66440699  0.66440699  0.69132019  0.63538893]
 [ 0.7334199  -0.65335878  0.76400021  0.58934978]
 [-1.         -1.         -1.         -1.        ]
 [ 0.63776292  0.61402305  0.66440699  0.63776292]
 [ 0.64042733  0.60243653  0.61402305  0.61402305]
 [ 0.61402305  0.41777095  0.55907791  0.58788282]
 [ 0.60243653  0.57641217 -0.83565801  0.57641217]]


# Value iteration for $q_*$ in Andrew Ng's lecture 16

<div align="center"><img src="img/cs188_mdp_optimal_policies.png" width="70%" height="10%"></div>

https://raw.githubusercontent.com/mebusy/notes/master/imgs/cs188_mdp_optimal_policies.png

<div align="center"><img src="img/Screenshot+2017-3.png" width="100%" height="10%"></div>

In [2]:
# Value iteration for $q_*$ in Andrew Ng's lecture 16

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if False: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for t in range(100):
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s,a] = R[s,a]+gamma*sum(
                                        [P[s,a,s1]*
                                         max([Q[s1,a1] for a1 in range(N_ACTIONS)]) 
                                         for s1 in range(N_STATES)]
                                            )
    Q[3,:] = 1
    Q[6,:] = -1    
print(Q)

[[-1.25396202 -0.76157779 -1.25396202 -1.69267636]
 [-1.29783345 -0.26421999 -0.76157779 -0.76157779]
 [-0.76157779  0.292      -0.21092    -0.81852795]
 [ 1.          1.          1.          1.        ]
 [-1.69267636 -1.69267636 -1.20472359 -2.13656999]
 [-0.74391994 -1.37188536 -0.24638378 -1.44348477]
 [-1.         -1.         -1.         -1.        ]
 [-2.17574959 -1.78395358 -1.69267636 -2.17574959]
 [-2.12744227 -1.29692281 -1.78395358 -1.78395358]
 [-1.78395358 -1.80306822 -0.86703795 -1.35836757]
 [-1.29692281 -1.78395358 -1.85198199 -1.78395358]]


# Policy iteration



<img src="img/Policy Iteration.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

# Policy iteration using $v$

- Initialize $\pi$ randomly.

- Repeat

    [Policy evaluation] Evalate $v_\pi$ by iterating Bellman's expectation equation.
\begin{eqnarray*}
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}
    
    [Policy improvement] Improve $\pi$ by solving

$$
\pi(s)=\mbox{argmax}_{a}q_\pi(s,a)
$$

    where
$$
q_\pi(s,a)={\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')
$$

# Policy iteration using $q$

- Initialize $\pi$ randomly.

- Repeat

    [Policy evaluation] Evalate $q_\pi$ by iterating Bellman's expectation equation.
\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
\end{eqnarray*}
    
    [Policy improvement] Improve $\pi$ by solving

$$
\pi(s)=\mbox{argmax}_{a}q_\pi(s,a)
$$

# Policy iteration using $v$ in Andrew Ng's lecture 16

<div align="center"><img src="img/cs188_mdp_optimal_policies.png" width="70%" height="10%"></div>

https://raw.githubusercontent.com/mebusy/notes/master/imgs/cs188_mdp_optimal_policies.png

<div align="center"><img src="img/Screenshot+2017-11.png" width="100%" height="10%"></div>

<div align="center"><img src="img/Screenshot+2017-10.png" width="100%" height="10%"></div>




In [None]:
# Policy iteration using $v$ in Andrew Ng's lecture 16

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0]

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0]

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0]

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1]

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0]

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0]

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0]

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1]

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))
else: # fuel-inefficient robot
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif True: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# value function
V = np.zeros(N_STATES)
V[3] = 1
V[6] = -1

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for t in range(100):

    # policy evaluation - v
    for i in range(100):
        for s in range(N_STATES):
            V[s] = sum(
                        [policy[s, a] * (R[s, a] + gamma *
                         sum([P[s, a, s1] * V[s1] for s1 in range(N_STATES)]))
                         for a in range(N_ACTIONS)]
                            )
        V[3] = 1
        V[6] = -1

    # policy evaluation - q
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s, a] = R[s, a] + gamma * sum([P[s, a, s1] * V[s1] for s1 in range(N_STATES)])
        Q[3, :] = 1
        Q[6, :] = -1

    # policy improvement
    policy = np.zeros((N_STATES, N_ACTIONS))
    m = np.argmax(Q, 1)
    for i in range(N_STATES):
        policy[i, m[i]] = 1

print(Q)

# Policy iteration using $q$ in Andrew Ng's lecture 16

<div align="center"><img src="img/cs188_mdp_optimal_policies.png" width="70%" height="10%"></div>

https://raw.githubusercontent.com/mebusy/notes/master/imgs/cs188_mdp_optimal_policies.png

<div align="center"><img src="img/Screenshot+2017-11.png" width="100%" height="10%"></div>

<div align="center"><img src="img/Screenshot+2017-4.png" width="100%" height="10%"></div>

In [1]:
# Policy iteration using $q$ in Andrew Ng's lecture 16

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if False: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif True: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for t in range(100):

    # policy evaluation
    for i in range(100):
        for s in range(N_STATES):
            for a in range(N_ACTIONS):
                Q[s, a] = R[s, a] + gamma * sum(
                                                    [P[s, a, s1] * 
                                                     sum([policy[s1, a1] * Q[s1, a1] for a1 in range(N_ACTIONS)]) 
                                                     for s1 in range(N_STATES)]
                                                        )
    
    # policy improvement
    policy = np.zeros((N_STATES, N_ACTIONS))
    m = np.argmax(Q,1)
    for i in range(N_STATES):
        policy[i,m[i]] = 1
    
print(Q)

[[-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]
 [-50. -50. -50. -50.]]


# Asynchronous Dynamic Programming




### In-Place Dynamic Programming



### Prioritised Sweeping



### Real-Time Dynamic Programming

# In-Place Dynamic Programming

<img src="img/In-Place Dynamic Programming.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

# Prioritised Sweeping

<img src="img/Prioritised Sweeping.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

# Real-Time Dynamic Programming

<img src="img/Real-Time Dynamic Programming.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

# Exercise

Check whether the code I provide is synchronous or asynchronous.
If the code is asynchronous, make is synchronous.  