






# Iterative Policy Evaluation, VI and PI
 
Sungchul Lee  




# References

- Reinforcement Learning: 4 Model-Free Prediction [David Silver](https://www.youtube.com/watch?v=PnHCvfgC_ZA&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT&index=4) [local-video](http://localhost:8888/notebooks/Dropbox/Video/RL Course by David Silver - Lecture 4_ Model-Free Prediction.mp4) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 4.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf)

- Reinforcement Learning: 5 Model Free Control [David Silver](https://www.youtube.com/watch?v=0g4j2k_Ggc4&index=5&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 5.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf)

- Tutorial: Deep Reinforcement Learning, ICML 2016 [David Silver](http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/deep_rl_tutorial.pdf)

- Machine Learning, part III: The Q-learning algorithm [JAKE BENNETT](https://articles.wearepop.com/secret-formula-for-self-learning-computers)




# How to run these slides yourself

**Setup python environment**

- Install RISE for an interactive presentation viewer

||Algorithm|
|---|---|
|Policy evaluation|Iterative Policy Evaluation|
|Policy improvement|Value Iteration|
||Policy Iteration|

# Iterative Policy Evaluation - $v_\pi$

- Initialize $q_{\pi}(s,a)=0$ and $v_{\pi}(s)=0$ for all $s$ and $a$.

- Repeat.

    For every $s$ and $a$ (synchronous or asynchronous) update $q_\pi$ and $v_\pi$ using Bellman's expectation equation: 


\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}

# bad policy in Andrew Ng's lecture 16

|$\Rightarrow$|$\Rightarrow$|$\Rightarrow$|1|
|------|------|------|------|
|$\Downarrow$|H|$\Rightarrow$|-1|
|$\Rightarrow$|$\Rightarrow$|$\Uparrow$|$\Uparrow$|

# optimal policy in Andrew Ng's lecture 16

|$\Rightarrow$|$\Rightarrow$|$\Rightarrow$|1|
|------|------|------|------|
|$\Uparrow$|H|$\Uparrow$|-1|
|$\Uparrow$|$\Leftarrow$|$\Leftarrow$|$\Leftarrow$|

In [None]:
# import libraries
import numpy as np

In [None]:
# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

In [None]:
# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

In [None]:
# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

In [None]:
# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

In [None]:
# discount factor
gamma = 0.99

In [None]:
# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

In [None]:
# value function
V = np.zeros(N_STATES)
V[3] = 1
V[6] = -1

In [None]:
for i in range(100):
    for s in range(N_STATES):
        V[s] = sum([policy[s,a]*(R[s,a]+ gamma*sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)])) for a in range(N_ACTIONS)])
    V[3] = 1
    V[6] = -1
print(V)

In [2]:
# iterative policy evaluation - v_\pi

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# value function
V = np.zeros(N_STATES)
V[3] = 1
V[6] = -1

for i in range(100):
    for s in range(N_STATES):
        V[s] = sum([policy[s,a]*(R[s,a]+ gamma*sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)])) for a in range(N_ACTIONS)])
    V[3] = 1
    V[6] = -1
print(V)

[ 0.71576205  0.74319399  0.772       1.          0.69132019  0.76103021
 -1.          0.66440699  0.64042733  0.61402305  0.60243653]


# Iterative Policy Evaluation for $q_\pi$

- Initialize $q_{\pi}(s,a)=0$ and $v_{\pi}(s)=0$ for all $s$ and $a$.

- Repeat.

    For every $s$ and $a$ (synchronous or asynchronous) update $q_\pi$ and $v_\pi$ using Bellman's expectation equation: 


\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}

In [None]:
# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

In [None]:
# value function
# V = np.zeros(N_STATES)
# V[3] = 1
# V[6] = -1

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

In [None]:
# for i in range(100):
#     for s in range(N_STATES):
#         V[s] = sum([policy[s,a]*(R[s,a]+ gamma*sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)])) for a in range(N_ACTIONS)])
#     V[3] = 1
#     V[6] = -1
# print(V)

for i in range(100):
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*sum([policy[s1,a1]*Q[s1,a1] for a1 in range(N_ACTIONS)]) for s1 in range(N_STATES)])
    Q[3,:] = 1
    Q[6,:] = -1
print(Q)

In [3]:
# iterative policy evaluation - q_\pi

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# value function
# V = np.zeros(N_STATES)
# V[3] = 1
# V[6] = -1

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# for i in range(100):
#     for s in range(N_STATES):
#         V[s] = sum([policy[s,a]*(R[s,a]+ gamma*sum([P[s,a,s1]*V[s1] for s1 in range(N_STATES)])) for a in range(N_ACTIONS)])
#     V[3] = 1
#     V[6] = -1
# print(V)

for i in range(100):
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*sum([policy[s1,a1]*Q[s1,a1] for a1 in range(N_ACTIONS)]) for s1 in range(N_STATES)])
    Q[3,:] = 1
    Q[6,:] = -1
print(Q)

[[ 0.68860443  0.71576205  0.68860443  0.66440699]
 [ 0.68618469  0.74319399  0.71576205  0.71576205]
 [ 0.71576205  0.772       0.74428     0.55907791]
 [ 1.          1.          1.          1.        ]
 [ 0.66440699  0.66440699  0.69132019  0.63538893]
 [ 0.7334199  -0.65632878  0.76103021  0.58934978]
 [-1.         -1.         -1.         -1.        ]
 [ 0.63776292  0.61402305  0.66440699  0.63776292]
 [ 0.64042733  0.60243653  0.61402305  0.61402305]
 [ 0.61402305  0.41678095  0.55808791  0.58788282]
 [ 0.60243653  0.57641217 -0.84456801  0.57641217]]


# Value iteration






- Initialize $q_*(s,a)=0$ and $v_*(s)=0$ for all $s$ and $a$.

- Repeat.

    For every $s$ and $a$ (synchronous or asynchronous) update $q_*$ and $v_*$ using Bellman's optimality equation: 

\begin{eqnarray*}
q_*(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\max_{a'}q_*(s',a')\right)\nonumber\\
v_*(s)&=&\max_{a}\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_*(s')\right)\nonumber\\
\end{eqnarray*}


In [None]:
# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

In [None]:
for t in range(100):
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*max([Q[s1,a1] for a1 in range(N_ACTIONS)]) for s1 in range(N_STATES)])
    Q[3,:] = 1
    Q[6,:] = -1    
print(Q)

In [4]:
# value iteration - q_\pi

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for t in range(100):
    for s in range(N_STATES):
        for a in range(N_ACTIONS):
            Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*max([Q[s1,a1] for a1 in range(N_ACTIONS)]) for s1 in range(N_STATES)])
    Q[3,:] = 1
    Q[6,:] = -1    
print(Q)

[[ 0.68860443  0.71576205  0.68860443  0.66440699]
 [ 0.68618469  0.74319399  0.71576205  0.71576205]
 [ 0.71576205  0.772       0.74428     0.55907791]
 [ 1.          1.          1.          1.        ]
 [ 0.66440699  0.66440699  0.69132019  0.63538893]
 [ 0.7334199  -0.65632878  0.76103021  0.58934978]
 [-1.         -1.         -1.         -1.        ]
 [ 0.63776292  0.61402305  0.66440699  0.63776292]
 [ 0.64042733  0.60243653  0.61402305  0.61402305]
 [ 0.61402305  0.41678095  0.55808791  0.58788282]
 [ 0.60243653  0.57641217 -0.84456801  0.57641217]]


# Policy iteration



<img src="img/Policy Iteration.png"/>

- Initialize $\pi$ randomly.

- Repeat

    Update $q_\pi$ and $v_\pi$ by solving Bellman's expectation equation.
\begin{eqnarray*}
q_\pi(s,a)&=&{\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}\left(\sum_{a'}\pi(a'|s')q_\pi(s',a')\right)\nonumber\\
v_\pi(s)&=&\sum_{a}\pi(a|s)\left({\cal R}_s^a+\gamma\sum_{s'}{\cal P}^a_{ss'}v_\pi(s')\right)\nonumber\\
\end{eqnarray*}
    
    Update $\pi$ by solving

$$
\pi(s)=\mbox{argmax}_{a}q_\pi(s,a)
$$

In [None]:
# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

In [None]:
for t in range(100):

    # evaluate policy
    for i in range(100):
        for s in range(N_STATES):
            for a in range(N_ACTIONS):
                Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*sum([policy[s1,a1]*Q[s1,a1] for a1 in range(N_ACTIONS)]) for s1 in range(N_STATES)])
        Q[3,:] = 1
        Q[6,:] = -1
        #print(Q)
    
    # improve policy
    policy = np.zeros((N_STATES, N_ACTIONS))
    m = np.argmax(Q,1)
    for i in range(N_STATES):
        policy[i,m[i]] = 1
    
print(Q)

In [5]:
# policy iteration - q_\pi

# import libraries
import numpy as np

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# transition probability
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# reward
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS)) 
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS)) 

# discount factor
gamma = 0.99

# policy
if False: # bad policy presented above (top)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,0,1]
    policy[5,:] = [0,1,0,0]
    policy[6,:] = [0,1,0,0]
    policy[7,:] = [0,1,0,0]
    policy[8,:] = [0,1,0,0]
    policy[9,:] = [0,0,1,0]
    policy[10,:] = [0,0,1,0]
elif False: # random policy
    policy = 0.25*np.ones((N_STATES, N_ACTIONS))
elif True: # optimal policy presented above (bottom)
    policy = np.zeros((N_STATES, N_ACTIONS))
    policy[0,:] = [0,1,0,0]
    policy[1,:] = [0,1,0,0]
    policy[2,:] = [0,1,0,0]
    policy[3,:] = [0,1,0,0]
    policy[4,:] = [0,0,1,0]
    policy[5,:] = [0,0,1,0]
    policy[6,:] = [0,0,1,0]
    policy[7,:] = [0,0,1,0]
    policy[8,:] = [1,0,0,0]
    policy[9,:] = [1,0,0,0]
    policy[10,:] = [1,0,0,0]

# Q function
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

for t in range(100):

    # evaluate policy
    for i in range(100):
        for s in range(N_STATES):
            for a in range(N_ACTIONS):
                Q[s,a] = R[s,a]+gamma*sum([P[s,a,s1]*sum([policy[s1,a1]*Q[s1,a1] for a1 in range(N_ACTIONS)]) for s1 in range(N_STATES)])
        Q[3,:] = 1
        Q[6,:] = -1
        #print(Q)
    
    # improve policy
    policy = np.zeros((N_STATES, N_ACTIONS))
    m = np.argmax(Q,1)
    for i in range(N_STATES):
        policy[i,m[i]] = 1
    
print(Q)

[[ 0.68860443  0.71576205  0.68860443  0.66440699]
 [ 0.68618469  0.74319399  0.71576205  0.71576205]
 [ 0.71576205  0.772       0.74428     0.55907791]
 [ 1.          1.          1.          1.        ]
 [ 0.66440699  0.66440699  0.69132019  0.63538893]
 [ 0.7334199  -0.65632878  0.76103021  0.58934978]
 [-1.         -1.         -1.         -1.        ]
 [ 0.63776292  0.61402305  0.66440699  0.63776292]
 [ 0.64042733  0.60243653  0.61402305  0.61402305]
 [ 0.61402305  0.41678095  0.55808791  0.58788282]
 [ 0.60243653  0.57641217 -0.84456801  0.57641217]]


# Asynchronous Dynamic Programming

### In-Place Dynamic Programming

### Prioritised Sweeping

### Real-Time Dynamic Programming


# In-Place Dynamic Programming

<img src="img/In-Place Dynamic Programming.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

# Prioritised Sweeping

<img src="img/Prioritised Sweeping.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

# Real-Time Dynamic Programming

<img src="img/Real-Time Dynamic Programming.png"/>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf