### Implementing gridworld example in the book:
<img src="./gridworld.png" alt="" width="60%"/>  

<div>At each cell, four actions are possible: north, south, east and west, which deterministically cause the agent to move one cell in respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A'. From state B, all actoins yield a reward of +5 and take the agent to B'.</div>

In [100]:
from bokeh.io import output_notebook
from bokeh.charts import Line, show

output_notebook()

In [54]:
GAMMA = 0.9

In [83]:
def transition(s, a):
    """given state s and action a, return new state s' and reward r"""
    if s == (0,1): # A to A'
        return ((4,1), 10)
    elif s == (0,3):  # B to B'
        return ((2,3), 5)
    if a == "N":
        s_new = (s[0]-1, s[1])
    elif a == "S":
        s_new = (s[0]+1, s[1])
    elif a == "W":
        s_new = (s[0], s[1]-1)        
    elif a == "E":
        s_new = (s[0], s[1]+1)      
    else:
        raise Exception
    if (s_new[0] < 0 or s_new[0] > 4 or s_new[1] < 0 or s_new[1] > 4):
        return (s, -1)
    else:
        return (s_new, 0)
    
    
def update_value(s, grid):
    new_value = 0
    for a in ["N", "S", "E", "W"]:
        s_new, reward = transition(s, a)
        new_value += 0.25 * (reward + GAMMA * grid[s_new])
    return new_value

In [113]:
def value_iteration(iters, init):
    new_grid = init.copy()
    for i in range(iters):
        previous_grid = new_grid.copy()
        for row in range(5):
            for col in range(5):
                new_grid[row][col] = update_value((row, col), previous_grid)
    return new_grid

In [122]:
# grid values after 100 iterations
initial = np.zeros((5,5))
value_trend = []
for i in range(1,51):
    value_trend.append(value_iteration(i, initial))
    if i in [1,2,3,5,10,20,50]:
        print("after {0} iteration, the grid values are:".format(i))
        print(value_iteration(i, initial))
        print("")

after 1 iteration, the grid values are:
[[ -0.5   10.    -0.25   5.    -0.5 ]
 [ -0.25   0.     0.     0.    -0.25]
 [ -0.25   0.     0.     0.    -0.25]
 [ -0.25   0.     0.     0.    -0.25]
 [ -0.5   -0.25  -0.25  -0.25  -0.5 ]]

after 2 iteration, the grid values are:
[[ 1.46875  9.775    3.06875  5.       0.34375]
 [-0.475    2.19375 -0.05625  1.06875 -0.475  ]
 [-0.41875 -0.05625  0.      -0.05625 -0.41875]
 [-0.475   -0.1125  -0.05625 -0.1125  -0.475  ]
 [-0.8375  -0.475   -0.41875 -0.475   -0.8375 ]]

after 3 iteration, the grid values are:
[[ 2.2534375   9.5725      3.7521875   4.949375    0.6728125 ]
 [ 0.37296875  2.0671875   1.42453125  0.9928125  -0.13328125]
 [-0.570625    0.3740625  -0.050625    0.1209375  -0.570625  ]
 [-0.66484375 -0.2390625  -0.14484375 -0.2390625  -0.66484375]
 [-1.090625   -0.66484375 -0.570625   -0.66484375 -1.090625  ]]

after 5 iteration, the grid values are:
[[ 3.00614424  9.25555586  4.29815439  5.02683125  1.04080088]
 [ 1.03519121  2.67124658 

In [123]:
data = {"A": [i[(0,1)] for i in value_trend], 
        "B": [i[(0,3)] for i in value_trend],
        "(0,2)": [i[(0,2)] for i in value_trend]}
p = Line(data, plot_width=400, plot_height=300, legend="center_right", 
         ylabel='grid value', xlabel="iterations")
show(p)