# GridWorld

In some cases the agent tries to move outside the grid and the cell that receives the negative reward is the origin of the agent's transition.
In other cases the agent moves into a high reward cell (like A or B) and the cell that receives the positive reward is the destimation of the agent's transition (A or B)

Implementation notes:
   - The state is the agent position on the grid.
   - The action is the movement of the agent in one time unit.

# State-value function for the equiprobable random policy

This value function was computed by solving the system of linear equations (3.12), for the discounted
reward case with γ = 0.9.                                         

In [2]:
import numpy as np
from collections import defaultdict

In [11]:
n = 5
discount=0.9
error=10^-3
grid = np.zeros((n, n))
A = [(0, 1), (4, 1), 10]
B = [(0, 3), (2, 3), 5]

pi = [.25, .25, .25, .25]
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

for c in range(1000):
    grid_new = np.zeros((n, n))
    # equation (3.12):
    # (i, j) covers all possible states (aka s')
    for i in range(n):
        for j in range(n):
            st = (i, j)
            if st == A[0]:
                st =  A[1]
                r = A[2]
                grid_new[i, j] += 1 * (r + discount * grid[st[0], st[1]])
            elif st == B[0]:
                st = B[1]
                r = B[2]
                grid_new[i, j] += 1 * (r + discount * grid[st[0], st[1]])
            else:
                # average for all actions 
                for aIndex, a in enumerate(actions):
                    r = 0
                    stplus1 = (st[0] + a[0], st[1] + a[1])  # element-wise addition of tuples
                    if stplus1[0]<0 or stplus1[0]>n-1 or stplus1[1]<0 or stplus1[1]>n-1:
                        stplus1 = st
                        r = -1
                    grid_new[i, j] += pi[aIndex] * (r + discount * grid[stplus1[0], stplus1[1]])

    if np.sum(np.abs(grid - grid_new)) < error: break
    grid = grid_new

print np.round(grid, decimals=2)

[[ 3.31  8.79  4.43  5.32  1.49]
 [ 1.52  2.99  2.25  1.91  0.55]
 [ 0.05  0.74  0.67  0.36 -0.4 ]
 [-0.97 -0.44 -0.35 -0.59 -1.18]
 [-1.86 -1.35 -1.23 -1.42 -1.98]]


# State-value function for the Optimal policy

In [9]:
n = 5
discount=0.9
error=10^-3
grid = np.zeros((n, n))
A = [(0, 1), (4, 1), 10]
B = [(0, 3), (2, 3), 5]


actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

for c in range(1000):
    grid_new = np.zeros((n, n))
    # equation (3.12):
    # (i, j) covers all possible states (aka s')
    for i in range(n):
        for j in range(n):
            st = (i, j)
            if st == A[0]:
                st =  A[1]
                r = A[2]
                grid_new[i, j] = 1 * (r + discount * grid[st[0], st[1]])
            elif st == B[0]:
                st = B[1]
                r = B[2]
                grid_new[i, j] = 1 * (r + discount * grid[st[0], st[1]])
            else:
                # maximize for all actions 
                values = []
                for aIndex, a in enumerate(actions):
                    r = 0
                    stplus1 = (st[0] + a[0], st[1] + a[1])  # element-wise addition of tuples
                    if stplus1[0]<0 or stplus1[0]>n-1 or stplus1[1]<0 or stplus1[1]>n-1:
                        stplus1 = st
                        r = -1
                    values.append(1 * (r + discount * grid[stplus1[0], stplus1[1]]))
                    
                grid_new[i, j] = np.max(values)

    if np.sum(np.abs(grid - grid_new)) < error: break
    print np.round(grid, decimals=1)
    print
    grid = grid_new
  

[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

[[  0.  10.   0.   5.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]]

[[  9.   10.    9.    5.    4.5]
 [  0.    9.    0.    4.5   0. ]
 [  0.    0.    0.    0.    0. ]
 [  0.    0.    0.    0.    0. ]
 [  0.    0.    0.    0.    0. ]]

[[  9.   10.    9.    5.    4.5]
 [  8.1   9.    8.1   4.5   4. ]
 [  0.    8.1   0.    4.    0. ]
 [  0.    0.    0.    0.    0. ]
 [  0.    0.    0.    0.    0. ]]

[[  9.   10.    9.    8.6   4.5]
 [  8.1   9.    8.1   7.3   4. ]
 [  7.3   8.1   7.3   4.    3.6]
 [  0.    7.3   0.    3.6   0. ]
 [  0.    0.    0.    0.    0. ]]

[[  9.   10.    9.    8.6   7.8]
 [  8.1   9.    8.1   7.8   6.6]
 [  7.3   8.1   7.3   6.6   3.6]
 [  6.6   7.3   6.6   3.6   3.3]
 [  0.    6.6   0.    3.3   0. ]]

[[  9.   15.9   9.   10.9   7.8]
 [  8.1   9.    8.1   7.8   7. ]
 [  7.3

In [10]:
print np.round(grid, decimals=1)

[[ 22.   24.4  22.   19.4  17.5]
 [ 19.8  22.   19.8  17.8  16. ]
 [ 17.8  19.8  17.8  16.   14.4]
 [ 16.   17.8  16.   14.4  13. ]
 [ 14.4  16.   14.4  13.   11.7]]
