# GridWorld

In some cases the agent tries to move outside the grid and the cell that receives the negative reward is the origin of the agent's transition.
In other cases the agent moves into a high reward cell (like A or B) and the cell that receives the positive reward is the destimation of the agent's transition (A or B)

Implementation notes:
   - The state is the agent position on the grid.
   - The action is the movement of the agent in one time unit.

# State-value function for the equiprobable random policy

This value function was computed by solving the system of linear equations (3.12), for the discounted
reward case with γ = 0.9.                                         

In [1]:
import numpy as np
from collections import defaultdict

In [2]:
n = 5
gamma=0.9 # discount rate
error=10^-3
grid = np.zeros((n, n))
A = [(0, 1), (4, 1), 10]
B = [(0, 3), (2, 3), 5]

pi = [.25, .25, .25, .25]
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

for c in range(1000):
    grid_new = np.zeros((n, n))
    # equation (3.12):
    # (i, j) covers all possible states (aka s')
    # note that in this problem, the resullt of taking an action is deterministic, 
    # so there is not need to compute an expectation over all the possible outcomes of taking an action
    # as in (3.12)
    for i in range(n):
        for j in range(n):
            st = (i, j)
            # in the case of the special points A and B, the agent chooses an action randomly
            # but the result is always moving to A' and B' regardless of the action
            # so there is no need to compute a weighted sum over all possible actions as in 
            # the other states that are not A or B
            if st == A[0]:
                st =  A[1]
                r = A[2]
                grid_new[i, j] = 1 * (r + gamma * grid[st[0], st[1]])
            elif st == B[0]:
                st = B[1]
                r = B[2]
                grid_new[i, j] = 1 * (r + gamma * grid[st[0], st[1]])
            else:
                # sum of rewards(*) over all possible actions from current state (i, j)
                # (*) weighted by the probability of taking each action under policy pi
                for aIndex, a in enumerate(actions):
                    r = 0
                    stplus1 = (st[0] + a[0], st[1] + a[1])  # element-wise addition of tuples
                    if stplus1[0]<0 or stplus1[0]>n-1 or stplus1[1]<0 or stplus1[1]>n-1:
                        stplus1 = st
                        r = -1
                    grid_new[i, j] += pi[aIndex] * (r + gamma * grid[stplus1[0], stplus1[1]])

    if np.sum(np.abs(grid - grid_new)) < error: break
    grid = grid_new

print np.round(grid, decimals=2)

[[ 3.31  8.79  4.43  5.32  1.49]
 [ 1.52  2.99  2.25  1.91  0.55]
 [ 0.05  0.74  0.67  0.36 -0.4 ]
 [-0.97 -0.44 -0.35 -0.59 -1.18]
 [-1.86 -1.35 -1.23 -1.42 -1.98]]


# State-value function for the Optimal policy

In [3]:
n = 5
gamma=0.9
error=10^-3
grid = np.zeros((n, n))
A = [(0, 1), (4, 1), 10]
B = [(0, 3), (2, 3), 5]


actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

for c in range(1000):
    grid_new = np.zeros((n, n))
    # equation (3.12):
    # (i, j) covers all possible states (aka s')
    for i in range(n):
        for j in range(n):
            st = (i, j)
            # in the case of the special points A and B, the agent chooses an action as per policy
            # but the result is always moving to A' and B' deterministically regardless of the action
            # so the optimal policy used by the agent to choose the action doesn't matter in A and B
            if st == A[0]:
                st =  A[1]
                r = A[2]
                grid_new[i, j] = 1 * (r + gamma * grid[st[0], st[1]])
            elif st == B[0]:
                st = B[1]
                r = B[2]
                grid_new[i, j] = 1 * (r + gamma * grid[st[0], st[1]])
            else:
                # maximize for all actions 
                max_action_value = None
                for aIndex, a in enumerate(actions):
                    r = 0
                    stplus1 = (st[0] + a[0], st[1] + a[1])  # element-wise addition of tuples
                    if stplus1[0]<0 or stplus1[0]>n-1 or stplus1[1]<0 or stplus1[1]>n-1:
                        stplus1 = st
                        r = -1
                    action_value = 1 * (r + gamma * grid[stplus1[0], stplus1[1]])
                    max_action_value = max(action_value, max_action_value)
                    
                grid_new[i, j] = max_action_value

    if np.max(np.abs(grid - grid_new)) < error: break
#     print np.round(grid, decimals=1)
#     print
    grid = grid_new
  

In [4]:
print np.round(grid, decimals=1)

[[ 22.   24.4  22.   19.4  17.5]
 [ 19.8  22.   19.8  17.8  16. ]
 [ 17.8  19.8  17.8  16.   14.4]
 [ 16.   17.8  16.   14.4  13. ]
 [ 14.4  16.   14.4  13.   11.7]]
