**Example 3.5**
</p>
Gridworld Figure (left) shows a rectangular gridworld representation
of a simple finite MDP. The cells of the grid correspond to the states of the environment. At
each cell, four actions are possible: $north$, $south$, $east$, and $west$, which deterministically
cause the agent to move one cell in the respective direction on the grid. Actions that
would take the agent off the grid leave its location unchanged, but also result in a reward
of $-1$. Other actions result in a reward of $0$, except those that move the agent out of the
special $states A$ and $B$. From state $A$, all four actions yield a reward of $+10$ and take the
agent to $A'$. From state $B$, all actions yield a reward of +5 and take the agent to $B'$

Suppose the agent selects all **four actions with equal probability** in all states. Table shows the value function, $\large v_\pi$, for this policy, for the discounted reward case with
$\large \gamma$ = 0.9. This value function was computed by solving the system of linear equations
$$\large v_\pi(s) \doteq \sum_{a} \pi(a|s) \sum_{s',r}p(s',r | s, a) \Big[ r + \gamma \space v_{\pi}(s')\Big] \to for \space all \space s \in S$$

|a|b|c|d|e|
|---|-----|----|----|----|
|3.3| 8.8 | 4.4| 5.3| 1.5|
|1.5 |3.0 |2.3 |1.9 |0.5|
|0.1 |0.7 | 0.7 |0.4 |-0.4|
|-1.0 |-0.4| -0.4| -0.6 |-1.2|
|-1.9| -1.3| -1.2| -1.4 |-2.0|

|a|b|c|d|e|
|---|-----|----|----|----|
|3.3| 8.8 | 4.4| 5.3| 1.5|
|1.5 |3.0 |2.3 |1.9 |0.5|
|0.1 |0.7 | 0.7 |0.4 |-0.4|
|-1.0 |-0.4| -0.4| -0.6 |-1.2|
|-1.9| -1.3| -1.2| -1.4 |-2.0|

 Notice the negative values near the lower edge; these are the result of the high
probability of hitting the edge of the grid there under the random policy. State A is the
best state to be in under this policy, but its expected return is less than 10, its immediate
reward, because from $A$ the agent is taken to $A'$, from which it is likely to run into the
edge of the grid. State $B$, on the other hand, is valued more than 5, its immediate reward,
because from B the agent is taken to $B'$, which has a positive value. From $B'$ the expected
penalty (negative reward) for possibly running into an edge is more than compensated
for by the expected gain for possibly stumbling onto $A$ or $B$.

In [1]:
import numpy as np
import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
#from matplotlib.table import Table


In [2]:
WORLD_SIZE = 5

A_POS = [0,1]
B_POS = [0,3]

A_PRIME_POS = [4,1]
B_PRIME_POS = [2,3]

A_TO_APRIME_TRANSITION_REWARD = 10
B_TO_BPRIME_TRANSITION_REWARD = 5

Gamma = 0.9

# left, up, right down
Actions =[np.array([0,-1]),
          np.array([-1,0]),
          np.array([0,1]),
          np.array([1,0])]

#In the problem it is given that actions are equiprobable.
#1/4 = 0.25
ACTION_PROB =  1/len(Actions)

In [3]:
def Step(state, action):
    
    if state == A_POS:
        return A_PRIME_POS, A_TO_APRIME_TRANSITION_REWARD
    elif state == B_POS:
        return B_PRIME_POS, B_TO_BPRIME_TRANSITION_REWARD
    else:
        state = np.array(state)
        next_state = state + action
        x, y = next_state
        if x < 0 or x > WORLD_SIZE or y < 0 or y >WORLD_SIZE:
            reward = -1
            next_state = state
        else :
            reward = 0
        return next_state, reward

In [4]:
test_state =np.array( [1,4])

for a in Actions:
    n_s = test_state + a
    print(f'{test_state} + {a} = {n_s}')
    

[1 4] + [ 0 -1] = [1 3]
[1 4] + [-1  0] = [0 4]
[1 4] + [0 1] = [1 5]
[1 4] + [1 0] = [2 4]


In [5]:
x, y = test_state
x,y

(1, 4)

In [6]:
test_state =[0,5]

for a in Actions:
    m_test, r = Step(test_state,a)
    print(f'{m_test} {r}')

[0 4] 0
[0 5] -1
[0 5] -1
[1 5] 0


In [17]:
values = np.zeros((WORLD_SIZE,WORLD_SIZE))
values, values.shape
count = 0
while True:
    # keep iteration until convergence
    count += 1
    new_value = np.zeros(values.shape,)
    for i in range(values.shape[0]-1):
        for j in range(values.shape[1]-1):
            for action in Actions:
                (next_i, next_j), reward = Step([i,j], action)
                # Bellman Equation.
                new_value[i,j]  = round(new_value[i,j] + ACTION_PROB * (reward + Gamma * values[next_i,next_j]),2)
                #new_value[i,j]  += ACTION_PROB * (reward + Gamma * values[next_i,next_j])
    if np.sum(np.abs(values - new_value)) < 1e-4:
        print(f'Value got converged after {count} iterations')
        print(values)
        break
    values = new_value
    #print(values)

Value got converged after 24 iterations
[[ 4.01 10.    5.02  5.72  0.  ]
 [ 2.06  3.6   2.67  2.07  0.  ]
 [ 0.63  1.29  1.16  0.79  0.  ]
 [-0.04  0.37  0.4   0.27  0.  ]
 [ 0.    0.    0.    0.    0.  ]]
