# Homework 11 - Reinforcement Learning

In this homework, we consider the grid world shown in the lecture and the random policy, that assigns equal probability 1/4 to each action.

Implement a slightly changed version of the iterative policy evaluation algorithm which can be seen below. The changes are 

1. the algorithm will not run to convergence, but for a certain number of episodes $N$ 
2. the value function is updated immediately for each state, instead of saving an old $V_k$ and new $V_{k+1}$

The update is thus $V_{new}(x_k) = \sum_i \pi(u_i|x_k) (r_{t+1} + V(x_{t+1})|x_t=x_i)$, $V(x_k) = V_{new}(x_k)$. Initialize V as a numpy array of zeros. Iterate starting at state $x=0$ to state $x=10$. We consider the undiscount case ($\gamma = 1$).

The function to implement is called 'pol_eval' and takes a single input $N$ describing the number of iterations over all states. The function returns a numpy array $V$, which contains $V(x_i)$ in the $V[i]$.

The environment is available and can be called as $x2, r = \text{gridworld}(x, u)$, the same way as in the code example after the lecture. Use this notebook to implement and test your code and input the desired solution in Moodle.

**Function Name:** pol_eval

**Function Input Name(s):** N

**Function Input Type(s):** integer > 0

**Output Type:** numpy array with 12 entries of type float

## Environment
This function defines the grid world environment presented in the lecture. The actions are defined as follows:  
u = 0: go right  
u = 1: go up  
u = 2: go left  
u = 3: go down  
The input is the current state $x$ and the chosen action $u$, the output is the next state $x2$ and the reward $r$.

In [None]:
import numpy as np

def gridworld(x, u):
    # Transition Matrix
    T = np.array([
        [0, 1, 0, 0],
        [1, 2, 1, 0],
        [3, 2, 2, 1],
        [8, 4, 2, 3],
        [5, 4, 4, 3],
        [6, 5, 4, 8],
        [6, 6, 5, 7],
        [7, 6, 8, 10],
        [7, 5, 3, 9],
        [10, 8, 9, 9],
        [10, 7, 9, 11],
        [11, 11, 11, 11]
    ], dtype = int)
    
    # Get the next state
    x2 = T[x, u]
    
    # Reward
    r = -1
    if x2 in [8, 9]:
        r = -2
    elif x == 11:
        r = 0
    
    # Return the next state and reward
    return x2, r
    

## Policy Evaluation
Write your function in here, the return value should be a numpy array of length 12.

In [None]:
def pol_eval(N):
    # insert your code here
    return V

The implemented function will be called for different values of iterations. A run over 2000 iterations should produce the converged result V = [-126.5 -122.5 -114.5 -102.5 -100 -93.5 -87.5 -77.5 -88 -73.5 -52 0].

In [None]:
iterations = [10, 50, 100, 2000]

for N in iterations:
    V = pol_eval(N)
    print(f"The value function for {N} iterations: \n {V} \n")