# Iterative Policy Evaluation

From previous chapter we've studied Bellman Optimality Equations<br>
For state value function:
$$
\begin{align}
    v_*(s) &= \underset{a}{max}\ \mathbb{E}\big[R_{t+1} + \gamma v_*(S_{t + 1}) | S_t = s, A_t=a\big]\\
    &= \underset{a}{max}\sum_{s^{\prime}, r}p(s^{\prime}, r | s, a)\big[r + \gamma v_*(S_{t+1})\big]
\end{align}
$$

For state-action pair value function:
$$
\begin{align}
    q_*(s, a) &= \mathbb{E}\big[R_{t+1} + \gamma\ \underset{a^{\prime}}{max}\ q_*(S_{t+1}, a^{\prime}) | S_t=s, A_t=a\big]\\
    &= \sum_{s^{\prime}, r}p(s^{\prime}, r|s, a)\ \underset{a^{\prime}}{max}\big[r + \gamma q_*(s^{\prime}, a^{\prime})\big]
\end{align}
$$

## Policy Evaluation
Also from previous chapter we've studied how we can compute $v_\pi(s)$ via $v_\pi(s^{\prime})$ given that $s^\prime$ is the successor state.
$$
\begin{align}
    v_\pi(s) &= \mathbb{E}\big[G_t | S_t=s\big]\\
    &= \mathbb{E}\big[R_{t+1} + \gamma G_{t+1} | S_t=s\big]\\
    &= \mathbb{E}\big[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s\big]\\
    &= \sum_{a}\pi(a|s)\sum_{s^\prime, r}p(s^\prime, r | s, a)\big[r + \gamma v_\pi(s^\prime)\big]
\end{align}
$$

- $v_\pi(s)$ shows how good it is to be in state $s$ given that we follow policy $\pi$
- $\pi(a|s)$ shows the probability that action $a$ will be taken given that we are in state $s$

### iterative method
We are to use an update rule to approximate $v_\pi$ starting from $v_0$ which is an arbitrary valued value-function to get to the approximated value-function. $(v_0, v_1, v_2, \dots, v_\pi)$<br>
Update rule is:
$$
\begin{align}
    v_{k+1}(s) &= \mathbb{E}_\pi\big[R_{t+1} + \gamma v_k(S_{t+1}) | S_t=s\big]\\
    &= \sum_{a}\pi(a|s)\sum_{s^\prime, r}p(s^\prime, r | s, a)\big[r + \gamma v_k(s^\prime)\big]
\end{align}
$$

- Algorithm is guaranteed to converge for $k\rightarrow\infty$

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
def iterative_policy_evaluation(grid_world_shape, rewards, transition, policy, gamma=1.0, threshold=1e-12):
    """
    args:
        grid_world_shape: shape of the grid world in 2D
        rewards:          denotes rewards function given state and action and the new-state, SxA
        transition:       function denoting transition probability from state s to s_prime, SxAxS
        policy:           the policy (pi) to be evaluated
        gamma:            discounting factor
        threshold:        determines the accuracy of estimation
    returns:
        v_pi:             approximation of policy evaluation
    """

    # Define V(s) arbitrary and for terminal states, V(terminal) = 0
    V = np.zeros(grid_world_shape)

    # Take the number of rows and columns in grid world shape
    rows, columns = grid_world_shape
    
    while True:
        delta = 0

        # i and j together denote state s
        for i in range(rows):
            for j in range(columns):
                v = V[i, j]
                # for each action find its expected return given action a is taken
                new_vs = 0
                for a in range(4):
                    # i_prime and j_prime denote state s_prime
                    for i_prime in range(rows):
                        for j_prime in range(columns):
                            new_vs += pi[a, i, j] * transition[i, j, a, i_prime, j_prime] * (reward[i, j, a] + gamma * V[i_prime, j_prime])
                # Update the value for state s
                V[i, j] = new_vs 

                # Storing the difference that each update makes
                delta = max(delta, np.abs(v - V[i, j]))

        # If the maximum difference made is less than the threshold break
        if delta <= threshold: break
    return V

### Definition of the MDP environment
Below we've defined the Markov-Decision-Process's Environment, reward signal and dynamics.<br>
Also we've defined a policy $\pi$ in which actions are chosen from a uniform probability distribution.

In [3]:
# Defining the shape of the grid world
grid_world_shape = (4, 4)


# Defining the reward function, SxAxS
reward = np.zeros(grid_world_shape + (4,)) - 1
reward[0, 0, :] = 0                                 # Terminal state at top left corner
reward[-1, -1, :] = 0                               # Terminal state at bottom right corner


# Defining the transition function
transition = np.zeros(grid_world_shape + (4,) + grid_world_shape)
# Let's define the transition function
for a in range(4):                                  # let's denote 0: up, 1:right, 2:down, 3:left
    for i in range(grid_world_shape[0]):
        for j in range(grid_world_shape[1]):
            if a == 0:                    
                transition[i, j, 0, max(0, i - 1), j] = 1
            if a == 1 :
                transition[i, j, 1, i, min(grid_world_shape[1] - 1, j + 1)] = 1
            if a == 2:
                transition[i, j, 2, min(grid_world_shape[0] - 1, i + 1), j] = 1
            if a == 3:
                transition[i, j, 3, i, max(j - 1, 0)] = 1
# Change the transition function for terminal states
transition[0, 0, :, :, :] = 0
transition[0, 0, :, 0, 0] = 1
transition[-1, -1, :, :, :] = 0
transition[-1, -1, :, -1, -1] = 1


# Define the policy in which actions are chosen from a uniform probability distribution
pi = np.ones((4,) + grid_world_shape) / 4

In [4]:
iterative_policy_evaluation(grid_world_shape, reward, transition, pi)

array([[  0., -14., -20., -22.],
       [-14., -18., -20., -20.],
       [-20., -20., -18., -14.],
       [-22., -20., -14.,   0.]])

- Note that here we've used ```transition``` instead of dynamics of the environment for better implementation since rewards are not stochastic and are not chosen from a probability distribution.