<a href="https://colab.research.google.com/github/Chris-Fourie/rl_at_ammi/blob/master/Policy_Value_New_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Policy Evaluation
$$
\def\E{\mathbb{E}}
\def\given{\mid}
\def\states{\mathcal{S}}
\def\argmax{\text{argmax}}
$$
The first step to improving a policy is to evaluate the state-value function $v_\pi$ for an arbitraty policy (see Sutton & Barto 4.1 for details). The state-value according to a policy is computed as:

$$
\begin{aligned}
v_\pi (s) &= \E_\pi \left[ G_t \given S_t = s\right]\\
&= \E_\pi \left[ R_{t+1} + \gamma G_{t+1} \given S_t = s\right]\\
&= \E_\pi \left[ R_{t+1} + \gamma v_\pi ( S_{t+1}) \mid S_t = s \right]\\
&= \sum_a \pi(a|s) \sum_{s^\prime, r} p(s^\prime, r \mid s,a) \left[r + \gamma v_\pi (s^\prime)\right]
\end{aligned}
$$

The value of $v_\pi$ can be updated iteratively for all $s$:

$$
\begin{aligned}
v_{k+1}(s) &= \E_{\pi} \left[ R_{t+1} + \gamma v_k (S_{t_1} \given S_t = s)\right] \\
&= \sum_a \pi(a|s) \sum_{s^\prime, r} p(s^\prime, r \mid s,a) \left[r + \gamma v_\pi (s^\prime)\right]
\end{aligned}
$$

For any policy $\pi$, your task to to implement a policy evaluation function based on the following pseudocode:


![policy-evaluation](https://i.ibb.co/j4zZ9Xw/policy-evaluation.png)

(source: S&B Section 4.1, page 75)

Recall $p$ represents the transition probabilities, which we will get from a simple GridWorld below, $\gamma$ is the discount factor, whereas $\theta$ is a  small threshold which indicates when to stop updating our value function. A simple GridWorld looks something like the following:

![gridworld](https://i.ibb.co/Lk166s4/gridworld.png)

(source: S&B Section 4.1, page 76)

### Exercise 1
Note the code for this and the following exercises is based on the reinforcement-learning [repo](https://github.com/dennybritz/reinforcement-learning) from Denny Britz.

Complete the `policy_eval` function above so that the following evaluates without error:
```python
v = policy_eval(random_policy, env)
```

In [0]:
!git clone https://github.com/dennybritz/reinforcement-learning

Cloning into 'reinforcement-learning'...
remote: Enumerating objects: 1213, done.[K
Receiving objects:   0% (1/1213)   Receiving objects:   1% (13/1213)   Receiving objects:   2% (25/1213)   Receiving objects:   3% (37/1213)   Receiving objects:   4% (49/1213)   Receiving objects:   5% (61/1213)   Receiving objects:   6% (73/1213)   Receiving objects:   7% (85/1213)   Receiving objects:   8% (98/1213)   Receiving objects:   9% (110/1213)   Receiving objects:  10% (122/1213)   Receiving objects:  11% (134/1213)   Receiving objects:  12% (146/1213)   Receiving objects:  13% (158/1213)   Receiving objects:  14% (170/1213)   Receiving objects:  15% (182/1213)   Receiving objects:  16% (195/1213)   Receiving objects:  17% (207/1213)   Receiving objects:  18% (219/1213)   Receiving objects:  19% (231/1213)   Receiving objects:  20% (243/1213)   Receiving objects:  21% (255/1213)   Receiving objects:  22% (267/1213)   Receiving objects:  23% (279/1213)   Receiving obj

In [0]:
import numpy as np
import sys
if "/content/reinforcement-learning" not in sys.path:
  sys.path.append("/content/reinforcement-learning") 
from lib.envs.gridworld import GridworldEnv

In [0]:
env = GridworldEnv()

In [0]:

def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):
    """
    Evaluate a policy given an environment and a full description of the environment's dynamics.
    
    Args:
        policy: [S, A] shaped matrix representing the policy.
        env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.
        theta: We stop evaluation once our value function change is less than theta for all states.
        discount_factor: Gamma discount factor.
    
    Returns:
        Vector of length env.nS representing the value function.
    """
    # Start with a random (all 0) value function
    V = np.zeros(env.nS)
    while True:
      
        delta = 0
        # For each state, perform a "full backup"
        for s in range(env.nS):
            v = 0
            
            # Look at the possible next actions
            for a, action_prob in enumerate(policy[s]):
              
                # For each action, look at the possible next states...
                for  prob, next_state, reward, done in env.P[s][a]:
                  
                    # Calculate the expected value. Ref: Sutton book eq. 4.6.
                    v += action_prob * prob * (reward + discount_factor * V[next_state])
                    
            # How much our value function changed (across any states)
            delta = max(delta, np.abs(v - V[s]))
            V[s] = v
              
        # Stop evaluating once our value function change is below a threshold
        if delta < theta:
            break
    return np.array(V)

In [0]:
random_policy = np.ones([env.nS, env.nA]) / env.nA
v = policy_eval(random_policy, env)

Run the cell below to verify your $v_\pi$ has converged to the proper value.

In [0]:
expected_v = np.array([0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22, -20, -14, 0])
result = np.testing.assert_array_almost_equal(v, expected_v, decimal=2)
print(result)
print("If `result` prints None, then your algorithm was successfully implemented")

None


# Policy Iteration

Now that we can evaluate the state value function for any given policy, we can use the values to improve our policy. 

First consider the state-action value function (q-function), which consists of selecting $a$ in $s$ and then behaving according to $\pi$:

$$
\begin{aligned}
q_\pi(s,a) &= \E \left[ R_{t+1} + \gamma v_\pi (S_{s+1}) \given S_t = s, A_t = a \right] \\
&= \sum_{s^\prime, r} p(s^\prime, r \given, s, a) \left[ r + \gamma v_\pi (s^\prime) \right]
\end{aligned}
$$

Let $\pi^\prime$ be a policy such that, for all $s \in \states$:
$$
q(s, \pi^\prime (s)) \geq v_\pi (s)
$$

Then the following holds for all $s \in \states$:
$$
v_\pi^\prime (s) \geq v_\pi (s)
$$
(See S&B page 78 for proof)

Consider a simple policy improvement which consists of $\pi$ selecting an action according to the maximum state-action value:

$$
\begin{aligned}
\pi^\prime (s) &= \argmax_{a} q_\pi (s,a) \\
&= \argmax_a \E \left[ R_{t+1} + \gamma v_\pi (S_{t+1}) \given S_t = s, A_t = a \right] \\
&= \argmax_a \sum_{s^\prime, r} p(s^\prime, s \given s, a) [r + \gamma v_\pi (s^\prime)]
\end{aligned}
$$

Alternating between policy evaluation and policy improvement is known as **policy iteration**. Implement the algorithm based on the following pseudocode:

![gridworld.png](https://i.ibb.co/GHpV8kV/policy-iteration.png)

(source: S&B page 80)


In [0]:
pp = pprint.PrettyPrinter(indent=2)
env = GridworldEnv()

### Exercise 2
Complete the `policy_improvement` function, based on the above pseudocode, so that the following runs without error:
```
policy, v = policy_improvement(env)
```

In [0]:

def policy_improvement(env, policy_eval_fn=policy_eval, discount_factor=1.0):
    """
    Policy Improvement Algorithm. Iteratively evaluates and improves a policy
    until an optimal policy is found.
    
    Args:
        env: The OpenAI envrionment.
        policy_eval_fn: Policy Evaluation function that takes 3 arguments:
            policy, env, discount_factor.
        discount_factor: gamma discount factor.
        
    Returns:
        A tuple (policy, V). 
        policy is the optimal policy, a matrix of shape [S, A] where each state s
        contains a valid probability distribution over actions.
        V is the value function for the optimal policy.
        
    """

    def one_step_lookahead(state, V):
        """
        Helper function to calculate the value for all action in a given state.
        
        Args:
            state: The state to consider (int)
            V: The value to use as an estimator, Vector of length env.nS
        
        Returns:
            A vector of length env.nA containing the expected value of each action.
        """
        A = np.zeros(env.nA)
        for a in range(env.nA):
            for prob, next_state, reward, done in env.P[state][a]:
                A[a] += prob * (reward + discount_factor * V[next_state])
        return A
      
    # Start with a random policy
    policy = np.ones([env.nS, env.nA]) / env.nA
    
    while True:
        # Evaluate the current policy
        V = policy_eval_fn(policy, env, discount_factor)
        
        # Will be set to false if we make any changes to the policy
        policy_stable = True
        
        # For each state...
        for s in range(env.nS):
            # The best action we would take under the currect policy
            chosen_a = np.argmax(policy[s])
            
            # Find the best action by one-step lookahead
            # Ties are resolved arbitarily
            action_values = one_step_lookahead(s, V)
            best_a = np.argmax(action_values)
            
            # Greedily update the policy
            if chosen_a != best_a:
                policy_stable = False
            policy[s] = np.eye(env.nA)[best_a]
        
        # If the policy is stable we've found an optimal policy. Return it
        if policy_stable:
            return policy, V

In [0]:
policy, v = policy_improvement(env)
print("Policy Probability Distribution:")
print(policy)
print("")

print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print("")

print("Value Function:")
print(v)
print("")

print("Reshaped Grid Value Function:")
print(v.reshape(env.shape))
print("")

In [0]:
# Test the value function
expected_v = np.array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1,  0])
np.testing.assert_array_almost_equal(v, expected_v, decimal=2)

## Value Iteration
The isue with policy iteration is that every time we improve our policy, we must do a full sweep through all states again to evaluate the new policy! This is extremely inefficient. Value iteration combines both evaluation and improvement in the same step. Implement value iteration based on the following pseudocode:


![gridworld.png](https://i.ibb.co/SVTmBFQ/value-iteration.png)

(source: S&B Section 4.4, page 83)


### Exercise 3
Complete the `value_iteration` function below, based on the above pseudocode, so that the following evaluates without error
```
policy, v = value_iteration(env)
```

In [0]:

def value_iteration(env, theta=0.0001, discount_factor=1.0):
    """
    Value Iteration Algorithm.
    
    Args:
        env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.
        theta: We stop evaluation once our value function change is less than theta for all states.
        discount_factor: Gamma discount factor.
        
    Returns:
        A tuple (policy, V) of the optimal policy and the optimal value function.
    """
    
    def one_step_lookahead(state, V):
        """
        Helper function to calculate the value for all action in a given state.
        
        Args:
            state: The state to consider (int)
            V: The value to use as an estimator, Vector of length env.nS
        
        Returns:
            A vector of length env.nA containing the expected value of each action.
        """
        A = np.zeros(env.nA)
        for a in range(env.nA):
            for prob, next_state, reward, done in env.P[state][a]:
                A[a] += prob * (reward + discount_factor * V[next_state])
        return A

    V = np.zeros(env.nS)
    while True:
        # Stopping condition
        delta = 0
        # Update each state...
        for s in range(env.nS):
            # Do a one-step lookahead to find the best action
            A = one_step_lookahead(s, V)
            best_action_value = np.max(A)
            # Calculate delta across all states seen so far
            delta = max(delta, np.abs(best_action_value - V[s]))
            # Update the value function. Ref: Sutton book eq. 4.10. 
            V[s] = best_action_value        
        # Check if we can stop 
        if delta < theta:
            break
    
    # Create a deterministic policy using the optimal value function
    policy = np.zeros([env.nS, env.nA])
    for s in range(env.nS):
        # One step lookahead to find the best action for this state
        A = one_step_lookahead(s, V)
        best_action = np.argmax(A)
        # Always take the best action
        policy[s, best_action] = 1.0
    
    return policy, V

In [0]:
policy, v = value_iteration(env)

print("Policy Probability Distribution:")
print(policy)
print("")

print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")
print(np.reshape(np.argmax(policy, axis=1), env.shape))
print("")

print("Value Function:")
print(v)
print("")

print("Reshaped Grid Value Function:")
print(v.reshape(env.shape))
print("")

In [0]:
# Test the value function
expected_v = np.array([ 0, -1, -2, -3, -1, -2, -3, -2, -2, -3, -2, -1, -3, -2, -1,  0])
np.testing.assert_array_almost_equal(v, expected_v, decimal=2)

## Exercise 4
Now that you have implemented Policy Iteration and Value Iteration, plot the average running time for both algorithms by varying the discount rate $\gamma$ between 0 and 1. What do you observe?

## Alternative implementation
The algorithms you've seen in this notebook can alternatively be solved using matrix notation and in closed form in some cases. See [this](https://drive.google.com/file/d/1UR20JtQRjFyrvCseusVuPBmQIpB3XFAH/view?usp=sharing) notebook.