***
<p style="text-align:left;">Reinforcement Learning
<span style="float:right;">Monday, 04. May 2020</span></p>

<p style="text-align:left;">Prof. S. Harmeling
<span style="float:right;">DUE 23:55 Monday, 11. May 2020</span></p>

---
<p style="text-align:center;"><b>Exercise set #3</b></p>

---

# 2. Policy iteration

In this exercise you will implement **policy iteration**, which is a dynamic programming algorithm.  
This exercise was inspired by the Reinforcement Learning tutorial by Shimon Whiteson  
from the Machine Learning Summer School 2019: https://github.com/mlss-skoltech

In [None]:
import gym
import numpy as np
from time import sleep
from IPython.display import clear_output

## Frozen Lake Environment

In this exercise we will work with the Frozen Lake environment from OpenAI's gym library.  
Make yourself familiar with the environment:
- https://gym.openai.com/envs/FrozenLake-v0
- https://github.com/openai/gym/wiki/FrozenLake-v0

The environment also provides some useful attributes:
- ```env.nS```: the number of states
- ```env.nA```: the number of actions
- ```env.P```: contains the transition probabilities of state-action pairs, i.e.
    ```
    prob, next_state, reward, done = env.P[state][action]
    ```
    where ```prob``` is the probability $p(s',r|s,a)$, that ```state``` and ```action``` lead to ```next_state``` and ```reward```

Now, let's create an instance of the environment:

In [None]:
env = gym.make('FrozenLake-v0').env

## Policies

In our setup, policies are functions that take two arguments, ```env``` and ```state```, and return an action based on that state:

```
def my_policy(env, state):
    action = ...
    return action
```

Below we implemented a function that runs one rollout of a given policy:

In [None]:
def rollout(env, policy, render=False):
    state = env.reset()
    total_reward = 0.
    done = False
    while not done:
        if render:
            env.render()
            clear_output(wait=True)
            
        action = policy(env, state)
        state, reward, done, info = env.step(action)
        total_reward += reward
        
        if render:
            sleep(0.4)
    
    if render:
        env.render()
    return total_reward

And another function that runs multiple rollouts of a given policy and averages the total rewards:

In [None]:
def evaluate(env, policy, num_rollouts=100):
    return sum(rollout(env, policy) for _ in range(num_rollouts)) / num_rollouts

### Random policy
The random policy selects random actions and ignores the current state.  
Let's see how this very simple policy performs on the environment:

In [None]:
def random_policy(env, state):
    return env.action_space.sample()

In [None]:
total_reward = rollout(env, random_policy, render=True)
print('\nTotal reward:', total_reward)

In [None]:
print('Average total reward:', evaluate(env, random_policy))

As we can see, the random policy performs poorly.

### Non-deterministic policies
We will use non-deterministic policies, that define conditional probabilities over the actions given a state, i.e. $\pi(a|s)$.  
Since we work with finite state and action spaces, we can store these conditional probabilities in a 2D array ```pi``` of shape ```(env.nS, env.nA)```,  
such that ```pi[state, action]``` corresponds to $\pi(a|s)$.

```policy_from_pi``` creates a policy function, that randomly chooses an action based on the conditional probabilities.

In [None]:
def policy_from_pi(pi):
    def policy(env, state):
        action_probs = pi[state]
        return np.random.choice(np.arange(env.nA), p=action_probs)
    return policy

## Policy iteration
We will follow the policy iteration algorithm from *Reinforcement Learning: An Introduction* by Sutton and Barto, p. 80  
(http://incompleteideas.net/book/the-book-2nd.html), but because we use non-deterministic policies, the implementation will be slightly different.

### Policy evaluation

We want to determine the value function $V_\pi(s)$ for a given policy $\pi$.  
Since we work with finite state spaces, we can store the values for each state in an array ```V```,  
such that ```V[s]``` corresponds to $V(s)$.

```V``` is initialized with zeros, then we iteratively apply the *Bellman expectation equation*, until the values converge.  
Because we use non-deterministic policies, the equation looks slightly different from the one in the book:

$$\begin{aligned}
V_\pi(s) & = \mathbb{E} [r + \gamma V_\pi(s') \,|\, s] \\[3pt]
         & = \sum_a \pi(a|s) \sum_{s',r} \ p(s',r|s,a) \ [r + \gamma V_\pi(s')] \\
         & = \sum_a \pi(a|s)\ Q_\pi(s,a)
\end{aligned}$$

In the last step we used the fact that the action-value function can be expressed in terms of the state-value function:

$$\begin{aligned}
Q_\pi(s,a) & = \mathbb{E} [r + \gamma V_\pi(s') \,|\, s, a] \\[3pt]
           & = \sum_{s',r} p(s',r|s,a)\ [r + \gamma V_\pi(s')]
\end{aligned}$$

Implement this last equation, since it will be very handy:

In [None]:
def action_value(env, V, state, action, gamma):
    """Computes the action-value Q(s,a) for a given state-action pair (state, action)
    based on the state-value function.
    - gamma: The discount-rate.
    """
    #########################
    # Write your code here. #
    #########################

Then use the ```action_value()``` function and implement policy evaluation:

In [None]:
def policy_evaluation(env, pi, gamma, theta):
    """Computes the state-value function V of a policy pi.
    - gamma: The discount-rate.
    - theta: A small threshold, determining when the values converge (see algorithm p. 80).
    """
    V = np.zeros(env.nS)
    #########################
    # Write your code here. #
    #########################
    return V

### Policy improvement

Now we want to find the best policy for the value function that we computed in policy evaluation,  
by maximizing the the action-value function.  

Again, because we use non-deterministic policies, the implementation will slightly differ from the book.  
Instead of just using $\arg\max$, we assign probabilities. If there are multiple maximizing actions,  
we evenly distribute their probabilities:
$$\pi'(a|s) := \begin{cases}
    \frac{1}{|\arg\max_a Q_\pi(s,a)\,|} & \text{if } a \in \arg\max_a Q_\pi(s,a) \\
    0 & \text{otherwise}
\end{cases}$$

Furthermore, to determine if the policy is *stable* we need to check if the *state-values* no longer change,  
instead of checking if the actions change.

Therefore only implement the loop of policy improvement from the book and do *not* check if the policy is stable here:

In [None]:
def policy_improvement(env, V, gamma):
    pi = np.zeros((env.nS, env.nA))
    #########################
    # Write your code here. #
    #########################
    return pi

### Policy iteration

Finally, we arrive at policy iteration by iteratively performing policy evaluation and policy improvement,  
until the policy is stable. Stop if the policy is stable, i.e. if the state-values no longer change:

In [None]:
def policy_iteration(env, gamma, theta):
    pi = np.full((env.nS, env.nA), 1 / env.nA)  # initialize with random policy
    #########################
    # Write your code here. #
    #########################
    return pi

In [None]:
pi = policy_iteration(env, gamma=1, theta=1e-8)
policy = policy_from_pi(pi)
print(pi)

In [None]:
total_reward = rollout(env, policy, render=True)
print('\nTotal reward:', total_reward)

In [None]:
print('Average total reward:', evaluate(env, policy))

You should achieve an average total reward of at least 0.8