# Dynamic Programming (Policy Iteration)

### Policy/Value improvement
During the algorithm execution both value and policy improves at the same time.
![alt text](imgs/policy_value_improvement.png "Game")

### Policy Iteration vs Value Iteration
* Both algorithms converge
* Policy iteration converge with less iterations but is more computation intensieve
* Both need a model from the environment (MDP)

### Policy vs Value iteration
One drawback to policy iteration is that each of its iterations involves policy evaluation, besides the policy improvement so the value iteration is easier to compute, but needs more iterations.

### References
* https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa
* https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
* https://en.wikipedia.org/wiki/Bellman_equation

### Import libraries and define Hyperparameters

In [1]:
import numpy as np
import time
import gym
from gym import wrappers
from IPython.display import clear_output

# Hyper parameters
gamma = 0.99

### Load OpenAI Gym Environment

In [2]:
#env_name  = 'FrozenLake8x8-v0'
#env_name = 'FrozenLake-v0'
env_name = 'Taxi-v2'

env = gym.make(env_name)

  result = entry_point.load(False)


### Access Environment Model
This part of the code will expose a dictionary $P(state,action)$ that will return a list of tupples $(prob,\text{next_state},reward,done)$, this will be the list of possible next states from that particular (state,action) pair, while P exposes the MDP of the environment.

For example this code would return the same thing
```python
state_reset = env.reset()
print('reset state:',state_reset)
action=0
next_state, reward, done, info = env.step(action)
print('############')
print('Next state:',next_state)
print('Reward:',reward)
print('done:',done)
print('info:',info)

#reset state: 0
############
#Next state: 0
#Reward: 0.0
#done: False
#info: {'prob': 0.3333333333333333}

# Return all the possible states with the tupple (probability, next_state, reward, done)
env.P[state_reset][action]
# [(0.3333333333333333, 0, 0.0, False),
# (0.3333333333333333, 0, 0.0, False),
# (0.3333333333333333, 8, 0.0, False)]
```

Just few environments expose their MDP:
* FrozenLake-v0
* FrozenLake8x8-v0
* Taxi-v2

In [3]:
# Have access to the MDP of the environment
env = env.unwrapped

### Display search space size

In [4]:
num_states = None
num_actions = None
try:
    num_states = env.observation_space.n
    num_actions = env.action_space.n
    print('Search space:',num_states*num_actions)
except:
    num_states = np.prod(env.observation_space.shape)
    num_actions = np.prod(env.action_space.shape)
    print('Search space:',num_states*num_actions)

Search space: 3000


### Policy Improvement
Extract a policy given some value function estimation. When the value estimation converges to the real value function we will get the optimal policy.

In [5]:
def policy_improvement(v, gamma = 1.0):
    policy = np.zeros(env.observation_space.n)
    q = np.zeros((env.observation_space.n, env.action_space.n))
    
    for s in range(env.observation_space.n):
        for a in range(env.action_space.n):
            q[s,a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in  env.P[s][a]])
        
        # If we have the action-value function we can just act greedly and thet the action with biggest value
        policy[s] = np.argmax(q[s,:])
    return policy

### Policy Evaluation
Improves the value function under a policy. 
This function could also be solved as a linear equation instead of an interative process.

In [6]:
def evaluate_v_under_policy(env, policy, gamma=1.0):
    v = np.zeros(env.observation_space.n)

    # Improves the value under a given policy
    while True:
        prev_v = np.copy(v)
        for s in range(env.nS):
            # Get action from given policy
            a = policy[s]
            
            # Bellman Expectation equation
            v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.P[s][a]])
        
        # Check convergence (Basically check if the value function didn't change much)
        if (np.sum(np.fabs(prev_v - v)) <= np.finfo(float).eps):
            #print ('Value-iteration converged at iteration# %d.' %(i+1))
            break
    return v

### Policy Iteration Algorithm
The algorithm has the following structure:
1. Start with a random policy.
2. Find the value function of that policy.
3. Find new policy based on the previous value function
4. Go back to 2 until the policy don't change anymore

The policy is evaluated using the Bellman Function.
![alt text](imgs/policy_iteration.png "Game")

In [7]:
def policy_iteration(env, gamma = 1.0, max_iterations = 200000):
    # initialize a random policy
    policy = np.random.choice(env.action_space.n, size=(env.observation_space.n))  
    
    # Iterate a lot...
    for i in range(max_iterations):
        # Policy Evaluation
        old_policy_v = evaluate_v_under_policy(env, policy, gamma)
        
        # Policy Improvement
        new_policy = policy_improvement(old_policy_v, gamma)
        
        # Detect convergence
        if (np.all(policy == new_policy)):
            print ('Policy-Iteration converged at step %d.' %(i+1))
            break
        policy = new_policy
    return policy

### Run the Policy Iteration Algorithm
Compare between policy and value iteration the number of iterations and time to compute for the same environment.

#### Result against Value Iteration on Taxi driver problem
* Value iteration took 3270 iterations
* 14 seconds to complete.

In [11]:
%%time
optimal_policy = policy_iteration(env, gamma = gamma)

Policy-Iteration converged at step 16.
CPU times: user 37.9 s, sys: 321 ms, total: 38.2 s
Wall time: 38 s


### Test learned policy

In [10]:
state = env.reset()
done = False
step = 0
while not done:
    action = int(optimal_policy[state])
    next_state, reward, done, info = env.step(action)
    state = next_state
    step += 1
    print("Step: {}".format(step))
    env.render()
    clear_output(wait=True)
    time.sleep(0.2)

Step: 12
+---------+
|[35m[42mR[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
