# Tutorial: Evaluating Policies & First Contact with State of the Art Continuous RL methods

This is a hands-on tutorial to get familiar with the following topics:

 - OpenAI ```gym``` 2D "videogame" environments
 - State-of-the-art methods for RL over continuous control spaces
 - Measuring the performance of a given policy and reporting it

We will rely heavily on four scientific Python computing frameworks:

 - Numpy & Scipy - we will use for high performance algebraic computation and statistical analysis
 - Pandas - very useful framework to analyze and manage complex datasets
 - Matplotlib - essential for visualization

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

ImportError: dateutil 2.5.0 is the minimum required version

## Bipedal Walker

Bipedal Walker is a straightforward task that can be decomposed into two subtasks. One is to keep the robot upright, as it lacks the means to pull itself back on its feet. The second is to move as fast as possible towards the right, as reward grows linearly with the distance from the origin. The first subtask needs to be achieved indifinetly and concurrently with the second one. 

Besides the challenge posed by the robot dynamics, 4(?) degrees of freedom, plus modeling of inertia, the ground is uneven. These "bumps" along the way complicate keeping balance while moving fast, as if the robot goes too fast it can be impossible to stabilize.

In [None]:
import gym

In [None]:
env = gym.make('BipedalWalker-v2')

## Baseline Policies

### Random Policy

In [None]:
num_trials = 1000
trials_per_checkpoint = 10
checkpoints = []
random_observed_R = []

for trial in range(10, num_trials+1, trials_per_checkpoint):
    R = 0
    env.reset()
    for t in range(100):
        # the following call implements a random policy that picks actions from a uniform distribution
        u_t = env.action_space.sample()
        next, r, done, info = env.step(u_t)
        R += r
        # uncomment if capable of rendering
        #env.render()
        if done: break
    checkpoints.append(trial)
    random_observed_R.append(R)
# uncomment if capable of rendering
#env.close()
checkpoints = np.array(checkpoints)
random_observed_R = np.array(random_observed_R)

### Zero Input

In [None]:
num_trials = 1000
trials_per_checkpoint = 10
zero_observed_R = []

for trial in range(10, num_trials+1, trials_per_checkpoint):
    R = 0
    env.reset()
    for t in range(100):
        # the following call implements a random policy that picks actions from a uniform distribution
        u_t = np.zeros(env.action_space.shape)
        next, r, done, info = env.step(u_t)
        R += r
        # uncomment if capable of rendering
        #env.render()
        if done: break
    zero_observed_R.append(R)
# uncomment if capable of rendering
#env.close()

zero_observed_R = np.array(zero_observed_R)

### Comparing Policies Robustly

In [None]:
colors = [ '#2D328F', '#F15C19',"#81b13c","#ca49ac"]
          
label_fontsize = 18
tick_fontsize = 14
linewidth = 3
markersize = 5

median_zero = np.median(zero_observed_R)
plt.plot([checkpoints[0],checkpoints[-1]], [median_zero, median_zero], color='#0000FF', linewidth=linewidth,\
             linestyle='--',label='Zero Control')
median_random = np.median(random_observed_R)
plt.plot([checkpoints[0],checkpoints[-1]], [median_random, median_random], color='#0000F3', linewidth=linewidth,\
             linestyle=':', label='Random Control')

plt.axis([0,1000,-500,500])

plt.xlabel('rollouts',fontsize=label_fontsize)
plt.ylabel('R',fontsize=label_fontsize)
plt.legend(fontsize=18, bbox_to_anchor=(1.0, 1.0))
plt.xticks(fontsize=tick_fontsize)
plt.yticks(fontsize=tick_fontsize)
plt.grid(True)

fig = plt.gcf()
fig.set_size_inches(9, 6)

plt.show()

## State of the Art RL over Continuous Control Spaces

```
$ python -m ars --env_name gym:BipedalWalker-v2 --n_directions 240 --deltas_used 240 --step_size 0.02 \ 
--delta_std 0.0075 --n_workers 6 --n_iter 1000 --address 10.100.228.201:6379
```

```
$ python3 -m ars.run_policy data/lin_policy_plus.npz BipedalWalker-v2 --render --num_rollouts 20
```

### Loading pre-trained policy

In [None]:
do_render = False

In [None]:
print('loading and building expert policy')
lin_policy = np.load("trained_policies/BipedalWalker-v1/e_545/lin_policy_plus.npz")
lin_policy = list(lin_policy.items())[0][1]

M = lin_policy[0]
# mean and std of state vectors estimated online by ARS.
mean = lin_policy[1]
std = lin_policy[2]

env = gym.make('BipedalWalker-v2')

returns = []
observations = []
actions = []
for i in range(num_trials):
    #print('iter', i)
    obs = env.reset()
    done = False
    totalr = 0.
    steps = 0
    while not done:
        action = np.dot(M, (obs - mean)/std)
        observations.append(obs)
        actions.append(action)


        obs, r, done, _ = env.step(action)
        totalr += r
        steps += 1
        if do_render:
            env.render()
        #if steps % 100 == 0: print("%i/%i"%(steps, env.spec.timestep_limit))
        if steps >= env.spec.timestep_limit:
            break
    returns.append(totalr)

In [None]:
colors = [ '#2D328F', '#F15C19',"#81b13c","#ca49ac"]
          
label_fontsize = 18
tick_fontsize = 14
linewidth = 3
markersize = 5

median_zero = np.median(zero_observed_R)
plt.plot([checkpoints[0],checkpoints[-1]], [median_zero, median_zero], color='#0000FF', linewidth=linewidth,\
             linestyle='--',label='Zero Control')
median_random = np.median(random_observed_R)
plt.plot([checkpoints[0],checkpoints[-1]], [median_random, median_random], color='#0000F3', linewidth=linewidth,\
             linestyle=':', label='Random Control')

plt.plot(checkpoints, np.ones(checkpoints.shape)*np.median(returns, axis=0), \
         '--', color=colors[2], linewidth=linewidth, markersize=markersize,label='ARS, n=545')
plt.fill_between(checkpoints, np.ones(checkpoints.shape)*np.min(returns, axis=0), \
                 np.ones(checkpoints.shape)*np.max(returns, axis=0), alpha=0.25)


#plt.plot(tot_samples,np.median(J_finite_rs,axis=0),'s-',color=colors[1],linewidth=linewidth,
#         markersize=markersize,label='random search')
#plt.fill_between(tot_samples, np.amin(J_finite_rs,axis=0), np.amax(J_finite_rs,axis=0), alpha=0.25)

#plt.plot(tot_samples,np.median(J_finite_nom,axis=0),'*-',color=colors[2],linewidth=linewidth,
#         markersize=markersize,label='nominal')
#plt.fill_between(tot_samples, np.amin(J_finite_nom,axis=0), np.amax(J_finite_nom,axis=0), alpha=0.25)

#plt.plot([tot_samples[0],tot_samples[-1]],[baseline, baseline],color='#000000',linewidth=linewidth,
#             linestyle='--',label='zero control')
#plt.plot([tot_samples[0],tot_samples[-1]],[J_finite_opt, J_finite_opt],color='#000000',linewidth=linewidth,
#             linestyle=':',label='optimal')

plt.axis([0,1000,-500,500])

plt.xlabel('rollouts',fontsize=label_fontsize)
plt.ylabel('R',fontsize=label_fontsize)
plt.legend(fontsize=8, bbox_to_anchor=(1.0, 1.0))
plt.xticks(fontsize=tick_fontsize)
plt.yticks(fontsize=tick_fontsize)
plt.grid(True)

fig = plt.gcf()
fig.set_size_inches(9, 6)

plt.show()

### We never stop learning