# Reinforcement Learning Exercise

For this exercise we will be using the [OpenAI Gym](https://gym.openai.com/) provided by [OpenAI](https://openai.com/). To get to know Gym, you are encouraged to read this [blogpost](https://openai.com/blog/openai-gym-beta/) (~10 minutes) and refer to the [docs](https://gym.openai.com/docs) along the way.

In this exercise we will train a neural network agent to navigate various environments from the OpenAI Gym.

## 0. Prerequisites (if running on your own machine)

We assume you already have Theano and Lasagne installed -- otherwise go back to the first exercise for instructions. 

Below is a brief guide on how to install OpenAI Gym. For more details please refer to the [docs](https://gym.openai.com/docs).
   
```
$ cd ~/path/to/dir/...
$ git clone https://github.com/openai/gym
$ cd gym
$ pip install -e . # minimal install
```

Verify your installation is working by importing `gym` and check for errors:

```
$ python
>>> import gym
[no errors]
```

Now restart this notebook before moving on to the next part.

## 1. Getting started

Now that you have everythong installed, lets get started!

The code below will import `gym` and initialize the [CartPole-v0](https://gym.openai.com/envs/CartPole-v0) environment. The task of this environment is to move a cart in order to balance a pole attached on top, but for now we will just take random actions for 200 timesteps to see what happens.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython.display import display

import numpy as np
import theano
import theano.tensor as T
import lasagne
import gym
from lasagne.layers import InputLayer, DenseLayer
from lasagne.nonlinearities import tanh, softmax
    
from PIL import Image, ImageDraw

In [None]:
def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    patch = plt.imshow(frames[0], cmap='gray')
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='loop')
            , autoplay=True)

def custom_render(state):
    """
    env.render() uses pyglet, which requires GLX > 1.2, which is cumbersome
    to get working on a server.
    """
    x = int(state[0] * 50 + 128) # Position
    theta = state[2]             # Angles in radians away from horizontal

    mark_size = 3
    cart_w = 25; cart_h = 10
    pole_l = 90; pole_w = 4
    
    frame = np.zeros((256,256)) + 256
    frame[200, :] = 0
    frame[200-mark_size:200+mark_size, 128-mark_size:128+mark_size] = 0

    image = Image.fromarray(frame)
    draw = ImageDraw.Draw(image)
    draw.line((x, 200,
               x   + np.sin(theta)*pole_l,   # x coord for top of pole
               200 - np.cos(theta)*pole_l)   # y coord for top of pole
              , width=pole_w, fill=0)
    draw.line((x-cart_w, 200, x+cart_w, 200), width=cart_h, fill=128)
    del draw

    return image


In [None]:
# init and run an example environment
env = gym.make('CartPole-v0')
env.reset()

frames = []
for _ in range(100):
    state, reward, done, info = env.step(env.action_space.sample())
    frames.append(custom_render(state))

display_frames_as_gif(frames)

That was all very nice, but taking random actions doesn't really solve the task. We have to do something smarter. In the next part we will train an agent to solve the task by reinforcement learning.

## 2. Policy gradient agent

In this part we will create an agent that can learn to solve tasks from OpenAI Gym by applying the policy gradient method.

The agent is designed to work on environments with a discrete action space. Extending the code to also handle environments with countinous action space is left as an optional exercise.

But first here is a short introduction to policy gradients.

### Policy gradients

We want to learn a policy neural network $p_\theta(a_{t}|s_{t-1})$ with parameters $\theta$ for action $a_t$ given the previous state $s_{t-1}$ only.
When the action $a$ is discrete we can implement this by a softmax output taking $s$ as input. 
The (discounted) cumulative award for a sequence terminating after $T$ time-steps is

$$
R = \sum_{t=1}^T \gamma^{t-1} r_{t} \ .
$$

The expectation of $R$ over a
policy roll-out $p_\theta({\bf a}|{\bf s})$ is 

$$
\mathbb{E}[R|\theta] = \int R({\bf a},{\bf s}) p_\theta({\bf a},{\bf s}) d{\bf a} d{\bf s}\ ,
$$

where ${\bf a} = a_1,\ldots,a_T$, ${\bf s}=s_0,\ldots,s_T$ and

$$
p_\theta({\bf a},{\bf s}) = p(s_0) \prod_{t=1}^T \left[ 
p(s_{t}|s_{t-1},a_t) p_\theta(a_{t}|s_{t-1})
\right]\ .
$$

In this formulation $s_t$ is a stochastic function of the previous action and state: $p(s_t|a_t,s_{t-1})$. We can draw from the joint distribution of actions and states through the environment but $p(s_t|a_t,s_{t-1})$ is unknown. 

A deterministic environment, think chess or go, is a special case of this set-up where the state is a deterministic function of the previous state and action: $s_t = f(a_t,s_{t-1})$. We can include the deterministic formulation in the general by using a Dirac $\delta$-function: $p(s_t|a_t,s_{t-1})= \delta(s_t - f(a_t,s_{t-1}))$. The integration over $s_1,\ldots,s_T$ may be carried out explicitly:    

$$
\mathbb{E}[R|\theta] = \int R({\bf a},s_0) p(s_0) \prod_{t=1}^T  p_\theta(a_{t}|s_{t-1})d{\bf a} ds_0\ ,
$$

where $s_{t-1}=f(a_{t-1},s_{t-2})$.

We use gradient ascent to learn an approximation to a policy that maximizes the cumulative reward. 
So we need to compute the gradient:

$$
\nabla_\theta \mathbb{E}[R|\theta] = \int R({\bf a}, {\bf s}) \nabla_\theta p_\theta({\bf a},{\bf s}) \, d{\bf a}d{\bf s} \ .
$$

We can now use the identity

$$
\nabla_\theta p_\theta({\bf a},{\bf s}) = p_\theta({\bf a},{\bf s}) \nabla_\theta \log p_\theta({\bf a},{\bf s})
$$

to express the gradient as an average over $p_\theta({\bf a},{\bf s})$:

$$
\nabla_\theta \mathbb{E}[R|\theta] = \int p_\theta({\bf a},{\bf s}) ( R({\bf a}, {\bf s}) - b ) \nabla_\theta \log p_\theta({\bf a},{\bf s}) d{\bf a}d{\bf s}\ .
$$

The constant factor $b$ will not affect the gradient but will needed in practice when we estimate gradients by Monte Carlo (that is roll-outs). We can prove that subtracting $b$ will not change the gradient by using the identity from above again:

\begin{align*}
0 & = \nabla_\theta 1 \\
  & = \nabla_\theta \int p_\theta({\bf a},{\bf s}) \, d{\bf a} d{\bf s}\\
  & = \int \nabla_\theta p_\theta({\bf a},{\bf s}) \, d{\bf a}d{\bf s}\\
  & = \int p_\theta({\bf a},{\bf s}) \nabla_\theta \log p_\theta({\bf a},{\bf s}) \, d{\bf a}d{\bf s} \ .
\end{align*}

We cannot evaluate the average over roll-outs analytically but we have an environment simulator that when supplied with our current policy $p_\theta(a|s)$ can return the sequence of action, states and rewards. This allows us to replace the integral by a Monte Carlo average over $V$ roll-outs

$$
\nabla_\theta \mathbb{E}[R|\theta] \approx \frac{1}{V} \sum_{v=1}^V ( R({\bf a}^{(v)}, {\bf s}^{(v)}) - b) \nabla_\theta \log p_\theta({\bf a}^{(v)},{\bf s}^{(v)})
$$

Note also that the gradient of $\log p_\theta({\bf a}^{(v)},{\bf s}^{(v)})$ does not depend explicitly on the state distribution:

$$
\nabla_\theta \log p_\theta({\bf a}^{(v)},{\bf s}^{(v)}) = \sum_{t=1}^T \nabla_\theta \log p_\theta(a_{t}|s_{t-1}) \ .
$$

We are almost done. As a last step we will use the freedom in the choice of $b$ to select a $b$ that will make the Monte Carlo estimate of the gradient have the lowest possible variance. In other words, the finite Monte Carlo sample give us a noisy gradient and by this correction we can make it vary as little as possible between roll-out draws.

The $b$ that minimizes the variances can be found from minimizing the following expression:

$$
\int p_\theta({\bf a},{\bf s}) \, ( R({\bf a}, {\bf s}) - b )^2 \,  |\nabla_\theta \log p_\theta({\bf a},{\bf s})|^2 \,  d{\bf a}d{\bf s} - \left( \nabla_\theta \mathbb{E}[R|\theta] \right)^2 \ .
$$

The solution to this problem is

$$
b = \frac{\int p_\theta({\bf a},{\bf s}) \, R({\bf a}, {\bf s}) \,  |\nabla_\theta \log p_\theta({\bf a},{\bf s})|^2 d{\bf a}d{\bf s}}{\int p_\theta({\bf a},{\bf s}) \, |\nabla_\theta \log p_\theta({\bf a},{\bf s})|^2 d{\bf a}d{\bf s}} \ .
$$

We replace this expression by a Monte Carlo average:

$$
b = \frac{\sum_{v=1}^V  R({\bf a}^{(v)}, {\bf s}^{(v)}) \, |\nabla_\theta \log p_\theta({\bf a}^{(v)},{\bf s}^{(v)})|^2}{\sum_{v=1}^V | \nabla_\theta \log p_\theta({\bf a}^{(v)},{\bf s}^{(v)})|^2}
$$

In the code below we instead use a time-step dependent baseline correction, $b(s_t)$, as described in [here](https://gym.openai.com/docs/rl#policy-gradients).

In [None]:
class Agent(object):
    """
    Reinforcement Learning Agent
    
    This agent can learn to solve reinforcement learning tasks from
    OpenAI Gym by applying the policy gradient method.
    """

    def __init__(self, n_inputs, n_outputs):
        # symbolic variables for state, action, and advantage
        sym_state = T.fmatrix()
        sym_action = T.ivector()
        sym_advantage = T.fvector()
        # policy network
        l_in = InputLayer(shape=(None, n_inputs))
        l_hid = DenseLayer(incoming=l_in, num_units=20, nonlinearity=tanh, name='hiddenlayer')
        l_out = DenseLayer(incoming=l_hid, num_units=n_outputs, nonlinearity=softmax, name='outputlayer')
        # get network output
        eval_out = lasagne.layers.get_output(l_out, {l_in: sym_state}, deterministic=True)
        # get trainable parameters in the network.
        params = lasagne.layers.get_all_params(l_out, trainable=True)
        # get total number of timesteps
        t_total = sym_state.shape[0]
        # loss function that we'll differentiate to get the policy gradient
        loss = -T.log(eval_out[T.arange(t_total), sym_action]).dot(sym_advantage) / t_total
        # learning_rate
        learning_rate = T.fscalar()
        # get gradients
        grads = T.grad(loss, params)
        # update function
        updates = lasagne.updates.sgd(grads, params, learning_rate=learning_rate)
        # declare training and evaluation functions
        self.f_train = theano.function([sym_state, sym_action, sym_advantage, learning_rate], loss, updates=updates, allow_input_downcast=True)
        self.f_eval = theano.function([sym_state], eval_out, allow_input_downcast=True)
    
    def learn(self, env, n_epochs=100, t_per_batch=10000, traj_t_limit=None,
              learning_rate=0.1, discount_factor=1.0, n_early_stop=0):
        """
        Learn the given environment by the policy gradient method.
        """
        self.mean_train_rs = []
        self.mean_val_rs = []
        self.loss = []
        for epoch in xrange(n_epochs):
            # 1. collect trajectories until we have at least t_per_batch total timesteps
            trajs = []; t_total = 0
            while t_total < t_per_batch:
                traj = self.get_trajectory(env, traj_t_limit, deterministic=False)
                trajs.append(traj)
                t_total += len(traj["r"])
            all_s = np.concatenate([traj["s"] for traj in trajs])
            # 2. compute cumulative discounted rewards (returns)
            rets = [self._cumulative_discount(traj["r"], discount_factor) for traj in trajs]
            maxlen = max(len(ret) for ret in rets)
            padded_rets = [np.concatenate([ret, np.zeros(maxlen-len(ret))]) for ret in rets]
            # 3. compute time-dependent baseline
            baseline = np.mean(padded_rets, axis=0)
            # 4. compute advantages
            advs = [ret - baseline[:len(ret)] for ret in rets]
            all_a = np.concatenate([traj["a"] for traj in trajs])
            all_adv = np.concatenate(advs)
            # 5. do policy gradient update step
            loss = self.f_train(all_s, all_a, all_adv, learning_rate)
            train_rs = np.array([traj["r"].sum() for traj in trajs]) # trajectory total rewards
            eplens = np.array([len(traj["r"]) for traj in trajs]) # trajectory lengths
            # compute validation reward
            val_rs = np.array([self.get_trajectory(env, traj_t_limit, deterministic=True)['r'].sum() for _ in range(10)])
            # update stats
            self.mean_train_rs.append(train_rs.mean())
            self.mean_val_rs.append(val_rs.mean())
            self.loss.append(loss)
            # print stats
            print '%3d mean_train_r: %6.2f mean_val_r: %6.2f loss: %f' % (epoch+1, train_rs.mean(), val_rs.mean(), loss)
            # render solution
            #self.get_trajectory(env, traj_t_limit, render=True)
            # check for early stopping: true if the validation reward has not changed in n_early_stop epochs
            if n_early_stop and len(self.mean_val_rs) >= n_early_stop and \
                all([x == self.mean_val_rs[-1] for x in self.mean_val_rs[-n_early_stop:-1]]):
                break
    
    def get_trajectory(self, env, t_limit=None, render=False, deterministic=True):
        """
        Compute trajectroy by iteratively evaluating the agent policy on the environment.
        """
        t_limit = t_limit or env.spec.timestep_limit
        s = env.reset()
        traj = {'s': [], 'a': [], 'r': [], 'f': []}
        for _ in xrange(t_limit):
            a = self.get_action(s, deterministic)
            (s, r, done, _) = env.step(a)
            traj['s'].append(s)
            traj['a'].append(a)
            traj['r'].append(r)
            if render: 
                traj['f'].append(custom_render(s))
                if done:
                    return {'s': np.array(traj['s']), 'a': np.array(traj['a']), 
                            'r': np.array(traj['r']), 'f': traj['f']}
            if done: break
        return {'s': np.array(traj['s']), 'a': np.array(traj['a']), 
                'r': np.array(traj['r']), 'f': traj['f']}
    
    def get_action(self, s, deterministic=True):
        """
        Evaluate the agent policy to choose an action, a, given state, s.
        """
        # compute action probabilities
        prob_a = self.f_eval(s.reshape(1,-1))
        if deterministic:
            # choose action with highest probability
            return prob_a.argmax()
        else:
            # sample action from distribution
            return (np.cumsum(np.asarray(prob_a)) > np.random.rand()).argmax()
    
    def _cumulative_discount(self, r, gamma):
        """
        Compute the cumulative discounted rewards (returns).
        """
        r_out = np.zeros(len(r), 'float64')
        r_out[-1] = r[-1]
        for i in reversed(xrange(len(r)-1)):
            r_out[i] = r[i] + gamma * r_out[i+1]
        return r_out

Now that we have an agent, let's train it to solve the CartPole task.

Note: The agent is not guaranteed to learn a good solution every time as the policy gradient method might get stuck in a local optimum -- you may have to do several restarts to find a good solution.

In [None]:
# init environment
env = gym.make('CartPole-v0')
# init agent
agent = Agent(n_inputs=env.observation_space.shape[0],
              n_outputs=env.action_space.n)
# train agent on the environment
agent.learn(env, n_epochs=100, learning_rate=0.05, discount_factor=1,
            t_per_batch=10000, traj_t_limit=env.spec.timestep_limit, n_early_stop=5)

In [None]:
# plot training and validation mean reward
plt.figure(figsize=(10,5))
plt.xlabel('epochs'); plt.ylabel('mean reward')
plt.plot(agent.mean_train_rs, label='training')
plt.plot(agent.mean_val_rs, label='validation')
plt.xlim((0,len(agent.mean_val_rs)-1))
plt.legend(loc=2); plt.grid()
_=plt.show()

In [None]:
# review solution
agent_out = agent.get_trajectory(env, t_limit=1000, render=True)

display_frames_as_gif(agent_out['f'])

The initial solution does not learn to solve the task very well. Here are some hints on how to improve the solution:

* Increase the trajectory timestep limit to let the simulations look further into the future.
* Increase number of timesteps evaluated per batch.
* Try different optimization functions.
* Adjust the learning rate.
* Adjust the discount factor.

### Exercises

1. Describe the changes you made and and why they should improve the agent. Are you able to get solutions consistently?
2. In the plot above you will sometimes see that the validation reward starts out lower than the training reward but later they cross. How can you explain this behavior?
3. Explain step by step the algorithm in the `agent.learn` method with particular attention the points denoted 1-5 in the code above.
4. Optional: Monitor and submit your best solution to the Gym (see code below).

In [None]:
# start monitor
#env.monitor.start('cartpole-experiment-1')
#for _ in xrange(100):
#    agent.get_trajectory(env)
#env.monitor.close()

When you have monitored a solution open a Python shell and use the following command to upload the results to OpenAI Gym:

```
import gym
gym.upload('cartpole-experiment-1', api_key='YOUR_API_KEY')
```

You can also run the command here in the notebook, but remember to remove the API key before handing in the exercise.

You can find your API key at your [OpenAI Gym](https://gym.openai.com/) account page. 