#### run global setup

In [None]:
try:
    with open("../global_setup.py") as setupfile:
        exec(setupfile.read())
except FileNotFoundError:
    print('Setup already completed')
import sys
from os import getcwd
sys.path.append(getcwd())

#### run local setup

In [None]:
%matplotlib inline

import gym
from gym import logger

logger.set_level(logger.ERROR)

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm

from src.rl.RandomAgent import RandomAgent
from src.rl.util import run_episode
from src.rl.TabularQAgent import TabularQAgent
from src.rl.NeuralQAgent import NeuralQAgent

## Cart-pole environment

In this notebook we're going to look at a slightly more difficult setting. Here, we have a pole balancing on top of a cart. 
The cart can move left or right along a track, and the goal is to keep the pole balanced on top, without the cart moving too far away from its starting position.  

We can render the setting like this:

In [None]:
env = gym.make('CartPole-v1')
env.reset()
plt.imshow(env.render(mode='rgb_array'))
env.close()

In model-free reinforcement learning, which is currently the most common and successful approach,  we do not use any prior knowledge about the system dynamics. It is worth noting that for an environment like the cart-pole, we can easily describe the dynamics using two differential equations describing the angular acceleration of the pole $\ddot\theta$, and the acceleration of the cart $\ddot x$ (here we ignore friction):

$T = \frac{F + m_p L\dot\theta^2\sin\theta)}{m_p + m_c}$

$\ddot\theta = \frac{g \sin\theta - T\cos\theta}{L (\frac{4}{3} - \frac{m_p\cos^2\theta}{m_p + m_c})}$ 

$\ddot x = T - \frac{m_p L\ddot\theta\cos\theta}{m_p + m_c}$

Don't worry, you don't need to understand these. However, with this valuable knowledge, we could manually design a control system behaving as we would like it to. However, in model-free reinforcement learning we ignore any such knowledge, letting the agent learn how to control the system from an essentially blank slate.

Now let us consider how the state space for this task could be designed. Obviously we're going to need the position of the cart and the angle of the pole to be part of the state space, since these are the things we're wanting to control. Additionally, due to assuming what's called the Markov property, we'll need to include the velocity of the cart and the pole. The Markov property essentially means that we have no "memory" -- we need to be able to describe the evolution of the system using only the current state. In this case, we need the velocities in order to know the positions at the next time step, just like in the equations of motion.

The result is a 4-dimensional state space -- the grid worlds from the previous exercise had a 2-dimensional state space. However, there is more to the picture than the dimensionality of the space. Whereas the grid worlds we looked at previous had around $4*12 = 48$ distinct states, each dimension in the 4-dimensional state space is in theory a real number. This leads to an effectively infinite number of distinct states that the pole and cart could be in -- even the smallest change in angle or position is distinct. We could even have an infinite number of actions, that is the force applied to the cart. However, for this reinforcement learning environment, we discretize the force into 2 distinct actions: a fixed force applied to either direction.

First, let's see the performance of a random policy plotted as a histogram.

In [None]:
env = gym.make('CartPole-v1')
agent = RandomAgent()
run_episode(env, agent, render=True)
dist = [run_episode(env, agent) for _ in range(1000)]
ax = sns.distplot(dist, kde=False)
ax.set(xlabel='Reward', ylabel='# occurences');

Now, as an experiment, let's see how our tabular Q-learning agent that we used to solve the grid world tasks works on this environment. First, a learning curve.

In [None]:
agent = TabularQAgent(0.1, 0.5, 0.99)

def run_experiment(env, agent, epsilon_decay, n_episodes) -> list:
    rewards = []
    for i in tqdm(range(n_episodes)):
        sum_r = run_episode(env, agent, learn=True)
        rewards.append(sum_r)
        agent.epsilon *= epsilon_decay
    agent.epsilon = 0
    sum_r = run_episode(env, agent)
    print('Trained for ', n_episodes, ' episodes. Last episode achieved a reward of ', sum_r)     
    #env.render(mode='path', ss=ss)
    return rewards


#run_episode(env, agent, learn=True)
rewards = run_experiment(env, agent, 0.99, 10000)
ax = sns.tsplot(rewards);
sns.despine()
ax.set(xlabel='Episode', ylabel='Reward');

And the corresponding histogram.

In [None]:
ax = sns.distplot(rewards, kde=False)
ax.set(xlabel='Reward', ylabel='# occurences');

It didn't seem to learn anything -- and the reason is the size of the state space. Intuitively, since we have so many states, we're unlikely to ever end up in the same state twice -- and so we'll never learn anything by storing our knowledge in a lookup table. Additionally, we should consider that what we've learned in one state might be useful in similar states. What we can do instead, is to use *function approximation*. Instead of a lookup table, we're going to learn a much simpler, continuous function that maps a state to the value we'd otherwise store in the table. Since this function is much simpler, there is no way it's going to return the exact same values as our table would -- but we can hope that it's going to be good enough.

We'll use a small neural network, taking in the 4-dimensional state. Through training, it should learn to output the value of taking our two actions, applying a force to the left or right. We can then use this instead of our lookup table, choosing the action with the highest value.

Training this agent is going to be significantly slower than what we've seen previously. The reason is that, as the agent gets better, each episode takes significantly longer to end -- and because the problem is harder, we're going to need more training data, and more time to make use of that data.

In [None]:
agent = NeuralQAgent(4, env.action_space.n, alpha=1e-3, gamma=0.95, epsilon=1.0, decay=1e-6)
rewards = run_experiment(env, agent, 0.995, 600)
ax = sns.tsplot(rewards)
sns.despine()
ax.set(xlabel='Episode', ylabel='Reward');

That learning curve should look significantly better than the tabular one. It might be unstable and jump up and down, but at least it is learning. Let's look at a histogram.

In [None]:
dist = [run_episode(env, agent) for _ in tqdm(range(200))]
sns.distplot(dist, kde=False)
print("Mean neural agent reward: ", np.mean(dist))

Certainly better, and if you weren't too unlucky with the training, it should even hit the maximum reward in this environment of 500. However, the imperfect results goes to show how difficult the reinforcement learning problem is. We can even render the policy in action -- try running it a few times and see how it behaves.

In [None]:
env = gym.make('CartPole-v1')
run_episode(env, agent, render=True)