# Introducing CartPole

Cartpole is a classic control problem from OpenAI Gym, a set of reinforcement learning environments for testing and developing RL agents.

https://gym.openai.com/envs/CartPole-v0/

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

### Key Gym methods

* Make method, e.g. `env = gym.make("CartPole-v1")`

* Reset (returns first state observations), e.g. `env.reset()`

* Step (passes action to environemtn and returns observations, reward, done*, and any extra info), e.g. `obs, reward, done, info = env.step(action)`

* Close  (closes environment at end), e.g. `env.close()`

*With the step method, the environment returns whether it has reached a terminal state, that is the end of an epsiode.


## Load libraries

In [1]:
import gym
import matplotlib.pyplot as plt
import numpy as np
import random

# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")

# Set whether enviornment will be rendered
RENDER = True

## Random choice

Our first baseline is random action of pushing the cart left or right.

Note: The CartPole visualisation of this demo may not work on remote servers. If this does not work set `RENDER = False` in cell above to run rest of Notebook with visualisation (re-run the cell after changing the setting).

In [2]:
def random_choice(obs):
    """
    Random choice. 
    `obs` is passed to function to make use consistent with other methods.
    """
    
    return random.randint(0,1)

In [3]:
# Set up environemnt
env = gym.make("CartPole-v1")

totals = []
for episode in range(10):
    episode_reward = 0
    obs = env.reset()
    for step in range(200):
        if RENDER:
            env.render()
        action = random_choice(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
        # Pole has fallen over if done is True
        if done:
            break
    totals.append(episode_reward)
    env.close()

print ("Average: {0:.1f}".format(np.mean(totals)))    
print ("Stdev: {0:.1f}".format(np.std(totals)))
print ("Minumum: {0:.0f}".format(np.min(totals)))
print ("Maximum: {0:.0f}".format(np.max(totals)))

Average: 16.4
Stdev: 4.6
Minumum: 11
Maximum: 26


## A simple policy

Here we use a simple policy that accelerates left when the pole is leaning to the right, and accelerates right when the pole is leaning to the left.

In [4]:
def basic_policy(obs):
    
    """
    A Simple policy that accelerates left when the pole is leaning to the right,
    and accelerates right when the pole is leaning to the left

    Cartpole observations:
        X position (0 = centre)
        velocity (+ve = right)
        angle (0 = upright)
        angular velocity (+ve = clockwise)
    """
    
    angle = obs[2]
    return 0 if angle < 0 else 1

In [5]:
# Set up environemnt
env = gym.make("CartPole-v1")

totals = []
for episode in range(10):
    episode_reward = 0
    obs = env.reset()
    for step in range(200):
        if RENDER:
                env.render()
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
        # Pole has fallen over if done is True
        if done:
            break
    totals.append(episode_reward)
    env.close()

print ("Average: {0:.1f}".format(np.mean(totals)))    
print ("Stdev: {0:.1f}".format(np.std(totals)))
print ("Minumum: {0:.0f}".format(np.min(totals)))
print ("Maximum: {0:.0f}".format(np.max(totals)))

Average: 44.2
Stdev: 10.7
Minumum: 25
Maximum: 63


The next notebook will use a Deep Q Network (Double DQN) to see if we can improve on the simple policy.