# Learning how to use the openAI gym environment

The Gym library by OpenAI provides virtual environments for developing and comparing the performance of different reinforcement learning algorithms. We will learn how to use this toolkit using the 'cart pole' environment.

## The cart pole problem

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. 
The system is controlled by applying a force of +1 or -1 to the cart. 
The pendulum starts upright, and the goal is to prevent it from falling over. 
A reward of +1 is provided for every timestep that the pole remains upright. 
The episode ends if:
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 200
The problem is considered solved when the average reward is greater than or equal to 195 
over 100 consecutive trials.

The leaderboard score can be found here:
https://github.com/openai/gym/wiki/Leaderboard

In [1]:
%matplotlib inline
import gym
import matplotlib.pyplot as plt
from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython.display import display

ModuleNotFoundError: No module named 'gym'

## Create the environment

In [21]:
env = gym.make('CartPole-v0')
env.reset() 

  result = entry_point.load(False)


array([-0.02430858, -0.03550512,  0.03583215,  0.01936399])

## Display the environment

In [22]:
def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    #plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='loop'))

In [23]:
frames = []
frames.append(env.render(mode = 'rgb_array'))
env.close()
display_frames_as_gif(frames)

## Show the environment's bound

In [26]:
print(env.action_space.n)
print(env.observation_space)

print(env.observation_space.high)
print(env.observation_space.low)

2
Box(4,)
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


There are 2 possible actions: push the cart to the left (0) or to the right (1) and there are 4 possible states: [Position Velocity Angle Velocity_at_tip]

- position: min = -4.8, max = 4.8
- velocity: min = -3.4e38, max = 3.4e38
- angle: min = -0.4, max = +0.4
- velocity at tip: min = -3.4e38, max = 3.4e38

## Define a policy function returning random actions

In [27]:
def policy():
    """ return a random action: either 0 (left) or 1 (right)"""
    action = env.action_space.sample()  
    return action

## Iterate over episodes and timesteps

In [28]:
nb_episodes = 20
nb_timesteps = 100

for episode in range(nb_episodes):  # iterate over the episodes
    state = env.reset()             # initialise the environment
    rewards = []
    
    for t in range(nb_timesteps):    # iterate over time steps
        env.render()                 # display the environment
        state, reward, done, info = env.step(policy())  # implement the action chosen by the policy
        rewards.append(reward)      # add 1 to the rewards list
        
        if done: # the episode ends either if the pole is > 15 deg from vertical or the cart move by > 2.4 unit from the centre
            cumulative_reward = sum(rewards)
            print("episode {} finished after {} timesteps. Total reward: {}".format(episode, t+1, cumulative_reward))  
            break
    
env.close()

episode 0 finished after 16 timesteps. Total reward: 16.0
episode 1 finished after 15 timesteps. Total reward: 15.0
episode 2 finished after 15 timesteps. Total reward: 15.0
episode 3 finished after 11 timesteps. Total reward: 11.0
episode 4 finished after 17 timesteps. Total reward: 17.0
episode 5 finished after 12 timesteps. Total reward: 12.0
episode 6 finished after 12 timesteps. Total reward: 12.0
episode 7 finished after 10 timesteps. Total reward: 10.0
episode 8 finished after 25 timesteps. Total reward: 25.0
episode 9 finished after 28 timesteps. Total reward: 28.0
episode 10 finished after 18 timesteps. Total reward: 18.0
episode 11 finished after 12 timesteps. Total reward: 12.0
episode 12 finished after 45 timesteps. Total reward: 45.0
episode 13 finished after 34 timesteps. Total reward: 34.0
episode 14 finished after 9 timesteps. Total reward: 9.0
episode 15 finished after 10 timesteps. Total reward: 10.0
episode 16 finished after 55 timesteps. Total reward: 55.0
episode 1

## Define a new hardcoded policy: alternate right push and left push

In [29]:
def policy(t):
    action = 0
    if t%2 == 1:  # if is odd
        action = 1
    return action

## Iterate with the new policy

In [30]:
nb_episodes = 20
nb_timesteps = 100

for episode in range(nb_episodes):
    state = env.reset()             
    rewards = []
    frames = []
    
    for t in range(nb_timesteps):   
        #env.render()
        frames.append(env.render(mode = 'rgb_array'))
        state, reward, done, info = env.step(policy(t))
        rewards.append(reward)
        
        if done: 
            cumulative_reward = sum(rewards)
            print("episode {} finished after {} timesteps. Total reward: {}".format(episode, t+1, cumulative_reward))  
            break
    
env.close()

episode 0 finished after 23 timesteps. Total reward: 23.0
episode 1 finished after 30 timesteps. Total reward: 30.0
episode 2 finished after 36 timesteps. Total reward: 36.0
episode 3 finished after 53 timesteps. Total reward: 53.0
episode 4 finished after 34 timesteps. Total reward: 34.0
episode 5 finished after 37 timesteps. Total reward: 37.0
episode 6 finished after 35 timesteps. Total reward: 35.0
episode 7 finished after 22 timesteps. Total reward: 22.0
episode 8 finished after 22 timesteps. Total reward: 22.0
episode 9 finished after 44 timesteps. Total reward: 44.0
episode 10 finished after 55 timesteps. Total reward: 55.0
episode 11 finished after 28 timesteps. Total reward: 28.0
episode 12 finished after 62 timesteps. Total reward: 62.0
episode 13 finished after 33 timesteps. Total reward: 33.0
episode 14 finished after 55 timesteps. Total reward: 55.0
episode 15 finished after 68 timesteps. Total reward: 68.0
episode 16 finished after 29 timesteps. Total reward: 29.0
episode

In [31]:
display_frames_as_gif(frames)

Of course, these hardcoded policies are not very interesting... Let's see how we can implement reinforcement learning algorithms to learn these policies in the next notebook.