# OpenAI Gym Cartpole

Reference: https://gym.openai.com/docs/

The open ai gym gives access to environments that can be used to develop agents that make use of deep learning predictive models. For this workshop we are going to go over the basics of developing argents that use deep learning models in relation to the open ai gym cartpole environment

### Installation

The installation of the open ai gym environments are done through pip as are most packages. The command is:

    pip install gym

## Environments

As stated before the gym package contains multiple environments that we are able to use, however, for now out focus will be on the cart pole environment. These environments, are basically just games that allow us to get the percepts for the agents and allow the agent to act on it with functions from the package.

### Setting up an Environment

We are now going to set up the basic cartpole environment in python. We will then let it run for 1000 steps, having a random action occur each step and see what happens. The code that we will be writing will be the agent. An agent acts on an environment and takes "percepts" from the enviroment. These actions that the agent will take are based off the percepts. In the example below, our agent is taking a random action.

In [1]:
import gym # To import all the environments

env = gym.make('CartPole-v0') # Creates the cartpole environment
env.reset() # Resets the environment

# Loops for 1000 steps

for step in range(1000):
    env.render() # Displays the environment
    env.step(env.action_space.sample()) # Takes a random action
env.close() # Cleans up the environment



Using this basic code above we can see how other environments behave by replacing 'CartPole-v0' with any other environment that is in open ai gym.

## Observations

As alluded to earlier, agents take percepts or observations which are then used with a predicitive model in order to choose the actions they take. Hence, we will want to change the functionality so that we are not choosing a random move each step. How can we get this information? The environments step function is how. It returns multiple things, one of which include the observation. These are: 
  - The observation (object): This is an environment-specific object that represents the observations of the environment which can be used by the agent. These could be pixel data from a camera, joint angles and joint velocities of a robot or many other things. The documentation for a specific environment will explain what each observation object represents
  - The reward (float): This is the reward achieved from the previous action, this is used for our deep learning model.
  - The finish (boolen) : This returns true when the end condition has been reached and it is time to reset the simulation
  - Extra info (dict) : THis is diagnostic additionasl information that can be used for debgging.
 
This is as the open ai docs put it, "an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward."


Now back to the other code to get rid of the error that was occuring we can rewrite it to use the output of the step event to determine when to reset the environment. We can also print the observation that is returned as well.

In [2]:
import gym

env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

[ 0.01705412  0.02018016 -0.01005875 -0.02106801]
[ 0.01745773  0.21544491 -0.01048011 -0.31690754]
[ 0.02176663  0.41071456 -0.01681826 -0.61287702]
[ 0.02998092  0.60606745 -0.0290758  -0.91080926]
[ 0.04210227  0.41135077 -0.04729199 -0.62740477]
[ 0.05032928  0.21691968 -0.05984008 -0.34998288]
[ 0.05466768  0.02269756 -0.06683974 -0.07675363]
[ 0.05512163 -0.17140569 -0.06837481  0.19411486]
[ 0.05169351 -0.36548632 -0.06449251  0.46446903]
[ 0.04438379 -0.55964048 -0.05520313  0.73614716]
[ 0.03319098 -0.36380123 -0.04048019  0.4266143 ]
[ 0.02591495 -0.55832715 -0.0319479   0.70626621]
[ 0.01474841 -0.36277747 -0.01782258  0.40370029]
[ 0.00749286 -0.55764218 -0.00974857  0.69071137]
[-0.00365998 -0.36238633  0.00406565  0.39497547]
[-0.01090771 -0.55756573  0.01196516  0.68893745]
[-0.02205902 -0.75285168  0.02574391  0.98536307]
[-0.03711606 -0.94830881  0.04545117  1.2860195 ]
[-0.05608223 -0.75379387  0.07117156  1.00790645]
[-0.07115811 -0.94979008  0.09132969  1.32206327]


[-0.12471431 -0.56617307  0.19127859  1.18828379]
Episode finished after 14 timesteps
[ 0.04314376 -0.02359359 -0.04360867  0.00426616]
[ 0.04267189 -0.21806388 -0.04352334  0.28287739]
[ 0.03831061 -0.02234904 -0.0378658  -0.02320869]
[ 0.03786363  0.17329491 -0.03832997 -0.32759414]
[ 0.04132952 -0.02126098 -0.04488185 -0.04724091]
[ 0.04090431 -0.21571158 -0.04582667  0.23095043]
[ 0.03659007 -0.01996577 -0.04120766 -0.07582835]
[ 0.03619076  0.17572196 -0.04272423 -0.38122238]
[ 0.0397052  -0.0187681  -0.05034868 -0.10231061]
[ 0.03932984  0.1770379  -0.05239489 -0.41044368]
[ 0.04287059 -0.01730361 -0.06060376 -0.1347285 ]
[ 0.04252452  0.17863169 -0.06329833 -0.4458985 ]
[ 0.04609716 -0.01554031 -0.0722163  -0.17382107]
[ 0.04578635 -0.20955839 -0.07569272  0.09523435]
[ 0.04159518 -0.01343777 -0.07378804 -0.22033667]
[ 0.04132643  0.1826571  -0.07819477 -0.53535258]
[ 0.04497957 -0.01128324 -0.08890182 -0.26829684]
[ 0.0447539   0.18498755 -0.09426776 -0.58764308]
[ 0.04845365  

[0.34839645 0.90196097 0.15536243 0.22689725]
[0.36643567 0.70499943 0.15990037 0.56427415]
[0.38053565 0.89755933 0.17118585 0.32592961]
[0.39848684 0.70046617 0.17770445 0.66733132]
[0.41249616 0.50337642 0.19105107 1.01028087]
Episode finished after 60 timesteps
[ 0.04567742  0.02775975  0.02798271 -0.01350174]
[ 0.04623261 -0.1677521   0.02771267  0.28787706]
[ 0.04287757 -0.36325806  0.03347021  0.58916989]
[ 0.03561241 -0.16862038  0.04525361  0.30721529]
[ 0.03224    -0.36435697  0.05139792  0.61381962]
[ 0.02495286 -0.56015805  0.06367431  0.92223771]
[ 0.0137497  -0.75607987  0.08211906  1.23423248]
[-0.00137189 -0.56210393  0.10680371  0.96836496]
[-0.01261397 -0.36856548  0.12617101  0.71105212]
[-0.01998528 -0.17539561  0.14039205  0.46059651]
[-0.02349319  0.01749187  0.14960398  0.21525177]
[-0.02314336 -0.17941694  0.15390902  0.55113552]
[-0.0267317  -0.37632744  0.16493173  0.88808067]
[-0.03425825 -0.57325733  0.18269334  1.22773725]
[-0.04572339 -0.77019864  0.207248

Now we can see all of the outputs from the observations as well as seeing when the simulation resets.

## Spaces

Above, we have been taking a random action from the environment's action space. Every environment comes with an action_space and an observation_space. The are all the options that exists for both observations and for actions. We can find out what formats these action and observation spaces are so that we can program our model to takes these inputs and return those outputs. However, as open ai gym is used for deep learning models we do not need to understand what the data represents, just need to be able to take that format for the observations and return in the format for the actions, as the deep learning algorithm will learn what it should do with the observations to maximize it's performance with it's actions. 

In [3]:
import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

Discrete(2)
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)


Discrete space allows for a range of non-negative numbers, in this case 2. Hence, our actions are either 0 or 1, for the cart pole. The box space that is returned for the observations represents n-dimensional box, hence the observations in this case are an array of 4 numbers. There are other spaces that can be encountered in other environments, it is important to know what is required for both the action and observation space so this can be done in the model.

For the cartpole the two actions that can be done are applying a force to the right or a force to the left. However, we do not need to figure out what is what as that is the job of our learning algorithm.

# Using Deep Learning for CartPole

From here we are now going to implement a deep learning algorithm as our model

We are going to start off with the code for the agent and then move on to the code for the model

### READ THE GITHUB