# Introduction

In this notebook we will be investigating a number of reinforcement learning (RL) environments using the open-source RL library [OpenAI Gym](https://gym.openai.com/) (gym). I will not be training any models or agents as my main goal is to get a strong understanding of how to work in this setting, and get exposed to potential environments that would be interesting to explore further.

In [1]:
import numpy as np

import gym
import tensorflow as tf
import tf_agents as tfa

# OpenAI Gym

We will begin by walking through "Getting Started" guide in the gym documentation. The first environment we consider is the classic Cart-Pole problem. In this problem the goal is to keep the pole balanced up-right on the cart.

In the code below we instantiate the environment, render it, take a random action, and repeat.

In [3]:
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())
env.close()



Okay, so that was fairly chaotic, but it gets the point across -- cool! We are also informed by the documentation that we should ignore the warning we recieve about calling step().

Lets try one of the other environments they suggest...

In [4]:
env = gym.make('MountainCar-v0')
env.reset()
for _ in range(500):
    env.render()
    env.step(env.action_space.sample())
env.close()

We didn't run that one for quite as long, but we can see the goal is to get the cart out of the valley and to the flag. Clearly this will require some amount of higher-level reasoning because you have to rock back-and-forth in order to accomplish the goal.

Lets look at a few more cool environments before we hone in on some more specifics. I think the [Atari games](https://gym.openai.com/envs/#atari) are really interesting, so I wil go with that. It requires a few extra dependencies, but they are easy to install.

For each Atari game there are two versions: one where the input is the machine RAM (128 bytes), and one where the input is the screen. At the moment this doesn't matter, but it is an important consideration when building a model. Pitfall is a classic, so I will take a look a that. Interestingly this wasn't one of the games listed in the plot from the "Human-level Control through Deep Reinforcement Learning" class slides, so I am not sure where things stand in terms of RL agent's capabilities with this game.

In [4]:
env = gym.make('Pitfall-ram-v0')
env.reset()
for _ in range(5000):
    env.render()
    env.step(env.action_space.sample())
env.close()

At least on my system, the rendered image is much too small to see. In looking for a solution to this I found the following two resources: this [Git issue](https://github.com/openai/gym/issues/550) and this [Git gist](https://gist.github.com/mttk/74dc6eaaea83b9f06c2cc99584d45f96). They are essentially the same fix, and I have implemented this in the code below.

In [2]:
from gym.envs.classic_control import rendering

def upsample(rgb_array, k, l):
    if k<=0 or l<=0:
        print('Scale factors must be greater than 0!')
        return rgb_array
    
    return np.repeat(np.repeat(rgb_array, k, axis=0), l, axis=1)

def renderEnv(env, steps=1000, scale=4):
    viewer = rendering.SimpleImageViewer()
    env.reset()
    for _ in range(steps):
        env.step(env.action_space.sample())
        
        rgb = env.render('rgb_array')
        upsampled = upsample(rgb, scale, scale)
        
        viewer.imshow(upsampled)
        
    viewer.close()

Essentially, we just upscale the images before rendering them. Let's try it out.

In [3]:
env = gym.make('Pitfall-ram-v0')
renderEnv(env, steps=1000)
env.close()

Okay, that looks much better, and the scale can be adjusted as necessary to make the image larger. One last environment that I found, which I thought looked interesting, was the Lunar Lander envrionment. It has both discrete and continuous control versions -- I will consider the continuous variant.

Turns out that this also requires some extra dependencies. Specifically, I used conda to install swig and pip to install box2d-py.

In [6]:
env = gym.make('LunarLanderContinuous-v2')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())
env.close()

Becuase we are just naively taking a random action we don't actually end the episode when we should (i.e. when the lander crashes). This partially relates to that warning we got much earlier, and really it seems like a good time to get a bit more into the details. Up to now we have just been looking at environments with agents making random decisions, but that is not our long-term goal.

We will take the Lunar Lander Continuous environment as our example and look a bit beyond simply rendering. According to the documentation the *step()* function returns a plethora of useful information that includes environment observations and rewards. The *reset()* function gets things started by returning and initial observation, so lets start there.

In [3]:
lander = gym.make('LunarLanderContinuous-v2')
obst = lander.reset() #i.e. fruit

print(obst)

[-0.00512733  1.4206668  -0.51935774  0.43317077  0.00594808  0.11764237
  0.          0.        ]


Currently this doesn't look like much, but we can consult the environment specific documentation to get more information. Doing so tells us that the first two values are the current (x,y) coordinates of the lander, the next two are the (x,y) velocities of the lander, then the angle, the angular speed, and finally boolean values for if the left and right legs have contacted the ground. It should be noted that some of this information had to be obtained directly from the source code.

This is the information that the agent will have available to make an action decision. It is worth pointing out that, in Pitfall for example, if the observation is the game screen, then the only observation information we would have is a matrix representing pixel values.

Now let't look at the information that *step()* returns, which will include an observation, but also other information.

In [4]:
o, r, done, info = lander.step(lander.action_space.sample())

print(f'Reward: {r}, Done: {done}')
lander.close()

Reward: 0.17200371560562644, Done: False


We have already examined an observation, and the info variable contains diagnostic infromation, so we will focus on the reward and done values shown above. The reward is the amount of reward generated by taking the previous action, which at this point is a random action. The documentation gives detailed information on how the reward is calculated. For example, crashing would be worth -100 and safely coming to rest would be worth 100. In this case there is always a reward being generated, but in other environments this may not be the case. The done value indicates whether the episode has ended, so in this case whether the lander has landed or crashed. We can see what happens when we use this information in the simulation below.

In [13]:
env = gym.make('LunarLanderContinuous-v2')
env.reset()
for t in range(1000):
    env.render()
    _, _, done, _ = env.step(env.action_space.sample())
    if done:
        print("Episode finished.")
        break
env.close()

Episode finished.


Up to now we have been taking actions by randomly sampling from the environment action space, but how is this space even defined and how can we make more informed decisions? 

According to the documentation every environment must have an associated action and observation space. In the case of actions, these are all the valid actions an agent can take and they will have some kind of effect on the environment. In the case of Lunar Lander an action is a combination of two real values in $[-1,1]$. First is the main engine throttle, where -1 is off and $[0,1]$ corresponds to 50% through 100% throttle -- apparently the engine only works above 50%. Second is the lateral engine throttle, with $[-1,-0.5]$ firing the left engine, $[0.5,1]$ firing the right engine, and the remainder indicating no engine should be fired.

In the discrete formulation we loose these continuous values and instead are restricted to four states. Main engine fire at full throttle, left eninge fire, right engine fire, and no engine fire.

Lets validate what we have discussed above.

In [19]:
env = gym.make('LunarLanderContinuous-v2')
action = env.action_space.sample()
print(f'Random Action: {action}')

Random Action: [-0.89128834 -0.42673907]


It looks like the random action we have sampled does not fire any of the engines, so the lander will just continue its current trajectory. Lets try specifying a more interesting action that will have some effect on the environment over a few steps.

In [30]:
o = env.reset()
print(f'Initial observation: {o}\n')
action = np.array([1.0, -.7])
print(f'Action: {action}\n')

for i in range(20):
    o, _, _, _ = env.step(action)
    
print(f'Final Observation: {o}')
env.close()

Initial observation: [ 0.00773897  1.4076171   0.7838566  -0.14682311 -0.00896072 -0.17755531
  0.          0.        ]

Action: [ 1.  -0.7]

Final Observation: [0.16171427 1.4709758  0.6856318  0.31994307 0.1292875  0.40468216
 0.         0.        ]


The action we have taken over 20 steps is to fire the main engine at 100% and fire the left engine. We expect to see the x and y coordinates (first and second values in the array) increase, and this is exactly what happens. We can also see that the horizontal and vertical speeds (the next two values in the array) increase as further verification.

This pretty much sums up everything we need to know about gym, with more of the complicated tasks coming from modeling and training an agent.