# Lunar Lander With Deep Q-Learning

In this lab, we'll do some cool stuff!

In [1]:
import numpy as np

import io
import base64
from IPython import display

import gym
from gym import wrappers

In [2]:
# This code is to embed the output into this notebook. 
# You may prefer to use the terminal directly, which will
# open a new window when you run a gym environment instead of 
# capturing a video. 
def imbed_round_video():
    video = io.open('./gym-videos/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
    encoded = base64.b64encode(video)
    return display.HTML(data='''
        <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
    .format(encoded.decode('ascii')))

In [3]:
# First, we can just make an environment from Gym
# And have the agent make a random action every time
original_env = gym.make('LunarLander-v2')

# The wrapper allows us to take a video so we can display it
# in the Jupyter notebook. 
env = wrappers.Monitor(original_env, "gym-videos/", force=True)
env.reset()

for _ in range(1000):
    # Randomly take an action
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)

    if done: break
        
# You're always supposed to close an environment when you're
# done with it in Gym. 
env.close()
original_env.close()

imbed_round_video()

In [4]:
# Okay, so that's what taking random actions looks like.
# Lets take a closer look at the information Gym gives us
original_env = gym.make('LunarLander-v2')
original_env.reset()

print("Actions: " , original_env.action_space)

observation, reward, done, info = env.step(env.action_space.sample())
print("Observation: ", observation)
print("Reward: ", reward)
print("done: ", done)
print("Info: ", info)

original_env.close()

Actions:  Discrete(4)
Observation:  [ 0.38900527 -0.07480966  0.5380346  -0.16801898 -0.07077409 -0.67004156
  0.          1.        ]
Reward:  -100
done:  True
Info:  {}


Okay, that's not completely enlightening. Here's what the documentation has to say about this environment:
    
"Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine."

So, actions:

```
0: Do nothing  
1: Fire left engine  
2: Fire main engine  
3: Fire right engine  
```

And we can only take one of these actions per frame. Lets gut check:

In [6]:
# First, we can just make an environment from Gym
# And have the agent make a random action every time
original_env = gym.make('LunarLander-v2')

# The wrapper allows us to take a video so we can display it
# in the Jupyter notebook. 
env = wrappers.Monitor(original_env, "gym-videos/", force=True)

env.reset()
for _ in range(1000):
    # We should just fall straight down, never use the engine
    # Or we can change this to take the other actions...
    action = 0
    observation, reward, done, info = env.step(action)

    if done: break
        
# You're always supposed to close an environment when you're
# done with it in Gym. 
env.close()
original_env.close()

imbed_round_video()

In [7]:
# Okay great, looks like we have a good idea about the action space.
# But what about the "observation"? Lets get the first three observations:
original_env = gym.make('LunarLander-v2')
original_env.reset()

print("Actions: " , original_env.action_space)

observation, reward, done, info = env.step(0)
print("Observation: \n", observation)

observation, reward, done, info = env.step(0)
print("Observation: \n", observation)

observation, reward, done, info = env.step(0)
print("Observation: \n", observation)

original_env.close()

Actions:  Discrete(4)
Observation: 
 [-3.4008986e-01 -7.3655307e-02 -6.0416913e-01 -6.0267959e-02
  1.4877194e-01 -7.2481363e-08  1.0000000e+00  0.0000000e+00]
Observation: 
 [-0.34660116 -0.07454019 -0.6466638  -0.04569064  0.14758897  0.09853733
  1.          0.        ]
Observation: 
 [-0.35276383 -0.07385215 -0.55434597  0.09588828  0.13967209 -0.41137037
  1.          1.        ]


Unfortunately, a lot of the Gym environments are not well documented. I had to dig through the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py) for this line to figure out what the observation space was:

```python
 state = [
            (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2),
            (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_H/SCALE/2),
            vel.x*(VIEWPORT_W/SCALE/2)/FPS,
            vel.y*(VIEWPORT_H/SCALE/2)/FPS,
            self.lander.angle,
            20.0*self.lander.angularVelocity/FPS,
            1.0 if self.legs[0].ground_contact else 0.0,
            1.0 if self.legs[1].ground_contact else 0.0
]
```

So, the first two values are the position of the lander (x, y). The next two values are the x,y velocity. After that the current angle of the lander, then the angular velocity. Finally, the last two values indicate whether or not the landers left and right legs are touching the ground. 