## Deep Reinforcement Learning in Gym's Lunar Lander Environment

It took some experimentation, but the steps to record video output in LunarLander-v2 since the Monitor wrapper was removed using an anaconda environment is as follows: 

1. Install any packages with conda package manager, e.g. `conda install -c conda-forge moviepy`
2. Create environment with argument `render_mode='rgb_array'`
3. Use new wrapper `gym.wrappers.record_video.RecordVideo`
4. Wrap like so `RecordVideo(env, "Recordings")`

Let's initialize a LunarLander-v2 environmnet, make random actions in the environment then view a recording of it.

In [80]:
import gym
from gym.wrappers.record_video import RecordVideo

env = gym.make('LunarLander-v2', render_mode='rgb_array')
env = RecordVideo(env, "Recordings", name_prefix="random-movements")
env.reset(seed=42)

terminated, truncated = False, False
while not terminated or truncated:
    action = env.action_space.sample()  # Take a random action
    _, _, terminated, truncated, _ = env.step(action)

env.close()

Moviepy - Building video /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/random-movements-episode-0.mp4.
Moviepy - Writing video /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/random-movements-episode-0.mp4



                                                              

Moviepy - Done !
Moviepy - video ready /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/random-movements-episode-0.mp4




## General Information
This information is from the official Gym documentation.

https://www.gymlibrary.dev/environments/box2d/lunar_lander/

| Feature Category  | Details                                |
|-------------------|----------------------------------------|
| Action Space      | Discrete(4)                            |
| Observation Shape | (8,)                                   |
| Observation High  | [1.5 1.5 5. 5. 3.14 5. 1. 1. ]         |
| Observation Low   | [-1.5 -1.5 -5. -5. -3.14 -5. -0. -0. ] |
| Import            | `gym.make("LunarLander-v2")`           |

## Description of Environment

This environment is a classic rocket trajectory optimization problem. According to Pontryagin’s maximum principle, it is optimal to fire the engine at full throttle or turn it off. This is the reason why this environment has discrete actions: engine on or off.

There are two environment versions: discrete or continuous. The landing pad is always at coordinates `(0,0)`. The coordinates are the first two numbers in the state vector. Landing outside of the landing pad is possible. Fuel is infinite, so an agent could learn to fly and then land on its first attempt.

## Action Space
There are four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

| Action  | Result                          |
|---------|---------------------------------|
| 0       | Do nothing                      |
| 1       | Fire left orientation engine    |
| 2       | Fire main engine                |
| 3       | Fire right orientation engine   |

## Observation Space
The state is an 8-dimensional vector: the coordinates of the lander in `x` & `y`, its linear velocities in `x` & `y`, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.

| Observation  | Value                                   |
|--------------|-----------------------------------------|
| 0            | `x` coordinate (float)                  |
| 1            | `y` coordinate (float)                  |
| 2            | `x` linear velocity (float)             |
| 3            | `y` linear velocity (float)             |
| 4            | Angle in radians from -π to +π (float)  |
| 5            | Angular velocity (float)                |
| 6            | Left leg contact (bool)                 |
| 7            | Right leg contact (bool)                |

## Rewards
Reward for moving from the top of the screen to the landing pad and coming to rest is about 100-140 points. If the lander moves away from the landing pad, it loses reward. If the lander crashes, it receives an additional -100 points. If it comes to rest, it receives an additional +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points.

## Starting State
The lander starts at the top center of the viewport with a random initial force applied to its center of mass.

## Episode Termination
The episode finishes if:

1. The lander crashes (the lander body gets in contact with the moon);

2. The lander gets outside of the viewport (`x` coordinate is greater than 1);

3. The lander is not awake. From the Box2D docs, a body which is not awake is a body which doesn’t move and doesn’t collide with any other body:

---

## The Safe Agent
We're going to implement a simple agent 'The Safe Agent' who will thrust upward if and only if the lander's `y` position is less than 0.5.

In theory this agent shouldn't hit the ground as we have unlimited fuel, but let's see.

In [81]:
class SafeAgent:
    def __init__(self):
        self.total_reward = 0
        self.n_steps = 0
        self.done = False

    def reset(self, env):
        self.total_reward = 0
        self.n_steps = 0
        self.done = False

        # New API format returns observations and info (which we don't need)
        self.state, _ = env.reset(seed=42)

    def act(self, state):
        MIN_HEIGHT = 1

        if state[1] < MIN_HEIGHT:
            return 2
        else:
            return 0
        
    def step(self, env):
        action = self.act(self.state)
        next_obs, reward, terminated, truncated, info = env.step(action)
        
        self.total_reward += reward
        self.state = next_obs
        self.done = terminated or truncated
        

env = gym.make('LunarLander-v2', render_mode='rgb_array')
env = RecordVideo(env, "Recordings", name_prefix="safe-agent")
agent = SafeAgent()

def play_episode(env, agent):
    agent.reset(env=env)

    while not agent.done:
        agent.step(env)

play_episode(env, agent)

Moviepy - Building video /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/safe-agent-episode-0.mp4.
Moviepy - Writing video /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/safe-agent-episode-0.mp4



                                                               

Moviepy - Done !
Moviepy - video ready /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/safe-agent-episode-0.mp4




## The Stable Agent
Let's try to define and agent that can remain stable in the air.

It will operate via the following rules:

1. If below height of 1: action = 2 (main engine)
2. If angle is above π/50: action = 1 (fire right engine)
3. If angle is above π/50: action = 1 (fire left engine)
4. If x distance is above 0.4: action = 3 (fire left engine)
5. If x distance is below -0.4: action = 1 (fire left engine)
6. If below height of 1.5: action = 2 (main engine)
6. Else: action = 0 (do nothing)

The idea is the lander will always use its main engine if it falls below a certain height, next it will prioritize stabilizing the angle of the lander, then the distance, then keeping it above another height. 

Let's see how this approach does:

In [83]:
class StableAgent:
    def __init__(self):
        self.total_reward = 0
        self.n_steps = 0
        self.done = False

    def reset(self, env):
        self.total_reward = 0
        self.n_steps = 0
        self.done = False

        # New API format returns observations and info (which we don't need)
        self.state, _ = env.reset(seed=42)

    def act(self, state):
        UPPER_MIN_Y = 1.5
        LOWER_MIN_Y = 1
        MIN_X = -0.4
        MAX_X = 0.4
        MIN_ANGLE = -3.14/50
        MAX_ANGLE = 3.14/50

        x = state[0]
        y = state[1]

        angle = state[4]

        MAIN_ENGINE = 2
        LEFT_ENGINE = 1
        RIGHT_ENGINE = 3
        DO_NOTHING = 0

        if y < LOWER_MIN_Y:
            return MAIN_ENGINE

        elif angle > MAX_ANGLE:
            return RIGHT_ENGINE
        elif angle < MIN_ANGLE:
            return LEFT_ENGINE
        
        elif x > MAX_X:
            return LEFT_ENGINE
        elif x < MIN_X:
            return RIGHT_ENGINE
        
        elif y < UPPER_MIN_Y:
            return MAIN_ENGINE
        
        else:
            return DO_NOTHING
        

        
    def step(self, env):
        action = self.act(self.state)
        next_obs, reward, terminated, truncated, info = env.step(action)
        
        self.total_reward += reward
        self.state = next_obs
        self.done = terminated or truncated
        

env = gym.make('LunarLander-v2', render_mode='rgb_array')
env = RecordVideo(env, "Recordings", name_prefix="stable-agent")
agent = StableAgent()

def play_episode(env, agent):
    agent.reset(env=env)

    while not agent.done:
        agent.step(env)

play_episode(env, agent)

Moviepy - Building video /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/stable-agent-episode-0.mp4.
Moviepy - Writing video /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/stable-agent-episode-0.mp4



                                                               

Moviepy - Done !
Moviepy - video ready /Users/alex/Documents/Programming/Jupyter Notebooks/Deep RL Lunar Lander/Recordings/stable-agent-episode-0.mp4




#### Observations:
- Crafting a straightforward set of rules to guide the lunar lander is more challenging than anticipated.
- Our initial efforts achieved some stability, but eventually, the lander lost control.

## Deep Reinforcement Learning
To address this challenge, we'll use deep reinforcement learning techniques to train an agent to land the spacecraft.

Simpler tabular methods are limited to discrete observation spaces, meaning there are a finite number of possible states. In `LunarLander-v2` howeer, we're dealing with a continuous range of states across 8 different parameters, meaning there are a near-infinite number of possible states. We could try to bin similar values into groups, but due to the sensitive controls of the game, even slight errors can lead to significant missteps.

To get around this, we'll use a `neural network Q-function approximator`. This lets us predict the best actions to take for a given state, even when dealing with a vast number of potential states. It's a much better match for our complex landing challenge.