# COMP47590 - Advanced Machine Learning 

## DQN for Lunar Lander
Uses a Deep Q Network to train a neural network based player for the Lunar Lander environment from Gymnasium (https://gymnasium.farama.org/environments/box2d/lunar_lander/).. This uses a vector-based state representation.

![Lunar Lander](lunar_lander.gif)

There are four **actions** in this environment:
- none (0)
- left engine (1)
- main engine (2)
- right engine (3)

**Reward** is awarded after each frame as follows:
- crash: -100 
- land: +100 
- leg ground contact: +10
- firing main engine: -0.3
- landing between flags: +200

And the **state** is represented using 8 values:
- position of the spaceship (in x and y coordinates) 
- the velocity of the spaceship (in x and y directions), 
- the angular velocity of the spaceship
- the angle of the line connecting the spaceship to the landing pad
- leg contact with ground (left and right)


### Initialisation - Google Colab

If using Google colab you need to install packages  - comment out lines below.

In [73]:
#!apt install swig cmake 
#!apt-get install -y xvfb x11-utils
#!apt-get install -y python-opengl ffmpeg > /dev/null 2>&1
#!python -m pip install 'git+https://github.com/DLR-RM/stable-baselines3@feat/gymnasium-support#egg=stable-baselines3[extra]' 
#!pip install pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate
#!pip install -U colabgymrender

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still wont; see display!)

In [74]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

In [75]:
#from colabgymrender.recorder import Recorder
#env = Recorder(env,"./video")
#
#obs = env.reset()
#done = False
#while not done:
#    action = env.action_space.sample()
#    obs, reward, done, info = env.step(action)
#
#env.play()

### Initialisation - Windows
To setup on a Windows machine you need the following pre-requisites.
* git to be installed (https://git-scm.com/download/win) and be on the path
* Visual Studio Developer Tools to be installed (https://visualstudio.microsoft.com/visual-cpp-build-tools/)

Uncomment and run the following:

In [76]:
#!pip install tqdm rich
#!pip install pytorch
#!pip install gymnasium
#!python -m pip install 'git+https://github.com/DLR-RM/stable-baselines3@feat/gymnasium-support#egg=stable-baselines3[extra]'
#!pip install pygame
#!pip install pyglet

### Initialisation - Mac
Uncomment and run the following:

In [77]:
# !pip install tqdm rich
# !pip install ipywidgets
# !pip install pytorch
# !pip install gymnasium
# !pip install stable-baselines3
# !pip install pygame
# !pip install pyglet
# !conda install swig -y # needed to build Box2D in the pip install

In [78]:
# !xcode-select --install

In [79]:
# !pip install box2d-py # a repackaged version of pybox2d

### Import Packages

Import required packages. 

In [80]:
import gymnasium as gym
import stable_baselines3 as sb3

### Create the Environment

Create the Lunar Lander Environment

In [81]:
env_render = gym.make('LunarLander-v2', 
               render_mode = 'human')

Add a time limit wrapper to avoid infinite hovering (remember the spaceship never runs out of fuel!)

In [82]:
env_render = gym.wrappers.TimeLimit(env_render, 
                                    max_episode_steps = 3000)

Explore the Lunar Lander environment

In [83]:
env_render.action_space

Discrete(4)

In [84]:
env_render.observation_space

Box([-90.        -90.         -5.         -5.         -3.1415927  -5.
  -0.         -0.       ], [90.        90.         5.         5.         3.1415927  5.
  1.         1.       ], (8,), float32)

In [85]:
env_render.reset()
env_render.render()

Play an episode of the Lunar Lander environment using random actions

In [86]:
obs, _ = env_render.reset()

terminate = False
truncate = False

while not (terminate or truncate):
    
    action = env_render.action_space.sample()
    obs, reward, terminate, truncate, info = env_render.step(action)
    
    env_render.render()

Complete an episode of the Lunar Lander environment using random actions recording actions and reward.

In [87]:
cumulative_reward = 0
actions = []
action_map = {0:'none', 
            1:'left engine',
            2:'main engine',
            3:'right engine'}

In [88]:
obs, _ = env_render.reset()

terminate = False
truncate = False
while not (terminate or truncate):
    
    action = env_render.action_space.sample()
    obs, reward, terminate, truncate, info = env_render.step(action)
    print(reward)
    
    # Record reward and action
    cumulative_reward = cumulative_reward + reward
    actions.append(action)
    
    env_render.render()

-0.5788542607556895
-2.4488270692613683
-2.25583937776031
-3.230851275361515
-2.256738861297491
2.2605112261959563
-3.0373441199956788
-2.003048836411607
2.5539433404101546
-1.6484483536123935
-1.819216733804011
-2.5305333347310452
-2.183158218676084
-2.7679098495373453
-2.3041821150175963
-2.723861046043082
-2.3677161770833095
-2.81528874664471
3.1282038383383624
-2.582712679385452
-3.0250226408456045
-3.178131377972447
3.050467520654178
1.0966834168944615
-3.6176206125942643
-2.3616791528446854
-2.2699469411498954
-2.530481685384501
-2.480865455626656
-2.4296459780397015
-3.1162841228114346
-2.559213984493539
1.0302307006622697
-2.1220612238268175
-2.3199071665100632
-2.7928863752097173
-1.7283027191281792
1.6119041212866534
-2.6633455798106227
1.6446961770311759
-1.641735723558013
1.7895555904297964
-2.7945675818376103
-1.3880165098927637
-2.586947948604744
1.2893001964395807
-1.2064143684697786
-2.3183005902708644
-2.551819001308472
-2.744788076379136
-1.3859275807471068
-1.0512185

Print actions, rewards and cumulative reward.

In [89]:
print("Actions: ", ', '.join([action_map[a] for a in actions]))
print("Cumulative Reward: {}".format(cumulative_reward))

Actions:  right engine, right engine, none, right engine, left engine, main engine, right engine, left engine, main engine, left engine, left engine, right engine, none, right engine, none, right engine, none, right engine, main engine, none, right engine, right engine, main engine, main engine, right engine, left engine, left engine, none, none, none, right engine, none, main engine, left engine, none, right engine, left engine, main engine, right engine, main engine, left engine, main engine, right engine, left engine, right engine, main engine, left engine, right engine, right engine, right engine, left engine, left engine, none, none, right engine, left engine, none, main engine, none, left engine, right engine, none, main engine, main engine, main engine, left engine, none, left engine, main engine, right engine, main engine, right engine, none, left engine, left engine, left engine, left engine, left engine, left engine, left engine, none, none, left engine, main engine, none, ma

### Create and Train an Agent

Create an environment without rendering.

In [90]:
env_train = gym.make('LunarLander-v2')

Create a simple DQN agent using stable-baselines3. LunarLander uses a state vector representation so a simple MLP can drive this model.

In [91]:
agent = sb3.DQN('MlpPolicy', 
                env_train, 
                verbose=1)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Train the agent for a large number of steps.

In [92]:
agent.learn(total_timesteps=5000)

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 97       |
|    ep_rew_mean      | -501     |
|    exploration_rate | 0.263    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 5380     |
|    time_elapsed     | 0        |
|    total_timesteps  | 388      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 3.73     |
|    n_updates        | 71       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 86.8     |
|    ep_rew_mean      | -434     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 4963     |
|    time_elapsed     | 0        |
|    total_timesteps  | 694      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 3.08     |
|    n_updates      

<stable_baselines3.dqn.dqn.DQN at 0x33dfff020>

###Â Evaluation

Evaluate the agent in the environment

In [93]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                env_render,
                                                                render = True,
                                                                n_eval_episodes=10)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))



Mean Reward: -140.84506327623967 +/- 17.65738118891108


### Deployment

We can save an agent easily in SB3.

In [94]:
agent.save("./dqn_lunar_lander_agent")

We can easily load an agent. 

In [95]:
agent = sb3.dqn.DQN.load("./dqn_lunar_lander_agent")

Deploy the agent into the environment

In [96]:
obs, _ = env_render.reset()

terminate = False
truncate = False
while not (terminate or truncate):

    action, _ = agent.predict(obs, deterministic = True)
    
    obs, reward, terminate, truncate, _ = env_render.step(action)
    
    env_render.render()

We can also continue training a loaded agent. First set the environment (this is not saved with the agent)

In [97]:
agent.set_env(env_train)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Now continue training the agent.

In [98]:
agent.learn(total_timesteps = 10000, 
            reset_num_timesteps = False)

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 80.1     |
|    ep_rew_mean      | -230     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 64       |
|    fps              | 3965     |
|    time_elapsed     | 0        |
|    total_timesteps  | 5132     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.188    |
|    n_updates        | 1257     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 79.9     |
|    ep_rew_mean      | -226     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 68       |
|    fps              | 4376     |
|    time_elapsed     | 0        |
|    total_timesteps  | 5436     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.725    |
|    n_updates      

<stable_baselines3.dqn.dqn.DQN at 0x105762bd0>

Evaluate the performance of the trained agent.

In [101]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                env_render, 
                                                                n_eval_episodes=10,
                                                               render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

KeyboardInterrupt: 

View the agent in action!

In [28]:
obs = env_render.reset()

terminate = False
truncate = False
while not (terminate or truncate):
    
    action = env_render.action_space.sample()
    obs, reward, terminate, truncate, _ = env_render.step(action)

    env_render.render()