# COMP47590 - Advanced Machine Learning 

## DQN for Lunar Lander - Custom Value Network
Uses a Deep Q Network to train a neural network based player for the Lunar Lander environment from Gymnasium (https://gymnasium.farama.org/environments/box2d/lunar_lander/). This uses a vector-based state representation and uses a custom value network architecture.

### Initialisation

If using Google colab you need to isntall packages  - comment out lines below.

In [1]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!python -m pip install 'git+https://github.com/DLR-RM/stable-baselines3@feat/gymnasium-support#egg=stable-baselines3[extra]' 
#!pip install pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still wont; see display!)

In [2]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [3]:
import gymnasium as gym
import torch
import stable_baselines3 as sb3

2025-03-27 22:03:46.818235: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Create the Environment

Create the Lunar Lander Environment

In [4]:
# Create environment
env_train = gym.make('LunarLander-v2')
env_train = gym.wrappers.TimeLimit(env_train, max_episode_steps=3000)

### Create and Train an Agent

Create a basic DQN agent and inspect its networks.

In [5]:
agent = sb3.DQN('MlpPolicy', 
                env_train)
print(agent.policy)

DQNPolicy(
  (q_net): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=8, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Linear(in_features=64, out_features=4, bias=True)
    )
  )
  (q_net_target): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=8, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Linear(in_features=64, out_features=4, bias=True)
    )
  )
)


Set the parameters for a custom value-function network in the DQN agent.

In [6]:
vf_network_args = {'activation_fn':torch.nn.ReLU,
                   'net_arch':[256, 256]}    

Create the agent using the custom value function network architecture. Also change some hyperparameters to tuned values:
- learning_rate: 0.00063
- batch_size: 128
- buffer_size: 50000
- learning_starts: 0
- target_update_interval: 250
- gradient_steps: -1
- exploration_fraction: 0.12
- exploration_final_eps: 0.1    

In [7]:
tb_log = './log_tb_lunarlander/'
agent = sb3.DQN('MlpPolicy', 
                env_train, 
                learning_rate = 0.00063,
                batch_size = 128,
                buffer_size = 50000,
                learning_starts = 0,
                target_update_interval = 250,
                gradient_steps = -1,
                exploration_fraction = 0.12,
                exploration_final_eps = 0.1,
                verbose=1, 
                policy_kwargs = vf_network_args,
                tensorboard_log=tb_log)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Inspect the vf networks.

In [8]:
print(agent.policy)

DQNPolicy(
  (q_net): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=8, out_features=256, bias=True)
      (1): ReLU()
      (2): Linear(in_features=256, out_features=256, bias=True)
      (3): ReLU()
      (4): Linear(in_features=256, out_features=4, bias=True)
    )
  )
  (q_net_target): QNetwork(
    (features_extractor): FlattenExtractor(
      (flatten): Flatten(start_dim=1, end_dim=-1)
    )
    (q_net): Sequential(
      (0): Linear(in_features=8, out_features=256, bias=True)
      (1): ReLU()
      (2): Linear(in_features=256, out_features=256, bias=True)
      (3): ReLU()
      (4): Linear(in_features=256, out_features=4, bias=True)
    )
  )
)


Make an evaluation callback with a long wait between steps and no rendering. 

In [9]:
eval_env = gym.make('LunarLander-v2', render_mode = 'human') # We use a separate evaluation env in case any wrappers have been used
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path='./logs_lunarlander_custom/',
                                                  log_path='./logs_lunarlander_custom/', 
                                                  eval_freq=5000,
                                                  render=False)

Train the agent for a large number of steps.

In [None]:
agent.learn(total_timesteps=200000, 
            tb_log_name="Custom Network",
           callback = eval_callback)

Logging to ./log_tb_lunarlander/Custom Network_6
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 80       |
|    ep_rew_mean      | -85.5    |
|    exploration_rate | 0.988    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 125      |
|    time_elapsed     | 2        |
|    total_timesteps  | 320      |
| train/              |          |
|    learning_rate    | 0.00063  |
|    loss             | 1.77     |
|    n_updates        | 316      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 91.5     |
|    ep_rew_mean      | -109     |
|    exploration_rate | 0.973    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 134      |
|    time_elapsed     | 5        |
|    total_timesteps  | 732      |
| train/              |          |
|    learning_rate    | 0.00063  |
|    l

2025-03-27 22:04:31.220 python3.12[7172:231903] +[IMKClient subclass]: chose IMKClient_Modern
2025-03-27 22:04:31.220 python3.12[7172:231903] +[IMKInputSession subclass]: chose IMKInputSession_Modern


Eval num_timesteps=5000, episode_reward=-114.41 +/- 30.84
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -114     |
| rollout/            |          |
|    exploration_rate | 0.813    |
| time/               |          |
|    total_timesteps  | 5000     |
| train/              |          |
|    learning_rate    | 0.00063  |
|    loss             | 1.66     |
|    n_updates        | 4996     |
----------------------------------
New best mean reward!
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 90.5     |
|    ep_rew_mean      | -111     |
|    exploration_rate | 0.81     |
| time/               |          |
|    episodes         | 56       |
|    fps              | 37       |
|    time_elapsed     | 136      |
|    total_timesteps  | 5068     |
| train/              |          |
|    learning_rate    | 0.00063  |
|    loss   

Save the trained agent.

In [None]:
agent.save("./dqn_lunar_lander_agent_custom")

###Â Evaluation

Evaluate the agent in the environment

In [None]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                eval_env, 
                                                                n_eval_episodes=10,
                                                               render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))

Evaluate the best model seen

In [None]:
best_agent = sb3.dqn.DQN.load("./logs_lunarlander_custom/best_model")

In [None]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(best_agent, 
                                                                eval_env, 
                                                                n_eval_episodes=10,
                                                               render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))