# COMP47590 - Advanced Machine Learning 

## Actor-Critic Method for Continuous Lunar Lander Using PPO
Uses the PPO actor-critic method to train a neural network based player for the Lunar Lander environment from OpenAI gym (https://gym.openai.com/envs/LunarLander-v2/). This uses a vector-based state representation.

### Initialisation

If using Google colab you need to install packages - comment out lines below.

In [1]:
#!apt install swig cmake ffmpeg
#!apt-get install -y xvfb x11-utils
#!pip install stable-baselines3[extra] pyglet box2d box2d-kengz
#!pip install pyvirtualdisplay PyOpenGL PyOpenGL-accelerate

For Google colab comment out this cell to make a virtual rendering canvas so render calls work (we still wont; see display!)

In [2]:
#import pyvirtualdisplay
#
#_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
#                                    size=(1400, 900))
#_ = _display.start()

Import required packages. 

In [4]:
import gymnasium as gym
import stable_baselines3 as sb3

import pandas as pd # For data frames and data frame manipulation
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
import numpy as np # For general  numeric operations

import matplotlib.pyplot as plt
%matplotlib inline 

2025-03-27 22:18:23.486438: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Create the Environment

Create the Lunar Lander Environment with a TimeLimit wrapper

In [None]:
env = gym.make('LunarLanderContinuous-v2')
env = gym.wrappers.TimeLimit(env, max_episode_steps=1000)

Examine the environment

In [None]:
env.action_space

In [None]:
env.observation_space

### Create an agent using an actor-critic method (PPO)

Create the PPO agent with some tuned hyper-parameters.

In [None]:
tb_log = './log_tb_lunarlander/'
agent = sb3.PPO('MlpPolicy',         
        env, 
        n_steps = 1024,
        batch_size = 64,
        gae_lambda = 0.98,
        gamma = 0.999,
        n_epochs = 4,
        ent_coef = 0.01,
        verbose=1, 
        tensorboard_log=tb_log)

Examine the actor and critic network architecture.

In [None]:
print(agent.policy)

Create an evaluation callback

In [None]:
eval_env = gym.make('LunarLanderContinuous-v2') # We use a separate evaluation env in case any wrappers have been used
eval_log_path = './logs_lunarlander_PPO/'
eval_callback = sb3.common.callbacks.EvalCallback(eval_env, 
                                                  best_model_save_path=eval_log_path,
                                                  log_path=eval_log_path, 
                                                  eval_freq=5000,
                                                  render=True)

Train the model

In [None]:
agent.learn(total_timesteps=10000, 
            callback=eval_callback,
            tb_log_name="PPO Network")

Then connect to the log using **TensorBoard** from the command line: 

`tensorboard --logdir ./log_tb_lunarlander_DQN/`

Then open TensorBoard in a browser, typically located at:

`http://localhost:6006/`

Examine the EvalCallback outputs.

In [None]:
evaluation_log = np.load(eval_log_path + 'evaluations.npz')
evaluation_log_df = pd.DataFrame({item: [np.mean(ep) for ep in evaluation_log[item]] for item in evaluation_log.files})
ax = evaluation_log_df.loc[0:len(evaluation_log_df), 'results'].plot(color = 'lightgray', xlim = [-5, len(evaluation_log_df)], figsize = (10,5))
evaluation_log_df['results'].rolling(5).mean().plot(color = 'black', xlim = [-5, len(evaluation_log_df)])
ax.set_xticklabels(evaluation_log_df['timesteps'])
ax.set_xlabel("Eval Episode")
plt.ylabel("Rolling Mean Cumulative Return")
plt.show()

Save the trained agent.

In [None]:
agent.save("./ppo_lunar_lander_agent")

In [None]:
agent.load("./ppo_lunar_lander_agent")

### Evaluation

Evaluate the agent in the environment

In [None]:
mean_reward, std_reward = sb3.common.evaluation.evaluate_policy(agent, 
                                                                agent.get_env(), 
                                                                n_eval_episodes=10,
                                                               render = True)
print("Mean Reward: {} +/- {}".format(mean_reward, std_reward))