# PPO for Lunar Lander with Stable Baselines3

This notebook implements Proximal Policy Optimization (PPO) to solve the Lunar Lander environment from OpenAI Gymnasium.

## 1. Install Required Packages

Run this cell first to install all dependencies.

In [None]:
!pip install swig
!pip install gymnasium[box2d]
!pip install stable-baselines3[extra]
!pip install tensorboard
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
import os
from matplotlib import animation
from IPython.display import HTML

Collecting swig
  Downloading swig-4.4.0-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.4.0-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.4.0
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp312-cp312-linux_x86_64.whl size=2399001 sha256=2f50f81aad6bb60428ec725717a633cd312a8e921c8dec5bc982da2b12d26f3b
  Stored in directory: /root/.cache/pip/wheels/2a

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)


## 3. Create the Environment

In [None]:
# Create a single environment for testing
env = gym.make('LunarLander-v3', render_mode='rgb_array')
obs, info = env.reset()

  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
  return datetime.utcnow().replace(tzinfo=utc)


## 4. Create Vectorized Environment for Training

Using multiple parallel environments speeds up training.

In [None]:
# Create vectorized environment with 4 parallel environments
vec_env = make_vec_env('LunarLander-v3', n_envs=10, seed=42)

## 5. Initialize PPO Agent

In [None]:
# Create directories for logs and models
os.makedirs("logs", exist_ok=True)
os.makedirs("models", exist_ok=True)

# Initialize PPO agent with custom hyperparameters
'''Create PPO model here using online
       docs/best practices as needed. You
       only need to fill in model parameters,
       but try to think about why each
       param is set a certain way and how you
       might adjust this for the final,
       much more complex rover project'''

model = PPO(
    "MlpPolicy",
    vec_env,
    verbose=1,
)

Using cpu device


  return datetime.utcnow().replace(tzinfo=utc)


## 6. Set Up Callbacks for Evaluation and Checkpointing

In [None]:
# Create evaluation environment
eval_env = gym.make('LunarLander-v3')

# Evaluation callback - evaluates the model every 10000 steps
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./models/",
    log_path="./logs/",
    eval_freq=10000,
    deterministic=True,
    render=False
)

# Checkpoint callback - saves the model every 50000 steps
checkpoint_callback = CheckpointCallback(
    save_freq=50000,
    save_path="./models/",
    name_prefix="ppo_model"
)

## 7. Train the Agent

This will train for 500,000 timesteps. Adjust as needed.

In [None]:
# Train the agent
total_timesteps = 500000

print(f"Starting training for {total_timesteps} timesteps...")

# Re-create vectorized environment and model to ensure they are valid
# This addresses the AssertionError where model.env might be None.
# It ensures that a valid environment is associated with the model before training starts.
# Duplicating these lines here to ensure state consistency within the training cell.
vec_env = make_vec_env('LunarLander-v3', n_envs=10, seed=42)
model = PPO(
    "MlpPolicy",
    vec_env,
    verbose=1,
)

model.learn(
    total_timesteps=total_timesteps,
    callback=[eval_callback, checkpoint_callback],
    progress_bar=False
)

print("\nTraining completed!")

# Save the final model
model.save("models/ppo_lunar_lander_final")
print("Final model saved!")

Starting training for 500000 timesteps...
Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 93.1     |
|    ep_rew_mean     | -173     |
| time/              |          |
|    fps             | 2733     |
|    iterations      | 1        |
|    time_elapsed    | 7        |
|    total_timesteps | 20480    |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 98          |
|    ep_rew_mean          | -156        |
| time/                   |             |
|    fps                  | 1278        |
|    iterations           | 2           |
|    time_elapsed         | 32          |
|    total_timesteps      | 40960       |
| train/                  |             |
|    approx_kl            | 0.011093736 |
|    clip_fraction        | 0.0737      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explaine



Eval num_timesteps=100000, episode_reward=-238.59 +/- 36.87
Episode length: 293.00 +/- 28.19
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 293         |
|    mean_reward          | -239        |
| time/                   |             |
|    total_timesteps      | 100000      |
| train/                  |             |
|    approx_kl            | 0.012305767 |
|    clip_fraction        | 0.186       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.25       |
|    explained_variance   | 0.639       |
|    learning_rate        | 0.0003      |
|    loss                 | 44.4        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0169     |
|    value_loss           | 183         |
-----------------------------------------
New best mean reward!
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 119      |
|    ep_rew_mean     | -63.

## 8. Evaluate the Trained Agent

In [None]:
# Load the best model
model = PPO.load("models/ppo_lunar_lander_final.zip")

# Evaluate the agent
eval_env = gym.make('LunarLander-v3')
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=100)

print(f"Mean reward over 100 episodes: {mean_reward:.2f} +/- {std_reward:.2f}")

eval_env.close()

Mean reward over 100 episodes: 196.13 +/- 90.97


In [None]:
def create_episode_animation(model, env_name='LunarLander-v3', deterministic=True):
    """
    Create an animated visualization of one episode.
    """
    env = gym.make(env_name, render_mode='rgb_array')

    # Collect frames from one episode
    frames = []
    obs, info = env.reset()
    done = False
    total_reward = 0
    step_count = 0

    while not done:
        # Render and store frame
        frame = env.render()
        frames.append(frame)

        # Get action from trained model
        action, _ = model.predict(obs, deterministic=deterministic)
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        total_reward += reward
        step_count += 1

    env.close()

    print(f"Episode completed in {step_count} steps with total reward: {total_reward:.2f}")

    # Create animation
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.axis('off')
    img = ax.imshow(frames[0])

    def animate(i):
        img.set_array(frames[i])
        ax.set_title(f'Step: {i+1}/{len(frames)} | Reward: {total_reward:.2f}', fontsize=14)
        return [img]

    anim = animation.FuncAnimation(
        fig, animate, frames=len(frames), interval=50, blit=True, repeat=True
    )

    plt.close()  # Prevent static display
    return anim

# Create and display the animation
print("Creating animation of trained agent...\n")
anim = create_episode_animation(model)
HTML(anim.to_jshtml())

Creating animation of trained agent...

Episode completed in 306 steps with total reward: 250.42


  return datetime.utcnow().replace(tzinfo=utc)


## Summary

This notebook demonstrates:
1. Setting up the Lunar Lander environment
2. Training a PPO agent with Stable Baselines3
3. Evaluating the trained agent
4. Visualizing performance
5. Comparing with a random baseline

The PPO algorithm should achieve an average reward above 200 (considered solved) after sufficient training.