# PPO for Lunar Lander with Stable Baselines3

This notebook implements Proximal Policy Optimization (PPO) to solve the Lunar Lander environment from OpenAI Gymnasium.

## 1. Install Required Packages

Run this cell first to install all dependencies.

In [None]:
!pip install swig
!pip install gymnasium[box2d]
!pip install stable-baselines3[extra]
!pip install tensorboard
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
import os
from matplotlib import animation
from IPython.display import HTML



## 3. Create the Environment

In [None]:
# Create a single environment for testing
env = gym.make('LunarLander-v3', render_mode='rgb_array')
obs, info = env.reset()

## 4. Create Vectorized Environment for Training

Using multiple parallel environments speeds up training.

In [None]:
# Create vectorized environment with 4 parallel environments
vec_env = make_vec_env('LunarLander-v3', n_envs=10, seed=42)

## 5. Initialize PPO Agent

In [None]:
# Create directories for logs and models
os.makedirs("logs", exist_ok=True)
os.makedirs("models", exist_ok=True)

# Initialize PPO agent with custom hyperparameters
model = PPO("MlpPolicy", vec_env, verbose=1)
'''Create PPO model here using online
  docs/best practices as needed. You
  only need to fill in model parameters,
  but try to think about why each
  param is set a certain way and how you
  might adjust this for the final,
  much more complex rover project'''

Using cuda device


'Create PPO model here using online\n  docs/best practices as needed. You\n  only need to fill in model parameters,\n  but try to think about why each\n  param is set a certain way and how you\n  might adjust this for the final,\n  much more complex rover project'

## 6. Set Up Callbacks for Evaluation and Checkpointing

In [None]:
# Create evaluation environment
eval_env = gym.make('LunarLander-v3')

# Evaluation callback - evaluates the model every 10000 steps
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./models/",
    log_path="./logs/",
    eval_freq=10000,
    deterministic=True,
    render=False
)

# Checkpoint callback - saves the model every 50000 steps
checkpoint_callback = CheckpointCallback(
    save_freq=50000,
    save_path="./models/",
    name_prefix="ppo_model"
)

## 7. Train the Agent

This will train for 500,000 timesteps. Adjust as needed.

In [None]:
# Train the agent
total_timesteps = 500000

print(f"Starting training for {total_timesteps} timesteps...")
model.learn(
    total_timesteps=total_timesteps,
    callback=[eval_callback, checkpoint_callback],
    progress_bar=True
)

print("\nTraining completed!")

# Save the final model
model.save("models/ppo_lunar_lander_final")
print("Final model saved!")

Output()

Starting training for 500000 timesteps...
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 89.8     |
|    ep_rew_mean     | -176     |
| time/              |          |
|    fps             | 1840     |
|    iterations      | 1        |
|    time_elapsed    | 11       |
|    total_timesteps | 20480    |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 105         |
|    ep_rew_mean          | -150        |
| time/                   |             |
|    fps                  | 1178        |
|    iterations           | 2           |
|    time_elapsed         | 34          |
|    total_timesteps      | 40960       |
| train/                  |             |
|    approx_kl            | 0.007676603 |
|    clip_fraction        | 0.0585      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | -0

-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 322         |
|    mean_reward          | -446        |
| time/                   |             |
|    total_timesteps      | 100000      |
| train/                  |             |
|    approx_kl            | 0.014662604 |
|    clip_fraction        | 0.221       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.26       |
|    explained_variance   | 0.612       |
|    learning_rate        | 0.0003      |
|    loss                 | 31.1        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.021      |
|    value_loss           | 126         |
-----------------------------------------


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 138      |
|    ep_rew_mean     | -54.1    |
| time/              |          |
|    fps             | 947      |
|    iterations      | 5        |
|    time_elapsed    | 108      |
|    total_timesteps | 102400   |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 160         |
|    ep_rew_mean          | -48.6       |
| time/                   |             |
|    fps                  | 928         |
|    iterations           | 6           |
|    time_elapsed         | 132         |
|    total_timesteps      | 122880      |
| train/                  |             |
|    approx_kl            | 0.013622736 |
|    clip_fraction        | 0.18        |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.19       |
|    explained_variance   | 0.737       |
|    learning_rate        | 0.

-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 650         |
|    mean_reward          | -161        |
| time/                   |             |
|    total_timesteps      | 200000      |
| train/                  |             |
|    approx_kl            | 0.013322706 |
|    clip_fraction        | 0.122       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.07       |
|    explained_variance   | 0.887       |
|    learning_rate        | 0.0003      |
|    loss                 | 9.29        |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.00797    |
|    value_loss           | 43.4        |
-----------------------------------------


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 592      |
|    ep_rew_mean     | 2.89     |
| time/              |          |
|    fps             | 680      |
|    iterations      | 10       |
|    time_elapsed    | 301      |
|    total_timesteps | 204800   |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 684          |
|    ep_rew_mean          | 18.5         |
| time/                   |              |
|    fps                  | 651          |
|    iterations           | 11           |
|    time_elapsed         | 345          |
|    total_timesteps      | 225280       |
| train/                  |              |
|    approx_kl            | 0.0075996616 |
|    clip_fraction        | 0.0846       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.06        |
|    explained_variance   | 0.825        |
|    learning_r

-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 714         |
|    mean_reward          | 177         |
| time/                   |             |
|    total_timesteps      | 300000      |
| train/                  |             |
|    approx_kl            | 0.006180631 |
|    clip_fraction        | 0.0649      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.969      |
|    explained_variance   | 0.905       |
|    learning_rate        | 0.0003      |
|    loss                 | 2.63        |
|    n_updates            | 140         |
|    policy_gradient_loss | -0.00381    |
|    value_loss           | 29.2        |
-----------------------------------------


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 898      |
|    ep_rew_mean     | 110      |
| time/              |          |
|    fps             | 584      |
|    iterations      | 15       |
|    time_elapsed    | 525      |
|    total_timesteps | 307200   |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 927         |
|    ep_rew_mean          | 119         |
| time/                   |             |
|    fps                  | 575         |
|    iterations           | 16          |
|    time_elapsed         | 569         |
|    total_timesteps      | 327680      |
| train/                  |             |
|    approx_kl            | 0.009765765 |
|    clip_fraction        | 0.109       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.03       |
|    explained_variance   | 0.944       |
|    learning_rate        | 0.

-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 338         |
|    mean_reward          | 188         |
| time/                   |             |
|    total_timesteps      | 400000      |
| train/                  |             |
|    approx_kl            | 0.006490971 |
|    clip_fraction        | 0.0597      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.846      |
|    explained_variance   | 0.906       |
|    learning_rate        | 0.0003      |
|    loss                 | 3.55        |
|    n_updates            | 190         |
|    policy_gradient_loss | -0.00129    |
|    value_loss           | 30.8        |
-----------------------------------------


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 841      |
|    ep_rew_mean     | 117      |
| time/              |          |
|    fps             | 555      |
|    iterations      | 20       |
|    time_elapsed    | 736      |
|    total_timesteps | 409600   |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 809        |
|    ep_rew_mean          | 111        |
| time/                   |            |
|    fps                  | 553        |
|    iterations           | 21         |
|    time_elapsed         | 776        |
|    total_timesteps      | 430080     |
| train/                  |            |
|    approx_kl            | 0.00855321 |
|    clip_fraction        | 0.124      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.844     |
|    explained_variance   | 0.941      |
|    learning_rate        | 0.0003     |
|   

-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 509         |
|    mean_reward          | 211         |
| time/                   |             |
|    total_timesteps      | 500000      |
| train/                  |             |
|    approx_kl            | 0.013682656 |
|    clip_fraction        | 0.13        |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.703      |
|    explained_variance   | 0.86        |
|    learning_rate        | 0.0003      |
|    loss                 | 69.4        |
|    n_updates            | 240         |
|    policy_gradient_loss | -0.00313    |
|    value_loss           | 52.6        |
-----------------------------------------


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 750      |
|    ep_rew_mean     | 150      |
| time/              |          |
|    fps             | 551      |
|    iterations      | 25       |
|    time_elapsed    | 928      |
|    total_timesteps | 512000   |
---------------------------------



Training completed!
Final model saved!


## 8. Evaluate the Trained Agent

In [None]:
# Load the best model
model = PPO.load("models/ppo_lunar_lander_final.zip")

# Evaluate the agent
eval_env = gym.make('LunarLander-v3')
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=100)

print(f"Mean reward over 100 episodes: {mean_reward:.2f} +/- {std_reward:.2f}")

eval_env.close()

Mean reward over 100 episodes: 207.20 +/- 81.51


In [None]:
def create_episode_animation(model, env_name='LunarLander-v3', deterministic=True):
    """
    Create an animated visualization of one episode.
    """
    env = gym.make(env_name, render_mode='rgb_array')

    # Collect frames from one episode
    frames = []
    obs, info = env.reset()
    done = False
    total_reward = 0
    step_count = 0

    while not done:
        # Render and store frame
        frame = env.render()
        frames.append(frame)

        # Get action from trained model
        action, _ = model.predict(obs, deterministic=deterministic)
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        total_reward += reward
        step_count += 1

    env.close()

    print(f"Episode completed in {step_count} steps with total reward: {total_reward:.2f}")

    # Create animation
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.axis('off')
    img = ax.imshow(frames[0])

    def animate(i):
        img.set_array(frames[i])
        ax.set_title(f'Step: {i+1}/{len(frames)} | Reward: {total_reward:.2f}', fontsize=14)
        return [img]

    anim = animation.FuncAnimation(
        fig, animate, frames=len(frames), interval=50, blit=True, repeat=True
    )

    plt.close()  # Prevent static display
    return anim

# Create and display the animation
print("Creating animation of trained agent...\n")
anim = create_episode_animation(model)
HTML(anim.to_jshtml())

Creating animation of trained agent...

Episode completed in 316 steps with total reward: 252.01


  return datetime.utcnow().replace(tzinfo=utc)


## Summary

This notebook demonstrates:
1. Setting up the Lunar Lander environment
2. Training a PPO agent with Stable Baselines3
3. Evaluating the trained agent
4. Visualizing performance
5. Comparing with a random baseline

The PPO algorithm should achieve an average reward above 200 (considered solved) after sufficient training.