This repository contains the implementation of a Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v3 (Continuous) environment from Gymnasium. It was developed as part of a Reinforcement Learning assignment.
The project not only trains a baseline PPO agent but also includes a comprehensive hyperparameter tuning suite to analyze the effects of various configurations (Learning Rate, Gamma, Network Architecture, and Clipping Boundary) on the agent's performance.
lunar_lander/
│
├── config.py # Centralized configuration for environment and PPO baseline params
├── main.py # Main training script with argparse for hyperparameter tuning
├── utils.py # Custom SB3 Callbacks (Top-3 model saving) & plotting functions
├── run_experiments.sh # Bash script to automate the hyperparameter tuning pipeline
├── visualize_models.ipynb # Interactive script to load and render trained models
├── hyperparameter_report.tex # LaTeX source code for the analysis report
│
├── results/ # Auto-generated directory for outputs
│ ├── logs/ # Console logs for each experiment
│ ├── models/ # Saved .zip models (keeps top 3 per experiment)
│ └── plots/ # Training vs Evaluation learning curves
│
└── fig/ # Directory containing plots used in the LaTeX report
Ensure you have Python 3.8+ installed. It is recommended to use a virtual environment (e.g., Anaconda or venv).
- Clone the repository:
git clone git@github.com:MissionAC/lunar_lander.git
cd lunar_lander- Install the required dependencies:
pip install gymnasium[box2d] stable-baselines3[extra] matplotlib(Note: box2d is required for the LunarLander environment).
You can train a single PPO agent using the default configurations defined in config.py:
python main.pyTo override specific hyperparameters, use the command-line arguments:
python main.py --lr 0.001 --gamma 0.999 --clip_range 0.1 --net_arch 128 128To reproduce the experiments for the hyperparameter analysis (Question 1B), run the bash script. This will sequentially train models with different configurations, log the console outputs, save the top 3 best-performing models for each run, and generate learning curves.
chmod +x run_experiments.sh
./run_experiments.shTo watch the trained agent land the spacecraft, run the visualization script. It automatically scans the results/ directory for .zip models and allows you to select which one to render.
python visualize_models.py(Note: If you are running this on WSL/Linux without an audio driver, the script automatically suppresses ALSA audio warnings by setting a dummy audio driver).
-
Episode-based Logging: The assignment specifically requests plotting the X-axis by episode number (not timesteps). A custom
EpisodeLoggerCallbackwas implemented to track and log rewards precisely at the end of each episode. -
Top-K Model Saving: The callback dynamically tracks the evaluation performance and saves only the Top 3 models for each hyperparameter configuration to save disk space while keeping the best policies.
-
Smoothing: The generated plots include an Exponential Moving Average (EMA) smoothed curve for training rewards to better visualize the learning trend amidst high variance.
-
Control Variable Analysis: The
run_experiments.shscript employs a strict control variable approach, changing only one hyperparameter at a time against an optimized[64, 64]baseline to cleanly observe its effect.
A detailed analysis of how hyperparameters (LR, Gamma, Capacity, Clip Range) affect the agent's performance is provided in the submitted PDF report. Below is a summary of the key findings:
-
Network Capacity: Wider networks (e.g.,
[128, 128]) significantly improve learning speed and stability compared to the default[64, 64]. -
Gamma: While
0.999offers theoretical long-term stability, it severely hinders early-stage learning (credit assignment problem). The baseline0.99is highly superior for sample efficiency. -
Learning Rate & Clipping: Aggressive learning rates (
0.001) cause catastrophic forgetting, while overly tight clipping (0.1) restricts exploration and slows down convergence.