# PPO vs GRPO on MountainCarContinuous-v0

## Introduction

This notebook compares two reinforcement learning algorithms on the **MountainCarContinuous-v0** environment:

- **PPO (Proximal Policy Optimization)**: A popular policy gradient method from `stable-baselines3` that uses a value function (critic) and clipped policy gradients for stable training.

- **GRPO (Group Relative Policy Optimization)**: An algorithm implemented in this repository (`sb3_contrib`) that uses relative performance within a group instead of a learned value function. GRPO normalizes advantages within groups of samples, providing stable gradient estimates without the need for a critic network.

### Environment: MountainCarContinuous-v0
The goal is to drive an underpowered car up a steep hill. The car must build momentum by rocking back and forth. The state space includes position and velocity, and the action is a continuous force applied to the car.

### Outcome
By the end of this notebook, you will see:
1. Training curves comparing episodic rewards for PPO and GRPO
2. A side-by-side video of both trained agents for direct visual comparison

## 1) Setup and Imports

In [None]:
import os
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

# stable-baselines3 imports
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import BaseCallback

# sb3_contrib imports (GRPO from this repository)
from sb3_contrib import GRPO

# Video recording
from gymnasium.wrappers import RecordVideo

# Video composition
from moviepy.editor import VideoFileClip, clips_array

# For displaying video in notebook
from IPython.display import HTML

In [None]:
# Print versions for reproducibility
import stable_baselines3
import sb3_contrib

print(f"gymnasium version: {gym.__version__}")
print(f"stable-baselines3 version: {stable_baselines3.__version__}")
print(f"sb3_contrib version: {sb3_contrib.__version__}")

## 2) Environment Helpers

In [None]:
from typing import List, Optional


def make_env(seed: Optional[int] = None) -> gym.Env:
    """
    Create a MountainCarContinuous-v0 environment.
    
    Args:
        seed: Optional random seed for reproducibility.
        
    Returns:
        The created gymnasium environment.
    """
    env = gym.make("MountainCarContinuous-v0")
    if seed is not None:
        env.reset(seed=seed)
    return env

## 3) Training Configuration

In [None]:
# Training configuration
TOTAL_TIMESTEPS = 300_000  # Total environment steps for training (suitable for a demo run)
SEED = 0                   # Random seed for reproducibility
N_EVAL_EPISODES = 10       # Number of episodes for final evaluation
VIDEO_EPISODES = 1         # Number of episodes to record for video

# Ensure output directories exist
os.makedirs("examples/models", exist_ok=True)
os.makedirs("examples/videos", exist_ok=True)

## 4) Callback for Logging Training Rewards

We create a simple callback to track episodic rewards during training.

In [None]:
class RewardLoggerCallback(BaseCallback):
    """
    Callback for logging episode rewards during training.
    """
    
    def __init__(self, verbose: int = 0):
        super().__init__(verbose)
        self.episode_rewards: List[float] = []
        self.episode_lengths: List[int] = []
        self.timesteps: List[int] = []
        
    def _on_step(self) -> bool:
        # Check if any episode has finished
        if self.locals.get("infos"):
            for info in self.locals["infos"]:
                if "episode" in info:
                    self.episode_rewards.append(info["episode"]["r"])
                    self.episode_lengths.append(info["episode"]["l"])
                    self.timesteps.append(self.num_timesteps)
        return True

## 5) Training PPO Baseline

**PPO (Proximal Policy Optimization)** is a popular on-policy algorithm that:
- Uses a value function (critic) to estimate expected returns
- Uses clipped policy gradients to prevent large policy updates
- Is known for stable training across many environments

In [None]:
# Create environment with Monitor wrapper to track episode stats
ppo_env = Monitor(make_env(SEED))

# Create PPO model
ppo_model = PPO(
    "MlpPolicy",
    ppo_env,
    seed=SEED,
    verbose=1,
)

# Create callback for logging rewards
ppo_callback = RewardLoggerCallback()

# Train PPO
print("Training PPO...")
ppo_model.learn(
    total_timesteps=TOTAL_TIMESTEPS,
    callback=ppo_callback,
    progress_bar=True,
)

# Save the trained model
ppo_model.save("examples/models/ppo_mountaincar")
print("PPO model saved to examples/models/ppo_mountaincar.zip")

In [None]:
# Evaluate PPO
ppo_eval_env = make_env(SEED + 100)  # Different seed for evaluation
ppo_mean_reward, ppo_std_reward = evaluate_policy(
    ppo_model, ppo_eval_env, n_eval_episodes=N_EVAL_EPISODES, deterministic=True
)
print(f"PPO Evaluation: Mean reward = {ppo_mean_reward:.2f} +/- {ppo_std_reward:.2f}")
ppo_eval_env.close()

# Store training rewards for plotting
ppo_rewards = np.array(ppo_callback.episode_rewards)
ppo_timesteps = np.array(ppo_callback.timesteps)

## 6) Training GRPO

**GRPO (Group Relative Policy Optimization)** is an algorithm that:
- Uses relative performance within a group instead of a learned value function
- Normalizes advantages within groups of samples for stable gradient estimates
- Reduces computational overhead by eliminating the critic network
- Uses KL divergence regularization to prevent policy collapse

In [None]:
# Create environment with Monitor wrapper to track episode stats
grpo_env = Monitor(make_env(SEED))

# Create GRPO model
grpo_model = GRPO(
    "MlpPolicy",
    grpo_env,
    seed=SEED,
    verbose=1,
)

# Create callback for logging rewards
grpo_callback = RewardLoggerCallback()

# Train GRPO
print("Training GRPO...")
grpo_model.learn(
    total_timesteps=TOTAL_TIMESTEPS,
    callback=grpo_callback,
    progress_bar=True,
)

# Save the trained model
grpo_model.save("examples/models/grpo_mountaincar")
print("GRPO model saved to examples/models/grpo_mountaincar.zip")

In [None]:
# Evaluate GRPO
grpo_eval_env = make_env(SEED + 100)  # Different seed for evaluation
grpo_mean_reward, grpo_std_reward = evaluate_policy(
    grpo_model, grpo_eval_env, n_eval_episodes=N_EVAL_EPISODES, deterministic=True
)
print(f"GRPO Evaluation: Mean reward = {grpo_mean_reward:.2f} +/- {grpo_std_reward:.2f}")
grpo_eval_env.close()

# Store training rewards for plotting
grpo_rewards = np.array(grpo_callback.episode_rewards)
grpo_timesteps = np.array(grpo_callback.timesteps)

## 7) Training Curve Comparison

In [None]:
def smooth_rewards(rewards: np.ndarray, window: int = 10) -> np.ndarray:
    """
    Apply a simple moving average to smooth reward curves.
    
    Args:
        rewards: Array of episode rewards.
        window: Window size for moving average.
        
    Returns:
        Smoothed rewards array.
    """
    if len(rewards) < window:
        return rewards
    return np.convolve(rewards, np.ones(window) / window, mode="valid")


# Create figure for comparison
fig, ax = plt.subplots(figsize=(12, 6))

# Plot PPO training curve
if len(ppo_rewards) > 0:
    ppo_smoothed = smooth_rewards(ppo_rewards)
    ax.plot(
        range(len(ppo_smoothed)),
        ppo_smoothed,
        label="PPO",
        color="blue",
        alpha=0.8,
    )
    # Plot raw rewards with transparency
    ax.plot(
        range(len(ppo_rewards)),
        ppo_rewards,
        color="blue",
        alpha=0.2,
    )

# Plot GRPO training curve
if len(grpo_rewards) > 0:
    grpo_smoothed = smooth_rewards(grpo_rewards)
    ax.plot(
        range(len(grpo_smoothed)),
        grpo_smoothed,
        label="GRPO",
        color="red",
        alpha=0.8,
    )
    # Plot raw rewards with transparency
    ax.plot(
        range(len(grpo_rewards)),
        grpo_rewards,
        color="red",
        alpha=0.2,
    )

ax.set_xlabel("Episode")
ax.set_ylabel("Episodic Reward")
ax.set_title("PPO vs GRPO on MountainCarContinuous-v0")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8) Video Recording and Side-by-Side Visualization

Now we record videos of both trained agents and create a side-by-side comparison.

In [None]:
def record_video(model, model_name: str, video_dir: str) -> str:
    """
    Record a video of the trained agent.
    
    Args:
        model: The trained RL model.
        model_name: Name of the model (for file naming).
        video_dir: Directory to save the video.
        
    Returns:
        Path to the recorded video file.
    """
    # Create video directory
    video_folder = os.path.join(video_dir, model_name)
    os.makedirs(video_folder, exist_ok=True)
    
    # Create environment with video recorder
    env = gym.make("MountainCarContinuous-v0", render_mode="rgb_array")
    env = RecordVideo(
        env,
        video_folder=video_folder,
        episode_trigger=lambda x: True,  # Record all episodes
        name_prefix=model_name,
    )
    
    # Run episodes
    for episode in range(VIDEO_EPISODES):
        obs, info = env.reset()
        done = False
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
    
    env.close()
    
    # Find the generated video file
    video_files = [f for f in os.listdir(video_folder) if f.endswith(".mp4")]
    if video_files:
        return os.path.join(video_folder, video_files[0])
    return ""

In [None]:
# Record videos for both models
print("Recording PPO video...")
ppo_video_path = record_video(ppo_model, "ppo_mountaincar", "examples/videos")
print(f"PPO video saved to: {ppo_video_path}")

print("\nRecording GRPO video...")
grpo_video_path = record_video(grpo_model, "grpo_mountaincar", "examples/videos")
print(f"GRPO video saved to: {grpo_video_path}")

## 9) Side-by-Side Video Composition

Using moviepy, we create a side-by-side video comparison:
- **Left**: PPO agent
- **Right**: GRPO agent

This allows for direct visual comparison of how each algorithm solves the MountainCar task.

In [None]:
# Load the video clips
ppo_clip = VideoFileClip(ppo_video_path)
grpo_clip = VideoFileClip(grpo_video_path)

# Resize to same height if needed
target_height = min(ppo_clip.h, grpo_clip.h)
ppo_clip_resized = ppo_clip.resize(height=target_height)
grpo_clip_resized = grpo_clip.resize(height=target_height)

# Make clips the same duration (use the shorter one)
min_duration = min(ppo_clip_resized.duration, grpo_clip_resized.duration)
ppo_clip_resized = ppo_clip_resized.subclip(0, min_duration)
grpo_clip_resized = grpo_clip_resized.subclip(0, min_duration)

# Create side-by-side composition
side_by_side = clips_array([[ppo_clip_resized, grpo_clip_resized]])

# Save the composite video
output_path = "examples/videos/ppo_vs_grpo_mountaincar_side_by_side.mp4"
side_by_side.write_videofile(output_path, fps=30, codec="libx264")

# Close clips
ppo_clip.close()
grpo_clip.close()

print(f"\nSide-by-side video saved to: {output_path}")

## 10) Display Side-by-Side Video

The video below shows:
- **Left**: PPO agent solving MountainCarContinuous-v0
- **Right**: GRPO agent solving MountainCarContinuous-v0

In [None]:
# Display the side-by-side video in the notebook
video_html = f"""
<video width="800" controls>
    <source src="{output_path}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

HTML(video_html)

## Summary

This notebook demonstrated a comparison between PPO and GRPO on the MountainCarContinuous-v0 environment.

### Key Differences:

| Aspect | PPO | GRPO |
|--------|-----|------|
| Value Function | Uses a learned critic | No critic needed |
| Advantage Estimation | GAE with value function | Group-relative normalization |
| Computational Cost | Higher (two networks) | Lower (single network) |
| Regularization | Clipped policy gradients | KL divergence + clipping |

### Files Generated:
- `examples/models/ppo_mountaincar.zip` - Trained PPO model
- `examples/models/grpo_mountaincar.zip` - Trained GRPO model
- `examples/videos/ppo_mountaincar/` - PPO agent video
- `examples/videos/grpo_mountaincar/` - GRPO agent video
- `examples/videos/ppo_vs_grpo_mountaincar_side_by_side.mp4` - Side-by-side comparison

In [None]:
# Cleanup: Close training environments
ppo_env.close()
grpo_env.close()

print("All environments closed.")