# Reinforcement Learning with Gymnasium & Stable-Baselines3

## 🎯 Learning Objectives

In this notebook, you'll learn:
1. **Gymnasium API fundamentals** - How to create, interact with, and understand RL environments
2. **Environment visualization** - Render and visualize agent behavior
3. **Complete training pipeline** - Train a PPO agent using Stable-Baselines3
4. **Evaluation & comparison** - Compare random vs. trained agents

---

## 📚 What is Reinforcement Learning?

**Reinforcement Learning (RL)** is a type of machine learning where an **agent** learns to make decisions by interacting with an **environment**:

```
     ┌─────────┐
     │  Agent  │
     └────┬────┘
          │ action
          ↓
   ┌─────────────┐
   │ Environment │
   └──────┬──────┘
          │ observation, reward
          ↓
```

- **Agent**: The learner/decision maker
- **Environment**: The world the agent interacts with
- **Action**: What the agent does
- **Observation**: What the agent sees
- **Reward**: Feedback signal (positive or negative)

The agent's goal is to learn a **policy** (a strategy) that maximizes cumulative reward over time.

## 🔧 Setup & Installation

First, let's install the required packages. We'll need:
- **gymnasium**: The environment API (successor to OpenAI Gym)
- **stable-baselines3**: RL algorithms implementation (PyTorch-based)
- **imageio**: For creating GIFs of episodes
- **matplotlib**: For plotting results

In [None]:
# Install required packages
!pip install gymnasium[classic-control,box2d] stable-baselines3 imageio imageio-ffmpeg matplotlib numpy --quiet

print("✅ Installation complete!")

In [None]:
# Import all necessary libraries
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image, display
import imageio
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)

print(f"Gymnasium version: {gym.__version__}")
print("✅ All imports successful!")

---

# 1️⃣ Gymnasium API Fundamentals

## Creating an Environment with `gym.make()`

The first step in any RL project is creating an environment. Gymnasium provides a unified interface for many environments.

### Key Parameters:
- **`env_id`**: The name of the environment (e.g., "CartPole-v1")
- **`render_mode`**: How to visualize the environment
  - `None`: No rendering (fastest, for training)
  - `"human"`: Real-time visualization window
  - `"rgb_array"`: Returns images (for creating videos/GIFs)

In [None]:
# Create our first environment - CartPole
env = gym.make("CartPole-v1", render_mode="rgb_array")

print(f"Environment: {env.spec.id}")
print(f"Render mode: {env.render_mode}")
print("\n✅ Environment created successfully!")

## Understanding Environment Spaces

Every environment has two important spaces:

### 1. **Observation Space** (`env.observation_space`)
- Defines what the agent can observe
- Different types: `Box` (continuous), `Discrete` (categorical), `MultiDiscrete`, etc.

### 2. **Action Space** (`env.action_space`)
- Defines what actions the agent can take
- Determines the output of your policy network

In [None]:
# Examine the observation space
print("=== OBSERVATION SPACE ===")
print(f"Type: {type(env.observation_space).__name__}")
print(f"Shape: {env.observation_space.shape}")
print(f"High bounds: {env.observation_space.high}")
print(f"Low bounds: {env.observation_space.low}")
print(f"Data type: {env.observation_space.dtype}")

print("\n=== ACTION SPACE ===")
print(f"Type: {type(env.action_space).__name__}")
print(f"Number of actions: {env.action_space.n}")
print(f"Actions: 0 = Push cart left, 1 = Push cart right")

# Sample random observations and actions
print("\n=== SAMPLING FROM SPACES ===")
print(f"Random observation sample: {env.observation_space.sample()}")
print(f"Random action sample: {env.action_space.sample()}")

### 🧠 Understanding CartPole Observations

The observation is a 4-dimensional vector:
1. **Cart Position**: Horizontal position of the cart
2. **Cart Velocity**: Speed of the cart
3. **Pole Angle**: Angle of the pole (in radians)
4. **Pole Angular Velocity**: How fast the pole is rotating

**Goal**: Keep the pole balanced (upright) by moving the cart left or right!

## 🧪 Exercise 1: Your Turn to Practice!

Now it's your turn! Complete the following code by filling in the blanks. This will help you understand the Gymnasium API.

**Your Task**: Complete the `my_first_episode()` function below.

```python
def my_first_episode():
    """
    Run one episode and collect statistics.
    
    TODO: Fill in the missing code!
    """
    # 1. Create the CartPole environment with render_mode="rgb_array"
    env = gym.make(___________)
    
    # 2. Reset the environment with seed=42
    observation, info = env.___________
    
    total_reward = 0
    steps = 0
    done = False
    
    while not done:
        # 3. Sample a random action from the action space
        action = env.action_space.___________
        
        # 4. Take the action in the environment
        observation, reward, terminated, truncated, info = env.___________(action)
        
        # 5. Update total reward
        total_reward += ___________
        steps += 1
        
        # 6. Check if episode is done (terminated OR truncated)
        done = ___________ or ___________
    
    # 7. Don't forget to close the environment!
    env.___________
    
    return total_reward, steps

# Test your function
# reward, steps = my_first_episode()
# print(f"Episode reward: {reward}, Steps: {steps}")
```

**Hints:**
- Line 1: Environment name is "CartPole-v1"
- Line 2: Method to start a new episode
- Line 3: Method to get a random action
- Line 4: Method to execute an action
- Line 5: What value should we add to total_reward?
- Line 6: Episode is done when either condition is True
- Line 7: Cleanup method

Try it yourself in the cell below! 👇

In [None]:
# YOUR CODE HERE - Complete the function!

def my_first_episode():
    """
    Run one episode and collect statistics.
    
    TODO: Fill in the missing code!
    """
    # 1. Create the CartPole environment with render_mode="rgb_array"
    env = gym.make("___________", render_mode="rgb_array")  # FILL THIS
    
    # 2. Reset the environment with seed=42
    observation, info = env.___________(seed=42)  # FILL THIS
    
    total_reward = 0
    steps = 0
    done = False
    
    while not done:
        # 3. Sample a random action from the action space
        action = env.action_space.___________()  # FILL THIS
        
        # 4. Take the action in the environment
        observation, reward, terminated, truncated, info = env.___________(action)  # FILL THIS
        
        # 5. Update total reward
        total_reward += ___________  # FILL THIS
        steps += 1
        
        # 6. Check if episode is done (terminated OR truncated)
        done = ___________ or ___________  # FILL THIS
    
    # 7. Don't forget to close the environment!
    env.___________()  # FILL THIS
    
    return total_reward, steps

# Uncomment to test your function
# reward, steps = my_first_episode()
# print(f"✅ Episode reward: {reward}, Steps: {steps}")

### ✅ Solution

<details>
<summary>Click here to see the solution (try it yourself first!)</summary>

```python
def my_first_episode():
    # 1. Create the CartPole environment
    env = gym.make("CartPole-v1", render_mode="rgb_array")
    
    # 2. Reset the environment
    observation, info = env.reset(seed=42)
    
    total_reward = 0
    steps = 0
    done = False
    
    while not done:
        # 3. Sample a random action
        action = env.action_space.sample()
        
        # 4. Take the action
        observation, reward, terminated, truncated, info = env.step(action)
        
        # 5. Update total reward
        total_reward += reward
        steps += 1
        
        # 6. Check if done
        done = terminated or truncated
    
    # 7. Close the environment
    env.close()
    
    return total_reward, steps
```

</details>

## The RL Interaction Loop: `reset()` and `step()`

These are the two most important methods in the Gymnasium API:

### `env.reset()` - Start a New Episode
Returns: `(observation, info)`
- **observation**: Initial state of the environment
- **info**: Additional diagnostic information (dict)

### `env.step(action)` - Take an Action
Returns: `(observation, reward, terminated, truncated, info)`
- **observation**: New state after taking the action
- **reward**: Immediate reward received (float)
- **terminated**: Episode ended naturally (e.g., goal reached or failed)
- **truncated**: Episode ended due to time limit
- **info**: Additional information

**Note**: Episode is done when `terminated OR truncated == True`

In [None]:
# Reset the environment to start a new episode
observation, info = env.reset(seed=SEED)

print("=== AFTER RESET ===")
print(f"Initial observation: {observation}")
print(f"Info: {info}")
print(f"Observation shape: {observation.shape}")

# Take a random action
action = env.action_space.sample()  # Random action (0 or 1)
print(f"\nTaking action: {action}")

# Execute the action
observation, reward, terminated, truncated, info = env.step(action)

print("\n=== AFTER STEP ===")
print(f"New observation: {observation}")
print(f"Reward received: {reward}")
print(f"Episode terminated?: {terminated}")
print(f"Episode truncated?: {truncated}")
print(f"Episode done?: {terminated or truncated}")
print(f"Info: {info}")

---

# 2️⃣ Visualizing Environments

## Rendering a Single Frame

Let's visualize what the environment looks like!

In [None]:
# Reset and render the initial state
env.reset(seed=SEED)
frame = env.render()  # Returns RGB array when render_mode="rgb_array"

# Display the frame
plt.figure(figsize=(8, 6))
plt.imshow(frame)
plt.title("CartPole-v1 Initial State", fontsize=14, fontweight='bold')
plt.axis('off')
plt.tight_layout()
plt.show()

print(f"Frame shape: {frame.shape} (height, width, RGB channels)")

## Running a Full Episode with Random Actions

Let's see how a **random agent** (taking random actions) performs:

In [None]:
def run_random_episode(env, seed=None, render=True):
    """
    Run one episode with random actions.
    
    Args:
        env: Gymnasium environment
        seed: Random seed for reproducibility
        render: Whether to collect frames for visualization
    
    Returns:
        total_reward: Sum of all rewards in the episode
        frames: List of RGB arrays (if render=True)
    """
    observation, info = env.reset(seed=seed)
    frames = []
    total_reward = 0
    steps = 0
    
    # Episode loop: continue until done
    done = False
    while not done:
        # Render and save frame
        if render:
            frames.append(env.render())
        
        # Take random action
        action = env.action_space.sample()
        
        # Execute action in environment
        observation, reward, terminated, truncated, info = env.step(action)
        
        # Accumulate reward
        total_reward += reward
        steps += 1
        
        # Check if episode is complete
        done = terminated or truncated
    
    print(f"Episode finished after {steps} steps")
    print(f"Total reward: {total_reward}")
    
    return total_reward, frames

# Run one episode
reward, frames = run_random_episode(env, seed=SEED)
print(f"\nCollected {len(frames)} frames")

## Creating a GIF Visualization

Let's create an animated GIF to see the agent in action!

In [None]:
def save_frames_as_gif(frames, filename, fps=30):
    """
    Save a list of frames as an animated GIF.
    
    Args:
        frames: List of RGB arrays
        filename: Output filename (e.g., 'episode.gif')
        fps: Frames per second
    """
    imageio.mimsave(filename, frames, fps=fps)
    print(f"✅ Saved GIF to: {filename}")

# Save the random agent episode
save_frames_as_gif(frames, '/tmp/random_agent_cartpole.gif')

# Display the GIF
display(Image(filename='/tmp/random_agent_cartpole.gif'))

### 📊 Baseline Performance: Random Agent

Let's evaluate the random agent over multiple episodes to establish a baseline:

In [None]:
# Run multiple episodes to get average performance
n_episodes = 100
random_rewards = []

for episode in range(n_episodes):
    reward, _ = run_random_episode(env, seed=episode, render=False)
    random_rewards.append(reward)

# Calculate statistics
mean_reward = np.mean(random_rewards)
std_reward = np.std(random_rewards)

print(f"\n{'='*50}")
print(f"RANDOM AGENT PERFORMANCE ({n_episodes} episodes)")
print(f"{'='*50}")
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")
print(f"Min reward: {min(random_rewards):.2f}")
print(f"Max reward: {max(random_rewards):.2f}")

# Visualize distribution
plt.figure(figsize=(10, 5))
plt.hist(random_rewards, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(mean_reward, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_reward:.2f}')
plt.xlabel('Episode Reward', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Random Agent Performance Distribution - CartPole', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

---

# 3️⃣ Complete Training Pipeline with PPO

Now let's train an intelligent agent using **Proximal Policy Optimization (PPO)**!

## What is PPO?

**PPO** is a state-of-the-art reinforcement learning algorithm that:
- Is relatively easy to tune
- Works well across many environments
- Balances exploration and exploitation effectively
- Is the default choice for many RL practitioners

### Key Hyperparameters:
- **`policy`**: Network architecture ("MlpPolicy" = Multi-Layer Perceptron)
- **`learning_rate`**: How fast the agent learns (default: 3e-4)
- **`n_steps`**: Steps per environment per update (default: 2048)
- **`batch_size`**: Minibatch size (default: 64)
- **`n_epochs`**: Number of epochs per update (default: 10)
- **`gamma`**: Discount factor for future rewards (default: 0.99)
- **`verbose`**: Print training progress (0=none, 1=info)

## Step 1: Create Training Environment

For training, we don't need rendering (it slows things down):

In [None]:
# Close previous environment
env.close()

# Create new environment for training (no rendering)
train_env = gym.make("CartPole-v1")

print("✅ Training environment created!")

## Step 2: Initialize the PPO Agent

In [None]:
# Create PPO model
model = PPO(
    policy="MlpPolicy",          # Use Multi-Layer Perceptron policy
    env=train_env,                # Training environment
    learning_rate=3e-4,           # Learning rate for optimizer
    n_steps=2048,                 # Steps per update
    batch_size=64,                # Minibatch size
    n_epochs=10,                  # Epochs per update
    gamma=0.99,                   # Discount factor
    verbose=1,                    # Print training info
    tensorboard_log="/home/rl/workspace/logs",  # TensorBoard logging directory
    seed=SEED                     # For reproducibility
)

print("\n✅ PPO agent initialized!")
print(f"Policy architecture: {model.policy}")
print(f"📊 TensorBoard logs will be saved to: /home/rl/workspace/logs")

## Step 3: Train the Agent

Now we train the agent for a specified number of timesteps. This is where the magic happens! 🎩✨

**Note**: Training typically takes a few minutes depending on your hardware.

In [None]:
# Train the agent
print("🚀 Starting training...\n")

# For CartPole, 50,000 timesteps is usually enough
model.learn(
    total_timesteps=50_000,
    progress_bar=True
)

print("\n✅ Training complete!")

### 📊 Monitoring Training with TensorBoard

During training, SB3 automatically logs important metrics to TensorBoard. Your logs are saved at: `/home/rl/workspace/logs`

**Key Metrics You'll See:**

1. **`rollout/ep_rew_mean`**: Average episode reward over time (learning curve)
2. **`rollout/ep_len_mean`**: Average episode length
3. **`train/entropy_loss`**: Policy entropy (exploration measure)
4. **`train/policy_gradient_loss`**: Policy gradient loss
5. **`train/value_loss`**: Value function loss
6. **`train/approx_kl`**: KL divergence (how much policy changed)
7. **`train/clip_fraction`**: Fraction of samples clipped by PPO
8. **`train/explained_variance`**: How well value function predicts returns

**What to Look For:**
- ✅ **`ep_rew_mean` should increase** - Agent is learning!
- ✅ **Losses should stabilize** - Training is converging
- ⚠️ **If `ep_rew_mean` plateaus too early** - Try different hyperparameters
- ⚠️ **If losses explode** - Reduce learning rate

Since TensorBoard is already running and monitoring `/home/rl/workspace/logs`, you can view these metrics in real-time! 📈

## Step 4: Save the Trained Model

Always save your trained models so you can reuse them later!

In [None]:
# Save the model
model_path = "/tmp/ppo_cartpole"
model.save(model_path)

print(f"✅ Model saved to: {model_path}")

# You can load it later with:
# loaded_model = PPO.load(model_path, env=train_env)

## Step 5: Evaluate the Trained Agent

Let's see how well our trained agent performs!

In [None]:
# Evaluate the agent
mean_reward, std_reward = evaluate_policy(
    model, 
    train_env, 
    n_eval_episodes=100,
    deterministic=True  # Use deterministic actions (no exploration)
)

print(f"\n{'='*50}")
print(f"TRAINED PPO AGENT PERFORMANCE (100 episodes)")
print(f"{'='*50}")
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")
print(f"\n🎯 Maximum possible reward in CartPole: 500")

## Step 6: Visualize Trained Agent Performance

In [None]:
def run_trained_episode(model, env, seed=None):
    """
    Run one episode with a trained agent.
    
    Args:
        model: Trained SB3 model
        env: Gymnasium environment (with render_mode="rgb_array")
        seed: Random seed
    
    Returns:
        total_reward: Episode reward
        frames: List of frames
    """
    observation, info = env.reset(seed=seed)
    frames = []
    total_reward = 0
    steps = 0
    done = False
    
    while not done:
        frames.append(env.render())
        
        # Use trained model to predict action (deterministic)
        action, _ = model.predict(observation, deterministic=True)
        
        observation, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        steps += 1
        done = terminated or truncated
    
    print(f"Episode finished after {steps} steps")
    print(f"Total reward: {total_reward}")
    
    return total_reward, frames

# Create environment with rendering
eval_env = gym.make("CartPole-v1", render_mode="rgb_array")

# Run trained agent
trained_reward, trained_frames = run_trained_episode(model, eval_env, seed=SEED)

# Save and display
save_frames_as_gif(trained_frames, '/tmp/trained_agent_cartpole.gif')
display(Image(filename='/tmp/trained_agent_cartpole.gif'))

eval_env.close()

## Step 7: Compare Random vs. Trained Agent

Let's visualize the improvement!

In [None]:
# Compare performance
comparison_data = {
    'Random Agent': np.mean(random_rewards),
    'Trained PPO': mean_reward
}

plt.figure(figsize=(10, 6))
bars = plt.bar(comparison_data.keys(), comparison_data.values(), 
               color=['lightcoral', 'lightgreen'], alpha=0.8, edgecolor='black', linewidth=2)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2f}',
             ha='center', va='bottom', fontsize=14, fontweight='bold')

plt.ylabel('Mean Episode Reward', fontsize=12)
plt.title('Performance Comparison: Random vs. Trained Agent - CartPole', 
          fontsize=14, fontweight='bold')
plt.axhline(y=500, color='gold', linestyle='--', linewidth=2, label='Maximum Possible (500)')
plt.ylim(0, 550)
plt.legend(fontsize=10)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate improvement
improvement = ((mean_reward - np.mean(random_rewards)) / np.mean(random_rewards)) * 100
print(f"\n🎉 Improvement: {improvement:.1f}% better than random!")

---

# 4️⃣ Second Environment: LunarLander-v3

Let's apply the same pipeline to a more challenging environment!

## About LunarLander

- **Goal**: Land a spacecraft safely on the moon
- **Observation Space**: 8 dimensions (position, velocity, angle, etc.)
- **Action Space**: 4 discrete actions (do nothing, fire left, fire main, fire right)
- **Rewards**: 
  - +100 for landing safely
  - -100 for crashing
  - Penalties for fuel usage and distance from landing pad

In [None]:
# Create LunarLander environment
lunar_env = gym.make("LunarLander-v3", render_mode="rgb_array")

print("=== LUNARLAND ER-V3 ENVIRONMENT ===")
print(f"Observation space: {lunar_env.observation_space}")
print(f"  - Shape: {lunar_env.observation_space.shape}")
print(f"  - 8 continuous values: x, y, vel_x, vel_y, angle, angular_vel, leg1_contact, leg2_contact")

print(f"\nAction space: {lunar_env.action_space}")
print(f"  - 0: Do nothing")
print(f"  - 1: Fire left engine")
print(f"  - 2: Fire main engine")
print(f"  - 3: Fire right engine")

# Visualize initial state
lunar_env.reset(seed=SEED)
frame = lunar_env.render()

plt.figure(figsize=(8, 6))
plt.imshow(frame)
plt.title("LunarLander-v3 Initial State", fontsize=14, fontweight='bold')
plt.axis('off')
plt.tight_layout()
plt.show()

## Baseline: Random Agent on LunarLander

In [None]:
# Evaluate random agent
print("Testing random agent on LunarLander...\n")

lunar_random_rewards = []
for episode in range(100):
    reward, _ = run_random_episode(lunar_env, seed=episode, render=False)
    lunar_random_rewards.append(reward)

lunar_random_mean = np.mean(lunar_random_rewards)
lunar_random_std = np.std(lunar_random_rewards)

print(f"\n{'='*50}")
print(f"RANDOM AGENT - LUNARLAND ER (100 episodes)")
print(f"{'='*50}")
print(f"Mean reward: {lunar_random_mean:.2f} ± {lunar_random_std:.2f}")

# Show one episode
print("\nRecording one random episode...")
_, random_lunar_frames = run_random_episode(lunar_env, seed=SEED, render=True)
save_frames_as_gif(random_lunar_frames, '/tmp/random_lunar.gif', fps=20)
display(Image(filename='/tmp/random_lunar.gif'))

## Training PPO on LunarLander

LunarLander is more complex, so we'll train for more timesteps:

In [None]:
# Close rendering environment
lunar_env.close()

# Create training environment
lunar_train_env = gym.make("LunarLander-v3")

# Initialize PPO
lunar_model = PPO(
    policy="MlpPolicy",
    env=lunar_train_env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=1,
    tensorboard_log="/home/rl/workspace/logs",  # TensorBoard logging
    seed=SEED
)

print("\n🚀 Training PPO on LunarLander...")
print("⏰ This will take a few minutes...")
print("📊 TensorBoard logs: /home/rl/workspace/logs\n")

# Train for more timesteps (LunarLander is harder)
lunar_model.learn(
    total_timesteps=300_000,
    progress_bar=True
)

print("\n✅ Training complete!")

# Save model
lunar_model.save("/tmp/ppo_lunar_lander")
print("✅ Model saved!")

## Evaluate Trained LunarLander Agent

In [None]:
# Evaluate
lunar_mean_reward, lunar_std_reward = evaluate_policy(
    lunar_model,
    lunar_train_env,
    n_eval_episodes=100,
    deterministic=True
)

print(f"\n{'='*50}")
print(f"TRAINED PPO - LUNARLAND ER (100 episodes)")
print(f"{'='*50}")
print(f"Mean reward: {lunar_mean_reward:.2f} ± {lunar_std_reward:.2f}")
print(f"\n🎯 A score > 200 is considered solved!")

# Visualize trained agent
lunar_eval_env = gym.make("LunarLander-v3", render_mode="rgb_array")
print("\nRecording trained agent...")
_, trained_lunar_frames = run_trained_episode(lunar_model, lunar_eval_env, seed=SEED)
save_frames_as_gif(trained_lunar_frames, '/tmp/trained_lunar.gif', fps=20)
display(Image(filename='/tmp/trained_lunar.gif'))

lunar_eval_env.close()
lunar_train_env.close()

## LunarLander Performance Comparison

In [None]:
# Compare LunarLander performance
lunar_comparison = {
    'Random Agent': lunar_random_mean,
    'Trained PPO': lunar_mean_reward
}

plt.figure(figsize=(10, 6))
bars = plt.bar(lunar_comparison.keys(), lunar_comparison.values(),
               color=['lightcoral', 'lightgreen'], alpha=0.8, edgecolor='black', linewidth=2)

for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2f}',
             ha='center', va='bottom', fontsize=14, fontweight='bold')

plt.ylabel('Mean Episode Reward', fontsize=12)
plt.title('Performance Comparison: Random vs. Trained Agent - LunarLander',
          fontsize=14, fontweight='bold')
plt.axhline(y=200, color='gold', linestyle='--', linewidth=2, label='Solved Threshold (200)')
plt.legend(fontsize=10)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate improvement
if lunar_random_mean < 0:
    improvement = lunar_mean_reward - lunar_random_mean
    print(f"\n🎉 Improvement: +{improvement:.1f} points from random baseline!")
else:
    improvement = ((lunar_mean_reward - lunar_random_mean) / abs(lunar_random_mean)) * 100
    print(f"\n🎉 Improvement: {improvement:.1f}% better than random!")

---

# 5️⃣ Creating Custom Gymnasium Environments

Sometimes you need to create your own environment for a specific problem. Let's learn how!

## 🎮 Problem Description: Simple GridWorld

We'll create a simple grid world environment where:

### Environment Rules:
- **Grid Size**: 5x5 grid
- **Agent**: Starts at position (0, 0) - top-left corner
- **Goal**: Reach position (4, 4) - bottom-right corner
- **Actions**: 4 possible actions
  - 0: Move UP
  - 1: Move DOWN
  - 2: Move LEFT
  - 3: Move RIGHT
- **Rewards**:
  - +10 for reaching the goal
  - -1 for each step (encourages finding shortest path)
  - -5 for hitting walls (trying to move outside grid)
- **Episode Termination**: When agent reaches goal or after 100 steps

### Visual Representation:
```
S . . . .    S = Start (0,0)
. . . . .    G = Goal (4,4)
. . . . .    . = Empty cell
. . . . .
. . . . G
```

## 🏗️ Building the Custom Environment

Every Gymnasium environment must inherit from `gym.Env` and implement these key methods:

### Required Methods:

1. **`__init__()`**: Initialize the environment
   - Define observation_space and action_space
   - Set up any necessary variables

2. **`reset()`**: Reset environment to initial state
   - Return: `(observation, info)`
   - Must handle the `seed` parameter for reproducibility

3. **`step(action)`**: Execute one action
   - Return: `(observation, reward, terminated, truncated, info)`
   - Update environment state
   - Calculate reward
   - Check if episode is done

4. **`render()`** (optional): Visualize the environment
   - Return visualization based on render_mode

Let's implement each part step by step! 👇

In [None]:
class GridWorldEnv(gym.Env):
    """
    Custom GridWorld Environment
    
    Agent navigates a 5x5 grid from start (0,0) to goal (4,4)
    """
    
    # Metadata for the environment
    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 4}
    
    def __init__(self, render_mode=None, grid_size=5):
        """
        Initialize the GridWorld environment.
        
        What we need to define here:
        - observation_space: What the agent observes
        - action_space: What actions the agent can take
        - Internal state variables
        """
        super().__init__()
        
        self.grid_size = grid_size
        self.render_mode = render_mode
        
        # Define action space: 4 discrete actions (UP, DOWN, LEFT, RIGHT)
        self.action_space = gym.spaces.Discrete(4)
        
        # Define observation space: agent's (x, y) position
        # Box space with shape (2,) for [x, y] coordinates
        # Values range from 0 to grid_size-1
        self.observation_space = gym.spaces.Box(
            low=0,
            high=grid_size - 1,
            shape=(2,),
            dtype=np.int32
        )
        
        # Define start and goal positions
        self.start_pos = np.array([0, 0], dtype=np.int32)
        self.goal_pos = np.array([grid_size - 1, grid_size - 1], dtype=np.int32)
        
        # Current agent position (will be set in reset)
        self.agent_pos = None
        self.num_steps = 0
        
        print("✅ GridWorld environment initialized!")
        print(f"   Grid size: {grid_size}x{grid_size}")
        print(f"   Start: {self.start_pos}, Goal: {self.goal_pos}")
    
    def reset(self, seed=None, options=None):
        """
        Reset the environment to initial state.
        
        What we need to do:
        - Handle the seed for reproducibility
        - Reset agent to start position
        - Reset step counter
        - Return (observation, info)
        """
        # Seed the random number generator
        super().reset(seed=seed)
        
        # Reset agent to start position
        self.agent_pos = self.start_pos.copy()
        self.num_steps = 0
        
        # Get observation (agent's current position)
        observation = self.agent_pos.copy()
        
        # Info dict (can contain any additional information)
        info = {"distance_to_goal": self._get_distance_to_goal()}
        
        return observation, info
    
    def step(self, action):
        """
        Execute one action and return the result.
        
        What we need to do:
        1. Update environment state based on action
        2. Calculate reward
        3. Check if episode is terminated or truncated
        4. Return (observation, reward, terminated, truncated, info)
        """
        self.num_steps += 1
        
        # Store old position to check if move was valid
        old_pos = self.agent_pos.copy()
        
        # Execute action: 0=UP, 1=DOWN, 2=LEFT, 3=RIGHT
        if action == 0:  # UP
            self.agent_pos[1] = max(0, self.agent_pos[1] - 1)
        elif action == 1:  # DOWN
            self.agent_pos[1] = min(self.grid_size - 1, self.agent_pos[1] + 1)
        elif action == 2:  # LEFT
            self.agent_pos[0] = max(0, self.agent_pos[0] - 1)
        elif action == 3:  # RIGHT
            self.agent_pos[0] = min(self.grid_size - 1, self.agent_pos[0] + 1)
        
        # Calculate reward
        reward = -1  # Step penalty (encourages shorter paths)
        
        # Check if agent hit a wall (position didn't change)
        if np.array_equal(old_pos, self.agent_pos):
            reward = -5  # Penalty for hitting wall
        
        # Check if agent reached the goal
        terminated = np.array_equal(self.agent_pos, self.goal_pos)
        if terminated:
            reward = 10  # Big reward for reaching goal!
        
        # Check if episode should be truncated (time limit)
        truncated = self.num_steps >= 100
        
        # Get observation
        observation = self.agent_pos.copy()
        
        # Additional info
        info = {
            "distance_to_goal": self._get_distance_to_goal(),
            "num_steps": self.num_steps
        }
        
        return observation, reward, terminated, truncated, info
    
    def render(self):
        """
        Visualize the environment.
        
        For simplicity, we'll create a text-based visualization.
        """
        if self.render_mode == "human" or self.render_mode == "rgb_array":
            # Create grid visualization
            grid = np.zeros((self.grid_size, self.grid_size), dtype=str)
            grid[:, :] = '.'
            
            # Mark start and goal
            grid[self.start_pos[1], self.start_pos[0]] = 'S'
            grid[self.goal_pos[1], self.goal_pos[0]] = 'G'
            
            # Mark agent position
            if not np.array_equal(self.agent_pos, self.goal_pos):
                grid[self.agent_pos[1], self.agent_pos[0]] = 'A'
            else:
                grid[self.agent_pos[1], self.agent_pos[0]] = 'W'  # Win!
            
            # Print grid
            print("\n" + "="*20)
            for row in grid:
                print(' '.join(row))
            print("="*20)
            print(f"Steps: {self.num_steps}")
            print(f"Position: {self.agent_pos}")
            
            # For rgb_array mode, return the grid as a simple representation
            if self.render_mode == "rgb_array":
                return grid
    
    def _get_distance_to_goal(self):
        """Helper function: Manhattan distance to goal"""
        return np.sum(np.abs(self.agent_pos - self.goal_pos))

print("✅ GridWorldEnv class defined!")

## 🧪 Testing Our Custom Environment

In [None]:
# Create our custom environment
custom_env = GridWorldEnv(render_mode="human")

print("\n=== CUSTOM ENVIRONMENT INFO ===")
print(f"Observation space: {custom_env.observation_space}")
print(f"Action space: {custom_env.action_space}")
print(f"Action meanings: 0=UP, 1=DOWN, 2=LEFT, 3=RIGHT")

In [None]:
# Test reset
print("\n=== TESTING RESET ===")
obs, info = custom_env.reset(seed=42)
print(f"Initial observation: {obs}")
print(f"Info: {info}")
custom_env.render()

# Take a few manual steps
print("\n=== TAKING MANUAL ACTIONS ===")

# Move RIGHT
print("\nAction: RIGHT (3)")
obs, reward, terminated, truncated, info = custom_env.step(3)
print(f"Observation: {obs}, Reward: {reward}")
custom_env.render()

# Move DOWN
print("\nAction: DOWN (1)")
obs, reward, terminated, truncated, info = custom_env.step(1)
print(f"Observation: {obs}, Reward: {reward}")
custom_env.render()

# Try to move UP (back towards start)
print("\nAction: UP (0)")
obs, reward, terminated, truncated, info = custom_env.step(0)
print(f"Observation: {obs}, Reward: {reward}")
custom_env.render()

## 🤖 Training an Agent on Our Custom Environment

Now let's train a PPO agent on our custom GridWorld!

In [None]:
# Create environment for training (no rendering)
train_grid_env = GridWorldEnv()

# Initialize PPO
grid_model = PPO(
    policy="MlpPolicy",
    env=train_grid_env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    verbose=1,
    tensorboard_log="/home/rl/workspace/logs",
    seed=SEED
)

print("\n🚀 Training PPO on GridWorld...\n")

# Train (GridWorld is simple, so we don't need many steps)
grid_model.learn(total_timesteps=20_000, progress_bar=True)

print("\n✅ Training complete!")

## 🎬 Visualizing the Trained Agent

In [None]:
# Test the trained agent
print("\n=== TRAINED AGENT DEMONSTRATION ===")

test_env = GridWorldEnv(render_mode="human")
obs, info = test_env.reset(seed=42)

print("\nInitial state:")
test_env.render()

total_reward = 0
done = False
step = 0

while not done and step < 20:  # Limit steps for demonstration
    # Get action from trained model
    action, _ = grid_model.predict(obs, deterministic=True)
    
    # Take action
    obs, reward, terminated, truncated, info = test_env.step(action)
    total_reward += reward
    done = terminated or truncated
    step += 1
    
    # Render
    action_names = ['UP', 'DOWN', 'LEFT', 'RIGHT']
    print(f"\nStep {step} - Action: {action_names[action]}")
    test_env.render()
    
    if terminated:
        print("\n🎉 GOAL REACHED!")
        break

print(f"\nTotal reward: {total_reward}")
print(f"Reached goal in {step} steps!")

test_env.close()

## 💡 Key Takeaways: Custom Environments

### What We Learned:

1. **Environment Structure**:
   - Inherit from `gym.Env`
   - Define `observation_space` and `action_space` in `__init__`
   - Implement `reset()` and `step()` methods

2. **`__init__()` Method**:
   - Define spaces that match your problem
   - Initialize environment parameters
   - Set up internal state variables

3. **`reset()` Method**:
   - Must handle `seed` parameter
   - Return `(observation, info)`
   - Reset all state variables

4. **`step()` Method**:
   - Update state based on action
   - Calculate reward (design rewards carefully!)
   - Determine `terminated` and `truncated`
   - Return `(observation, reward, terminated, truncated, info)`

5. **Reward Design**:
   - Positive rewards for desired behavior (reaching goal)
   - Negative rewards for undesired behavior (hitting walls)
   - Step penalties to encourage efficiency

### 🚀 Your Turn!

Try modifying the GridWorld environment:
- Add obstacles that the agent must avoid
- Change the reward structure
- Make the grid larger
- Add multiple goals
- Implement diagonal movement

---

# 6️⃣ Summary & Key Takeaways

## 🎓 What We Learned

### 1. **Gymnasium API**
- ✅ `gym.make()` - Create environments with different render modes
- ✅ `env.reset()` - Initialize episodes → returns `(observation, info)`
- ✅ `env.step(action)` - Execute actions → returns `(obs, reward, terminated, truncated, info)`
- ✅ `env.observation_space` & `env.action_space` - Understand environment structure
- ✅ `env.render()` - Visualize agent behavior
- ✅ `env.close()` - Clean up resources

### 2. **Complete Training Pipeline**
1. Create environment
2. Initialize RL algorithm (PPO)
3. Train with `model.learn()`
4. Save model with `model.save()`
5. Evaluate with `evaluate_policy()`
6. Visualize performance

### 3. **Stable-Baselines3 Workflow**
```python
# Initialize
model = PPO("MlpPolicy", env, verbose=1)

# Train
model.learn(total_timesteps=50_000)

# Save
model.save("model_name")

# Load
model = PPO.load("model_name", env=env)

# Use
action, _ = model.predict(obs, deterministic=True)
```

### 4. **TensorBoard Monitoring**
- ✅ Track learning curves (`ep_rew_mean`)
- ✅ Monitor training losses (policy, value, entropy)
- ✅ Visualize policy updates (KL divergence, clip fraction)
- ✅ Real-time training feedback

### 5. **Custom Environments**
- ✅ Inherit from `gym.Env`
- ✅ Define observation and action spaces
- ✅ Implement `__init__`, `reset`, `step` methods
- ✅ Design reward functions carefully

### 6. **Key Insights**
- **Random agents perform poorly** - RL is necessary for complex tasks
- **Training time varies** - Simpler environments (CartPole) train faster
- **Visualization helps** - See what the agent is learning
- **Evaluation is crucial** - Always benchmark against baselines
- **Reward design matters** - Shapes agent behavior significantly

---

## 🚀 Next Steps

### Experiments to Try:
1. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.
2. **Different algorithms**: Try DQN, A2C, or SAC
3. **Custom rewards**: Modify reward functions for different behaviors
4. **More environments**: Explore Atari, MuJoCo, or custom environments
5. **Curriculum learning**: Train on progressively harder tasks

### Resources:
- 📖 [Gymnasium Documentation](https://gymnasium.farama.org/)
- 📖 [Stable-Baselines3 Docs](https://stable-baselines3.readthedocs.io/)
- 📄 [PPO Paper](https://arxiv.org/abs/1707.06347)
- 🎓 [Spinning Up in Deep RL](https://spinningup.openai.com/)

---

## 💡 Discussion Questions

1. Why do we need `render_mode="rgb_array"` for creating GIFs?
2. What's the difference between `terminated` and `truncated`?
3. Why use `deterministic=True` during evaluation?
4. How would you modify the code to train on a custom environment?
5. What happens if you train for too long? (Hint: overfitting)

---

### 🎉 Congratulations!

You now know how to:
- ✅ Use the Gymnasium API
- ✅ Visualize agent behavior
- ✅ Train RL agents with Stable-Baselines3
- ✅ Evaluate and compare performance
- ✅ Build complete RL pipelines

**Keep learning and experimenting! 🚀**

---

# 📚 Appendix: Quick Reference

## Gymnasium API Cheat Sheet

```python
# Create environment
env = gym.make("EnvName-v0", render_mode="rgb_array")

# Reset (start new episode)
obs, info = env.reset(seed=42)

# Take action
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated

# Spaces
obs_space = env.observation_space
act_space = env.action_space
sample_obs = obs_space.sample()
sample_act = act_space.sample()

# Render
frame = env.render()  # Returns RGB array

# Cleanup
env.close()
```

## SB3 PPO Cheat Sheet

```python
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

# Create model
model = PPO("MlpPolicy", env, verbose=1)

# Train
model.learn(total_timesteps=100_000)

# Save/Load
model.save("path/to/model")
model = PPO.load("path/to/model", env=env)

# Predict
action, _states = model.predict(obs, deterministic=True)

# Evaluate
mean_reward, std_reward = evaluate_policy(
    model, env, n_eval_episodes=100
)
```