# Car Racing with PPO - Reinforcement Learning Project

This notebook demonstrates the implementation and training of a **Proximal Policy Optimization (PPO)** agent for the **Car Racing** environment.

## Table of Contents
1. [Introduction](#introduction)
2. [Environment Setup](#environment)
3. [PPO Algorithm](#ppo)
4. [Training](#training)
5. [Evaluation](#evaluation)
6. [Results Analysis](#results)
7. [Conclusion](#conclusion)

## 1. Introduction <a name="introduction"></a>

### Problem Overview

**Car Racing** is a continuous control task where an agent must learn to drive a car around a racing track. The challenge involves:
- **Continuous action space**: steering, acceleration, and braking
- **Pixel-based observations**: 96x96 RGB images
- **Sparse rewards**: +1000/N for visiting track tiles, negative rewards for going off-track

### Why PPO?

PPO is an excellent choice for this task because:
- **Stable**: Clips policy updates to prevent destructive changes
- **Sample efficient**: Uses GAE for better advantage estimation
- **Continuous control**: Naturally handles continuous action spaces
- **State-of-the-art**: One of the most successful modern RL algorithms

In [None]:
# Imports
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt
import torch
from IPython.display import Image, display

from config.ppo_config import PPOConfig
from src.environment import CarRacingEnv, NormalizeActions
from src.ppo_agent import PPOAgent
from src.utils import set_seed, evaluate_agent

print("Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Environment Setup <a name="environment"></a>

### Environment Preprocessing

We apply several preprocessing steps:
1. **Grayscale conversion**: Reduce from RGB to grayscale
2. **Frame cropping**: Remove car dashboard (bottom part)
3. **Resizing**: 96x96 → 84x84
4. **Normalization**: Pixel values to [0, 1]
5. **Frame stacking**: Stack 4 consecutive frames for temporal information
6. **Frame skipping**: Repeat actions for 2 frames (action repeat)

In [None]:
# Create environment
env = NormalizeActions(CarRacingEnv())

print(f"Observation space: {env.observation_space.shape}")
print(f"Action space: {env.action_space.shape}")
print(f"Action space bounds: {env.action_space.low} to {env.action_space.high}")

In [None]:
# Visualize environment
obs, _ = env.reset(seed=42)

# Show stacked frames
fig, axes = plt.subplots(1, 4, figsize=(15, 3))
for i in range(4):
    axes[i].imshow(obs[i], cmap='gray')
    axes[i].set_title(f'Frame {i+1}')
    axes[i].axis('off')
plt.suptitle('Stacked Frames (Grayscale)')
plt.tight_layout()
plt.show()

## 3. PPO Algorithm <a name="ppo"></a>

### Algorithm Overview

PPO optimizes a clipped surrogate objective:

$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

where:
- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio
- $\hat{A}_t$ is the advantage estimate (computed using GAE)
- $\epsilon$ is the clipping parameter (typically 0.2)

### Network Architecture

**Shared CNN Feature Extractor:**
- Conv2D(4, 32, 8x8, stride=4) + ReLU
- Conv2D(32, 64, 4x4, stride=2) + ReLU
- Conv2D(64, 64, 3x3, stride=1) + ReLU
- Flatten → 3136 features

**Actor (Policy) Head:**
- Linear(3136, 256) + ReLU
- Linear(256, 256) + ReLU
- Linear(256, 3) + Tanh
- Outputs: mean actions + learnable log std

**Critic (Value) Head:**
- Linear(3136, 256) + ReLU
- Linear(256, 256) + ReLU
- Linear(256, 1)
- Outputs: state value estimate

In [None]:
# Initialize PPO agent
config = PPOConfig()
set_seed(config.seed)

obs_shape = env.observation_space.shape
action_dim = env.action_space.shape[0]

agent = PPOAgent(obs_shape, action_dim, config)

print("PPO Agent initialized!")
print(f"Device: {agent.device}")
print(f"\nNetwork architecture:")
print(agent.actor_critic)

### Hyperparameters

Key hyperparameters for PPO:

In [None]:
# Display hyperparameters
hyperparams = {
    'Learning Rate': config.learning_rate,
    'Gamma (Discount)': config.gamma,
    'GAE Lambda': config.gae_lambda,
    'Clip Epsilon': config.clip_epsilon,
    'Value Coefficient': config.value_coef,
    'Entropy Coefficient': config.entropy_coef,
    'Steps per Update': config.n_steps,
    'Batch Size': config.batch_size,
    'Epochs per Update': config.n_epochs,
    'Max Grad Norm': config.max_grad_norm
}

for key, value in hyperparams.items():
    print(f"{key}: {value}")

## 4. Training <a name="training"></a>

To train the agent, run the training script:

```bash
python train.py
```

This will:
- Train for 1,000,000 timesteps
- Save checkpoints every 50 updates
- Evaluate every 25 updates
- Log training metrics
- Save the best model based on evaluation performance

### Training Progress

Training typically takes 4-8 hours on a GPU (depending on hardware).

## 5. Evaluation <a name="evaluation"></a>

Load and evaluate a trained model:

In [None]:
# Load trained model
checkpoint_path = "../checkpoints/best_model.pt"

try:
    agent.load(checkpoint_path)
    print(f"Model loaded from: {checkpoint_path}")
except:
    print("No trained model found. Please train the model first!")

In [None]:
# Evaluate agent
eval_env = NormalizeActions(CarRacingEnv())
mean_reward, std_reward, mean_length = evaluate_agent(agent, eval_env, n_episodes=10)

print(f"\nEvaluation Results (10 episodes):")
print(f"Mean Reward: {mean_reward:.2f} +/- {std_reward:.2f}")
print(f"Mean Length: {mean_length:.1f}")

eval_env.close()

## 6. Results Analysis <a name="results"></a>

Analyze training curves and performance:

In [None]:
# Load training metrics
try:
    metrics = np.load("../logs/training_metrics.npz")
    
    timesteps = metrics['timesteps']
    episode_rewards = metrics['episode_rewards']
    episode_lengths = metrics['episode_lengths']
    policy_losses = metrics['policy_losses']
    value_losses = metrics['value_losses']
    
    print("Training metrics loaded successfully!")
except:
    print("No training metrics found. Train the model first!")

In [None]:
# Plot training curves
def moving_average(data, window=10):
    if len(data) < window:
        return data
    return np.convolve(data, np.ones(window)/window, mode='valid')

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Episode rewards
axes[0, 0].plot(timesteps, episode_rewards, alpha=0.3, label='Raw')
axes[0, 0].plot(timesteps[:len(moving_average(episode_rewards, 50))], 
                moving_average(episode_rewards, 50), linewidth=2, label='MA(50)')
axes[0, 0].set_xlabel('Timestep')
axes[0, 0].set_ylabel('Episode Reward')
axes[0, 0].set_title('Training Progress: Episode Rewards')
axes[0, 0].legend()
axes[0, 0].grid(True)

# Episode lengths
axes[0, 1].plot(timesteps, episode_lengths, alpha=0.3, label='Raw')
axes[0, 1].plot(timesteps[:len(moving_average(episode_lengths, 50))], 
                moving_average(episode_lengths, 50), linewidth=2, label='MA(50)')
axes[0, 1].set_xlabel('Timestep')
axes[0, 1].set_ylabel('Episode Length')
axes[0, 1].set_title('Training Progress: Episode Lengths')
axes[0, 1].legend()
axes[0, 1].grid(True)

# Policy loss
axes[1, 0].plot(policy_losses)
axes[1, 0].set_xlabel('Update')
axes[1, 0].set_ylabel('Policy Loss')
axes[1, 0].set_title('Policy Loss over Updates')
axes[1, 0].grid(True)

# Value loss
axes[1, 1].plot(value_losses)
axes[1, 1].set_xlabel('Update')
axes[1, 1].set_ylabel('Value Loss')
axes[1, 1].set_title('Value Loss over Updates')
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

### Performance Statistics

In [None]:
# Calculate statistics
print("Training Statistics:")
print(f"Total Episodes: {len(episode_rewards)}")
print(f"Final 100 Episodes - Mean Reward: {np.mean(episode_rewards[-100:]):.2f}")
print(f"Final 100 Episodes - Mean Length: {np.mean(episode_lengths[-100:]):.1f}")
print(f"Best Episode Reward: {max(episode_rewards):.2f}")
print(f"Worst Episode Reward: {min(episode_rewards):.2f}")

## 7. Conclusion <a name="conclusion"></a>

### Summary

In this project, we successfully:
1. Implemented a PPO agent from scratch using PyTorch
2. Created a comprehensive environment wrapper with preprocessing
3. Trained the agent on the Car Racing environment
4. Achieved [YOUR RESULTS HERE] average reward
5. Analyzed training dynamics and performance

### Key Insights

- **Preprocessing is crucial**: Frame stacking and normalization significantly improve learning
- **PPO is stable**: The clipped objective prevents destructive policy updates
- **Hyperparameters matter**: GAE lambda and clip epsilon heavily influence convergence
- **Sample efficiency**: PPO achieves good performance with reasonable sample complexity

### Future Improvements

1. **Architecture**: Try larger networks or attention mechanisms
2. **Curriculum learning**: Start with simpler tracks
3. **Data augmentation**: Random crops, color jitter for robustness
4. **Reward shaping**: More sophisticated reward engineering
5. **Ensemble methods**: Multiple agents for better exploration

### References

- Schulman et al. (2017). [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
- Schulman et al. (2016). [High-Dimensional Continuous Control Using Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438)
- [Gymnasium Documentation](https://gymnasium.farama.org/)