# DQN Training for CarRacing Environment 🏎️

This notebook implements a complete DQN training pipeline for the CarRacing-v3 environment. You'll train an AI agent to drive a car using Deep Q-Networks!

## 🚀 What You'll Do
1. **Set up the environment** - CarRacing with preprocessing
2. **Build the DQN network** - CNN for visual input processing
3. **Implement training components** - Replay buffer, target network, etc.
4. **Train the agent** - Watch it learn to drive!
5. **Analyze results** - Visualize training progress

## ⚡ Quick Start
Make sure you have:
- Activated the virtual environment
- Installed all dependencies
- Run the tutorial notebook first (recommended)

Let's start training your racing AI! 🏁

In [None]:
# Import all necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import gymnasium as gym
import cv2
import random
import os
import time
from collections import deque
from typing import Tuple, List, Optional, Dict, Any
from pathlib import Path
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = [12, 8]

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🖥️  Using device: {device}")
print(f"🐍 Python packages ready!")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)
print(f"🎲 Random seeds set for reproducibility")

---
## 📋 Hyperparameters Configuration

These hyperparameters control the training process. Feel free to experiment with different values!

In [None]:
# Hyperparameters - Modify these to experiment!
HYPERPARAMETERS = {
    # Learning parameters
    'learning_rate': 0.0001,
    'gamma': 0.99,  # Discount factor
    
    # Exploration parameters
    'epsilon_start': 1.0,
    'epsilon_end': 0.01,
    'epsilon_decay': 0.995,
    
    # Training parameters
    'batch_size': 32,
    'buffer_size': 10000,
    'target_update': 1000,  # Update target network every N steps
    
    # Episode parameters
    'num_episodes': 100,  # Start with fewer episodes for notebook
    'max_steps_per_episode': 1000,
    
    # Environment parameters
    'frame_stack': 4,
    'image_size': (84, 84),
    
    # Logging
    'save_interval': 25,
    'log_interval': 5
}

print("📊 Hyperparameters:")
for key, value in HYPERPARAMETERS.items():
    print(f"  {key}: {value}")
    
print(f"\n💡 Tip: Increase 'num_episodes' to 500+ for better performance!")

---
## 🚗 Environment Setup

Let's create the CarRacing environment with proper preprocessing.

In [None]:
class CarRacingWrapper:
    """Wrapper for CarRacing environment with preprocessing."""
    
    def __init__(self, render_mode: Optional[str] = None):
        """
        Initialize CarRacing environment wrapper.
        
        Args:
            render_mode: Rendering mode ('human', 'rgb_array', or None)
        """
        self.env = gym.make('CarRacing-v3', render_mode=render_mode)
        self.frame_stack = HYPERPARAMETERS['frame_stack']
        self.image_size = HYPERPARAMETERS['image_size']
        
        # Frame buffer for stacking
        self.frames = deque(maxlen=self.frame_stack)
        
    def reset(self) -> np.ndarray:
        """Reset environment and return initial stacked frames."""
        obs, info = self.env.reset()
        
        # Preprocess initial frame
        processed_frame = self._preprocess_frame(obs)
        
        # Initialize frame stack with repeated first frame
        for _ in range(self.frame_stack):
            self.frames.append(processed_frame)
            
        return self._get_stacked_frames()
        
    def step(self, action: int) -> Tuple[np.ndarray, float, bool, bool, Dict]:
        """
        Take action and return preprocessed observation.
        
        Args:
            action: Discrete action index
            
        Returns:
            Tuple of (observation, reward, terminated, truncated, info)
        """
        # Convert discrete action to continuous
        continuous_action = self._discrete_to_continuous(action)
        
        # Take step in environment
        obs, reward, terminated, truncated, info = self.env.step(continuous_action)
        
        # Preprocess and stack frames
        processed_frame = self._preprocess_frame(obs)
        self.frames.append(processed_frame)
        stacked_frames = self._get_stacked_frames()
        
        return stacked_frames, reward, terminated, truncated, info
        
    def _preprocess_frame(self, frame: np.ndarray) -> np.ndarray:
        """Preprocess frame: resize, grayscale, normalize."""
        # Convert to grayscale
        gray_frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        
        # Resize to target size
        resized_frame = cv2.resize(gray_frame, self.image_size)
        
        # Normalize to [0, 1]
        normalized_frame = resized_frame.astype(np.float32) / 255.0
        
        return normalized_frame
        
    def _get_stacked_frames(self) -> np.ndarray:
        """Get stacked frames as numpy array."""
        return np.array(list(self.frames))
        
    def _discrete_to_continuous(self, action: int) -> np.ndarray:
        """Convert discrete action to continuous action space."""
        if action == 0:     # Turn left
            return np.array([-0.5, 0.3, 0.0])
        elif action == 1:   # Go straight
            return np.array([0.0, 0.5, 0.0])
        elif action == 2:   # Turn right
            return np.array([0.5, 0.3, 0.0])
        elif action == 3:   # Brake
            return np.array([0.0, 0.0, 0.8])
        else:
            return np.array([0.0, 0.0, 0.0])
            
    def close(self):
        """Close the environment."""
        self.env.close()

# Test the environment
print("🚗 Testing CarRacing Environment...")
test_env = CarRacingWrapper()
test_obs = test_env.reset()
print(f"✅ Environment initialized successfully!")
print(f"   Observation shape: {test_obs.shape}")
print(f"   Action space: 4 discrete actions (left, straight, right, brake)")
test_env.close()
del test_env

---
## 🧠 DQN Network Architecture

Let's build our CNN-based Deep Q-Network!

In [None]:
class DQN(nn.Module):
    """CNN-based Deep Q-Network for CarRacing."""
    
    def __init__(self, action_dim: int = 4, input_channels: int = 4):
        """
        Initialize DQN network.
        
        Args:
            action_dim: Number of discrete actions
            input_channels: Number of input channels (frame stack)
        """
        super(DQN, self).__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=8, stride=4, padding=0)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0)
        
        # Calculate conv output size
        self._conv_output_size = self._get_conv_output_size((input_channels, 84, 84))
        
        # Fully connected layers
        self.fc1 = nn.Linear(self._conv_output_size, 512)
        self.fc2 = nn.Linear(512, action_dim)
        
        # Initialize weights
        self._initialize_weights()
        
    def _get_conv_output_size(self, input_shape: Tuple[int, int, int]) -> int:
        """Calculate output size after conv layers."""
        with torch.no_grad():
            dummy_input = torch.zeros(1, *input_shape)
            dummy_output = self._forward_conv(dummy_input)
            return dummy_output.numel()
            
    def _forward_conv(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through conv layers only."""
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        return x.view(x.size(0), -1)
        
    def _initialize_weights(self):
        """Initialize network weights."""
        for module in self.modules():
            if isinstance(module, (nn.Conv2d, nn.Linear)):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)
                    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through network."""
        x = self._forward_conv(x)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create and analyze the network
dqn = DQN(action_dim=4, input_channels=4).to(device)
total_params = sum(p.numel() for p in dqn.parameters() if p.requires_grad)

print("🧠 DQN Network Created!")
print(f"   Total parameters: {total_params:,}")
print(f"   Network size: ~{total_params * 4 / 1024 / 1024:.1f} MB")

# Test forward pass
dummy_input = torch.randn(1, 4, 84, 84).to(device)
with torch.no_grad():
    output = dqn(dummy_input)
print(f"   Input shape: {dummy_input.shape}")
print(f"   Output shape: {output.shape}")
print(f"✅ Network test passed!")

---
## 🗃️ Experience Replay Buffer

The replay buffer stores and samples experiences for training.

In [None]:
class ReplayBuffer:
    """Experience replay buffer for storing transitions."""
    
    def __init__(self, capacity: int):
        self.buffer = deque(maxlen=capacity)
        self.capacity = capacity
        
    def push(self, state, action, reward, next_state, done):
        """Add transition to buffer."""
        self.buffer.append((state, action, reward, next_state, done))
        
    def sample(self, batch_size: int) -> Tuple:
        """Sample batch of transitions."""
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.BoolTensor(dones)
        )
        
    def __len__(self):
        return len(self.buffer)

# Create replay buffer
replay_buffer = ReplayBuffer(HYPERPARAMETERS['buffer_size'])
print(f"🗃️  Replay buffer created with capacity: {HYPERPARAMETERS['buffer_size']:,}")

---
## 🤖 DQN Agent

Let's create our DQN agent that combines all the components!

In [None]:
class DQNAgent:
    """DQN Agent with all training components."""
    
    def __init__(self, device: torch.device):
        self.device = device
        self.action_dim = 4  # left, straight, right, brake
        
        # Networks
        self.main_network = DQN(self.action_dim).to(device)
        self.target_network = DQN(self.action_dim).to(device)
        self.target_network.load_state_dict(self.main_network.state_dict())
        
        # Optimizer
        self.optimizer = optim.Adam(
            self.main_network.parameters(), 
            lr=HYPERPARAMETERS['learning_rate']
        )
        
        # Replay buffer
        self.replay_buffer = ReplayBuffer(HYPERPARAMETERS['buffer_size'])
        
        # Exploration strategy
        self.epsilon = HYPERPARAMETERS['epsilon_start']
        self.epsilon_decay = HYPERPARAMETERS['epsilon_decay']
        self.epsilon_min = HYPERPARAMETERS['epsilon_end']
        
        # Training counters
        self.step_count = 0
        self.episode_count = 0
        
    def select_action(self, state: np.ndarray, training: bool = True) -> int:
        """Select action using epsilon-greedy policy."""
        if training and random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            q_values = self.main_network(state_tensor)
            return q_values.argmax().item()
            
    def store_transition(self, state, action, reward, next_state, done):
        """Store transition in replay buffer."""
        self.replay_buffer.push(state, action, reward, next_state, done)
        
    def update(self) -> Optional[float]:
        """Update network using batch from replay buffer."""
        if len(self.replay_buffer) < HYPERPARAMETERS['batch_size']:
            return None
            
        # Sample batch
        states, actions, rewards, next_states, dones = \
            self.replay_buffer.sample(HYPERPARAMETERS['batch_size'])
            
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)
        
        # Current Q-values
        current_q_values = self.main_network(states).gather(1, actions.unsqueeze(1))
        
        # Next Q-values from target network
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            targets = rewards + (HYPERPARAMETERS['gamma'] * next_q_values * (~dones))
            
        # Compute loss
        loss = F.smooth_l1_loss(current_q_values.squeeze(), targets)
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.main_network.parameters(), 1.0)
        self.optimizer.step()
        
        # Update step counter
        self.step_count += 1
        
        # Update target network
        if self.step_count % HYPERPARAMETERS['target_update'] == 0:
            self.update_target_network()
            
        return loss.item()
        
    def update_target_network(self):
        """Update target network with main network weights."""
        self.target_network.load_state_dict(self.main_network.state_dict())
        
    def update_epsilon(self):
        """Update epsilon for next episode."""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        self.episode_count += 1

# Create the agent
agent = DQNAgent(device)
print(f"🤖 DQN Agent created!")
print(f"   Main network parameters: {sum(p.numel() for p in agent.main_network.parameters()):,}")
print(f"   Target network parameters: {sum(p.numel() for p in agent.target_network.parameters()):,}")
print(f"   Initial epsilon: {agent.epsilon}")

---
## 🏋️ Training Loop

Now let's train our agent! This is where the magic happens. 🪄

In [None]:
def train_agent(num_episodes: int = HYPERPARAMETERS['num_episodes']):
    """Train the DQN agent."""
    
    # Initialize environment
    env = CarRacingWrapper()
    
    # Training statistics
    episode_rewards = []
    episode_losses = []
    episode_lengths = []
    
    # Create directories for saving
    models_dir = Path("../models/saved_weights")
    models_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"🚀 Starting training for {num_episodes} episodes...")
    print(f"📊 Progress will be displayed every {HYPERPARAMETERS['log_interval']} episodes")
    print("-" * 60)
    
    start_time = time.time()
    best_reward = float('-inf')
    
    # Training loop with progress bar
    progress_bar = tqdm(range(num_episodes), desc="Training")
    
    for episode in progress_bar:
        # Reset environment
        state = env.reset()
        episode_reward = 0.0
        episode_loss_list = []
        step = 0
        
        # Episode loop
        for step in range(HYPERPARAMETERS['max_steps_per_episode']):
            # Select and take action
            action = agent.select_action(state, training=True)
            next_state, reward, terminated, truncated, _ = env.step(action)
            
            # Store transition
            done = terminated or truncated
            agent.store_transition(state, action, reward, next_state, done)
            
            # Update agent
            loss = agent.update()
            if loss is not None:
                episode_loss_list.append(loss)
                
            # Update state and reward
            state = next_state
            episode_reward += reward
            
            if done:
                break
                
        # Update statistics
        episode_rewards.append(episode_reward)
        episode_lengths.append(step + 1)
        avg_loss = np.mean(episode_loss_list) if episode_loss_list else 0.0
        episode_losses.append(avg_loss)
        
        # Update exploration
        agent.update_epsilon()
        
        # Update progress bar
        recent_rewards = episode_rewards[-10:] if len(episode_rewards) >= 10 else episode_rewards
        avg_reward = np.mean(recent_rewards)
        progress_bar.set_postfix({
            'Reward': f'{episode_reward:.1f}',
            'Avg': f'{avg_reward:.1f}',
            'ε': f'{agent.epsilon:.3f}'
        })
        
        # Logging
        if episode % HYPERPARAMETERS['log_interval'] == 0 and episode > 0:
            print(f"\nEpisode {episode:4d} | "
                  f"Reward: {episode_reward:8.2f} | "
                  f"Avg Reward: {avg_reward:8.2f} | "
                  f"Loss: {avg_loss:.4f} | "
                  f"Epsilon: {agent.epsilon:.4f} | "
                  f"Buffer: {len(agent.replay_buffer)}")
                  
        # Save model
        if episode % HYPERPARAMETERS['save_interval'] == 0 and episode > 0:
            model_path = models_dir / f"dqn_episode_{episode}.pth"
            torch.save({
                'main_network': agent.main_network.state_dict(),
                'target_network': agent.target_network.state_dict(),
                'optimizer': agent.optimizer.state_dict(),
                'epsilon': agent.epsilon,
                'step_count': agent.step_count,
                'episode_count': agent.episode_count
            }, model_path)
            
            # Save best model
            if episode_reward > best_reward:
                best_reward = episode_reward
                best_model_path = models_dir / "dqn_best.pth"
                torch.save({
                    'main_network': agent.main_network.state_dict(),
                    'target_network': agent.target_network.state_dict(),
                    'optimizer': agent.optimizer.state_dict(),
                    'epsilon': agent.epsilon,
                    'step_count': agent.step_count,
                    'episode_count': agent.episode_count
                }, best_model_path)
                print(f"💾 New best model saved! Reward: {best_reward:.2f}")
    
    # Save final model
    final_model_path = models_dir / "dqn_final.pth"
    torch.save({
        'main_network': agent.main_network.state_dict(),
        'target_network': agent.target_network.state_dict(),
        'optimizer': agent.optimizer.state_dict(),
        'epsilon': agent.epsilon,
        'step_count': agent.step_count,
        'episode_count': agent.episode_count
    }, final_model_path)
    
    # Training summary
    total_time = time.time() - start_time
    print("\n" + "=" * 60)
    print("🎉 TRAINING COMPLETE!")
    print("=" * 60)
    print(f"Total episodes: {len(episode_rewards)}")
    print(f"Total time: {total_time/60:.1f} minutes")
    print(f"Average reward: {np.mean(episode_rewards):.2f}")
    print(f"Best reward: {np.max(episode_rewards):.2f}")
    print(f"Final epsilon: {agent.epsilon:.4f}")
    print(f"Total steps: {agent.step_count}")
    
    # Cleanup
    env.close()
    
    return episode_rewards, episode_losses, episode_lengths

# Start training!
print("🏁 Ready to start training!")
print(f"📋 Training for {HYPERPARAMETERS['num_episodes']} episodes")
print(f"⚡ Using device: {device}")

In [None]:
# Run the training!
episode_rewards, episode_losses, episode_lengths = train_agent()

print("\n🎊 Training finished! Check the results below.")

---
## 📊 Training Results Analysis

Let's visualize how our agent performed during training!

In [None]:
# Create comprehensive training plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('🏎️ DQN Training Results', fontsize=16, fontweight='bold')

# Episode rewards
axes[0, 0].plot(episode_rewards, 'b-', alpha=0.6)
if len(episode_rewards) >= 10:
    # Add moving average
    window = min(10, len(episode_rewards))
    moving_avg = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
    axes[0, 0].plot(range(window-1, len(episode_rewards)), moving_avg, 'r-', linewidth=2, label=f'MA({window})')
    axes[0, 0].legend()
axes[0, 0].set_title('Episode Rewards')
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Reward')
axes[0, 0].grid(True, alpha=0.3)

# Episode losses
non_zero_losses = [loss for loss in episode_losses if loss > 0]
if non_zero_losses:
    axes[0, 1].plot(non_zero_losses, 'g-')
    axes[0, 1].set_title('Training Loss')
    axes[0, 1].set_xlabel('Episode (with training)')
    axes[0, 1].set_ylabel('Loss')
    axes[0, 1].grid(True, alpha=0.3)
else:
    axes[0, 1].text(0.5, 0.5, 'No training data\n(buffer too small)', 
                   ha='center', va='center', transform=axes[0, 1].transAxes)
    axes[0, 1].set_title('Training Loss')

# Episode lengths
axes[1, 0].plot(episode_lengths, 'orange')
axes[1, 0].set_title('Episode Lengths')
axes[1, 0].set_xlabel('Episode')
axes[1, 0].set_ylabel('Steps')
axes[1, 0].grid(True, alpha=0.3)

# Epsilon decay
epsilons = [HYPERPARAMETERS['epsilon_start'] * (HYPERPARAMETERS['epsilon_decay'] ** i) for i in range(len(episode_rewards))]
epsilons = [max(HYPERPARAMETERS['epsilon_end'], eps) for eps in epsilons]
axes[1, 1].plot(epsilons, 'purple')
axes[1, 1].set_title('Epsilon Decay')
axes[1, 1].set_xlabel('Episode')
axes[1, 1].set_ylabel('Epsilon')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print("📈 Training Statistics:")
print(f"   Episodes: {len(episode_rewards)}")
print(f"   Average reward: {np.mean(episode_rewards):.2f} ± {np.std(episode_rewards):.2f}")
print(f"   Best reward: {np.max(episode_rewards):.2f}")
print(f"   Worst reward: {np.min(episode_rewards):.2f}")
print(f"   Average episode length: {np.mean(episode_lengths):.1f} steps")
print(f"   Final epsilon: {agent.epsilon:.4f}")

# Performance analysis
if len(episode_rewards) >= 20:
    early_rewards = np.mean(episode_rewards[:10])
    late_rewards = np.mean(episode_rewards[-10:])
    improvement = late_rewards - early_rewards
    print(f"\n📊 Learning Progress:")
    print(f"   Early episodes (1-10): {early_rewards:.2f}")
    print(f"   Late episodes ({len(episode_rewards)-9}-{len(episode_rewards)}): {late_rewards:.2f}")
    print(f"   Improvement: {improvement:.2f} ({improvement/abs(early_rewards)*100:.1f}%)")
    
    if improvement > 0:
        print("   🎉 Your agent is learning!")
    else:
        print("   💡 Try training for more episodes or tuning hyperparameters")

---
## 🏆 Next Steps

Congratulations! You've successfully trained a DQN agent. Here's what you can do next:

### 🎮 Test Your Agent
Run the demo script to see your trained agent in action:
```bash
python ../games/demo_trained_agent.py
```

### 🔧 Improve Performance
Try these techniques to get better results:

1. **Train Longer**: Increase `num_episodes` to 500-1000
2. **Tune Hyperparameters**: 
   - Lower learning rate (0.00005)
   - Larger buffer size (50000)
   - Different epsilon decay (0.999)
3. **Advanced Techniques**:
   - Double DQN
   - Dueling DQN
   - Prioritized Experience Replay

### 📚 Learn More
- Read the original [DQN paper](https://arxiv.org/abs/1312.5602)
- Try other environments from Gymnasium
- Implement DQN variants

### 💾 Save Your Work
Your trained models are saved in `../models/saved_weights/`:
- `dqn_best.pth` - Best performing model
- `dqn_final.pth` - Final model after training
- `dqn_episode_X.pth` - Checkpoints during training

Happy racing! 🏎️💨

In [None]:
# Final summary
print("🎉 DQN Training Notebook Complete!")
print("="*50)
print("✅ Environment setup")
print("✅ DQN network architecture")
print("✅ Experience replay buffer")
print("✅ Agent training")
print("✅ Results visualization")
print("\n🚀 Your AI agent is ready to race!")
print("Run the demo script to see it in action.")