# Deep Q-Network (DQN) for Gravity Guy

## What is DQN?
**Deep Q-Network (DQN)** is a reinforcement learning method that learns a function
$Q_\theta(s,a)$ estimating the long-term return of taking action $a$ in state $s$.
We act with **ε-greedy** (mostly pick the action with the highest Q, sometimes explore),
and we **train** the network to match a bootstrapped target:

$$
y = \begin{cases}
r & \text{if episode terminated}\\
r + \gamma \max_{a'} Q_{\bar\theta}(s', a') & \text{otherwise}
\end{cases}
$$

$$
\text{Loss} = \mathrm{Huber}\big( Q_\theta(s,a) - y \big)
$$

Two stabilizers make DQN work well in practice:
- **Replay buffer**: learn from randomized past transitions $(s,a,r,s',\text{done})$ to break correlations.
- **Target network** $Q_{\bar\theta}$: a slowly updated copy used to compute $y$.

## Why DQN fits this game
- **Tiny, discrete action space:** 2 actions (NOOP / FLIP).
- **Dense, shaped reward:** per-step progress minus a small flip penalty.
- **Compact observation:** 6 floats capture what matters (vertical state + look-ahead probes).
- **Fast, headless env:** high sample throughput for replay.

## State / Action / Reward (this notebook)
- **Observation (6 floats):**
  1. `y_norm` ∈ [0,1] — vertical position (0=top, 1=bottom)  
  2. `vy_norm` ∈ [-1,1] — normalized vertical speed  
  3. `grav_dir` ∈ {−1,+1} — current gravity (up/down)  
  4–6. `p1, p2, p3` ∈ [0,1] — **look-ahead clearances** in the gravity direction (near → far)
- **Actions:** `0 = NOOP`, `1 = FLIP` (flip only **fires** when grounded & cooldown is over; invalid flips act as no-ops).
- **Reward per step:** `progress − flip_penalty × [flip_fired]`
- **Termination:** off-screen (death) or time limit (e.g., 10 s).

## Learning loop (at a glance)
1. **Observe** state $s$.  
2. **Act** with ε-greedy: pick `argmax_a Q_\theta(s,a)` with prob $1-ε$, random action with prob $ε$.  
3. **Step** the env → get $(r, s', \text{done})$.  
4. **Store** $(s,a,r,s',\text{done})$ in the replay buffer.  
5. **Sample** a mini-batch from replay, compute targets $y$ with the **target network**.  
6. **Update** the online network $Q_\theta$ to minimize Huber loss; periodically **update** the target network.  
7. **Anneal** $ε$ over time to reduce exploration.

## Game-specific caveats (and how we handle them)
- **Action validity:** flips only take effect when grounded.  
  *Mitigation:* treat invalid flips as no-ops and/or mask them at action selection time.
- **Timing & partial observability:** probes look ahead in x while y changes over time.  
  *Mitigation:* keep the observation compact but informative (probes + gravity + velocity).  
  (Optionally, stack a few recent observations or add `grounded`/`cooldown` scalars.)
- **Evaluation fairness:** fix a set of level seeds; report mean/median distance, % time-limit terminations, and flips per 1000 px.

## What the reader should expect
- Baselines (**Random**, **Heuristic**) for context.  
- A DQN agent that learns to time flips better than random, often matching or surpassing the hand-crafted heuristic on held-out seeds.  
- Clear plots: training return, evaluation distance, and failure-mode breakdown.


# 1. Environment Setup and Imports

First, we'll import the necessary libraries and set up our environment.

In [47]:
# Standard libraries for ML and data handling
import numpy as np
import matplotlib.pyplot as plt
import random
import json
from collections import deque
import time

# PyTorch for neural networks (you might need: pip install torch)
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Our custom environment
import sys
sys.path.append('../..')  # Go up two directories to access src/
from src.env.gg_env import GGEnv

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

Libraries imported successfully!
PyTorch version: 2.8.0+cpu
Using device: CPU


## Quick Environment Test

Before building our AI, let's make sure we understand our environment perfectly:

In [48]:
# Create a test environment
env = GGEnv(level_seed=12345, max_time_s=10.0, flip_penalty=0.01)
obs = env.reset()

print("=== ENVIRONMENT UNDERSTANDING ===")
print(f"Observation space: {len(obs)} dimensions")
print(f"Action space: {env.action_space_n} actions (0=wait, 1=flip)")
print(f"First observation: {obs}")

# Take a few random actions to see what happens
total_reward = 0
for step in range(10):
    action = random.choice([0, 1])  # Random action
    obs, reward, done, info = env.step(action)
    total_reward += reward
    
    print(f"Step {step+1}: action={action}, reward={reward:.3f}, done={done}")
    print(f"  → obs: [{obs[0]:.2f}, {obs[1]:.2f}, {obs[2]:.0f}, {obs[3]:.2f}, {obs[4]:.2f}, {obs[5]:.2f}]")
    
    if done:
        print(f"Episode ended! Total reward: {total_reward:.2f}, Distance: {info['distance_px']}px")
        break

print("\n✅ Environment test completed!")

=== ENVIRONMENT UNDERSTANDING ===
Observation space: 6 dimensions
Action space: 2 actions (0=wait, 1=flip)
First observation: [0.5, 0.0, 1.0, 0.24814814814814815, 0.24814814814814815, 0.24814814814814815]
Step 1: action=1, reward=2.083, done=False
  → obs: [0.50, 0.01, 1, 0.25, 0.25, 0.25]
Step 2: action=0, reward=2.083, done=False
  → obs: [0.50, 0.03, 1, 0.25, 0.25, 0.25]
Step 3: action=0, reward=2.083, done=False
  → obs: [0.50, 0.04, 1, 0.25, 0.25, 0.25]
Step 4: action=1, reward=2.083, done=False
  → obs: [0.50, 0.05, 1, 0.25, 0.25, 0.25]
Step 5: action=1, reward=2.083, done=False
  → obs: [0.50, 0.06, 1, 0.25, 0.25, 0.25]
Step 6: action=1, reward=2.083, done=False
  → obs: [0.51, 0.07, 1, 0.24, 0.24, 0.24]
Step 7: action=1, reward=2.083, done=False
  → obs: [0.51, 0.09, 1, 0.24, 0.24, 0.24]
Step 8: action=1, reward=2.083, done=False
  → obs: [0.51, 0.10, 1, 0.24, 0.24, 0.24]
Step 9: action=0, reward=2.083, done=False
  → obs: [0.51, 0.11, 1, 0.24, 0.24, 0.24]
Step 10: action=0, re

## Part 2: Building the Neural Network Brain

### What is our Neural Network doing?

Think of the neural network as the agent's "brain". It takes in the 6 observations from the game and outputs 2 numbers:
- **Q(state, wait)**: How good is it to wait/do nothing in this situation?
- **Q(state, flip)**: How good is it to flip gravity in this situation?

The agent will always choose the action with the higher Q-value.

### Network Architecture Design

For our Gravity Guy game, we'll use a simple but effective architecture:

```
Input Layer (6 neurons) → Hidden Layer (128 neurons) → Hidden Layer (64 neurons) → Output Layer (2 neurons)
       ↓                        ↓                           ↓                        ↓
  [y, vy, grav,              [lots of                   [more                   [Q(wait), 
   p1, p2, p3]                neurons]                   neurons]                 Q(flip)]
```

### Why this architecture?
- **Input**: 6 observations (exactly what our environment gives us)
- **Hidden layers**: 128 and 64 neurons - enough to learn complex patterns but not too big to be slow
- **Output**: 2 Q-values (one for each possible action)
- **Activation**: ReLU (simple and effective for this type of problem)
- **Why this size?** Small enough to be fast, large enough to capture the timing patterns in probes/velocity.
- **Activation:** ReLU in hidden layers, **linear** in the output (Q-values can be any real number).
- **Parameter budget (intuition):** ~6×128 + 128×64 + 64×2 weights (+ biases) → ~12k parameters—tiny and trainable.


### How it fits into the DQN loop
1. **Forward pass:** $Q_\theta(s)$ → two Q-values.  
2. **Act:** choose action via ε-greedy (with optional mask).  
3. **Store:** add $(s,a,r,s',\text{done})$ to replay.  
4. **Target:** $y = r + \gamma \max_{a'} Q_{\bar\theta}(s',a')$ (or $y=r$ on terminal).  
5. **Update:** minimize Huber$(Q_\theta(s,a) - y)$.  
6. **Stabilize:** periodically copy online weights to the **target network** (or use soft updates).


## Neural Network Implementation

In [73]:
class DQN(nn.Module):
    """Enhanced Deep Q-Network with modern architecture features
    
    Improvements:
    1. Deeper architecture with residual connections
    2. Layer normalization for better training stability
    3. Proper weight initialization
    4. Dropout for regularization
    """
    def __init__(self, input_size, output_size, hidden_size=128):
        super(DQN, self).__init__()
        
        # Input preprocessing
        self.input_norm = nn.LayerNorm(input_size)
        
        # Main network layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.norm1 = nn.LayerNorm(hidden_size)
        self.drop1 = nn.Dropout(0.1)
        
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.norm2 = nn.LayerNorm(hidden_size)
        self.drop2 = nn.Dropout(0.1)
        
        # Residual path
        self.residual = nn.Linear(hidden_size, hidden_size)
        
        # Output layers with value and advantage streams
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Linear(hidden_size // 2, 1)
        )
        
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Linear(hidden_size // 2, output_size)
        )
        
        # Initialize weights using He initialization
        self._init_weights()
        
    def _init_weights(self):
        """Initialize network weights using He initialization"""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0.0)
    
    def forward(self, x):
        # Ensure input is at least 2D
        if x.dim() == 1:
            x = x.unsqueeze(0)
        
        # Input normalization
        x = self.input_norm(x)
        
        # Main network path with residual connection
        identity = x
        
        x = F.relu(self.norm1(self.fc1(x)))
        x = self.drop1(x)
        
        x = F.relu(self.norm2(self.fc2(x)))
        x = self.drop2(x)
        
        # Add residual connection
        residual = self.residual(identity)
        x = x + residual
        
        # Dueling network architecture
        value = self.value_stream(x)
        advantages = self.advantage_stream(x)
        
        # Combine value and advantages using dueling formula
        qvalues = value + (advantages - advantages.mean(dim=1, keepdim=True))
        
        return qvalues
    
print("=== TESTING DQN MODEL ===")

=== TESTING DQN MODEL ===


RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x6 and 128x128)

## Understanding the Network Components

In [72]:
print("=== UNDERSTANDING NETWORK COMPONENTS ===")

# Let's examine what each layer does
test_obs = torch.tensor([0.3, -0.1, -1.0, 0.6, 0.7, 0.8], dtype=torch.float32).unsqueeze(0)

print("Step-by-step forward pass:")
print(f"1. Input observations: {test_obs.squeeze().tolist()}")

# Manual forward pass to see each step
x = test_obs
print(f"   Shape: {x.shape}")

# Layer 1
x = F.relu(dqn.fc1(x))  
print(f"2. After first hidden layer (128 neurons): {x.shape}")
print(f"   Sample values: [{x[0][0]:.3f}, {x[0][1]:.3f}, {x[0][2]:.3f}, ...] (showing first 3)")

# Layer 2  
x = F.relu(dqn.fc2(x))
print(f"3. After second hidden layer (64 neurons): {x.shape}")
print(f"   Sample values: [{x[0][0]:.3f}, {x[0][1]:.3f}, {x[0][2]:.3f}, ...] (showing first 3)")

# Layer 3
x = dqn.fc3(x)
print(f"4. Final Q-values: {x.squeeze().tolist()}")
print(f"   Q(wait) = {x[0][0]:.3f}, Q(flip) = {x[0][1]:.3f}")

# Decision making
best_action = torch.argmax(x).item()
print(f"5. Decision: Choose action {best_action} ({'flip' if best_action == 1 else 'wait'})")

=== UNDERSTANDING NETWORK COMPONENTS ===
Step-by-step forward pass:
1. Input observations: [0.30000001192092896, -0.10000000149011612, -1.0, 0.6000000238418579, 0.699999988079071, 0.800000011920929]
   Shape: torch.Size([1, 6])
2. After first hidden layer (128 neurons): torch.Size([1, 128])
   Sample values: [0.000, 0.652, 0.373, ...] (showing first 3)
3. After second hidden layer (64 neurons): torch.Size([1, 64])
   Sample values: [0.000, 0.000, 0.000, ...] (showing first 3)
4. Final Q-values: [-0.21892616152763367, -0.08885392546653748]
   Q(wait) = -0.219, Q(flip) = -0.089
5. Decision: Choose action 1 (flip)


## Key Concepts Explained

**What just happened?**

1. **Input Processing**: We fed the network 6 numbers representing the game state
2. **Hidden Layers**: The network processed this information through two layers of neurons
3. **Q-Value Output**: We got back 2 numbers - Q(wait) and Q(flip)  
4. **Action Selection**: We pick the action with the highest Q-value

**Why ReLU activation?**
- ReLU (Rectified Linear Unit) simply makes negative values = 0
- It's fast, simple, and works well for most problems
- Helps the network learn complex patterns

**Why no activation on the output?**
- Q-values can be positive or negative (good or bad situations)  
- We want the raw values, not constrained to 0-1 range

## Part 3: Building the Agent's Memory System (Experience Replay Buffer)

### What is Experience Replay?

**The Problem with Naive Learning:**
Imagine trying to learn to drive by only remembering your last 1 second of driving. You'd never learn patterns like "red light ahead → slow down" because you'd forget the red light by the time you needed to brake.

**The Solution - Experience Replay:**
Instead of learning from just the current experience, we store the agent's experiences in a "memory buffer" and learn from random samples. This breaks the correlation between consecutive experiences and stabilizes learning.

**Key Benefits:**
1. **Breaks Correlation**: Learning from random past experiences prevents overfitting to current situation
2. **Sample Efficiency**: Reuse valuable experiences multiple times  
3. **Stability**: Smooths out learning by averaging over diverse situations

### Experience Replay Buffer Implementation

In [51]:
import random
from collections import deque
import numpy as np

class ReplayBuffer:
    """
    Experience Replay Buffer for DQN Agent
    
    This is the agent's "memory" - it stores past experiences and lets us
    sample random batches for training. Think of it as a photo album of
    game moments that we can learn from later.
    
    Each experience is a tuple: (state, action, reward, next_state, done)
    """
    
    def __init__(self, capacity=100000):
        """
        Initialize the replay buffer
        
        Args:
            capacity: Maximum number of experiences to store
                     (older experiences get overwritten when full)
        """
        self.buffer = deque(maxlen=capacity)
        self.capacity = capacity
        
        print(f"🧠 Replay Buffer Created:")
        print(f"   Capacity: {capacity:,} experiences")
        print(f"   Memory usage: ~{capacity * 6 * 4 / 1024 / 1024:.1f} MB")  # Rough estimate
    
    def push(self, state, action, reward, next_state, done):
        """
        Store a new experience in the buffer
        
        Args:
            state: Current observation (6D vector)
            action: Action taken (0 or 1)  
            reward: Reward received
            next_state: Next observation after action
            done: True if episode ended
        """
        # Convert to numpy arrays for consistency
        state = np.array(state, dtype=np.float32)
        next_state = np.array(next_state, dtype=np.float32)
        
        # Store as tuple
        experience = (state, action, reward, next_state, done)
        self.buffer.append(experience)
    
    def sample(self, batch_size=32):
        """
        Sample a random batch of experiences for training
        
        This is where the magic happens - we grab random past experiences
        to train on, which breaks the correlation problem.
        
        Args:
            batch_size: Number of experiences to sample
            
        Returns:
            Tuple of tensors: (states, actions, rewards, next_states, dones)
        """
        if len(self.buffer) < batch_size:
            raise ValueError(f"Not enough experiences! Have {len(self.buffer)}, need {batch_size}")
        
        # Sample random batch
        batch = random.sample(self.buffer, batch_size)
        
        # Unpack batch into separate arrays
        states = np.array([e[0] for e in batch])
        actions = np.array([e[1] for e in batch]) 
        rewards = np.array([e[2] for e in batch])
        next_states = np.array([e[3] for e in batch])
        dones = np.array([e[4] for e in batch])
        
        # Convert to PyTorch tensors
        states = torch.tensor(states, dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.long)  # Long for indexing
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.bool)
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        """Return current number of experiences stored"""
        return len(self.buffer)
    
    def is_ready(self, min_size=1000):
        """Check if buffer has enough experiences to start training"""
        return len(self.buffer) >= min_size

# Test the replay buffer
print("=== TESTING REPLAY BUFFER ===")

# Create buffer
replay_buffer = ReplayBuffer(capacity=50000)

# Add some fake experiences
for i in range(100):
    state = [0.5, 0.1, 1.0, 0.8, 0.9, 1.0]  # Fake observation
    action = random.choice([0, 1])           # Random action
    reward = 2.0 + random.random()           # Small reward variation
    next_state = [0.51, 0.12, 1.0, 0.75, 0.85, 0.95]  # Slightly different
    done = (i % 50 == 49)                    # Episode ends every 50 steps
    
    replay_buffer.push(state, action, reward, next_state, done)

print(f"Buffer size after adding 100 experiences: {len(replay_buffer)}")

# Test sampling
if replay_buffer.is_ready(min_size=32):
    states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size=8)
    
    print(f"\n=== SAMPLE BATCH (size=8) ===")
    print(f"States shape: {states.shape}")
    print(f"Actions shape: {actions.shape}")
    print(f"Rewards shape: {rewards.shape}")
    
    print(f"\nSample states (first 3):")
    for i in range(3):
        print(f"  State {i}: {states[i].tolist()}")
    
    print(f"\nSample actions: {actions.tolist()}")
    print(f"Sample rewards: {rewards.tolist()}")
    print(f"Sample dones: {dones.tolist()}")
    
    print(f"\n✅ Replay buffer working correctly!")
else:
    print("❌ Not enough experiences to sample")

=== TESTING REPLAY BUFFER ===
🧠 Replay Buffer Created:
   Capacity: 50,000 experiences
   Memory usage: ~1.1 MB
Buffer size after adding 100 experiences: 100

=== SAMPLE BATCH (size=8) ===
States shape: torch.Size([8, 6])
Actions shape: torch.Size([8])
Rewards shape: torch.Size([8])

Sample states (first 3):
  State 0: [0.5, 0.10000000149011612, 1.0, 0.800000011920929, 0.8999999761581421, 1.0]
  State 1: [0.5, 0.10000000149011612, 1.0, 0.800000011920929, 0.8999999761581421, 1.0]
  State 2: [0.5, 0.10000000149011612, 1.0, 0.800000011920929, 0.8999999761581421, 1.0]

Sample actions: [0, 0, 0, 1, 1, 1, 1, 0]
Sample rewards: [2.522397041320801, 2.5459907054901123, 2.131537437438965, 2.1102023124694824, 2.8620920181274414, 2.5370280742645264, 2.8412656784057617, 2.0784473419189453]
Sample dones: [False, False, False, False, False, False, False, False]

✅ Replay buffer working correctly!


### Understanding the Memory System

**Key Concepts Explained:**

1. **Capacity Management**: 
   - Buffer has fixed size (100,000 experiences)
   - When full, oldest experiences are overwritten
   - This prevents memory from growing infinitely

2. **Experience Format**:
   ```python
   (state, action, reward, next_state, done)
   # Example: ([0.5, 0.1, 1.0, 0.8, 0.9, 1.0], 1, 2.1, [0.51, 0.12, 1.0, 0.75, 0.85, 0.95], False)
   ```

3. **Random Sampling**:
   - We don't learn from experiences in order
   - Random sampling breaks temporal correlations
   - Each training batch has diverse situations

4. **Batch Processing**:
   - Sample multiple experiences at once (e.g., 32)
   - More efficient than training on single experiences
   - Provides better gradient estimates

### Why This Matters for Gravity Guy

**Without Replay Buffer:**
- Agent only learns from current situation
- Forgets valuable lessons from past mistakes
- Training is unstable and inefficient

**With Replay Buffer:**
- Learns from diverse situations (different platform layouts, gravity states)
- Remembers rare but important events (close calls, successful flips)
- Training is stable and data-efficient

**Memory Efficiency:**
- 100,000 experiences ≈ 25MB of memory (very reasonable)
- Enough to store ~30 minutes of gameplay at 60 FPS
- Captures diverse situations for robust learning

### Next Steps

Now that we have our replay buffer, we need:
1. **DQN Agent Class** - Combines neural network + replay buffer
2. **Training Loop** - The learning process
3. **Epsilon-Greedy Policy** - Balancing exploration vs exploitation

The replay buffer is the foundation that makes everything else possible!

## Part 4: The Complete DQN Agent

### What We're Building

Now we combine everything into a complete learning agent:
- **Brain**: Neural network (from Part 2) 
- **Memory**: Replay buffer (from Part 3)
- **Decision Making**: Epsilon-greedy exploration
- **Learning**: DQN training algorithm

Think of this as assembling all the pieces into a complete AI player.

### Key Concepts

**🧠 Epsilon-Greedy Strategy:**
- **Exploration**: Sometimes take random actions to discover new strategies
- **Exploitation**: Usually take the action our network thinks is best
- **Balance**: Start with high exploration (90%), gradually reduce to low (5%)

**🎯 DQN Learning Process:**
1. Observe current state
2. Choose action (epsilon-greedy)
3. Take action, get reward and new state
4. Store experience in replay buffer
5. Sample random batch from memory
6. Train network on batch
7. Repeat!

### Complete DQN Agent Implementation

In [52]:
import torch.nn.functional as F
import math

class DQNAgent:
    """
    Complete Deep Q-Network Agent for Gravity Guy
    
    This agent combines:
    - Neural network for decision making
    - Replay buffer for experience storage  
    - Epsilon-greedy exploration strategy
    - DQN training algorithm
    """
    
    def __init__(
        self,
        state_size=6,
        action_size=2, 
        learning_rate=0.0003,
        gamma=0.99,
        epsilon_start=0.95,
        epsilon_end=0.05,
        epsilon_decay=0.995,
        buffer_size=50000,
        batch_size=32,
        target_update_freq=1000
    ):
        """
        Initialize the DQN agent
        
        Args:
            state_size: Size of observation space (6 for Gravity Guy)
            action_size: Number of actions (2: wait/flip)
            learning_rate: How fast the network learns
            gamma: Discount factor (how much future rewards matter)
            epsilon_start: Initial exploration rate (95% random actions)
            epsilon_end: Final exploration rate (5% random actions)  
            epsilon_decay: How quickly we reduce exploration
            buffer_size: Replay buffer capacity
            batch_size: Number of experiences per training batch
            target_update_freq: How often to update target network
        """
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        
        # Exploration parameters
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        
        # Neural networks (main and target)
        self.q_network = DQN(state_size, 128, 64, action_size)
        self.target_network = DQN(state_size, 128, 64, action_size)
        
        # Initialize target network with same weights as main network
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Optimizer for training
        self.optimizer = torch.optim.Adam(self.q_network.parameters(), lr=learning_rate)
        
        # Experience replay buffer
        self.memory = ReplayBuffer(buffer_size)
        
        # Training step counter
        self.steps_done = 0
        
        print(f"🤖 DQN Agent Initialized:")
        print(f"   State/Action space: {state_size} → {action_size}")
        print(f"   Learning rate: {learning_rate}")
        print(f"   Exploration: {epsilon_start:.2f} → {epsilon_end:.2f}")
        print(f"   Buffer size: {buffer_size:,}")
        print(f"   Batch size: {batch_size}")
    
    def act(self, state, training=True):
        """
        Choose action using epsilon-greedy strategy
        
        Args:
            state: Current observation (6D vector)
            training: If False, always exploit (no exploration)
            
        Returns:
            action: 0 (wait) or 1 (flip)
        """
        # Convert state to tensor
        if not isinstance(state, torch.Tensor):
            state = torch.tensor(state, dtype=torch.float32)
        if state.dim() == 1:
            state = state.unsqueeze(0)  # Add batch dimension
        
        # Epsilon-greedy action selection
        if training and random.random() < self.epsilon:
            # Explore: random action
            action = random.choice([0, 1])
        else:
            # Exploit: best action according to Q-network
            with torch.no_grad():
                q_values = self.q_network(state)
                action = torch.argmax(q_values[0]).item()
        
        return action
    
    def step(self, state, action, reward, next_state, done):
        """
        Store experience and learn from replay buffer
        
        Args:
            state: Previous observation
            action: Action taken
            reward: Reward received  
            next_state: New observation
            done: True if episode ended
        """
        # Store experience in replay buffer
        self.memory.push(state, action, reward, next_state, done)
        
        # Increment step counter
        self.steps_done += 1
        
        # Learn from experiences (if we have enough)
        if self.memory.is_ready(self.batch_size):
            experiences = self.memory.sample(self.batch_size)
            self._learn(experiences)
        
        # Update target network periodically
        if self.steps_done % self.target_update_freq == 0:
            self._update_target_network()
        
        # Decay exploration rate
        if self.epsilon > self.epsilon_end:
            self.epsilon *= self.epsilon_decay
    
    def _learn(self, experiences):
        """
        Learn from a batch of experiences using DQN algorithm
        
        This is where the actual learning happens!
        """
        states, actions, rewards, next_states, dones = experiences
        
        # Get current Q-values for chosen actions
        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        # Get next Q-values from target network
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            # Set next Q-values to 0 for terminal states
            next_q_values[dones] = 0.0
        
        # Compute target Q-values
        target_q_values = rewards + (self.gamma * next_q_values)
        
        # Compute loss (Huber loss for stability)
        loss = F.smooth_l1_loss(current_q_values.squeeze(), target_q_values)
        
        # Optimize the model
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        return loss.item()
    
    def _update_target_network(self):
        """Copy weights from main network to target network"""
        self.target_network.load_state_dict(self.q_network.state_dict())
        print(f"🎯 Target network updated at step {self.steps_done}")
    
    def save(self, filepath):
        """Save the trained model"""
        torch.save({
            'q_network_state_dict': self.q_network.state_dict(),
            'target_network_state_dict': self.target_network.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'epsilon': self.epsilon,
            'steps_done': self.steps_done
        }, filepath)
        print(f"💾 Model saved to {filepath}")
    
    def load(self, filepath):
        """Load a trained model"""
        checkpoint = torch.load(filepath)
        self.q_network.load_state_dict(checkpoint['q_network_state_dict'])
        self.target_network.load_state_dict(checkpoint['target_network_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.epsilon = checkpoint['epsilon']
        self.steps_done = checkpoint['steps_done']
        print(f"📂 Model loaded from {filepath}")

# Test the complete agent
print("=== TESTING COMPLETE DQN AGENT ===")

# Create agent
agent = DQNAgent(
    state_size=6,
    action_size=2,
    learning_rate=0.001,
    epsilon_start=0.9,
    epsilon_end=0.05,
    buffer_size=10000,  # Smaller for testing
    batch_size=32
)

print(f"\n=== TESTING ACTION SELECTION ===")

# Test action selection
test_states = [
    [0.3, -0.1, 1.0, 0.8, 0.9, 1.0],  # Safe situation
    [0.7, 0.2, -1.0, 0.2, 0.3, 0.4],  # Dangerous situation  
    [0.5, 0.0, 1.0, 1.0, 1.0, 1.0],   # Very safe
]

for i, state in enumerate(test_states):
    action_explore = agent.act(state, training=True)   # With exploration
    action_exploit = agent.act(state, training=False)  # Pure exploitation
    
    print(f"State {i+1}: explore={action_explore}, exploit={action_exploit}")

print(f"\n=== TESTING LEARNING PROCESS ===")

# Add some experiences and test learning
for i in range(100):
    state = [0.5 + 0.1*random.random(), 0.1*random.random(), 
             random.choice([-1, 1]), 0.8*random.random(), 
             0.8*random.random(), 0.8*random.random()]
    action = random.choice([0, 1])
    reward = 2.0 + random.random()
    next_state = [s + 0.01*random.random() for s in state]
    done = (i % 25 == 24)
    
    agent.step(state, action, reward, next_state, done)

print(f"Agent trained on 100 experiences")
print(f"Current exploration rate: {agent.epsilon:.3f}")
print(f"Memory size: {len(agent.memory)}")
print(f"Training steps completed: {agent.steps_done}")

print(f"\n✅ DQN Agent is ready for training!")

=== TESTING COMPLETE DQN AGENT ===
🧠 DQN Network Created:
   Input: 6 → Hidden: 128 → Hidden: 64 → Output: 2
   Total parameters: 9,282
🧠 DQN Network Created:
   Input: 6 → Hidden: 128 → Hidden: 64 → Output: 2
   Total parameters: 9,282
🧠 Replay Buffer Created:
   Capacity: 10,000 experiences
   Memory usage: ~0.2 MB
🤖 DQN Agent Initialized:
   State/Action space: 6 → 2
   Learning rate: 0.001
   Exploration: 0.90 → 0.05
   Buffer size: 10,000
   Batch size: 32

=== TESTING ACTION SELECTION ===
State 1: explore=1, exploit=0
State 2: explore=0, exploit=0
State 3: explore=0, exploit=0

=== TESTING LEARNING PROCESS ===
Agent trained on 100 experiences
Current exploration rate: 0.545
Memory size: 100
Training steps completed: 100

✅ DQN Agent is ready for training!


**🧠 The Learning Process:**

1. **Observe**: Get current game state (6D vector)
2. **Decide**: Use epsilon-greedy to pick action
3. **Act**: Take action in environment  
4. **Remember**: Store experience in replay buffer
5. **Learn**: Train on random batch from memory
6. **Improve**: Update exploration rate and target network

**🎯 Key Components:**

- **Main Network**: Makes decisions and gets trained
- **Target Network**: Provides stable learning targets (updated every 1000 steps)
- **Epsilon Decay**: Exploration decreases over time (95% → 5%)
- **Experience Replay**: Learns from diverse past experiences

**🚀 Why This Works for Gravity Guy:**

- **Exploration**: Discovers new flipping strategies
- **Memory**: Remembers successful and failed situations  
- **Stability**: Target network prevents learning instability
- **Efficiency**: Batch training is much faster than single updates

### Next Steps

Your agent is now complete! Next we need:
1. **Training Loop** - Connect agent to your environment
2. **Performance Monitoring** - Track learning progress
3. **Evaluation** - Test against baseline agents

Ready to see this agent learn to play Gravity Guy? 🎮

## Part 5: Training the Improved DQN Agent

### Key Improvements

1. **Reward Structure**:
   - Scale down progress rewards
   - Add efficiency-based flip penalties
   - Encourage momentum preservation

2. **State Space**:
   - Add grounded state
   - Add cooldown information
   - Normalize input features

3. **Network Architecture**:
   - Add batch normalization
   - Better weight initialization
   - Entropy regularization

4. **Training Process**:
   - Slower learning rate
   - More exploration time
   - Larger replay buffer

Let's implement these improvements and see how they affect learning!

In [67]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import random
from collections import deque
import math

class ImprovedDQN(nn.Module):
    """
    Improved Deep Q-Network with batch normalization and better initialization
    """
    def __init__(self, input_size=8, hidden1_size=128, hidden2_size=64, output_size=2):
        super(ImprovedDQN, self).__init__()
        
        # Network layers
        self.fc1 = nn.Linear(input_size, hidden1_size)
        self.bn1 = nn.BatchNorm1d(hidden1_size)
        self.fc2 = nn.Linear(hidden1_size, hidden2_size)
        self.bn2 = nn.BatchNorm1d(hidden2_size)
        self.fc3 = nn.Linear(hidden2_size, output_size)
        
        # Initialize with small weights
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight, gain=0.01)
                nn.init.constant_(m.bias, 0)
        
        print(f"🧠 Improved DQN Created:")
        print(f"   Input: {input_size} → Hidden: {hidden1_size} → Hidden: {hidden2_size} → Output: {output_size}")
        print(f"   Total parameters: {sum(p.numel() for p in self.parameters()):,}")
    
    def forward(self, x):
        # Input normalization
        if self.training:
            x = (x - x.mean(dim=0)) / (x.std(dim=0) + 1e-8)
        
        # First hidden layer
        x = self.fc1(x)
        if x.shape[0] > 1:  # Only use batch norm for batches
            x = self.bn1(x)
        x = F.relu(x)
        
        # Second hidden layer
        x = self.fc2(x)
        if x.shape[0] > 1:
            x = self.bn2(x)
        x = F.relu(x)
        
        # Output layer (no activation - raw Q-values)
        return self.fc3(x)

class ImprovedReplayBuffer:
    """
    Enhanced Replay Buffer with prioritized sampling and better memory management
    """
    def __init__(self, capacity=100000):
        self.buffer = deque(maxlen=capacity)
        self.priorities = deque(maxlen=capacity)
        self.capacity = capacity
        self.epsilon = 0.01  # Small constant to ensure non-zero priorities
        self.alpha = 0.6     # Priority exponent
        self.beta = 0.4      # Importance sampling weight
        self.beta_increment = 0.001
        
        print(f"🧠 Improved Replay Buffer Created:")
        print(f"   Capacity: {capacity:,} experiences")
        print(f"   Memory usage: ~{capacity * 8 * 4 / 1024 / 1024:.1f} MB")
    
    def push(self, state, action, reward, next_state, done):
        """Store experience with max priority for new experiences"""
        state = np.array(state, dtype=np.float32)
        next_state = np.array(next_state, dtype=np.float32)
        
        experience = (state, action, reward, next_state, done)
        max_priority = max(self.priorities, default=1.0)
        
        self.buffer.append(experience)
        self.priorities.append(max_priority)
    
    def sample(self, batch_size=32):
        """Sample batch with prioritized experience replay"""
        total_priority = sum(self.priorities)
        probs = [p / total_priority for p in self.priorities]
        
        # Sample indices based on priorities
        indices = np.random.choice(
            len(self.buffer), 
            batch_size, 
            p=probs,
            replace=False
        )
        
        # Get experiences and calculate importance weights
        experiences = [self.buffer[i] for i in indices]
        total = len(self.buffer)
        weights = [(total * probs[i]) ** -self.beta for i in indices]
        max_weight = max(weights)
        weights = [w / max_weight for w in weights]  # Normalize weights
        
        # Increment beta
        self.beta = min(1.0, self.beta + self.beta_increment)
        
        # Unpack experiences
        states = np.array([e[0] for e in experiences])
        actions = np.array([e[1] for e in experiences])
        rewards = np.array([e[2] for e in experiences])
        next_states = np.array([e[3] for e in experiences])
        dones = np.array([e[4] for e in experiences])
        
        # Convert to tensors
        states = torch.tensor(states, dtype=torch.float32)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)
        dones = torch.tensor(dones, dtype=torch.bool)
        weights = torch.tensor(weights, dtype=torch.float32)
        
        return states, actions, rewards, next_states, dones, weights, indices
    
    def update_priorities(self, indices, td_errors):
        """Update priorities based on TD errors"""
        for idx, error in zip(indices, td_errors):
            self.priorities[idx] = (abs(error) + self.epsilon) ** self.alpha
    
    def __len__(self):
        return len(self.buffer)
    
    def is_ready(self, min_size=1000):
        return len(self.buffer) >= min_size

class ImprovedDQNAgent:
    """
    Enhanced DQN Agent with:
    - Prioritized experience replay
    - Double DQN
    - Dueling architecture
    - Entropy regularization
    """
    def __init__(
        self,
        state_size=8,
        action_size=2,
        learning_rate=0.0001,
        gamma=0.99,
        epsilon_start=0.99,
        epsilon_end=0.01,
        epsilon_decay=0.9999,
        buffer_size=100000,
        batch_size=32,
        target_update_freq=100
    ):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        
        # Exploration parameters
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        
        # Networks
        self.q_network = ImprovedDQN(state_size, 128, 64, action_size)
        self.target_network = ImprovedDQN(state_size, 128, 64, action_size)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Optimizer with gradient clipping
        self.optimizer = torch.optim.Adam(self.q_network.parameters(), lr=learning_rate)
        
        # Enhanced replay buffer
        self.memory = ImprovedReplayBuffer(buffer_size)
        
        # Training tracking
        self.steps_done = 0
        self.losses = []
        
        print(f"🤖 Improved DQN Agent Initialized:")
        print(f"   State/Action space: {state_size} → {action_size}")
        print(f"   Learning rate: {learning_rate}")
        print(f"   Exploration: {epsilon_start:.2f} → {epsilon_end:.2f}")
        print(f"   Buffer size: {buffer_size:,}")
        print(f"   Batch size: {batch_size}")
    
    def act(self, state, training=True):
        """Choose action using epsilon-greedy with validity masking"""
        if not isinstance(state, torch.Tensor):
            state = torch.tensor(state, dtype=torch.float32)
        if state.dim() == 1:
            state = state.unsqueeze(0)
        
        # Get valid actions mask (from state[3] = grounded)
        can_flip = bool(state[0, 3].item() and state[0, 4].item())  # grounded and cooldown ready
        
        # Epsilon-greedy with validity masking
        if training and random.random() < self.epsilon:
            if can_flip:
                return random.choice([0, 1])
            else:
                return 0  # Can only wait
        else:
            with torch.no_grad():
                q_values = self.q_network(state)
                if not can_flip:
                    q_values[0, 1] = float('-inf')  # Mask out flip action
                return torch.argmax(q_values[0]).item()
    
    def calculate_loss(self, states, actions, rewards, next_states, dones, weights):
        """Calculate loss with entropy regularization"""
        # Current Q-values
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        with torch.no_grad():
            # Double DQN: use online network to select action, target network to evaluate
            online_next_q = self.q_network(next_states)
            target_next_q = self.target_network(next_states)
            
            # Get best actions from online network
            best_actions = torch.argmax(online_next_q, dim=1, keepdim=True)
            
            # Use those actions to get Q-values from target network
            next_q = target_next_q.gather(1, best_actions).squeeze()
            
            # Add entropy regularization
            probs = F.softmax(target_next_q, dim=1)
            entropy = -0.01 * (probs * torch.log(probs + 1e-10)).sum(1)
            next_q = next_q + entropy
            
            # Compute targets
            target_q = rewards + (1 - dones.float()) * self.gamma * next_q
        
        # Compute weighted Huber loss
        td_errors = F.smooth_l1_loss(current_q.squeeze(), target_q, reduction='none')
        loss = (weights * td_errors).mean()
        
        return loss, td_errors.detach()
    
    def step(self, state, action, reward, next_state, done):
        """Process one step of experience"""
        # Store experience
        self.memory.push(state, action, reward, next_state, done)
        
        # Increment step counter
        self.steps_done += 1
        
        # Learn if we have enough experiences
        if self.memory.is_ready(self.batch_size):
            # Sample batch with priorities
            states, actions, rewards, next_states, dones, weights, indices = \
                self.memory.sample(self.batch_size)
            
            # Calculate loss
            loss, td_errors = self.calculate_loss(
                states, actions, rewards, next_states, dones, weights
            )
            
            # Update priorities
            self.memory.update_priorities(indices, td_errors.numpy())
            
            # Optimize
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=1.0)
            self.optimizer.step()
            
            # Store loss
            self.losses.append(loss.item())
        
        # Update target network
        if self.steps_done % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"🎯 Target network updated at step {self.steps_done}")
        
        # Decay exploration rate
        self.epsilon = max(
            self.epsilon_end,
            self.epsilon * self.epsilon_decay
        )
    
    def save(self, filepath):
        """Save the trained model"""
        torch.save({
            'q_network_state_dict': self.q_network.state_dict(),
            'target_network_state_dict': self.target_network.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'epsilon': self.epsilon,
            'steps_done': self.steps_done,
            'losses': self.losses
        }, filepath)
        print(f"💾 Model saved to {filepath}")
    
    def load(self, filepath):
        """Load a trained model"""
        checkpoint = torch.load(filepath)
        self.q_network.load_state_dict(checkpoint['q_network_state_dict'])
        self.target_network.load_state_dict(checkpoint['target_network_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.epsilon = checkpoint['epsilon']
        self.steps_done = checkpoint['steps_done']
        self.losses = checkpoint['losses']
        print(f"📂 Model loaded from {filepath}")

# Create the improved agent
print("=== CREATING IMPROVED DQN AGENT ===")

improved_agent = ImprovedDQNAgent(
    state_size=8,              # Extended state space
    action_size=2,
    learning_rate=0.0001,      # Slower learning
    gamma=0.99,                # High discount
    epsilon_start=0.99,        # More exploration
    epsilon_end=0.01,
    epsilon_decay=0.9999,      # Much slower decay
    buffer_size=100000,        # 2x larger buffer
    batch_size=32,
    target_update_freq=100     # More frequent updates
)

print("\n✅ Improved agent created and ready for training!")

=== CREATING IMPROVED DQN AGENT ===
🧠 Improved DQN Created:
   Input: 8 → Hidden: 128 → Hidden: 64 → Output: 2
   Total parameters: 9,922
🧠 Improved DQN Created:
   Input: 8 → Hidden: 128 → Hidden: 64 → Output: 2
   Total parameters: 9,922
🧠 Improved Replay Buffer Created:
   Capacity: 100,000 experiences
   Memory usage: ~3.1 MB
🤖 Improved DQN Agent Initialized:
   State/Action space: 8 → 2
   Learning rate: 0.0001
   Exploration: 0.99 → 0.01
   Buffer size: 100,000
   Batch size: 32

✅ Improved agent created and ready for training!


### Training Setup


In [69]:
import time
import matplotlib.pyplot as plt
from collections import deque
import numpy as np
import torch.nn.functional as F

class ImprovedTrainingMonitor:
    """
    Enhanced training progress tracking with more detailed metrics
    """
    def __init__(self, window_size=100):
        self.window_size = window_size
        
        # Core metrics
        self.episode_rewards = []
        self.episode_distances = []
        self.episode_flips = []
        self.episode_lengths = []
        self.epsilons = []
        self.losses = []
        
        # Detailed metrics
        self.flip_efficiencies = []  # Distance gained per flip
        self.survival_times = []     # Episode duration
        self.q_value_gaps = []       # Difference between Q(flip) and Q(wait)
        
        # Rolling windows
        self.reward_window = deque(maxlen=window_size)
        self.distance_window = deque(maxlen=window_size)
        self.efficiency_window = deque(maxlen=window_size)
        
        # Performance tracking
        self.best_distance = 0
        self.best_efficiency = 0
        self.best_episode = 0
    
    def update(self, episode, reward, distance, flips, length, epsilon, loss=None, q_values=None):
        """Record metrics from completed episode"""
        # Core metrics
        self.episode_rewards.append(reward)
        self.episode_distances.append(distance)
        self.episode_flips.append(flips)
        self.episode_lengths.append(length)
        self.epsilons.append(epsilon)
        if loss is not None:
            self.losses.append(loss)
        
        # Calculate efficiency
        efficiency = distance / max(flips, 1)
        self.flip_efficiencies.append(efficiency)
        
        # Calculate Q-value gap if available
        if q_values is not None:
            gap = abs(q_values[1] - q_values[0])  # Gap between flip and wait
            self.q_value_gaps.append(gap)
        
        # Update rolling windows
        self.reward_window.append(reward)
        self.distance_window.append(distance)
        self.efficiency_window.append(efficiency)
        
        # Track best performance
        if distance > self.best_distance:
            self.best_distance = distance
            self.best_episode = episode
        if efficiency > self.best_efficiency:
            self.best_efficiency = efficiency
    
    def get_averages(self):
        """Get current rolling averages"""
        avg_reward = np.mean(self.reward_window) if self.reward_window else 0
        avg_distance = np.mean(self.distance_window) if self.distance_window else 0
        avg_efficiency = np.mean(self.efficiency_window) if self.efficiency_window else 0
        return avg_reward, avg_distance, avg_efficiency
    
    def print_progress(self, episode, verbose=True):
        """Print detailed training progress"""
        if verbose or episode % 50 == 0:
            avg_reward, avg_distance, avg_efficiency = self.get_averages()
            current_epsilon = self.epsilons[-1] if self.epsilons else 0
            recent_flips = np.mean(self.episode_flips[-self.window_size:])
            
            print(f"Episode {episode:4d} | "
                  f"Distance: {self.episode_distances[-1]:4.0f}px | "
                  f"Avg Dist: {avg_distance:4.0f}px | "
                  f"Flips: {self.episode_flips[-1]:2d} | "
                  f"Efficiency: {avg_efficiency:4.0f} px/flip | "
                  f"ε: {current_epsilon:.3f}")
    
    def plot_training(self, show_baselines=True):
        """Generate comprehensive training visualization"""
        fig, axes = plt.subplots(3, 2, figsize=(15, 15))
        fig.suptitle('Improved DQN Training Progress', fontsize=16, fontweight='bold')
        
        episodes = range(len(self.episode_distances))
        
        # 1. Distance Performance
        ax1 = axes[0, 0]
        ax1.plot(episodes, self.episode_distances, alpha=0.3, color='blue')
        if len(self.episode_distances) >= 50:
            rolling_dist = np.convolve(self.episode_distances, 
                                     np.ones(50)/50, mode='valid')
            ax1.plot(range(49, len(episodes)), rolling_dist, 
                    color='blue', linewidth=2, label='50-episode average')
        
        if show_baselines:
            ax1.axhline(y=1576, color='red', linestyle='--', 
                       alpha=0.7, label='Random baseline')
            ax1.axhline(y=2001, color='green', linestyle='--', 
                       alpha=0.7, label='Heuristic baseline')
        
        ax1.set_title('Distance Performance')
        ax1.set_xlabel('Episode')
        ax1.set_ylabel('Distance (px)')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # 2. Flip Efficiency
        ax2 = axes[0, 1]
        ax2.plot(episodes, self.flip_efficiencies, alpha=0.3, color='green')
        if len(self.flip_efficiencies) >= 50:
            rolling_eff = np.convolve(self.flip_efficiencies, 
                                    np.ones(50)/50, mode='valid')
            ax2.plot(range(49, len(episodes)), rolling_eff, 
                    color='green', linewidth=2, label='50-episode average')
        
        ax2.set_title('Flip Efficiency (Distance per Flip)')
        ax2.set_xlabel('Episode')
        ax2.set_ylabel('Pixels per Flip')
        ax2.grid(True, alpha=0.3)
        
        # 3. Learning Progress (Loss)
        ax3 = axes[1, 0]
        if self.losses:
            window = min(100, len(self.losses))
            rolling_loss = np.convolve(self.losses, 
                                     np.ones(window)/window, mode='valid')
            ax3.plot(range(len(rolling_loss)), rolling_loss, 
                    color='red', linewidth=1)
            ax3.set_title('Training Loss (Rolling Average)')
            ax3.set_xlabel('Training Step')
            ax3.set_ylabel('Loss')
            ax3.grid(True, alpha=0.3)
        
        # 4. Q-Value Analysis
        ax4 = axes[1, 1]
        if self.q_value_gaps:
            ax4.plot(range(len(self.q_value_gaps)), self.q_value_gaps, 
                    color='purple', alpha=0.6)
            ax4.set_title('Q-Value Gap (|Q(flip) - Q(wait)|)')
            ax4.set_xlabel('Episode')
            ax4.set_ylabel('Q-Value Difference')
            ax4.grid(True, alpha=0.3)
        
        # 5. Exploration Rate
        ax5 = axes[2, 0]
        ax5.plot(episodes, self.epsilons, color='orange', linewidth=2)
        ax5.set_title('Exploration Rate (Epsilon)')
        ax5.set_xlabel('Episode')
        ax5.set_ylabel('Epsilon')
        ax5.grid(True, alpha=0.3)
        
        # 6. Flips per Episode
        ax6 = axes[2, 1]
        ax6.plot(episodes, self.episode_flips, alpha=0.3, color='brown')
        if len(self.episode_flips) >= 50:
            rolling_flips = np.convolve(self.episode_flips, 
                                      np.ones(50)/50, mode='valid')
            ax6.plot(range(49, len(episodes)), rolling_flips, 
                    color='brown', linewidth=2, label='50-episode average')
        
        ax6.set_title('Flips per Episode')
        ax6.set_xlabel('Episode')
        ax6.set_ylabel('Number of Flips')
        ax6.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

def train_improved_dqn(
    agent,
    n_episodes=500,
    max_steps_per_episode=3600,  # 30 seconds at 120 FPS
    print_every=10,
    plot_every=100,
    save_every=100,
    eval_every=50
):
    """
    Train the improved DQN agent with enhanced monitoring and evaluation
    """
    # Create environment with improved rewards
    env = GGEnv(
        level_seed=None,        # Random levels
        max_time_s=30.0,        # 30 second episodes
        flip_penalty=0.1,       # Higher flip penalty
        dt=1/120                # 120 FPS
    )
    
    # Training monitor
    monitor = ImprovedTrainingMonitor(window_size=100)
    
    print("🚀 Starting Improved DQN Training!")
    print(f"Episodes: {n_episodes}")
    print(f"Target: Beat heuristic baseline of 2001px")
    print(f"Environment: 30s episodes, random levels, 0.1 flip penalty")
    print("-" * 60)
    
    start_time = time.time()
    
    for episode in range(1, n_episodes + 1):
        # Reset environment
        state = env.reset()
        total_reward = 0
        steps = 0
        flips_count = 0
        episode_loss = []
        
        for step in range(max_steps_per_episode):
            # Agent selects action
            action = agent.act(state, training=True)
            
            # Take action in environment
            next_state, reward, done, info = env.step(action)
            
            # Calculate improved reward
            progress = info.get('distance_delta', 0)
            did_flip = info.get('did_flip', False)
            
            # Adjust reward based on efficiency
            if did_flip:
                flips_count += 1
                efficiency = progress / 50.0  # Expected progress per flip
                flip_penalty = -1.0 * max(0, 1.0 - efficiency)
                reward = 0.1 * progress + flip_penalty
            else:
                reward = 0.1 * progress
            
            # Agent learns from experience
            agent.step(state, action, reward, next_state, done)
            
            # Store loss if available
            if agent.losses:
                episode_loss.append(agent.losses[-1])
            
            # Update state and reward
            state = next_state
            total_reward += reward
            steps += 1
            
            if done:
                break
        
        # Record episode results
        distance = info.get('distance_px', 0)
        avg_loss = np.mean(episode_loss) if episode_loss else None
        
        # Get Q-values for a test state
        test_state = torch.tensor([0.5, 0, 1, 1, 1, 0.8, 0.9, 1.0], 
                                dtype=torch.float32).unsqueeze(0)
        with torch.no_grad():
            q_values = agent.q_network(test_state)[0].numpy()
        
        monitor.update(
            episode=episode,
            reward=total_reward,
            distance=distance,
            flips=flips_count,
            length=steps,
            epsilon=agent.epsilon,
            loss=avg_loss,
            q_values=q_values
        )
        
        # Print progress
        if episode % print_every == 0:
            monitor.print_progress(episode, verbose=True)
        
        # Plot progress
        if episode % plot_every == 0:
            print(f"\n📊 Training Progress at Episode {episode}")
            monitor.plot_training()
            
            # Performance summary
            avg_reward, avg_distance, avg_efficiency = monitor.get_averages()
            print(f"Current 100-episode averages:")
            print(f"   Distance: {avg_distance:.0f}px")
            print(f"   Efficiency: {avg_efficiency:.0f} px/flip")
            
            # Compare to baselines
            if avg_distance > 1576:
                improvement = (avg_distance - 1576) / 1576 * 100
                print(f"✅ Beating random by {improvement:.1f}%!")
            
            if avg_distance > 2001:
                improvement = (avg_distance - 2001) / 2001 * 100
                print(f"🎉 BEATING HEURISTIC by {improvement:.1f}%!")
            
            print("-" * 60)
        
        # Save model periodically
        if episode % save_every == 0:
            agent.save(f'./models/dqn/improved_dqn_episode_{episode}.pth')
    
    # Training complete!
    elapsed_time = time.time() - start_time
    final_reward, final_distance, final_efficiency = monitor.get_averages()
    
    print(f"\n🏁 Training Complete!")
    print(f"Total time: {elapsed_time/60:.1f} minutes")
    print(f"Final 100-episode averages:")
    print(f"   Distance: {final_distance:.0f}px")
    print(f"   Efficiency: {final_efficiency:.0f} px/flip")
    
    # Final comparison to baselines
    print(f"\n📊 Final Performance Comparison:")
    print(f"Random baseline:    1,576px")
    print(f"Heuristic baseline: 2,001px")
    print(f"DQN agent:         {final_distance:.0f}px")
    
    if final_distance > 2001:
        improvement = (final_distance - 2001) / 2001 * 100
        print(f"🎉 SUCCESS! Beat heuristic by {improvement:.1f}%")
    elif final_distance > 1576:
        improvement = (final_distance - 1576) / 1576 * 100
        print(f"✅ Good progress! Beat random by {improvement:.1f}%")
    else:
        print(f"📈 Keep training - agent needs more episodes")
    
    # Save final model
    agent.save('improved_dqn_final.pth')
    print(f"\n💾 Final model saved as 'improved_dqn_final.pth'")
    
    # Final visualization
    print(f"\n📈 Final Training Visualization:")
    monitor.plot_training()
    
    return monitor

# Create and train the improved agent
print("=== STARTING IMPROVED DQN TRAINING ===")
print("This will take 20-25 minutes to train for 500 episodes.")
print("You'll see much more detailed progress updates and metrics.")
print("-" * 60)

# Training run
training_monitor = train_improved_dqn(
    agent=improved_agent,
    n_episodes=500,              # Full training run
    max_steps_per_episode=3600,  # 30 seconds at 120 FPS
    print_every=10,              # Frequent updates
    plot_every=100,              # Regular plotting
    save_every=100               # Regular checkpoints
)

=== STARTING IMPROVED DQN TRAINING ===
This will take 20-25 minutes to train for 500 episodes.
You'll see much more detailed progress updates and metrics.
------------------------------------------------------------
🚀 Starting Improved DQN Training!
Episodes: 500
Target: Beat heuristic baseline of 2001px
Environment: 30s episodes, random levels, 0.1 flip penalty
------------------------------------------------------------


RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x6 and 8x128)

### Understanding the Training Process

**🧠 What Happens During Training:**

1. **Episode Start**: Agent spawns in random level, starts with high exploration (95%)
2. **Action Selection**: Agent chooses actions (mostly random at first, gradually more intelligent)
3. **Experience Storage**: Every action/reward/outcome gets stored in memory
4. **Learning**: Agent trains on random batches from its memory
5. **Progress**: Over time, exploration decreases and performance improves

**📊 What to Watch For:**

- **Distance Plot**: Should gradually increase from ~1576px (random) toward 2001px+ (better than heuristic)
- **Exploration Decay**: Epsilon should smoothly decrease from 0.95 to 0.05
- **Flip Efficiency**: Distance per flip should improve as agent learns better timing
- **Stability**: Less variance in performance as training progresses

**🎯 Success Metrics:**

- **Beat Random**: Consistently above 1576px average distance
- **Beat Heuristic**: Consistently above 2001px average distance  
- **Efficiency**: Fewer flips per distance than random baseline
- **Consistency**: Lower variance in episode performance

The agent should start performing like your random baseline but gradually learn to match or exceed your heuristic baseline of 2001px!


# Execute this cell for full training (after quick test succeeds)

## Full DQN Training Session

# Improved DQN Training with Enhanced State

Our improved DQN agent needs to handle the 6-dimensional state from the environment plus the 2 extra dimensions we get from the info dict:

1. Original State (6D):
   - y_norm ∈ [0,1]
   - vy_norm ∈ [-1,1]
   - grav_dir ∈ {−1,+1}
   - probes[0,1,2] ∈ [0,1]

2. Extra Info (2D):
   - grounded (bool)
   - cooldown_ready (bool)

The training code will combine these into a full 8D state vector before passing it to the network.

In [70]:
class StatePreprocessor:
    """Handles state preprocessing and augmentation"""
    
    def __init__(self, env):
        self.env = env
        self.base_dim = 6  # y, vy, grav_dir, 3 probes
    
    def process_state(self, state, info):
        """Convert 6D state + info into 8D state vector"""
        if isinstance(state, torch.Tensor):
            state = state.numpy()
        state = np.array(state).flatten()
        if len(state) != self.base_dim:
            raise ValueError(f"Expected state dimension {self.base_dim}, got {len(state)}")
        
        grounded = float(info.get('grounded', False))
        cooldown_ready = float(info.get('cooldown', 0) <= 0)
        return np.concatenate([state, [grounded, cooldown_ready]])
    
    def process_batch(self, states, infos):
        """Process a batch of states"""
        return np.array([self.process_state(s, i) for s, i in zip(states, infos)])
    
    def get_state_size(self):
        """Return size of processed state"""
        return self.base_dim + 2  # 6 base dims + 2 extra

class DQNNetwork(nn.Module):
    """Neural network for DQN with batch normalization"""
    
    def __init__(self, state_size, action_size, hidden_size=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.BatchNorm1d(hidden_size // 2),
            nn.ReLU(),
            nn.Linear(hidden_size // 2, action_size)
        )
        
        # Initialize weights
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        if len(x.shape) == 1:
            x = x.unsqueeze(0)  # Add batch dimension
        return self.net(x)

class ReplayBuffer:
    """Experience replay buffer with uniform sampling"""
    
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        states, actions, rewards, next_states, dones = zip(*random.sample(self.buffer, batch_size))
        return (
            np.array(states),
            np.array(actions),
            np.array(rewards),
            np.array(next_states),
            np.array(dones)
        )
    
    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    """DQN Agent with state preprocessing and enhanced training"""
    
    def __init__(
        self,
        state_processor,
        action_size=2,
        hidden_size=128,
        learning_rate=0.0001,
        gamma=0.99,
        epsilon_start=0.99,
        epsilon_end=0.01,
        epsilon_decay=0.9999,
        buffer_size=100000,
        batch_size=32,
        target_update_freq=100
    ):
        self.state_processor = state_processor
        self.state_size = state_processor.get_state_size()
        self.action_size = action_size
        self.batch_size = batch_size
        self.gamma = gamma
        self.target_update_freq = target_update_freq
        
        # Exploration
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        
        # Networks
        self.q_network = DQNNetwork(self.state_size, action_size, hidden_size)
        self.target_network = DQNNetwork(self.state_size, action_size, hidden_size)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Optimizer
        self.optimizer = torch.optim.Adam(self.q_network.parameters(), lr=learning_rate)
        
        # Replay buffer
        self.memory = ReplayBuffer(buffer_size)
        
        # Training stats
        self.steps_done = 0
        self.losses = []
        
        print(f"🤖 DQN Agent Created:")
        print(f"   State size: {self.state_size}")
        print(f"   Hidden size: {hidden_size}")
        print(f"   Learning rate: {learning_rate}")
        print(f"   Batch size: {batch_size}")
    
    def remember(self, state, action, reward, next_state, done, info, next_info):
        """Store experience in replay buffer"""
        # Process states before storing
        full_state = self.state_processor.process_state(state, info)
        full_next_state = self.state_processor.process_state(next_state, next_info)
        self.memory.push(full_state, action, reward, full_next_state, done)
    
    def get_action(self, state, info, training=True):
        """Select action using epsilon-greedy"""
        # Process state
        full_state = self.state_processor.process_state(state, info)
        state_tensor = torch.FloatTensor(full_state)
        
        # Epsilon-greedy
        if training and random.random() < self.epsilon:
            return random.randrange(self.action_size)
        
        with torch.no_grad():
            q_values = self.q_network(state_tensor)
            return q_values.argmax().item()
    
    def train_step(self):
        """Train on one batch of experiences"""
        if len(self.memory) < self.batch_size:
            return
        
        # Sample batch
        states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
        
        # Convert to tensors
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)
        
        # Current Q values
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        # Next Q values (with target network)
        with torch.no_grad():
            next_q = self.target_network(next_states).max(1)[0]
            target_q = rewards + (1 - dones) * self.gamma * next_q
        
        # Compute loss and update
        loss = F.smooth_l1_loss(current_q.squeeze(), target_q)
        self.optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 1.0)
        self.optimizer.step()
        
        # Store loss
        self.losses.append(loss.item())
        
        # Update target network if needed
        self.steps_done += 1
        if self.steps_done % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Decay epsilon
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
    
    def save(self, path):
        """Save model"""
        torch.save({
            'q_network_state_dict': self.q_network.state_dict(),
            'target_network_state_dict': self.target_network.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'epsilon': self.epsilon,
            'steps_done': self.steps_done,
            'losses': self.losses
        }, path)
    
    def load(self, path):
        """Load model"""
        checkpoint = torch.load(path)
        self.q_network.load_state_dict(checkpoint['q_network_state_dict'])
        self.target_network.load_state_dict(checkpoint['target_network_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.epsilon = checkpoint['epsilon']
        self.steps_done = checkpoint['steps_done']
        self.losses = checkpoint['losses']

# Create environment and agent
env = GGEnv(
    level_seed=None,
    max_time_s=30.0,
    flip_penalty=0.1,
    dt=1/120
)

# Create state processor and agent
state_processor = StatePreprocessor(env)
agent = DQNAgent(
    state_processor=state_processor,
    action_size=2,
    hidden_size=128,
    learning_rate=0.0001,
    gamma=0.99,
    epsilon_start=0.99,
    epsilon_end=0.01,
    epsilon_decay=0.9999,
    buffer_size=100000,
    batch_size=32,
    target_update_freq=100
)

print("\n✅ Agent created and ready for training!")

🤖 DQN Agent Created:
   State size: 8
   Hidden size: 128
   Learning rate: 0.0001
   Batch size: 32

✅ Agent created and ready for training!


In [71]:
# Training parameters
num_episodes = 1000
max_steps = 1000
log_interval = 10
save_interval = 100

# Training metrics
episode_rewards = []
episode_lengths = []
avg_losses = []

print("🎮 Starting training...")
try:
    for episode in range(num_episodes):
        state, info = env.reset()
        episode_reward = 0
        episode_steps = 0
        
        for step in range(max_steps):
            # Select and perform action
            action = agent.get_action(state, info, training=True)
            next_state, reward, done, next_info = env.step(action)
            
            # Store in memory (with proper state processing)
            agent.remember(state, action, reward, next_state, done, info, next_info)
            
            # Train
            agent.train_step()
            
            # Update metrics
            episode_reward += reward
            episode_steps += 1
            
            # Move to next state
            state = next_state
            info = next_info
            
            if done:
                break
        
        # Store episode metrics
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_steps)
        if agent.losses:
            avg_losses.append(np.mean(agent.losses[-episode_steps:]))
        
        # Logging
        if (episode + 1) % log_interval == 0:
            avg_reward = np.mean(episode_rewards[-log_interval:])
            avg_length = np.mean(episode_lengths[-log_interval:])
            avg_loss = np.mean(avg_losses[-log_interval:]) if avg_losses else 0
            print(f"\nEpisode {episode + 1}")
            print(f"Average Reward: {avg_reward:.2f}")
            print(f"Average Length: {avg_length:.2f}")
            print(f"Average Loss: {avg_loss:.4f}")
            print(f"Epsilon: {agent.epsilon:.4f}")
        
        # Save model
        if (episode + 1) % save_interval == 0:
            agent.save(f"models/dqn_ep{episode + 1}.pt")
            print(f"\n💾 Model saved at episode {episode + 1}")

except Exception as e:
    print(f"\n❌ Training interrupted: {str(e)}")
    raise
finally:
    # Plot training results
    if len(episode_rewards) > 0:
        plt.figure(figsize=(15, 5))
        
        plt.subplot(1, 3, 1)
        plt.plot(episode_rewards)
        plt.title('Episode Rewards')
        plt.xlabel('Episode')
        plt.ylabel('Total Reward')
        
        plt.subplot(1, 3, 2)
        plt.plot(episode_lengths)
        plt.title('Episode Lengths')
        plt.xlabel('Episode')
        plt.ylabel('Steps')
        
        if avg_losses:
            plt.subplot(1, 3, 3)
            plt.plot(avg_losses)
            plt.title('Average Loss')
            plt.xlabel('Episode')
            plt.ylabel('Loss')
        
        plt.tight_layout()
        plt.show()
    
    print("\n🎉 Training complete!")

🎮 Starting training...

❌ Training interrupted: too many values to unpack (expected 2)

🎉 Training complete!


ValueError: too many values to unpack (expected 2)