# Deep Q-Network (DQN) for Gravity Guy

## What is DQN?
**Deep Q-Network (DQN)** is a reinforcement learning method that learns a function
$Q_\theta(s,a)$ estimating the long-term return of taking action $a$ in state $s$.
We act with **ε-greedy** (mostly pick the action with the highest Q, sometimes explore),
and we **train** the network to match a bootstrapped target:

$$
y = \begin{cases}
r & \text{if episode terminated}\\
r + \gamma \max_{a'} Q_{\bar\theta}(s', a') & \text{otherwise}
\end{cases}
$$

$$
\text{Loss} = \mathrm{Huber}\big( Q_\theta(s,a) - y \big)
$$

Two stabilizers make DQN work well in practice:
- **Replay buffer**: learn from randomized past transitions $(s,a,r,s',\text{done})$ to break correlations.
- **Target network** $Q_{\bar\theta}$: a slowly updated copy used to compute $y$.

## Why DQN fits this game
- **Tiny, discrete action space:** 2 actions (NOOP / FLIP).
- **Dense, shaped reward:** per-step progress minus a small flip penalty.
- **Compact observation:** 6 floats capture what matters (vertical state + look-ahead probes).
- **Fast, headless env:** high sample throughput for replay.

## State / Action / Reward (this notebook)
- **Observation (6 floats):**
  1. `y_norm` ∈ [0,1] — vertical position (0=top, 1=bottom)  
  2. `vy_norm` ∈ [-1,1] — normalized vertical speed  
  3. `grav_dir` ∈ {−1,+1} — current gravity (up/down)  
  4–6. `p1, p2, p3` ∈ [0,1] — **look-ahead clearances** in the gravity direction (near → far)
- **Actions:** `0 = NOOP`, `1 = FLIP` (flip only **fires** when grounded & cooldown is over; invalid flips act as no-ops).
- **Reward per step:** `progress − flip_penalty × [flip_fired]`
- **Termination:** off-screen (death) or time limit (e.g., 10 s).

## Learning loop (at a glance)
1. **Observe** state $s$.  
2. **Act** with ε-greedy: pick `argmax_a Q_\theta(s,a)` with prob $1-ε$, random action with prob $ε$.  
3. **Step** the env → get $(r, s', \text{done})$.  
4. **Store** $(s,a,r,s',\text{done})$ in the replay buffer.  
5. **Sample** a mini-batch from replay, compute targets $y$ with the **target network**.  
6. **Update** the online network $Q_\theta$ to minimize Huber loss; periodically **update** the target network.  
7. **Anneal** $ε$ over time to reduce exploration.

## Game-specific caveats (and how we handle them)
- **Action validity:** flips only take effect when grounded.  
  *Mitigation:* treat invalid flips as no-ops and/or mask them at action selection time.
- **Timing & partial observability:** probes look ahead in x while y changes over time.  
  *Mitigation:* keep the observation compact but informative (probes + gravity + velocity).  
  (Optionally, stack a few recent observations or add `grounded`/`cooldown` scalars.)
- **Evaluation fairness:** fix a set of level seeds; report mean/median distance, % time-limit terminations, and flips per 1000 px.

## What the reader should expect
- Baselines (**Random**, **Heuristic**) for context.  
- A DQN agent that learns to time flips better than random, often matching or surpassing the hand-crafted heuristic on held-out seeds.  
- Clear plots: training return, evaluation distance, and failure-mode breakdown.


## Environment Setup

Let's start by setting up our environment and importing the libraries we'll need:

In [4]:
# Standard libraries for ML and data handling
import numpy as np
import matplotlib.pyplot as plt
import random
import json
from collections import deque
import time

# PyTorch for neural networks (you might need: pip install torch)
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Our custom environment
import sys
sys.path.append('../..')  # Go up two directories to access src/
from src.env.gg_env import GGEnv

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Using device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

Libraries imported successfully!
PyTorch version: 2.8.0+cpu
Using device: CPU


## Quick Environment Test

Before building our AI, let's make sure we understand our environment perfectly:

In [5]:
# Create a test environment
env = GGEnv(level_seed=12345, max_time_s=10.0, flip_penalty=0.01)
obs = env.reset()

print("=== ENVIRONMENT UNDERSTANDING ===")
print(f"Observation space: {len(obs)} dimensions")
print(f"Action space: {env.action_space_n} actions (0=wait, 1=flip)")
print(f"First observation: {obs}")

# Take a few random actions to see what happens
total_reward = 0
for step in range(10):
    action = random.choice([0, 1])  # Random action
    obs, reward, done, info = env.step(action)
    total_reward += reward
    
    print(f"Step {step+1}: action={action}, reward={reward:.3f}, done={done}")
    print(f"  → obs: [{obs[0]:.2f}, {obs[1]:.2f}, {obs[2]:.0f}, {obs[3]:.2f}, {obs[4]:.2f}, {obs[5]:.2f}]")
    
    if done:
        print(f"Episode ended! Total reward: {total_reward:.2f}, Distance: {info['distance_px']}px")
        break

print("\n✅ Environment test completed!")

=== ENVIRONMENT UNDERSTANDING ===
Observation space: 6 dimensions
Action space: 2 actions (0=wait, 1=flip)
First observation: [0.5, 0.0, 1.0, 0.24814814814814815, 0.24814814814814815, 0.24814814814814815]
Step 1: action=0, reward=2.083, done=False
  → obs: [0.50, 0.01, 1, 0.25, 0.25, 0.25]
Step 2: action=0, reward=2.083, done=False
  → obs: [0.50, 0.03, 1, 0.25, 0.25, 0.25]
Step 3: action=0, reward=2.083, done=False
  → obs: [0.50, 0.04, 1, 0.25, 0.25, 0.25]
Step 4: action=0, reward=2.083, done=False
  → obs: [0.50, 0.05, 1, 0.25, 0.25, 0.25]
Step 5: action=0, reward=2.083, done=False
  → obs: [0.50, 0.06, 1, 0.25, 0.25, 0.25]
Step 6: action=0, reward=2.083, done=False
  → obs: [0.51, 0.07, 1, 0.24, 0.24, 0.24]
Step 7: action=1, reward=2.083, done=False
  → obs: [0.51, 0.09, 1, 0.24, 0.24, 0.24]
Step 8: action=1, reward=2.083, done=False
  → obs: [0.51, 0.10, 1, 0.24, 0.24, 0.24]
Step 9: action=0, reward=2.083, done=False
  → obs: [0.51, 0.11, 1, 0.24, 0.24, 0.24]
Step 10: action=1, re

## Part 2: Building the Neural Network Brain

### What is our Neural Network doing?

Think of the neural network as the agent's "brain". It takes in the 6 observations from the game and outputs 2 numbers:
- **Q(state, wait)**: How good is it to wait/do nothing in this situation?
- **Q(state, flip)**: How good is it to flip gravity in this situation?

The agent will always choose the action with the higher Q-value.

### Network Architecture Design

For our Gravity Guy game, we'll use a simple but effective architecture:

```
Input Layer (6 neurons) → Hidden Layer (128 neurons) → Hidden Layer (64 neurons) → Output Layer (2 neurons)
       ↓                        ↓                           ↓                        ↓
  [y, vy, grav,              [lots of                   [more                   [Q(wait), 
   p1, p2, p3]                neurons]                   neurons]                 Q(flip)]
```

### Why this architecture?
- **Input**: 6 observations (exactly what our environment gives us)
- **Hidden layers**: 128 and 64 neurons - enough to learn complex patterns but not too big to be slow
- **Output**: 2 Q-values (one for each possible action)
- **Activation**: ReLU (simple and effective for this type of problem)

## Neural Network Implementation

In [6]:
class DQN(nn.Module):
    """
    Deep Q-Network for Gravity Guy
    
    This neural network takes game observations and predicts Q-values for each action.
    Think of it as the "brain" that learns to evaluate how good each action is.
    """
    
    def __init__(self, input_size=6, hidden_size1=128, hidden_size2=64, output_size=2):
        """
        Initialize the network layers
        
        Args:
            input_size: Number of observations (6 for our game)
            hidden_size1: First hidden layer size (128 neurons)
            hidden_size2: Second hidden layer size (64 neurons)  
            output_size: Number of actions (2: wait or flip)
        """
        super(DQN, self).__init__()
        
        # Define the network layers
        self.fc1 = nn.Linear(input_size, hidden_size1)      # Input → Hidden1
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)    # Hidden1 → Hidden2  
        self.fc3 = nn.Linear(hidden_size2, output_size)     # Hidden2 → Output
        
        print(f"🧠 DQN Network Created:")
        print(f"   Input: {input_size} → Hidden: {hidden_size1} → Hidden: {hidden_size2} → Output: {output_size}")
        print(f"   Total parameters: {sum(p.numel() for p in self.parameters()):,}")
    
    def forward(self, x):
        """
        Forward pass: convert observations into Q-values
        
        This is where the magic happens - observations go in, Q-values come out!
        
        Args:
            x: Observations tensor [batch_size, 6]
            
        Returns:
            Q-values tensor [batch_size, 2] - one Q-value for each action
        """
        # Layer 1: observations → first hidden layer (with ReLU activation)
        x = F.relu(self.fc1(x))
        
        # Layer 2: first hidden → second hidden layer (with ReLU activation)  
        x = F.relu(self.fc2(x))
        
        # Layer 3: second hidden → Q-values (no activation - we want raw Q-values)
        q_values = self.fc3(x)
        
        return q_values

# Test our network
print("=== TESTING THE NEURAL NETWORK ===")

# Create the network
dqn = DQN(input_size=6, output_size=2)

# Test with a fake observation (like what our environment produces)
test_observation = torch.tensor([0.5, 0.2, 1.0, 0.8, 0.9, 1.0], dtype=torch.float32)
test_observation = test_observation.unsqueeze(0)  # Add batch dimension

# Get Q-values from the network
with torch.no_grad():  # No gradients needed for testing
    q_values = dqn(test_observation)

print(f"Test observation: {test_observation.squeeze().tolist()}")
print(f"Network output (Q-values): {q_values.squeeze().tolist()}")
print(f"Best action: {torch.argmax(q_values).item()} ({'flip' if torch.argmax(q_values).item() == 1 else 'wait'})")

print("\n✅ Neural Network test completed!")

=== TESTING THE NEURAL NETWORK ===
🧠 DQN Network Created:
   Input: 6 → Hidden: 128 → Hidden: 64 → Output: 2
   Total parameters: 9,282
Test observation: [0.5, 0.20000000298023224, 1.0, 0.800000011920929, 0.8999999761581421, 1.0]
Network output (Q-values): [0.16926881670951843, 0.14115166664123535]
Best action: 0 (wait)

✅ Neural Network test completed!


## Understanding the Network Components

In [7]:
print("=== UNDERSTANDING NETWORK COMPONENTS ===")

# Let's examine what each layer does
test_obs = torch.tensor([0.3, -0.1, -1.0, 0.6, 0.7, 0.8], dtype=torch.float32).unsqueeze(0)

print("Step-by-step forward pass:")
print(f"1. Input observations: {test_obs.squeeze().tolist()}")

# Manual forward pass to see each step
x = test_obs
print(f"   Shape: {x.shape}")

# Layer 1
x = F.relu(dqn.fc1(x))  
print(f"2. After first hidden layer (128 neurons): {x.shape}")
print(f"   Sample values: [{x[0][0]:.3f}, {x[0][1]:.3f}, {x[0][2]:.3f}, ...] (showing first 3)")

# Layer 2  
x = F.relu(dqn.fc2(x))
print(f"3. After second hidden layer (64 neurons): {x.shape}")
print(f"   Sample values: [{x[0][0]:.3f}, {x[0][1]:.3f}, {x[0][2]:.3f}, ...] (showing first 3)")

# Layer 3
x = dqn.fc3(x)
print(f"4. Final Q-values: {x.squeeze().tolist()}")
print(f"   Q(wait) = {x[0][0]:.3f}, Q(flip) = {x[0][1]:.3f}")

# Decision making
best_action = torch.argmax(x).item()
print(f"5. Decision: Choose action {best_action} ({'flip' if best_action == 1 else 'wait'})")

=== UNDERSTANDING NETWORK COMPONENTS ===
Step-by-step forward pass:
1. Input observations: [0.30000001192092896, -0.10000000149011612, -1.0, 0.6000000238418579, 0.699999988079071, 0.800000011920929]
   Shape: torch.Size([1, 6])
2. After first hidden layer (128 neurons): torch.Size([1, 128])
   Sample values: [0.000, 0.179, 0.000, ...] (showing first 3)
3. After second hidden layer (64 neurons): torch.Size([1, 64])
   Sample values: [0.000, 0.000, 0.000, ...] (showing first 3)
4. Final Q-values: [0.15515024960041046, 0.1052684634923935]
   Q(wait) = 0.155, Q(flip) = 0.105
5. Decision: Choose action 0 (wait)


## Key Concepts Explained

**What just happened?**

1. **Input Processing**: We fed the network 6 numbers representing the game state
2. **Hidden Layers**: The network processed this information through two layers of neurons
3. **Q-Value Output**: We got back 2 numbers - Q(wait) and Q(flip)  
4. **Action Selection**: We pick the action with the highest Q-value

**Why ReLU activation?**
- ReLU (Rectified Linear Unit) simply makes negative values = 0
- It's fast, simple, and works well for most problems
- Helps the network learn complex patterns

**Why no activation on the output?**
- Q-values can be positive or negative (good or bad situations)  
- We want the raw values, not constrained to 0-1 range