# Deep Q-Network (DQN) for Gravity-Guy Env v2 — Plan & Rationale

**Goal.** Train a DQN agent that outperforms both **Random** and our **Tiny Heuristic** on a held-out set of seeds.  
We’ll keep the notebook educational: each section explains *what we do* and *why*, before showing code.

**Notebook roadmap**
1. **DQN at a glance (this section):** problem framing & core equations (high-level, with formulas).
2. **Experiment setup:** action/obs spaces, reward, time limits, seeds, evaluation protocol, logging.
3. **Network & optimizer choices:** architecture, activations, initialization, loss, target updates, exploration.
4. **Training loop design:** replay buffer, batches, update cadence, target sync, eval cadence, checkpoints.
5. **Implementation:** minimal training code, clean metrics.
6. **Results & analysis:** curves, tables, seed-paired eval, side-by-side with heuristic.
7. **Next steps:** ablations (Double DQN, PER, n-step) and improvements.

*Environment:* GGEnv v2, Observation v2 (15-dim), discrete actions (2), default decision rate ≈ 15 Hz (`frame_skip=4`).
test

## Part 1 — DQN at a glance

### 1) Problem framing (MDP)
We model the game as a Markov Decision Process (MDP):
- **State** $s_t$: the 15-dim observation vector (player position/velocity/gravity + probe features).
- **Action** $a_t \in \{0,1\}$: `0 = NOOP`, `1 = FLIP` gravity.
- **Reward** $r_t$: scalar signal per step (defined in the setup section).
- **Transition**: environment moves platforms, applies gravity, checks collisions → $s_{t+1}$.
- **Discount** $\gamma \in [0,1)$: how much we value future rewards.

The objective is to learn a policy $\pi(a \mid s)$ that maximizes expected discounted return.

---

### 2) Q-learning objective
The **optimal action-value** function $Q^*(s,a)$ satisfies the Bellman optimality equation:

$$
Q^*(s_t,a_t) = \mathbb{E}\big[r_t + \gamma \max_{a'} Q^*(s_{t+1}, a') \,\big]
$$

DQN approximates $Q(s,a;\theta)$ with a neural network and minimizes a temporal-difference (TD) loss toward a **target**:

$$
y_t = r_t + \gamma (1 - \text{done}_t) \, \max_{a'} Q(s_{t+1}, a'; \theta^-)
$$

$$
\mathcal{L}(\theta) = \mathbb{E}\left[\, \ell\big(y_t - Q(s_t, a_t; \theta)\big) \,\right]
$$

- $\theta$: online network parameters (updated every gradient step).  
- $\theta^-$: **target network** parameters (held fixed; periodically or softly synced from $\theta$).  
- $\ell(\cdot)$: loss; we’ll use **Huber** (smooth L1) for stability:

$$
\ell(\delta)=
\begin{cases}
\tfrac{1}{2}\delta^2 & \text{if } |\delta|\le \kappa \\
\kappa\,(|\delta| - \tfrac{1}{2}\kappa) & \text{otherwise}
\end{cases}
$$

with $\kappa=1$ by default. (MSE also works but is less robust to outliers.)

> **Double DQN (optional):** reduces over-estimation by selecting with $\theta$ but evaluating with $\theta^-$:
> $$
> y_t = r_t + \gamma (1-\text{done}_t)\, Q\!\left(s_{t+1}, \arg\max_{a'} Q(s_{t+1},a';\theta);\ \theta^- \right)
> $$
> We’ll start with vanilla DQN and can switch to Double DQN if needed.

---

### 3) Experience replay & exploration
- **Replay buffer** $\mathcal{D}$: stores transitions $(s_t,a_t,r_t,s_{t+1},\text{done})$.  
  At each update, we sample a mini-batch **i.i.d.** from $\mathcal{D}$ to decorrelate updates and stabilize training.
- **$\varepsilon$-greedy exploration**: with prob. $\varepsilon_t$ choose a random action; otherwise act greedily:

$$
a_t =
\begin{cases}
\text{rand action} & \text{with prob } \varepsilon_t \\
\arg\max_a Q(s_t,a;\theta) & \text{with prob } 1-\varepsilon_t
\end{cases}
$$

We’ll **anneal** $\varepsilon$ linearly from 1.0 to 0.05 over a fixed number of decision steps.

---

### 4) Network & activation (why ReLU?)
- **Input:** 15-dim observation (already normalized/scaled by design).
- **Output:** 2 Q-values $[Q(s,NOOP),\,Q(s,FLIP)]$.
- **Backbone:** MLP with two hidden layers (e.g., **256 → 256**, **ReLU**).
  - **Why ReLU?** Simple, fast, avoids vanishing gradients, works well on sparse/tabular-like signals (our probes).
  - We’ll use **Xavier/He** initialization (framework defaults) and **Adam** optimizer (good adaptive steps).

---

### 5) Target updates & stability knobs
- **Target network sync:** either **hard** copy every $C$ updates or **soft** update  
  $\theta^- \leftarrow \tau \theta + (1-\tau)\theta^-$ with small $\tau$.
- **Gradient clipping:** cap global norm (e.g., 10) to prevent rare exploding updates.
- **Reward clipping (optional):** clip $r_t \in [-1,1]$ if rewards are unbounded.  
- **Batch size / buffer size:** large enough to decorrelate; we’ll start with batch 256, buffer 100k.

---

### 6) What “success” looks like here
- On held-out seeds, DQN surpasses both **Random** and **Tiny Heuristic** in:
  - **distance** (px), **episode length** (s),
  - **death-cause mix** (fewer early spikes *and* fewer OOB),
  - learning curves that improve steadily without divergence.


## Part 2 — Experiment setup & evaluation protocol

**Goal.** Fix all the “contract” details before writing any training code so results are reproducible and comparable.

### 1) Environment contract
- **Env:** `GGEnv v2` (Observation v2, 15-dim).
- **Actions:** discrete {0 = NOOP, 1 = FLIP}.
- **Decision cadence:** `frame_skip = 4` ⇒ ~15 decisions/sec at 60 FPS physics.
- **Time limit:** 30 s simulated per episode ⇒ **max_steps ≈ 450** decisions.
- **Seeding:** full determinism per episode seed (and per-run RNG seeds).

> We’ll keep the env reward as-is (alive reward per decision, small terminal penalty if defined).  
> Our *primary* evaluation metrics remain **distance (px)** and **episode length (s)**.

### 2) Datasets: training vs evaluation seeds
- **Eval seeds (held-out):** reuse the 20 seeds from the sanity notebook (101–120).  
  We never train on these; we only evaluate on them periodically and at the end.
- **Training seeds:** a larger pool (e.g., 1000–1999) sampled per episode to avoid overfitting.

### 3) Logging & folders (so replay works)
We’ll log under `experiments/runs/dqn/<run_id>/`:
- `config.json` — all hyperparams & seeds
- `metrics.csv` — rolling train stats (loss, eps, steps, buffer size, etc.)
- `eval/episodes.csv` — eval summaries on the held-out seeds
- `eval/traces/<seed>_actions.npy` — action sequences for exact replay
- `checkpoints/` — network snapshots (e.g., best and periodic)

### 4) Hyperparameters (first pass; we’ll freeze them in Part 3)
- Network: MLP 2×256, ReLU, output dimension = 2 Q-values
- Optimizer: Adam (lr = 1e-3), Huber loss, γ = 0.99
- Replay buffer: 100k, batch = 256
- Train every 4 decisions; target sync every 1,000 updates
- ε-greedy: 1.0 → 0.05 linearly over ~100k decisions

We’ll confirm/adjust these after a small smoke run.


In [2]:
# Make the repo root (the folder that contains `src/`) importable
import sys
from pathlib import Path

# Start from the current notebook dir and walk upwards until we find a 'src' folder
here = Path.cwd().resolve()
repo_root = None
for parent in [here, *here.parents]:
    if (parent / "src").exists():
        repo_root = parent
        break

if repo_root is None:
    raise FileNotFoundError(
        f"Couldn't find a 'src' directory by walking up from {here}. "
        "Open this notebook from inside your repo or adjust the path below manually."
    )

# Put repo root at the front of sys.path
sys.path.insert(0, str(repo_root))
print("Added to sys.path:", repo_root)

# Optional: ensure packages are recognized (create empty __init__.py if missing)
for pkg in ["src", "src/env", "src/game"]:
    init = repo_root / pkg / "__init__.py"
    if not init.exists():
        try:
            init.touch()
            print("Created", init)
        except Exception as e:
            print("Note:", init, "does not exist and couldn't be created automatically:", e)

# Sanity import
from src.env.gg_env_v2 import GGEnv
print("Import OK:", GGEnv)


Added to sys.path: D:\Projects\GravityGuyML
Created D:\Projects\GravityGuyML\src\__init__.py
Import OK: <class 'src.env.gg_env_v2.GGEnv'>


In [3]:
import os, random, json, time
from pathlib import Path
import numpy as np

import gymnasium as gym
from src.env.gg_env_v2 import GGEnv

# ----- Paths (assuming notebook in experiments/notebooks/) -----
NOTEBOOK_DIR = Path.cwd()
EXP_DIR = NOTEBOOK_DIR.parent                   # experiments/
RUNS_BASE = EXP_DIR / "runs" / "dqn"            # experiments/runs/dqn/
RUNS_BASE.mkdir(parents=True, exist_ok=True)

# Unique run id (timestamp) and run directory
RUN_ID = time.strftime("%Y%m%d_%H%M%S")
RUN_DIR = RUNS_BASE / RUN_ID
(RUN_DIR / "eval" / "traces").mkdir(parents=True, exist_ok=True)
(RUN_DIR / "checkpoints").mkdir(parents=True, exist_ok=True)
print("Run dir:", RUN_DIR)

# ----- Core env settings -----
SIM_FPS = 60
FRAME_SKIP = 4
DECISION_HZ = SIM_FPS / FRAME_SKIP
TIME_LIMIT_S = 30
MAX_STEPS = int(TIME_LIMIT_S * DECISION_HZ)

# ----- Seed sets -----
EVAL_SEEDS = list(range(101, 121))               # held-out
TRAIN_SEED_RANGE = (1000, 2000)                  # sampled per episode

# ----- Global RNG seeds for reproducibility (you can change these) -----
GLOBAL_SEED = 42
np.random.seed(GLOBAL_SEED)
random.seed(GLOBAL_SEED)

print(f"Decision rate ≈ {DECISION_HZ:.1f} Hz; max_steps = {MAX_STEPS}")
print(f"Eval seeds ({len(EVAL_SEEDS)}): {EVAL_SEEDS[0]}..{EVAL_SEEDS[-1]}")
print(f"Train seed range: [{TRAIN_SEED_RANGE[0]}, {TRAIN_SEED_RANGE[1]})")

# ----- Small helper: env factory -----
def make_env(seed: int, render_mode=None):
    """
    Create a GGEnv v2 with the agreed contract.
    We keep render_mode=None for training and 'human' for debug runs.
    """
    env = GGEnv(render_mode=render_mode, frame_skip=FRAME_SKIP)
    # Let env carry its own time limit; if you need a hard cap, wrap with gym.wrappers.TimeLimit
    obs, info = env.reset(seed=seed)
    return env

# Sanity: instantiate once headless (no window), check spaces
_env = make_env(seed=EVAL_SEEDS[0], render_mode=None)
print("Obs space:", _env.observation_space)
print("Act space:", _env.action_space)
_env.close()

# Save a tiny config snapshot for the run folder now
config_snapshot = {
    "run_id": RUN_ID,
    "sim_fps": SIM_FPS,
    "frame_skip": FRAME_SKIP,
    "decision_hz": DECISION_HZ,
    "time_limit_s": TIME_LIMIT_S,
    "max_steps": MAX_STEPS,
    "eval_seeds": EVAL_SEEDS,
    "train_seed_range": TRAIN_SEED_RANGE,
    "global_seed": GLOBAL_SEED,
}
with (RUN_DIR / "config.json").open("w") as f:
    json.dump(config_snapshot, f, indent=2)
print("Wrote:", RUN_DIR / "config.json")


Run dir: d:\Projects\GravityGuyML\experiments\runs\dqn\20250911_015257
Decision rate ≈ 15.0 Hz; max_steps = 450
Eval seeds (20): 101..120
Train seed range: [1000, 2000)
Obs space: Box([ 0. -1. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.], 1.0, (15,), float32)
Act space: Discrete(2)
Wrote: d:\Projects\GravityGuyML\experiments\runs\dqn\20250911_015257\config.json


## Part 3 — Network & optimizer choices

### Architecture (DQN head)
- **Input:** 15-dim observation (obs v2).
- **Output:** 2 Q-values $[Q(s,\text{NOOP}),\, Q(s,\text{FLIP})]$.
- **Backbone:** MLP with two hidden layers: **256 → 256**, **ReLU** activations.

**Why ReLU?**  
Simple, fast, robust to vanishing gradients, and works well on sparse/tabular-ish inputs (our probe features). PyTorch’s default Kaiming/He initialization is designed for ReLU.

### Loss & targets
- **TD target:** $ y_t = r_t + \gamma (1 - \text{done}_t)\max_{a'} Q(s_{t+1}, a'; \theta^-) $
- **Loss:** **Huber (smooth L1)** on $(y_t - Q(s_t,a_t;\theta))$ for outlier robustness.
- **Target network:** copy online → target every **C = 1000** updates (hard sync).  
  *(We can switch to soft updates or Double DQN later if needed.)*

### Optimizer & stability
- **Optimizer:** Adam, **lr = 1e-3**.
- **Discount:** $\gamma = 0.99$.
- **Gradient clipping:** global-norm cap **10.0** to avoid rare spikes.
- **Replay buffer:** **100k** transitions; **batch = 256**; train every **4** decisions.
- **Warmup:** collect **5k** decisions with pure exploration before first gradient step.

### Exploration (ε-greedy)
- Start **ε=1.0**, linearly anneal to **ε=0.05** over **100k** decisions:

$$
\varepsilon_t = \max\big(\varepsilon_{\min}, \varepsilon_{\max} - (\varepsilon_{\max}-\varepsilon_{\min}) \cdot t / T\big)
$$

where $T = 100{,}000$ decisions.

### What success should look like
- On held-out seeds, the trained DQN surpasses **Random** and the **Tiny Heuristic** in **distance** and **episode length**, and reduces both **early spikes** and **OOB** deaths.
