ARCbound Intelligence — Proximal Policy Optimization
A PyTorch reinforcement-learning pipeline for training real-time multiplayer game AI. Built for and proven in the Arcbound arena shooter.
Standard PPO struggles in real-time multiplayer settings where:
- Scalar rewards flatten distinct skills (survival vs combat vs strategy) into one noisy signal
- Shared optimizer state lets the critic's massive value loss drown out the actor's tiny policy gradient
- Credit assignment over ~10,000-tick trajectories gives the agent no chance to learn anything
ABI-PPO fixes all three:
- Decoupled optimizers — actor and critic each get an independent Adam optimizer with its own learning rate. No momentum corruption between the two.
- Reward decomposition — rewards split into three orthogonal channels (survival / combat / strategy). Each channel's advantages are normalized independently.
- Staged curriculum — three training phases (Movement → Combat → Strategy), each mastering one skill before unlocking the next. Collapses credit assignment from ~10,000 ticks to ~200 per phase.
Legacy rl_train.py |
ABI-PPO | |
|---|---|---|
| Epochs to convergence | 800+ (did not converge) | 15 |
| Value-function EV | -3.3 (useless) | 0.49 (predictive) |
| Policy gradient strength | baseline | 3–4× stronger |
| One-phase value loss | stuck at ~945 | 420 → 0.9 |
python abi_ppo.py --epochs 150 --phase all
python abi_ppo.py --resume models/checkpoint_v6.pt --epochs 100
python abi_ppo.py --resume models/checkpoint_v6.pt --phase combat --epochs 50
python abi_ppo.py --infoObservation: 270-dim float vector (RLState.ts v6 layout)
[0-7] Self: x, y, vx, vy, hp, energy, rotation, alive
[8-10] Ammo: missile, bouncy, grenade
[11-16] Weapon economy (v6): 4 affordability flags + laserBudget + energyRegenMult
[17-34] Flags (18): carrying, nearest-flag, pole/carrier/escort dirs, role one-hot
[35-76] Enemies: 7 × (dx, dy, vx, vy, hp, alive)
[77] Nearest-enemy angle error
[78-127] Projectiles: 10 × (dx, dy, vx, vy, threat)
[128-151] Teammates: 4 × (dx, dy, vx, vy, hp, carrying)
[152-155] Game state: scoreDiff, nearestAllyDist, timeAlive, roundProgress
[156-269] Tile awareness: viewport grid + raycasts (114)
Actor: 270 → 256 → 256 (LayerNorm + ReLU) → { move(9), fire(2), aim(128) }
Critic: 270 → 256 → 256 (LayerNorm + ReLU) → value scalar
Separate backbones, separate optimizers, separate learning rates.
v6 aim semantics: the 128-bin aim head outputs a target-relative offset (±90° around a reference angle, 1.4°/bin). Reference = nearest-visible-enemy direction if one exists, else self.rotation. The policy learns corrections to rule-based lead aim rather than absolute aim direction — rlInfluence (0–0.45) becomes a meaningful precision dial.
abi_ppo.py— the ABI-PPO training system (1511 LOC, v6)rl_train.py— legacy PPO trainer, kept for comparison
- Python 3.9+
- NumPy
- PyTorch (CUDA wheel recommended — training runs on GPU)
See requirements.txt. Pick the PyTorch build that matches your CUDA toolchain from https://pytorch.org/get-started/locally/.
Extracted from the Arcbound game where it trains the in-game AI. v6 policy deployed 2026-04-19; 103 epochs in ~23 min produced a converged three-phase policy on a single GPU.
MIT