Skip to content

ACaTreYu/abi-ppo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ABI-PPO

ARCbound Intelligence — Proximal Policy Optimization

A PyTorch reinforcement-learning pipeline for training real-time multiplayer game AI. Built for and proven in the Arcbound arena shooter.

What it solves

Standard PPO struggles in real-time multiplayer settings where:

  • Scalar rewards flatten distinct skills (survival vs combat vs strategy) into one noisy signal
  • Shared optimizer state lets the critic's massive value loss drown out the actor's tiny policy gradient
  • Credit assignment over ~10,000-tick trajectories gives the agent no chance to learn anything

ABI-PPO fixes all three:

  1. Decoupled optimizers — actor and critic each get an independent Adam optimizer with its own learning rate. No momentum corruption between the two.
  2. Reward decomposition — rewards split into three orthogonal channels (survival / combat / strategy). Each channel's advantages are normalized independently.
  3. Staged curriculum — three training phases (Movement → Combat → Strategy), each mastering one skill before unlocking the next. Collapses credit assignment from ~10,000 ticks to ~200 per phase.

Measured results vs legacy PPO

Legacy rl_train.py ABI-PPO
Epochs to convergence 800+ (did not converge) 15
Value-function EV -3.3 (useless) 0.49 (predictive)
Policy gradient strength baseline 3–4× stronger
One-phase value loss stuck at ~945 420 → 0.9

Usage

python abi_ppo.py --epochs 150 --phase all
python abi_ppo.py --resume models/checkpoint_v6.pt --epochs 100
python abi_ppo.py --resume models/checkpoint_v6.pt --phase combat --epochs 50
python abi_ppo.py --info

Architecture (v6 — 2026-04-19)

Observation: 270-dim float vector (RLState.ts v6 layout)
  [0-7]     Self: x, y, vx, vy, hp, energy, rotation, alive
  [8-10]    Ammo: missile, bouncy, grenade
  [11-16]   Weapon economy (v6): 4 affordability flags + laserBudget + energyRegenMult
  [17-34]   Flags (18): carrying, nearest-flag, pole/carrier/escort dirs, role one-hot
  [35-76]   Enemies: 7 × (dx, dy, vx, vy, hp, alive)
  [77]      Nearest-enemy angle error
  [78-127]  Projectiles: 10 × (dx, dy, vx, vy, threat)
  [128-151] Teammates: 4 × (dx, dy, vx, vy, hp, carrying)
  [152-155] Game state: scoreDiff, nearestAllyDist, timeAlive, roundProgress
  [156-269] Tile awareness: viewport grid + raycasts (114)

Actor:  270 → 256 → 256 (LayerNorm + ReLU) → { move(9), fire(2), aim(128) }
Critic: 270 → 256 → 256 (LayerNorm + ReLU) → value scalar

Separate backbones, separate optimizers, separate learning rates.

v6 aim semantics: the 128-bin aim head outputs a target-relative offset (±90° around a reference angle, 1.4°/bin). Reference = nearest-visible-enemy direction if one exists, else self.rotation. The policy learns corrections to rule-based lead aim rather than absolute aim direction — rlInfluence (0–0.45) becomes a meaningful precision dial.

Files

  • abi_ppo.py — the ABI-PPO training system (1511 LOC, v6)
  • rl_train.py — legacy PPO trainer, kept for comparison

Requirements

  • Python 3.9+
  • NumPy
  • PyTorch (CUDA wheel recommended — training runs on GPU)

See requirements.txt. Pick the PyTorch build that matches your CUDA toolchain from https://pytorch.org/get-started/locally/.

Status

Extracted from the Arcbound game where it trains the in-game AI. v6 policy deployed 2026-04-19; 103 epochs in ~23 min produced a converged three-phase policy on a single GPU.

License

MIT

About

ARCbound Intelligence PPO — NumPy-only RL pipeline with decoupled optimizers and staged curriculum for real-time multiplayer game AI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages