Skip to content

SotoAlt/silent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

silent

A JEPA world model plays predator. It hunts the human player by listening to four directional microphones and watching its own audio echolocation — no game state, no positions, no ground truth. Audio in, predator move out.

Live demo: https://sotoalt.dev/experiments/silent.html Model on HuggingFace: https://huggingface.co/sotoalt/silent

4-channel mel-spec (N/E/S/W ears) ──> ViT-Tiny encoder ──> 192-dim embedding
                                                                  │
            action (3-d: dx, dy, ping_amp) ──> ActionEncoder ──> conditioning
                                                                  │
                                                          AR causal predictor
                                                          (6 layers, AdaLN)
                                                                  │
                                                          predicted embedding
                                                                  │
                                              MLP state head (192 → 256² → 8)
                                                                  │
                                                  predator_xy, player_xy, ...

What this is

A Joint Embedding Predictive Architecture (JEPA) — Yann LeCun's "post-LLM" framework — trained to predict next-step audio embeddings on a custom predator-prey environment. The predator senses the world through four cardioid microphones (N/E/S/W). It hears the player breathing and running, plus the echoes from its own active sonar pings reflecting off walls. There is no map, no GPS, no oracle. The JEPA's predictor learns the audio dynamics; a CEM planner samples thrust + ping actions, rolls them through the predictor, and a state-head MLP decodes the rolled-out embedding into a position so the planner can score "did this trajectory get me closer to the player?". Then the predator commits the first action and re-plans.

The architecture is byte-for-byte LeWM (arxiv 2603.19312, Maes et al, LeCun, Balestriero) with one addition borrowed from DexWM (arxiv 2512.13644): a state head trained jointly with the predictor. The head's MSE gradient flows back through the projector and into the encoder, forcing both to preserve task-relevant spatial information that pure next-embedding-MSE training would otherwise discard.

The shipping model is pure JEPA at inference — no privileged state, no simulator in the loop. The state-head decode happens only inside the planner's cost function on the predictor's own outputs.

Architecture

  • Encoder: ViT-Tiny, 4-channel input, trained from scratch.
  • ActionEncoder: Linear(frameskip × 3 → 192) so a horizon of 5 frames of (dx, dy, ping_amp) maps into the same 192-dim conditioning the predictor consumes per-step.
  • ARPredictor: 6-layer causal transformer, 16 heads, AdaLN-zero conditioning on action embedding.
  • ProjectorMLP: 192 → 2048 → 192, BatchNorm, applied to both encoder outputs and predictor outputs (the LeWM SIGReg projection space).
  • SIGReg: Spectral implicit Gaussian regularization on projected embeddings — replaces VICReg's three terms with a single isotropic- Gaussian regularizer.
  • StateHead: MLP(192 → 256 → 256 → 8) — predator_xy, player_xy,
    • 4 auxiliary state vars. Trained jointly with the predictor at lambda_state = 10. ~117K params.

Total trainable: ~26M params (encoder + projectors + predictor + head). Runs at 10 Hz on a single shared CPU vCPU.

Training pipeline

The recipe — validated end-to-end on this project — is the cheap-gates playbook for any new JEPA-as-agent game:

  1. Generate data with uniformly randomized starting positions. The first Silent dataset baked pred_start = (80, 80), player_start = (430, 430) into every episode and every downstream decoder memorized "player ≈ bottom-right" as a strong prior. Always randomize.
  2. Pure-LeWM smoke test, 5–10 epochs. Watch val_pred drop monotonically. This validates the pipeline. It does not validate that the encoder has learned what the planner needs.
  3. Preflight v2 probe (scripts/preflight_silent_v2.py). Linear ridge regression: projected embedding → ground-truth state. Use ≥ 900 samples — Silent's first preflight ran on 400 and reported R² = -0.17 (false negative). The honest probe at 908 samples showed predator_xy R² = 1.00, player_xy R² = 0.35, and the projector was throwing away 0.21 of the encoder's spatial information.
  4. Joint DexWM training at λ=10 for 10 epochs as a cheap validation gate. If player_xy R² jumps from ~0.5 → ~0.9, commit to the full 100-epoch run. If it stalls, kill cheaply and try a different recipe before burning the full training budget.
  5. Post-hoc head on uniform-sampled positions (train_silent_head_uniform- style). The planner's CEM cost decoder must be free of the spawn-region prior in the training distribution.
  6. Audit before shipping (scripts/audit_silent_v1.py, four gates: baseline comparison, closed-loop drift, causal sensitivity, horizon match). Exit code 0 = safe to deploy.

Quick start

pip install torch torchvision timm einops fastapi uvicorn websockets \
    librosa pymunk h5py pygame scipy

# Download checkpoints from HuggingFace
huggingface-cli download sotoalt/silent --local-dir checkpoints/

# Run the inference server
python -m world_model.infer_silent_env \
    --jepa-ckpt checkpoints/silent_v1_3e_ep030.pt \
    --jepa-head checkpoints/3e_ep030_head_uniform.pt \
    --host 0.0.0.0 --port 8801

# Open http://localhost:8801/ in your browser. Use WASD to move,
# space to voice (your only audio source the predator can lock onto).

Training from scratch

# 1. Generate training data (uniform-randomized starts)
python -m scripts.collect_silent_data \
    --episodes 12500 --steps 40 --frameskip 5 \
    --output data/silent_train.h5

# 2. Pure-LeWM smoke test (Phase A)
python -m scripts.train_silent_v1_lewm \
    --h5 data/silent_train.h5 \
    --output checkpoints/silent_v1_phaseA.pt \
    --epochs 10 --batch 64 --device cuda --lambda-state 0.0

# 3. Preflight probe — does the encoder preserve player_xy?
python -m scripts.preflight_silent_v2 \
    --checkpoint checkpoints/silent_v1_phaseA.pt --samples 1000

# 4. Joint DexWM 10-epoch validation gate
python -m scripts.train_silent_v1_lewm \
    --h5 data/silent_train.h5 \
    --output checkpoints/silent_v1_3b_ep010.pt \
    --epochs 10 --batch 64 --device cuda --lambda-state 10.0 \
    --resume-from checkpoints/silent_v1_phaseA.pt

# Re-probe — if player_xy R² > 0.7, commit to the full run.
python -m scripts.preflight_silent_v2 \
    --checkpoint checkpoints/silent_v1_3b_ep010.pt --samples 1000

# 5. Full 100-epoch joint run
python -m scripts.train_silent_v1_lewm \
    --h5 data/silent_train.h5 \
    --output checkpoints/silent_v1_3e_ep100.pt \
    --epochs 100 --batch 128 --device cuda --lambda-state 10.0 \
    --num-workers 4

# 6. Audit gate before shipping
python -m scripts.audit_silent_v1 \
    --checkpoint checkpoints/silent_v1_3e_ep030.pt \
    --deploy-horizon 5

Repository layout

world_model/
  infer_silent_env.py     FastAPI + WebSocket server, env loop, planner host

envs/
  silent.py               Pymunk environment (2D top-down, walls, exit zone)
  silent_jepa_predator_v2.py  CEM planner in latent space + state-head cost
  silent_predators.py     Scripted baselines (BeaconSeeker, etc.)
  silent_players.py       Scripted bot players for data generation
  silent_rooms.py         Stage geometry + spawn distributions

scripts/
  collect_silent_data.py            Uniform-randomized data generation
  train_silent_v1_lewm.py           LeWM + DexWM joint head trainer
  preflight_silent_v2.py            5-check encoder probe
  audit_silent_v1.py                4-gate ship audit
  abtest_silent_predators.py        Multi-predator A/B harness

client/silent/
  index.html, main.js, audio.js     Server-side JEPA demo UI

docs/
  JOURNAL.md              Curated research narrative — what worked, what didn't

Results

Encoder probe (linear ridge: projected embedding → state)

Variant predator_xy R² player_xy R² dist R²
Pure LeWM ep5 (best of A) 1.00 0.55 n/a
Pure LeWM ep60 (worst of A) 0.99 0.35 0.65
Joint DexWM ep10 1.00 0.90 0.71

10 epochs of joint training nearly doubled the encoder's player_xy R² on projected embeddings. Pure-LeWM training was silently degrading the information the planner needed.

Federated training (mention only)

The hosted demo at sotoalt.dev/experiments/silent.html includes an opt-in browser-side federated training experiment — same JEPA, same predictor, your gradients. That's a separate research thread and lives in its own infrastructure; this repo is the JEPA WM core.

Model

Available on HuggingFace at https://huggingface.co/sotoalt/silent:

  • silent_v1_3e_ep030.pt — shipping checkpoint (joint DexWM, λ=10)
  • 3e_ep030_head_uniform.pt — post-hoc state head for planner CEM cost

Related work

  • LeWM (Maes, Le Lidec, Scieur, LeCun, Balestriero, 2026 — arxiv 2603.19312). The architecture this repo is faithful to.
  • DexWM (arxiv 2512.13644). The joint state-head technique.
  • V-JEPA 2-AC (FAIR, 2026). Uncontrolled-thing prediction + controlled-thing search — the planning frame Silent uses.
  • lepong (sibling repo, https://github.com/SotoAlt/lepong). 13M CNN-JEPA on 2D Pong, the predecessor that proved JEPA-as-policy worked at all on a controlled-paddle game.
  • relay (sibling repo, https://github.com/SotoAlt/relay). v9 of this same recipe applied to a turn-based 2D pushing game.

License

MIT — see LICENSE.

About

JEPA world model that hunts by audio. ViT-Tiny encoder + AR causal predictor + jointly-trained DexWM state head, CEM planner in latent space.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors