HPP 10K multimodal modeling (CGM, DEXA, retina, metabolites): loaders in data/, LeJEPA-style pretraining in training/.
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt- This machine:
python -m training --max-steps 10(uses CPU if no CUDA). - GPU (mcluster11):
python scripts/train_elysium_mcluster11.py --max-steps 10— needs passwordlessssh mcluster11and the same NFS project path on genie and the cluster.
Slurm metadata and logs land in elysium_runs/ (gitignored). Optional: scripts/train_elysium_local.py runs Elysium without SSH (no cluster GPU).
Training writes embeddings/<run_id>/checkpoint.pt and embeddings.h5; add --run-eval to run Ridge probes against the tabular baseline table (prints EVAL_SCORE=...). Or alone: python -m eval.run_eval --embeddings embeddings/<run_id>/embeddings.h5.
Autonomous ML research runs through Claude Code (.claude/agents/) and OpenCode (opencode.json + .opencode/agents/).
Supervisor (talk to this one): your single conversation partner for fleet status and priorities.
claude --agent supervisor
# or:
opencode run --agent supervisorIt reads training/results.tsv and the shared diary, then answers questions like "how are agents doing?" and writes your ideas as directives workers pick up on their next cycle.
Research workers run one per worktree on autoresearch/agent-<suffix> branches. Start a loop:
# Create worktree (once):
bash scripts/agent_worktree.sh a
# Start loop (or ask the supervisor to do it):
bash scripts/supervisor_spawn_agent.sh a
# Gemini CLI instead of OpenCode: AGENT_CLI=gemini bash scripts/supervisor_spawn_agent.sh a
# logs → agent_logs/agent-a.logTraining from a shell: use bash (genie's default shell is often tcsh, which rejects $(…) and prints Illegal variable name.). training/run_experiment.sh always runs on mcluster11 GPU via Elysium.
Slurm Requested node configuration is not available: override GPU spec with export ELYSIUM_TRAINING_GPU=1 or e.g. L40S:1.
Remote job uv: command not found: GPU nodes don't have uv; run_training_elysium.py uses .venv/bin/python on the cluster. Ensure uv sync has created .venv at the NFS project path.
Quiet Elysium: run_training_elysium.py is quiet by default (RUN_TRAINING_ELYSIUM_QUIET=1). Use --human for live output.
Why is --max-steps 1 still minutes? Queue time + CUDA init dominate; one step is cheap.
Agent logs: agent_logs/agent-<suffix>.log; remote training → elysium_runs/collection_*/0000/stdout.log.
python -m pytest tests/ -qBuilding H5s from the cohort requires MULTIMODAL_BENCHMARK_CSV (see env.example); processed H5 paths are under data/constants.py. Large files stay out of git (*.h5 ignored).