Skip to content

AdamGabet/MultiModalJepa

Repository files navigation

multimodal_research

HPP 10K multimodal modeling (CGM, DEXA, retina, metabolites): loaders in data/, LeJEPA-style pretraining in training/.

Setup

uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

Training

  • This machine: python -m training --max-steps 10 (uses CPU if no CUDA).
  • GPU (mcluster11): python scripts/train_elysium_mcluster11.py --max-steps 10 — needs passwordless ssh mcluster11 and the same NFS project path on genie and the cluster.

Slurm metadata and logs land in elysium_runs/ (gitignored). Optional: scripts/train_elysium_local.py runs Elysium without SSH (no cluster GPU).

Training writes embeddings/<run_id>/checkpoint.pt and embeddings.h5; add --run-eval to run Ridge probes against the tabular baseline table (prints EVAL_SCORE=...). Or alone: python -m eval.run_eval --embeddings embeddings/<run_id>/embeddings.h5.

Autoresearch Agents

Autonomous ML research runs through Claude Code (.claude/agents/) and OpenCode (opencode.json + .opencode/agents/).

Supervisor (talk to this one): your single conversation partner for fleet status and priorities.

claude --agent supervisor
# or:
opencode run --agent supervisor

It reads training/results.tsv and the shared diary, then answers questions like "how are agents doing?" and writes your ideas as directives workers pick up on their next cycle.

Research workers run one per worktree on autoresearch/agent-<suffix> branches. Start a loop:

# Create worktree (once):
bash scripts/agent_worktree.sh a

# Start loop (or ask the supervisor to do it):
bash scripts/supervisor_spawn_agent.sh a
# Gemini CLI instead of OpenCode: AGENT_CLI=gemini bash scripts/supervisor_spawn_agent.sh a
# logs → agent_logs/agent-a.log

Training from a shell: use bash (genie's default shell is often tcsh, which rejects $(…) and prints Illegal variable name.). training/run_experiment.sh always runs on mcluster11 GPU via Elysium.

Slurm Requested node configuration is not available: override GPU spec with export ELYSIUM_TRAINING_GPU=1 or e.g. L40S:1.

Remote job uv: command not found: GPU nodes don't have uv; run_training_elysium.py uses .venv/bin/python on the cluster. Ensure uv sync has created .venv at the NFS project path.

Quiet Elysium: run_training_elysium.py is quiet by default (RUN_TRAINING_ELYSIUM_QUIET=1). Use --human for live output.

Why is --max-steps 1 still minutes? Queue time + CUDA init dominate; one step is cheap.

Agent logs: agent_logs/agent-<suffix>.log; remote training → elysium_runs/collection_*/0000/stdout.log.

Tests

python -m pytest tests/ -q

Data

Building H5s from the cohort requires MULTIMODAL_BENCHMARK_CSV (see env.example); processed H5 paths are under data/constants.py. Large files stay out of git (*.h5 ignored).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors