Skip to content

Add GPU node exclusion, runtime cap, and unified memory for inference#42

Merged
DimaMolod merged 4 commits into
mainfrom
feature/gpu-exclude-unified-memory
May 21, 2026
Merged

Add GPU node exclusion, runtime cap, and unified memory for inference#42
DimaMolod merged 4 commits into
mainfrom
feature/gpu-exclude-unified-memory

Conversation

@DimaMolod
Copy link
Copy Markdown
Collaborator

Summary

Makes structure_inference robust to two real cluster failure modes seen in
production: jobs landing on GPUs the prediction container can't use, and large
complexes exhausting GPU VRAM.

Three new (backward-compatible) config options, all read by the existing
structure_inference rule:

  • slurm_exclude_nodes — comma-separated node list passed straight to
    sbatch --exclude (e.g. "gpu50,gpu51,gpu52,gpu53"). Use it to skip nodes
    whose GPU the container can't compile for — e.g. a CUDA compute capability
    newer than the bundled ptxas, which fails with ptxas too old /
    UNIMPLEMENTED. --constraint/--gres are managed by the Slurm executor
    plugin (and forbidden inside slurm_extra), so --exclude is the supported
    way to drop a few incompatible nodes while keeping the rest of the partition.
    It is a Slurm resource, not rule code, so it does not trigger reruns of
    already-finished predictions.

  • structure_inference_max_runtime (default 10080 = 7 days) — caps the
    per-job wall time. Wall time scales with the retry attempt (1440 * attempt
    minutes); without a cap, enough retries request more time than the partition
    MaxTime and SLURM rejects the job with Requested time limit is invalid.

  • structure_inference_unified_memory (default true) +
    structure_inference_xla_mem_fraction (default 3.2) — export the
    DeepMind-recommended
    JAX/XLA unified-memory env so inference spills GPU VRAM into host RAM instead
    of OOM-ing (RESOURCE_EXHAUSTED / bfc_allocator ran out of memory):

    export TF_FORCE_UNIFIED_MEMORY=true
    export XLA_PYTHON_CLIENT_PREALLOCATE=false
    export XLA_CLIENT_MEM_FRACTION=3.2

    Set structure_inference_unified_memory: false to fail fast instead. When
    disabled the env string is empty, so the rule's shell is byte-identical to
    before.

Notes

  • Pair slurm_exclude_nodes with structure_inference_gpu_model to both
    restrict to a model and exclude bad nodes.
  • Because unified memory slows down when actually spilling, give the job enough
    host RAM via structure_inference_ram_bytes.

Testing

  • snakemake --list and --dry-run parse cleanly with the modified Snakefile.
  • --dry-run -p confirms the unified-memory exports appear in the
    structure_inference shell, and that structure_inference_unified_memory: false removes them.
  • Verified slurm_exclude_nodes produces slurm_extra=--exclude=<nodes> on the
    inference jobs and SLURM normalizes it to the expected ExcNodeList.
  • Unified-memory env confirmed against AlphaFold 3's own docs/performance.md.

🤖 Generated with Claude Code

DimaMolod and others added 2 commits May 20, 2026 15:57
structure_inference jobs can now avoid unsuitable GPUs and survive large
complexes that exceed VRAM:

- slurm_exclude_nodes: comma-separated nodes passed to sbatch --exclude, to
  skip GPUs the prediction container cannot use (e.g. a CUDA compute
  capability newer than the bundled ptxas, which fails "ptxas too old" /
  UNIMPLEMENTED). It is a Slurm resource, not rule code, so it does not
  trigger reruns of finished predictions. --constraint/--gres are managed by
  the plugin (and forbidden in slurm_extra), so --exclude is the supported way
  to drop specific nodes.
- structure_inference_max_runtime: cap wall time so retry scaling
  (1440 * attempt minutes) cannot exceed the partition MaxTime and produce
  "Requested time limit is invalid" sbatch failures. Default 10080 (7 days).
- structure_inference_unified_memory (default true): export the
  DeepMind-recommended JAX/XLA unified-memory env (TF_FORCE_UNIFIED_MEMORY,
  XLA_PYTHON_CLIENT_PREALLOCATE=false, XLA_CLIENT_MEM_FRACTION) so inference
  spills GPU VRAM into host RAM instead of OOM-ing. Toggle off to fail fast;
  tune via structure_inference_xla_mem_fraction.

Documented in config/config.yaml and README. The unified-memory env is empty
when disabled, so the rule code is unchanged in that case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ble details

- config.yaml: note that *_ram_bytes values are in MB (used as SLURM --mem),
  not bytes — 64000 = ~64 GB.
- README: keep the SLURM section minimal; move the GPU-exclude/runtime-cap and
  unified-memory explanations into <details> blocks so non-expert users are not
  overwhelmed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@DimaMolod
Copy link
Copy Markdown
Collaborator Author

Empirical validation: unified memory resolves a real OOM

Tested end-to-end on a pair that genuinely OOM'd before: O00194+Q9ULV0
(2066 tokens, ~25 GiB) had failed with RESOURCE_EXHAUSTED on 24 GB RTX 3090s.
Re-ran it forced back onto a 3090 (structure_inference_gpu_model: "3090")
with structure_inference_unified_memory: true, exercising this branch's
Snakefile shell (not an external env):

  • Node: gpu35 (RTX 3090, 24 GB), jobid 54337038
  • Result: completed successfully — model.cif, ranking_scores.csv,
    completed_fold.txt written, no RESOURCE_EXHAUSTED.
  • Timing: inference 16:38→16:58 (~20 min vs the usual few minutes) — the
    expected host-paging slowdown when actually spilling, i.e. the documented
    speed/robustness trade-off.

Toggling structure_inference_unified_memory: false removes the exports (verified
via --dry-run -p), so the behaviour is opt-out.

…U VRAM)

structure_inference_xla_mem_fraction now defaults to "auto": instead of a fixed
3.2, the per-job ceiling is computed in the inference shell as
(allocated host RAM, the SLURM --mem value) / (physical GPU VRAM read via
nvidia-smi once the job lands on a node). This keeps XLA's unified-memory ceiling
within the SLURM allocation so it cannot oversubscribe host RAM past what the job
requested and get OOM-killed -- the EMBL run_AF_multimer.sh convention.

The GPU VRAM is only known at run time and the SLURM executor exposes no per-job
env hook (it passes the submit env through --export=ALL, which is the same for
every job), so the value must be computed in the job shell; doing it inside the
container also avoids apptainer env-crossing. Falls back to 3.2 if nvidia-smi is
unavailable. XLA_PYTHON_CLIENT_PREALLOCATE=false is kept (without it XLA grabs a
large VRAM slice up front, defeating on-demand spill). Pin a number to override.

Also exports XLA_PYTHON_CLIENT_MEM_FRACTION alongside XLA_CLIENT_MEM_FRACTION so
the JAX/AF2 path honors the same ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@DimaMolod
Copy link
Copy Markdown
Collaborator Author

Follow-up commit: structure_inference_xla_mem_fraction now defaults to auto (was a fixed 3.2).

When auto, the fraction is computed per job in the inference shell as (allocated host RAM, the SLURM --mem value) / (physical GPU VRAM read via nvidia-smi once the job lands). This keeps XLA's unified-memory ceiling within the SLURM allocation so it can't oversubscribe host RAM past what the job requested and get OOM-killed — the EMBL run_AF_multimer.sh convention. Falls back to 3.2 if nvidia-smi is unavailable; pin a number to override.

Notes:

  • Computed in the job shell because the value depends on which GPU the job lands on (only known at run time) and the SLURM executor exposes no per-job env hook (it passes the submit env through --export=ALL, same for every job). Doing it inside the container also avoids apptainer env-crossing.
  • XLA_PYTHON_CLIENT_PREALLOCATE=false is kept (without it XLA grabs a large VRAM slice up front, defeating on-demand spill). Also now exports XLA_PYTHON_CLIENT_MEM_FRACTION alongside XLA_CLIENT_MEM_FRACTION so the JAX/AF2 path honors the same ceiling.
  • Verified via snakemake -n -p: the rendered shell shows auto → nvidia-smi + awk division, a pinned number → direct export, and disabled → no exports.

Keep only the code-reader rationale (why the fraction is resolved at run time,
why --exclude doesn't trigger reruns); the user-facing "what each option does"
already lives in config.yaml and README.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@DimaMolod DimaMolod merged commit bd962ae into main May 21, 2026
@DimaMolod DimaMolod deleted the feature/gpu-exclude-unified-memory branch May 21, 2026 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant