Add GPU node exclusion, runtime cap, and unified memory for inference#42
Conversation
structure_inference jobs can now avoid unsuitable GPUs and survive large complexes that exceed VRAM: - slurm_exclude_nodes: comma-separated nodes passed to sbatch --exclude, to skip GPUs the prediction container cannot use (e.g. a CUDA compute capability newer than the bundled ptxas, which fails "ptxas too old" / UNIMPLEMENTED). It is a Slurm resource, not rule code, so it does not trigger reruns of finished predictions. --constraint/--gres are managed by the plugin (and forbidden in slurm_extra), so --exclude is the supported way to drop specific nodes. - structure_inference_max_runtime: cap wall time so retry scaling (1440 * attempt minutes) cannot exceed the partition MaxTime and produce "Requested time limit is invalid" sbatch failures. Default 10080 (7 days). - structure_inference_unified_memory (default true): export the DeepMind-recommended JAX/XLA unified-memory env (TF_FORCE_UNIFIED_MEMORY, XLA_PYTHON_CLIENT_PREALLOCATE=false, XLA_CLIENT_MEM_FRACTION) so inference spills GPU VRAM into host RAM instead of OOM-ing. Toggle off to fail fast; tune via structure_inference_xla_mem_fraction. Documented in config/config.yaml and README. The unified-memory env is empty when disabled, so the rule code is unchanged in that case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ble details - config.yaml: note that *_ram_bytes values are in MB (used as SLURM --mem), not bytes — 64000 = ~64 GB. - README: keep the SLURM section minimal; move the GPU-exclude/runtime-cap and unified-memory explanations into <details> blocks so non-expert users are not overwhelmed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empirical validation: unified memory resolves a real OOMTested end-to-end on a pair that genuinely OOM'd before:
Toggling |
…U VRAM) structure_inference_xla_mem_fraction now defaults to "auto": instead of a fixed 3.2, the per-job ceiling is computed in the inference shell as (allocated host RAM, the SLURM --mem value) / (physical GPU VRAM read via nvidia-smi once the job lands on a node). This keeps XLA's unified-memory ceiling within the SLURM allocation so it cannot oversubscribe host RAM past what the job requested and get OOM-killed -- the EMBL run_AF_multimer.sh convention. The GPU VRAM is only known at run time and the SLURM executor exposes no per-job env hook (it passes the submit env through --export=ALL, which is the same for every job), so the value must be computed in the job shell; doing it inside the container also avoids apptainer env-crossing. Falls back to 3.2 if nvidia-smi is unavailable. XLA_PYTHON_CLIENT_PREALLOCATE=false is kept (without it XLA grabs a large VRAM slice up front, defeating on-demand spill). Pin a number to override. Also exports XLA_PYTHON_CLIENT_MEM_FRACTION alongside XLA_CLIENT_MEM_FRACTION so the JAX/AF2 path honors the same ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Follow-up commit: When Notes:
|
Keep only the code-reader rationale (why the fraction is resolved at run time, why --exclude doesn't trigger reruns); the user-facing "what each option does" already lives in config.yaml and README. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Makes
structure_inferencerobust to two real cluster failure modes seen inproduction: jobs landing on GPUs the prediction container can't use, and large
complexes exhausting GPU VRAM.
Three new (backward-compatible) config options, all read by the existing
structure_inferencerule:slurm_exclude_nodes— comma-separated node list passed straight tosbatch --exclude(e.g."gpu50,gpu51,gpu52,gpu53"). Use it to skip nodeswhose GPU the container can't compile for — e.g. a CUDA compute capability
newer than the bundled
ptxas, which fails withptxas too old/UNIMPLEMENTED.--constraint/--gresare managed by the Slurm executorplugin (and forbidden inside
slurm_extra), so--excludeis the supportedway to drop a few incompatible nodes while keeping the rest of the partition.
It is a Slurm resource, not rule code, so it does not trigger reruns of
already-finished predictions.
structure_inference_max_runtime(default10080= 7 days) — caps theper-job wall time. Wall time scales with the retry attempt (
1440 * attemptminutes); without a cap, enough retries request more time than the partition
MaxTimeand SLURM rejects the job withRequested time limit is invalid.structure_inference_unified_memory(defaulttrue) +structure_inference_xla_mem_fraction(default3.2) — export theDeepMind-recommended
JAX/XLA unified-memory env so inference spills GPU VRAM into host RAM instead
of OOM-ing (
RESOURCE_EXHAUSTED/bfc_allocator ran out of memory):Set
structure_inference_unified_memory: falseto fail fast instead. Whendisabled the env string is empty, so the rule's shell is byte-identical to
before.
Notes
slurm_exclude_nodeswithstructure_inference_gpu_modelto bothrestrict to a model and exclude bad nodes.
host RAM via
structure_inference_ram_bytes.Testing
snakemake --listand--dry-runparse cleanly with the modified Snakefile.--dry-run -pconfirms the unified-memory exports appear in thestructure_inferenceshell, and thatstructure_inference_unified_memory: falseremoves them.slurm_exclude_nodesproducesslurm_extra=--exclude=<nodes>on theinference jobs and SLURM normalizes it to the expected
ExcNodeList.docs/performance.md.🤖 Generated with Claude Code