fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning#216
Draft
wlewNV wants to merge 1 commit into
Draft
fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning#216wlewNV wants to merge 1 commit into
wlewNV wants to merge 1 commit into
Conversation
… poisoning
``torch.utils.cpp_extension.load`` keys its build cache off a hash of
the source files but NOT off the GPU arch / compile flags. On any
shared-filesystem host (NFS ``$HOME`` on Slurm / HPC clusters being
the typical case) a single user's ``~/.cache/torch_extensions/`` can
end up holding a ``.so`` that was built on one node (e.g. a CPU-only
login or a Hopper dev box) and then silently get reused on a later
job that lands on a node with a different GPU generation (Blackwell
GB200 / GB300). The mismatched binary doesn't carry the right SASS
for the new arch, so kernels fall back to PTX-JIT (or a degraded
path) at every launch and steady-state cudaludus throughput collapses
by ~4x.
That matches the exact pattern Junchen and Ruilong reported on the
HSG GB200/GB300 cluster (omnidreams demo capped at ~20 fps) while
the same code is fast on a locally-installed GB300 box -- where the
cache is necessarily built fresh on the running GPU and so always
ends up with the right SASS.
Two changes here:
1) ``TORCH_CUDA_ARCH_LIST`` is now pinned to the live device's
compute capability (e.g. ``"10.0"`` on GB200/GB300) instead of
being cleared. Clearing it relied on ``torch.cuda.get_arch_list()``
which varies with the installed PyTorch wheel and which may
silently lack SM 10.0 / 10.0a / 12.0 entries on older wheels.
Pinning is deterministic and forces nvcc to gencode for the right
thing.
2) The cache slot's directory name now includes the arch tag
(``ludus_renderer_plugin`` -> ``ludus_renderer_plugin_sm_100`` on
Blackwell). This makes the cache safe to share across hosts: each
arch gets its own slot, so a Hopper build can never be picked up
on a Blackwell job and vice versa. Existing users incur a
one-time ~30-60s rebuild on first load after this lands; the
legacy ``ludus_renderer_plugin`` directory is left in place (we
warn rather than auto-delete since other tooling may reference
it).
The startup log line also now records the detected
``device_capability``, the resolved ``arch_tag``, the value of
``TORCH_CUDA_ARCH_LIST`` actually used, and whether the load was a
cache hit or a fresh build. With that line in hand it's possible to
diagnose any future "fine on host A, slow on host B" reports from
the log alone (compare the values across the two hosts; if anything
differs the cache is the suspect).
Fallback behaviour:
* No CUDA available at import time -> ``arch_tag = "nocuda"`` and
``TORCH_CUDA_ARCH_LIST`` is cleared. Preserves the prior code
path for CPU-only login nodes that just import the package.
* ``get_device_capability`` raises -> ``arch_tag = "auto"`` and
``TORCH_CUDA_ARCH_LIST`` is cleared, with a warning. Same
behaviour as before this patch, just labelled so the operator can
see we punted.
Tested locally on a non-cluster host: builds + loads fine, log line
shows the new fields. Cluster validation (the symptom this patch
targets) needs someone with SLURM access to run on the affected
GB200/GB300 nodes:
1. Apply this patch on top of current ``main``.
2. Run a representative omnidreams-webrtc rollout on the affected
cluster.
3. Look for the ``Loading Ludus renderer extension ...`` log line.
``cache=miss (will build)`` on the first run is expected (new
arch-scoped slot); ``arch=sm_100`` confirms the right SASS is
being produced.
4. Confirm cudaludus per-chunk cost drops back to the
locally-observed value and the demo holds 30 fps.
If FPS does NOT improve, the cache hypothesis is wrong and the
new log line tells us what to chase next (e.g. driver too old, MIG
slice, container CUDA toolkit version) without needing to rebuild
the patch.
Collaborator
|
Profiling result with this PR applied: |
jatentaki
reviewed
Jun 2, 2026
| logger.warning( | ||
| "Found legacy Ludus extension cache at {}; this directory " | ||
| "was produced by a pre-arch-pinning build and may contain " | ||
| "a .so compiled for a different GPU. Safe to delete -- " |
Collaborator
There was a problem hiding this comment.
I wouldn't say it's safe to delete precisely because another operator may be depending on it :) I'd remove this legacy detection code branch for simplicity and assume the little bit of extra files won't hurt anyone and eventually get cleaned up when somebody clears the cache for whatever reason (disk cleanup, etc).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Renames the Ludus JIT cache slot from
ludus_renderer_plugintoludus_renderer_plugin_sm_<cap>(e.g._sm_100on Blackwell GB200/GB300) and pinsTORCH_CUDA_ARCH_LISTto the live device's compute capability instead of clearing it. Targets the ~4x cudaludus slowdown @junchen / @ruilong reported on the HSG GB200/GB300 Slurm cluster, while the same code runs fast on a locally-installed GB300.Root cause
torch.utils.cpp_extension.load(...)keys its build cache off a hash of the source files but NOT off the GPU arch / compile flags. On any shared-filesystem host (NFS$HOMEon Slurm) a single user's~/.cache/torch_extensions/can hold a.sobuilt on one node (CPU login, Hopper dev box, ...) and silently reuse it on a later job that lands on a different GPU generation. The mismatched binary lacks the right SASS, kernels fall back to PTX-JIT at every launch, and cudaludus throughput collapses by ~4x — exactly the omnidreams-webrtc demo's "20 fps on cluster, 30 fps locally" pattern. Local installs work because the cache is necessarily built fresh on the running GPU.Changes (
ludus_renderer/_ops/_plugin.py)TORCH_CUDA_ARCH_LISTto the device's compute capability (e.g."10.0"on GB200/GB300) instead of clearing it. Deterministic gencode; no dependency ontorch.cuda.get_arch_list()(which varies per PyTorch wheel and can silently lack SM 10.0+ entries)..socan't be reused on Blackwell, and vice versa. One-time ~30–60 s rebuild on first load after this lands.device_capability,arch_tag, the value ofTORCH_CUDA_ARCH_LISTactually used, and cache hit/miss.Fallback: no CUDA at import →
arch_tag="nocuda"; capability detect fails →arch_tag="auto". Both behave exactly as before the patch.Out of scope: the same
TORCH_CUDA_ARCH_LIST=""pattern inludus_renderer/nvjpeg.pyline 71 — separate extension, separate cache, follow-up.