fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning by wlewNV · Pull Request #216 · NVIDIA/flashdreams

wlewNV · 2026-05-29T16:08:28Z

Summary

Renames the Ludus JIT cache slot from ludus_renderer_plugin to ludus_renderer_plugin_sm_<cap> (e.g. _sm_100 on Blackwell GB200/GB300) and pins TORCH_CUDA_ARCH_LIST to the live device's compute capability instead of clearing it. Targets the ~4x cudaludus slowdown @junchen / @ruilong reported on the HSG GB200/GB300 Slurm cluster, while the same code runs fast on a locally-installed GB300.

Root cause

torch.utils.cpp_extension.load(...) keys its build cache off a hash of the source files but NOT off the GPU arch / compile flags. On any shared-filesystem host (NFS $HOME on Slurm) a single user's ~/.cache/torch_extensions/ can hold a .so built on one node (CPU login, Hopper dev box, ...) and silently reuse it on a later job that lands on a different GPU generation. The mismatched binary lacks the right SASS, kernels fall back to PTX-JIT at every launch, and cudaludus throughput collapses by ~4x — exactly the omnidreams-webrtc demo's "20 fps on cluster, 30 fps locally" pattern. Local installs work because the cache is necessarily built fresh on the running GPU.

Changes (`ludus_renderer/_ops/_plugin.py`)

Pin TORCH_CUDA_ARCH_LIST to the device's compute capability (e.g. "10.0" on GB200/GB300) instead of clearing it. Deterministic gencode; no dependency on torch.cuda.get_arch_list() (which varies per PyTorch wheel and can silently lack SM 10.0+ entries).
Arch-tag the cache slot name so a Hopper .so can't be reused on Blackwell, and vice versa. One-time ~30–60 s rebuild on first load after this lands.
Richer startup log line records device_capability, arch_tag, the value of TORCH_CUDA_ARCH_LIST actually used, and cache hit/miss.
Warn (don't delete) the legacy cache slot if it still exists, so operators can clean it up.

Fallback: no CUDA at import → arch_tag="nocuda"; capability detect fails → arch_tag="auto". Both behave exactly as before the patch.

Out of scope: the same TORCH_CUDA_ARCH_LIST="" pattern in ludus_renderer/nvjpeg.py line 71 — separate extension, separate cache, follow-up.

… poisoning ``torch.utils.cpp_extension.load`` keys its build cache off a hash of the source files but NOT off the GPU arch / compile flags. On any shared-filesystem host (NFS ``$HOME`` on Slurm / HPC clusters being the typical case) a single user's ``~/.cache/torch_extensions/`` can end up holding a ``.so`` that was built on one node (e.g. a CPU-only login or a Hopper dev box) and then silently get reused on a later job that lands on a node with a different GPU generation (Blackwell GB200 / GB300). The mismatched binary doesn't carry the right SASS for the new arch, so kernels fall back to PTX-JIT (or a degraded path) at every launch and steady-state cudaludus throughput collapses by ~4x. That matches the exact pattern Junchen and Ruilong reported on the HSG GB200/GB300 cluster (omnidreams demo capped at ~20 fps) while the same code is fast on a locally-installed GB300 box -- where the cache is necessarily built fresh on the running GPU and so always ends up with the right SASS. Two changes here: 1) ``TORCH_CUDA_ARCH_LIST`` is now pinned to the live device's compute capability (e.g. ``"10.0"`` on GB200/GB300) instead of being cleared. Clearing it relied on ``torch.cuda.get_arch_list()`` which varies with the installed PyTorch wheel and which may silently lack SM 10.0 / 10.0a / 12.0 entries on older wheels. Pinning is deterministic and forces nvcc to gencode for the right thing. 2) The cache slot's directory name now includes the arch tag (``ludus_renderer_plugin`` -> ``ludus_renderer_plugin_sm_100`` on Blackwell). This makes the cache safe to share across hosts: each arch gets its own slot, so a Hopper build can never be picked up on a Blackwell job and vice versa. Existing users incur a one-time ~30-60s rebuild on first load after this lands; the legacy ``ludus_renderer_plugin`` directory is left in place (we warn rather than auto-delete since other tooling may reference it). The startup log line also now records the detected ``device_capability``, the resolved ``arch_tag``, the value of ``TORCH_CUDA_ARCH_LIST`` actually used, and whether the load was a cache hit or a fresh build. With that line in hand it's possible to diagnose any future "fine on host A, slow on host B" reports from the log alone (compare the values across the two hosts; if anything differs the cache is the suspect). Fallback behaviour: * No CUDA available at import time -> ``arch_tag = "nocuda"`` and ``TORCH_CUDA_ARCH_LIST`` is cleared. Preserves the prior code path for CPU-only login nodes that just import the package. * ``get_device_capability`` raises -> ``arch_tag = "auto"`` and ``TORCH_CUDA_ARCH_LIST`` is cleared, with a warning. Same behaviour as before this patch, just labelled so the operator can see we punted. Tested locally on a non-cluster host: builds + loads fine, log line shows the new fields. Cluster validation (the symptom this patch targets) needs someone with SLURM access to run on the affected GB200/GB300 nodes: 1. Apply this patch on top of current ``main``. 2. Run a representative omnidreams-webrtc rollout on the affected cluster. 3. Look for the ``Loading Ludus renderer extension ...`` log line. ``cache=miss (will build)`` on the first run is expected (new arch-scoped slot); ``arch=sm_100`` confirms the right SASS is being produced. 4. Confirm cudaludus per-chunk cost drops back to the locally-observed value and the demo holds 30 fps. If FPS does NOT improve, the cache hypothesis is wrong and the new log line tells us what to chase next (e.g. driver too old, MIG slice, container CUDA toolkit version) without needing to rebuild the patch.

copy-pr-bot · 2026-05-29T16:08:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

JunchenLiu77 · 2026-05-29T17:50:15Z

Profiling result with this PR applied:
gen_ms: avg 402.9 ms / 8 frames (~19.9 FPS)
pipeline_total_ms: avg 179.0 ms / 8 frames
wrapper_render_condition_ms: avg 219.4 ms
renderer_ctx_render_ms: avg 218.3 ms
ctx_render_plugin_cuda_ms_sum: avg 214.7 ms
enqueue_ms: avg 26.5 ms

jatentaki · 2026-06-02T13:47:34Z

+            logger.warning(
+                "Found legacy Ludus extension cache at {}; this directory "
+                "was produced by a pre-arch-pinning build and may contain "
+                "a .so compiled for a different GPU. Safe to delete -- "


I wouldn't say it's safe to delete precisely because another operator may be depending on it :) I'd remove this legacy detection code branch for simplicity and assume the little bit of extra files won't hurt anyone and eventually get cleaned up when somebody clears the cache for whatever reason (disk cleanup, etc).

wlewNV self-assigned this May 29, 2026

jatentaki reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning#216

fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning#216
wlewNV wants to merge 1 commit into
mainfrom
dev/wlew/ludus-plugin-arch-cache

wlewNV commented May 29, 2026

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

JunchenLiu77 commented May 29, 2026 •

edited

Loading

Uh oh!

jatentaki Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wlewNV commented May 29, 2026

Summary

Root cause

Changes (ludus_renderer/_ops/_plugin.py)

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

JunchenLiu77 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatentaki Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Changes (`ludus_renderer/_ops/_plugin.py`)

JunchenLiu77 commented May 29, 2026 •

edited

Loading