Skip to content

fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning#216

Draft
wlewNV wants to merge 1 commit into
mainfrom
dev/wlew/ludus-plugin-arch-cache
Draft

fix(ludus-renderer): scope JIT cache by GPU arch to prevent shared-FS poisoning#216
wlewNV wants to merge 1 commit into
mainfrom
dev/wlew/ludus-plugin-arch-cache

Conversation

@wlewNV
Copy link
Copy Markdown
Collaborator

@wlewNV wlewNV commented May 29, 2026

Summary

Renames the Ludus JIT cache slot from ludus_renderer_plugin to ludus_renderer_plugin_sm_<cap> (e.g. _sm_100 on Blackwell GB200/GB300) and pins TORCH_CUDA_ARCH_LIST to the live device's compute capability instead of clearing it. Targets the ~4x cudaludus slowdown @junchen / @ruilong reported on the HSG GB200/GB300 Slurm cluster, while the same code runs fast on a locally-installed GB300.

Root cause

torch.utils.cpp_extension.load(...) keys its build cache off a hash of the source files but NOT off the GPU arch / compile flags. On any shared-filesystem host (NFS $HOME on Slurm) a single user's ~/.cache/torch_extensions/ can hold a .so built on one node (CPU login, Hopper dev box, ...) and silently reuse it on a later job that lands on a different GPU generation. The mismatched binary lacks the right SASS, kernels fall back to PTX-JIT at every launch, and cudaludus throughput collapses by ~4x — exactly the omnidreams-webrtc demo's "20 fps on cluster, 30 fps locally" pattern. Local installs work because the cache is necessarily built fresh on the running GPU.

Changes (ludus_renderer/_ops/_plugin.py)

  1. Pin TORCH_CUDA_ARCH_LIST to the device's compute capability (e.g. "10.0" on GB200/GB300) instead of clearing it. Deterministic gencode; no dependency on torch.cuda.get_arch_list() (which varies per PyTorch wheel and can silently lack SM 10.0+ entries).
  2. Arch-tag the cache slot name so a Hopper .so can't be reused on Blackwell, and vice versa. One-time ~30–60 s rebuild on first load after this lands.
  3. Richer startup log line records device_capability, arch_tag, the value of TORCH_CUDA_ARCH_LIST actually used, and cache hit/miss.
  4. Warn (don't delete) the legacy cache slot if it still exists, so operators can clean it up.

Fallback: no CUDA at import → arch_tag="nocuda"; capability detect fails → arch_tag="auto". Both behave exactly as before the patch.

Out of scope: the same TORCH_CUDA_ARCH_LIST="" pattern in ludus_renderer/nvjpeg.py line 71 — separate extension, separate cache, follow-up.

… poisoning

``torch.utils.cpp_extension.load`` keys its build cache off a hash of
the source files but NOT off the GPU arch / compile flags. On any
shared-filesystem host (NFS ``$HOME`` on Slurm / HPC clusters being
the typical case) a single user's ``~/.cache/torch_extensions/`` can
end up holding a ``.so`` that was built on one node (e.g. a CPU-only
login or a Hopper dev box) and then silently get reused on a later
job that lands on a node with a different GPU generation (Blackwell
GB200 / GB300). The mismatched binary doesn't carry the right SASS
for the new arch, so kernels fall back to PTX-JIT (or a degraded
path) at every launch and steady-state cudaludus throughput collapses
by ~4x.

That matches the exact pattern Junchen and Ruilong reported on the
HSG GB200/GB300 cluster (omnidreams demo capped at ~20 fps) while
the same code is fast on a locally-installed GB300 box -- where the
cache is necessarily built fresh on the running GPU and so always
ends up with the right SASS.

Two changes here:

1) ``TORCH_CUDA_ARCH_LIST`` is now pinned to the live device's
   compute capability (e.g. ``"10.0"`` on GB200/GB300) instead of
   being cleared. Clearing it relied on ``torch.cuda.get_arch_list()``
   which varies with the installed PyTorch wheel and which may
   silently lack SM 10.0 / 10.0a / 12.0 entries on older wheels.
   Pinning is deterministic and forces nvcc to gencode for the right
   thing.

2) The cache slot's directory name now includes the arch tag
   (``ludus_renderer_plugin`` -> ``ludus_renderer_plugin_sm_100`` on
   Blackwell). This makes the cache safe to share across hosts: each
   arch gets its own slot, so a Hopper build can never be picked up
   on a Blackwell job and vice versa. Existing users incur a
   one-time ~30-60s rebuild on first load after this lands; the
   legacy ``ludus_renderer_plugin`` directory is left in place (we
   warn rather than auto-delete since other tooling may reference
   it).

The startup log line also now records the detected
``device_capability``, the resolved ``arch_tag``, the value of
``TORCH_CUDA_ARCH_LIST`` actually used, and whether the load was a
cache hit or a fresh build. With that line in hand it's possible to
diagnose any future "fine on host A, slow on host B" reports from
the log alone (compare the values across the two hosts; if anything
differs the cache is the suspect).

Fallback behaviour:

* No CUDA available at import time -> ``arch_tag = "nocuda"`` and
  ``TORCH_CUDA_ARCH_LIST`` is cleared. Preserves the prior code
  path for CPU-only login nodes that just import the package.
* ``get_device_capability`` raises -> ``arch_tag = "auto"`` and
  ``TORCH_CUDA_ARCH_LIST`` is cleared, with a warning. Same
  behaviour as before this patch, just labelled so the operator can
  see we punted.

Tested locally on a non-cluster host: builds + loads fine, log line
shows the new fields. Cluster validation (the symptom this patch
targets) needs someone with SLURM access to run on the affected
GB200/GB300 nodes:

  1. Apply this patch on top of current ``main``.
  2. Run a representative omnidreams-webrtc rollout on the affected
     cluster.
  3. Look for the ``Loading Ludus renderer extension ...`` log line.
     ``cache=miss (will build)`` on the first run is expected (new
     arch-scoped slot); ``arch=sm_100`` confirms the right SASS is
     being produced.
  4. Confirm cudaludus per-chunk cost drops back to the
     locally-observed value and the demo holds 30 fps.

If FPS does NOT improve, the cache hypothesis is wrong and the
new log line tells us what to chase next (e.g. driver too old, MIG
slice, container CUDA toolkit version) without needing to rebuild
the patch.
@wlewNV wlewNV self-assigned this May 29, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JunchenLiu77
Copy link
Copy Markdown
Collaborator

JunchenLiu77 commented May 29, 2026

Profiling result with this PR applied:
gen_ms: avg 402.9 ms / 8 frames (~19.9 FPS)
pipeline_total_ms: avg 179.0 ms / 8 frames
wrapper_render_condition_ms: avg 219.4 ms
renderer_ctx_render_ms: avg 218.3 ms
ctx_render_plugin_cuda_ms_sum: avg 214.7 ms
enqueue_ms: avg 26.5 ms

logger.warning(
"Found legacy Ludus extension cache at {}; this directory "
"was produced by a pre-arch-pinning build and may contain "
"a .so compiled for a different GPU. Safe to delete -- "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it's safe to delete precisely because another operator may be depending on it :) I'd remove this legacy detection code branch for simplicity and assume the little bit of extra files won't hurt anyone and eventually get cleaned up when somebody clears the cache for whatever reason (disk cleanup, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants