Skip to content

[FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120)#91

Merged
Flink-ddd merged 2 commits into
RL-Align:mainfrom
KJLdefeated:fix/fused-logp-blackwell-sm120
Jun 9, 2026
Merged

[FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120)#91
Flink-ddd merged 2 commits into
RL-Align:mainfrom
KJLdefeated:fix/fused-logp-blackwell-sm120

Conversation

@KJLdefeated

@KJLdefeated KJLdefeated commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

#87

Summary

pip install -e . force-built the Hopper-only TMA fused-logp kernel (csrc/cuda/fused_logp_sm90.cu) on every device with compute capability >= 9, including Blackwell (SM120) and SM100. Its hardcoded gencode=arch=compute_90a,code=sm_90a also suppressed PyTorch's automatic native-arch gencode, so the entire extension — including the generic and attention kernels — was compiled for sm_90a only and could not load on the actual device.

Change

  • setup.py: When opted in, emit the detected device's architecture-specific gencode (SM90->90a, SM120->120a) instead of a hardcoded compute_90a.
  • registry.py: Prioritize the TMA logp op only when its symbol is compiled into _C, and drop the misleading "Failed to instantiate CUDA_FUSED_LOGP_SM90" ERROR that fired on every non-Hopper SM>=9 run before falling back.

Tests on Blackwell and Hopper

  • Blackwell (SM120) + CUDA13
> pip install -e . # No error
> python examples/grpo_single_gpu.py --require-fused-logp --device cuda
INFO 06-08 20:00:03 [RL-Kernel]: RL-Engine initialized with NVIDIA CUDA backend (Version: 13.0)
INFO 06-08 20:00:03 [RL-Kernel]: KernelRegistry initialized for cuda
INFO 06-08 20:00:03 [RL-Kernel]: Successfully linked to precompiled _C.fused_logp fallback kernel.
starting grpo_single_gpu device=cuda backend=FusedLogpGenericOp batch=8x16 active_tokens=115
reward_stats mean=0.446666 min=0.297619 max=0.600000
step=0 loss=0.002686 policy_loss=0.002686 kl=0.000000 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=1 loss=-0.084697 policy_loss=-0.086432 kl=0.173448 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=2 loss=-0.103133 policy_loss=-0.107106 kl=0.397346 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=3 loss=-0.102970 policy_loss=-0.109759 kl=0.678900 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
completed grpo_single_gpu steps=4 device=cuda backend=FusedLogpGenericOp
  • H100 (SM90) + CUDA12.8
> pip install -e . # No error
> python examples/grpo_single_gpu.py --require-fused-logp --device cuda
backend=FusedLogpGenericOp
INFO 06-08 11:52:22 [RL-Kernel]: RL-Engine initialized with NVIDIA CUDA backend (Version: 12.8)
INFO 06-08 11:52:22 [RL-Kernel]: KernelRegistry initialized for cuda
INFO 06-08 11:52:22 [RL-Kernel]: Successfully linked to precompiled _C.fused_logp fallback kernel.
starting grpo_single_gpu device=cuda backend=FusedLogpGenericOp batch=8x16 active_tokens=115
reward_stats mean=0.446666 min=0.297619 max=0.600000
step=0 loss=0.002686 policy_loss=0.002686 kl=0.000000 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=1 loss=-0.084697 policy_loss=-0.086432 kl=0.173448 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=2 loss=-0.103133 policy_loss=-0.107106 kl=0.397345 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=3 loss=-0.102970 policy_loss=-0.109759 kl=0.678900 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
completed grpo_single_gpu steps=4 device=cuda backend=FusedLogpGenericOp

Summary by CodeRabbit

  • Refactor
    • Improved CUDA kernel backend selection logic with stricter hardware compatibility checks.
    • Refined SM90 optimization enablement to require explicit environment variable configuration (KERNEL_ALIGN_FORCE_SM90="1").
    • Enhanced device capability detection during build process for more precise hardware targeting.

…kwell SM120)

`pip install -e .` force-built the Hopper-only TMA fused-logp kernel
(csrc/cuda/fused_logp_sm90.cu) on every device with compute capability >= 9,
including Blackwell (SM120) and SM100. Its hardcoded
`-gencode=arch=compute_90a,code=sm_90a` also suppressed PyTorch's automatic
native-arch gencode, so the entire extension — including the generic and
attention kernels — was compiled for sm_90a only and could not load on the
actual device. The TMA kernel is additionally non-functional on all
architectures (TMA box width exceeds the 256-element cuTensorMapEncodeTiled
limit; its warp-specialized layout deadlocks cub::BlockReduce across a partial
block), so it should not be built by default.

setup.py:
- Build the experimental TMA kernel only via KERNEL_ALIGN_FORCE_SM90=1 (off by
  default), so the default build compiles the generic fused kernel for the
  detected native architecture and runs on SM120 + CUDA 13.
- When opted in, emit the detected device's architecture-specific gencode
  (SM90->90a, SM120->120a) instead of a hardcoded compute_90a.

registry.py:
- Prioritize the TMA logp op only when its symbol is compiled into _C, and drop
  the misleading "Failed to instantiate CUDA_FUSED_LOGP_SM90" ERROR that fired
  on every non-Hopper SM>=9 run before falling back.

Verified on RTX PRO 6000 (SM120) + CUDA 13: build succeeds, the example selects
FusedLogpGenericOp, --require-fused-logp passes (kernel_max_abs_error 4.77e-07).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aa5df907-6ebe-4341-b51d-8bf5bb264941

📥 Commits

Reviewing files that changed from the base of the PR and between 12fc220 and 1c3e388.

📒 Files selected for processing (2)
  • rl_engine/kernels/registry.py
  • setup.py

📝 Walkthrough

Walkthrough

This PR refines SM90 CUDA kernel availability detection and compilation. The build system now gates SM90 extension compilation behind an environment variable and derives gencode targets from detected device capability. Runtime kernel selection validates extension presence and SM major version before prioritizing the fused TMA LogP backend.

Changes

SM90 CUDA Kernel Build and Runtime Selection

Layer / File(s) Summary
Build-time SM90 extension gating
setup.py
CUDA extension build now captures both major and minor device capability and gates SM90 "tma" support compilation to only when KERNEL_ALIGN_FORCE_SM90 environment variable is set to "1". When enabled, gencode targets are computed from detected capability ({cc_major}{cc_minor}a) instead of hardcoded compute_90a/sm_90a.
Runtime kernel availability checking
rl_engine/kernels/registry.py
Kernel registry now validates fused TMA LogP backend availability by checking for the fused_logp_sm90 extension symbol and restricting prioritization to specific SM major versions (9, 10, 12). Non-CUDA devices return early; CUDA devices with unavailable fused kernels log a debug message instead of injecting the backend.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰 A kernel fine-tuned for the SM90 day,
No longer forced where it cannot play—
Build with a flag, runtime checks with care,
Device compatibility everywhere!
Compute ninety whispers: only when ready.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main change: fixing CUDA extension build for non-Hopper SM>=90 architectures (specifically Blackwell SM120).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@KJLdefeated KJLdefeated changed the title [FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=9 (Blackkwell SM120) [FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120) Jun 8, 2026
@inaniloquentee

Copy link
Copy Markdown
Collaborator

LGTM. The default build path looks good and the registry fallback behavior is cleaner. I only noticed a minor non-blocking edge case around KERNEL_ALIGN_FORCE_SM90=1 deriving an a arch from the local GPU, but that can be addressed separately if needed. Happy to merge.

@Flink-ddd Flink-ddd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think we can merge this PR. Thanks.

@Flink-ddd Flink-ddd merged commit 3178f69 into RL-Align:main Jun 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants