[FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120) by KJLdefeated · Pull Request #91 · RL-Align/RL-Kernel

KJLdefeated · 2026-06-08T12:03:03Z

Summary

pip install -e . force-built the Hopper-only TMA fused-logp kernel (csrc/cuda/fused_logp_sm90.cu) on every device with compute capability >= 9, including Blackwell (SM120) and SM100. Its hardcoded gencode=arch=compute_90a,code=sm_90a also suppressed PyTorch's automatic native-arch gencode, so the entire extension — including the generic and attention kernels — was compiled for sm_90a only and could not load on the actual device.

Change

setup.py: When opted in, emit the detected device's architecture-specific gencode (SM90->90a, SM120->120a) instead of a hardcoded compute_90a.
registry.py: Prioritize the TMA logp op only when its symbol is compiled into _C, and drop the misleading "Failed to instantiate CUDA_FUSED_LOGP_SM90" ERROR that fired on every non-Hopper SM>=9 run before falling back.

Tests on Blackwell and Hopper

Blackwell (SM120) + CUDA13

> pip install -e . # No error
> python examples/grpo_single_gpu.py --require-fused-logp --device cuda
INFO 06-08 20:00:03 [RL-Kernel]: RL-Engine initialized with NVIDIA CUDA backend (Version: 13.0)
INFO 06-08 20:00:03 [RL-Kernel]: KernelRegistry initialized for cuda
INFO 06-08 20:00:03 [RL-Kernel]: Successfully linked to precompiled _C.fused_logp fallback kernel.
starting grpo_single_gpu device=cuda backend=FusedLogpGenericOp batch=8x16 active_tokens=115
reward_stats mean=0.446666 min=0.297619 max=0.600000
step=0 loss=0.002686 policy_loss=0.002686 kl=0.000000 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=1 loss=-0.084697 policy_loss=-0.086432 kl=0.173448 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=2 loss=-0.103133 policy_loss=-0.107106 kl=0.397346 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=3 loss=-0.102970 policy_loss=-0.109759 kl=0.678900 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
completed grpo_single_gpu steps=4 device=cuda backend=FusedLogpGenericOp

H100 (SM90) + CUDA12.8

> pip install -e . # No error
> python examples/grpo_single_gpu.py --require-fused-logp --device cuda
backend=FusedLogpGenericOp
INFO 06-08 11:52:22 [RL-Kernel]: RL-Engine initialized with NVIDIA CUDA backend (Version: 12.8)
INFO 06-08 11:52:22 [RL-Kernel]: KernelRegistry initialized for cuda
INFO 06-08 11:52:22 [RL-Kernel]: Successfully linked to precompiled _C.fused_logp fallback kernel.
starting grpo_single_gpu device=cuda backend=FusedLogpGenericOp batch=8x16 active_tokens=115
reward_stats mean=0.446666 min=0.297619 max=0.600000
step=0 loss=0.002686 policy_loss=0.002686 kl=0.000000 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=1 loss=-0.084697 policy_loss=-0.086432 kl=0.173448 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=2 loss=-0.103133 policy_loss=-0.107106 kl=0.397345 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
step=3 loss=-0.102970 policy_loss=-0.109759 kl=0.678900 train_logp_source=autograd_reference kernel_max_abs_error=4.768372e-07
completed grpo_single_gpu steps=4 device=cuda backend=FusedLogpGenericOp

Summary by CodeRabbit

Refactor
- Improved CUDA kernel backend selection logic with stricter hardware compatibility checks.
- Refined SM90 optimization enablement to require explicit environment variable configuration (KERNEL_ALIGN_FORCE_SM90="1").
- Enhanced device capability detection during build process for more precise hardware targeting.

…kwell SM120) `pip install -e .` force-built the Hopper-only TMA fused-logp kernel (csrc/cuda/fused_logp_sm90.cu) on every device with compute capability >= 9, including Blackwell (SM120) and SM100. Its hardcoded `-gencode=arch=compute_90a,code=sm_90a` also suppressed PyTorch's automatic native-arch gencode, so the entire extension — including the generic and attention kernels — was compiled for sm_90a only and could not load on the actual device. The TMA kernel is additionally non-functional on all architectures (TMA box width exceeds the 256-element cuTensorMapEncodeTiled limit; its warp-specialized layout deadlocks cub::BlockReduce across a partial block), so it should not be built by default. setup.py: - Build the experimental TMA kernel only via KERNEL_ALIGN_FORCE_SM90=1 (off by default), so the default build compiles the generic fused kernel for the detected native architecture and runs on SM120 + CUDA 13. - When opted in, emit the detected device's architecture-specific gencode (SM90->90a, SM120->120a) instead of a hardcoded compute_90a. registry.py: - Prioritize the TMA logp op only when its symbol is compiled into _C, and drop the misleading "Failed to instantiate CUDA_FUSED_LOGP_SM90" ERROR that fired on every non-Hopper SM>=9 run before falling back. Verified on RTX PRO 6000 (SM120) + CUDA 13: build succeeds, the example selects FusedLogpGenericOp, --require-fused-logp passes (kernel_max_abs_error 4.77e-07). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-08T12:03:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aa5df907-6ebe-4341-b51d-8bf5bb264941

📥 Commits

Reviewing files that changed from the base of the PR and between 12fc220 and 1c3e388.

📒 Files selected for processing (2)

rl_engine/kernels/registry.py
setup.py

📝 Walkthrough

Walkthrough

This PR refines SM90 CUDA kernel availability detection and compilation. The build system now gates SM90 extension compilation behind an environment variable and derives gencode targets from detected device capability. Runtime kernel selection validates extension presence and SM major version before prioritizing the fused TMA LogP backend.

Changes

SM90 CUDA Kernel Build and Runtime Selection

Layer / File(s)	Summary
Build-time SM90 extension gating `setup.py`	CUDA extension build now captures both major and minor device capability and gates SM90 "tma" support compilation to only when `KERNEL_ALIGN_FORCE_SM90` environment variable is set to `"1"`. When enabled, gencode targets are computed from detected capability (`{cc_major}{cc_minor}a`) instead of hardcoded `compute_90a`/`sm_90a`.
Runtime kernel availability checking `rl_engine/kernels/registry.py`	Kernel registry now validates fused TMA LogP backend availability by checking for the `fused_logp_sm90` extension symbol and restricting prioritization to specific SM major versions (9, 10, 12). Non-CUDA devices return early; CUDA devices with unavailable fused kernels log a debug message instead of injecting the backend.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

[FEAT][kernels] SM90 fused-logp (TMA) kernel cannot build or run on SM120 (Blackwell); Ask support for SM120 (Blackwell) #87: Related — both changes constrain runtime prioritization to avoid forcing the SM90 fused-logp backend on unsupported devices by gating selection on kernel availability and device SM versions.

Poem

🐰 A kernel fine-tuned for the SM90 day,
No longer forced where it cannot play—
Build with a flag, runtime checks with care,
Device compatibility everywhere!
Compute ninety whispers: only when ready. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main change: fixing CUDA extension build for non-Hopper SM>=90 architectures (specifically Blackwell SM120).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

inaniloquentee · 2026-06-09T07:32:32Z

LGTM. The default build path looks good and the registry fallback behavior is cleaner. I only noticed a minor non-blocking edge case around KERNEL_ALIGN_FORCE_SM90=1 deriving an a arch from the local GPU, but that can be addressed separately if needed. Happy to merge.

Flink-ddd

LGTM, I think we can merge this PR. Thanks.

KJLdefeated requested review from Flink-ddd and inaniloquentee as code owners June 8, 2026 12:03

KJLdefeated changed the title ~~[FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=9 (Blackkwell SM120)~~ [FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120) Jun 8, 2026

KJLdefeated mentioned this pull request Jun 8, 2026

[FEAT][kernels] SM90 fused-logp (TMA) kernel cannot build or run on SM120 (Blackwell); Ask support for SM120 (Blackwell) #87

Closed

4 tasks

Merge branch 'main' into fix/fused-logp-blackwell-sm120

7a19aab

inaniloquentee approved these changes Jun 9, 2026

View reviewed changes

Flink-ddd approved these changes Jun 9, 2026

View reviewed changes

Flink-ddd merged commit 3178f69 into RL-Align:main Jun 9, 2026
4 checks passed

coderabbitai Bot mentioned this pull request Jun 26, 2026

[WS1][kernels] Batch-invariant deterministic GEMM (fwd + bwd) #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120)#91

[FEAT][kernels] Fix CUDA extension build on non-Hopper SM>=90 (Blackkwell SM120)#91
Flink-ddd merged 2 commits into
RL-Align:mainfrom
KJLdefeated:fix/fused-logp-blackwell-sm120

KJLdefeated commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

inaniloquentee commented Jun 9, 2026

Uh oh!

Flink-ddd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

KJLdefeated commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Tests on Blackwell and Hopper

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

inaniloquentee commented Jun 9, 2026

Uh oh!

Flink-ddd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KJLdefeated commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading