[None][perf] FC2 DenseGEMM autotune: split-K, swap_ab, fine-grained tuning buckets#13833
Conversation
📝 WalkthroughWalkthroughThis PR adds split-K execution and SwapAB transposition support to the MoE FC2 dense GEMM kernel. It expands tactic search with split_k, introduces vectorized/Float16 atomics, threads split_k/swap_ab through kernel/grid/wrapper, rewrites epilogue storage for atomic reductions, and updates the test runner and CLI. ChangesSplit-K and SwapAB MoE FC2 GEMM
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py (1)
2574-2580: 💤 Low valueInt64 to Int32 cast lacks bounds validation.
The cast assumes "practical tensor dimensions always fit in Int32" but doesn't verify this. While dimensions >2B elements are rare, a silent overflow would cause subtle correctness issues. Consider adding an assertion for safety, especially since this is a public wrapper API.
🛡️ Optional bounds check
+ INT32_MAX = 2147483647 + if m > INT32_MAX or n > INT32_MAX or k > INT32_MAX or l > INT32_MAX: + raise ValueError("Tensor dimensions exceed Int32 range") m = cutlass.Int32(m) n = cutlass.Int32(n) k = cutlass.Int32(k) l = cutlass.Int32(l)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py` around lines 2574 - 2580, Add explicit bounds validation before casting m, n, k, l to cutlass.Int32: check each dimension is within the signed 32-bit range (e.g., between -(2**31) and 2**31-1) and raise/throw a clear error or assert if not. Update the wrapper where m, n, k, l are converted (the cutlass.Int32(...) lines) to perform these checks and include the offending variable name and value in the error message so callers of this public API get immediate, actionable feedback instead of silent overflow. Ensure the validation runs before any cutlass.range() or other 32-bit-only calls.tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py (1)
319-351: 💤 Low valueMisleading comment and unused parameter in
create_alpha_scale_tensor.
Line 320: The comment
### if not swapABis misleading since this function is called for both swap_ab modes — the caller passes different values form(either the actual M or N dimension).Line 319: The
nparameter is never used insidecreate_alpha_scale_tensor. Consider removing it or documenting why it's retained.These are documentation nits; the logic is correct.
📝 Suggested documentation fix
- def create_alpha_scale_tensor(l, m, n, expert_count, dtype): # noqa: E741 - ### if not swapAB - # True means alpha_scale is token (m) major for coalesced global memory access. + def create_alpha_scale_tensor(l, token_dim, expert_count, dtype): # noqa: E741 + # token_dim = M (standard) or N (swap_ab) + # alpha_scale is token-major for coalesced global memory access. alpha_scale_ref = cutlass_torch.matrix( l, - m, + token_dim, expert_count, True, # token_dim is major cutlass.Float32,Then update the call sites accordingly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py` around lines 319 - 351, The function create_alpha_scale_tensor currently has an unused parameter n and a misleading comment "### if not swapAB"; remove the unused n parameter from create_alpha_scale_tensor's signature and from its internal references, update the call site where create_alpha_scale_tensor(...) is invoked (the call that passes alpha_token_dim, n, expert_count, ...) to pass only the needed parameters (l, alpha_token_dim, expert_count, dtype), and replace the "### if not swapAB" comment with a short clarifying comment that the caller controls token-major ordering via alpha_token_dim (derived from swap_ab).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`:
- Around line 1873-1878: The scalar atomic fallback loop incorrectly calls
atomic_add_func(rVec_flat[j], scatter_out) instead of using the per-j computed
address; change the call to atomic_add_func(rVec_flat[j], scatter_j) so each
iteration uses scatter_j (computed via cute.domain_offset) rather than the
shared scatter_out, preserving correct per-element destinations; this affects
the branch handling in the vectorized_atomic_add_bf16x8 fallback where rVec,
rVec_flat, scatter_out, scatter_j, and atomic_add_func are used.
---
Nitpick comments:
In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`:
- Around line 2574-2580: Add explicit bounds validation before casting m, n, k,
l to cutlass.Int32: check each dimension is within the signed 32-bit range
(e.g., between -(2**31) and 2**31-1) and raise/throw a clear error or assert if
not. Update the wrapper where m, n, k, l are converted (the cutlass.Int32(...)
lines) to perform these checks and include the offending variable name and value
in the error message so callers of this public API get immediate, actionable
feedback instead of silent overflow. Ensure the validation runs before any
cutlass.range() or other 32-bit-only calls.
In
`@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py`:
- Around line 319-351: The function create_alpha_scale_tensor currently has an
unused parameter n and a misleading comment "### if not swapAB"; remove the
unused n parameter from create_alpha_scale_tensor's signature and from its
internal references, update the call site where create_alpha_scale_tensor(...)
is invoked (the call that passes alpha_token_dim, n, expert_count, ...) to pass
only the needed parameters (l, alpha_token_dim, expert_count, dtype), and
replace the "### if not swapAB" comment with a short clarifying comment that the
caller controls token-major ordering via alpha_token_dim (derived from swap_ab).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9d51f9b2-9a7e-418d-893b-e4cbb7797a11
📒 Files selected for processing (4)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.pytests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py
|
/bot run |
|
@coderabbitai full review |
|
/bot run --disable-fail-fast |
✅ Actions performedFull review triggered. |
|
PR_Github #47359 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)
4255-4274:⚠️ Potential issue | 🟠 Major | ⚡ Quick winExpose
swap_abin the FC2 tactic path too.This runner now autotunes only
(mma_tiler_mn, cluster_shape_mn, split_k). Becauseswap_abnever enters the tactic tuple,forwardparsing, or the cache/kernel construction,cute_dsl_nvfp4_dense_gemm_fc2_blackwellstill cannot autotune or launch the new swap-ab mode described for this PR, so the op-level sweep/cache population will miss that dimension entirely.Also applies to: 4338-4437
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py` around lines 4255 - 4274, The FC2 tactic candidate generator currently only returns (mma_tiler_mn, cluster_shape_mn, split_k) and omits the swap_ab dimension; update the generator in the method that builds FC2 candidates (the function surrounding the shown diff that feeds cute_dsl_nvfp4_dense_gemm_fc2_blackwell) to include swap_ab in the candidate tuple and enumeration (e.g., add a swap_ab_candidates list like [False, True] and produce candidates as (mma_tiler_mn, cluster_shape_mn, split_k, swap_ab)); then propagate this new 4-tuple shape through the FC2 tactic path by (1) updating the forward parsing logic that reads tactics to expect swap_ab for cute_dsl_nvfp4_dense_gemm_fc2_blackwell, and (2) including swap_ab when constructing cache keys and kernel launch configuration so the autotuner and cache population account for and can select the swap-ab mode.
🧹 Nitpick comments (3)
tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py (2)
603-608: 💤 Low valueConsider adding CLI validation for
split_kvalues.Per the PR description, split-K only supports values in {1, 2, 4}. Adding a
choicesconstraint would provide clearer error messages than failing later in kernel construction.💡 Suggested fix
parser.add_argument( "--split_k", type=int, default=1, + choices=[1, 2, 4], help="Split-K factor (default: 1)", )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py` around lines 603 - 608, The CLI currently accepts any integer for the "--split_k" argument in parser.add_argument; restrict accepted values to the supported set {1,2,4} by adding an argparse validation (e.g., use the choices parameter) on the "--split_k" argument so invalid inputs produce a helpful argparse error before kernel construction; update the parser.add_argument call for "--split_k" to include choices=[1,2,4] and adjust the help text to reflect allowed values.
105-106: 💤 Low valueConsider documenting new parameters in docstring.
The new
split_kandswap_abparameters are not documented in the function docstring below. For consistency with other parameters, consider adding brief descriptions.📝 Suggested docstring addition
:param use_cold_l2: Whether to use circular buffer strategy to ensure cold L2 cache, defaults to False :type use_cold_l2: bool, optional + :param split_k: Split-K factor for parallel reduction (valid values: 1, 2, 4), defaults to 1 + :type split_k: int, optional + :param swap_ab: Whether to swap A/B roles (A=weight, B=activation), defaults to False + :type swap_ab: bool, optional :raises RuntimeError: If CUDA GPU is not available🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py` around lines 105 - 106, The function run_moe_as_dense_gemm_fc2 has two new parameters split_k and swap_ab that are not described in the function docstring; update the docstring for run_moe_as_dense_gemm_fc2 to add short descriptions for "split_k: int" (what it controls, default 1) and "swap_ab: bool" (what swapping A/B does, default False), matching the style/format of the existing parameter descriptions so callers and generated docs are consistent.tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py (1)
2627-2634: 💤 Low valueClarify semantic difference in swap_ab alpha tensor.
The comments state "alpha per M (token)" and "alpha per N (token)" but this could be confusing since the term "token" has a specific meaning in MoE contexts. Consider clarifying whether swap_ab actually swaps which dimension represents tokens, or if the alpha semantics change for a different reason.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py` around lines 2627 - 2634, The comment around alpha_token_dim/alpha_scale is ambiguous about what swap_ab changes; update the comment to clearly state that swap_ab toggles which matrix axis corresponds to the token dimension (i.e., when swap_ab is true the token index is N instead of M) and that alpha_scale_ptr/make_tensor/make_ordered_layout are built accordingly. Edit the block around alpha_token_dim, alpha_scale_ptr, and the call to cute.make_ordered_layout((alpha_token_dim, expert_count, l), order=(0,1,2)) so the comment explicitly says "when swap_ab is False, tokens map to M (first dim); when True, tokens map to N (first dim)" and reference swap_ab, alpha_token_dim, alpha_scale, and alpha_scale_ptr so readers know the layout change is intentional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`:
- Around line 4306-4314: The wrapped conditional causing flake8 E129 should be
reindented so continuations align under the opening parenthesis; update the
predicate in the loop over split_k_candidates (the if that checks k_tiles %
split_k and (k_tiles // split_k) % tiles_per_expert) to use a single
parenthesized expression with proper alignment, keeping the same logic, and then
append the tactic tuple (mma_tiler_mn, cluster_shape_mn, split_k) to tactics as
before; ensure variables referenced (_MMA_TILE_K, self.weight_per_expert,
k_tiles, tiles_per_expert, split_k) are unchanged.
In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`:
- Around line 2265-2273: The calculation of alpha_dim uses
tiled_mma.thr_id.shape directly which can be a layout/tuple; change the division
to use cute.size(tiled_mma.thr_id.shape) so alpha_dim = max(mma_tiler_mnk[0] //
cute.size(tiled_mma.thr_id.shape), mma_tiler_mnk[1]) and recompute alpha_bytes
accordingly (update the expression that constructs alpha_bytes if necessary) to
match usages elsewhere in this file (see references to tiled_mma.thr_id.shape
and cute.size).
- Around line 916-919: Add an explicit validation that k_tile_total is divisible
by self.split_k to avoid silently dropping K-tiles: check the divisibility
either in the class constructor or in can_implement() (where other kernel
constraints are validated) and raise/assert with a clear message if k_tile_total
% self.split_k != 0; refer to k_tile_total, self.split_k, and k_tiles_per_split
when adding the check so it runs before computing k_tiles_per_split.
- Line 1582: The swap_ab split-K epilogue is using m_total =
malpha_scale_mnl.shape[0] and then checking m_global + 7 < m_total which
compares an M-coordinate against the N-sized alpha tensor; instead use the
actual M bound from the output tensor and the existing helper used in the
non-swap path. Replace the m_total/m_global comparison in the swap_ab split-K
branch with a bounds check against mC_raw.shape[0] (or call thread_in_bounds
with m_global and mC_raw) so the epilogue validates M against the real M
dimension of mC_raw, mirroring the non-swap split-K approach.
---
Outside diff comments:
In `@tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`:
- Around line 4255-4274: The FC2 tactic candidate generator currently only
returns (mma_tiler_mn, cluster_shape_mn, split_k) and omits the swap_ab
dimension; update the generator in the method that builds FC2 candidates (the
function surrounding the shown diff that feeds
cute_dsl_nvfp4_dense_gemm_fc2_blackwell) to include swap_ab in the candidate
tuple and enumeration (e.g., add a swap_ab_candidates list like [False, True]
and produce candidates as (mma_tiler_mn, cluster_shape_mn, split_k, swap_ab));
then propagate this new 4-tuple shape through the FC2 tactic path by (1)
updating the forward parsing logic that reads tactics to expect swap_ab for
cute_dsl_nvfp4_dense_gemm_fc2_blackwell, and (2) including swap_ab when
constructing cache keys and kernel launch configuration so the autotuner and
cache population account for and can select the swap-ab mode.
---
Nitpick comments:
In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`:
- Around line 2627-2634: The comment around alpha_token_dim/alpha_scale is
ambiguous about what swap_ab changes; update the comment to clearly state that
swap_ab toggles which matrix axis corresponds to the token dimension (i.e., when
swap_ab is true the token index is N instead of M) and that
alpha_scale_ptr/make_tensor/make_ordered_layout are built accordingly. Edit the
block around alpha_token_dim, alpha_scale_ptr, and the call to
cute.make_ordered_layout((alpha_token_dim, expert_count, l), order=(0,1,2)) so
the comment explicitly says "when swap_ab is False, tokens map to M (first dim);
when True, tokens map to N (first dim)" and reference swap_ab, alpha_token_dim,
alpha_scale, and alpha_scale_ptr so readers know the layout change is
intentional.
In
`@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py`:
- Around line 603-608: The CLI currently accepts any integer for the "--split_k"
argument in parser.add_argument; restrict accepted values to the supported set
{1,2,4} by adding an argparse validation (e.g., use the choices parameter) on
the "--split_k" argument so invalid inputs produce a helpful argparse error
before kernel construction; update the parser.add_argument call for "--split_k"
to include choices=[1,2,4] and adjust the help text to reflect allowed values.
- Around line 105-106: The function run_moe_as_dense_gemm_fc2 has two new
parameters split_k and swap_ab that are not described in the function docstring;
update the docstring for run_moe_as_dense_gemm_fc2 to add short descriptions for
"split_k: int" (what it controls, default 1) and "swap_ab: bool" (what swapping
A/B does, default False), matching the style/format of the existing parameter
descriptions so callers and generated docs are consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 12182d32-835b-4e75-8042-1e1c43c79320
📒 Files selected for processing (4)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.pytests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py
|
PR_Github #47359 [ run ] completed with state
|
f7b6b2e to
b12b493
Compare
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #47645 [ run ] triggered by Bot. Commit: |
|
PR_Github #47645 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47883 [ run ] triggered by Bot. Commit: |
|
PR_Github #47883 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #48264 [ run ] triggered by Bot. Commit: |
…GEMM FC2 Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
Introduces swap_ab mode in the Blackwell MoE FC2 kernel plus the accompanying run script. Dedicated warp (warp_id=6) cp.async-loads alpha scale into a pipelined smem buffer; alpha buffer is now token-major (stride=1 along token dim) and token_dim flips to N when swap_ab is enabled. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
The Sm100BlockScaledPersistentDenseGemmKernel.__call__ entry receives m/n/k/l as Int64 for API compatibility, but downstream cutlass.range and cute.size derivations require Int32. Without an explicit cast, autotune profiling at small m (e.g. m=8 from deep_gemm_gen_tuning_buckets) fails with "DSLRuntimeError: expected Int32 for stop, got Int64". Cast at the kernel entry; practical tensor dimensions always fit in 32-bit. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
…ckets Match the FC1 DenseGEMM autotuner setup so FC2 uses the dense per-token buckets generated by deep_gemm_gen_tuning_buckets (8-stride below 128, 128-stride above) instead of the coarse power-of-2 bucket grid, and raise tune_max_num_tokens from 256 to 512. This gives autotune room to pick split-K tactics that benefit small and mid-range m without inflating the cache or losing the M=288 step that the previous bucket grid masked. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
…er alpha layout / SF zero-volume Three discrete fixes that together unblock the full test_nvfp4_dense_gemm_fc2_blackwell parametrize sweep (180 cases) and the production-path test_moe_backend -k DENSEGEMM (8 cases): 1. acc / alpha pipeline cadence (kernel hang). MMA accumulates tiles_per_expert k-tiles into a single tmem stage and commits the acc_pipeline once per expert, but the epilogue and the alpha producer were both iterating k_tile_cnt_local times. The 2nd consumer_wait per expert blocked forever for weight_per_expert > mma_tiler_k. Producer and consumer of both acc_pipeline and alpha_scale_pipeline now iterate experts_per_split = k_tile_cnt_local / tiles_per_expert (compile-time constant per tactic), matching MMA's commit cadence. Applied to both swap_ab and non-swap epilogue paths. 2. Wrapper alpha layout (numerical mismatch). Kernel wrapper builds alpha_scale token-major (token has stride 1, expert has stride m) so that warp 6 can coalesce-load 32 contiguous M alphas per expert. PyTorch's default contiguous (M, expert_count) is expert-major, so the runner now does alpha_scale = alpha_scale.t().contiguous() before taking data_ptr(). This is invisible at m=1 or expert_count=1 but produced 16-34% match for any case with m>1 AND expert>1. 3. Wrapper SF zero-volume at m<128. The A/B SF cute layouts used floor div m // 128 / n // 128 for the block dim, which collapses to 0 for m or n < 128 and made the MMA read undefined SF state. Mirror FC1: use ceil-div m_blocks = (m + 127) // 128 (and the same for n). Verification on B200 (sm_100): - test_moe_densegemm.py (kernel-direct, 564 cases): PASS - test_nvfp4_dense_gemm_fc2_blackwell (FC2 only, 180 cases): PASS - test_moe_backend.py -k DENSEGEMM (production path, 8 cases): PASS Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
e5fbbb2 to
7ddc37e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48282 [ run ] triggered by Bot. Commit: |
|
PR_Github #48282 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48317 [ run ] triggered by Bot. Commit: |
|
PR_Github #48317 [ run ] completed with state |
…uning buckets (NVIDIA#13833) Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
[None][perf] FC2 DenseGEMM autotune: split-K, swap_ab, fine-grained tuning buckets
Summary
Improves the MoE FC2 DenseGEMM kernel and its autotune setup:
(mma_tiler_mn, cluster_shape_mn)candidates; epilogue uses atomic-add reduction whensplit_k > 1.swap_abmode in the Blackwell MoE FC2 kernel (token-major alpha-scale buffer, dedicatedcp.asyncwarp).m/n/k/lfromInt64→Int32at the kernel entry; required by downstreamcutlass.range/cute.sizefor small-m autotune profiling.deep_gemm_gen_tuning_buckets(8-stride below 128, 128-stride above) and raisetune_max_num_tokens256 → 512 to match FC1.Performance
NVFP4, n=7168, k=65536, expert_count=256, B200, cold L2.
Brute-force sweep over (m × tactic × split_k × swap_ab) on the new kernel vs baseline (HEAD~4, no split-K / no swap_ab).
Highlights:
split_k=2lifts per-wave occupancy → +10–23%.split_k=4→ +35–39%.Test Plan
torch.ops.trtllm.cute_dsl_nvfp4_dense_gemm_fc2_blackwell; autotune cache populated for all buckets with no failures./bot runon B200 stages.contributors: @mingyangHao @zongfeijing @JacobHu-NV
Summary by CodeRabbit
New Features
Tests