Add optimised top-k kernel AIR. by dcampora · Pull Request #2890 · NVIDIA/TransformerEngine

dcampora · 2026-04-16T04:19:18Z

Description

Adds a custom AIR TopK implementation (header-only, vendored into
transformer_engine/common/util/) exposed as a JAX FFI custom call
via the TE JAX extension.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

transformer_engine/common/util/air_topk.cu: AIR TopK CUDA kernel
transformer_engine/common/util/standalone_air_topk.cuh: vendored header (AIR TopK, header-only)
transformer_engine/common/include/transformer_engine/air_topk.h: C API
transformer_engine/jax/csrc/extensions/air_topk.cpp: JAX FFI binding
transformer_engine/jax/cpp_extensions/air_topk.py: Python wrapper
transformer_engine/common/CMakeLists.txt: compile new kernel; use CCCL from CUDA toolkit; fix SM100 arch handling when all arches are special-cased

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Adds a custom AIR TopK implementation (header-only, vendored into transformer_engine/common/util/) exposed as a JAX FFI custom call via the TE JAX extension. Key changes: - transformer_engine/common/util/air_topk.cu: AIR TopK CUDA kernel - transformer_engine/common/util/standalone_air_topk.cuh: vendored header - transformer_engine/common/include/transformer_engine/air_topk.h: C API - transformer_engine/jax/csrc/extensions/air_topk.cpp: JAX FFI binding - transformer_engine/jax/cpp_extensions/air_topk.py: Python wrapper - CMakeLists.txt: compile new kernel; use CCCL from CUDA toolkit - CMakeLists.txt: fix SM100 arch handling when all arches are special-cased Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: dcampora <961215+dcampora@users.noreply.github.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-16T04:24:22Z

Greptile Summary

This PR vendored the AIR radix-selection top-K algorithm as a new CUDA kernel (standalone_topk.cuh / topk.cu), exposed it via a C API (topk.h), and wired it up as a JAX FFI custom call (topk.cpp / topk.py) with correctness tests. All previous review concerns (namespace pollution for constants, unused variable, SM-count caching) have been addressed. The remaining findings are all P2 style/hygiene issues that do not block correctness.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/hygiene issues that do not affect correctness or runtime behaviour.

All P0/P1 concerns from prior review rounds have been addressed. The four remaining comments are P2: two style issues in the vendored header (global-namespace helpers, dead code with magic constants), one unresolved UB TODO in a union that nvcc handles correctly in practice, and one missing Python-level guard for k > seq_len that the kernel already handles gracefully.

transformer_engine/common/util/standalone_topk.cuh (three minor P2 issues); transformer_engine/jax/cpp_extensions/topk.py (missing k <= seq_len guard)

Important Files Changed

Filename	Overview
transformer_engine/common/util/standalone_topk.cuh	Vendored AIR radix top-K header; core algorithm looks correct. Three P2 issues: global-namespace helpers, dead `scan_warp_version` with magic constants, and an unresolved UB TODO on a union.
transformer_engine/common/util/topk.cu	Thin dispatch layer over `standalone_topk`; workspace size query always uses `float` which over-allocates for `bfloat16` (safe). Error handling and dtype dispatch look correct.
transformer_engine/jax/cpp_extensions/topk.py	Clean JAX FFI wrapper with workspace memoisation; missing `k_value <= seq_len` guard in `abstract`.
transformer_engine/jax/csrc/extensions/topk.cpp	Well-structured JAX FFI handler; dtype validation, shape checks, and workspace-size plumbing are all correct.
transformer_engine/common/include/transformer_engine/topk.h	New C API header; doc-comments are accurate and the extern "C" guards are properly applied.
tests/jax/test_custom_call_compute.py	Adds `TestTopK` with 1-D and 2-D correctness checks using `jax.lax.top_k` as reference; cross-validates sorted values and gathered indices.
transformer_engine/common/CMakeLists.txt	Adds `util/topk.cu` to the CUDA sources and adds `CMAKE_CUDA_ARCHITECTURES OFF` guard for the all-special-cased-arch edge case.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["topk(x, k_value)\nPython API"] --> B{"x.ndim == 1?"}
    B -->|yes| C["unsqueeze → (1, seq_len)"]
    B -->|no| D["(batch_size, seq_len)"]
    C --> D
    D --> E["TopKPrimitive.outer_primitive.bind()\nlengths = full(batch_size, seq_len, int32)"]
    E --> F["TopkFFI (C++)\nJAX FFI handler"]
    F --> G["nvte_topk (C API)\ntopk.cu"]
    G --> H{"len ≤ 32768?"}
    H -->|yes – one-block| I["radix_topk_one_block_kernel\n<<<batch_size, 1024>>>"]
    H -->|no – multi-block| J["calc_grid_dim → grid_dim\n(cached sm_cnt)"]
    J --> K["radix_kernel loop\n<<<grid_dim × batch, 256>>>"]
    K --> L["last_filter_kernel"]
    I --> M["out_keys (batch, k)\nout_indices (batch, k)"]
    L --> M
    M --> N{"squeezed?"}
    N -->|yes| O["squeeze → (k,)"]
    N -->|no| P["return (values, indices)"]
    O --> P

_{Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

…ing export, cache sm_cnt - Move WARP_SIZE/WARP_BITS/FULL_WARP_MASK/VECTORIZED_READ_SIZE into namespace nv - Remove unused keys_element_bytes variable in AirTopkFFI; collapse switch to dtype validation - Add missing `from .air_topk import *` export in jax/cpp_extensions/__init__.py - Cache sm_cnt per device with static vars to avoid repeated cudaGetDevice/cudaDeviceGetAttribute calls - Add CMAKE_BUILD_WITH_INSTALL_RPATH=ON to build_ext.py Signed-off-by: dcampora <961215+dcampora@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

Remove the `air_` prefix from all TopK-related identifiers: file names, C API functions (nvte_air_topk -> nvte_topk), FFI handler/primitive names (te_air_topk_ffi -> te_topk_ffi), Python symbols, and the internal `air_topk` namespace in standalone_topk.cuh. No functional changes. Signed-off-by: Diego Campora <dcampora@nvidia.com> Signed-off-by: dcampora <961215+dcampora@users.noreply.github.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-04-16T20:14:00Z

+ *  \param[in]  k           Top-K count.
+ *  \return     Required workspace size in bytes.
+ */
+size_t nvte_get_topk_workspace_bytes(int batch_size, int seq_len, int k);


In the other parts of TE we follow the convention of running the main function with empty workspace to get the size, rather than a specialized function, see e.g. the layernorm functions. Could we make that consistent?

ptrendx · 2026-04-16T20:15:06Z

+// Helper: convert a float literal to type T without relying on implicit
+// conversions (needed when __CUDA_NO_BFLOAT16_CONVERSIONS__ is defined).


We don't need that.

ptrendx · 2026-04-16T20:18:54Z

@@ -0,0 +1,1281 @@
+/*************************************************************************


A general comment - there is some duplication here with the rest of the codebase. My assumption is though that this is mostly temporary and we will want to switch to cub once it has this implementation, so I'm fine with merging this file as is.

Yes, this is temporary and we will switch to cub the moment the optimizations to top-k land there.

ptrendx · 2026-04-16T20:19:53Z

+# If all architectures were special-cased and removed, disable CMake's automatic
+# CUDA_ARCHITECTURES management — compilation flags are set via COMPILE_OPTIONS below.
+if(NOT CMAKE_CUDA_ARCHITECTURES)
+  set(CMAKE_CUDA_ARCHITECTURES OFF)
+endif()


This change is not needed for this PR.

jberchtold-nvidia · 2026-04-16T20:41:25Z

+    if squeezed:
+        x = x[jnp.newaxis, :]  # (1, seq_len)
+
+    batch_size, seq_len = x.shape


nit self-resolve: Can we add this assert before this line?

assert x.ndim == 2, f"topk expected 2D input tensor 'x' but {x.shape=}"

jberchtold-nvidia · 2026-04-16T20:43:31Z

/te-ci

dcampora and others added 2 commits April 16, 2026 04:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

470be3c

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Apr 16, 2026

View reviewed changes

Comment thread transformer_engine/jax/csrc/extensions/air_topk.cpp Outdated

Comment thread transformer_engine/common/util/standalone_air_topk.cuh Outdated

Comment thread transformer_engine/common/util/standalone_topk.cuh Outdated

dcampora force-pushed the feature/air-topk branch from 36e8405 to 1e6c976 Compare April 16, 2026 05:17

pre-commit-ci bot and others added 3 commits April 16, 2026 05:18

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b328a2

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

0862bca

for more information, see https://pre-commit.ci

ptrendx self-requested a review April 16, 2026 15:41

ptrendx self-assigned this Apr 16, 2026

ptrendx reviewed Apr 16, 2026

View reviewed changes

jberchtold-nvidia reviewed Apr 16, 2026

View reviewed changes

ptrendx mentioned this pull request Apr 17, 2026

[Common][JAX] Add CUB TopK MaxPairs interface #2784

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimised top-k kernel AIR.#2890

Add optimised top-k kernel AIR.#2890
dcampora wants to merge 6 commits intoNVIDIA:mainfrom
dcampora:feature/air-topk

dcampora commented Apr 16, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Apr 16, 2026

Uh oh!

ptrendx Apr 16, 2026

Uh oh!

ptrendx Apr 16, 2026

Uh oh!

dcampora Apr 17, 2026

Uh oh!

ptrendx Apr 16, 2026

Uh oh!

jberchtold-nvidia Apr 16, 2026

Uh oh!

Uh oh!

jberchtold-nvidia commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Helper: convert a float literal to type T without relying on implicit
		// conversions (needed when __CUDA_NO_BFLOAT16_CONVERSIONS__ is defined).

		@@ -0,0 +1,1281 @@
		/*************************************************************************

Conversation

dcampora commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dcampora Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

jberchtold-nvidia Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jberchtold-nvidia commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dcampora commented Apr 16, 2026 •

edited

Loading

greptile-apps bot commented Apr 16, 2026 •

edited

Loading