Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly by tbensonatl · Pull Request #1161 · NVIDIA/MatX

tbensonatl · 2026-04-24T21:08:59Z

Adds detail::TensorAccessor (a per-kernel wrapper with an IsUnitStride fast path) and plumbs it through the SAR BP and channelize_poly transforms. Extends the SAR BP shared-mem preamble to float and mixed precisions. Adds a CTILE=32 variant of ChannelizePoly1D_SmemTiled for num_channels <= 32, which replaces the generic ChannelizePoly1D kernel for most small-channel oversampled configs (M=32/D=16, M=40/D=20 etc.) at roughly 3x the throughput. Adds a Full filter smem layout (single copy at smem[p*M + phase]) alongside the existing Rotated layout; dispatcher picks the smaller footprint. Refactors the SmemTiled FIR loops to use a single running filter_idx counter, saving 3 registers/thread. Scopes all channelize_poly internals under matx::detail::cpoly, dropping redundant MATX_CHANNELIZE_POLY1D_ and matxChannelizePoly1DInternal_ prefixes. Adds support for num_channels = 1 to channelize_poly() for completeness, although such cases should use conv1d() for better performance.

The TensorAccessor class improves performance for tensors with unit stride by eliding the load and multiply of the last-dimension stride. Speedups range from 3-12%, depending on the GPU and precision. The float path for the sarbp example improved by > 3x on GPUs with reduced double-precision throughput, but that is due to amortizing the cost of double-to-float conversions.

General polyphase channelizer improvements from the TensorAccessor class range from 0-10%. Oversampled cases with <= 48 channels improved by > 2x due to the addition of a tiled shared memory implementation with a smaller tiling factor.

…Tiled expansion Adds support for num_channels = 1 for completeness, although such cases should use conv1d() for better performance. Adds detail::TensorAccessor (a per-kernel wrapper with an IsUnitStride fast path) and plumbs it through the SAR BP and channelize_poly transforms. Extends the SAR BP shared-mem preamble to float and mixed precisions. Adds a CTILE=32 variant of ChannelizePoly1D_SmemTiled for num_channels <= 32, which replaces the generic ChannelizePoly1D kernel for most small-channel oversampled configs (M=32/D=16, M=40/D=20 etc.) at roughly 3x the throughput. Adds a Full filter smem layout (single copy at smem[p*M + phase]) alongside the existing Rotated layout; dispatcher picks the smaller footprint. Refactors the SmemTiled FIR loops to use a single running filter_idx counter, saving 3 registers/thread. Scopes all channelize_poly internals under matx::detail::cpoly, dropping redundant MATX_CHANNELIZE_POLY1D_ and matxChannelizePoly1DInternal_ prefixes. The TensorAccessor class improves performance for tensors with unit stride by eliding the load and multiply of the last-dimension stride. Speedups range from 3-12%, depending on the GPU and precision. The float path for the sarbp example improved by > 3x on GPUs with reduced double-precision throughput, but that is due to amortizing the cost of double-to-float conversions. General polyphase channelizer improvements from the TensorAccessor class range from 0-10%. Oversampled cases with <= 48 channels improved by > 2x due to the addition of a tiled shared memory implementation with a smaller tiling factor. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-04-24T21:09:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-24T21:16:37Z

Greptile Summary

This PR introduces detail::TensorAccessor — a compile-time-selected wrapper that replaces the per-access stride reload in the SAR BP and channelize_poly hot loops with a single pointer-plus-outer-strides snapshot taken at kernel entry, yielding 3–12% throughput gains. It simultaneously extends the SAR BP float/mixed shared-memory preamble, adds a CTILE=32 ChannelizePoly1D_SmemTiled variant for num_channels ≤ 32 (≈3× speedup over the generic path), and introduces a Full filter smem layout that saves footprint for small-channel oversampled configs.

The previously flagged P1 concern about scalar range_to_mcp has been addressed — the API now requires a MatX operator (enforced by static_assert), and all existing callers already use make_tensor. The rtm_acc int{0} placeholder pattern is fully replaced. Three P2 observations remain (see inline comments): the __CUDA_ARCH__ allowlist, the mutable return from const operator(), and the managed-memory assumption in the new accessor test.

Confidence Score: 5/5

Safe to merge — no P0/P1 findings; the previously flagged scalar range_to_mcp concern is resolved and all P2s are non-blocking.

All P1 concerns from the previous review round have been addressed: the scalar range_to_mcp API now requires a MatX operator (enforced by static_assert), the int{0} rtm_acc placeholder is gone, and all existing callers already use tensor views. The three remaining P2s (CUDA_ARCH allowlist, const accessor mutability, managed-memory test assumption) are quality/style concerns that do not affect correctness of the current implementation.

include/matx/kernels/sar_bp.cuh — CUDA_ARCH allowlist will need updating as new GPU generations are released.

Important Files Changed

Filename	Overview
include/matx/kernels/tensor_accessor.h	New TensorAccessor/BoundAccessor wrapper with correct fast/slow path dispatch; const operator() returns mutable T& by design, matching existing MatX patterns
include/matx/kernels/sar_bp.cuh	Extends shared-mem preamble to Float/Mixed, replaces scalar range_to_mcp with required MatX operator, plumbs TensorAccessor fast-path; CUDA_ARCH whitelist still forward-exclusive
include/matx/transforms/sar_bp.h	Adds unit-stride fast-path detection and dispatches IsUnitStride kernel instantiation; all existing callers already use tensor views for range_to_mcp
include/matx/kernels/channelize_poly.cuh	Adds Full filter smem layout, CTILE=32 SmemTiled variant, running filter_idx counter; TensorAccessor plumbed through all kernel entry points
include/matx/transforms/channelize_poly.h	Renamed to matx::detail::cpoly namespace, adds SmemTiled CTILE=32 dispatcher, Full filter layout selection, and num_channels=1 support; dispatch logic is correct
include/matx/core/tensor.h	Single-line change: adds MATX_DEVICE to Stride() to make it callable from device kernels
test/00_misc/TensorAccessorTests.cu	New host-side unit tests for TensorAccessor fast/slow paths; FastPathWriteThrough does host-side writes assuming managed memory
test/00_transform/ChannelizePoly.cu	Adds num_channels=1 regression test and updates existing tests for cpoly namespace rename
test/CMakeLists.txt	Adds TensorAccessorTests target to the build

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[sar_bp_impl / channelize_poly_impl] --> B{is_tensor_view_v for hot inputs?}
    B -- No --> C[IsUnitStride = false\nlaunch slow-path kernel]
    B -- Yes --> D{runtime: all last-dim strides == 1?}
    D -- No --> C
    D -- Yes --> E[IsUnitStride = true\nlaunch fast-path kernel]

    E --> F[TensorAccessor FastPath=true\ndata_ = op.Data\nouter_strides_ preloaded]
    C --> G[TensorAccessor FastPath=false\nforwards to op operator]

    F --> H["operator()(is...) → data_[offset]\nno stride reload in inner loop"]
    G --> I["operator()(is...) → op(is...)\ngeneric MatX evaluation"]

    subgraph channelize_poly dispatch
      J[num_channels ≤ 6 and D==M and real] --> K[FusedChan kernel]
      L[D==M and smem fits] --> M[Smem kernel]
      N[SmemTiled smem fits] --> O{num_channels ≤ 32?}
      O -- Yes --> P[SmemTiled CTILE=32]
      O -- No --> Q[SmemTiled CTILE=64]
      R[fallback] --> S[Generic kernel]
    end

_{Reviews (3): Last reviewed commit: "Remove assert() from ChannelizePoly1D_Sm..." | Re-trigger Greptile}

tbensonatl · 2026-04-24T21:21:40Z

/build

Constant range_to_mcp values can be provided as rank-0 tensors. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl · 2026-04-24T22:01:54Z

/build

cliffburdick · 2026-04-24T22:09:14Z

            // rotations[] array size. Assert this invariant at runtime.
-            assert(K <= detail::MATX_CHANNELIZE_POLY1D_SMEM_TILED_MAX_ROTATIONS);
-            int32_t rotations[detail::MATX_CHANNELIZE_POLY1D_SMEM_TILED_MAX_ROTATIONS];
+            assert(K <= detail::cpoly::SmemTiledMaxRotations);


Do we want to use assert in device code? I thought it has a first-time and register penalty.

Good point. It should be removed with NDEBUG defined, but we can rely on the transform dispatch checks to keep this from impacting debug build performance too much.

coveralls · 2026-04-25T03:39:08Z

Coverage is 91.858% — add-tensor-accessor-optimize-sarbp-cpoly into main. No base build found for main.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl self-assigned this Apr 24, 2026

tbensonatl requested a review from cliffburdick April 24, 2026 21:21

Remove support for scalar range_to_mcp values sar_bp operator

de50453

Constant range_to_mcp values can be provided as rank-0 tensors. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

cliffburdick reviewed Apr 24, 2026

View reviewed changes

Remove assert() from ChannelizePoly1D_SmemTiled

5ff9eac

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly#1161

Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly#1161
tbensonatl wants to merge 3 commits intomainfrom
add-tensor-accessor-optimize-sarbp-cpoly

tbensonatl commented Apr 24, 2026

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

tbensonatl commented Apr 24, 2026

Uh oh!

tbensonatl commented Apr 24, 2026

Uh oh!

cliffburdick Apr 24, 2026

Uh oh!

tbensonatl Apr 24, 2026

Uh oh!

coveralls commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tbensonatl commented Apr 24, 2026

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

tbensonatl commented Apr 24, 2026

Uh oh!

tbensonatl commented Apr 24, 2026

Uh oh!

cliffburdick Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tbensonatl Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading