Skip to content

Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly#1161

Open
tbensonatl wants to merge 3 commits intomainfrom
add-tensor-accessor-optimize-sarbp-cpoly
Open

Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly#1161
tbensonatl wants to merge 3 commits intomainfrom
add-tensor-accessor-optimize-sarbp-cpoly

Conversation

@tbensonatl
Copy link
Copy Markdown
Collaborator

Adds detail::TensorAccessor (a per-kernel wrapper with an IsUnitStride fast path) and plumbs it through the SAR BP and channelize_poly transforms. Extends the SAR BP shared-mem preamble to float and mixed precisions. Adds a CTILE=32 variant of ChannelizePoly1D_SmemTiled for num_channels <= 32, which replaces the generic ChannelizePoly1D kernel for most small-channel oversampled configs (M=32/D=16, M=40/D=20 etc.) at roughly 3x the throughput. Adds a Full filter smem layout (single copy at smem[p*M + phase]) alongside the existing Rotated layout; dispatcher picks the smaller footprint. Refactors the SmemTiled FIR loops to use a single running filter_idx counter, saving 3 registers/thread. Scopes all channelize_poly internals under matx::detail::cpoly, dropping redundant MATX_CHANNELIZE_POLY1D_ and matxChannelizePoly1DInternal_ prefixes. Adds support for num_channels = 1 to channelize_poly() for completeness, although such cases should use conv1d() for better performance.

The TensorAccessor class improves performance for tensors with unit stride by eliding the load and multiply of the last-dimension stride. Speedups range from 3-12%, depending on the GPU and precision. The float path for the sarbp example improved by > 3x on GPUs with reduced double-precision throughput, but that is due to amortizing the cost of double-to-float conversions.

General polyphase channelizer improvements from the TensorAccessor class range from 0-10%. Oversampled cases with <= 48 channels improved by > 2x due to the addition of a tiled shared memory implementation with a smaller tiling factor.

…Tiled expansion

Adds support for num_channels = 1 for completeness, although such cases
should use conv1d() for better performance.

Adds detail::TensorAccessor (a per-kernel wrapper with an IsUnitStride fast
path) and plumbs it through the SAR BP and channelize_poly transforms.
Extends the SAR BP shared-mem preamble to float and mixed precisions. Adds
a CTILE=32 variant of ChannelizePoly1D_SmemTiled for num_channels <= 32,
which replaces the generic ChannelizePoly1D kernel for most small-channel
oversampled configs (M=32/D=16, M=40/D=20 etc.) at roughly 3x the
throughput. Adds a Full filter smem layout (single copy at smem[p*M + phase])
alongside the existing Rotated layout; dispatcher picks the smaller
footprint. Refactors the SmemTiled FIR loops to use a single running
filter_idx counter, saving 3 registers/thread. Scopes all channelize_poly
internals under matx::detail::cpoly, dropping redundant MATX_CHANNELIZE_POLY1D_
and matxChannelizePoly1DInternal_ prefixes.

The TensorAccessor class improves performance for tensors with unit stride by
eliding the load and multiply of the last-dimension stride. Speedups range from
3-12%, depending on the GPU and precision. The float path for the sarbp example
improved by > 3x on GPUs with reduced double-precision throughput, but that is
due to amortizing the cost of double-to-float conversions.

General polyphase channelizer improvements from the TensorAccessor class range
from 0-10%. Oversampled cases with <= 48 channels improved by > 2x due to
the addition of a tiled shared memory implementation with a smaller tiling factor.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl tbensonatl self-assigned this Apr 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR introduces detail::TensorAccessor — a compile-time-selected wrapper that replaces the per-access stride reload in the SAR BP and channelize_poly hot loops with a single pointer-plus-outer-strides snapshot taken at kernel entry, yielding 3–12% throughput gains. It simultaneously extends the SAR BP float/mixed shared-memory preamble, adds a CTILE=32 ChannelizePoly1D_SmemTiled variant for num_channels ≤ 32 (≈3× speedup over the generic path), and introduces a Full filter smem layout that saves footprint for small-channel oversampled configs.

The previously flagged P1 concern about scalar range_to_mcp has been addressed — the API now requires a MatX operator (enforced by static_assert), and all existing callers already use make_tensor. The rtm_acc int{0} placeholder pattern is fully replaced. Three P2 observations remain (see inline comments): the __CUDA_ARCH__ allowlist, the mutable return from const operator(), and the managed-memory assumption in the new accessor test.

Confidence Score: 5/5

Safe to merge — no P0/P1 findings; the previously flagged scalar range_to_mcp concern is resolved and all P2s are non-blocking.

All P1 concerns from the previous review round have been addressed: the scalar range_to_mcp API now requires a MatX operator (enforced by static_assert), the int{0} rtm_acc placeholder is gone, and all existing callers already use tensor views. The three remaining P2s (CUDA_ARCH allowlist, const accessor mutability, managed-memory test assumption) are quality/style concerns that do not affect correctness of the current implementation.

include/matx/kernels/sar_bp.cuh — CUDA_ARCH allowlist will need updating as new GPU generations are released.

Important Files Changed

Filename Overview
include/matx/kernels/tensor_accessor.h New TensorAccessor/BoundAccessor wrapper with correct fast/slow path dispatch; const operator() returns mutable T& by design, matching existing MatX patterns
include/matx/kernels/sar_bp.cuh Extends shared-mem preamble to Float/Mixed, replaces scalar range_to_mcp with required MatX operator, plumbs TensorAccessor fast-path; CUDA_ARCH whitelist still forward-exclusive
include/matx/transforms/sar_bp.h Adds unit-stride fast-path detection and dispatches IsUnitStride kernel instantiation; all existing callers already use tensor views for range_to_mcp
include/matx/kernels/channelize_poly.cuh Adds Full filter smem layout, CTILE=32 SmemTiled variant, running filter_idx counter; TensorAccessor plumbed through all kernel entry points
include/matx/transforms/channelize_poly.h Renamed to matx::detail::cpoly namespace, adds SmemTiled CTILE=32 dispatcher, Full filter layout selection, and num_channels=1 support; dispatch logic is correct
include/matx/core/tensor.h Single-line change: adds MATX_DEVICE to Stride() to make it callable from device kernels
test/00_misc/TensorAccessorTests.cu New host-side unit tests for TensorAccessor fast/slow paths; FastPathWriteThrough does host-side writes assuming managed memory
test/00_transform/ChannelizePoly.cu Adds num_channels=1 regression test and updates existing tests for cpoly namespace rename
test/CMakeLists.txt Adds TensorAccessorTests target to the build

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[sar_bp_impl / channelize_poly_impl] --> B{is_tensor_view_v for hot inputs?}
    B -- No --> C[IsUnitStride = false\nlaunch slow-path kernel]
    B -- Yes --> D{runtime: all last-dim strides == 1?}
    D -- No --> C
    D -- Yes --> E[IsUnitStride = true\nlaunch fast-path kernel]

    E --> F[TensorAccessor FastPath=true\ndata_ = op.Data\nouter_strides_ preloaded]
    C --> G[TensorAccessor FastPath=false\nforwards to op operator]

    F --> H["operator()(is...) → data_[offset]\nno stride reload in inner loop"]
    G --> I["operator()(is...) → op(is...)\ngeneric MatX evaluation"]

    subgraph channelize_poly dispatch
      J[num_channels ≤ 6 and D==M and real] --> K[FusedChan kernel]
      L[D==M and smem fits] --> M[Smem kernel]
      N[SmemTiled smem fits] --> O{num_channels ≤ 32?}
      O -- Yes --> P[SmemTiled CTILE=32]
      O -- No --> Q[SmemTiled CTILE=64]
      R[fallback] --> S[Generic kernel]
    end
Loading

Reviews (3): Last reviewed commit: "Remove assert() from ChannelizePoly1D_Sm..." | Re-trigger Greptile

@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

@tbensonatl tbensonatl requested a review from cliffburdick April 24, 2026 21:21
Constant range_to_mcp values can be provided as rank-0 tensors.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

// rotations[] array size. Assert this invariant at runtime.
assert(K <= detail::MATX_CHANNELIZE_POLY1D_SMEM_TILED_MAX_ROTATIONS);
int32_t rotations[detail::MATX_CHANNELIZE_POLY1D_SMEM_TILED_MAX_ROTATIONS];
assert(K <= detail::cpoly::SmemTiledMaxRotations);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use assert in device code? I thought it has a first-time and register penalty.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It should be removed with NDEBUG defined, but we can rely on the transform dispatch checks to keep this from impacting debug build performance too much.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage is 91.858%add-tensor-accessor-optimize-sarbp-cpoly into main. No base build found for main.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants