Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly#1161
Add fast-path TensorAccessor & Optimize SarBp/ChannelizePoly#1161tbensonatl wants to merge 3 commits intomainfrom
Conversation
…Tiled expansion Adds support for num_channels = 1 for completeness, although such cases should use conv1d() for better performance. Adds detail::TensorAccessor (a per-kernel wrapper with an IsUnitStride fast path) and plumbs it through the SAR BP and channelize_poly transforms. Extends the SAR BP shared-mem preamble to float and mixed precisions. Adds a CTILE=32 variant of ChannelizePoly1D_SmemTiled for num_channels <= 32, which replaces the generic ChannelizePoly1D kernel for most small-channel oversampled configs (M=32/D=16, M=40/D=20 etc.) at roughly 3x the throughput. Adds a Full filter smem layout (single copy at smem[p*M + phase]) alongside the existing Rotated layout; dispatcher picks the smaller footprint. Refactors the SmemTiled FIR loops to use a single running filter_idx counter, saving 3 registers/thread. Scopes all channelize_poly internals under matx::detail::cpoly, dropping redundant MATX_CHANNELIZE_POLY1D_ and matxChannelizePoly1DInternal_ prefixes. The TensorAccessor class improves performance for tensors with unit stride by eliding the load and multiply of the last-dimension stride. Speedups range from 3-12%, depending on the GPU and precision. The float path for the sarbp example improved by > 3x on GPUs with reduced double-precision throughput, but that is due to amortizing the cost of double-to-float conversions. General polyphase channelizer improvements from the TensorAccessor class range from 0-10%. Oversampled cases with <= 48 channels improved by > 2x due to the addition of a tiled shared memory implementation with a smaller tiling factor. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Greptile SummaryThis PR introduces The previously flagged P1 concern about scalar Confidence Score: 5/5Safe to merge — no P0/P1 findings; the previously flagged scalar range_to_mcp concern is resolved and all P2s are non-blocking. All P1 concerns from the previous review round have been addressed: the scalar range_to_mcp API now requires a MatX operator (enforced by static_assert), the int{0} rtm_acc placeholder is gone, and all existing callers already use tensor views. The three remaining P2s (CUDA_ARCH allowlist, const accessor mutability, managed-memory test assumption) are quality/style concerns that do not affect correctness of the current implementation. include/matx/kernels/sar_bp.cuh — CUDA_ARCH allowlist will need updating as new GPU generations are released. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[sar_bp_impl / channelize_poly_impl] --> B{is_tensor_view_v for hot inputs?}
B -- No --> C[IsUnitStride = false\nlaunch slow-path kernel]
B -- Yes --> D{runtime: all last-dim strides == 1?}
D -- No --> C
D -- Yes --> E[IsUnitStride = true\nlaunch fast-path kernel]
E --> F[TensorAccessor FastPath=true\ndata_ = op.Data\nouter_strides_ preloaded]
C --> G[TensorAccessor FastPath=false\nforwards to op operator]
F --> H["operator()(is...) → data_[offset]\nno stride reload in inner loop"]
G --> I["operator()(is...) → op(is...)\ngeneric MatX evaluation"]
subgraph channelize_poly dispatch
J[num_channels ≤ 6 and D==M and real] --> K[FusedChan kernel]
L[D==M and smem fits] --> M[Smem kernel]
N[SmemTiled smem fits] --> O{num_channels ≤ 32?}
O -- Yes --> P[SmemTiled CTILE=32]
O -- No --> Q[SmemTiled CTILE=64]
R[fallback] --> S[Generic kernel]
end
Reviews (3): Last reviewed commit: "Remove assert() from ChannelizePoly1D_Sm..." | Re-trigger Greptile |
|
/build |
Constant range_to_mcp values can be provided as rank-0 tensors. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
|
/build |
| // rotations[] array size. Assert this invariant at runtime. | ||
| assert(K <= detail::MATX_CHANNELIZE_POLY1D_SMEM_TILED_MAX_ROTATIONS); | ||
| int32_t rotations[detail::MATX_CHANNELIZE_POLY1D_SMEM_TILED_MAX_ROTATIONS]; | ||
| assert(K <= detail::cpoly::SmemTiledMaxRotations); |
There was a problem hiding this comment.
Do we want to use assert in device code? I thought it has a first-time and register penalty.
There was a problem hiding this comment.
Good point. It should be removed with NDEBUG defined, but we can rely on the transform dispatch checks to keep this from impacting debug build performance too much.
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Adds detail::TensorAccessor (a per-kernel wrapper with an IsUnitStride fast path) and plumbs it through the SAR BP and channelize_poly transforms. Extends the SAR BP shared-mem preamble to float and mixed precisions. Adds a CTILE=32 variant of ChannelizePoly1D_SmemTiled for num_channels <= 32, which replaces the generic ChannelizePoly1D kernel for most small-channel oversampled configs (M=32/D=16, M=40/D=20 etc.) at roughly 3x the throughput. Adds a Full filter smem layout (single copy at smem[p*M + phase]) alongside the existing Rotated layout; dispatcher picks the smaller footprint. Refactors the SmemTiled FIR loops to use a single running filter_idx counter, saving 3 registers/thread. Scopes all channelize_poly internals under matx::detail::cpoly, dropping redundant MATX_CHANNELIZE_POLY1D_ and matxChannelizePoly1DInternal_ prefixes. Adds support for num_channels = 1 to channelize_poly() for completeness, although such cases should use conv1d() for better performance.
The TensorAccessor class improves performance for tensors with unit stride by eliding the load and multiply of the last-dimension stride. Speedups range from 3-12%, depending on the GPU and precision. The float path for the sarbp example improved by > 3x on GPUs with reduced double-precision throughput, but that is due to amortizing the cost of double-to-float conversions.
General polyphase channelizer improvements from the TensorAccessor class range from 0-10%. Oversampled cases with <= 48 channels improved by > 2x due to the addition of a tiled shared memory implementation with a smaller tiling factor.