Add a unit_stride_last capability by tbensonatl · Pull Request #1153 · NVIDIA/MatX

tbensonatl · 2026-04-13T18:50:12Z

Add a capability to indicate that a tensor has unit stride in the last dimension. When true, we can elide loading and multiplying by the last stride. The last dimension being unit-stride is the nominal case since that is what is created via make_tensor() without user-provided strides.

This capability approach applies to the matxOpT*Kernel dispatch. It will not apply to custom kernels in MatX.

For a single large set kernel (i.e., (t = 1.0f).run()), this results in 18% fewer instructions being executed. In that case there is little benefit since the kernel is memory bound, but it does verify that the capability effectively elides some instructions.

Add a capability to indicate that a tensor has unit stride in the last dimension. When true, we can elide loading and multiplying by the last stride. The last dimension being unit-stride is the nominal case since that is what is created via make_tensor() without user-provided strides. This capability approach applies to the matxOpT*Kernel dispatch. It will not apply to custom kernels in MatX. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-04-13T18:50:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-13T19:04:29Z

Greptile Summary

This PR adds a UNIT_STRIDE_LAST capability that elides one ULDC + IMAD instruction per tensor access in the last dimension when all leaf tensors in an expression have stride-1 in that dimension, which is the common case for make_tensor()-created tensors. The capability uses an AND-query so it is automatically disabled for transposed, sliced, or otherwise non-unit-stride tensors. The interp.h operators are also fixed to properly propagate capabilities to their child operands (they previously returned the default value for all capabilities except ELEMENTS_PER_THREAD).

Confidence Score: 5/5

Safe to merge — optimization is correctly gated by an AND-query through the full expression tree, so it cannot fire when any tensor has a non-unit last stride.

All leaf-tensor stride checks are runtime-evaluated before dispatch; the AND-query propagation (verified in SetOp, PermuteOp, SliceOp, and interp operators) ensures the fast path is only taken when the invariant is truly satisfied. No incorrect memory access patterns are possible. The m_ workspace exclusion in Interp1Op is safe because AllocateTempTensor always produces a contiguous tensor. Prior P1 concerns from the thread are resolved. Only P2-level observations remain.

No files require special attention.

Important Files Changed

Filename	Overview
include/matx/core/capabilities.h	Adds UNIT_STRIDE_LAST enum, CapabilityParams template parameter, capability_attributes specialization (default true, AND-query), and get_query_type case — all consistent and correct.
include/matx/core/tensor_impl.h	Introduces DimStride helper to elide stride load/multiply for last dim; refactors GetOffsetOptimized, GetVal, GetValC to use CapType instead of EPT; adds runtime Stride check for UNIT_STRIDE_LAST in get_capability. Logic is correct.
include/matx/executors/cuda.h	Runtime unit_stride_last check added before kernel dispatch; both the rank≤4 and rank>4 paths branch on the bool to select USL=true/false CapType instantiations. Correct and complete.
include/matx/executors/kernel.h	matxOpTDKernel (rank>4) gains CapType template param and passes it through operator() calls, completing the 5D+ dispatch path. Clean change.
include/matx/operators/interp.h	Fixes capability propagation in both LTOIR and Interp1Op get_capability — previously returned default for all non-EPT capabilities; now correctly combines with all child operands.
include/matx/core/nvrtc_helper.h	Comment-only addition noting that JIT CapabilityParams deliberately omits unit_stride_last (strides are constexpr in JIT, so the compiler handles it). No logic change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["CudaExecutor::run(op)"] --> B["get_operator_capability&lt;UNIT_STRIDE_LAST&gt;(op)"]
    B --> C{"AND-query through\nexpression tree"}
    C --> D["SetOp → combines out_ + op_"]
    D --> E["PermuteOp / SliceOp\n→ generic else branch\n→ propagates to child"]
    D --> F["tensor_impl_t\nStride(Rank()-1) == 1 ?"]
    E --> F
    C --> G["Non-tensor ops\n→ default_value = true"]
    F -->|true| H["unit_stride_last = true"]
    F -->|false| I["unit_stride_last = false"]
    H --> J["dispatch_kernel&lt;EPT, true&gt;\n→ CapType::unit_stride_last = true"]
    I --> K["dispatch_kernel&lt;EPT, false&gt;\n→ CapType::unit_stride_last = false"]
    J --> L["DimStride: last dim\nreturns idx_val\nOR idx_val * EPT\n(no stride load)"]
    K --> M["DimStride: last dim\nreturns idx_val * stride\nOR idx_val * stride * EPT"]

_{Reviews (3): Last reviewed commit: "Propagate/combine capabilities through I..." | Re-trigger Greptile}

cliffburdick · 2026-04-13T19:20:18Z

/build

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl · 2026-04-13T19:52:28Z

/build

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl · 2026-04-14T02:15:58Z

/build

tbensonatl · 2026-04-14T12:59:28Z

/build

cliffburdick · 2026-04-14T21:22:07Z

/build

coveralls · 2026-04-15T03:29:55Z

Coverage is 91.843% — add-unit-stride-last-dispatch-optimization into main. No base build found for main.

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

Comment thread include/matx/executors/cuda.h

Comment thread include/matx/core/tensor_impl.h

Comment thread include/matx/core/tensor_impl.h

cliffburdick approved these changes Apr 13, 2026

View reviewed changes

Add USL support for 5D+ tensors

3bd7164

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

Propagate/combine capabilities through Interp1D object

5dec0ce

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl merged commit d2f550c into main Apr 15, 2026
1 check passed

tbensonatl deleted the add-unit-stride-last-dispatch-optimization branch April 15, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a unit_stride_last capability#1153

Add a unit_stride_last capability#1153
tbensonatl merged 3 commits intomainfrom
add-unit-stride-last-dispatch-optimization

tbensonatl commented Apr 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

greptile-apps bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cliffburdick commented Apr 13, 2026

Uh oh!

tbensonatl commented Apr 13, 2026

Uh oh!

tbensonatl commented Apr 14, 2026

Uh oh!

tbensonatl commented Apr 14, 2026

Uh oh!

cliffburdick commented Apr 14, 2026

Uh oh!

coveralls commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tbensonatl commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

greptile-apps bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cliffburdick commented Apr 13, 2026

Uh oh!

tbensonatl commented Apr 13, 2026

Uh oh!

tbensonatl commented Apr 14, 2026

Uh oh!

tbensonatl commented Apr 14, 2026

Uh oh!

cliffburdick commented Apr 14, 2026

Uh oh!

coveralls commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbensonatl commented Apr 13, 2026 •

edited

Loading

greptile-apps bot commented Apr 13, 2026 •

edited

Loading