Skip to content

Add a unit_stride_last capability#1153

Merged
tbensonatl merged 3 commits intomainfrom
add-unit-stride-last-dispatch-optimization
Apr 15, 2026
Merged

Add a unit_stride_last capability#1153
tbensonatl merged 3 commits intomainfrom
add-unit-stride-last-dispatch-optimization

Conversation

@tbensonatl
Copy link
Copy Markdown
Collaborator

@tbensonatl tbensonatl commented Apr 13, 2026

Add a capability to indicate that a tensor has unit stride in the last dimension. When true, we can elide loading and multiplying by the last stride. The last dimension being unit-stride is the nominal case since that is what is created via make_tensor() without user-provided strides.

This capability approach applies to the matxOpT*Kernel dispatch. It will not apply to custom kernels in MatX.

For a single large set kernel (i.e., (t = 1.0f).run()), this results in 18% fewer instructions being executed. In that case there is little benefit since the kernel is memory bound, but it does verify that the capability effectively elides some instructions.

Add a capability to indicate that a tensor has unit stride in the last
dimension. When true, we can elide loading and multiplying by the last
stride. The last dimension being unit-stride is the nominal case since
that is what is created via make_tensor() without user-provided strides.

This capability approach applies to the matxOpT*Kernel dispatch. It will
not apply to custom kernels in MatX.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 13, 2026

Greptile Summary

This PR adds a UNIT_STRIDE_LAST capability that elides one ULDC + IMAD instruction per tensor access in the last dimension when all leaf tensors in an expression have stride-1 in that dimension, which is the common case for make_tensor()-created tensors. The capability uses an AND-query so it is automatically disabled for transposed, sliced, or otherwise non-unit-stride tensors. The interp.h operators are also fixed to properly propagate capabilities to their child operands (they previously returned the default value for all capabilities except ELEMENTS_PER_THREAD).

Confidence Score: 5/5

Safe to merge — optimization is correctly gated by an AND-query through the full expression tree, so it cannot fire when any tensor has a non-unit last stride.

All leaf-tensor stride checks are runtime-evaluated before dispatch; the AND-query propagation (verified in SetOp, PermuteOp, SliceOp, and interp operators) ensures the fast path is only taken when the invariant is truly satisfied. No incorrect memory access patterns are possible. The m_ workspace exclusion in Interp1Op is safe because AllocateTempTensor always produces a contiguous tensor. Prior P1 concerns from the thread are resolved. Only P2-level observations remain.

No files require special attention.

Important Files Changed

Filename Overview
include/matx/core/capabilities.h Adds UNIT_STRIDE_LAST enum, CapabilityParams template parameter, capability_attributes specialization (default true, AND-query), and get_query_type case — all consistent and correct.
include/matx/core/tensor_impl.h Introduces DimStride helper to elide stride load/multiply for last dim; refactors GetOffsetOptimized, GetVal, GetValC to use CapType instead of EPT; adds runtime Stride check for UNIT_STRIDE_LAST in get_capability. Logic is correct.
include/matx/executors/cuda.h Runtime unit_stride_last check added before kernel dispatch; both the rank≤4 and rank>4 paths branch on the bool to select USL=true/false CapType instantiations. Correct and complete.
include/matx/executors/kernel.h matxOpTDKernel (rank>4) gains CapType template param and passes it through operator() calls, completing the 5D+ dispatch path. Clean change.
include/matx/operators/interp.h Fixes capability propagation in both LTOIR and Interp1Op get_capability — previously returned default for all non-EPT capabilities; now correctly combines with all child operands.
include/matx/core/nvrtc_helper.h Comment-only addition noting that JIT CapabilityParams deliberately omits unit_stride_last (strides are constexpr in JIT, so the compiler handles it). No logic change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["CudaExecutor::run(op)"] --> B["get_operator_capability&lt;UNIT_STRIDE_LAST&gt;(op)"]
    B --> C{"AND-query through\nexpression tree"}
    C --> D["SetOp → combines out_ + op_"]
    D --> E["PermuteOp / SliceOp\n→ generic else branch\n→ propagates to child"]
    D --> F["tensor_impl_t\nStride(Rank()-1) == 1 ?"]
    E --> F
    C --> G["Non-tensor ops\n→ default_value = true"]
    F -->|true| H["unit_stride_last = true"]
    F -->|false| I["unit_stride_last = false"]
    H --> J["dispatch_kernel&lt;EPT, true&gt;\n→ CapType::unit_stride_last = true"]
    I --> K["dispatch_kernel&lt;EPT, false&gt;\n→ CapType::unit_stride_last = false"]
    J --> L["DimStride: last dim\nreturns idx_val\nOR idx_val * EPT\n(no stride load)"]
    K --> M["DimStride: last dim\nreturns idx_val * stride\nOR idx_val * stride * EPT"]
Loading

Reviews (3): Last reviewed commit: "Propagate/combine capabilities through I..." | Re-trigger Greptile

Comment thread include/matx/executors/cuda.h
Comment thread include/matx/core/tensor_impl.h
Comment thread include/matx/core/tensor_impl.h
@cliffburdick
Copy link
Copy Markdown
Collaborator

/build

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

2 similar comments
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator

/build

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage is 91.843%add-unit-stride-last-dispatch-optimization into main. No base build found for main.

@tbensonatl tbensonatl merged commit d2f550c into main Apr 15, 2026
1 check passed
@tbensonatl tbensonatl deleted the add-unit-stride-last-dispatch-optimization branch April 15, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants