Conversation
Add a capability to indicate that a tensor has unit stride in the last dimension. When true, we can elide loading and multiplying by the last stride. The last dimension being unit-stride is the nominal case since that is what is created via make_tensor() without user-provided strides. This capability approach applies to the matxOpT*Kernel dispatch. It will not apply to custom kernels in MatX. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Greptile SummaryThis PR adds a Confidence Score: 5/5Safe to merge — optimization is correctly gated by an AND-query through the full expression tree, so it cannot fire when any tensor has a non-unit last stride. All leaf-tensor stride checks are runtime-evaluated before dispatch; the AND-query propagation (verified in SetOp, PermuteOp, SliceOp, and interp operators) ensures the fast path is only taken when the invariant is truly satisfied. No incorrect memory access patterns are possible. The m_ workspace exclusion in Interp1Op is safe because AllocateTempTensor always produces a contiguous tensor. Prior P1 concerns from the thread are resolved. Only P2-level observations remain. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["CudaExecutor::run(op)"] --> B["get_operator_capability<UNIT_STRIDE_LAST>(op)"]
B --> C{"AND-query through\nexpression tree"}
C --> D["SetOp → combines out_ + op_"]
D --> E["PermuteOp / SliceOp\n→ generic else branch\n→ propagates to child"]
D --> F["tensor_impl_t\nStride(Rank()-1) == 1 ?"]
E --> F
C --> G["Non-tensor ops\n→ default_value = true"]
F -->|true| H["unit_stride_last = true"]
F -->|false| I["unit_stride_last = false"]
H --> J["dispatch_kernel<EPT, true>\n→ CapType::unit_stride_last = true"]
I --> K["dispatch_kernel<EPT, false>\n→ CapType::unit_stride_last = false"]
J --> L["DimStride: last dim\nreturns idx_val\nOR idx_val * EPT\n(no stride load)"]
K --> M["DimStride: last dim\nreturns idx_val * stride\nOR idx_val * stride * EPT"]
Reviews (3): Last reviewed commit: "Propagate/combine capabilities through I..." | Re-trigger Greptile |
|
/build |
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
|
/build |
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
|
/build |
2 similar comments
|
/build |
|
/build |
Add a capability to indicate that a tensor has unit stride in the last dimension. When true, we can elide loading and multiplying by the last stride. The last dimension being unit-stride is the nominal case since that is what is created via make_tensor() without user-provided strides.
This capability approach applies to the matxOpT*Kernel dispatch. It will not apply to custom kernels in MatX.
For a single large set kernel (i.e.,
(t = 1.0f).run()), this results in 18% fewer instructions being executed. In that case there is little benefit since the kernel is memory bound, but it does verify that the capability effectively elides some instructions.