[PyTorch] Expose function to bulk-allocate tensors backed by the same buffer#2900
Conversation
…locators Introduces transformer_engine/pytorch/csrc/extensions/allocate.cpp with a general-purpose bulk_allocate function: given parallel lists of shapes, dtypes, and per-tensor byte alignments, it computes a packed layout, does a single CUDA allocation, and returns at::from_blob views whose deleters keep the backing buffer alive. The three internal bulk_allocate_*_tensors helpers in cast.cpp are refactored to call bulk_allocate instead of each owning a copy of the make_torch_view lambda and the offset-computation loops (~120 lines removed). The new function is also exposed via pybind11 so Python can allocate packed CUDA buffers directly without going through a quantizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Greptile SummaryThis PR introduces a dedicated
Confidence Score: 5/5The change is safe to merge: it is a pure performance optimization with well-contained scope — a new allocation helper replaces equivalent per-tensor allocations, and the refactored cast.cpp paths preserve the same memory layout semantics. The core bulk_allocate logic is straightforward: offset arithmetic, a single CUDA allocation, and reference-counted views. The alignment padding is conservative and correct. All three Python call-sites pass arguments in the right positional order. The cast.cpp refactoring replaces identical local lambdas with equivalent calls to the new primitive, and the contiguous_data_and_scale rewrite is semantically equivalent to the original cumulative-offset check. No files require special attention; all changes are self-contained and the critical path is covered by the three Python call-sites. Important Files Changed
Sequence DiagramsequenceDiagram
participant Py as Python (grouped_linear backward)
participant PB as pybind11
participant BA as bulk_allocate (C++)
participant Alloc as CUDA Allocator
Py->>PB: tex.bulk_allocate(shapes, dtypes, device, alignments)
Note over PB: GIL released
PB->>BA: bulk_allocate(shapes, dtypes, device, alignments)
BA->>BA: "compute per-tensor offsets & base_alignment"
BA->>BA: "base_byte_size += base_alignment (padding)"
BA->>Alloc: "at::empty({base_byte_size}, kUInt8, device)"
Alloc-->>BA: "base_buffer (shared_ptr<at::Tensor>)"
BA->>BA: align base_ptr to base_alignment
loop for each tensor i
alt "byte_sizes[i] == 0"
BA->>Alloc: at::empty(shape_i, dtype_i)
Alloc-->>BA: standalone empty tensor
else
BA->>BA: at::from_blob(base_ptr+offset[i], shape_i, deleter, dtype_i)
end
end
BA-->>PB: "vector<at::Tensor>"
PB-->>Py: List[torch.Tensor] (wgrad_list)
Note over Py: base_buffer kept alive via shared_ptr in each view deleter
Reviews (7): Last reviewed commit: "Merge branch 'main' into tmoon/group-mlp..." | Re-trigger Greptile |
| // Check whether data and scales can be packed in contiguous | ||
| // buffer. Amaxes are not contiguous since they are aligned to | ||
| // 16B. |
There was a problem hiding this comment.
I don't think I understand the logic here. What is the case where we would not be able to pack those buffers together if we can control the alignment requirements for the individual tensors in the allocation?
There was a problem hiding this comment.
I think there was some alignment requirement for the weight tensors, and some implementation difference depending on if the weights are contiguous. In any case, this is the logic in the existing implementation and I'm trying to avoid functional changes.
Make optional args for device and alignment. Handle case where base data_ptr is unaligned. Align grouped linear wgrad buffers to 256B. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Tim Moon <tmoon@nvidia.com>
befa6f6 to
16806b4
Compare
|
/te-ci pytorch |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
… buffer (NVIDIA#2900) * [PyTorch] Add bulk_allocate utility and use it in quantized tensor allocators Introduces transformer_engine/pytorch/csrc/extensions/allocate.cpp with a general-purpose bulk_allocate function: given parallel lists of shapes, dtypes, and per-tensor byte alignments, it computes a packed layout, does a single CUDA allocation, and returns at::from_blob views whose deleters keep the backing buffer alive. The three internal bulk_allocate_*_tensors helpers in cast.cpp are refactored to call bulk_allocate instead of each owning a copy of the make_torch_view lambda and the offset-computation loops (~120 lines removed). The new function is also exposed via pybind11 so Python can allocate packed CUDA buffers directly without going through a quantizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> * Bulk-allocate wgrads in grouped linear impls Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply review suggestions Make optional args for device and alignment. Handle case where base data_ptr is unaligned. Align grouped linear wgrad buffers to 256B. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Nits from Claude Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect call to `bulk_allocate` Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix ambiguous return type Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use c10::Device consistently Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Description
Allocating PyTorch tensors carries non-trivial CPU overhead, which becomes especially painful when allocating per-expert tensors in the grouped linear layer. This PR generalizes the bulk allocation approach used in the split-quantize functions (see #1793), which involves allocating a single large buffer and creating tensor subviews with
at::from_blob. We expose a dedicated bulk-allocation function that is exposed to Python, refactor the split-quantize functions, and bulk-allocate wgrads in the grouped linear implementations.This is incremental progress toward #2897. When I run the grouped MLP benchmark discussed in that issue, I see a runtime reduction of 150 us (for reference the backward pass takes ~2.1 ms).
Type of change
Changes
Checklist: