[PyTorch] Expose function to bulk-allocate tensors backed by the same buffer by timmoon10 · Pull Request #2900 · NVIDIA/TransformerEngine

timmoon10 · 2026-04-18T02:32:32Z

Description

Allocating PyTorch tensors carries non-trivial CPU overhead, which becomes especially painful when allocating per-expert tensors in the grouped linear layer. This PR generalizes the bulk allocation approach used in the split-quantize functions (see #1793), which involves allocating a single large buffer and creating tensor subviews with at::from_blob. We expose a dedicated bulk-allocation function that is exposed to Python, refactor the split-quantize functions, and bulk-allocate wgrads in the grouped linear implementations.

This is incremental progress toward #2897. When I run the grouped MLP benchmark discussed in that issue, I see a runtime reduction of 150 us (for reference the backward pass takes ~2.1 ms).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add dedicated function for bulk-allocating PyTorch tensors
Bulk-allocate wgrad tensors in grouped linear implementations

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…locators Introduces transformer_engine/pytorch/csrc/extensions/allocate.cpp with a general-purpose bulk_allocate function: given parallel lists of shapes, dtypes, and per-tensor byte alignments, it computes a packed layout, does a single CUDA allocation, and returns at::from_blob views whose deleters keep the backing buffer alive. The three internal bulk_allocate_*_tensors helpers in cast.cpp are refactored to call bulk_allocate instead of each owning a copy of the make_torch_view lambda and the offset-computation loops (~120 lines removed). The new function is also exposed via pybind11 so Python can allocate packed CUDA buffers directly without going through a quantizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-04-18T02:34:04Z

/te-ci pytorch

greptile-apps · 2026-04-18T02:40:23Z

Greptile Summary

This PR introduces a dedicated bulk_allocate C++ function that allocates a single contiguous CUDA buffer and returns per-tensor views via at::from_blob, eliminating the CPU overhead of individual torch.empty calls in the grouped linear backward passes. The existing split-quantize helpers in cast.cpp are refactored to delegate to this new primitive, and all three grouped linear backward implementations are updated to use it for weight gradient allocation.

New allocate.cpp: Computes per-tensor byte offsets with configurable alignment, pads the base buffer for pointer alignment, and uses a shared_ptr<at::Tensor> deleter to keep the backing buffer alive as long as any view exists; zero-size tensors fall back to a standalone at::empty to avoid from_blob edge-case bugs.
cast.cpp refactor: Removes three copies of the bespoke buffer-management lambda and delegates to bulk_allocate; the contiguous_data_and_scale pre-check logic is correctly rewritten as a per-tensor size divisibility test that is semantically equivalent to the original cumulative offset check.
Python call-sites: Consistently pass device as the third positional argument and [256] * num_groups as alignments, matching the pybind11 signature.

Confidence Score: 5/5

The change is safe to merge: it is a pure performance optimization with well-contained scope — a new allocation helper replaces equivalent per-tensor allocations, and the refactored cast.cpp paths preserve the same memory layout semantics.

The core bulk_allocate logic is straightforward: offset arithmetic, a single CUDA allocation, and reference-counted views. The alignment padding is conservative and correct. All three Python call-sites pass arguments in the right positional order. The cast.cpp refactoring replaces identical local lambdas with equivalent calls to the new primitive, and the contiguous_data_and_scale rewrite is semantically equivalent to the original cumulative-offset check.

No files require special attention; all changes are self-contained and the critical path is covered by the three Python call-sites.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/csrc/extensions/allocate.cpp	New file implementing bulk_allocate: allocates a single contiguous CUDA buffer and creates at::from_blob views for each tensor, with per-tensor alignment support and correct empty-tensor workaround.
transformer_engine/pytorch/csrc/extensions/cast.cpp	Refactors bulk_allocate_fp8/fp4 helpers to delegate to the new bulk_allocate function; eliminates duplicated buffer-management code while preserving the contiguous_data_and_scale semantics.
transformer_engine/pytorch/csrc/extensions/pybind.cpp	Exposes bulk_allocate to Python with py::call_guardpy::gil_scoped_release(), consistent with other allocation-adjacent bindings in the file.
transformer_engine/pytorch/module/grouped_linear.py	Replaces per-expert torch.empty loop with tex.bulk_allocate for wgrad tensors, using 256-byte alignment and the context device.
transformer_engine/pytorch/ops/basic/grouped_linear.py	Replaces per-expert torch.empty loop with tex.bulk_allocate for wgrad tensors; aligned with module/grouped_linear.py change.
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	Replaces per-expert torch.empty loop with tex.bulk_allocate in _compute_grad_params; correctly passes device as third arg and [256]*num_groups as alignments.

Sequence Diagram

sequenceDiagram
    participant Py as Python (grouped_linear backward)
    participant PB as pybind11
    participant BA as bulk_allocate (C++)
    participant Alloc as CUDA Allocator

    Py->>PB: tex.bulk_allocate(shapes, dtypes, device, alignments)
    Note over PB: GIL released
    PB->>BA: bulk_allocate(shapes, dtypes, device, alignments)
    BA->>BA: "compute per-tensor offsets & base_alignment"
    BA->>BA: "base_byte_size += base_alignment (padding)"
    BA->>Alloc: "at::empty({base_byte_size}, kUInt8, device)"
    Alloc-->>BA: "base_buffer (shared_ptr<at::Tensor>)"
    BA->>BA: align base_ptr to base_alignment
    loop for each tensor i
        alt "byte_sizes[i] == 0"
            BA->>Alloc: at::empty(shape_i, dtype_i)
            Alloc-->>BA: standalone empty tensor
        else
            BA->>BA: at::from_blob(base_ptr+offset[i], shape_i, deleter, dtype_i)
        end
    end
    BA-->>PB: "vector<at::Tensor>"
    PB-->>Py: List[torch.Tensor] (wgrad_list)
    Note over Py: base_buffer kept alive via shared_ptr in each view deleter

_{Reviews (7): Last reviewed commit: "Merge branch 'main' into tmoon/group-mlp..." | Re-trigger Greptile}

ptrendx · 2026-04-21T20:56:49Z

+    // Check whether data and scales can be packed in contiguous
+    // buffer. Amaxes are not contiguous since they are aligned to
+    // 16B.


I don't think I understand the logic here. What is the case where we would not be able to pack those buffers together if we can control the alignment requirements for the individual tensors in the allocation?

I think there was some alignment requirement for the weight tensors, and some implementation difference depending on if the weights are contiguous. In any case, this is the logic in the existing implementation and I'm trying to avoid functional changes.

Make optional args for device and alignment. Handle case where base data_ptr is unaligned. Align grouped linear wgrad buffers to 256B. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2026-04-23T00:02:47Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-05-11T21:18:49Z

/te-ci pytorch

vthumbe1503

LGTM

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2026-05-12T02:29:59Z

/te-ci pytorch

… buffer (NVIDIA#2900) * [PyTorch] Add bulk_allocate utility and use it in quantized tensor allocators Introduces transformer_engine/pytorch/csrc/extensions/allocate.cpp with a general-purpose bulk_allocate function: given parallel lists of shapes, dtypes, and per-tensor byte alignments, it computes a packed layout, does a single CUDA allocation, and returns at::from_blob views whose deleters keep the backing buffer alive. The three internal bulk_allocate_*_tensors helpers in cast.cpp are refactored to call bulk_allocate instead of each owning a copy of the make_torch_view lambda and the offset-computation loops (~120 lines removed). The new function is also exposed via pybind11 so Python can allocate packed CUDA buffers directly without going through a quantizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> * Bulk-allocate wgrads in grouped linear impls Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply review suggestions Make optional args for device and alignment. Handle case where base data_ptr is unaligned. Align grouped linear wgrad buffers to 256B. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Nits from Claude Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect call to `bulk_allocate` Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix ambiguous return type Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use c10::Device consistently Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

timmoon10 and others added 4 commits April 18, 2026 01:19

Bulk-allocate wgrads in grouped linear impls

81597a6

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

84096c4

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/group-mlp-bulk-allocate

d1d27ee

greptile-apps Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/pybind.cpp Outdated

Comment thread transformer_engine/pytorch/csrc/extensions/allocate.cpp Outdated

Comment thread transformer_engine/pytorch/module/grouped_linear.py

vthumbe1503 reviewed Apr 21, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/allocate.cpp Outdated