[pyTorch] Replace the make_empty implementation to use C++ implementation by ptrendx · Pull Request #2666 · NVIDIA/TransformerEngine

ptrendx · 2026-02-10T00:17:25Z

Description

This PR unifies the implementation of the QuantizedTensor creation by using the C++ implementation of the create_tensor.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Replaced the Python implementations of the make_empty with the calls to C++ create_tensor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ptrendx · 2026-02-10T00:17:50Z

/te-ci L1 pytorch

ptrendx · 2026-02-10T17:11:01Z

/te-ci L1 pytorch

known quantizers Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-02-19T22:41:24Z

/te-ci pytorch L1

greptile-apps · 2026-02-19T22:44:42Z

Greptile Summary

This PR removes per-class Python make_empty overrides from all built-in quantizer types (Float8Quantizer, Float8CurrentScalingQuantizer, Float8BlockQuantizer, MXFP8Quantizer, NVFP4Quantizer) and replaces them with a single C++-backed implementation in the Quantizer base class. A new create_empty_quantized_tensor C++ function is exposed via pybind11 and a new resolve_device helper centralises device inference across all create_tensor overloads.

Adds device and pin_memory parameters (defaulting to nullopt/false) to every create_tensor virtual declaration and implementation, maintaining backward compatibility at existing call sites.
Introduces resolve_device which correctly infers the target device from an explicit argument, a pre-provided data tensor, or the current CUDA device — in that priority order.
Python make_empty now delegates to tex.create_empty_quantized_tensor; a custom attribute guard is added to preserve a friendly error path for custom quantizers, though the attribute is undocumented.

Confidence Score: 5/5

The refactor is mechanically sound — all built-in quantizer paths are correctly unified through C++ with no correctness regressions.

All allocation sites are consistently updated; the new resolve_device helper properly handles all four device-inference cases. The only rough edge is an undocumented custom attribute that affects error message quality for niche custom-quantizer authors, not correctness of the built-in paths.

transformer_engine/pytorch/quantized_tensor.py — the undocumented custom attribute guard for custom quantizers.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/quantized_tensor.py	Base-class `make_empty` now routes through C++; the `custom` attribute guard is undocumented and risks replacing a clear Python error with an opaque C++ assertion for custom quantizer authors.
transformer_engine/pytorch/csrc/quantizer.cpp	Adds `resolve_device` helper and threads `device`/`pin_memory` through all `create_tensor` overloads; device inference from a pre-provided `data` tensor is correctly handled via the new helper.
transformer_engine/pytorch/csrc/extensions/cast.cpp	New `create_empty_quantized_tensor` function added cleanly before the anonymous namespace; delegates directly to the C++ quantizer's `create_tensor` with device and pin_memory.
transformer_engine/pytorch/csrc/extensions/pybind.cpp	Registers `create_empty_quantized_tensor` with pybind11 with correct argument names and types; no issues.
transformer_engine/pytorch/csrc/common.h	All `create_tensor` virtual declarations updated with `device` and `pin_memory` defaulting to `nullopt`/`false`; backward-compatible at call sites that don't supply them.
transformer_engine/pytorch/tensor/nvfp4_tensor.py	Removes Python `make_empty` from `NVFP4Quantizer`; old impl used `torch.zeros` for `amax_*` while C++ uses `at::empty` — deliberate per inline C++ comments that kernels zero those buffers.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	Removes Python `make_empty` from `MXFP8Quantizer`; shape validation is preserved in C++ via `NVTE_CHECK`.
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py	Removes Python `make_empty` from `Float8BlockQuantizer`; C++ now passes `device` kwarg which correctly flows through `**kwargs` to `QuantizedTensor.__new__`.
transformer_engine/pytorch/tensor/float8_tensor.py	Removes Python `make_empty` from `Float8Quantizer` and `Float8CurrentScalingQuantizer`; both now covered by the C++ path.

Sequence Diagram

sequenceDiagram
    participant PY as Python (Quantizer.make_empty)
    participant TEX as tex.create_empty_quantized_tensor (C++)
    participant CONV as convert_quantizer
    participant QC as QuantizerCpp::create_tensor
    participant RD as resolve_device
    participant PT as PyTorch (at::empty)
    participant PYTENSOR as Python Tensor __new__

    PY->>TEX: (self, shape, dtype, device, pin_memory)
    TEX->>CONV: convert Python quantizer to C++ Quantizer
    TEX->>QC: create_tensor(shape, te_dtype, device, pin_memory)
    QC->>RD: resolve_device(device_opt, data_opt)
    RD-->>QC: concrete at::Device
    QC->>PT: at::empty(..., opts with device + pin_memory)
    PT-->>QC: allocated at::Tensor(s)
    QC->>PYTENSOR: PyObject_Call(TensorClass, kwargs incl. device)
    PYTENSOR-->>QC: Python QuantizedTensor
    QC-->>TEX: (TensorWrapper, py::object)
    TEX-->>PY: QuantizedTensor

_{Reviews (13): Last reviewed commit: "Merge branch 'main' into pr_unify_make_e..." | Re-trigger Greptile}

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2026-02-27T18:48:17Z

+            pin_memory,
        )
+        if requires_grad:
+            result.requires_grad_(True)


Doing this in C++ itslef might be faster, since we are anyway going to call the QuantizedTensor.new method with requires_grad argument. Calling this from python for custom quantized tensor has severe python overheads

But I see it can get quite complicated since we might have to change the create_tensor API to accept the requires_grad argument.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-03-04T19:50:43Z

/te-ci pytorch

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-03-05T00:44:03Z

/te-ci pytorch

greptile-apps · 2026-03-05T00:48:23Z

Additional Comments (1)

transformer_engine/pytorch/csrc/quantizer.cpp, line 596
device parameter shadowed by pre-existing local variable

The new device function parameter (line 562) is shadowed by the local variable at::Device device declared here. In C++, this is a scoping issue: the local declaration shadows the parameter, so kwargs["device"] on line 632 uses the local variable instead of the caller's argument.

This creates two problems:

If the build uses -Wshadow -Werror, this will be a compile error.
More critically, in the edge case where both with_data == false and with_transpose == false, the device will resolve to c10::cuda::current_device() rather than the device parameter passed by the caller. This causes the Python Float8Tensor object to report the wrong device.

Fix: Remove the shadowing local variable and use the parameter directly:

  // Construct Python FP8 tensor
  py::object out_py;
  py::object scale_inv_py = py::cast(scale_inv_tensor);
  py::object data_py = with_data ? py::cast(data_tensor) : py::none();
  py::object transpose_py = with_transpose ? py::cast(transpose_tensor) : py::none();
  if (internal) {
    // ...
    kwargs["quantizer"] = this->quantizer;

    py::tuple args(0);
    // ...
  } else {
    // ...
    kwargs["quantizer"] = this->quantizer;
    kwargs["device"] = py::cast(device);

The local at::Device device declaration (lines 593-596) should be removed entirely, as the device parameter already holds the requested device.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-03-10T17:11:44Z

/te-ci pytorch

known quantizers Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

…ngine into pr_unify_make_empty

The merge with main introduced duplicate function definition, declaration, and pybind registration for create_empty_quantized_tensor. Remove the duplicates. Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Change the device parameter from at::Device with default torch::kCUDA to std::optional<at::Device> with default nullopt. When no device is specified, resolve to the current CUDA device via c10::cuda::current_device(), ensuring the device always has a valid index. This fixes autograd engine assertions when tensors created without an explicit device are used in backward passes. Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-04-13T23:58:23Z

/te-ci pytorch

Custom quantizers that set self.custom = True and don't override make_empty() will now get a clear NotImplementedError instead of hitting an opaque C++ NVTE_ERROR("Unexpected type for quantizer"). Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps · 2026-04-16T22:35:33Z

Want your agent to iterate on Greptile's feedback? Try greploops.

ptrendx · 2026-04-17T01:13:43Z

/te-ci pytorch

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-05-11T22:02:40Z

/te-ci pytorch

vthumbe1503

LGTM

vthumbe1503 · 2026-05-12T14:56:46Z

+        if requires_grad:
+            result.requires_grad_(True)


It might make sense for the tex API to accept requires_grad as an argument as well considering CPU overheads. Given that we are passing things like pin_memory, device which are torch::TensorOptions, we might as well add requires_grad in the picture which is also an attribute of torch::TensorOptions.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-05-12T21:40:54Z

/te-ci pytorch

ptrendx requested a review from negvet February 10, 2026 00:17

negvet reviewed Feb 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/quantized_tensor.py

ptrendx and others added 4 commits February 18, 2026 17:27

Replace the make_empty implementation to use C++ implementation for the

3ba0b9e

known quantizers Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a374284

for more information, see https://pre-commit.ci

Fix lint

9ae5d33

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Handle the device passed as string

6be430a

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx force-pushed the pr_unify_make_empty branch from 98f9681 to 6be430a Compare February 19, 2026 01:27

Fix

9cad6d0

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx marked this pull request as ready for review February 19, 2026 22:41

greptile-apps Bot reviewed Feb 19, 2026

View reviewed changes

vthumbe1503 reviewed Feb 27, 2026

View reviewed changes

Merge branch 'main' into pr_unify_make_empty

a352b4d

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

da3927b

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

ptrendx and others added 2 commits March 4, 2026 16:43

Fix

f73cde2

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4abf5a8

for more information, see https://pre-commit.ci

Fixes

ca6c7a6

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx and others added 5 commits April 9, 2026 16:45

Replace the make_empty implementation to use C++ implementation for the

a018743

known quantizers Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b332f03

for more information, see https://pre-commit.ci

Fix lint

15940b5

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Handle the device passed as string

c65f4d2

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

aa95303

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx and others added 6 commits April 9, 2026 16:45

Fixes

9cf3628

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge remote-tracking branch 'origin/main' into pr_unify_make_empty

b7cc07e

Merge branch 'pr_unify_make_empty' of github.com:ptrendx/TransformerE…

f513f42

…ngine into pr_unify_make_empty

Fix duplicate create_empty_quantized_tensor after merge

c9f9921

The merge with main introduced duplicate function definition, declaration, and pybind registration for create_empty_quantized_tensor. Remove the duplicates. Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a0772e6

for more information, see https://pre-commit.ci

ptrendx added 2 commits April 13, 2026 17:02

Merge branch 'main' into pr_unify_make_empty

715c930

ptrendx and others added 5 commits May 11, 2026 14:24

Merge branch 'main' into pr_unify_make_empty

f68cac7

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge branch 'main' into pr_unify_make_empty

4488773

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Fix the device from the passed data case

6554918

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge remote-tracking branch into pr_unify_make_empty

d288ebe

[pre-commit.ci] auto fixes from pre-commit.com hooks

f57392e

for more information, see https://pre-commit.ci

vthumbe1503 previously approved these changes May 12, 2026

View reviewed changes

Merge branch 'main' into pr_unify_make_empty

887c872

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx dismissed vthumbe1503’s stale review via 887c872 May 12, 2026 18:57

vthumbe1503 approved these changes May 12, 2026

View reviewed changes

Merge branch 'main' into pr_unify_make_empty

5358d49

ptrendx merged commit 4631d97 into NVIDIA:main May 13, 2026
11 of 14 checks passed

Conversation

ptrendx commented Feb 10, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

ptrendx commented Feb 10, 2026

Uh oh!

ptrendx commented Feb 10, 2026

Uh oh!

Uh oh!

ptrendx commented Feb 19, 2026

Uh oh!

greptile-apps Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Mar 4, 2026

Uh oh!

Uh oh!

ptrendx commented Mar 5, 2026

Uh oh!

greptile-apps Bot commented Mar 5, 2026

Uh oh!

ptrendx commented Mar 10, 2026

Uh oh!

ptrendx commented Apr 13, 2026

Uh oh!

greptile-apps Bot commented Apr 16, 2026

Uh oh!

ptrendx commented Apr 17, 2026

Uh oh!

ptrendx commented May 11, 2026

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Feb 19, 2026 •

edited

Loading