[pyTorch] Replace the make_empty implementation to use C++ implementation#2666
Conversation
|
/te-ci L1 pytorch |
1 similar comment
|
/te-ci L1 pytorch |
known quantizers Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
98f9681 to
6be430a
Compare
|
/te-ci pytorch L1 |
Greptile SummaryThis PR removes per-class Python
Confidence Score: 5/5The refactor is mechanically sound — all built-in quantizer paths are correctly unified through C++ with no correctness regressions. All allocation sites are consistently updated; the new transformer_engine/pytorch/quantized_tensor.py — the undocumented Important Files Changed
Sequence DiagramsequenceDiagram
participant PY as Python (Quantizer.make_empty)
participant TEX as tex.create_empty_quantized_tensor (C++)
participant CONV as convert_quantizer
participant QC as QuantizerCpp::create_tensor
participant RD as resolve_device
participant PT as PyTorch (at::empty)
participant PYTENSOR as Python Tensor __new__
PY->>TEX: (self, shape, dtype, device, pin_memory)
TEX->>CONV: convert Python quantizer to C++ Quantizer
TEX->>QC: create_tensor(shape, te_dtype, device, pin_memory)
QC->>RD: resolve_device(device_opt, data_opt)
RD-->>QC: concrete at::Device
QC->>PT: at::empty(..., opts with device + pin_memory)
PT-->>QC: allocated at::Tensor(s)
QC->>PYTENSOR: PyObject_Call(TensorClass, kwargs incl. device)
PYTENSOR-->>QC: Python QuantizedTensor
QC-->>TEX: (TensorWrapper, py::object)
TEX-->>PY: QuantizedTensor
Reviews (13): Last reviewed commit: "Merge branch 'main' into pr_unify_make_e..." | Re-trigger Greptile |
| pin_memory, | ||
| ) | ||
| if requires_grad: | ||
| result.requires_grad_(True) |
There was a problem hiding this comment.
Doing this in C++ itslef might be faster, since we are anyway going to call the QuantizedTensor.new method with requires_grad argument. Calling this from python for custom quantized tensor has severe python overheads
There was a problem hiding this comment.
But I see it can get quite complicated since we might have to change the create_tensor API to accept the requires_grad argument.
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
/te-ci pytorch |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Additional Comments (1)
The new This creates two problems:
Fix: Remove the shadowing local variable and use the parameter directly: The local |
|
/te-ci pytorch |
known quantizers Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…ngine into pr_unify_make_empty
The merge with main introduced duplicate function definition, declaration, and pybind registration for create_empty_quantized_tensor. Remove the duplicates. Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Change the device parameter from at::Device with default torch::kCUDA to std::optional<at::Device> with default nullopt. When no device is specified, resolve to the current CUDA device via c10::cuda::current_device(), ensuring the device always has a valid index. This fixes autograd engine assertions when tensors created without an explicit device are used in backward passes. Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Custom quantizers that set self.custom = True and don't override
make_empty() will now get a clear NotImplementedError instead of
hitting an opaque C++ NVTE_ERROR("Unexpected type for quantizer").
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
|
/te-ci pytorch |
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
| if requires_grad: | ||
| result.requires_grad_(True) |
There was a problem hiding this comment.
It might make sense for the tex API to accept requires_grad as an argument as well considering CPU overheads. Given that we are passing things like pin_memory, device which are torch::TensorOptions, we might as well add requires_grad in the picture which is also an attribute of torch::TensorOptions.
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
/te-ci pytorch |
Description
This PR unifies the implementation of the QuantizedTensor creation by using the C++ implementation of the create_tensor.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: