CPU Overhead Optimizations by vthumbe1503 · Pull Request #2559 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-01-05T18:17:02Z

Description

CPU overhead optimizations

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Python Optimizations

TE Pybinded Enums like tex.FP8FwdTensors.GEMM1_INPUT are casted to int in each and every forward pass. Now we are caching the integer values in a constants file and using that instead.
Getting tensor device in the helper function went through expensive tensor.device even for Quantized Tensor. Now we have the device declared as propery of QuantizedTensor, so it doesnt go through PyObject Lookup
Defining device shape and is_cuda attributes as Quantized Tensor properties (since they are easy enough to compute in python) and it avoid the expensive PyObject Lookup
Defining requires_grad and dtype as properties of base QuantizedTensor class. Here we cache the properties when values are being set to avoid the expensive PyObject Lookup. We still need to make sure setter goes through Pybind C++. For instance torch autograd engine in C++ needs to be aware of requires_grad changes.
dtype of our Custom QuantizedTensor can change when we go through x.data = new_tensor. And so we make sure dtype is cached appropriately by defining appropriate _get_data and _set_data for the data property of QuantizedTensor

C++ Optimizations

Caching symbol lookups in libcuda.so for driver calls like cuCtxGetCurrent, so we dont lookup the symbol in each and every forward/backward call.
Caching nvte_non_tn_fp8_gemm_supported() function call
Faster py object call without cxa_demangle to construct QuantizedTensor classes in C++
Reduce Python work in QuantizedTensor object creation(calculating stride from shape and getting current cuda device can be done in C++ instead Python Constructor).

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-01-06T13:03:13Z

/te-ci L1 pytorch

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-07T17:23:30Z

/te-ci L1 pytorch

greptile-apps · 2026-01-07T17:48:47Z

Greptile Summary

This PR implements comprehensive CPU overhead optimizations across Python and C++ layers of TransformerEngine, targeting hotspots identified through profiling.

Key Optimizations

Python Layer:

Cached pybind11 enum values (FP8FwdTensorIdx, FP8BwdTensorIdx) as integers to avoid repeated enum-to-int conversions on every forward/backward pass
Added cached _dtype and _requires_grad properties to QuantizedTensor base class with lazy initialization fallback to avoid expensive PyObject lookups
Added device, shape, and is_cuda properties to all quantized tensor types to bypass parent class attribute resolution
Modified QuantizedTensor.__new__ to accept pre-computed stride parameter, allowing C++ to compute strides without calling back into Python

C++ Layer:

Replaced pybind11 named argument function calls with direct PyObject_Call using py::dict and py::tuple for lower overhead in tensor creation paths
Added thread-safe symbol caching in cuda_driver.h using mutex-protected map to avoid repeated dlsym lookups for CUDA driver API functions
Cached nvte_is_non_tn_fp8_gemm_supported() result per function invocation instead of calling multiple times for A and B matrix configurations
Moved stride computation from Python to C++ to eliminate PyObjectVectorCall overhead
Thread-safe extension initialization using std::call_once

Code Quality

The implementation demonstrates solid engineering practices:

Proper RAII memory management with py::dict/py::tuple preventing reference leaks
Thread safety through std::mutex and std::call_once
Defensive error handling with RuntimeError for edge cases where tensor data is missing
Fallback logic for cached properties to handle alternate tensor creation paths (unpickling, FSDP, etc.)

All critical issues from previous review rounds have been addressed, including memory leaks, thread safety concerns, and proper initialization of cached attributes.

Confidence Score: 5/5

This PR is safe to merge with high confidence
All previously identified critical issues (memory leaks, thread safety, initialization bugs) have been properly addressed. The optimizations are well-implemented with proper error handling, thread safety mechanisms, and defensive programming. The changes maintain backward compatibility while reducing CPU overhead through caching and reduced Python-C++ boundary crossings. Code quality is excellent with RAII memory management, mutex-protected shared state, and comprehensive error handling
No files require special attention - all changes follow best practices and critical issues have been resolved

Important Files Changed

Filename	Overview
transformer_engine/pytorch/csrc/quantizer.cpp	Replaces pybind11 function calls with direct PyObject_Call for tensor creation, computes stride in C++ instead of Python, caches nvte_is_non_tn_fp8_gemm_supported() result. Memory management properly handled with py::dict/py::tuple RAII
transformer_engine/pytorch/quantized_tensor.py	Adds cached `_dtype` and `_requires_grad` properties with proper initialization and fallback logic. Adds stride parameter to `__new__` to avoid Python call from C++. Implements data property setter to sync dtype cache
transformer_engine/common/util/cuda_driver.h	Adds symbol caching with thread-safe mutex-protected map to avoid repeated libcuda.so symbol lookups
transformer_engine/pytorch/constants.py	Adds FP8FwdTensorIdx and FP8BwdTensorIdx SimpleNamespace objects to cache enum-to-int conversions, avoiding expensive pybind11 enum casts on every forward/backward pass
transformer_engine/common/gemm/cublaslt_gemm.cu	Caches nvte_is_non_tn_fp8_gemm_supported() result at function start to avoid redundant calls for both A and B matrix configurations
transformer_engine/pytorch/csrc/extensions/pybind.cpp	Replaces null-check guards with std::call_once for thread-safe initialization of Python extension classes

_{Last reviewed commit: e52a12d}

greptile-apps

_{24 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-07T17:59:34Z

/te-ci L1 pytorch

greptile-apps

Greptile Overview

Greptile Summary

This PR implements CPU-side performance optimizations for FP8 operations by caching frequently accessed attributes and reducing redundant function calls. The optimizations target expensive PyObject attribute lookups on custom tensor types and repeated C++ function calls.

Key Changes:

Caches requires_grad, dtype, shape, and is_cuda attribute accesses to avoid expensive PyObject lookups on custom tensors
Reorders attribute checks in get_tensor_device() to prioritize internal quantized tensor attributes
Makes num_devices static in nvte_is_non_tn_fp8_gemm_supported() to cache device count
Stores GEMM support check results in local variables to avoid redundant function calls

Critical Issues Found:

Variable redeclaration error in cublaslt_gemm.cu (line 224) will prevent compilation
Logic bug in linear.py (line 484) changes FP8 state management from OR logic to AND logic, breaking functionality when bias is None or doesn't require grad

Confidence Score: 0/5

This PR cannot be merged due to compilation error and critical logic bug
Two critical issues prevent merging: (1) C++ compilation will fail due to variable redeclaration at line 224 of cublaslt_gemm.cu, and (2) logic bug at line 484 of linear.py breaks FP8 state management by requiring all three tensors to have requires_grad=True instead of any one of them
Pay close attention to transformer_engine/common/gemm/cublaslt_gemm.cu (compilation error) and transformer_engine/pytorch/module/linear.py (logic bug)

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/gemm/cublaslt_gemm.cu	1/5	Caches function call result to reduce overhead, but contains variable redeclaration error that will cause compilation failure
transformer_engine/common/transformer_engine.cpp	5/5	Makes `num_devices` static to avoid redundant calls to `cuda::num_devices()` - valid optimization
transformer_engine/pytorch/module/linear.py	0/5	Caches `requires_grad` checks for performance, but contains critical logic bug at line 484 that changes FP8 state management behavior

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Linear as Linear Module
    participant Quantizer as Quantizer/QuantizedTensor
    participant GEMM as GEMM Operations
    participant CPP as C++ Extensions

    Note over Linear,CPP: Performance Optimization Flow
    
    User->>Linear: forward(input, weight, bias)
    
    Note over Linear: Cache requires_grad checks
    Linear->>Linear: inp_requires_grad = inp.requires_grad<br/>weight_requires_grad = weight.requires_grad<br/>bias_requires_grad = bias.requires_grad
    
    Linear->>Quantizer: Check if quantized tensor
    alt QuantizedTensor
        Note over Quantizer: Use cached dtype property
        Quantizer->>Quantizer: return self._dtype
        Note over Quantizer: Use cached shape/is_cuda
        Quantizer->>Quantizer: return self._data.shape
    else Regular Tensor
        Quantizer->>Linear: Standard attribute access
    end
    
    Linear->>CPP: get_tensor_device(tensor)
    Note over CPP: Reordered attribute checks
    CPP->>CPP: Check _rowwise_data first<br/>Check _columnwise_data<br/>Check device last
    CPP-->>Linear: device_index
    
    Linear->>GEMM: Configure GEMM parameters
    Note over GEMM: Cache nvte_is_non_tn_fp8_gemm_supported
    GEMM->>CPP: nvte_is_non_tn_fp8_gemm_supported()
    Note over CPP: Static num_devices cached
    CPP-->>GEMM: support_flag
    GEMM->>GEMM: Store in local variable
    
    GEMM->>GEMM: Execute optimized GEMM
    GEMM-->>Linear: output
    
    Note over Linear: FP8 State Management
    alt FP8 enabled and requires_grad check
        Linear->>Linear: Update FP8 tensors<br/>based on cached flags
    end
    
    Linear-->>User: output

greptile-apps · 2026-01-07T18:02:31Z

Additional Comments (2)

transformer_engine/common/gemm/cublaslt_gemm.cu
variable redeclared in same scope - already declared at line 132

    // int is_nvte_non_tn_fp8_gemm_supported already declared at line 132

transformer_engine/pytorch/module/linear.py
logic change from original requires_grad(inp, weight, bias) which returns True if ANY tensor requires grad. New code requires ALL THREE to be True, breaking FP8 state management when bias is None or doesn't require grad

            if ctx.fp8 and (inp_requires_grad or weight_requires_grad or bias_requires_grad):

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-01-11T19:04:36Z

/te-ci L1 pytorch

vthumbe1503 · 2026-01-11T19:04:45Z

/te-ci L1 pytorch

greptile-apps

_{8 files reviewed, 8 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/pytorch/csrc/quantizer.cpp

transformer_engine/pytorch/quantized_tensor.py

transformer_engine/pytorch/csrc/extensions/pybind.cpp

transformer_engine/common/transformer_engine.cpp

greptile-apps

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/pytorch/csrc/quantizer.cpp

transformer_engine/pytorch/quantized_tensor.py

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

greptile-apps

_{13 files reviewed, 13 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/pytorch/csrc/quantizer.cpp

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/csrc/quantizer.cpp

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps

_{19 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2026-02-24T07:35:51Z

/te-ci pytorch

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-02-24T09:47:44Z

/te-ci pytorch

…ormerEngine into cpu_fp8_optimizations

vthumbe1503 · 2026-02-24T09:48:28Z

/te-ci pytorch

greptile-apps

_{23 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2026-02-24T17:49:26Z

/te-ci pytorch

greptile-apps

_{23 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ormerEngine into cpu_fp8_optimizations

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

vthumbe1503 · 2026-02-27T17:50:33Z

/te-ci pytorch

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-02-27T18:25:15Z

/te-ci pytorch

greptile-apps · 2026-02-27T18:33:31Z

Additional Comments (1)

transformer_engine/pytorch/tensor/float8_tensor.py, line 1019
cached _dtype not updated when copying Float8Tensor with different dtype

when copying from one Float8Tensor to another with different dtype (line 1004 condition), the code creates a dummy tensor with the new dtype and sets it using super(Float8Tensor, type(self)).data.__set__(self, dummy_tensor) (line 1019). this bypasses QuantizedTensor._set_data() which updates the cached _dtype attribute

result: cached _dtype becomes stale and won't match the actual tensor's dtype

add after line 1019:

self._dtype = tensor.dtype

ksivaman · 2026-03-02T17:56:05Z

/te-ci L0 L1

vthumbe1503 and others added 2 commits January 5, 2026 18:11

add all the optimizations

93ee022

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

06338bc

for more information, see https://pre-commit.ci

vthumbe1503 added the cpu_overhead label Jan 6, 2026

vthumbe1503 and others added 4 commits January 6, 2026 12:34

requires_grad optimization

50de9cd

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

5fee841

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

4c79ac7

[pre-commit.ci] auto fixes from pre-commit.com hooks

62b88e1

for more information, see https://pre-commit.ci

vthumbe1503 added 3 commits January 7, 2026 17:19

test if commenting out requires_grad works

99494d7

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

b157f85

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

2a7b627

vthumbe1503 marked this pull request as ready for review January 7, 2026 17:22

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

vthumbe1503 added 2 commits January 7, 2026 17:58

fix minor bug

b61a6a8

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

938651e

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

fix ci

88dfdbd

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_fp8_optimizations

1526eea

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

vthumbe1503 and others added 3 commits January 11, 2026 19:12

missed a bug

5809dcc

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

b3bd748

…ormerEngine into cpu_fp8_optimizations Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Update transformer_engine/pytorch/csrc/quantizer.cpp

30fecf2

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

greptile-apps bot reviewed Jan 11, 2026

View reviewed changes

remove unused function

0bf040f

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 changed the title ~~CPU Optimizations for FP8~~ CPU Optimizations Feb 24, 2026

vthumbe1503 and others added 4 commits February 23, 2026 23:03

Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

b7d9693

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

15165b7

…ormerEngine into cpu_fp8_optimizations

index instead of device bug

369afeb

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb73444

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

vthumbe1503 changed the title ~~CPU Optimizations~~ CPU Overhead Optimizations Feb 24, 2026

fix ci:

c7bb5ce

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

f934261

…ormerEngine into cpu_fp8_optimizations

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

Merge branch 'main' into cpu_fp8_optimizations

1843f02

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

vthumbe1503 added 3 commits February 27, 2026 17:46

debug quantized tensor fix

89d8d82

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'cpu_fp8_optimizations' of github.com:vthumbe1503/Transf…

63509e6

…ormerEngine into cpu_fp8_optimizations

Merge branch 'main' into cpu_fp8_optimizations

a77195a

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

pre-commit-ci bot and others added 2 commits February 27, 2026 17:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

4e92a46

for more information, see https://pre-commit.ci

revert cudnnt front end change

73e4d1d

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

ksivaman and others added 2 commits March 2, 2026 09:56

Merge branch 'main' into cpu_fp8_optimizations

8282aca

Merge branch 'main' into cpu_fp8_optimizations

e52a12d

ksivaman approved these changes Mar 3, 2026

View reviewed changes

ksivaman merged commit 9dac78e into NVIDIA:main Mar 3, 2026
11 of 13 checks passed

Conversation

vthumbe1503 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

vthumbe1503 commented Jan 6, 2026

Uh oh!

vthumbe1503 commented Jan 7, 2026

Uh oh!

greptile-apps bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Key Optimizations

Code Quality

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Jan 7, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 0/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 7, 2026

Uh oh!

vthumbe1503 commented Jan 11, 2026

Uh oh!

vthumbe1503 commented Jan 11, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Feb 24, 2026

Uh oh!

vthumbe1503 commented Feb 24, 2026

Uh oh!

vthumbe1503 commented Feb 24, 2026

Uh oh!

vthumbe1503 commented Jan 5, 2026 •

edited

Loading

greptile-apps bot commented Jan 7, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading