[PyTorch] Reduce CPU overheads by ksivaman · Pull Request #2377 · NVIDIA/TransformerEngine

ksivaman · 2025-11-13T16:13:40Z

Description

Based on single GPU profiling of the GroupedLinear module, implement some optimizations in order to reduce CPU overhead due to PyTorch.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring
Optimization/performance

Changes

Consolidate creation and caching of workspace in the GEMM logic. Fix workspace device for cases where incorrectly cached tensor is used.
Reduce number of arguments to PyTorch autograd function in order to not reduce overheads due to functions such as unwrap_dead_wrappers .
Use nvtx context manager only when enabled via envvar.
Remove torch.cuda.device context manager to C++.
Minor refactors such that delayed scaling recipe checks are grouped together to reduce unnecessary checks for other recipes.
Reduce number of calls to torch.is_grad_enabled and misc other torch calls.
Make quantizer specific copy implementations in order to avoid copy.copy().

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2025-11-13T16:19:06Z

Greptile Summary

Consolidates workspace creation/caching and moves device context management to C++ to reduce CPU overhead in PyTorch operations
Reduces PyTorch autograd function arguments by grouping non-tensor args into tuples, minimizing overhead from unwrap_dead_wrappers and arg validation
Optimizes recipe checks by caching torch.is_grad_enabled() results and grouping delayed scaling checks, plus implements custom quantizer copy() methods to avoid copy.copy() overhead

Confidence Score: 3/5

This PR has critical bugs that must be fixed before merging
The workspace caching and CPU overhead optimizations are solid improvements, but the get_tensor_device() function has a critical bug where .device.index can return None for CUDA devices without explicit index (e.g., torch.device('cuda')), which will cause runtime errors when creating tensors
Pay close attention to transformer_engine/pytorch/cpp_extensions/gemm.py - the device index handling must be fixed

Important Files Changed

Filename	Overview
transformer_engine/pytorch/cpp_extensions/gemm.py	Added workspace caching and device detection, but `.device.index` can return `None` causing issues with device specification

Sequence Diagram

sequenceDiagram
    participant User
    participant Linear/GroupedLinear
    participant AutogradFunction
    participant general_gemm
    participant CUDAGuard
    participant cuBLAS/cuDNN

    User->>Linear/GroupedLinear: forward(input)
    Linear/GroupedLinear->>Linear/GroupedLinear: Cache torch.is_grad_enabled()
    Linear/GroupedLinear->>Linear/GroupedLinear: Consolidate args into non_tensor_args tuple
    Linear/GroupedLinear->>AutogradFunction: forward(tensors, non_tensor_args)
    AutogradFunction->>general_gemm: gemm(A, B, quantization_params)
    general_gemm->>general_gemm: get_cublas_workspace(device, ub, grouped)
    general_gemm->>CUDAGuard: Set correct device context
    CUDAGuard->>cuBLAS/cuDNN: Execute GEMM with correct device
    cuBLAS/cuDNN-->>general_gemm: output
    general_gemm-->>AutogradFunction: result
    AutogradFunction-->>Linear/GroupedLinear: output
    Linear/GroupedLinear-->>User: result

greptile-apps

_{27 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps

_{27 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2025-11-14T22:35:47Z

Reviewed with Kirthi Offline. Looks good to me. LGTM.

vthumbe1503 · 2025-11-15T01:46:01Z

+    if ub:
+        return torch.empty(
+            get_cublas_workspace_size_bytes(), dtype=torch.uint8, device=device
+        ).repeat(_NUM_MAX_UB_STREAMS)


Just a minor further optimization, can be done in later PR as well. Instead of calling empty and then calling a repeat again which means there are 2 torch operations.Directly you can call

torch.empty( get_cublas_workspace_size_bytes() * _NUM_MAX_UB_STREAMS, dtype=torch.uint8,device=device)

ksivaman · 2025-11-16T15:59:12Z

/te-ci

greptile-apps

_{27 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

Initial changes to remove pytorch overheads Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Anna Shors <ashors@nvidia.com>

ksivaman added 2 commits November 11, 2025 20:58

Initial changes to remove pytorch overheads

58b96ab

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'NVIDIA:main' into reduce_framework_cpu_overheads

54c750c

ksivaman marked this pull request as draft November 13, 2025 16:13

greptile-apps Bot reviewed Nov 13, 2025

View reviewed changes

yaox12 requested review from yaox12 and zhongbozhu November 14, 2025 03:40

ksivaman added 2 commits November 14, 2025 10:55

Merge branch 'main' into reduce_framework_cpu_overheads

69f3149

Merge branch 'main' into reduce_framework_cpu_overheads

37564de

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman marked this pull request as ready for review November 14, 2025 19:46

ksivaman requested a review from vthumbe1503 November 14, 2025 19:46

greptile-apps Bot reviewed Nov 14, 2025

View reviewed changes

vthumbe1503 approved these changes Nov 14, 2025

View reviewed changes

vthumbe1503 reviewed Nov 15, 2025

View reviewed changes

Merge branch 'main' into reduce_framework_cpu_overheads

0e6a0ee

ksivaman merged commit e1edaae into NVIDIA:main Nov 17, 2025
9 of 12 checks passed

greptile-apps Bot reviewed Nov 17, 2025

View reviewed changes

pggPL mentioned this pull request Nov 18, 2025

[PyTorch] Fix small errors #2396

Merged

13 tasks

This was referenced Nov 19, 2025

Minor improvements to CPU overhead #2400

Merged

[PyTorch] CPU Overhead Micro-optimizations #2146

Closed

KshitijLakhani added the 2.10.0 label Nov 20, 2025

KshitijLakhani pushed a commit that referenced this pull request Nov 20, 2025

[PyTorch] Reduce CPU overheads (#2377)

645716c

Initial changes to remove pytorch overheads Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

timmoon10 referenced this pull request in NVIDIA-NeMo/RL May 4, 2026

fixes to reduce allocated/reserved memory after offload

620f90e

Signed-off-by: Anna Shors <ashors@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Reduce CPU overheads#2377

[PyTorch] Reduce CPU overheads#2377
ksivaman merged 5 commits into
NVIDIA:mainfrom
ksivaman:reduce_framework_cpu_overheads

ksivaman commented Nov 13, 2025 •

edited

Loading

Uh oh!

greptile-apps Bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Uh oh!

vthumbe1503 commented Nov 14, 2025

Uh oh!

vthumbe1503 Nov 15, 2025 •

edited

Loading

Uh oh!

ksivaman commented Nov 16, 2025

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ksivaman commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Nov 14, 2025

Uh oh!

vthumbe1503 Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Nov 16, 2025

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ksivaman commented Nov 13, 2025 •

edited

Loading

greptile-apps Bot commented Nov 13, 2025 •

edited

Loading

vthumbe1503 Nov 15, 2025 •

edited

Loading