Change the cudagraph distribution from linearly to exponentially-decreasing by mathemakitten · Pull Request #3509 · NVIDIA/Megatron-LM

mathemakitten · 2026-02-20T01:56:17Z

What does this PR do ?

It is often much more useful for speed reasons to have several small cudagraphs instead of large ones, so we create them over an exponentially-decayed distribution instead of a linear one.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes

janEbert · 2026-02-20T13:14:34Z

Hey, are there empirics available to support the change? Should the old setting still be supported for cases where it may be better?

Also, do any tests need to be updated because of this?

mathemakitten · 2026-02-20T17:39:09Z

Hey, are there empirics available to support the change? Should the old setting still be supported for cases where it may be better?

Also, do any tests need to be updated because of this?

The empirics are the reinforcement learning runs. I can provide internal pointers if you need. I don't think anyone can presently make a strong case for the old setting.

I will update the values for test_cuda_graph_token_counts.

into helenn-exponential-decay-cudagraph-sizes

janEbert · 2026-02-20T18:19:58Z

The empirics are the reinforcement learning runs. I can provide internal pointers if you need. I don't think anyone can presently make a strong case for the old setting.

Awesome, thank you!

janEbert · 2026-02-20T18:21:16Z

/ok to test 3c718e9

lmcafee-nvidia · 2026-02-21T01:36:17Z

-            cuda_graph_token_counts.reverse()
+            return [cuda_graph_max_tokens]
+
+        # Exponentially decreasing, stops after num_cuda_graphs entries


I also vote to leave the linear-spaced CGs as an option, there's no harm in doing so since the code is already setup, we can just default to exponential in the arguments.

One reason for keeping this setting & code is that vLLM uses linear spacing, they just create a ton more graphs than we do because they can create them so quickly and efficiently, and I think that just speaks to how unoptimized our CG system is. So I personally would keep the old option, and just plan to use it in the future.

I would be against adding a new flag to toggle the distribution of inference cudagraphs. Inference already has a lot of flags and users don't know how to combine them effectively, and there is no empirical case that makes the argument to keep linear distribution around at the moment. I will leave a TODO to re-enable when someone wants to take it on.

Also, #3527 already implements the vllm strategy orthogonal to this.

@mathemakitten on second thought, I am also in favor of slowly phasing out the older strategy - mostly because it's the most stress tested one we have right now. We could make yours default, while having the option to fallback onto that.

re:3527 - it does use a linear function, but builds a lot more cudagraphs compared to our default strategy.

lmcafee-nvidia · 2026-02-21T01:39:34Z

+        # Include a (possibly extra) size-1 graph
+        min_token_count = math.ceil(1 / tp_size) * tp_size
+        if cuda_graph_token_counts[-1] != min_token_count:
+            cuda_graph_token_counts.append(min_token_count)


not sure how strict we want to be about sticking to num_cuda_graphs, but this line will generally cause use to have num_cuda_graphs + 1 graphs. Probably a minor concern, but might want to consider if this could be confusing to anyone.

also, why are we adding this size-1 graph? we didn't have this before

Having one more size-1 graph is minorly confusing if you're counting graphs but not functionally breaking.

I will replace a middle graph with the size-1 graph to adhere to num_cuda_graphs.

lmcafee-nvidia · 2026-02-21T01:43:32Z

+        val = cuda_graph_max_tokens
+        for _ in range(num_cuda_graphs):
+            # Round down to multiple of rounder, then up to multiple of TP size
+            rounded = max(rounder, (val // rounder) * rounder)


the old code guaranteed that cuda_graph_max_tokens is in the list, but now we don't have that guarantee anymore. Do we care about this, e.g., someone wants to very strictly set the max cuda graph size?

I will bring this back and sub out a middle graph for it.

…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes

santhnm2

Do any functional tests need to be updated as a result of this change?

santhnm2 · 2026-02-23T18:26:14Z

+        while len(cuda_graph_token_counts) > num_cuda_graphs:
+            cuda_graph_token_counts.pop(-2)


Can you add these lines afterwards:

assert len(cuda_graph_token_counts) == num_cuda_graphs assert cuda_graph_max_tokens in num_cuda_graphs

santhnm2 · 2026-02-23T18:28:18Z

+            cuda_graph_token_counts.append(tp_size)
+
+        # Trim from the middle if we exceed num_cuda_graphs requested by the user
+        while len(cuda_graph_token_counts) > num_cuda_graphs:


Does this line need to be

while len(cuda_graph_token_counts) > num_cuda_graphs and len(cuda_graph_token_counts) >= 2

Otherwise pop(-2) might give an index error or (even worse) silently wrap around?

We have a guarantee that num_cuda_graphs >= 1 at the top of the block and we also check while len(cuda_graph_token_counts) > num_cuda_graphs, so we're actually already guaranteed that len(cuda_graph_token_counts) >= 2 when this block runs.

…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes

sidsingh-nvidia · 2026-02-25T19:16:08Z

+            if rounded not in cuda_graph_token_counts:
+                cuda_graph_token_counts.append(rounded)
+            val //= 2
+            if val < rounder:


why aren't we allowing a CG of size 1?
Maybe we can make this check - if val < 1 or if val

sidsingh-nvidia

LGTM, except for the minor nit of allowing CG of size 1.
Still unsure if we should simply chuck out the older CG strategy or keep it around as a fallback.

tdene · 2026-03-05T17:37:58Z

+        while len(cuda_graph_token_counts) > num_cuda_graphs:
+            cuda_graph_token_counts.pop(-2)
+
+        assert len(cuda_graph_token_counts) == num_cuda_graphs


This assert will fail when we request too few CUDA graphs for the size of the memory buffer. It looks like it would fail the unit-test as well.

Change from linear to exponentially decay cudagraph sizes

a09988a

mathemakitten requested review from a team as code owners February 20, 2026 01:56

copy-pr-bot Bot temporarily deployed to nemo-ci February 20, 2026 01:56 Inactive

svcnvidia-nemo-ci requested a review from a team February 20, 2026 01:56

ko3n1g added this to the Core 0.16 milestone Feb 20, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci February 20, 2026 01:56 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 20, 2026 01:56 Failure

copy-pr-bot Bot had a problem deploying to test February 20, 2026 01:57 Error

mathemakitten changed the title ~~Change from linearly sized cudagraphs to exponentially-decreasing sized cudagraphs~~ Change the cudagraph distribution from linearly to exponentially-decreasing Feb 20, 2026

Merge branch 'main' into helenn-exponential-decay-cudagraph-sizes

a1555ba

copy-pr-bot Bot temporarily deployed to test February 20, 2026 01:59 Inactive

mathemakitten added 2 commits February 19, 2026 18:14

Maybe include a size-1 graph

20798af

Merge branch 'helenn-exponential-decay-cudagraph-sizes' of https://gi…

24bc8d6

…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes

copy-pr-bot Bot temporarily deployed to test February 20, 2026 02:15 Inactive

janEbert added Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. complexity: low labels Feb 20, 2026

yobibyte reviewed Feb 20, 2026

View reviewed changes

Comment thread megatron/core/inference/batch_dimensions_utils.py

mathemakitten added 2 commits February 20, 2026 09:46

Merge branch 'main' of https://gitlab-master.nvidia.com/ADLR/megatron-lm

bbf2dd1

into helenn-exponential-decay-cudagraph-sizes

Update test_cuda_graph_token_counts

3c718e9

copy-pr-bot Bot temporarily deployed to test February 20, 2026 18:12 Inactive

janEbert approved these changes Feb 20, 2026

View reviewed changes

Merge branch 'main' into helenn-exponential-decay-cudagraph-sizes

ffae8ef

copy-pr-bot Bot temporarily deployed to test February 20, 2026 18:39 Inactive

sidsingh-nvidia mentioned this pull request Feb 21, 2026

Inference: Create finer grained cuda-graphs with better coverage of smaller batch sizes #3527

Merged

6 tasks

lmcafee-nvidia reviewed Feb 21, 2026

View reviewed changes

mathemakitten added 3 commits February 23, 2026 08:13

address comments

cafe6af

Merge branch 'helenn-exponential-decay-cudagraph-sizes' of https://gi…

618d016

…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes

Merge branch 'main' into helenn-exponential-decay-cudagraph-sizes

1f3654d

copy-pr-bot Bot temporarily deployed to test February 23, 2026 16:15 Inactive

janEbert approved these changes Feb 23, 2026

View reviewed changes

santhnm2 reviewed Feb 23, 2026

View reviewed changes

mathemakitten added the Run functional tests label Feb 23, 2026

mathemakitten added 3 commits February 23, 2026 12:24

keshav comments

96f5278

Merge branch 'helenn-exponential-decay-cudagraph-sizes' of https://gi…

d62f2fc

…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes

Merge branch 'main' into helenn-exponential-decay-cudagraph-sizes

116a785

copy-pr-bot Bot temporarily deployed to test February 23, 2026 20:26 Inactive

sidsingh-nvidia reviewed Feb 25, 2026

View reviewed changes

sidsingh-nvidia requested changes Feb 25, 2026

View reviewed changes

tdene reviewed Mar 5, 2026

View reviewed changes

address comments

ad6753e

mathemakitten requested a review from sidsingh-nvidia March 10, 2026 15:14

copy-pr-bot Bot temporarily deployed to test March 10, 2026 15:15 Inactive

		while len(cuda_graph_token_counts) > num_cuda_graphs:
		cuda_graph_token_counts.pop(-2)

Conversation

mathemakitten commented Feb 20, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

janEbert commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mathemakitten commented Feb 20, 2026

Uh oh!

janEbert commented Feb 20, 2026

Uh oh!

janEbert commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mathemakitten Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mathemakitten Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

santhnm2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sidsingh-nvidia Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sidsingh-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

(Step 1): Add PR label `Expert Review`

janEbert commented Feb 20, 2026 •

edited

Loading

mathemakitten Feb 23, 2026 •

edited

Loading

mathemakitten Feb 23, 2026 •

edited

Loading

sidsingh-nvidia Feb 25, 2026 •

edited

Loading