Skip to content

Change the cudagraph distribution from linearly to exponentially-decreasing#3509

Open
mathemakitten wants to merge 14 commits intoNVIDIA:mainfrom
mathemakitten:helenn-exponential-decay-cudagraph-sizes
Open

Change the cudagraph distribution from linearly to exponentially-decreasing#3509
mathemakitten wants to merge 14 commits intoNVIDIA:mainfrom
mathemakitten:helenn-exponential-decay-cudagraph-sizes

Conversation

@mathemakitten
Copy link
Copy Markdown
Contributor

What does this PR do ?

It is often much more useful for speed reasons to have several small cudagraphs instead of large ones, so we create them over an exponentially-decayed distribution instead of a linear one.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@mathemakitten mathemakitten requested review from a team as code owners February 20, 2026 01:56
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team February 20, 2026 01:56
@ko3n1g ko3n1g added this to the Core 0.16 milestone Feb 20, 2026
@mathemakitten mathemakitten changed the title Change from linearly sized cudagraphs to exponentially-decreasing sized cudagraphs Change the cudagraph distribution from linearly to exponentially-decreasing Feb 20, 2026
@janEbert
Copy link
Copy Markdown
Contributor

janEbert commented Feb 20, 2026

Hey, are there empirics available to support the change? Should the old setting still be supported for cases where it may be better?

Also, do any tests need to be updated because of this?

@janEbert janEbert added Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. complexity: low labels Feb 20, 2026
Comment thread megatron/core/inference/batch_dimensions_utils.py
@mathemakitten
Copy link
Copy Markdown
Contributor Author

Hey, are there empirics available to support the change? Should the old setting still be supported for cases where it may be better?

Also, do any tests need to be updated because of this?

The empirics are the reinforcement learning runs. I can provide internal pointers if you need. I don't think anyone can presently make a strong case for the old setting.

I will update the values for test_cuda_graph_token_counts.

@janEbert
Copy link
Copy Markdown
Contributor

The empirics are the reinforcement learning runs. I can provide internal pointers if you need. I don't think anyone can presently make a strong case for the old setting.

Awesome, thank you!

@janEbert
Copy link
Copy Markdown
Contributor

/ok to test 3c718e9

cuda_graph_token_counts.reverse()
return [cuda_graph_max_tokens]

# Exponentially decreasing, stops after num_cuda_graphs entries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also vote to leave the linear-spaced CGs as an option, there's no harm in doing so since the code is already setup, we can just default to exponential in the arguments.

One reason for keeping this setting & code is that vLLM uses linear spacing, they just create a ton more graphs than we do because they can create them so quickly and efficiently, and I think that just speaks to how unoptimized our CG system is. So I personally would keep the old option, and just plan to use it in the future.

Copy link
Copy Markdown
Contributor Author

@mathemakitten mathemakitten Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be against adding a new flag to toggle the distribution of inference cudagraphs. Inference already has a lot of flags and users don't know how to combine them effectively, and there is no empirical case that makes the argument to keep linear distribution around at the moment. I will leave a TODO to re-enable when someone wants to take it on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, #3527 already implements the vllm strategy orthogonal to this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathemakitten on second thought, I am also in favor of slowly phasing out the older strategy - mostly because it's the most stress tested one we have right now. We could make yours default, while having the option to fallback onto that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re:3527 - it does use a linear function, but builds a lot more cudagraphs compared to our default strategy.

# Include a (possibly extra) size-1 graph
min_token_count = math.ceil(1 / tp_size) * tp_size
if cuda_graph_token_counts[-1] != min_token_count:
cuda_graph_token_counts.append(min_token_count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how strict we want to be about sticking to num_cuda_graphs, but this line will generally cause use to have num_cuda_graphs + 1 graphs. Probably a minor concern, but might want to consider if this could be confusing to anyone.

also, why are we adding this size-1 graph? we didn't have this before

Copy link
Copy Markdown
Contributor Author

@mathemakitten mathemakitten Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one more size-1 graph is minorly confusing if you're counting graphs but not functionally breaking.

I will replace a middle graph with the size-1 graph to adhere to num_cuda_graphs.

val = cuda_graph_max_tokens
for _ in range(num_cuda_graphs):
# Round down to multiple of rounder, then up to multiple of TP size
rounded = max(rounder, (val // rounder) * rounder)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old code guaranteed that cuda_graph_max_tokens is in the list, but now we don't have that guarantee anymore. Do we care about this, e.g., someone wants to very strictly set the max cuda graph size?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will bring this back and sub out a middle graph for it.

Comment thread megatron/core/inference/batch_dimensions_utils.py Outdated
Comment thread tests/unit_tests/inference/engines/test_dynamic_engine.py
Copy link
Copy Markdown
Contributor

@santhnm2 santhnm2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any functional tests need to be updated as a result of this change?

Comment on lines +276 to +277
while len(cuda_graph_token_counts) > num_cuda_graphs:
cuda_graph_token_counts.pop(-2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add these lines afterwards:

assert len(cuda_graph_token_counts) == num_cuda_graphs
assert cuda_graph_max_tokens in num_cuda_graphs

cuda_graph_token_counts.append(tp_size)

# Trim from the middle if we exceed num_cuda_graphs requested by the user
while len(cuda_graph_token_counts) > num_cuda_graphs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line need to be

while len(cuda_graph_token_counts) > num_cuda_graphs and len(cuda_graph_token_counts) >= 2

Otherwise pop(-2) might give an index error or (even worse) silently wrap around?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a guarantee that num_cuda_graphs >= 1 at the top of the block and we also check while len(cuda_graph_token_counts) > num_cuda_graphs, so we're actually already guaranteed that len(cuda_graph_token_counts) >= 2 when this block runs.

if rounded not in cuda_graph_token_counts:
cuda_graph_token_counts.append(rounded)
val //= 2
if val < rounder:
Copy link
Copy Markdown
Contributor

@sidsingh-nvidia sidsingh-nvidia Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't we allowing a CG of size 1?
Maybe we can make this check - if val < 1 or if val

Copy link
Copy Markdown
Contributor

@sidsingh-nvidia sidsingh-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for the minor nit of allowing CG of size 1.
Still unsure if we should simply chuck out the older CG strategy or keep it around as a fallback.

while len(cuda_graph_token_counts) > num_cuda_graphs:
cuda_graph_token_counts.pop(-2)

assert len(cuda_graph_token_counts) == num_cuda_graphs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert will fail when we request too few CUDA graphs for the size of the memory buffer. It looks like it would fail the unit-test as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: low Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants