Change the cudagraph distribution from linearly to exponentially-decreasing#3509
Change the cudagraph distribution from linearly to exponentially-decreasing#3509mathemakitten wants to merge 14 commits intoNVIDIA:mainfrom
Conversation
…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes
|
Hey, are there empirics available to support the change? Should the old setting still be supported for cases where it may be better? Also, do any tests need to be updated because of this? |
The empirics are the reinforcement learning runs. I can provide internal pointers if you need. I don't think anyone can presently make a strong case for the old setting. I will update the values for |
Awesome, thank you! |
|
/ok to test 3c718e9 |
| cuda_graph_token_counts.reverse() | ||
| return [cuda_graph_max_tokens] | ||
|
|
||
| # Exponentially decreasing, stops after num_cuda_graphs entries |
There was a problem hiding this comment.
I also vote to leave the linear-spaced CGs as an option, there's no harm in doing so since the code is already setup, we can just default to exponential in the arguments.
One reason for keeping this setting & code is that vLLM uses linear spacing, they just create a ton more graphs than we do because they can create them so quickly and efficiently, and I think that just speaks to how unoptimized our CG system is. So I personally would keep the old option, and just plan to use it in the future.
There was a problem hiding this comment.
I would be against adding a new flag to toggle the distribution of inference cudagraphs. Inference already has a lot of flags and users don't know how to combine them effectively, and there is no empirical case that makes the argument to keep linear distribution around at the moment. I will leave a TODO to re-enable when someone wants to take it on.
There was a problem hiding this comment.
Also, #3527 already implements the vllm strategy orthogonal to this.
There was a problem hiding this comment.
@mathemakitten on second thought, I am also in favor of slowly phasing out the older strategy - mostly because it's the most stress tested one we have right now. We could make yours default, while having the option to fallback onto that.
There was a problem hiding this comment.
re:3527 - it does use a linear function, but builds a lot more cudagraphs compared to our default strategy.
| # Include a (possibly extra) size-1 graph | ||
| min_token_count = math.ceil(1 / tp_size) * tp_size | ||
| if cuda_graph_token_counts[-1] != min_token_count: | ||
| cuda_graph_token_counts.append(min_token_count) |
There was a problem hiding this comment.
not sure how strict we want to be about sticking to num_cuda_graphs, but this line will generally cause use to have num_cuda_graphs + 1 graphs. Probably a minor concern, but might want to consider if this could be confusing to anyone.
also, why are we adding this size-1 graph? we didn't have this before
There was a problem hiding this comment.
Having one more size-1 graph is minorly confusing if you're counting graphs but not functionally breaking.
I will replace a middle graph with the size-1 graph to adhere to num_cuda_graphs.
| val = cuda_graph_max_tokens | ||
| for _ in range(num_cuda_graphs): | ||
| # Round down to multiple of rounder, then up to multiple of TP size | ||
| rounded = max(rounder, (val // rounder) * rounder) |
There was a problem hiding this comment.
the old code guaranteed that cuda_graph_max_tokens is in the list, but now we don't have that guarantee anymore. Do we care about this, e.g., someone wants to very strictly set the max cuda graph size?
There was a problem hiding this comment.
I will bring this back and sub out a middle graph for it.
…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes
santhnm2
left a comment
There was a problem hiding this comment.
Do any functional tests need to be updated as a result of this change?
| while len(cuda_graph_token_counts) > num_cuda_graphs: | ||
| cuda_graph_token_counts.pop(-2) |
There was a problem hiding this comment.
Can you add these lines afterwards:
assert len(cuda_graph_token_counts) == num_cuda_graphs
assert cuda_graph_max_tokens in num_cuda_graphs
| cuda_graph_token_counts.append(tp_size) | ||
|
|
||
| # Trim from the middle if we exceed num_cuda_graphs requested by the user | ||
| while len(cuda_graph_token_counts) > num_cuda_graphs: |
There was a problem hiding this comment.
Does this line need to be
while len(cuda_graph_token_counts) > num_cuda_graphs and len(cuda_graph_token_counts) >= 2
Otherwise pop(-2) might give an index error or (even worse) silently wrap around?
There was a problem hiding this comment.
We have a guarantee that num_cuda_graphs >= 1 at the top of the block and we also check while len(cuda_graph_token_counts) > num_cuda_graphs, so we're actually already guaranteed that len(cuda_graph_token_counts) >= 2 when this block runs.
…thub.com/mathemakitten/Megatron-LM into helenn-exponential-decay-cudagraph-sizes
| if rounded not in cuda_graph_token_counts: | ||
| cuda_graph_token_counts.append(rounded) | ||
| val //= 2 | ||
| if val < rounder: |
There was a problem hiding this comment.
why aren't we allowing a CG of size 1?
Maybe we can make this check - if val < 1 or if val
sidsingh-nvidia
left a comment
There was a problem hiding this comment.
LGTM, except for the minor nit of allowing CG of size 1.
Still unsure if we should simply chuck out the older CG strategy or keep it around as a fallback.
| while len(cuda_graph_token_counts) > num_cuda_graphs: | ||
| cuda_graph_token_counts.pop(-2) | ||
|
|
||
| assert len(cuda_graph_token_counts) == num_cuda_graphs |
There was a problem hiding this comment.
This assert will fail when we request too few CUDA graphs for the size of the memory buffer. It looks like it would fail the unit-test as well.
What does this PR do ?
It is often much more useful for speed reasons to have several small cudagraphs instead of large ones, so we create them over an exponentially-decayed distribution instead of a linear one.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.