-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120
Conversation
00d1a34
to
f6465d4
Compare
Hi @fs-eire @guschmue, I'm working on the BucketCacheManager in buffer_manager.cc to improve peak/average GPU buffer memory usage currently, this is an initial solution I tried to propose, using dynamic buckets to replace default buckets, with dynamic bucket keys and dynamic updating buckets, can you help to review the code and doc and provide feedback if this is a reasonable optimization. CC @qjia7. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a dynamic bucket cache mode to improve GPU buffer memory usage by dynamically adjusting bucket sizes based on session usage patterns. Key changes include:
- Defining a new buffer cache mode constant "dynamicBucket" in provider options.
- Incorporating dynamic bucket support in the provider factory and execution provider with new run start/end memory tracking.
- Implementing a new DynamicBucketCacheManager with logic to adjust cache buckets based on observed memory usage patterns.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
onnxruntime/core/providers/webgpu/webgpu_provider_options.h | Added a new constant for dynamic bucket cache mode. |
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc | Added a branch to return DynamicBucket mode. |
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc | Added hooks for buffer manager run start/end calls. |
onnxruntime/core/providers/webgpu/buffer_manager.h | Updated enum, documentation, and added run lifecycle hooks for buffer caches. |
onnxruntime/core/providers/webgpu/buffer_manager.cc | Introduced DynamicBucketCacheManager with methods for memory pattern tracking and bucket adjustments. |
Comments suppressed due to low confidence (1)
onnxruntime/core/providers/webgpu/buffer_manager.cc:286
- Verify that the CalculateBufferSize function is declared as a virtual method in the IBufferCacheManager interface since it is marked override here. This ensures consistency across cache manager implementations.
size_t CalculateBufferSize(size_t request_size) override {
// Adjust buckets based on the collected memory patterns every 2 runs. | ||
// The reason for this is to allow the cache to adapt to the memory usage patterns | ||
// of previous runs of last completed token generation session. | ||
static size_t run_id = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider whether the use of a static non-atomic run_id in OnRunEnd might lead to thread-safety issues if the provider is used in a multi-threaded context.
Copilot uses AI. Check for mistakes.
The cache algorithm looks good to me. But I have concerns about the BackgroundWhen ORT is running, there are all different kinds of states. Those states are managed by different components. For example, a Each state has its lifecycle. For example, an initializer is created during the session initialization and will live until the session is released. An intermediate value, however, will have a shorter lifecycle that created at the start of a session run and destroyed no more later than the end of a session run. From the lifecycle perspective, there are 3 different layers of lifecycles:
For WebGPU, there are a new different lifecycle layer:
ProblemIn the main branch, the buffer manager and the cache manager are all per-logical-device. Since one WebGpuContext instance can be used by multiple sessions, the buffer manager and the buffer cache manager are also used by multiple sessions. In the current PR, the new cache algorithm is exposing a new requirement:
While (1) is quite straightforward, (2) is actually missing from the current implementation. Per the current design, WebGpuContext is not bound to any session state. This assumption is very important because it can safely allow multiple inference runs of different sessions at the same time. This is the problem: (2) is never guaranteed. SolutionThere are a few things need to change for the framework to support the new cache algorithm. Considering the latest requirement for graph capture, this need to be carefully considered to ensure no design flaw or regression. I will think about it and come up with a design to work with both the buffer manager optimization and the graph capture feature. I will update it here when it's ready. |
Description
Add a dynamic bucket cache mode to improve the GPU buffer memory usage, including reducing the peak memory and average memory usage without any other performance regressions.
Motivation and Context
Default buckets solution uses default buckets with coarse-grained bucket keys to cache buffers and don’t update the bucket keys over the runs while dynamic buckets solution uses dynamic buckets with fine-grained bucket keys to cache buffers and update the bucket keys over the runs. Dynamic buckets solution will overall request less buffers than default buckets solution but keeps the similar cache missed buffer bytes. We summarized the optimization solution details in this doc https://microsoftapc-my.sharepoint.com/:w:/g/personal/feich_microsoft_com/Eb1nPP2LyDZNq9bpG-jXp64BbEVcbbMzYeOPJHc3ITUu-w?e=tQLJVo.
A experimental result shows that dynamic buckets solution can reduce the peak memory usage by 16.9% aginst Edge build-in AI Phi3 model with GenAI tool command: python benchmark/python/benchmark_e2e.py -i /Users/feich/models/phi3 -l 1000 -o perf_result.csv
Open questions: if this solution works well, is it possible to replace current default bucket solution?
Test Device
Model Memory and Perf Analysis
The peak/avg memory calculation is based on this commit 61cdfb9
The token gen latency and tokens/sec are from the output file of benchmark_e2e.py script
Token Gen Latency -> prompt_processing_latency_ms
Tokens/sec -> token_generation_throughput_tps
Phi3-Windows
DeepSeek-R1-Windows
LLAMA3.2-1B-Windows
Phi3-Mac
DeepSeek-R1-Mac
LLAMA3.2-1B-Mac