Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120

feich-ms · 2025-06-20T04:17:05Z

Description

Add a dynamic bucket cache mode to improve the GPU buffer memory usage, including reducing the peak memory and average memory usage without any other performance regressions.

Motivation and Context

Default buckets solution uses default buckets with coarse-grained bucket keys to cache buffers and don’t update the bucket keys over the runs while dynamic buckets solution uses dynamic buckets with fine-grained bucket keys to cache buffers and update the bucket keys over the runs. Dynamic buckets solution will overall request less buffers than default buckets solution but keeps the similar cache missed buffer bytes. We summarized the optimization solution details in this doc https://microsoftapc-my.sharepoint.com/:w:/g/personal/feich_microsoft_com/Eb1nPP2LyDZNq9bpG-jXp64BbEVcbbMzYeOPJHc3ITUu-w?e=tQLJVo.

A experimental result shows that dynamic buckets solution can reduce the peak memory usage by 16.9% aginst Edge build-in AI Phi3 model with GenAI tool command: python benchmark/python/benchmark_e2e.py -i /Users/feich/models/phi3 -l 1000 -o perf_result.csv

Open questions: if this solution works well, is it possible to replace current default bucket solution?

Test Device

Windows with NVIDIA GeForce RTX 5080 chip
Mac Mini with Apple M2 Pro chip

Model Memory and Perf Analysis

The peak/avg memory calculation is based on this commit 61cdfb9

The token gen latency and tokens/sec are from the output file of benchmark_e2e.py script
Token Gen Latency -> prompt_processing_latency_ms
Tokens/sec -> token_generation_throughput_tps

Phi3-Windows

Optimization Strategy	Peak Memory (MB)	Avg Memory (MB)	Token Gen Latency (ms)	Tokens/sec
Default Bucket	3603.83	3127.05	7.17	139.50
Dynamic Bucket	2994.94 (+16.90%)	2510.18 (+19.73%)	7.15 (+0.24%)	139.84 (+0.24%)

DeepSeek-R1-Windows

Optimization Strategy	Peak Memory (MB)	Avg Memory (MB)	Token Gen Latency (ms)	Tokens/sec
Default Bucket	2089.03	1716.15	6.07	164.67
Dynamic Bucket	1752.53 (+16.11%)	1345.29 (+21.61%)	6.10 (-0.43%)	163.97 (-0.42%)

LLAMA3.2-1B-Windows

Optimization Strategy	Peak Memory (MB)	Avg Memory (MB)	Token Gen Latency (ms)	Tokens/sec
Default Bucket	1736.03	1424.64	3.37	296.53
Dynamic Bucket	1550.81 (+10.67%)	1222.38 (+14.20%)	3.37 (+0.06%)	296.71 (+0.06%)

Phi3-Mac

Optimization Strategy	Peak Memory (MB)	Avg Memory (MB)	Token Gen Latency (ms)	Tokens/sec
Default Bucket	3603.83	3121.60	7.17	139.50
Dynamic Bucket	2994.94 (+16.90%)	2508.87 (+19.63%)	7.15 (+0.24%)	139.84 (+0.24%)

DeepSeek-R1-Mac

Optimization Strategy	Peak Memory (MB)	Avg Memory (MB)	Token Gen Latency (ms)	Tokens/sec
Default Bucket	2065.88	1697.92	11.23	89.01
Dynamic Bucket	1739.98 (+15.78%)	1344.08 (+20.84%)	11.22 (+0.11%)	89.10 (+0.11%)

LLAMA3.2-1B-Mac

Optimization Strategy	Peak Memory (MB)	Avg Memory (MB)	Token Gen Latency (ms)	Tokens/sec
Default Bucket	1652.87	1361.16	8.52	117.32
Dynamic Bucket	1535.11 (+7.12%)	1220.86 (+10.31%)	8.53 (-0.02%)	117.30 (-0.02%)

feich-ms · 2025-06-26T05:58:56Z

Hi @fs-eire @guschmue, I'm working on the BucketCacheManager in buffer_manager.cc to improve peak/average GPU buffer memory usage currently, this is an initial solution I tried to propose, using dynamic buckets to replace default buckets, with dynamic bucket keys and dynamic updating buckets, can you help to review the code and doc and provide feedback if this is a reasonable optimization. CC @qjia7. Thanks.

Copilot

Pull Request Overview

This PR adds a dynamic bucket cache mode to improve GPU buffer memory usage by dynamically adjusting bucket sizes based on session usage patterns. Key changes include:

Defining a new buffer cache mode constant "dynamicBucket" in provider options.
Incorporating dynamic bucket support in the provider factory and execution provider with new run start/end memory tracking.
Implementing a new DynamicBucketCacheManager with logic to adjust cache buckets based on observed memory usage patterns.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnxruntime/core/providers/webgpu/webgpu_provider_options.h	Added a new constant for dynamic bucket cache mode.
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc	Added a branch to return DynamicBucket mode.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc	Added hooks for buffer manager run start/end calls.
onnxruntime/core/providers/webgpu/buffer_manager.h	Updated enum, documentation, and added run lifecycle hooks for buffer caches.
onnxruntime/core/providers/webgpu/buffer_manager.cc	Introduced DynamicBucketCacheManager with methods for memory pattern tracking and bucket adjustments.

Comments suppressed due to low confidence (1)

onnxruntime/core/providers/webgpu/buffer_manager.cc:286

Verify that the CalculateBufferSize function is declared as a virtual method in the IBufferCacheManager interface since it is marked override here. This ensures consistency across cache manager implementations.

  size_t CalculateBufferSize(size_t request_size) override {

Copilot · 2025-06-26T15:20:24Z

onnxruntime/core/providers/webgpu/buffer_manager.cc

+    // Adjust buckets based on the collected memory patterns every 2 runs.
+    // The reason for this is to allow the cache to adapt to the memory usage patterns
+    // of previous runs of last completed token generation session.
+    static size_t run_id = 0;


Consider whether the use of a static non-atomic run_id in OnRunEnd might lead to thread-safety issues if the provider is used in a multi-threaded context.

fs-eire · 2025-06-27T00:55:41Z

The cache algorithm looks good to me. But I have concerns about the OnRunStart() and OnRunEnd(). Let me explain it.

Background

When ORT is running, there are all different kinds of states. Those states are managed by different components. For example, a SessionState manages the graph and all initializers, a ExecutionFrame manages all intermediate values, and a WebGpuContext manages the WebGPU Device and all Dawn objects created from it including the buffers.

Each state has its lifecycle. For example, an initializer is created during the session initialization and will live until the session is released. An intermediate value, however, will have a shorter lifecycle that created at the start of a session run and destroyed no more later than the end of a session run.

From the lifecycle perspective, there are 3 different layers of lifecycles:

global or global-like: for example, OrtEnv. it can be simply assumed to be always available.
per-session: the life cycle is bound to the session. for example, a SessionState.
per-run: the life cycle is bound to an inference run call. for example, a ExecutionFrame.

For WebGPU, there are a new different lifecycle layer:

per-logical-device: the WebGpuContext. It's bound to the lifecycle of the WGPUDevice object.

Problem

In the main branch, the buffer manager and the cache manager are all per-logical-device. Since one WebGpuContext instance can be used by multiple sessions, the buffer manager and the buffer cache manager are also used by multiple sessions.

In the current PR, the new cache algorithm is exposing a new requirement:

It needs to know when an inference run is started and ended.
It requires between the start and end, no buffer operation from a different inference run happen.

While (1) is quite straightforward, (2) is actually missing from the current implementation.

Per the current design, WebGpuContext is not bound to any session state. This assumption is very important because it can safely allow multiple inference runs of different sessions at the same time. This is the problem: (2) is never guaranteed.

Solution

There are a few things need to change for the framework to support the new cache algorithm. Considering the latest requirement for graph capture, this need to be carefully considered to ensure no design flaw or regression.

I will think about it and come up with a design to work with both the buffer manager optimization and the graph capture feature. I will update it here when it's ready.

feich-ms changed the title ~~add dynamic bucket cache mode~~ Add dynamic bucket cache mode Jun 20, 2025

feich-ms marked this pull request as ready for review June 24, 2025 05:06

feich-ms added 6 commits June 24, 2025 14:02

add dynamic bucket cache mode

39a9a3b

revert some testing code

e19039d

optimze

5da66c7

add dynamicBucket to cache mode switch

2ad2ab0

rename request_size to normalized_request_size

9477134

fix warning as error

f6465d4

feich-ms force-pushed the feich-ms/buffer_memory_optimization_with_dynamic_bucket_mode branch from 00d1a34 to f6465d4 Compare June 24, 2025 06:07

feich-ms added 2 commits June 25, 2025 14:55

adjust buckets every two runs

74bfbaf

define run_id

f14a459

feich-ms changed the title ~~Add dynamic bucket cache mode~~ Add dynamic bucket cache mode to improve peak and avg memory usage Jun 26, 2025

feich-ms changed the title ~~Add dynamic bucket cache mode to improve peak and avg memory usage~~ Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage Jun 26, 2025

guschmue added the ep:WebGPU label Jun 26, 2025

guschmue requested a review from Copilot June 26, 2025 15:19

Copilot AI reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120

Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120

feich-ms commented Jun 20, 2025 •

edited

Loading

Uh oh!

feich-ms commented Jun 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 26, 2025

Uh oh!

fs-eire commented Jun 27, 2025

Uh oh!

Uh oh!

Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120

Are you sure you want to change the base?

Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120

Conversation

feich-ms commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Test Device

Model Memory and Perf Analysis

Uh oh!

feich-ms commented Jun 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

fs-eire commented Jun 27, 2025

Background

Problem

Solution

Uh oh!

Uh oh!

feich-ms commented Jun 20, 2025 •

edited

Loading