Skip to content

Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage #25120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

feich-ms
Copy link
Contributor

@feich-ms feich-ms commented Jun 20, 2025

Description

Add a dynamic bucket cache mode to improve the GPU buffer memory usage, including reducing the peak memory and average memory usage without any other performance regressions.

Motivation and Context

Default buckets solution uses default buckets with coarse-grained bucket keys to cache buffers and don’t update the bucket keys over the runs while dynamic buckets solution uses dynamic buckets with fine-grained bucket keys to cache buffers and update the bucket keys over the runs. Dynamic buckets solution will overall request less buffers than default buckets solution but keeps the similar cache missed buffer bytes. We summarized the optimization solution details in this doc https://microsoftapc-my.sharepoint.com/:w:/g/personal/feich_microsoft_com/Eb1nPP2LyDZNq9bpG-jXp64BbEVcbbMzYeOPJHc3ITUu-w?e=tQLJVo.

A experimental result shows that dynamic buckets solution can reduce the peak memory usage by 16.9% aginst Edge build-in AI Phi3 model with GenAI tool command: python benchmark/python/benchmark_e2e.py -i /Users/feich/models/phi3 -l 1000 -o perf_result.csv

Open questions: if this solution works well, is it possible to replace current default bucket solution?

Test Device

  • Windows with NVIDIA GeForce RTX 5080 chip
  • Mac Mini with Apple M2 Pro chip

Model Memory and Perf Analysis

The peak/avg memory calculation is based on this commit 61cdfb9

The token gen latency and tokens/sec are from the output file of benchmark_e2e.py script
Token Gen Latency -> prompt_processing_latency_ms
Tokens/sec -> token_generation_throughput_tps

Phi3-Windows

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 3603.83 3127.05 7.17 139.50
Dynamic Bucket 2994.94 (+16.90%) 2510.18 (+19.73%) 7.15 (+0.24%) 139.84 (+0.24%)

DeepSeek-R1-Windows

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 2089.03 1716.15 6.07 164.67
Dynamic Bucket 1752.53 (+16.11%) 1345.29 (+21.61%) 6.10 (-0.43%) 163.97 (-0.42%)

LLAMA3.2-1B-Windows

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 1736.03 1424.64 3.37 296.53
Dynamic Bucket 1550.81 (+10.67%) 1222.38 (+14.20%) 3.37 (+0.06%) 296.71 (+0.06%)

Phi3-Mac

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 3603.83 3121.60 7.17 139.50
Dynamic Bucket 2994.94 (+16.90%) 2508.87 (+19.63%) 7.15 (+0.24%) 139.84 (+0.24%)

DeepSeek-R1-Mac

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 2065.88 1697.92 11.23 89.01
Dynamic Bucket 1739.98 (+15.78%) 1344.08 (+20.84%) 11.22 (+0.11%) 89.10 (+0.11%)

LLAMA3.2-1B-Mac

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 1652.87 1361.16 8.52 117.32
Dynamic Bucket 1535.11 (+7.12%) 1220.86 (+10.31%) 8.53 (-0.02%) 117.30 (-0.02%)

@feich-ms feich-ms changed the title add dynamic bucket cache mode Add dynamic bucket cache mode Jun 20, 2025
@feich-ms feich-ms marked this pull request as ready for review June 24, 2025 05:06
@feich-ms feich-ms force-pushed the feich-ms/buffer_memory_optimization_with_dynamic_bucket_mode branch from 00d1a34 to f6465d4 Compare June 24, 2025 06:07
@feich-ms
Copy link
Contributor Author

Hi @fs-eire @guschmue, I'm working on the BucketCacheManager in buffer_manager.cc to improve peak/average GPU buffer memory usage currently, this is an initial solution I tried to propose, using dynamic buckets to replace default buckets, with dynamic bucket keys and dynamic updating buckets, can you help to review the code and doc and provide feedback if this is a reasonable optimization. CC @qjia7. Thanks.

@feich-ms feich-ms changed the title Add dynamic bucket cache mode Add dynamic bucket cache mode to improve peak and avg memory usage Jun 26, 2025
@feich-ms feich-ms changed the title Add dynamic bucket cache mode to improve peak and avg memory usage Add dynamic bucket cache mode to improve peak and avg gpu buffer memory usage Jun 26, 2025
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jun 26, 2025
@guschmue guschmue requested a review from Copilot June 26, 2025 15:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a dynamic bucket cache mode to improve GPU buffer memory usage by dynamically adjusting bucket sizes based on session usage patterns. Key changes include:

  • Defining a new buffer cache mode constant "dynamicBucket" in provider options.
  • Incorporating dynamic bucket support in the provider factory and execution provider with new run start/end memory tracking.
  • Implementing a new DynamicBucketCacheManager with logic to adjust cache buckets based on observed memory usage patterns.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
onnxruntime/core/providers/webgpu/webgpu_provider_options.h Added a new constant for dynamic bucket cache mode.
onnxruntime/core/providers/webgpu/webgpu_provider_factory.cc Added a branch to return DynamicBucket mode.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc Added hooks for buffer manager run start/end calls.
onnxruntime/core/providers/webgpu/buffer_manager.h Updated enum, documentation, and added run lifecycle hooks for buffer caches.
onnxruntime/core/providers/webgpu/buffer_manager.cc Introduced DynamicBucketCacheManager with methods for memory pattern tracking and bucket adjustments.
Comments suppressed due to low confidence (1)

onnxruntime/core/providers/webgpu/buffer_manager.cc:286

  • Verify that the CalculateBufferSize function is declared as a virtual method in the IBufferCacheManager interface since it is marked override here. This ensures consistency across cache manager implementations.
  size_t CalculateBufferSize(size_t request_size) override {

// Adjust buckets based on the collected memory patterns every 2 runs.
// The reason for this is to allow the cache to adapt to the memory usage patterns
// of previous runs of last completed token generation session.
static size_t run_id = 0;
Copy link
Preview

Copilot AI Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider whether the use of a static non-atomic run_id in OnRunEnd might lead to thread-safety issues if the provider is used in a multi-threaded context.

Copilot uses AI. Check for mistakes.

@fs-eire
Copy link
Contributor

fs-eire commented Jun 27, 2025

The cache algorithm looks good to me. But I have concerns about the OnRunStart() and OnRunEnd(). Let me explain it.

Background

When ORT is running, there are all different kinds of states. Those states are managed by different components. For example, a SessionState manages the graph and all initializers, a ExecutionFrame manages all intermediate values, and a WebGpuContext manages the WebGPU Device and all Dawn objects created from it including the buffers.

Each state has its lifecycle. For example, an initializer is created during the session initialization and will live until the session is released. An intermediate value, however, will have a shorter lifecycle that created at the start of a session run and destroyed no more later than the end of a session run.

From the lifecycle perspective, there are 3 different layers of lifecycles:

  • global or global-like: for example, OrtEnv. it can be simply assumed to be always available.
  • per-session: the life cycle is bound to the session. for example, a SessionState.
  • per-run: the life cycle is bound to an inference run call. for example, a ExecutionFrame.

For WebGPU, there are a new different lifecycle layer:

  • per-logical-device: the WebGpuContext. It's bound to the lifecycle of the WGPUDevice object.

Problem

In the main branch, the buffer manager and the cache manager are all per-logical-device. Since one WebGpuContext instance can be used by multiple sessions, the buffer manager and the buffer cache manager are also used by multiple sessions.

In the current PR, the new cache algorithm is exposing a new requirement:

  1. It needs to know when an inference run is started and ended.
  2. It requires between the start and end, no buffer operation from a different inference run happen.

While (1) is quite straightforward, (2) is actually missing from the current implementation.

Per the current design, WebGpuContext is not bound to any session state. This assumption is very important because it can safely allow multiple inference runs of different sessions at the same time. This is the problem: (2) is never guaranteed.

Solution

There are a few things need to change for the framework to support the new cache algorithm. Considering the latest requirement for graph capture, this need to be carefully considered to ensure no design flaw or regression.

I will think about it and come up with a design to work with both the buffer manager optimization and the graph capture feature. I will update it here when it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants