Skip to content

Move buffer release or cache from OnRefresh to ReleaseBuffer in BucketCacheManager #25276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

feich-ms
Copy link
Contributor

@feich-ms feich-ms commented Jul 3, 2025

Description

This PR is to move buffer release or cache from OnRefresh to ReleaseBuffer in BucketCacheManager.

Motivation and Context

The OnRefresh is executed after a batch(16) ep runs and inside the batch runs, the buffer can not be really reused which is a waste for gpu buffer resources. This PR proposed a strightforward optimization that release or cache the buffer early in ReleaseBuffer instead of OnRefresh to improve the buffer cache or release efficiency which will improve the peak and average GPU memory usage. The experimental result also shows a reasonable memory optimization without perf regressions.

Phi3

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 3603.83 3127.05 7.17 139.50
Default Bucket with Early Release Optimization 3534.77 (+1.92%) 3073.97 (+1.70%) 7.14 (+0.36%) 140.01 (+0.36%)

Deepseek-R1

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 2089.03 1716.15 6.07 164.67
Default Bucket with Early Release Optimization 2034.00 (+2.63%) 1674.49 (+2.43%) 6.09 (-0.20%) 164.34 (-0.20%)

LLama3.2-1B

Optimization Strategy Peak Memory (MB) Avg Memory (MB) Token Gen Latency (ms) Tokens/sec
Default Bucket 1736.03 1424.64 3.37 296.53
Default Bucket with Early Release Optimization 1659.78 (+4.39%) 1366.78 (+4.06%) 3.41 (-1.09%) 293.34 (-1.08%)

@feich-ms
Copy link
Contributor Author

feich-ms commented Jul 3, 2025

Hi @fs-eire, @guschmue, this is to improve the buffer reuse in batch runs in BucketCacheMode, can you help to review? Cc @qjia7.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jul 3, 2025
@qjia7
Copy link
Contributor

qjia7 commented Jul 4, 2025

@guschmue Please help check the changes' correctness with your 90 models. It reuses the storage buffers in one batch (16 dispatches) compared with before.

fs-eire
fs-eire previously approved these changes Jul 4, 2025
@fs-eire
Copy link
Contributor

fs-eire commented Jul 4, 2025

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Jul 7, 2025

I can run some tests on it.

@fs-eire
Copy link
Contributor

fs-eire commented Jul 8, 2025

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants