fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM #225

IAMDAVID0920 · 2025-11-12T04:36:35Z

📋 PR Title Format

The PR title should follow the format:

type(scope): concise message (max 50 chars)

Where:

type is one of: feat, fix, docs, refactor, perf, test, chore.
scope is optional and describes the part of the codebase affected (e.g., auth, ui, api).
concise message is a short description of the change (max 50 chars).

📝 Change Type

Please select the type of change this PR introduces (choose one or more):

feat: New feature.
fix: Bug fix.
docs: Documentation only changes.
refactor: A code change that neither fixes a bug nor adds a feature.
perf: Performance improvement.
test: Adding missing tests or correcting existing tests.
chore: Maintenance tasks (e.g., updating dependencies).

💡 Description

Fixes RTX 4070 OOM during CUDA graph capture by reserving 2GB for CUDA graph overhead in capacity calculations.
Example calculation

RTX 4070 (12GB):
├─ Assigned: 3 layers (~8GB weights)
├─ KV cache: ~3.6GB (if default 0.3 ratio)
├─ CUDA graphs: ~2GB needed
└─ Total: ~13.6GB > 12GB → OOM!

My dumb fix is to reserve 2GB before calculating capacity if it is CUDA devices, not sure if we need to scale with model size for very large models? Please let me know if there are better implementation ideas!

if self.hardware.device == "cuda":
    cuda_graph_overhead_gb = 2.0
    available_memory_bytes -= cuda_graph_overhead_gb * 1024 * 1024 * 1024

This fix also reduces the layer capacity for ALL CUDA GPUs:

GPU	Old Capacity	New Capacity	Change
A100-80GB	13 layers	12 layers	-1
A100-40GB	6 layers	6 layers	0
RTX 5090	5 layers	4 layers	-1
RTX 4090	4 layers	3 layers	-1
RTX 4070	2 layers	1 layer	-1

Something wrong with the commented test, will share more updates tomorrow.

Key Changes

🔗 Related Issues

List any issues this PR closes or relates to:

Closes [Bug]: GPU OOM when capturing graph. #188

✅ Checklist

Please ensure the following points are addressed before merging:

I have performed a self-review of my own code.
I have added/updated tests that prove my fix or feature works (if applicable).
I have updated the documentation (if necessary).
My code follows the project's style guidelines.

IAMDAVID0920 · 2025-11-13T00:59:20Z

Seems like my fix makes no sense, i probably read code wrong or something
It is good to know that fixing import ordering can fix the issue, which is to use the patched version

TianyiZhao1437 · 2025-11-13T01:01:06Z

Hi, the root cause of #188 is that sglang monkey patches are not taking effect, and is fixed in #228. Actually weights of 3 layers for gpt-oss-20b only takes ~1.5GB.

Default value for memory ratio is 0.3 for kvcache, 0.5 for parameters and 0.2 for activation + CUDA graph. But we will keep looking into this since it is still a naive design.

IAMDAVID0920 · 2025-11-13T01:03:59Z

Hi, the root cause of #188 is that sglang monkey patches are not taking effect, and is fixed in #228. Actually weights of 3 layers for gpt-oss-20b only takes ~1.5GB.

Default value for memory ratio is 0.3 for kvcache, 0.5 for parameters and 0.2 for activation + CUDA graph. But we will keep looking into this since it is still a naive design.

Appreciate your explanation, my naive thought would just try to reserve some memory for CUDA graph, then working on the memory ratio per default value. Learned something new today!

IAMDAVID0920 added 2 commits November 11, 2025 23:26

fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM

b80cd78

drop debug log

249f317

RWL-Dittrich mentioned this pull request Nov 12, 2025

[Bug]: Windows GPU OOM error #210

Closed

2 tasks

IAMDAVID0920 closed this Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM #225

fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM #225

Uh oh!

IAMDAVID0920 commented Nov 12, 2025 •

edited

Loading

Uh oh!

IAMDAVID0920 commented Nov 13, 2025

Uh oh!

TianyiZhao1437 commented Nov 13, 2025

Uh oh!

IAMDAVID0920 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM #225

fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM #225

Uh oh!

Conversation

IAMDAVID0920 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 PR Title Format

📝 Change Type

💡 Description

Key Changes

🔗 Related Issues

✅ Checklist

Uh oh!

IAMDAVID0920 commented Nov 13, 2025

Uh oh!

TianyiZhao1437 commented Nov 13, 2025

Uh oh!

IAMDAVID0920 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IAMDAVID0920 commented Nov 12, 2025 •

edited

Loading