Skip to content

Conversation

@IAMDAVID0920
Copy link
Contributor

@IAMDAVID0920 IAMDAVID0920 commented Nov 12, 2025

📋 PR Title Format

The PR title should follow the format:

type(scope): concise message (max 50 chars)

Where:

  • type is one of: feat, fix, docs, refactor, perf, test, chore.
  • scope is optional and describes the part of the codebase affected (e.g., auth, ui, api).
  • concise message is a short description of the change (max 50 chars).

📝 Change Type

Please select the type of change this PR introduces (choose one or more):

  • feat: New feature.
  • fix: Bug fix.
  • docs: Documentation only changes.
  • refactor: A code change that neither fixes a bug nor adds a feature.
  • perf: Performance improvement.
  • test: Adding missing tests or correcting existing tests.
  • chore: Maintenance tasks (e.g., updating dependencies).

💡 Description

Fixes RTX 4070 OOM during CUDA graph capture by reserving 2GB for CUDA graph overhead in capacity calculations.
Example calculation

RTX 4070 (12GB):
├─ Assigned: 3 layers (~8GB weights)
├─ KV cache: ~3.6GB (if default 0.3 ratio)
├─ CUDA graphs: ~2GB needed
└─ Total: ~13.6GB > 12GB → OOM!

My dumb fix is to reserve 2GB before calculating capacity if it is CUDA devices, not sure if we need to scale with model size for very large models? Please let me know if there are better implementation ideas!

if self.hardware.device == "cuda":
    cuda_graph_overhead_gb = 2.0
    available_memory_bytes -= cuda_graph_overhead_gb * 1024 * 1024 * 1024

This fix also reduces the layer capacity for ALL CUDA GPUs:

GPU Old Capacity New Capacity Change
A100-80GB 13 layers 12 layers -1
A100-40GB 6 layers 6 layers 0
RTX 5090 5 layers 4 layers -1
RTX 4090 4 layers 3 layers -1
RTX 4070 2 layers 1 layer -1

Something wrong with the commented test, will share more updates tomorrow.

Key Changes

🔗 Related Issues

List any issues this PR closes or relates to:

✅ Checklist

Please ensure the following points are addressed before merging:

  • I have performed a self-review of my own code.
  • I have added/updated tests that prove my fix or feature works (if applicable).
  • I have updated the documentation (if necessary).
  • My code follows the project's style guidelines.

@RWL-Dittrich RWL-Dittrich mentioned this pull request Nov 12, 2025
2 tasks
@IAMDAVID0920
Copy link
Contributor Author

Seems like my fix makes no sense, i probably read code wrong or something
It is good to know that fixing import ordering can fix the issue, which is to use the patched version

@TianyiZhao1437
Copy link
Collaborator

Hi, the root cause of #188 is that sglang monkey patches are not taking effect, and is fixed in #228. Actually weights of 3 layers for gpt-oss-20b only takes ~1.5GB.

Default value for memory ratio is 0.3 for kvcache, 0.5 for parameters and 0.2 for activation + CUDA graph. But we will keep looking into this since it is still a naive design.

@IAMDAVID0920
Copy link
Contributor Author

Hi, the root cause of #188 is that sglang monkey patches are not taking effect, and is fixed in #228. Actually weights of 3 layers for gpt-oss-20b only takes ~1.5GB.

Default value for memory ratio is 0.3 for kvcache, 0.5 for parameters and 0.2 for activation + CUDA graph. But we will keep looking into this since it is still a naive design.

Appreciate your explanation, my naive thought would just try to reserve some memory for CUDA graph, then working on the memory ratio per default value. Learned something new today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: GPU OOM when capturing graph.

2 participants