fix(bug): Reserve 2GB for CUDA graph overhead, prevent GPU OOM #225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📋 PR Title Format
The PR title should follow the format:
Where:
typeis one of:feat,fix,docs,refactor,perf,test,chore.scopeis optional and describes the part of the codebase affected (e.g.,auth,ui,api).concise messageis a short description of the change (max 50 chars).📝 Change Type
Please select the type of change this PR introduces (choose one or more):
💡 Description
Fixes RTX 4070 OOM during CUDA graph capture by reserving 2GB for CUDA graph overhead in capacity calculations.
Example calculation
My dumb fix is to reserve 2GB before calculating capacity if it is CUDA devices, not sure if we need to scale with model size for very large models? Please let me know if there are better implementation ideas!
This fix also reduces the layer capacity for ALL CUDA GPUs:
Something wrong with the commented test, will share more updates tomorrow.
Key Changes
🔗 Related Issues
List any issues this PR closes or relates to:
✅ Checklist
Please ensure the following points are addressed before merging: