Skip to content

Conversation

@Qubitium
Copy link
Collaborator

No description provided.

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
@Qubitium
Copy link
Collaborator Author

Qubitium commented Sep 27, 2025

@avtc Ignore the PR title. Add a very estimated memory dealloc check and counter so we only call torch.cuda.empty_cache when we detect about ~1/4 of your gpu max vram has been deallocated recently. The default value is "auto" set in memory.py at the bottom. Right auto is set to min gpu vram / 4. You can set this if auto doesn't work for you.

Goal is to avoid calling it as much as possible since it is very very slow.

pass DEBUG=1 env before you call gptqmodel to get the memory dealloc/alloc debug logs.

@Qubitium Qubitium marked this pull request as ready for review September 27, 2025 05:27
@Qubitium Qubitium merged commit f0491ff into main Sep 27, 2025
5 checks passed
@Qubitium Qubitium deleted the input branch September 27, 2025 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants