Skip to content

Conversation

@Qubitium
Copy link
Collaborator

@Qubitium Qubitium commented Oct 1, 2025

@avtc This Pr will fix your PVML env crash. I reproduced it on my system and now bypassing the thread safety in Accelerate entirely. Core issue is accelerate.utils.modeling.clear_device_cache is thread unsafe due to:

  1. calls device level torch.cuda.empty_cache() without proper thread ctx and not checking if ops (call paths) are actually cuda related
  2. mutates os.environ without locks

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
@Qubitium Qubitium marked this pull request as ready for review October 1, 2025 01:41
@Qubitium Qubitium merged commit 32dbaf0 into main Oct 1, 2025
5 checks passed
@Qubitium Qubitium deleted the bypass-accelerate branch October 1, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants