-
Notifications
You must be signed in to change notification settings - Fork 55
Fix memory leak bug #534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix memory leak bug #534
Conversation
Also, I wonder if the MR registered in IB should also be properly unregistered in RegisteredMemory's destructor? mscclpp/src/registered_memory.cc Line 90 in 7263bb7
|
For MR and QP, thinking if we could wrap it into |
WARN("Attempted to remove a memory that is not registered or already removed: %u", memoryId); | ||
return; | ||
} | ||
memories_[memoryId] = RegisteredMemory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a resource lock so that the removal doesn't happen while there exists unflushed triggers on the flight in the proxy. Also, we need a mechanism that prevents proxies on remote ranks from processing requests on this RegisteredMemory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users are responsible for safely releasing the buffer. Before removing it, ensure that no peers or device-side operations are still accessing this memory. Adding spinlock here to prevent race conditions inside the proxy_channel.
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
Fix #533