New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak in 2.10.4 release #604
Comments
Yes the memory leak was due to the intermediate GPU IPC memory not being released on NVB based topologies during communicator destruction. It does not occur on NVSwitch based systems. I have extensively tested Communicator Destroy and Abort calls, but so far have not detected any additional memory leaks. Do you need to see the code diffs for the NVB memory leak fix? |
@AddyLaddy sure the code diff will be great! Also is there a unit test or something testing destroy/abort that we can try out? |
I think these are all the changes you need to fix the NVB memory leak: index f5e9f565..ae9da9be 100644
--- a/src/bootstrap.cc
+++ b/src/bootstrap.cc
@@ -202,7 +202,7 @@ struct unexConn {
struct remAllocState {
int cudaDev;
int listenFd;
- int stop;
+ volatile int stop;
};
struct extState {
@@ -257,7 +257,7 @@ void* ncclRemoteMemAllocationService(void* args) {
for (int s=0; s<MAX_SEGMENTS; s++) segments[s] = NULL;
for (int s=0; s<MAX_SEGMENTS; s++) {
pollfds[s].fd = -1;
- pollfds[s].events = POLLHUP;
+ pollfds[s].events = POLLIN;
}
pollfds[MAX_SEGMENTS].fd = state->listenFd;
pollfds[MAX_SEGMENTS].events = POLLIN;
@@ -285,7 +285,7 @@ void* ncclRemoteMemAllocationService(void* args) {
}
}
for (int s=0; s<MAX_SEGMENTS; s++) {
- if (pollfds[s].revents & POLLHUP) {
+ if (pollfds[s].revents & (POLLIN|POLLHUP)) {
if (cudaFree(segments[s]) != cudaSuccess) {
WARN("[Rem Allocator] cudaFree %p failed", segments[s]);
}
diff --git a/src/transport/p2p.cc b/src/transport/p2p.cc
index 38ac57dc..5bd92b11 100644
--- a/src/transport/p2p.cc
+++ b/src/transport/p2p.cc
@@ -21,6 +21,7 @@ struct p2pSendResources {
void* ipcPtr;
int remoteId;
int memRank;
+ void* remIpcPtr;
void* bootstrap;
};
@@ -29,6 +30,7 @@ struct p2pRecvResources {
void* ipcPtr;
int remoteId;
int memRank;
+ void* remIpcPtr;
void* bootstrap;
};
@@ -252,7 +254,7 @@ static ncclResult_t p2pSendConnect(struct ncclComm* comm, struct ncclConnect* co
struct ncclRecvMem* remDevMem;
struct p2pConnectInfo* info = (struct p2pConnectInfo*)connectInfo;
- NCCLCHECK(p2pMap(comm->peerInfo+rank, comm->peerInfo+info->rank, info, (void**)&remDevMem, &resources->ipcPtr));
+ NCCLCHECK(p2pMap(comm->peerInfo+rank, comm->peerInfo+info->rank, info, (void**)&remDevMem, &resources->remIpcPtr));
int offset = 0;
for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
@@ -276,7 +278,7 @@ ncclResult_t p2pRecvConnect(struct ncclComm* comm, struct ncclConnect* connectIn
struct ncclSendMem* remDevMem;
struct p2pConnectInfo* info = (struct p2pConnectInfo*)connectInfo;
- NCCLCHECK(p2pMap(comm->peerInfo+rank, comm->peerInfo+info->rank, info, (void**)&remDevMem, &resources->ipcPtr));
+ NCCLCHECK(p2pMap(comm->peerInfo+rank, comm->peerInfo+info->rank, info, (void**)&remDevMem, &resources->remIpcPtr));
int offset = 0;
for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
@@ -298,6 +300,8 @@ ncclResult_t p2pSendFree(void* resources) {
struct p2pSendResources* sendRes = (struct p2pSendResources*)resources;
if (sendRes->ipcPtr)
CUDACHECK(cudaIpcCloseMemHandle(sendRes->ipcPtr));
+ if (sendRes->remIpcPtr)
+ CUDACHECK(cudaIpcCloseMemHandle(sendRes->remIpcPtr));
if (sendRes->remoteId != -1) {
NCCLCHECK(bootstrapRemFree(sendRes->remoteId, sendRes->memRank, sendRes->bootstrap));
sendRes->devMem = NULL;
@@ -311,6 +315,8 @@ ncclResult_t p2pRecvFree(void* resources) {
struct p2pRecvResources* recvRes = (struct p2pRecvResources*)resources;
if (recvRes->ipcPtr)
CUDACHECK(cudaIpcCloseMemHandle(recvRes->ipcPtr));
+ if (recvRes->remIpcPtr)
+ CUDACHECK(cudaIpcCloseMemHandle(recvRes->remIpcPtr));
if (recvRes->remoteId != -1) {
NCCLCHECK(bootstrapRemFree(recvRes->remoteId, recvRes->memRank, recvRes->bootstrap));
recvRes->devMem = NULL; |
Thanks @AddyLaddy ! I wrote a simple program that creates and destroys a NCCL communicator in a loop. Here is an output with 2.11.4, run with 4 interconnected GPUs on a DGX1 like machine (so there should not be NVB leak):
It looks like after the first iteration, there is 50 MiB not freed. But after that, the free memory becomes stable. Another finding is that this amount varies from one NCCL version to another: I still need to think about where the 50 MiB comes from, but it seems to be related to some static initialization, as it does not increase after the first iteration. Appreciate your help in understanding this! |
Yes there is some memory consumed by the CUDA runtime that is not released. Perhaps the Cuda malloc heap and other context info? We have a test that does a similar checks. Due to the 'background' memory usage of the CUDA RT, I sample the CUDA memory stats after a warmup Communicator Alloc/Destroy, so that I am only looking for NCCL memory leaks. |
Just want to seek some clarification for the memory leak issue that is fixed in the 2.11 release (in the release note). Can you give us some details about the leak? We're noticing two potential memory leak in NCCL 2.10 release, both seems to be related with communicator initialize + abort
The text was updated successfully, but these errors were encountered: