New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault at transportDestroyProxy line 241 #191
Comments
Thanks for the report. How many Nodes and GPUs per node are you running with ? |
The above system is a single node with 2x GPUs (K80s): # lspci -tv -+ | \-[0000:00]-+ + +-02.0-[06-0a]----00.0-[07-09]--+-08.0-[08]----00.0 NVIDIA Corporation GK210GL [Tesla K80] \-10.0-[09]----00.0 NVIDIA Corporation GK210GL [Tesla K80] We subsequently switched to a Power8 system (still single node) running RHEL 7.5, CUDA 10.1, Driver 418.39, NCCL 2.4.2-1; this system has 4xK80: # lspci -tv +-[0002:00]---00.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0 NVIDIA Corporation GK210GL [Tesla K80] | \-10.0-[04]----00.0 NVIDIA Corporation GK210GL [Tesla K80] + \-[0000:00]---00.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0 NVIDIA Corporation GK210GL [Tesla K80] \-10.0-[04]----00.0 NVIDIA Corporation GK210GL [Tesla K80] I added a bit of additional logging (in italics below) to /.../src/transport.cu: ncclResult_t transportAllocateProxyArgs(struct ncclComm* comm, struct ncclProxyArgs** argsptr) { ... elem->next = elem->nextPeer = NULL; *argsptr = elem; if ( state) INFO(NCCL_INIT, "END transportAllocateProxyArgs(): state = %p ; state->pools = %p ; state->pools->next = %p", state, state->pools, state->pools->next); return ncclSuccess; } ncclResult_t transportDestroyProxy(struct ncclComm* comm) { struct ncclProxyState* state = &comm->proxyState; ... struct ncclProxyState* infoproxyState = &comm->proxyState; if ( infoproxyState) INFO(NCCL_INIT, "PRE free(): proxyState = %p ; proxyState->pools = %p", infoproxyState,infoproxyState->pools); // Free off any memory allocated for the proxy arg pools pthread_mutex_lock(&state->mutex); struct ncclProxyState* proxyState = &comm->proxyState; while (proxyState->pools != NULL) { struct ncclProxyPool *next = proxyState->pools->next; free(proxyState->pools); proxyState->pools = next; } pthread_mutex_unlock(&state->mutex); if ( infoproxyState) INFO(NCCL_INIT, "POST free(): proxyState = %p ; proxyState->pools = %p", infoproxyState,infoproxyState->pools); return ncclSuccess; } Relevant output from the above: /var/tmp/dbg/nccldebug-cit1077.114378.out cit1077:114378:114378 [0] NCCL INFO PRE free(): proxyState = 0x1005a4430a0 ; proxyState->pools = (nil) cit1077:114378:114378 [0] NCCL INFO POST free(): proxyState = 0x1005a4430a0 ; proxyState->pools = (nil) cit1077:114378:114378 [0] NCCL INFO Destroyed comm 0x1005a440f70 rank 1 cit1077:114378:114378 [0] NCCL INFO PRE free(): proxyState = 0x1005a4515a0 ; proxyState->pools = (nil) cit1077:114378:114378 [0] NCCL INFO POST free(): proxyState = 0x1005a4515a0 ; proxyState->pools = (nil) cit1077:114378:114378 [0] NCCL INFO Destroyed comm 0x1005a44f470 rank 1 cit1077:114378:114378 [0] NCCL INFO PRE free(): proxyState = 0x1005a4430a0 ; proxyState->pools = 0x2860 /var/tmp/dbg/nccldebug-cit1077.114455.out cit1077:114455:114455 [0] NCCL INFO PRE free(): proxyState = 0x100933a2ce0 ; proxyState->pools = (nil) cit1077:114455:114455 [0] NCCL INFO POST free(): proxyState = 0x100933a2ce0 ; proxyState->pools = (nil) cit1077:114455:114455 [0] NCCL INFO Destroyed comm 0x100933a0bb0 rank 0 cit1077:114455:114455 [0] NCCL INFO PRE free(): proxyState = 0x10093385c60 ; proxyState->pools = (nil) cit1077:114455:114455 [0] NCCL INFO POST free(): proxyState = 0x10093385c60 ; proxyState->pools = (nil) cit1077:114455:114455 [0] NCCL INFO Destroyed comm 0x10093383b30 rank 0 cit1077:114455:114455 [0] NCCL INFO PRE free(): proxyState = 0x100933a2ce0 ; proxyState->pools = 0x2350 /var/tmp/dbg/nccldebug-cit1077.17388.out cit1077:17388:17388 [0] NCCL INFO PRE free(): proxyState = 0x10076a97660 ; proxyState->pools = (nil) cit1077:17388:17388 [0] NCCL INFO POST free(): proxyState = 0x10076a97660 ; proxyState->pools = (nil) cit1077:17388:17388 [0] NCCL INFO Destroyed comm 0x10076a95530 rank 0 cit1077:17388:17388 [0] NCCL INFO PRE free(): proxyState = 0x10076aa6020 ; proxyState->pools = (nil) cit1077:17388:17388 [0] NCCL INFO POST free(): proxyState = 0x10076aa6020 ; proxyState->pools = (nil) cit1077:17388:17388 [0] NCCL INFO Destroyed comm 0x10076aa3ef0 rank 0 cit1077:17388:17388 [0] NCCL INFO PRE free(): proxyState = 0x10076a97660 ; proxyState->pools = 0x2350 /var/tmp/dbg/nccldebug-cit1077.17393.out cit1077:17393:17393 [0] NCCL INFO PRE free(): proxyState = 0x100938b38f0 ; proxyState->pools = (nil) cit1077:17393:17393 [0] NCCL INFO POST free(): proxyState = 0x100938b38f0 ; proxyState->pools = (nil) cit1077:17393:17393 [0] NCCL INFO Destroyed comm 0x100938b17c0 rank 1 cit1077:17393:17393 [0] NCCL INFO PRE free(): proxyState = 0x10093896360 ; proxyState->pools = (nil) cit1077:17393:17393 [0] NCCL INFO POST free(): proxyState = 0x10093896360 ; proxyState->pools = (nil) cit1077:17393:17393 [0] NCCL INFO Destroyed comm 0x10093894230 rank 1 cit1077:17393:17393 [0] NCCL INFO PRE free(): proxyState = 0x100938b38f0 ; proxyState->pools = 0x2350 In each instance above the final proxyState->pools address (before the crash w/ SIGSEGV) looks like an invalid address. |
After reviewing additional logging it appeared our code called ncclCommDestroy() twice with the same handle: when the code was changed to remove the duplicate call processing continued correctly to completion. We would like to understand the difference in behavior between NCCL 2.3.x and NCCL 2.4.2. |
Ok thanks for the update Ashu. Glad you've been able to find a resolution at your end. I'll make some enhancements to NCCL to try and catch this user error next time. As for the differences between 2.3.x and 2.4; the linked list in question didn't exist in 2.3.x and so I think we were just unlucky that the free() code was 'corrupting' the freed memory in a location that was assumed to be a linked list pointer. |
Closing since I think this issue has been solved. |
We get a segfault when we destroy the current ncclComm:
in transportDestroyProxy:
attempting to dereference:
ENV
Lenovo x3650 M5 (Intel x64_64)
48core / 377GB
Ubuntu 18.04.2 LTS
uname -r: 4.15.0-45-generic
CUDA: 10.1
NCCL: 2.4.2-1
Note: the same code works fine with NCCL 2.3.5-5
(We built a version of the NCCL libraries with ENABLE_TRACE, however, we do not see any additional output beyond that enabled by "NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL)
The text was updated successfully, but these errors were encountered: