Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault at transportDestroyProxy line 241 #191

Closed
ashuatibm opened this issue Mar 6, 2019 · 5 comments
Closed

segfault at transportDestroyProxy line 241 #191

ashuatibm opened this issue Mar 6, 2019 · 5 comments

Comments

@ashuatibm
Copy link

We get a segfault when we destroy the current ncclComm:

#0 0x00007fdba8a61838 in transportDestroyProxy (comm=comm@entry=0x5604b2781850) from /opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/torch/lib/../../../../libnccl.so.2
#1 0x00007fdba8a5d82b in commDestroy (comm=0x5604b2781850) from /opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/torch/lib/../../../../libnccl.so.2
#2 ncclCommDestroy (comm=0x5604b2781850) at /tmp/tmpxft_0000756e_00000000-5_init.compute_70.cudafe1.cpp:1158

in transportDestroyProxy:

  while (proxyState->pools != NULL) {
    struct ncclProxyPool *next = proxyState->pools->next;
    free(proxyState->pools);
    proxyState->pools = next;
}

attempting to dereference:

(gdb) p * & comm->proxyState->pools->next
Cannot access memory at address 0x21f0

ENV
    Lenovo x3650 M5 (Intel x64_64)
        48core / 377GB
    Ubuntu 18.04.2 LTS
        uname -r: 4.15.0-45-generic
    CUDA: 10.1
    NCCL: 2.4.2-1
        Note: the same code works fine with NCCL 2.3.5-5

(We built a version of the NCCL libraries with ENABLE_TRACE, however, we do not see any additional output beyond that enabled by "NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL)

@AddyLaddy
Copy link
Collaborator

Thanks for the report.
If you set NCCL_DEBUG=TRACE you will see the extra logging generated by the TRACE=1 build option.
Perhaps you can add a TRACE() call to the failing function to see which pointer is corrupt?

How many Nodes and GPUs per node are you running with ?

@ashuatibm
Copy link
Author

The above system is a single node with 2x GPUs (K80s):

# lspci -tv
-+
|
 \-[0000:00]-+
             +
             +-02.0-[06-0a]----00.0-[07-09]--+-08.0-[08]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
                                             \-10.0-[09]----00.0  NVIDIA Corporation GK210GL [Tesla K80]

We subsequently switched to a Power8 system (still single node) running RHEL 7.5, CUDA 10.1, Driver 418.39, NCCL 2.4.2-1; this system has 4xK80:

# lspci -tv
+-[0002:00]---00.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
| \-10.0-[04]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
+
\-[0000:00]---00.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
\-10.0-[04]----00.0 NVIDIA Corporation GK210GL [Tesla K80]

I added a bit of additional logging (in italics below) to /.../src/transport.cu:

ncclResult_t transportAllocateProxyArgs(struct ncclComm* comm, struct ncclProxyArgs** argsptr) {
...
  elem->next = elem->nextPeer = NULL;
  *argsptr = elem;
  if ( state) INFO(NCCL_INIT, "END transportAllocateProxyArgs(): state = %p ; state->pools = %p ; state->pools->next = %p", state, state->pools, state->pools->next);
  return ncclSuccess;
}

ncclResult_t transportDestroyProxy(struct ncclComm* comm) {
  struct ncclProxyState* state = &comm->proxyState;
...
  struct ncclProxyState* infoproxyState = &comm->proxyState;
  if ( infoproxyState) INFO(NCCL_INIT, "PRE free(): proxyState = %p ; proxyState->pools = %p", infoproxyState,infoproxyState->pools);

  // Free off any memory allocated for the proxy arg pools
  pthread_mutex_lock(&state->mutex);
  struct ncclProxyState* proxyState = &comm->proxyState;
  while (proxyState->pools != NULL) {
    struct ncclProxyPool *next = proxyState->pools->next;
    free(proxyState->pools);
    proxyState->pools = next;
  }
  pthread_mutex_unlock(&state->mutex);

  if ( infoproxyState) INFO(NCCL_INIT, "POST free(): proxyState = %p ; proxyState->pools = %p", infoproxyState,infoproxyState->pools);

  return ncclSuccess;
}

Relevant output from the above:

/var/tmp/dbg/nccldebug-cit1077.114378.out
    cit1077:114378:114378 [0] NCCL INFO PRE free(): proxyState = 0x1005a4430a0 ; proxyState->pools = (nil)
    cit1077:114378:114378 [0] NCCL INFO POST free(): proxyState = 0x1005a4430a0 ; proxyState->pools = (nil)
    cit1077:114378:114378 [0] NCCL INFO Destroyed comm 0x1005a440f70 rank 1
    cit1077:114378:114378 [0] NCCL INFO PRE free(): proxyState = 0x1005a4515a0 ; proxyState->pools = (nil)
    cit1077:114378:114378 [0] NCCL INFO POST free(): proxyState = 0x1005a4515a0 ; proxyState->pools = (nil)
    cit1077:114378:114378 [0] NCCL INFO Destroyed comm 0x1005a44f470 rank 1
    cit1077:114378:114378 [0] NCCL INFO PRE free(): proxyState = 0x1005a4430a0 ; proxyState->pools = 0x2860

/var/tmp/dbg/nccldebug-cit1077.114455.out
    cit1077:114455:114455 [0] NCCL INFO PRE free(): proxyState = 0x100933a2ce0 ; proxyState->pools = (nil)
    cit1077:114455:114455 [0] NCCL INFO POST free(): proxyState = 0x100933a2ce0 ; proxyState->pools = (nil)
    cit1077:114455:114455 [0] NCCL INFO Destroyed comm 0x100933a0bb0 rank 0
    cit1077:114455:114455 [0] NCCL INFO PRE free(): proxyState = 0x10093385c60 ; proxyState->pools = (nil)
    cit1077:114455:114455 [0] NCCL INFO POST free(): proxyState = 0x10093385c60 ; proxyState->pools = (nil)
    cit1077:114455:114455 [0] NCCL INFO Destroyed comm 0x10093383b30 rank 0
    cit1077:114455:114455 [0] NCCL INFO PRE free(): proxyState = 0x100933a2ce0 ; proxyState->pools = 0x2350

/var/tmp/dbg/nccldebug-cit1077.17388.out
    cit1077:17388:17388 [0] NCCL INFO PRE free(): proxyState = 0x10076a97660 ; proxyState->pools = (nil)
    cit1077:17388:17388 [0] NCCL INFO POST free(): proxyState = 0x10076a97660 ; proxyState->pools = (nil)
    cit1077:17388:17388 [0] NCCL INFO Destroyed comm 0x10076a95530 rank 0
    cit1077:17388:17388 [0] NCCL INFO PRE free(): proxyState = 0x10076aa6020 ; proxyState->pools = (nil)
    cit1077:17388:17388 [0] NCCL INFO POST free(): proxyState = 0x10076aa6020 ; proxyState->pools = (nil)
    cit1077:17388:17388 [0] NCCL INFO Destroyed comm 0x10076aa3ef0 rank 0
    cit1077:17388:17388 [0] NCCL INFO PRE free(): proxyState = 0x10076a97660 ; proxyState->pools = 0x2350

/var/tmp/dbg/nccldebug-cit1077.17393.out
    cit1077:17393:17393 [0] NCCL INFO PRE free(): proxyState = 0x100938b38f0 ; proxyState->pools = (nil)
    cit1077:17393:17393 [0] NCCL INFO POST free(): proxyState = 0x100938b38f0 ; proxyState->pools = (nil)
    cit1077:17393:17393 [0] NCCL INFO Destroyed comm 0x100938b17c0 rank 1
    cit1077:17393:17393 [0] NCCL INFO PRE free(): proxyState = 0x10093896360 ; proxyState->pools = (nil)
    cit1077:17393:17393 [0] NCCL INFO POST free(): proxyState = 0x10093896360 ; proxyState->pools = (nil)
    cit1077:17393:17393 [0] NCCL INFO Destroyed comm 0x10093894230 rank 1
    cit1077:17393:17393 [0] NCCL INFO PRE free(): proxyState = 0x100938b38f0 ; proxyState->pools = 0x2350

In each instance above the final proxyState->pools address (before the crash w/ SIGSEGV) looks like an invalid address.

@ashuatibm
Copy link
Author

After reviewing additional logging it appeared our code called ncclCommDestroy() twice with the same handle: when the code was changed to remove the duplicate call processing continued correctly to completion.

We would like to understand the difference in behavior between NCCL 2.3.x and NCCL 2.4.2.

@AddyLaddy
Copy link
Collaborator

Ok thanks for the update Ashu. Glad you've been able to find a resolution at your end. I'll make some enhancements to NCCL to try and catch this user error next time.

As for the differences between 2.3.x and 2.4; the linked list in question didn't exist in 2.3.x and so I think we were just unlucky that the free() code was 'corrupting' the freed memory in a location that was assumed to be a linked list pointer.

@sjeaugey
Copy link
Member

Closing since I think this issue has been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants