Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary abort mechanism and an API for querying asynchronous errors. #140

Closed
wants to merge 1 commit into from

Conversation

@wojciechwasko
Copy link
Collaborator

wojciechwasko commented Oct 11, 2018

This is a proposed infrastructure and plumbing for enabling recovery from an error condition in NCCL. Users who care about fault tolerance should familiarize themselves with two changes:

  • Calling ncclCommDestroy while the collective operations are running on the GPUs will now result in the collective operations being aborted. Previously the behaviour was unspecified.

  • There is a new API, ncclCommGetAsyncError which allows the user to inspect whether there have been any asynchronous errors in the communicator (e.g. due to a failing network layer). Note that the only operation that is possible on a communicator which has experienced an error is ncclCommDestroy.

Please note that this is a preliminary version of the change and some limitations apply. One known issue is that in case of an error condition, it is possible for ncclCommDestroy to hang if multiple GPUs are used per process.

Change-Id: If1f8fadc719b136788609a10416658f3ef76cf35
@sjeaugey

This comment has been minimized.

Copy link
Contributor

sjeaugey commented Oct 11, 2018

One way to use this new feature is to replace cudaStreamSynchronize(ncclStream) (waiting on NCCL stream) by :

while (cudaStreamQuery(ncclStream) != cudaSuccess) {
  if (ncclGetAsyncError(comm, &ncclError) != ncclSuccess || ncclError != ncclSuccess) {
    ncclCommDestroy(comm);
    /* abort or recover */
  }
}

One can also implement a timeout inside the loop and call ncclCommDestroy when we reach the timeout.

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Oct 12, 2018

Is there any way to distinguish errors, just failed in other reasons or aborted by ncclCommDestroy of application call? I'd hope a new ncclResult_t like ncclAborted would be nice, even though all NCCL function calls are async, the thread working on main CUDA stream may or may not be aware of its stream aborted by another thread. Of course it'd be also fine to manage the aborting or aborted state outside the library by the application, but would be a bit hard to maintain synced with internal state.

Without deep knowledge on the internals, though, the patch looks great to me.

@wojciechwasko

This comment has been minimized.

Copy link
Collaborator Author

wojciechwasko commented Oct 12, 2018

I presume you're describing the case of multiple threads using the same NCCL communicator, right? In that case if one of the threads detects an asynchronous failure and decides to destroy the communicator, the other thread(s) would have to get the notification about that (out-of-band from NCCL's perspective); otherwise, they'd be using a non-existent communicator.

That means that (with this patch) there is no transient "aborted" state the communicator is in - aborting work in a communicator is (internally) immediately followed by destruction of resources associated with that communicator. Are you saying there'd be value in decoupling the "abort" part and the "destroy" part?

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Oct 12, 2018

Thanks to the description, I think I understood how it was designed. Let me clarify my virtual use case, say an application have two threads with multiple processes. The main thread in every process is actively calling cuda and nccl APIs like cudnn kernels or whatever, and periodically allreduce() to synchronize, on a specific data (indeed it'd be a DNN model). Another sub-thread is intermittently communicating with other processes' to watch their liveness and can detect process failures.
Once one of remote processes dies during or before allreduce() call, then all other live processes stay in busy loop, that's somehow sub-thread detects remote process failure, via Ethernet or whatever, then it calls ncclCommDestroy(). I understand that as soon as the destruction call returns the communicator becomes non-existent, but there may be a race where the main thread may enter another CUDA or NCCL API call before sub-thread notifies the main thread that the communicator has been destroyed. If I understand correctly, if the next call is NCCL, the API call must return with error as the communicator is already destroyed. But I wonder what return can be expected if next call is CUDA API call to the same (or different) stream. Possibly a CUDA kernels may run with broken data, but sooner or later the main thread should be aware of NCCL error and abort the whole computation.

Are you saying there'd be value in decoupling the "abort" part and the "destroy" part?

I might have sounded so, but I didn't mean it (I might have been just not understanding well).

Anyway your description makes sense, thank you!

@wojciechwasko

This comment has been minimized.

Copy link
Collaborator Author

wojciechwasko commented Oct 12, 2018

In your case, I'd argue that it is the application's responsibility to ensure no race condition occurs when accessing a communicator that might be destroyed in the meantime, e.g. by maintaining a per-communicator lock or some other method of verifying that the shared resource still exists. Also, the ncclCommGetAsyncError call is per-comm, i.e. it would ideally be called by the thread that is handling that particular GPU as opposed to some independent thread.

If the main thread calls NCCL with a handle of a non-existent communicator, it will almost certainly segfault. Please also note that the comm destroy operation is not destructive towards the stream - it only causes NCCL kernels to quit early. The stream itself (and the device) remains usable, but yes, the data that kernels would operate wold most likely be bogus.

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Oct 15, 2018

My PoC snippet worked nicely as intended. Thank you for your great work!

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Nov 12, 2018

I've seen wierd issue when calling ncclCommDestroy(). In multi-node and multi-process configuration, say we have process P1 and P2 in node P, Q1 and Q2 in node Q, each has rank 0,1,2,3. Both node has two GPUs and conneced with InfiniBand, where NCCL works very fine with good performance.

But, to imitate hardware or software failure I killed Q2 (with SIGTERM) while allReduce() and other cuda kernels actively being called, for example. Then all other processes got notified of Q2 death by other communication channel, tried to destroy the NCCL communicator, to abort all further or ongoing communication, by my design. But all aborting procceses got SIGSEGV and got core dumped. All those core failed at same position; transport/net.cu:394 as follows. This seems to be a worker thread invoked by NCCL.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  netSendProxy (args=0x7fb014cfdae0) at transport/net.cu:394
394         } else while (tail < *prevTail) {
[Current thread is 1 (Thread 0x7fb014cfe700 (LWP 715))]

The main thread that runs CUDA kernels and NCCL functions is not affected, usually running actively. Meanwhile, my killer thread, spawned by main thread usually is stuck in ncclCommDestroy call like this:

(gdb) where
#0  0x00007fb086a506ba in __mmap (addr=0x7fb003200000, len=4263936, prot=<optimized out>, flags=50, fd=<optimized out>, offset=0) at ../sysdeps/unix/sysv/linux/wordsize-64/mmap.c:34
#1  0x00007fb06c193fe2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(snip)
#8  0x00007fb06c09f4f2 in cuMemFreeHost () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fb05171773d in cudart::driverHelper::freeHost(void*) () from /usr/local/lib/libnccl.so.2
#10 0x00007fb0516ec4bc in cudart::cudaApiFreeHost(void*) () from /usr/local/lib/libnccl.so.2
#11 0x00007fb051726c41 in cudaFreeHost () from /usr/local/lib/libnccl.so.2
#12 0x00007fb0516b3c29 in ncclCudaHostFree (ptr=<optimized out>) at include/core.h:364
#13 netSendFree (transportResources=0x4d08ef0) at transport/net.cu:322
#14 0x00007fb051692aa2 in freeRing (ring=ring@entry=0x4d1f730) at ring.cu:65
#15 0x00007fb051686030 in commFree (comm=comm@entry=0x4d1f730) at init.cu:120
#16 0x00007fb051690628 in ncclCommDestroy (comm=0x4d1f730) at init.cu:785
#17 0x00007fb056907e21 in __pyx_f_4cupy_4cuda_4nccl_16NcclCommunicator_destroy (__pyx_v_self=0x7fb03df21690, __pyx_skip_dispatch=__pyx_skip_dispatch@entry=1) at cupy/cuda/nccl.cpp:2666
#18 0x00007fb05690864e in __pyx_pf_4cupy_4cuda_4nccl_16NcclCommunicator_6destroy (__pyx_v_self=<optimized out>) at cupy/cuda/nccl.cpp:2730
#19 __pyx_pw_4cupy_4cuda_4nccl_16NcclCommunicator_7destroy (__pyx_v_self=<optimized out>, unused=<optimized out>) at cupy/cuda/nccl.cpp:2717

I tried to introduce a mutex among NCCL function calls but couldn't avoid this SEGV. I still have no lock and no workaround so far... I'll try to make a minimal reproduction code, but do you have any clue solving this? I'd happily provide further information such as result of NCCL_DEBUG or whole coredump file (which now seem useless or too much to me).

Seems related or not, but additional info:

  • This issue does not happen when all processes run in single node using SHM.
  • Killing any one process of rank 1-3 leads to all others' death (didn't just try killing 0).
  • This issue does not happen when no CUDA kernels nor NCCL calls are happening (before and after the training).
  • This issue happens regardless of the transport, either via TCP/IP or via ibverbs.
  • Software versions are as follows (nccl is built from this branch):
Modules:
  cuda      : Yes (version 9020)
  cudnn     : Yes (version 7201)
  nccl      : Yes (version 2305)
  cusolver  : Yes
  nvtx      : Yes
  thrust    : Yes

(update: added detail to the main description)

@sjeaugey

This comment has been minimized.

Copy link
Contributor

sjeaugey commented Nov 12, 2018

@wojciechwasko I assume this is caused by transport threads not being stopped properly. Indeed we were trying to fix the case where transport threads had exited due to an error, but we also need to handle other ranks (which did not exit) properly.

That means active polling in the transport thread also needs to look for the abort flag. Also, it seems the resources are freed before we stop the proxies (https://github.com/NVIDIA/nccl/blob/master/src/ring.cu#L65-L68), hence the crash instead of a hang.

@wojciechwasko

This comment has been minimized.

Copy link
Collaborator Author

wojciechwasko commented Nov 13, 2018

@sjeaugey do you think we need the transport threads to watch the abort flag? ncclCommDestroy joins the transport threads. But, it does that improperly (freeing the threads' resources before joining them) and so a SIGSEGV results. This behaviour depends on the transport threads never truly blocking, but IIRC that's the case anyways.

I'll try to reproduce this issue and try reversing the order of resource free / join thread.

@sjeaugey

This comment has been minimized.

Copy link
Contributor

sjeaugey commented Nov 13, 2018

Yes I think we need to reverse the free/join (to get a hang instead of a crash), then have transport threads watch the abort flag.

@wojciechwasko

This comment has been minimized.

Copy link
Collaborator Author

wojciechwasko commented Nov 14, 2018

@kuenishi just so you know I cooked up a reproducer for this issue; indeed reversing the free/join results in a hang instead of a crash. I'm looking into having the tranport threads stop properly instead of hanging.

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Nov 16, 2018

@wojciechwasko @sjeaugey Thanks! I tried the fix and it looks working for the moment, but official fix would be far better.

@sjeaugey

This comment has been minimized.

Copy link
Contributor

sjeaugey commented Nov 16, 2018

@kuenishi I'm not sure which fix you tried ... ? In any case, we're still working on having proxy threads abort properly, which means also making sure the network calls do not block. This is quite a significant change, and it will take us some time to implement.

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Nov 16, 2018

@sjeaugey It's kind of a dirty hack. Swapped the proxy thread join and free, introduced an abort flag to the ring struct, and added flag check to all loops as far as I could find (attach the process with gdb and see which loop the thread is in). I don't even know it's a right fix or not, but feeling not as it may have other race or leak but so far enough for my boilerplate experiment.

dev/wwasko/abort...kuenishi:my-abort

@kuenishi

This comment has been minimized.

Copy link

kuenishi commented Nov 16, 2018

Of course I'll switch to an official patch (or 2.4 release) once it is provided (never rushing).

@sjeaugey

This comment has been minimized.

Copy link
Contributor

sjeaugey commented Nov 16, 2018

Thanks, it makes sense indeed. It should cover most of the cases but not all (for example https://github.com/NVIDIA/nccl/blob/master/src/transport/net_ib.cu#L686).

@kuenishi kuenishi referenced this pull request Feb 21, 2019
@sjeaugey

This comment has been minimized.

Copy link
Contributor

sjeaugey commented Feb 21, 2019

Closing as this has been released in 2.4.2.

@sjeaugey sjeaugey closed this Feb 21, 2019
@sjeaugey sjeaugey deleted the dev/wwasko/abort branch Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.