multiple_communicators branch gets deadlock on Alltoall #5

andrevitorelli · 2021-05-17T21:26:46Z

When running IDRIS-Hackathon fft_benchmark.job I get the following message:

W <...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

The job never finishes, but doesn't crash.

The text was updated successfully, but these errors were encountered:

EiffL · 2021-05-18T10:11:03Z

Thanks @andrevitorelli for reporting this!

andrevitorelli · 2021-05-18T10:40:56Z

This was missing from the error message above:

Missing ranks:
0: [iFFT3D_2/HorovodAlltoall_iFFT3D_2_stack_0]

kimchitsigai · 2021-05-18T14:55:06Z

@andrevitorelli @EiffL I've just tested fft_benchmark (2 nodes / 8 GPUs) :

by removing everything related to nsys (module load, TMPDIR and nsys profile command). srun python fft_benchmark.py. Not much trace mais looks OK.
with nvprof (also removing everything related to nsys, including the TMPDIR link). Looks OK.

With nsys, results look random... I wonder if nsys and fft_benchmark.py face race conditions when accessing NCCL.

Can you tell me if you get the same results as me?

kimchitsigai · 2021-05-20T10:00:24Z

@andrevitorelli @EiffL I can randomly reproduce the problem on 1 node / 4 GPU : "/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds."...
It looks like a deadlock, but when I look at the /gpfsssd/jobscratch directory, there is some activity.

At 11:32 this morning:

total 65538
drwxrwxrwx 7 shpe033 genhpe   4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe   4096 May 20 11:31 ompi.r14i1n1.300328

At 11:40

total 65538
drwxrwxrwx 7 shpe033 genhpe   4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe   4096 May 20 11:31 ompi.r14i1n1.300328

At 11:50

total 65538
drwxrwxrwx 7 shpe033 genhpe    4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe    4096 May 20 11:31 ompi.r14i1n1.300328

I wonder if the fft_benchmark has actually ended but for some reason, the step writing the nsys report on the SCRATCH directory is very slow.

andrevitorelli · 2021-05-20T13:45:39Z

@kimchitsigai @EiffL
Some info, comparing a successful run (first) with a deadlock one, I see that the layout rules are different in order. I don't know if this means something, but I'm working on understanding it better.

Successful run, quoting from fft_0.log


INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}
I0520 15:14:39.514306 22949178982912 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}

Unsuccessful run:

INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}
I0520 15:26:35.604007 22840491303424 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}

The layout rules are in a different order.

~~Also, the unsucessful run was run on a Tesla V100-SXM2-16GB, but the successful on a Tesla V100-SXM2-32GB (different memory sizes).~~
I was able to run in a 16GB Tesla V100. I am now testing different/specific nodes.

kimchitsigai · 2021-05-21T07:37:45Z

@andrevitorelli @EiffL Just in case you didn't know, you can set the Horovod layer's verbosity level with
export HOROVOD_LOG_LEVEL=TRACE in the sbatch file and getting messages like:

[2021-05-21 09:27:04.153366: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:913] Created response of size 0
[2021-05-21 09:27:04.153418: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:428] Sending ready responses as FFT3D/HorovodAlltoall_FFT3D_stack_0;
[2021-05-21 09:27:04.153459: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:654] [0]: Performing FFT3D/HorovodAlltoall_FFT3D_stack_0
[2021-05-21 09:27:04.153474: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:655] [0]: Processing 1 tensors
[2021-05-21 09:27:04.188005: T /tmp/pip-req-build-rxees3rj/horovod/common/ops/nccl_operations.cc:593] [NCCLAlltoall::Execute start, communicator_id=]: 2

(I've addede the NCCL messages myself).

andrevitorelli · 2021-05-21T11:43:53Z

Thanks! I'll keep looking into it. I've eliminated the possibility of it being a node-specific issue.
For me, about 70% of the runs fail. A few never return anything, but most get into that deadlock.

andrevitorelli · 2021-05-21T21:15:33Z

Further possibly related insights:
horovod#403

andrevitorelli · 2021-05-21T21:19:50Z

What does not work to solve this bug:

Using specific nodes
Using 32GiB Tesla nodes
Asking Horovod to wait more

What seems to:

Using more than 1 node (I did not have a deadlock with these).

Current hypotheses:

Memory issues?
Race Condition?
Mesh specification?

kimchitsigai · 2021-05-23T08:02:40Z

@EiffL @andrevitorelli
I've tried the 2 modifications below.

In all the EnqueueTensorAllxxx() functions in horovod/common/operations.cc, message.set_device(device) is replaced by message.set_device(horovod_global.controller[0]->GetRank()). When I printed it before this modification, device value was 0 for all the devices of a communicator. Code is here: https://github.com/DifferentiableUniverseInitiative/horovod/blob/nccl_fixes/horovod/common/operations.cc
In InitNCCLComm() in horovod/common/ops/nccl_operations.cc, nccl_context_->nccl_comms[communicator_id][nccl_device_map]=nccl_comm line of code is added so that the {device_map, coordinator_id} is inserted in the dictionary. I have also replaced the use of global_state_->current_nccl_stream by communicator_id as the value of global_state_->current_nccl_stream does not seem consistent, it should be equal to communicator_id. Code is here: https://github.com/DifferentiableUniverseInitiative/horovod/blob/nccl_fixes/horovod/common/ops/nccl_operations.cc

Test results (with module load tensorflow-gpu/py3/2.4.1+cuda-11.2 nvidia-nsight-systems/2021.1.1):

1 node / 4 GPU (mesh_shape=b1:2, b2:2) => OK
2 nodes / 8 GPUs (mesh_shape=b1:2, b2:4) => Max absolute FFT error... with wall time... message is present in the 8 log files and qdrep files are OK (can be opened with nsys-ui) but the job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct> processes on the second machine.
4 nodes / 16 GPUs (mesh_shape=b1:4, b2:4) => Max absolute FFT error... with wall time... message is present in the 16 log files and. The qdrep files are sometimes produced, sometimes not. When the qdrep files are produced, the 16 processes have the Shutting down background thread message in their log file. When they are not produced, some of the 16 processes have no Shutting down background thread message in the log file. The job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct> processes on each of the 3 last machines.
8 nodes / 32 GPUs (mesh_shape=b1:8,b2:4) => Max absolute FFT error... with wall time... message is present in the 32 log files and qdrep files are OK (can be opened with nsys-ui) but the job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct> processes on each of the 7 last machines.

Attempt to explain the 4 node / 16 GPU problem. This is a representation of the 16 processes and the 2 sub-communicators, as seen in the logs:

process/global_rank   0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

devices[comm_id=1]    0  0  0  0  4  4  4  4  8  8  8  8  12 12 12 12
                      1  1  1  1  5  5  5  5  9  9  9  9  13 13 13 13
                      2  2  2  2  6  6  6  6  10 10 10 10 14 14 14 14
                      3  3  3  3  7  7  7  7  11 11 11 11 15 15 15 15

devices[comm_id=2]    0  1  2  3  0  1  2  3  0  1  2  3  0  1  2  3
                      4  5  6  7  4  5  6  7  4  5  6  7  4  5  6  7
                      8  9  10 11 8  9  10 11 8  9  10 11 8  9  10 11
                      12 13 14 15 12 13 14 15 12 13 14 15 12 13 14 15

Let's suppose that:

subcommunicator [12,13,14,15] closes first
Process 15 is the first one to close its second sub-communicator, which is [3,7,11,15]
Processes 3,7,11 shutdown.
But these 3 processes do not close their second sub-communicator (which is [0,1,2,3] for process 3 for example)

So, we get the "Shutting down Background thread loop" message only for processes 12,13,14,15,3,7,11.
The other processes are stuck and the qdrep files are not generated.

kimchitsigai · 2021-05-25T07:25:10Z

At the end of fft_benchmark.py, replacing:

exit(0)

by

import horovod.tensorflow as hvd
hvd.shutdown()
time.sleep(5)
exit(0)

seems to give better results. The qdrep files are generated in the 4 above test cases. The jobs must still be scanceled.

EiffL self-assigned this May 18, 2021

EiffL mentioned this issue May 18, 2021

Testing, Benchmarking, and Optimizing Horovod collectives on Jean-Zay DifferentiableUniverseInitiative/IDRIS-hackathon#4

Open

5 tasks

EiffL mentioned this issue May 25, 2021

PR for Nccl fixes #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple_communicators branch gets deadlock on Alltoall #5

multiple_communicators branch gets deadlock on Alltoall #5

andrevitorelli commented May 17, 2021

EiffL commented May 18, 2021

andrevitorelli commented May 18, 2021 •

edited

kimchitsigai commented May 18, 2021 •

edited

kimchitsigai commented May 20, 2021

andrevitorelli commented May 20, 2021 •

edited

kimchitsigai commented May 21, 2021

andrevitorelli commented May 21, 2021

andrevitorelli commented May 21, 2021

andrevitorelli commented May 21, 2021 •

edited

kimchitsigai commented May 23, 2021 •

edited

kimchitsigai commented May 25, 2021

multiple_communicators branch gets deadlock on Alltoall #5

multiple_communicators branch gets deadlock on Alltoall #5

Comments

andrevitorelli commented May 17, 2021

EiffL commented May 18, 2021

andrevitorelli commented May 18, 2021 • edited

kimchitsigai commented May 18, 2021 • edited

kimchitsigai commented May 20, 2021

andrevitorelli commented May 20, 2021 • edited

kimchitsigai commented May 21, 2021

andrevitorelli commented May 21, 2021

andrevitorelli commented May 21, 2021

andrevitorelli commented May 21, 2021 • edited

kimchitsigai commented May 23, 2021 • edited

kimchitsigai commented May 25, 2021

andrevitorelli commented May 18, 2021 •

edited

kimchitsigai commented May 18, 2021 •

edited

andrevitorelli commented May 20, 2021 •

edited

andrevitorelli commented May 21, 2021 •

edited

kimchitsigai commented May 23, 2021 •

edited