Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple_communicators branch gets deadlock on Alltoall #5

Open
andrevitorelli opened this issue May 17, 2021 · 11 comments
Open

multiple_communicators branch gets deadlock on Alltoall #5

andrevitorelli opened this issue May 17, 2021 · 11 comments
Assignees

Comments

@andrevitorelli
Copy link

When running IDRIS-Hackathon fft_benchmark.job I get the following message:

W <...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

The job never finishes, but doesn't crash.

@EiffL EiffL self-assigned this May 18, 2021
@EiffL
Copy link
Member

EiffL commented May 18, 2021

Thanks @andrevitorelli for reporting this!

@andrevitorelli
Copy link
Author

andrevitorelli commented May 18, 2021

This was missing from the error message above:

Missing ranks:
0: [iFFT3D_2/HorovodAlltoall_iFFT3D_2_stack_0]

@kimchitsigai
Copy link
Collaborator

kimchitsigai commented May 18, 2021

@andrevitorelli @EiffL I've just tested fft_benchmark (2 nodes / 8 GPUs) :

  • by removing everything related to nsys (module load, TMPDIR and nsys profile command). srun python fft_benchmark.py. Not much trace mais looks OK.

  • with nvprof (also removing everything related to nsys, including the TMPDIR link). Looks OK.

With nsys, results look random... I wonder if nsys and fft_benchmark.py face race conditions when accessing NCCL.

Can you tell me if you get the same results as me?

@kimchitsigai
Copy link
Collaborator

@andrevitorelli @EiffL I can randomly reproduce the problem on 1 node / 4 GPU : "/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds."...
It looks like a deadlock, but when I look at the /gpfsssd/jobscratch directory, there is some activity.

At 11:32 this morning:

total 65538
drwxrwxrwx 7 shpe033 genhpe   4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 115718 May 20 11:32 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe   4096 May 20 11:31 ompi.r14i1n1.300328

At 11:40

total 65538
drwxrwxrwx 7 shpe033 genhpe   4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 505218 May 20 11:40 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe      0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe   4096 May 20 11:31 ompi.r14i1n1.300328

At 11:50

total 65538
drwxrwxrwx 7 shpe033 genhpe    4096 May 20 11:31 nsight_systems
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-86e1-748d-0fb1-3554.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-9b20-6dde-455a-f0ee.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-d92d-3d2f-7d85-d28e.qdstrm
-rw-r----- 1 shpe033 genhpe 1054618 May 20 11:50 nsys-report-fd08-6971-007c-fcba.qdstrm
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7838
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7839
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7840
-rw-r----- 1 shpe033 genhpe       0 May 20 11:31 nsys_termination_status_7841
drwx------ 3 shpe033 genhpe    4096 May 20 11:31 ompi.r14i1n1.300328

I wonder if the fft_benchmark has actually ended but for some reason, the step writing the nsys report on the SCRATCH directory is very slow.

@andrevitorelli
Copy link
Author

andrevitorelli commented May 20, 2021

@kimchitsigai @EiffL
Some info, comparing a successful run (first) with a deadlock one, I see that the layout rules are different in order. I don't know if this means something, but I'm working on understanding it better.

Successful run, quoting from fft_0.log


INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}
I0520 15:14:39.514306 22949178982912 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('ny', 'b2'), ('tny', 'b1'), ('tnz', 'b2')}

Unsuccessful run:

INFO:tensorflow:HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}
I0520 15:26:35.604007 22840491303424 hvd_simd_mesh_impl.py:47] HvdSimdMeshImpl init: Shape[b1=2, b2=2] LayoutRules{('nx', 'b1'), ('tnz', 'b2'), ('ny', 'b2'), ('tny', 'b1')}

The layout rules are in a different order.

Also, the unsucessful run was run on a Tesla V100-SXM2-16GB, but the successful on a Tesla V100-SXM2-32GB (different memory sizes).
I was able to run in a 16GB Tesla V100. I am now testing different/specific nodes.

@kimchitsigai
Copy link
Collaborator

@andrevitorelli @EiffL Just in case you didn't know, you can set the Horovod layer's verbosity level with
export HOROVOD_LOG_LEVEL=TRACE in the sbatch file and getting messages like:

[2021-05-21 09:27:04.153366: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:913] Created response of size 0
[2021-05-21 09:27:04.153418: T /tmp/pip-req-build-rxees3rj/horovod/common/controller.cc:428] Sending ready responses as FFT3D/HorovodAlltoall_FFT3D_stack_0;
[2021-05-21 09:27:04.153459: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:654] [0]: Performing FFT3D/HorovodAlltoall_FFT3D_stack_0
[2021-05-21 09:27:04.153474: T /tmp/pip-req-build-rxees3rj/horovod/common/operations.cc:655] [0]: Processing 1 tensors
[2021-05-21 09:27:04.188005: T /tmp/pip-req-build-rxees3rj/horovod/common/ops/nccl_operations.cc:593] [NCCLAlltoall::Execute start, communicator_id=]: 2

(I've addede the NCCL messages myself).

@andrevitorelli
Copy link
Author

Thanks! I'll keep looking into it. I've eliminated the possibility of it being a node-specific issue.
For me, about 70% of the runs fail. A few never return anything, but most get into that deadlock.

@andrevitorelli
Copy link
Author

Further possibly related insights:
horovod#403

@andrevitorelli
Copy link
Author

andrevitorelli commented May 21, 2021

What does not work to solve this bug:

  • Using specific nodes
  • Using 32GiB Tesla nodes
  • Asking Horovod to wait more

What seems to:

  • Using more than 1 node (I did not have a deadlock with these).

Current hypotheses:

  • Memory issues?
  • Race Condition?
  • Mesh specification?

@kimchitsigai
Copy link
Collaborator

kimchitsigai commented May 23, 2021

@EiffL @andrevitorelli
I've tried the 2 modifications below.

Test results (with module load tensorflow-gpu/py3/2.4.1+cuda-11.2 nvidia-nsight-systems/2021.1.1):

  • 1 node / 4 GPU (mesh_shape=b1:2, b2:2) => OK
  • 2 nodes / 8 GPUs (mesh_shape=b1:2, b2:4) => Max absolute FFT error... with wall time... message is present in the 8 log files and qdrep files are OK (can be opened with nsys-ui) but the job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct> processes on the second machine.
  • 4 nodes / 16 GPUs (mesh_shape=b1:4, b2:4) => Max absolute FFT error... with wall time... message is present in the 16 log files and. The qdrep files are sometimes produced, sometimes not. When the qdrep files are produced, the 16 processes have the Shutting down background thread message in their log file. When they are not produced, some of the 16 processes have no Shutting down background thread message in the log file. The job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct> processes on each of the 3 last machines.
  • 8 nodes / 32 GPUs (mesh_shape=b1:8,b2:4) => Max absolute FFT error... with wall time... message is present in the 32 log files and qdrep files are OK (can be opened with nsys-ui) but the job does not end and must be scanceled. There are 4 [nsys-launcher] <defunct> processes on each of the 7 last machines.

Attempt to explain the 4 node / 16 GPU problem. This is a representation of the 16 processes and the 2 sub-communicators, as seen in the logs:

process/global_rank   0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

devices[comm_id=1]    0  0  0  0  4  4  4  4  8  8  8  8  12 12 12 12
                      1  1  1  1  5  5  5  5  9  9  9  9  13 13 13 13
                      2  2  2  2  6  6  6  6  10 10 10 10 14 14 14 14
                      3  3  3  3  7  7  7  7  11 11 11 11 15 15 15 15

devices[comm_id=2]    0  1  2  3  0  1  2  3  0  1  2  3  0  1  2  3
                      4  5  6  7  4  5  6  7  4  5  6  7  4  5  6  7
                      8  9  10 11 8  9  10 11 8  9  10 11 8  9  10 11
                      12 13 14 15 12 13 14 15 12 13 14 15 12 13 14 15

Let's suppose that:

  • subcommunicator [12,13,14,15] closes first
  • Process 15 is the first one to close its second sub-communicator, which is [3,7,11,15]
  • Processes 3,7,11 shutdown.
  • But these 3 processes do not close their second sub-communicator (which is [0,1,2,3] for process 3 for example)

So, we get the "Shutting down Background thread loop" message only for processes 12,13,14,15,3,7,11.
The other processes are stuck and the qdrep files are not generated.

@kimchitsigai
Copy link
Collaborator

At the end of fft_benchmark.py, replacing:

exit(0)

by

import horovod.tensorflow as hvd
hvd.shutdown()
time.sleep(5)
exit(0)

seems to give better results. The qdrep files are generated in the 4 above test cases. The jobs must still be scanceled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants