Initialization sometimes fails on multi-GPU nodes due to race condition #88

jglaser · 2022-01-05T23:15:08Z

When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:

347: pthread_mutex_timedlock() returned 110
 347: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 348: pthread_mutex_timedlock() returned 110
 348: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 757: pthread_mutex_timedlock() returned 110
 757: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 350: pthread_mutex_timedlock() returned 110
 350: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 753: pthread_mutex_timedlock() returned 110
 753: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 351: pthread_mutex_timedlock() returned 110
 351: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 756: pthread_mutex_timedlock() returned 110
 756: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 758: pthread_mutex_timedlock() returned 110
 758: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
1050: pthread_mutex_timedlock() returned 110
1050: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 347: rsmi_init() failed
1052: pthread_mutex_timedlock() returned 110

The reason is that rocm_smi_lib creates a mutex in /dev/shm whose name is independent of the process id, which creates a race condition.

The text was updated successfully, but these errors were encountered:

hkasivis · 2022-01-24T15:14:03Z

Jens Glaser thanks for the fix. We will pull this in shortly.

bill-shuzhou-liu · 2022-01-25T22:12:43Z

Some function in the rocm_smi_lib only allows one process to access at a time. The rocm_smi_lib used an inter-process mutex to protect it. If another process is using this function, the process will call pthread_mutex_timedlock() to wait 5 seconds.

The error code 110 means timeout. After 5 seconds, another process does not release the mutex, then the init() is fail with above errors.

Is the number in the error message process id (i.e. 347 in the below example)? How many processes are trying to call the rsmi_init(0) at the same time, and how many processes are successful? If we attach gdb to successful processes, did it block at some rocm_smi_lib function?
347: pthread_mutex_timedlock() returned 110

Thanks.

jglaser · 2022-01-26T23:38:24Z

Bill, could you specify which function requires the mutex?

Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the rsmi_init() at the same time. Typically 80% of nodes make it through, but beyond ~16 nodes there is always a high probability for one node to fail.

I'll have to attach the debugger and will let you know.

bill-shuzhou-liu · 2022-01-27T13:55:07Z

A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.

I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.

aferoz21 · 2022-10-06T16:24:29Z

I am also encountering the same issue. Line number points to static rsmi_status_t status = rsmi_init(0);
I am using ROCm 5.2.3 version. is there any fix or workaround in place ?
Happening only in CI, could not reproduce in my machine yet.

I use the below 4 ROCm API's to monitor, when I add the 4th one it started giving me this error.
1)auto status = rsmi_dev_temp_metric_get(m_smiDeviceIndex, sensorType, metric, &newValue); => OK
2)auto status = rsmi_dev_gpu_clk_freq_get(m_smiDeviceIndex, m_clockMetrics[i], &freq); => OK
3)auto status = rsmi_dev_fan_rpms_get(m_smiDeviceIndex, m_fanMetrics[i], &newValue); => OK
4)auto status = rsmi_dev_gpu_metrics_info_get(m_smiDeviceIndex, &gpuMetrics); ==> NOT OK

Below is the error message:
pthread_mutex_timedlock() returned 131
Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success terminate called after throwing an instance of 'std::runtime_error' what(): Error 8(An error occurred during initialization, during monitor discovery or when when initializing internal data structures)
/tmp/.tensile-tox/py3/lib/python3.8/site-packages/Tensile/Source/client/source/HardwareMonitor.cpp:144:

jglaser · 2023-02-20T20:32:48Z

A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.

I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.

yes.... see here https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html

dmitrii-galantsev · 2023-10-18T18:40:26Z

76b5528 and 160c99d should address it.

jglaser mentioned this issue Jan 5, 2022

Use PID for unique mutex name in /dev/shm #89

Open

jglaser changed the title ~~Initialization fails sometimes on multi-GPU nodes due to race condition~~ Initialization sometimes fails on multi-GPU nodes due to race condition Jan 6, 2022

dmitrii-galantsev closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialization sometimes fails on multi-GPU nodes due to race condition #88

Initialization sometimes fails on multi-GPU nodes due to race condition #88

jglaser commented Jan 5, 2022

hkasivis commented Jan 24, 2022

bill-shuzhou-liu commented Jan 25, 2022

jglaser commented Jan 26, 2022

bill-shuzhou-liu commented Jan 27, 2022

aferoz21 commented Oct 6, 2022

jglaser commented Feb 20, 2023

dmitrii-galantsev commented Oct 18, 2023

Initialization sometimes fails on multi-GPU nodes due to race condition #88

Initialization sometimes fails on multi-GPU nodes due to race condition #88

Comments

jglaser commented Jan 5, 2022

hkasivis commented Jan 24, 2022

bill-shuzhou-liu commented Jan 25, 2022

jglaser commented Jan 26, 2022

bill-shuzhou-liu commented Jan 27, 2022

aferoz21 commented Oct 6, 2022

jglaser commented Feb 20, 2023

dmitrii-galantsev commented Oct 18, 2023