Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialization sometimes fails on multi-GPU nodes due to race condition #88

Closed
jglaser opened this issue Jan 5, 2022 · 7 comments · May be fixed by #89
Closed

Initialization sometimes fails on multi-GPU nodes due to race condition #88

jglaser opened this issue Jan 5, 2022 · 7 comments · May be fixed by #89

Comments

@jglaser
Copy link

jglaser commented Jan 5, 2022

When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:

347: pthread_mutex_timedlock() returned 110
 347: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 348: pthread_mutex_timedlock() returned 110
 348: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 757: pthread_mutex_timedlock() returned 110
 757: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 350: pthread_mutex_timedlock() returned 110
 350: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 753: pthread_mutex_timedlock() returned 110
 753: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 351: pthread_mutex_timedlock() returned 110
 351: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 756: pthread_mutex_timedlock() returned 110
 756: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 758: pthread_mutex_timedlock() returned 110
 758: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
1050: pthread_mutex_timedlock() returned 110
1050: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 347: rsmi_init() failed
1052: pthread_mutex_timedlock() returned 110

The reason is that rocm_smi_lib creates a mutex in /dev/shm whose name is independent of the process id, which creates a race condition.

@jglaser jglaser changed the title Initialization fails sometimes on multi-GPU nodes due to race condition Initialization sometimes fails on multi-GPU nodes due to race condition Jan 6, 2022
@hkasivis
Copy link
Contributor

Jens Glaser thanks for the fix. We will pull this in shortly.

@bill-shuzhou-liu
Copy link
Contributor

Some function in the rocm_smi_lib only allows one process to access at a time. The rocm_smi_lib used an inter-process mutex to protect it. If another process is using this function, the process will call pthread_mutex_timedlock() to wait 5 seconds.

The error code 110 means timeout. After 5 seconds, another process does not release the mutex, then the init() is fail with above errors.

Is the number in the error message process id (i.e. 347 in the below example)? How many processes are trying to call the rsmi_init(0) at the same time, and how many processes are successful? If we attach gdb to successful processes, did it block at some rocm_smi_lib function?
347: pthread_mutex_timedlock() returned 110

Thanks.

@jglaser
Copy link
Author

jglaser commented Jan 26, 2022

Bill, could you specify which function requires the mutex?

Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the rsmi_init() at the same time. Typically 80% of nodes make it through, but beyond ~16 nodes there is always a high probability for one node to fail.

I'll have to attach the debugger and will let you know.

@bill-shuzhou-liu
Copy link
Contributor

A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.

I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.

@aferoz21
Copy link

aferoz21 commented Oct 6, 2022

I am also encountering the same issue. Line number points to static rsmi_status_t status = rsmi_init(0);
I am using ROCm 5.2.3 version. is there any fix or workaround in place ?
Happening only in CI, could not reproduce in my machine yet.

I use the below 4 ROCm API's to monitor, when I add the 4th one it started giving me this error.
1)auto status = rsmi_dev_temp_metric_get(m_smiDeviceIndex, sensorType, metric, &newValue); => OK
2)auto status = rsmi_dev_gpu_clk_freq_get(m_smiDeviceIndex, m_clockMetrics[i], &freq); => OK
3)auto status = rsmi_dev_fan_rpms_get(m_smiDeviceIndex, m_fanMetrics[i], &newValue); => OK
4)auto status = rsmi_dev_gpu_metrics_info_get(m_smiDeviceIndex, &gpuMetrics); ==> NOT OK

Below is the error message:
pthread_mutex_timedlock() returned 131
Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success terminate called after throwing an instance of 'std::runtime_error' what(): Error 8(An error occurred during initialization, during monitor discovery or when when initializing internal data structures)
/tmp/.tensile-tox/py3/lib/python3.8/site-packages/Tensile/Source/client/source/HardwareMonitor.cpp:144:

@jglaser
Copy link
Author

jglaser commented Feb 20, 2023

A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.

I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.

yes.... see here https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html

@dmitrii-galantsev
Copy link
Collaborator

76b5528 and 160c99d should address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants