-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Deadlock with multiple threads #91
Comments
Are you able to reproduce this problem if you set HSA_ENABLE_SDMA=0 in the environment? That will cause HIP to use blit kernels instead of the SDMA engines for memcpys. If that makes the problem go away, it suggests that the root cause might be related to SDMA. |
@evetsso Unfortunately no. Calling |
@evetsso Does the code snippet I've posted above allow you to reproduce the deadlock locally? |
@torrance I have been able to reproduce the problem. It looks to me like it's related to rocFFT (via hipFFT) doing work on the null stream while other work is happening asynchronously on non-null streams. Your example uses hipStreamPerThread, but I can see the same problem with user-created streams as well. This doesn't look specific to rocFFT/hipFFT, however. I can reproduce this with a per-thread null stream workload that doesn't involve the FFT libraries at all: #include <iostream>
#include <thread>
#include <vector>
#include <hip/hip_runtime.h>
const int NTHREADS {64};
const int N = 128;
struct nullStreamData
{
void* ptr;
};
void nullStreamAllocWork(nullStreamData& data)
{
hipMalloc(&data.ptr, N * sizeof(float2));
std::vector<float2> data_host(N);
hipMemcpy(data.ptr, data_host.data(), N * sizeof(float2), hipMemcpyHostToDevice);
}
void nullStreamFreeWork(nullStreamData& data)
{
hipFree(data.ptr);
}
int main() {
std::vector<std::thread> threads;
for (int n {}; n < NTHREADS; ++n) {
threads.emplace_back([n=n] {
std::cout << "Thread ID=" << n << " starting..." << std::endl;
nullStreamData data;
nullStreamAllocWork(data);
std::vector<float2> xs_host(N * N);
for (int i {}; i < 100; ++i) {
float2* xs_device {};
hipMallocAsync(
&xs_device,
sizeof(float2) * xs_host.size(),
hipStreamPerThread
);
hipStreamSynchronize(hipStreamPerThread);
hipFreeAsync(xs_device, hipStreamPerThread);
hipStreamSynchronize(hipStreamPerThread);
}
nullStreamFreeWork(data);
std::cout << "Thread ID=" << n << " finishing..." << std::endl;
});
}
for (auto& thread : threads) thread.join();
} I can reproduce this with ROCm 6.0 as well as the builds of what will become ROCm 6.1. However, I am not able to reproduce it in the builds of what will become ROCm 6.2 (using either your reproducer or mine). I'll raise an issue internally at least to see if the fix can be included in 6.1. |
@torrance The HIP runtime team has worked out that ROCm/clr@0b0df60 fixes the problem. They will be aiming to get that fix applied to the ROCm 6.1.1 release. I've been able to pick that commit to a local build of clr and confirmed that my reproducer succeeds as well. Building clr is a bit fiddly but might be an option for you in the immediate term. I'm closing this issue, since it is independent from hipFFT/rocFFT. Please feel free to comment and/or reopen if necessary, but I don't think there's anything to be done in the FFT libraries for this problem. |
@evetsso Just wanted to say thanks for pushing this to the core team on my behalf! |
@evetsso it looks like ROCm/clr@0b0df60 didn't make it to ROCm 6.1.1 release, am I right? |
Ah, it looks like the fix just missed the cutoff for 6.1.1. But a 6.1.2 is planned and I can see the fix has been picked to an internal branch that will become 6.1.2. |
I see, thanks! |
Problem Description
I've had a long-standing issue with random deadlocks occurring in multi-threaded code when using a combination of hip and hipfft, and with the rocm backend. Previously I'd only been able to reproduce the issue using an MI250X backend, however that specific HPC environment was using an older version of rocm (5.3) and so we put the cause down to out-of-date code.
Now, however, having updated to rocm-6.0.3 on my local machine using a W6800 device, I am now also able to reproduce the issue locally.
The following code is a minimum working example:
You can see it's not really doing anything at all - just creating an fftplan and then a bunch of memory allocations, transfers and then deallocations. However, on my own system this will reliably deadlock.
Note that this does not deadlock if:
So it appears its some kind of interaction between initialising the fft plan, as well as alloc/dealloc device memory.
I can also only reliably reproduce the deadlock with a large number of threads. 1 thread never deadlocks, and even 12 threads seems to always (?) complete. 24 threads seems to deadlock often, and 32 threads reliably deadlocks.
If I run with
AMD_LOG_LEVEL=4
there are no errors reported. The last few log lines are shown below:GDB stack traces
If I run the program using
gdb
, the stack trace on each thread shows variable HIP API calls (includinghipfftPlanMany
) but each one deadlocked waiting on__futex_abstimed_wait_common64
. For example:Thread 2:
hipfftPlanMany()
Thread 5:
hipfftPlanMany()
Thread 22:
hipMallocAsync()
Thread 40:
hipMallocAsync()
Operating System
Ubuntu 20.04, in an Ubuntu 22.04 container
CPU
AMD Ryzen 5 5600 6-Core Processor
GPU
AMD Instinct MI250X, AMD Radeon Pro W6800
ROCm Version
ROCm 6.0.0
ROCm Component
hipFFT
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response
The text was updated successfully, but these errors were encountered: