Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Deadlock with multiple threads #91

Closed
torrance opened this issue Apr 12, 2024 · 9 comments
Closed

[Issue]: Deadlock with multiple threads #91

torrance opened this issue Apr 12, 2024 · 9 comments
Assignees

Comments

@torrance
Copy link

Problem Description

I've had a long-standing issue with random deadlocks occurring in multi-threaded code when using a combination of hip and hipfft, and with the rocm backend. Previously I'd only been able to reproduce the issue using an MI250X backend, however that specific HPC environment was using an older version of rocm (5.3) and so we put the cause down to out-of-date code.

Now, however, having updated to rocm-6.0.3 on my local machine using a W6800 device, I am now also able to reproduce the issue locally.

The following code is a minimum working example:

#include <iostream>
#include <thread>
#include <vector>

#include <hip/hip_runtime.h>
#include <hipfft/hipfft.h>
#include <thrust/complex.h>

const int NTHREADS {64};
const int N {128};

int main() {
    std::vector<std::thread> threads;

    for (int n {}; n < NTHREADS; ++n) {
        threads.emplace_back([n=n] {
            std::cout << "Thread ID=" << n << " starting..." << std::endl;

            hipfftHandle plan {};
            int rank[] {N, N};
            hipfftPlanMany(&plan, 2, rank, rank, 1, 1, rank, 1, 1, HIPFFT_C2C, 1);
            hipfftSetStream(plan, hipStreamPerThread);

            std::vector<thrust::complex<float>> xs_host(N * N);
            for (int i {}; i < 100; ++i) {
                thrust::complex<float>* xs_device {};
                hipMallocAsync(
                    &xs_device,
                    sizeof(thrust::complex<float>) * xs_host.size(),
                    hipStreamPerThread
                );
                hipStreamSynchronize(hipStreamPerThread);

                hipFreeAsync(xs_device, hipStreamPerThread);
                hipStreamSynchronize(hipStreamPerThread);
            }

            hipfftDestroy(plan);

            std::cout << "Thread ID=" << n << " finishing..." << std::endl;
        });
    }

    for (auto& thread : threads) thread.join();
}

You can see it's not really doing anything at all - just creating an fftplan and then a bunch of memory allocations, transfers and then deallocations. However, on my own system this will reliably deadlock.

Note that this does not deadlock if:

  • No fft plan is created
  • hipFreeAsync is not called
  • You compile this for CUDA backend instead

So it appears its some kind of interaction between initialising the fft plan, as well as alloc/dealloc device memory.

I can also only reliably reproduce the deadlock with a large number of threads. 1 thread never deadlocks, and even 12 threads seems to always (?) complete. 24 threads seems to deadlock often, and 32 threads reliably deadlocks.

If I run with AMD_LOG_LEVEL=4 there are no errors reported. The last few log lines are shown below:

:3:hip_memory.cpp           :661 : 93675951786 us: [pid:2124825 tid:0x7f115effd640]  hipMemcpy ( 0x7f1124313000, 0x7f115effa950, 384, hipMemcpyHostToDevice )
:3:hip_device_runtime.cpp   :622 : 93675951763 us: [pid:2124825 tid:0x7f116663e640]  hipGetDevice ( 0x7f116663bb78 )
:3:hip_device_runtime.cpp   :630 : 93675951793 us: [pid:2124825 tid:0x7f116663e640] hipGetDevice: Returned hipSuccess :
:3:hip_memory.cpp           :661 : 93675951795 us: [pid:2124825 tid:0x7f1166e3f640]  hipMemcpy ( 0x7f1124312000, 0x7f1166e3c950, 384, hipMemcpyHostToDevice )
:3:hip_memory.cpp           :588 : 93675951782 us: [pid:2124825 tid:0x7f1126ffd640] hipMalloc: Returned hipSuccess : 0x7f1124314000: duration: 5886 us
:3:hip_memory.cpp           :661 : 93675951804 us: [pid:2124825 tid:0x7f1126ffd640]  hipMemcpy ( 0x7f1124314000, 0x7f1126ffa950, 384, hipMemcpyHostToDevice )
:3:hip_memory.cpp           :586 : 93675951805 us: [pid:2124825 tid:0x7f116663e640]  hipMalloc ( 0x7f116663bb68, 384 )
:3:hip_memory.cpp           :588 : 93675951812 us: [pid:2124825 tid:0x7f1168e43640] hipMalloc: Returned hipSuccess : 0x7f1124315000: duration: 43 us
:3:hip_memory.cpp           :661 : 93675951816 us: [pid:2124825 tid:0x7f1168e43640]  hipMemcpy ( 0x7f1124315000, 0x7f1168e40950, 384, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :2266: 93675951817 us: [pid:2124825 tid:0x7f115cff9640] device=0x7f116002e0d0, freeMem_ = 0x77efc8b80
:3:hip_memory.cpp           :588 : 93675951825 us: [pid:2124825 tid:0x7f115cff9640] hipMalloc: Returned hipSuccess : 0x7f1124316000: duration: 6102 us
:3:rocdevice.cpp            :2266: 93675951827 us: [pid:2124825 tid:0x7f116663e640] device=0x7f116002e0d0, freeMem_ = 0x77efc8a00
:3:hip_memory.cpp           :661 : 93675951831 us: [pid:2124825 tid:0x7f115cff9640]  hipMemcpy ( 0x7f1124316000, 0x7f115cff6950, 384, hipMemcpyHostToDevice )
:3:hip_memory.cpp           :588 : 93675951832 us: [pid:2124825 tid:0x7f116663e640] hipMalloc: Returned hipSuccess : 0x7f1124317000: duration: 27 us
:3:hip_memory.cpp           :661 : 93675951839 us: [pid:2124825 tid:0x7f116663e640]  hipMemcpy ( 0x7f1124317000, 0x7f116663b950, 384, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :2266: 93675951839 us: [pid:2124825 tid:0x7f1105ffb640] device=0x7f116002e0d0, freeMem_ = 0x77efc8880
:3:hip_memory.cpp           :588 : 93675951849 us: [pid:2124825 tid:0x7f1105ffb640] hipMalloc: Returned hipSuccess : 0x7f1124318000: duration: 6287 us
:3:rocdevice.cpp            :2266: 93675951852 us: [pid:2124825 tid:0x7f115f7fe640] device=0x7f116002e0d0, freeMem_ = 0x77efc8700
:3:hip_memory.cpp           :661 : 93675951856 us: [pid:2124825 tid:0x7f1105ffb640]  hipMemcpy ( 0x7f1124318000, 0x7f1105ff8950, 384, hipMemcpyHostToDevice )
:3:hip_memory.cpp           :588 : 93675951859 us: [pid:2124825 tid:0x7f115f7fe640] hipMalloc: Returned hipSuccess : 0x7f1124319000: duration: 6512 us
:3:hip_memory.cpp           :661 : 93675951864 us: [pid:2124825 tid:0x7f115f7fe640]  hipMemcpy ( 0x7f1124319000, 0x7f115f7fb950, 384, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :2266: 93675951864 us: [pid:2124825 tid:0x7f10e3fff640] device=0x7f116002e0d0, freeMem_ = 0x77efc8580
:3:hip_memory.cpp           :588 : 93675951870 us: [pid:2124825 tid:0x7f10e3fff640] hipMalloc: Returned hipSuccess : 0x7f112431a000: duration: 6715 us
:3:hip_memory.cpp           :661 : 93675951873 us: [pid:2124825 tid:0x7f10e3fff640]  hipMemcpy ( 0x7f112431a000, 0x7f10e3ffc950, 384, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :2768: 93675952130 us: [pid:2124825 tid:0x7f11277fe640] number of allocated hardware queues with low priority: 0, with normal priority: 4, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2753: 93675952133 us: [pid:2124825 tid:0x7f11277fe640] selected queue refCount: 0x7f1164526000 (10)
:3:rocdevice.cpp            :2768: 93675952542 us: [pid:2124825 tid:0x7f10e37fe640] number of allocated hardware queues with low priority: 0, with normal priority: 4, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2753: 93675952547 us: [pid:2124825 tid:0x7f10e37fe640] selected queue refCount: 0x7f116454e000 (10)
:3:rocdevice.cpp            :2768: 93675952950 us: [pid:2124825 tid:0x7f1168642640] number of allocated hardware queues with low priority: 0, with normal priority: 4, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2753: 93675952954 us: [pid:2124825 tid:0x7f1168642640] selected queue refCount: 0x7f116456e000 (10)
:3:rocdevice.cpp            :2768: 93675953345 us: [pid:2124825 tid:0x7f1106ffd640] number of allocated hardware queues with low priority: 0, with normal priority: 4, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2753: 93675953349 us: [pid:2124825 tid:0x7f1106ffd640] selected queue refCount: 0x7f116458e000 (10)

GDB stack traces

If I run the program using gdb, the stack trace on each thread shows variable HIP API calls (including hipfftPlanMany) but each one deadlocked waiting on __futex_abstimed_wait_common64. For example:

Thread 2: hipfftPlanMany()

#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fffe4014268) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fffe4014268) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fffe4014268, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff6031bdf in do_futex_wait (sem=sem@entry=0x7fffe4014268, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff6031c78 in __new_sem_wait_slow64 (sem=0x7fffe4014268, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff6815398 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#6  0x00007ffff6814f8a in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#7  0x00007ffff6742d6e in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#8  0x00007ffff65b1644 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#9  0x00007ffff66925d5 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#10 0x00007ffff6694837 in hipMemcpy () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#11 0x00007ffff58d44d9 in kargs_create(std::vector<unsigned long, std::allocator<unsigned long> >, std::vector<unsigned long, std::allocator<unsigned long> >, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long) () from /.singularity.d/libs/librocfft.so.0
#12 0x00007ffff58d5842 in LeafNode::CreateDevKernelArgs() () from /.singularity.d/libs/librocfft.so.0
#13 0x00007ffff58c592a in PlanPowX(ExecPlan&) () from /.singularity.d/libs/librocfft.so.0
#14 0x00007ffff58ad548 in rocfft_plan_create_internal(rocfft_plan_t*, rocfft_result_placement_e, rocfft_transform_type_e, rocfft_precision_e, unsigned long, unsigned long const*, unsigned long, rocfft_plan_description_t*) ()
   from /.singularity.d/libs/librocfft.so.0
#15 0x00007ffff58aebae in rocfft_plan_create () from /.singularity.d/libs/librocfft.so.0
#16 0x00007ffff7fb44e9 in hipfftMakePlan_internal(hipfftHandle_t*, unsigned long, unsigned long*, hipfftIOType, unsigned long, hipfft_plan_description_t*, unsigned long*, bool) ()
   from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libhipfft.so.0
#17 0x00007ffff7fb83ca in hipfftResult_t hipfftMakePlanMany_internal<int>(hipfftHandle_t*, int, int*, int*, int, int, int*, int, int, hipfftIOType, int, unsigned long*) ()
   from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libhipfft.so.0
#18 0x00007ffff7fb2efb in hipfftPlanMany () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libhipfft.so.0
#19 0x00000000002042e4 in main::{lambda()#1}::operator()() const (this=0x312308) at mwe.cpp:21
#20 0x0000000000204235 in std::__invoke_impl<void, main::{lambda()#1}>(std::__invoke_other, main::{lambda()#1}&&) (__f=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#21 0x00000000002041f5 in std::__invoke<main::{lambda()#1}>(main::{lambda()#1}&&) (__fn=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#22 0x00000000002041cd in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x312308) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#23 0x00000000002041a5 in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::operator()() (this=0x312308) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#24 0x0000000000204109 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() (this=0x312300) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#25 0x00007ffff63a3253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#26 0x00007ffff6029ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#27 0x00007ffff60bb850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5: hipfftPlanMany()

#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fffd4001ee8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fffd4001ee8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fffd4001ee8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff6031bdf in do_futex_wait (sem=sem@entry=0x7fffd4001ee8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff6031c78 in __new_sem_wait_slow64 (sem=0x7fffd4001ee8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff6815398 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#6  0x00007ffff6814f8a in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#7  0x00007ffff6742d6e in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#8  0x00007ffff65b1644 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#9  0x00007ffff66925d5 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#10 0x00007ffff6694837 in hipMemcpy () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#11 0x00007ffff58d44d9 in kargs_create(std::vector<unsigned long, std::allocator<unsigned long> >, std::vector<unsigned long, std::allocator<unsigned long> >, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, unsigned long) () from /.singularity.d/libs/librocfft.so.0
#12 0x00007ffff58d5842 in LeafNode::CreateDevKernelArgs() () from /.singularity.d/libs/librocfft.so.0
#13 0x00007ffff58c592a in PlanPowX(ExecPlan&) () from /.singularity.d/libs/librocfft.so.0
#14 0x00007ffff58ad548 in rocfft_plan_create_internal(rocfft_plan_t*, rocfft_result_placement_e, rocfft_transform_type_e, rocfft_precision_e, unsigned long, unsigned long const*, unsigned long, rocfft_plan_description_t*) ()
   from /.singularity.d/libs/librocfft.so.0
#15 0x00007ffff58aebae in rocfft_plan_create () from /.singularity.d/libs/librocfft.so.0
#16 0x00007ffff7fb44e9 in hipfftMakePlan_internal(hipfftHandle_t*, unsigned long, unsigned long*, hipfftIOType, unsigned long, hipfft_plan_description_t*, unsigned long*, bool) ()
   from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libhipfft.so.0
#17 0x00007ffff7fb83ca in hipfftResult_t hipfftMakePlanMany_internal<int>(hipfftHandle_t*, int, int*, int*, int, int, int*, int, int, hipfftIOType, int, unsigned long*) ()
   from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libhipfft.so.0
#18 0x00007ffff7fb2efb in hipfftPlanMany () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libhipfft.so.0
#19 0x00000000002042e4 in main::{lambda()#1}::operator()() const (this=0x3124c8) at mwe.cpp:21
#20 0x0000000000204235 in std::__invoke_impl<void, main::{lambda()#1}>(std::__invoke_other, main::{lambda()#1}&&) (__f=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#21 0x00000000002041f5 in std::__invoke<main::{lambda()#1}>(main::{lambda()#1}&&) (__fn=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#22 0x00000000002041cd in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x3124c8) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#23 0x00000000002041a5 in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::operator()() (this=0x3124c8) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#24 0x0000000000204109 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() (this=0x3124c0) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#25 0x00007ffff63a3253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#26 0x00007ffff6029ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#27 0x00007ffff60bb850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 22: hipMallocAsync()

#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fff90001ee8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fff90001ee8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fff90001ee8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff6031bdf in do_futex_wait (sem=sem@entry=0x7fff90001ee8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff6031c78 in __new_sem_wait_slow64 (sem=0x7fff90001ee8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff6815398 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#6  0x00007ffff6814f8a in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#7  0x00007ffff6744275 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#8  0x00007ffff67443bb in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#9  0x00007ffff6746bb7 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#10 0x00007ffff6746c65 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#11 0x00007ffff6746dfa in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#12 0x00007ffff66e11fe in hipMallocAsync () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#13 0x0000000000204485 in hipMallocAsync<thrust::complex<float> > (dev_ptr=0x7fffa97f8bc0, size=131072, stream=0x2) at /opt/rocm-6.0.3/include/hip/hip_runtime_api.h:8542
#14 0x0000000000204356 in main::{lambda()#1}::operator()() const (this=0x314668) at mwe.cpp:27
#15 0x0000000000204235 in std::__invoke_impl<void, main::{lambda()#1}>(std::__invoke_other, main::{lambda()#1}&&) (__f=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#16 0x00000000002041f5 in std::__invoke<main::{lambda()#1}>(main::{lambda()#1}&&) (__fn=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#17 0x00000000002041cd in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x314668) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#18 0x00000000002041a5 in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::operator()() (this=0x314668) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#19 0x0000000000204109 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() (this=0x314660) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#20 0x00007ffff63a3253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff6029ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x00007ffff60bb850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 40: hipMallocAsync()

#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fff34001ee8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fff34001ee8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fff34001ee8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff6031bdf in do_futex_wait (sem=sem@entry=0x7fff34001ee8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff6031c78 in __new_sem_wait_slow64 (sem=0x7fff34001ee8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff6815398 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#6  0x00007ffff6814f8a in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#7  0x00007ffff6744275 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#8  0x00007ffff67443bb in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#9  0x00007ffff6746bb7 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#10 0x00007ffff6746c65 in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#11 0x00007ffff6746dfa in ?? () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#12 0x00007ffff66e11fe in hipMallocAsync () from /opt/rocm-6.0.3/lib/llvm/bin/../../../lib/libamdhip64.so.6
#13 0x0000000000204485 in hipMallocAsync<thrust::complex<float> > (dev_ptr=0x7fff4affbbc0, size=131072, stream=0x2) at /opt/rocm-6.0.3/include/hip/hip_runtime_api.h:8542
#14 0x0000000000204356 in main::{lambda()#1}::operator()() const (this=0x3165e8) at mwe.cpp:27
#15 0x0000000000204235 in std::__invoke_impl<void, main::{lambda()#1}>(std::__invoke_other, main::{lambda()#1}&&) (__f=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#16 0x00000000002041f5 in std::__invoke<main::{lambda()#1}>(main::{lambda()#1}&&) (__fn=...) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#17 0x00000000002041cd in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x3165e8) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#18 0x00000000002041a5 in std::thread::_Invoker<std::tuple<main::{lambda()#1}> >::operator()() (this=0x3165e8) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#19 0x0000000000204109 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#1}> > >::_M_run() (this=0x3165e0) at /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#20 0x00007ffff63a3253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff6029ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x00007ffff60bb850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Operating System

Ubuntu 20.04, in an Ubuntu 22.04 container

CPU

AMD Ryzen 5 5600 6-Core Processor

GPU

AMD Instinct MI250X, AMD Radeon Pro W6800

ROCm Version

ROCm 6.0.0

ROCm Component

hipFFT

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 5600 6-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 5600 6-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3500
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65791724(0x3ebe6ec) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65791724(0x3ebe6ec) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65791724(0x3ebe6ec) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1030
  Uuid:                    GPU-66c31d3d0cc90177
  Marketing Name:          AMD Radeon PRO W6800
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
  Chip ID:                 29603(0x73a3)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2555
  BDFID:                   3328
  Internal Node ID:        1
  Compute Unit:            60
  SIMDs per CU:            2
  Shader Engines:          4
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

@evetsso
Copy link
Contributor

evetsso commented Apr 12, 2024

Are you able to reproduce this problem if you set HSA_ENABLE_SDMA=0 in the environment? That will cause HIP to use blit kernels instead of the SDMA engines for memcpys. If that makes the problem go away, it suggests that the root cause might be related to SDMA.

@torrance
Copy link
Author

@evetsso Unfortunately no. Calling HSA_ENABLE_SDMA=0 ./mwe still deadlocks reliably with the minimum working example code as above.

@torrance
Copy link
Author

@evetsso Does the code snippet I've posted above allow you to reproduce the deadlock locally?

@evetsso evetsso self-assigned this Apr 15, 2024
@evetsso
Copy link
Contributor

evetsso commented Apr 15, 2024

@torrance I have been able to reproduce the problem. It looks to me like it's related to rocFFT (via hipFFT) doing work on the null stream while other work is happening asynchronously on non-null streams. Your example uses hipStreamPerThread, but I can see the same problem with user-created streams as well.

This doesn't look specific to rocFFT/hipFFT, however. I can reproduce this with a per-thread null stream workload that doesn't involve the FFT libraries at all:

#include <iostream>
#include <thread>
#include <vector>

#include <hip/hip_runtime.h>

const int NTHREADS {64};
const int N = 128;

struct nullStreamData
{
  void* ptr;
};

void nullStreamAllocWork(nullStreamData& data)
{
  hipMalloc(&data.ptr, N * sizeof(float2));
  std::vector<float2> data_host(N);
  hipMemcpy(data.ptr, data_host.data(), N * sizeof(float2), hipMemcpyHostToDevice);
}

void nullStreamFreeWork(nullStreamData& data)
{
  hipFree(data.ptr);
}

int main() {
    std::vector<std::thread> threads;

    for (int n {}; n < NTHREADS; ++n) {
        threads.emplace_back([n=n] {
            std::cout << "Thread ID=" << n << " starting..." << std::endl;

	    nullStreamData data;
	    nullStreamAllocWork(data);
	    
            std::vector<float2> xs_host(N * N);
            for (int i {}; i < 100; ++i) {
                float2* xs_device {};
                hipMallocAsync(
                    &xs_device,
                    sizeof(float2) * xs_host.size(),
                    hipStreamPerThread
                );
                hipStreamSynchronize(hipStreamPerThread);

                hipFreeAsync(xs_device, hipStreamPerThread);
                hipStreamSynchronize(hipStreamPerThread);
            }

	    nullStreamFreeWork(data);

            std::cout << "Thread ID=" << n << " finishing..." << std::endl;
        });
    }

    for (auto& thread : threads) thread.join();
}

I can reproduce this with ROCm 6.0 as well as the builds of what will become ROCm 6.1. However, I am not able to reproduce it in the builds of what will become ROCm 6.2 (using either your reproducer or mine). I'll raise an issue internally at least to see if the fix can be included in 6.1.

@evetsso
Copy link
Contributor

evetsso commented Apr 16, 2024

@torrance The HIP runtime team has worked out that ROCm/clr@0b0df60 fixes the problem. They will be aiming to get that fix applied to the ROCm 6.1.1 release.

I've been able to pick that commit to a local build of clr and confirmed that my reproducer succeeds as well. Building clr is a bit fiddly but might be an option for you in the immediate term.

I'm closing this issue, since it is independent from hipFFT/rocFFT. Please feel free to comment and/or reopen if necessary, but I don't think there's anything to be done in the FFT libraries for this problem.

@evetsso evetsso closed this as completed Apr 16, 2024
@torrance
Copy link
Author

torrance commented May 1, 2024

@evetsso Just wanted to say thanks for pushing this to the core team on my behalf!

@pxl-th
Copy link

pxl-th commented May 9, 2024

@evetsso it looks like ROCm/clr@0b0df60 didn't make it to ROCm 6.1.1 release, am I right?
If so, is there 6.1.2 planned to release quite soon?

@evetsso
Copy link
Contributor

evetsso commented May 9, 2024

Ah, it looks like the fix just missed the cutoff for 6.1.1. But a 6.1.2 is planned and I can see the fix has been picked to an internal branch that will become 6.1.2.

@pxl-th
Copy link

pxl-th commented May 9, 2024

Ah, it looks like the fix just missed the cutoff for 6.1.1. But a 6.1.2 is planned and I can see the fix has been picked to an internal branch that will become 6.1.2.

I see, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants