New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application works with UCX_LOG_LEVEL=info
(or more verbose levels), but hangs otherwise
#9532
Comments
Forgot to mention it, but the same version of waLBerla works fine on this system (regardless of
|
@bedroge can you pls attach to the hanging process with gdb and post the backtrace of the hang (gdb command is "thread apply all backtrace") |
Sure! Here it is (for the
And for the
|
A quick search shows it could be this issue: |
Thanks, I looked into this a bit, and I'm not sure if I completely understood that issue. But I tried recompiling OpenBLAS with By the way, I also tried to use the same compiler toolchain but with an older UCX version (1.10.0), and that did work fine. |
it seems like an issue between glibc and UCX: a deadlock between reading TLS value from one thread and dlclose() from another thread. dlclose() takes TLS lock, which calls UCX destructor, which tries to stop a thread that is reading TLS and stuck on the TLS lock. |
Describe the bug
I'm running into a weird issue on one particular system where importing the Python interface of waLBerla, which I compiled from source using EasyBuild, hangs:
Then I found that disabling the UCX PML solved the issue:
So I tried again with UCX and some more debugging output, but then suddenly it works:
I've tried it with both the following set of dependencies:
and with some slightly newer versions:
And also with UCX/1.15.0 I'm still seeing this same issue.
These are the last lines of
strace
output for a run that hangs:I'm not sure how to get more information, as increasing the verbosity solves the issue. I've included the output for a (successful) run with
UCX_LOG_LEVEL
at the bottom of this issue.Steps to Reproduce
mpirun -np 1 python -c "import waLBerla"
ucx_info -v
)UCX_LOG_LEVEL
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Linux login1 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
For RDMA/IB/RoCE related issues:
rpm -q rdma-core
:rdma-core-35.0-1.el8.x86_64
rpm -q libibverbs
:libibverbs-35.0-1.el8.x86_64
ibstat
oribv_devinfo -vv
command:ibstat
is available, but there's no Infiniband, hence no output from the commandAdditional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX:With
UCX_LOG_LEVEL=data
things work fine, but here it is anyway:The text was updated successfully, but these errors were encountered: