-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling dgetrf_ after a fork. I instrumented the calls to LOCK_COMMAND and UNLOCK_COMMAND in blas_server.c and I think the problem is in exec_blas_async. This is "new" after #5170.
Here is the main() of the test code
int main() {
int64_t m = 200, n = 200;
int64_t lda = m;
int64_t info;
int64_t ipiv[200];
// array is an identity matrix
double arr[200*200];
for (int i = 0; i < m*n; i += n + 1) {
arr[i] = 1.0;
}
printf("before fork\n");
pid_t pid = fork();
printf("after fork\n");
if (pid == 0) {
printf("inside child\n");
exit(0);
} else {
wait(NULL);
}
printf("before dgetrf\n");
dgetrf_(&m, &n, arr, &lda, ipiv, &info);
printf("after dgetrf\n");
and here is what I see with debug printing (on OpenBLAS HEAD, using ``)
installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565
Note the call to LOCK_COMMAND in exec_blas_async, and then the call to blas_thread_init, which again tries to call LOCK_COMMAND. Boom.
OpenBLAS/driver/others/blas_server.c
Lines 638 to 644 in 0c59ae0
| #ifdef SMP_SERVER | |
| // Handle lazy re-init of the thread-pool after a POSIX fork | |
| LOCK_COMMAND(&server_lock); | |
| if (unlikely(blas_server_avail == 0)) blas_thread_init(); | |
| UNLOCK_COMMAND(&server_lock); | |
| #endif | |
| BLASLONG i = 0; |
I am not sure what the best way is to solve this. Note that the first thing blas_thread_init does is to check blas_server_avail (with no lock), so maybe the lock/unlock in exec_blas_async should be removed?