Skip to content

Deadlock after fork when calling dgetrf_ #5520

@mattip

Description

@mattip

As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling dgetrf_ after a fork. I instrumented the calls to LOCK_COMMAND and UNLOCK_COMMAND in blas_server.c and I think the problem is in exec_blas_async. This is "new" after #5170.

Here is the main() of the test code

int main() {
    int64_t m = 200, n = 200;
    int64_t lda = m;
    int64_t info;
    int64_t ipiv[200];

    // array is an identity matrix
    double arr[200*200];
    for (int i = 0; i < m*n; i += n + 1) {
        arr[i] = 1.0;
    }

    printf("before fork\n");
    pid_t pid = fork();
    printf("after fork\n");
    if (pid == 0) {
        printf("inside child\n");
        exit(0);
    } else {
        wait(NULL);
    }

    printf("before dgetrf\n");
    dgetrf_(&m, &n, arr, &lda, ipiv, &info);
    printf("after dgetrf\n");

and here is what I see with debug printing (on OpenBLAS HEAD, using ``)

installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565

Note the call to LOCK_COMMAND in exec_blas_async, and then the call to blas_thread_init, which again tries to call LOCK_COMMAND. Boom.

#ifdef SMP_SERVER
// Handle lazy re-init of the thread-pool after a POSIX fork
LOCK_COMMAND(&server_lock);
if (unlikely(blas_server_avail == 0)) blas_thread_init();
UNLOCK_COMMAND(&server_lock);
#endif
BLASLONG i = 0;

I am not sure what the best way is to solve this. Note that the first thing blas_thread_init does is to check blas_server_avail (with no lock), so maybe the lock/unlock in exec_blas_async should be removed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions