Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.15.x] segfault in IOV GPU recv #9514

Open
raffenet opened this issue Nov 29, 2023 · 2 comments
Open

[1.15.x] segfault in IOV GPU recv #9514

raffenet opened this issue Nov 29, 2023 · 2 comments
Labels

Comments

@raffenet
Copy link
Contributor

raffenet commented Nov 29, 2023

Describe the bug

Segfault when receiving an IOV datatype using GPU device memory.

Steps to Reproduce

  • Configure MPICH main branch (494c4cf50c4a5d95c76d9fcd454b462adeb4d15e or later) with UCX
  • ./configure --prefix=$PWD/i --with-device=ch4:ucx --with-ucx=<path/to/install> --with-cuda=<path/to/cuda> && make -j install
  • UCX version used (v1.15.0) + UCX configure flags (--with-cuda=<path/to/cuda>)
  • Compile and run reproducer
#include <mpi.h>
#include <stdlib.h>
#include <cuda_runtime_api.h>

#define COUNT (64*1024)
#define	BLOCKS (16)
#define BLOCKLEN (4096)
#define	STRIDE (8192)

int main(void) {
    void *sendbuf;
    void *recvbuf;

    MPI_Init(NULL, NULL);

    cudaMalloc(&sendbuf, COUNT*sizeof(int));
    cudaMalloc(&recvbuf, STRIDE*BLOCKS*sizeof(int));

    MPI_Datatype recvtype;
    MPI_Type_vector(BLOCKS, BLOCKLEN, STRIDE, MPI_INT, &recvtype);
    MPI_Type_commit(&recvtype);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
	MPI_Send(sendbuf, COUNT, MPI_INT, 1, 0, MPI_COMM_WORLD);
    }

    if (rank == 1) {
	MPI_Recv(recvbuf, 1, recvtype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }

    MPI_Type_free(&recvtype);
    cudaFree(sendbuf);
    cudaFree(recvbuf);
    MPI_Finalize();
    return 0;
}

Output:

[pmrs-gpu-240-02:83391:0:83391] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f3985a40000)
==== backtrace (tid:  83391) ====
 0 0x0000000000036400 killpg()  ???:0
 1 0x0000000000064007 ucs_memcpy_relaxed()  /var/tmp/raffenet/mpich/modules/ucx/src/ucs/arch/x86_64/cpu.h:99
 2 0x0000000000064007 ucp_memcpy_pack_unpack()  /var/tmp/raffenet/mpich/modules/ucx/src/ucp/dt/dt.h:111
 3 0x0000000000064259 ucp_dt_iov_scatter()  /var/tmp/raffenet/mpich/modules/ucx/src/ucp/dt/dt_iov.c:75
 4 0x00000000000adbec ucp_request_recv_data_unpack()  /var/tmp/raffenet/mpich/modules/ucx/src/ucp/core/ucp_request.inl:779
 5 0x00000000000adbec ucp_request_process_recv_data()  /var/tmp/raffenet/mpich/modules/ucx/src/ucp/core/ucp_request.inl:915
 6 0x00000000000adbec ucp_rndv_data_handler()  /var/tmp/raffenet/mpich/modules/ucx/src/ucp/rndv/rndv.c:2413
 7 0x000000000001b2a9 uct_iface_invoke_am()  /var/tmp/raffenet/mpich/modules/ucx/src/uct/base/uct_iface.h:893
 8 0x000000000001b2a9 uct_mm_iface_invoke_am()  /var/tmp/raffenet/mpich/modules/ucx/src/uct/sm/mm/base/mm_iface.h:262
 9 0x000000000001b2a9 uct_mm_iface_process_recv()  /var/tmp/raffenet/mpich/modules/ucx/src/uct/sm/mm/base/mm_iface.c:294
10 0x000000000001b2a9 uct_mm_iface_poll_fifo()  /var/tmp/raffenet/mpich/modules/ucx/src/uct/sm/mm/base/mm_iface.c:326
11 0x000000000001b2a9 uct_mm_iface_progress()  /var/tmp/raffenet/mpich/modules/ucx/src/uct/sm/mm/base/mm_iface.c:379
12 0x000000000006031a ucs_callbackq_dispatch()  /var/tmp/raffenet/mpich/modules/ucx/src/ucs/datastruct/callbackq.h:211
13 0x000000000006031a uct_worker_progress()  /var/tmp/raffenet/mpich/modules/ucx/src/uct/api/uct.h:2777
14 0x000000000006031a ucp_worker_progress()  /var/tmp/raffenet/mpich/modules/ucx/src/ucp/core/ucp_worker.c:2889
15 0x0000000000aae7e2 MPIDI_NM_progress()  /var/tmp/raffenet/mpich/./src/mpid/ch4/netmod/include/../ucx/ucx_progress.h:23
16 0x0000000000ab016e MPIDI_progress_test()  /var/tmp/raffenet/mpich/./src/mpid/ch4/src/ch4_progress.h:134
17 0x0000000000ab0d67 MPID_Progress_test()  /var/tmp/raffenet/mpich/./src/mpid/ch4/src/ch4_progress.h:233
18 0x0000000000ab0d81 MPID_Progress_wait()  /var/tmp/raffenet/mpich/./src/mpid/ch4/src/ch4_progress.h:288
19 0x0000000000ab37f3 MPIR_Wait_state()  /var/tmp/raffenet/mpich/src/mpi/request/request_impl.c:736
20 0x0000000000ab1473 MPID_Wait()  /var/tmp/raffenet/mpich/./src/mpid/ch4/src/ch4_wait.h:100
21 0x0000000000ab3a0c MPIR_Wait()  /var/tmp/raffenet/mpich/src/mpi/request/request_impl.c:779
22 0x000000000083bb18 internal_Recv()  /var/tmp/raffenet/mpich/src/binding/c/c_binding.c:61680
23 0x000000000083bc63 PMPI_Recv()  /var/tmp/raffenet/mpich/src/binding/c/c_binding.c:61733
24 0x00000000004009e6 main()  /var/tmp/raffenet/mpich/foo.c:31
25 0x0000000000022555 __libc_start_main()  ???:0
26 0x0000000000400849 _start()  ???:0
=================================

Setup and versions

  • OS version (CentOS Linux release 7.8.2003) + CPU architecture (x86_64)
    • Linux pmrs-gpu-240-01.cels.anl.gov 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • For GPU related issues:
    • Cuda: neither peer-mem or gdrcopy is loaded
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:5E:00.0 Off |                  N/A |
| 30%   27C    P8     9W / 125W |    593MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 4000     Off  | 00000000:D8:00.0 Off |                  N/A |
| 30%   22C    P8     8W / 125W |      3MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Additional information (depending on the issue)

There is no segfault if using UCX_PROTO_ENABLE=y

@raffenet raffenet added the Bug label Nov 29, 2023
@brminich
Copy link
Contributor

brminich commented Nov 30, 2023

Can UCX_PROTO_ENABLE=y be used as a workaround?
It is a default option since 1.16.x

@raffenet
Copy link
Contributor Author

I guess I could set it in the environment before initializing ucp inside MPICH, but it seems pretty hacky. Is there a better way to control the behavior programmatically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants