GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589

denisbertini · 2024-01-10T20:39:03Z

When using (AMReX code](https://amrex-codes.github.io/amrex) with

GPU Aware OpenMPI (v5.0.1)
UCX ( 1.15.0 ) w
ROCm ( 6.0)
program crashes immediately with invalid buffer size :

ERROR ibv_reg_mr(address=0x7f56dd327140, length=6528, access=0x10000f) failed: Invalid argument
[1704917108.050189] [lxbk1120:1097129:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7f56dd327140 (rocm) length 6528 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1704917108.050192] [lxbk1120:1097129:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7f56dd327140 len 6528: Input/output error

Intra-node communication works though.
Any ideas what can be wrong ?

The text was updated successfully, but these errors were encountered:

edgargabriel · 2024-01-10T20:55:19Z

@denisbertini thank you for the report.

What GPU is this if I may ask?
This looks like an environment setup thing, something like the GPUdirect kernel component, or a BIOS setting (iommu or similar)

Also, how difficult is it set up the application/testcase for reproducing?

denisbertini · 2024-01-10T21:34:05Z

The GPU used is AMD GPU MI 100 and the OS is Rocky Linux 8.8

edgargabriel · 2024-01-16T13:59:55Z

@denisbertini is there an easy way to run a simple test that reproduces the issue?

denisbertini · 2024-01-16T14:01:46Z

I tried with GPU aware osu benchmarks, but i all works fine, so it seems to be related to AMReX code itself.

edgargabriel · 2024-01-16T14:07:11Z

What is your iommu setting if I may ask? And one other thing that comes to my mind is whether acs is disabled?

denisbertini · 2024-01-16T14:12:30Z

which command(s) should i use to get this info from UCX?

edgargabriel · 2024-01-16T14:17:47Z

for the first one, try

cat /proc/cmdline | grep iommu

denisbertini · 2024-01-19T07:25:23Z

IOMMU is now enabled in both BIOS, Grub boot manager on 2 of our GPUs nodes.
Unfortunately this new setup did not help.
From the very first MPI inter-node communication between these 2 nodes, it immediately breaks with:

lxbk1115:121377] pml_ucx.c:934  Error: ucx send failed: Input/output error
[lxbk1097:230130] pml_ucx.c:934  Error: ucx send failed: Input/output error
[lxbk1097:00000] *** An error occurred in MPI_Isend
[lxbk1097:00000] *** reported by process [845479936,3]
[lxbk1097:00000] *** on communicator MPI COMM 3 DUP FROM 0
[lxbk1097:00000] *** MPI_ERR_OTHER: known error not in list
[lxbk1097:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1097:00000] ***    and MPI will try to terminate your MPI job as well)
[lxbk1115:00000] *** An error occurred in MPI_Isend
[lxbk1115:00000] *** reported by process [845479936,4]
[lxbk1115:00000] *** on communicator MPI COMM 3 DUP FROM 0
[lxbk1115:00000] *** MPI_ERR_OTHER: known error not in list
[lxbk1115:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

denisbertini · 2024-01-19T11:46:26Z

with debug output:

1705664408.009086] [lxbk1115:177189:0]          wireup.c:1635 UCX  DEBUG ep 0x7fade005d180: send wireup request (flags=0x40)
[1705664408.009096] [lxbk1115:177189:a]        ib_iface.c:797  UCX  DEBUG iface 0x33635a0: ah_attr dlid=286 sl=0 port=1 src_path_bits=0
[1705664408.009102] [lxbk1115:177189:a]           ud_ep.c:824  UCX  DEBUG simultaneous CREQ ep=0x35e9880(iface=0x33635a0 conn_sn=0 ep_id=4, dest_ep_id=4 rx_psn=1)
[1705664408.009627] [lxbk1115:177189:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7fa3ae20cec0, length=26464, access=0x10000f) failed: Invalid argument
[1705664408.009644] [lxbk1115:177189:0]          rcache.c:933  UCX  DEBUG failed to register region 0x38f9680 [0x7fa3ae20cec0..0x7fa3ae213620]: Input/output error
[1705664408.009647] [lxbk1115:177189:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7fa3ae20cec0 (rocm) length 26464 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1705664408.009649] [lxbk1115:177189:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fa3ae20cec0 len 26464: Input/output error
[1705664408.004998] [lxbk1097:287547:0]          ucp_mm.c:62   UCX  DIAG  failed to register address 0x7f1d3f40cec0 (rocm) length 2[1705664408.005935] [lxbk1097:287545:0]          rcache.c:933  UCX  DEBUG failed to register region 0x24df260 [0x7fafa440cec0..0x7fafa4413620]: Input/output erro[1705664408.004709] [lxbk1115:177187:0]           mm_ep.c:68   UCX  DEBU[1705664408.005385] [lxbk1115:177188:0]           mm_ep.c:68   UCX  DEBUG mm_ep 0x3977b80: attached remote [1705664408.004575] [lxbk1115:177190:0]

edgargabriel · 2024-01-19T14:55:43Z

@denisbertini what about the PCI ACS. is that also disabled on the nodes? I am 99% confident that this is a system setup issue, not a UCX issue since we are running the same software configuration daily in our internal setup, we'll just have to identify what is triggering the problem.

denisbertini · 2024-01-19T15:04:26Z

@edgargabriel Answer from our sys. admin colleague:

The PCI ACS is disabled (see one of my previous email):

lspci -vv -s 03:00.0|grep 'Access Control Services' -A 2
(...)

It reports 'SrcValid-' (for ACSCtl), which shows it is not enabled.

edgargabriel · 2024-01-19T15:08:46Z

Do you have a script/recipe that I could use to reproduce the run on one of our internal systems? e.g. how to compile and run the code, what input files are required etc.

edgargabriel · 2024-01-19T15:15:51Z

what mofed version are you running btw. on that system?

denisbertini · 2024-01-19T15:37:15Z

we do not use mellanox official MOFED but the linux rdma-core library

edgargabriel · 2024-01-19T15:39:22Z

we do not use mellanox official MOFED but the linux rdma-core library

could you in that case check whether the ib_peer_mem kernel module is running/used?

denisbertini · 2024-01-19T15:52:13Z

Not simple to repoduce but you can give a try:

ROCM 6.0 in /opt/rocm
UCX 1.15.0 (with ROCM) + openMPI 5.0.1 install in /usr/local
you can fetch/compile warpx code:

  #
  # WarpX 24.01 with AMREX correction
  #
 
  export WARPX_VERSION=24.01
  echo "INstalling WarpX version: " $WARPX_VERSION
  rm -rf /tmp/warp  
  mkdir -p /tmp/warp
  cd /tmp/warp
  git clone https://github.com/ECP-WarpX/WarpX.git
  cd WarpX
  git checkout $WARPX_VERSION

  mkdir build_1d
  cd build_1d
  CXX=/opt/rocm/bin/hipcc CC=/opt/rocm/bin/amdclang cmake .. -D WarpX_DIMS=1 -D WarpX_COMPUTE=HIP -D AMReX_AMD_ARCH=gfx908  -D AMReX_TINY_PROFILE=FALSE -D MPI_C_COMPILER=/usr/local/bin/mpicc -D MPI_CXX_COMPILER=/usr/local/bin/mpicc -DWarpX_OPENPMD=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
  make -j 2
  make -j 2 install    
  cd ..
  mkdir build_2d
  cd build_2d 
  CXX=/opt/rocm/bin/hipcc CC=/opt/rocm/bin/amdclang cmake .. -D WarpX_DIMS=2 -D WarpX_COMPUTE=HIP -D AMReX_AMD_ARCH=gfx908  -D AMReX_TINY_PROFILE=FALSE -D MPI_C_COMPILER=/usr/local/bin/mpicc -D MPI_CXX_COMPILER=/usr/local/bin/mpicc -DWarpX_OPENPMD=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
  make -j 2
  make -j 2 install  
  cd ..
  mkdir build_3d
  cd build_3d 
  CXX=/opt/rocm/bin/hipcc CC=/opt/rocm/bin/amdclang cmake .. -D WarpX_DIMS=3 -D WarpX_COMPUTE=HIP -D AMReX_AMD_ARCH=gfx908  -D AMReX_TINY_PROFILE=FALSE -D MPI_C_COMPILER=/usr/local/bin/mpicc -D MPI_CXX_COMPILER=/usr/local/bin/mpicc -DWarpX_OPENPMD=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
  make -j 2
  make -j 2 install

After these step you should then have all 3 executable waprx_1d,2d,3d installed in /usr/local/bin

after that you can run a simulation :

# GPU-aware MPI optimizations
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"

# executable & inputs file 
EXE=warpx_3d
INPUTS=inputs_3d.txt
srun --export=ALL --cpu-bind=cores singularity  exec --rocm ${CONT} ${EXE} ${INPUTS}

The inputs_3d.txt in in attachment.

inputs_3d.txt

edgargabriel · 2024-01-19T16:03:12Z

@denisbertini thank you! I will give it a try, but it might take me a few days until I get to it.

denisbertini · 2024-01-19T17:23:56Z

I can understand that !
If you need any help, do not hesitate to ask !

edgargabriel · 2024-01-23T17:38:51Z

@denisbertini I have successfully compiled the application, but I have trouble running it. The system here does not have singularity installed, but independent of that I get the following error when starting the application:

mpirun --mca pml ucx -np 8 ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED inputs_3d.txt
Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Initializing AMReX (24.01)...
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 8 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)
...
(skipping some lines)
...
--- INFO    : Writing openPMD file diags/diag2000000
terminate called after throwing an instance of 'openPMD::error::WrongAPIUsage'
  what():  Wrong API usage: openPMD-api built without support for backend 'HDF5'.
SIGABRT

Do I need to compile/provide HDF5 as well for the app?

denisbertini · 2024-01-23T18:37:06Z

please try again with this new input file:
inputs_3d.txt
I commented out the HDF5 output.

edgargabriel · 2024-01-23T22:05:32Z

ok, this worked now, thanks! I ran it on 2 nodes, each node with 4 MI100 GPUs + InfiniBand, and it seemed to work without issues (UCX 1.15.0, Open MPI 5.0.1, but ROCm 5.7.1 and mofed installed on these nodes). I will try to run tomorrow on 2 nodes with 8 GPUs each (since that changes the underlying topology)

[egabriel@t004-003 bin]$ mpirun --mca pml ucx -np 8 ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED inputs_3d.txt
Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Initializing AMReX (24.01)...
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 8 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 8.687655226e-16 ; dx = 1.875e-06 ; dy = 1.875e-06 ; dz = 2.65625e-07

Grids Summary:
  Level 0   8 grids  262144 cells  100 % of domain
            smallest grid: 32 x 32 x 32  biggest grid: 32 x 32 x 32

-------------------------------------------------------------------------------
--------------------------- MAIN EM PIC PARAMETERS ----------------------------
-------------------------------------------------------------------------------
Precision:            | DOUBLE
Particle precision:   | DOUBLE
Geometry:             | 3D (XYZ)
Operation mode:       | Electromagnetic
                      | - vacuum
-------------------------------------------------------------------------------
Current Deposition:   | Esirkepov
Particle Pusher:      | Boris
Charge Deposition:    | standard
Field Gathering:      | energy-conserving
Particle Shape Factor:| 3
-------------------------------------------------------------------------------
Maxwell Solver:       | Yee
-------------------------------------------------------------------------------
Moving window:        |    ON
                      |  - moving_window_dir = z
                      |  - moving_window_v = 299792458
-------------------------------------------------------------------------------
For full input parameters, see the file: warpx_used_inputs

STEP 1 starts ...


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ FIRST STEP ]
*
* No recorded warnings.
********************************************************************************

STEP 1 ends. TIME = 8.687655226e-16 DT = 8.687655226e-16
Evolve time = 0.038002964 s; This step = 0.038002964 s; Avg. per step = 0.038002964 s

STEP 2 starts ...
STEP 2 ends. TIME = 1.737531045e-15 DT = 8.687655226e-16
Evolve time = 0.047672122 s; This step = 0.009669158 s; Avg. per step = 0.023836061 s

STEP 3 starts ...
STEP 3 ends. TIME = 2.606296568e-15 DT = 8.687655226e-16
Evolve time = 0.052934207 s; This step = 0.005262085 s; Avg. per step = 0.01764473567 s

...

STEP 99 starts ...
STEP 99 ends. TIME = 8.600778674e-14 DT = 8.687655226e-16
Evolve time = 0.602792063 s; This step = 0.005151691 s; Avg. per step = 0.006088808717 s

STEP 100 starts ...
--- INFO    : re-sorting particles
STEP 100 ends. TIME = 8.687655226e-14 DT = 8.687655226e-16
Evolve time = 0.609494703 s; This step = 0.00670264 s; Avg. per step = 0.00609494703 s


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ THE END ]
*
* No recorded warnings.
***********************************************************************************

Total Time                     : 0.676633109
Total GPU global memory (MB) spread across MPI: [32752 ... 32752]
Free  GPU global memory (MB) spread across MPI: [7895 ... 7901]
[The         Arena] space (MB) allocated spread across MPI: [24564 ... 24564]
[The         Arena] space (MB) used      spread across MPI: [0 ... 0]
[The Managed Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] space (MB) used      spread across MPI: [0 ... 0]
AMReX (24.01) finalized

denisbertini · 2024-01-24T07:34:59Z

could you try with ROCm 6.0 ?

edgargabriel · 2024-01-24T14:13:29Z

unfortunately not, I cannot update the rocm version on that cluster, that is not under my control. However, the ROCm version plays a miniscule role in this scenario in my opinion. If the 8 GPUs per node scenario with 2 nodes ( 16 GPUs) also works, my main suspicion is actually on the mofed vs. rdma-core library. The managers of this cluster also tried originally the rdma-core library, but had to switch to mofed.

edgargabriel · 2024-01-24T15:11:49Z

I can confirm that the 16 process run (2 nodes with 8 processes/GPUs each of MI100) also finished correctly (though the job complained about too many resources and too little work :-) )

I am at this point 99% sure that this is not a UCX/Open MPI issue, but we are dealing with some configuration aspect on your cluster.

denisbertini · 2024-01-24T15:47:38Z

Could you ask the sys. admin of your cluster, the reason(s) why they had to move from rdma-core to official MOFED library?
Will be very interesting to know for us ...

edgargabriel · 2024-01-24T16:25:13Z

it was because of issues making the GPUDirectRDMA work. We have GPUDIrectRDMA using rdma-core working meanwhile on a cluster that uses Broadcomm NICs, but for InfiniBand/Mellanox RoCE HCAs we always use MOFED.

denisbertini · 2024-01-25T08:21:27Z

Could you please try again the test with the modified submit script:

# GPU-aware MPI optimizations
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"

# executable & inputs file 
EXE=warpx_3d
INPUTS=inputs_3d.txt
srun --export=ALL --cpu-bind=cores ${EXE} ${INPUTS} ${GPU_AWARE_MPI}

denisbertini · 2024-01-25T09:00:57Z

@edgargabriel
Redoing the same test gives another type of error:

 [lxbk1087:2070485:0:2070485]        rndv.c:1872 Assertion `sreq->send.rndv.lanes_count > 0' failed
==== backtrace (tid:2070485) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f4132b0f0a4]
 1  /usr/local/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7f4132b0c070]
 2  /usr/local/ucx/lib/libucs.so.0(+0x2a151) [0x7f4132b0c151]
 3  /usr/local/ucx/lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x148) [0x7f4133002b68]
 4  /usr/local/ucx/lib/libucp.so.0(+0x7e835) [0x7f4132fff835]
 5  /usr/local/ucx/lib/libucp.so.0(ucp_rndv_atp_handler+0x77) [0x7f4133004d87]
 6  /usr/local/ucx/lib/ucx/libuct_ib.so.0(+0x3e2e5) [0x7f3925fa92e5]
 7  /usr/local/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f4132fcf6aa]
 8  /usr/local/lib/libopen-pal.so.80(opal_progress+0x2c) [0x7f41332880bc]
 9  /usr/local/lib/libopen-pal.so.80(ompi_sync_wait_mt+0x125) [0x7f41332ba275]
10  /usr/local/lib/libmpi.so.40(ompi_request_default_wait_all+0x13c) [0x7f413ddb16bc]
11  /usr/local/lib/libmpi.so.40(PMPI_Waitall+0x6f) [0x7f413de01aff]

This issue seems also to be linked to upstreamed ofed drivers in RHEL.

See for example this related issue:

#7882

edgargabriel · 2024-01-25T14:51:39Z

@denisbertini I submitted a job with the new syntax, will update the ticket once the job finishes.

The issue that you are pointing to was a problem that we had in ucx 1.13.x, but was resolved starting ucx 1.14.x , so it should hopefully not apply here (unless your job pulls in accidentally a wrong ucx library, e.g. because LD_LIBRARY_PATH is not set to point to ucx 1.15/lib or similar)

edgargabriel · 2024-01-25T16:54:22Z

I reran the code with the changed settings/arguments, the code still finished correctly on our system

export GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"
cd /work1/amd/egabriel/WARPX/bin

/home1/egabriel/OpenMPI/bin/mpirun --mca pml ucx -np 16 -x UCX_LOG_LEVEL=info ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED inputs_3d.txt ${GPU_AWARE_MPI}

Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Initializing AMReX (24.01)...
MPI initialized with 16 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 16 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 8.687655226e-16 ; dx = 1.875e-06 ; dy = 1.875e-06 ; dz = 2.65625e-07

Grids Summary:
  Level 0   8 grids  262144 cells  100 % of domain
            smallest grid: 32 x 32 x 32  biggest grid: 32 x 32 x 32

-------------------------------------------------------------------------------
--------------------------- MAIN EM PIC PARAMETERS ----------------------------
-------------------------------------------------------------------------------
Precision:            | DOUBLE
Particle precision:   | DOUBLE
Geometry:             | 3D (XYZ)
Operation mode:       | Electromagnetic
                      | - vacuum
-------------------------------------------------------------------------------
Current Deposition:   | Esirkepov
Particle Pusher:      | Boris
Charge Deposition:    | standard
Field Gathering:      | energy-conserving
Particle Shape Factor:| 3
-------------------------------------------------------------------------------
Maxwell Solver:       | Yee
-------------------------------------------------------------------------------
Moving window:        |    ON
                      |  - moving_window_dir = z
                      |  - moving_window_v = 299792458
-------------------------------------------------------------------------------
For full input parameters, see the file: warpx_used_inputs

STEP 1 starts ...


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ FIRST STEP ]
*
* --> [!!!] [Performance] [raised 16 times]
*     Too many resources / too little work!
*     It looks like you requested more compute resources than there are total
*     number of boxes of cells available (8). You started with (16) MPI ranks,
*     so (8) rank(s) will have no work.
*     On GPUs, consider using 1-8 boxes per GPU that together fill each GPU's
*     memory sufficiently. If you do not rely on dynamic load-balancing, then
*     one large box per GPU is ideal.
*     Consider decreasing the amr.blocking_factor and amr.max_grid_size
*     parameters and/or using fewer MPI ranks.
*     More information:
*     https://warpx.readthedocs.io/en/latest/usage/workflows/parallelization.html
*     @ Raised by: ALL
*
********************************************************************************

STEP 1 ends. TIME = 8.687655226e-16 DT = 8.687655226e-16
Evolve time = 0.0097388 s; This step = 0.0097388 s; Avg. per step = 0.0097388 s

STEP 2 starts ...
STEP 2 ends. TIME = 1.737531045e-15 DT = 8.687655226e-16
Evolve time = 0.021369073 s; This step = 0.011630273 s; Avg. per step = 0.0106845365 s

STEP 3 starts ...
STEP 3 ends. TIME = 2.606296568e-15 DT = 8.687655226e-16
Evolve time = 0.026867411 s; This step = 0.005498338 s; Avg. per step = 0.008955803667 s

...

STEP 99 starts ...
STEP 99 ends. TIME = 8.600778674e-14 DT = 8.687655226e-16
Evolve time = 0.650830741 s; This step = 0.005815287 s; Avg. per step = 0.006574047889 s

STEP 100 starts ...
--- INFO    : re-sorting particles
STEP 100 ends. TIME = 8.687655226e-14 DT = 8.687655226e-16
Evolve time = 0.658796751 s; This step = 0.00796601 s; Avg. per step = 0.00658796751 s

Total Time                     : 0.740151409
Total GPU global memory (MB) spread across MPI: [32752 ... 32752]
Free  GPU global memory (MB) spread across MPI: [7879 ... 32514]
[The         Arena] space (MB) allocated spread across MPI: [24564 ... 24564]
[The         Arena] space (MB) used      spread across MPI: [0 ... 0]
[The Managed Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The   Comms Arena] space (MB) used      spread across MPI: [0 ... 0]
AMReX (24.01) finalized

denisbertini · 2024-01-25T17:52:07Z

OK then it is now clear ... thanks !
Preparing the basic OSU benchmarks now ...

denisbertini · 2024-01-26T08:38:08Z

Standard OSU benchmarks test

setup 2 node with only 1 MPI rank/node corresponding to 1/GPU
non ROCm test : Host to Host only mode works

# OSU MPI Bandwidth Test v7.3
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       1.88
2                       3.75
4                       7.37
8                      14.63
16                     27.09
32                     56.30
64                    126.23
128                   243.24
256                   492.83
512                   909.62
1024                 1430.86
2048                 2114.66
4096                 2088.23
8192                 5062.87
16384                8186.23
32768                9719.93
65536               10841.10
131072              11071.00
262144              11431.13
524288              12306.31
1048576             11251.29
2097152             11189.57
4194304             11423.66

ROCm tests ( H D ) or (D D ) immediately failed with the same errors:

# OSU MPI-ROCM Bandwidth Test v7.3
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
[1706257302.851919] [lxbk1102:2038794:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7fe35ea00000, length=16, access=0x10000f) failed: Invalid argument
[1706257302.851940] [lxbk1102:2038794:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7fe35ea00000 (rocm) length 1 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1706257302.851943] [lxbk1102:2038794:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fe35ea00000 len 1: Input/output error
# OSU MPI-ROCM Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
[1706257305.444694] [lxbk1099:36585:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7ff763a00000, length=16, access=0x10000f) failed: Invalid argument
[1706257305.444712] [lxbk1099:36585:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7ff763a00000 (rocm) length 1 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)

edgargabriel · 2024-01-26T13:58:09Z

ok, so very clearly memory registration of GPU memory doesn't work on your system, independent of the application. This leads to things that I suggested and also mentioned in the email communication:

double check that you have the ib_peer_mem linux kernel component running on your system if you want to use the linux rdma-core component
or install MOFED (recommended)

I think these are your two options.

denisbertini added the Bug label Jan 10, 2024

edgargabriel self-assigned this Jan 16, 2024

denisbertini mentioned this issue Feb 6, 2024

Rank to GPU mapping warning ECP-WarpX/WarpX#4594

Open

WeiqunZhang mentioned this issue Feb 7, 2024

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address AMReX-Codes/amrex#3692

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589

denisbertini commented Jan 10, 2024

edgargabriel commented Jan 10, 2024

denisbertini commented Jan 10, 2024

edgargabriel commented Jan 16, 2024

denisbertini commented Jan 16, 2024

edgargabriel commented Jan 16, 2024

denisbertini commented Jan 16, 2024

edgargabriel commented Jan 16, 2024

denisbertini commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024 •

edited

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 23, 2024

denisbertini commented Jan 23, 2024

edgargabriel commented Jan 23, 2024 •

edited

denisbertini commented Jan 24, 2024

edgargabriel commented Jan 24, 2024

edgargabriel commented Jan 24, 2024 •

edited

denisbertini commented Jan 24, 2024

edgargabriel commented Jan 24, 2024

denisbertini commented Jan 25, 2024

denisbertini commented Jan 25, 2024

edgargabriel commented Jan 25, 2024

edgargabriel commented Jan 25, 2024

denisbertini commented Jan 25, 2024

denisbertini commented Jan 26, 2024

edgargabriel commented Jan 26, 2024

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589

Comments

denisbertini commented Jan 10, 2024

edgargabriel commented Jan 10, 2024

denisbertini commented Jan 10, 2024

edgargabriel commented Jan 16, 2024

denisbertini commented Jan 16, 2024

edgargabriel commented Jan 16, 2024

denisbertini commented Jan 16, 2024

edgargabriel commented Jan 16, 2024

denisbertini commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024 • edited

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 19, 2024

denisbertini commented Jan 19, 2024

edgargabriel commented Jan 23, 2024

denisbertini commented Jan 23, 2024

edgargabriel commented Jan 23, 2024 • edited

denisbertini commented Jan 24, 2024

edgargabriel commented Jan 24, 2024

edgargabriel commented Jan 24, 2024 • edited

denisbertini commented Jan 24, 2024

edgargabriel commented Jan 24, 2024

denisbertini commented Jan 25, 2024

denisbertini commented Jan 25, 2024

edgargabriel commented Jan 25, 2024

edgargabriel commented Jan 25, 2024

denisbertini commented Jan 25, 2024

denisbertini commented Jan 26, 2024

edgargabriel commented Jan 26, 2024

edgargabriel commented Jan 19, 2024 •

edited

edgargabriel commented Jan 23, 2024 •

edited

edgargabriel commented Jan 24, 2024 •

edited