Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589

Open
denisbertini opened this issue Jan 10, 2024 · 33 comments
Assignees
Labels

Comments

@denisbertini
Copy link

When using (AMReX code](https://amrex-codes.github.io/amrex) with

  • GPU Aware OpenMPI (v5.0.1)
  • UCX ( 1.15.0 ) w
  • ROCm ( 6.0)
    program crashes immediately with invalid buffer size :
ERROR ibv_reg_mr(address=0x7f56dd327140, length=6528, access=0x10000f) failed: Invalid argument
[1704917108.050189] [lxbk1120:1097129:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7f56dd327140 (rocm) length 6528 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1704917108.050192] [lxbk1120:1097129:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7f56dd327140 len 6528: Input/output error

Intra-node communication works though.
Any ideas what can be wrong ?

@edgargabriel
Copy link
Contributor

@denisbertini thank you for the report.

What GPU is this if I may ask?
This looks like an environment setup thing, something like the GPUdirect kernel component, or a BIOS setting (iommu or similar)

Also, how difficult is it set up the application/testcase for reproducing?

@denisbertini
Copy link
Author

The GPU used is AMD GPU MI 100 and the OS is Rocky Linux 8.8

@edgargabriel
Copy link
Contributor

@denisbertini is there an easy way to run a simple test that reproduces the issue?

@edgargabriel edgargabriel self-assigned this Jan 16, 2024
@denisbertini
Copy link
Author

I tried with GPU aware osu benchmarks, but i all works fine, so it seems to be related to AMReX code itself.

@edgargabriel
Copy link
Contributor

What is your iommu setting if I may ask? And one other thing that comes to my mind is whether acs is disabled?

@denisbertini
Copy link
Author

which command(s) should i use to get this info from UCX?

@edgargabriel
Copy link
Contributor

for the first one, try

cat /proc/cmdline | grep iommu

@denisbertini
Copy link
Author

IOMMU is now enabled in both BIOS, Grub boot manager on 2 of our GPUs nodes.
Unfortunately this new setup did not help.
From the very first MPI inter-node communication between these 2 nodes, it immediately breaks with:

lxbk1115:121377] pml_ucx.c:934  Error: ucx send failed: Input/output error
[lxbk1097:230130] pml_ucx.c:934  Error: ucx send failed: Input/output error
[lxbk1097:00000] *** An error occurred in MPI_Isend
[lxbk1097:00000] *** reported by process [845479936,3]
[lxbk1097:00000] *** on communicator MPI COMM 3 DUP FROM 0
[lxbk1097:00000] *** MPI_ERR_OTHER: known error not in list
[lxbk1097:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1097:00000] ***    and MPI will try to terminate your MPI job as well)
[lxbk1115:00000] *** An error occurred in MPI_Isend
[lxbk1115:00000] *** reported by process [845479936,4]
[lxbk1115:00000] *** on communicator MPI COMM 3 DUP FROM 0
[lxbk1115:00000] *** MPI_ERR_OTHER: known error not in list
[lxbk1115:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

@denisbertini
Copy link
Author

with debug output:

1705664408.009086] [lxbk1115:177189:0]          wireup.c:1635 UCX  DEBUG ep 0x7fade005d180: send wireup request (flags=0x40)
[1705664408.009096] [lxbk1115:177189:a]        ib_iface.c:797  UCX  DEBUG iface 0x33635a0: ah_attr dlid=286 sl=0 port=1 src_path_bits=0
[1705664408.009102] [lxbk1115:177189:a]           ud_ep.c:824  UCX  DEBUG simultaneous CREQ ep=0x35e9880(iface=0x33635a0 conn_sn=0 ep_id=4, dest_ep_id=4 rx_psn=1)
[1705664408.009627] [lxbk1115:177189:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7fa3ae20cec0, length=26464, access=0x10000f) failed: Invalid argument
[1705664408.009644] [lxbk1115:177189:0]          rcache.c:933  UCX  DEBUG failed to register region 0x38f9680 [0x7fa3ae20cec0..0x7fa3ae213620]: Input/output error
[1705664408.009647] [lxbk1115:177189:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7fa3ae20cec0 (rocm) length 26464 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1705664408.009649] [lxbk1115:177189:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fa3ae20cec0 len 26464: Input/output error
[1705664408.004998] [lxbk1097:287547:0]          ucp_mm.c:62   UCX  DIAG  failed to register address 0x7f1d3f40cec0 (rocm) length 2[1705664408.005935] [lxbk1097:287545:0]          rcache.c:933  UCX  DEBUG failed to register region 0x24df260 [0x7fafa440cec0..0x7fafa4413620]: Input/output erro[1705664408.004709] [lxbk1115:177187:0]           mm_ep.c:68   UCX  DEBU[1705664408.005385] [lxbk1115:177188:0]           mm_ep.c:68   UCX  DEBUG mm_ep 0x3977b80: attached remote [1705664408.004575] [lxbk1115:177190:0]   

@edgargabriel
Copy link
Contributor

@denisbertini what about the PCI ACS. is that also disabled on the nodes? I am 99% confident that this is a system setup issue, not a UCX issue since we are running the same software configuration daily in our internal setup, we'll just have to identify what is triggering the problem.

@denisbertini
Copy link
Author

@edgargabriel Answer from our sys. admin colleague:

The PCI ACS is disabled (see one of my previous email):

lspci -vv -s 03:00.0|grep 'Access Control Services' -A 2
(...)

It reports 'SrcValid-' (for ACSCtl), which shows it is not enabled.

@edgargabriel
Copy link
Contributor

edgargabriel commented Jan 19, 2024

Do you have a script/recipe that I could use to reproduce the run on one of our internal systems? e.g. how to compile and run the code, what input files are required etc.

@edgargabriel
Copy link
Contributor

what mofed version are you running btw. on that system?

@denisbertini
Copy link
Author

we do not use mellanox official MOFED but the linux rdma-core library

@edgargabriel
Copy link
Contributor

we do not use mellanox official MOFED but the linux rdma-core library

could you in that case check whether the ib_peer_mem kernel module is running/used?

@denisbertini
Copy link
Author

Not simple to repoduce but you can give a try:

  • ROCM 6.0 in /opt/rocm
  • UCX 1.15.0 (with ROCM) + openMPI 5.0.1 install in /usr/local
  • you can fetch/compile warpx code:
  #
  # WarpX 24.01 with AMREX correction
  #
 
  export WARPX_VERSION=24.01
  echo "INstalling WarpX version: " $WARPX_VERSION
  rm -rf /tmp/warp  
  mkdir -p /tmp/warp
  cd /tmp/warp
  git clone https://github.com/ECP-WarpX/WarpX.git
  cd WarpX
  git checkout $WARPX_VERSION

  mkdir build_1d
  cd build_1d
  CXX=/opt/rocm/bin/hipcc CC=/opt/rocm/bin/amdclang cmake .. -D WarpX_DIMS=1 -D WarpX_COMPUTE=HIP -D AMReX_AMD_ARCH=gfx908  -D AMReX_TINY_PROFILE=FALSE -D MPI_C_COMPILER=/usr/local/bin/mpicc -D MPI_CXX_COMPILER=/usr/local/bin/mpicc -DWarpX_OPENPMD=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
  make -j 2
  make -j 2 install    
  cd ..
  mkdir build_2d
  cd build_2d 
  CXX=/opt/rocm/bin/hipcc CC=/opt/rocm/bin/amdclang cmake .. -D WarpX_DIMS=2 -D WarpX_COMPUTE=HIP -D AMReX_AMD_ARCH=gfx908  -D AMReX_TINY_PROFILE=FALSE -D MPI_C_COMPILER=/usr/local/bin/mpicc -D MPI_CXX_COMPILER=/usr/local/bin/mpicc -DWarpX_OPENPMD=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
  make -j 2
  make -j 2 install  
  cd ..
  mkdir build_3d
  cd build_3d 
  CXX=/opt/rocm/bin/hipcc CC=/opt/rocm/bin/amdclang cmake .. -D WarpX_DIMS=3 -D WarpX_COMPUTE=HIP -D AMReX_AMD_ARCH=gfx908  -D AMReX_TINY_PROFILE=FALSE -D MPI_C_COMPILER=/usr/local/bin/mpicc -D MPI_CXX_COMPILER=/usr/local/bin/mpicc -DWarpX_OPENPMD=ON -DCMAKE_INSTALL_PREFIX=/usr/local 
  make -j 2
  make -j 2 install

After these step you should then have all 3 executable waprx_1d,2d,3d installed in /usr/local/bin

after that you can run a simulation :

# GPU-aware MPI optimizations
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"

# executable & inputs file 
EXE=warpx_3d
INPUTS=inputs_3d.txt
srun --export=ALL --cpu-bind=cores singularity  exec --rocm ${CONT} ${EXE} ${INPUTS} 

The inputs_3d.txt in in attachment.

inputs_3d.txt

@edgargabriel
Copy link
Contributor

@denisbertini thank you! I will give it a try, but it might take me a few days until I get to it.

@denisbertini
Copy link
Author

I can understand that !
If you need any help, do not hesitate to ask !

@edgargabriel
Copy link
Contributor

@denisbertini I have successfully compiled the application, but I have trouble running it. The system here does not have singularity installed, but independent of that I get the following error when starting the application:

mpirun --mca pml ucx -np 8 ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED inputs_3d.txt
Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Initializing AMReX (24.01)...
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 8 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)
...
(skipping some lines)
...
--- INFO    : Writing openPMD file diags/diag2000000
terminate called after throwing an instance of 'openPMD::error::WrongAPIUsage'
  what():  Wrong API usage: openPMD-api built without support for backend 'HDF5'.
SIGABRT

Do I need to compile/provide HDF5 as well for the app?

@denisbertini
Copy link
Author

please try again with this new input file:
inputs_3d.txt
I commented out the HDF5 output.

@edgargabriel
Copy link
Contributor

edgargabriel commented Jan 23, 2024

ok, this worked now, thanks! I ran it on 2 nodes, each node with 4 MI100 GPUs + InfiniBand, and it seemed to work without issues (UCX 1.15.0, Open MPI 5.0.1, but ROCm 5.7.1 and mofed installed on these nodes). I will try to run tomorrow on 2 nodes with 8 GPUs each (since that changes the underlying topology)

[egabriel@t004-003 bin]$ mpirun --mca pml ucx -np 8 ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED inputs_3d.txt
Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Initializing AMReX (24.01)...
MPI initialized with 8 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 8 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 8.687655226e-16 ; dx = 1.875e-06 ; dy = 1.875e-06 ; dz = 2.65625e-07

Grids Summary:
  Level 0   8 grids  262144 cells  100 % of domain
            smallest grid: 32 x 32 x 32  biggest grid: 32 x 32 x 32

-------------------------------------------------------------------------------
--------------------------- MAIN EM PIC PARAMETERS ----------------------------
-------------------------------------------------------------------------------
Precision:            | DOUBLE
Particle precision:   | DOUBLE
Geometry:             | 3D (XYZ)
Operation mode:       | Electromagnetic
                      | - vacuum
-------------------------------------------------------------------------------
Current Deposition:   | Esirkepov
Particle Pusher:      | Boris
Charge Deposition:    | standard
Field Gathering:      | energy-conserving
Particle Shape Factor:| 3
-------------------------------------------------------------------------------
Maxwell Solver:       | Yee
-------------------------------------------------------------------------------
Moving window:        |    ON
                      |  - moving_window_dir = z
                      |  - moving_window_v = 299792458
-------------------------------------------------------------------------------
For full input parameters, see the file: warpx_used_inputs

STEP 1 starts ...


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ FIRST STEP ]
*
* No recorded warnings.
********************************************************************************

STEP 1 ends. TIME = 8.687655226e-16 DT = 8.687655226e-16
Evolve time = 0.038002964 s; This step = 0.038002964 s; Avg. per step = 0.038002964 s

STEP 2 starts ...
STEP 2 ends. TIME = 1.737531045e-15 DT = 8.687655226e-16
Evolve time = 0.047672122 s; This step = 0.009669158 s; Avg. per step = 0.023836061 s

STEP 3 starts ...
STEP 3 ends. TIME = 2.606296568e-15 DT = 8.687655226e-16
Evolve time = 0.052934207 s; This step = 0.005262085 s; Avg. per step = 0.01764473567 s

...

STEP 99 starts ...
STEP 99 ends. TIME = 8.600778674e-14 DT = 8.687655226e-16
Evolve time = 0.602792063 s; This step = 0.005151691 s; Avg. per step = 0.006088808717 s

STEP 100 starts ...
--- INFO    : re-sorting particles
STEP 100 ends. TIME = 8.687655226e-14 DT = 8.687655226e-16
Evolve time = 0.609494703 s; This step = 0.00670264 s; Avg. per step = 0.00609494703 s


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ THE END ]
*
* No recorded warnings.
***********************************************************************************

Total Time                     : 0.676633109
Total GPU global memory (MB) spread across MPI: [32752 ... 32752]
Free  GPU global memory (MB) spread across MPI: [7895 ... 7901]
[The         Arena] space (MB) allocated spread across MPI: [24564 ... 24564]
[The         Arena] space (MB) used      spread across MPI: [0 ... 0]
[The Managed Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] space (MB) used      spread across MPI: [0 ... 0]
AMReX (24.01) finalized

@denisbertini
Copy link
Author

could you try with ROCm 6.0 ?

@edgargabriel
Copy link
Contributor

unfortunately not, I cannot update the rocm version on that cluster, that is not under my control. However, the ROCm version plays a miniscule role in this scenario in my opinion. If the 8 GPUs per node scenario with 2 nodes ( 16 GPUs) also works, my main suspicion is actually on the mofed vs. rdma-core library. The managers of this cluster also tried originally the rdma-core library, but had to switch to mofed.

@edgargabriel
Copy link
Contributor

edgargabriel commented Jan 24, 2024

I can confirm that the 16 process run (2 nodes with 8 processes/GPUs each of MI100) also finished correctly (though the job complained about too many resources and too little work :-) )

I am at this point 99% sure that this is not a UCX/Open MPI issue, but we are dealing with some configuration aspect on your cluster.

@denisbertini
Copy link
Author

Could you ask the sys. admin of your cluster, the reason(s) why they had to move from rdma-core to official MOFED library?
Will be very interesting to know for us ...

@edgargabriel
Copy link
Contributor

it was because of issues making the GPUDirectRDMA work. We have GPUDIrectRDMA using rdma-core working meanwhile on a cluster that uses Broadcomm NICs, but for InfiniBand/Mellanox RoCE HCAs we always use MOFED.

@denisbertini
Copy link
Author

Could you please try again the test with the modified submit script:

# GPU-aware MPI optimizations
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"

# executable & inputs file 
EXE=warpx_3d
INPUTS=inputs_3d.txt
srun --export=ALL --cpu-bind=cores ${EXE} ${INPUTS} ${GPU_AWARE_MPI}

@denisbertini
Copy link
Author

@edgargabriel
Redoing the same test gives another type of error:

 [lxbk1087:2070485:0:2070485]        rndv.c:1872 Assertion `sreq->send.rndv.lanes_count > 0' failed
==== backtrace (tid:2070485) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f4132b0f0a4]
 1  /usr/local/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7f4132b0c070]
 2  /usr/local/ucx/lib/libucs.so.0(+0x2a151) [0x7f4132b0c151]
 3  /usr/local/ucx/lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x148) [0x7f4133002b68]
 4  /usr/local/ucx/lib/libucp.so.0(+0x7e835) [0x7f4132fff835]
 5  /usr/local/ucx/lib/libucp.so.0(ucp_rndv_atp_handler+0x77) [0x7f4133004d87]
 6  /usr/local/ucx/lib/ucx/libuct_ib.so.0(+0x3e2e5) [0x7f3925fa92e5]
 7  /usr/local/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f4132fcf6aa]
 8  /usr/local/lib/libopen-pal.so.80(opal_progress+0x2c) [0x7f41332880bc]
 9  /usr/local/lib/libopen-pal.so.80(ompi_sync_wait_mt+0x125) [0x7f41332ba275]
10  /usr/local/lib/libmpi.so.40(ompi_request_default_wait_all+0x13c) [0x7f413ddb16bc]
11  /usr/local/lib/libmpi.so.40(PMPI_Waitall+0x6f) [0x7f413de01aff]

This issue seems also to be linked to upstreamed ofed drivers in RHEL.

See for example this related issue:

#7882

@edgargabriel
Copy link
Contributor

@denisbertini I submitted a job with the new syntax, will update the ticket once the job finishes.

The issue that you are pointing to was a problem that we had in ucx 1.13.x, but was resolved starting ucx 1.14.x , so it should hopefully not apply here (unless your job pulls in accidentally a wrong ucx library, e.g. because LD_LIBRARY_PATH is not set to point to ucx 1.15/lib or similar)

@edgargabriel
Copy link
Contributor

I reran the code with the changed settings/arguments, the code still finished correctly on our system

export GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"
cd /work1/amd/egabriel/WARPX/bin

/home1/egabriel/OpenMPI/bin/mpirun --mca pml ucx -np 16 -x UCX_LOG_LEVEL=info ./warpx.3d.MPI.HIP.DP.PDP.OPMD.QED inputs_3d.txt ${GPU_AWARE_MPI}

Multiple GPUs are visible to each MPI rank, This may lead to incorrect or suboptimal rank-to-GPU mapping.!
Initializing AMReX (24.01)...
MPI initialized with 16 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 16 devices.
AMReX (24.01) initialized
PICSAR (23.09)
WarpX (24.01)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 8.687655226e-16 ; dx = 1.875e-06 ; dy = 1.875e-06 ; dz = 2.65625e-07

Grids Summary:
  Level 0   8 grids  262144 cells  100 % of domain
            smallest grid: 32 x 32 x 32  biggest grid: 32 x 32 x 32

-------------------------------------------------------------------------------
--------------------------- MAIN EM PIC PARAMETERS ----------------------------
-------------------------------------------------------------------------------
Precision:            | DOUBLE
Particle precision:   | DOUBLE
Geometry:             | 3D (XYZ)
Operation mode:       | Electromagnetic
                      | - vacuum
-------------------------------------------------------------------------------
Current Deposition:   | Esirkepov
Particle Pusher:      | Boris
Charge Deposition:    | standard
Field Gathering:      | energy-conserving
Particle Shape Factor:| 3
-------------------------------------------------------------------------------
Maxwell Solver:       | Yee
-------------------------------------------------------------------------------
Moving window:        |    ON
                      |  - moving_window_dir = z
                      |  - moving_window_v = 299792458
-------------------------------------------------------------------------------
For full input parameters, see the file: warpx_used_inputs

STEP 1 starts ...


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ FIRST STEP ]
*
* --> [!!!] [Performance] [raised 16 times]
*     Too many resources / too little work!
*     It looks like you requested more compute resources than there are total
*     number of boxes of cells available (8). You started with (16) MPI ranks,
*     so (8) rank(s) will have no work.
*     On GPUs, consider using 1-8 boxes per GPU that together fill each GPU's
*     memory sufficiently. If you do not rely on dynamic load-balancing, then
*     one large box per GPU is ideal.
*     Consider decreasing the amr.blocking_factor and amr.max_grid_size
*     parameters and/or using fewer MPI ranks.
*     More information:
*     https://warpx.readthedocs.io/en/latest/usage/workflows/parallelization.html
*     @ Raised by: ALL
*
********************************************************************************

STEP 1 ends. TIME = 8.687655226e-16 DT = 8.687655226e-16
Evolve time = 0.0097388 s; This step = 0.0097388 s; Avg. per step = 0.0097388 s

STEP 2 starts ...
STEP 2 ends. TIME = 1.737531045e-15 DT = 8.687655226e-16
Evolve time = 0.021369073 s; This step = 0.011630273 s; Avg. per step = 0.0106845365 s

STEP 3 starts ...
STEP 3 ends. TIME = 2.606296568e-15 DT = 8.687655226e-16
Evolve time = 0.026867411 s; This step = 0.005498338 s; Avg. per step = 0.008955803667 s

...

STEP 99 starts ...
STEP 99 ends. TIME = 8.600778674e-14 DT = 8.687655226e-16
Evolve time = 0.650830741 s; This step = 0.005815287 s; Avg. per step = 0.006574047889 s

STEP 100 starts ...
--- INFO    : re-sorting particles
STEP 100 ends. TIME = 8.687655226e-14 DT = 8.687655226e-16
Evolve time = 0.658796751 s; This step = 0.00796601 s; Avg. per step = 0.00658796751 s

Total Time                     : 0.740151409
Total GPU global memory (MB) spread across MPI: [32752 ... 32752]
Free  GPU global memory (MB) spread across MPI: [7879 ... 32514]
[The         Arena] space (MB) allocated spread across MPI: [24564 ... 24564]
[The         Arena] space (MB) used      spread across MPI: [0 ... 0]
[The Managed Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The   Comms Arena] space (MB) used      spread across MPI: [0 ... 0]
AMReX (24.01) finalized

@denisbertini
Copy link
Author

OK then it is now clear ... thanks !
Preparing the basic OSU benchmarks now ...

@denisbertini
Copy link
Author

Standard OSU benchmarks test

  • setup 2 node with only 1 MPI rank/node corresponding to 1/GPU
  • non ROCm test : Host to Host only mode works
# OSU MPI Bandwidth Test v7.3
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       1.88
2                       3.75
4                       7.37
8                      14.63
16                     27.09
32                     56.30
64                    126.23
128                   243.24
256                   492.83
512                   909.62
1024                 1430.86
2048                 2114.66
4096                 2088.23
8192                 5062.87
16384                8186.23
32768                9719.93
65536               10841.10
131072              11071.00
262144              11431.13
524288              12306.31
1048576             11251.29
2097152             11189.57
4194304             11423.66
  • ROCm tests ( H D ) or (D D ) immediately failed with the same errors:
# OSU MPI-ROCM Bandwidth Test v7.3
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
[1706257302.851919] [lxbk1102:2038794:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7fe35ea00000, length=16, access=0x10000f) failed: Invalid argument
[1706257302.851940] [lxbk1102:2038794:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7fe35ea00000 (rocm) length 1 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)
[1706257302.851943] [lxbk1102:2038794:0]     ucp_request.c:555  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fe35ea00000 len 1: Input/output error
# OSU MPI-ROCM Bandwidth Test v7.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
[1706257305.444694] [lxbk1099:36585:0]           ib_md.c:309  UCX  ERROR ibv_reg_mr(address=0x7ff763a00000, length=16, access=0x10000f) failed: Invalid argument
[1706257305.444712] [lxbk1099:36585:0]          ucp_mm.c:62   UCX  ERROR failed to register address 0x7ff763a00000 (rocm) length 1 on md[4]=mlx5_0: Input/output error (md supports: host|rocm)

@edgargabriel
Copy link
Contributor

ok, so very clearly memory registration of GPU memory doesn't work on your system, independent of the application. This leads to things that I suggested and also mentioned in the email communication:

  • double check that you have the ib_peer_mem linux kernel component running on your system if you want to use the linux rdma-core component
  • or install MOFED (recommended)

I think these are your two options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants