Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open UCX 1.15.0 with Open MPI 4.1.1 - running osu_iallgather/osu_iallgatherv stucked when the message size reached 65536 #9731

Open
Tobez123 opened this issue Mar 7, 2024 · 6 comments
Labels

Comments

@Tobez123
Copy link

Tobez123 commented Mar 7, 2024

Describe the bug

We use Open UCX 1.15.0 with Open MPI 4.1.1 to run osu_iallgather/osu_iallgatherv. However, when the message size reached 65536, the program was stucked, we waited at least 30 minutes but printed nothing no more.

Things we have tried

  • add `-x UCX_RC_MLX5_RX_QUEUE_LEN=8191', it works!
  • add '-x UCX_RNDV_THRESH=8192', it also works!

Steps to Reproduce

  • Command line
    mpirun -x UCX_TLS=sm,rc_x -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2

  • UCX version used :1.15.0

  • UCX configure flags (can be checked by ucx_info -v)

Library version: 1.15.0
Library path: /lib/libucs.so.0
API headers version: 1.15.0
Git branch '', revision
Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --prefix=/openucx --enable-mt

  • Any UCX environment variables used
    • UCX_TLS=sm,rc_x
    • UCX_NET_DEVICES=mlx5_1:1

Setup and versions

  • OS version (e.g Linux distro)
    • Linux 6426-node125 4.19.90-2112.8.0.0131.oe1.aarch64 #1 SMP Fri Dec 31 19:53:20 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
  • CPU architecture (x86_64/aarch64/ppc64le/...)
    • aarch64
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rdma-core-54mlnx1-1.54303.aarch64
      • MLNX_OFED_LINUX-5.4-3.0.3.0
    • HW information from ibstat or ibv_devinfo -vv command

CA 'mlx5_1'
CA type: MT4121
Number of ports: 1
Firmware version: 16.31.2006
Hardware version: 0
Node GUID: 0x98039b030071f6e9
System image GUID: 0x98039b030071f6e8
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x9a039bfffe71f6e9
Link layer: Ethernet

Additional information (depending on the issue)

  • OpenMPI version
    • Open MPI 4.1.1
  • OSU version
    • osu-micro-benchmarks-7.1-1
  • Output log
    iallgather
    iallgatherv
@Tobez123 Tobez123 added the Bug label Mar 7, 2024
@Tobez123
Copy link
Author

Tobez123 commented Mar 7, 2024

osu_iallgatherv
add -x UCX_RC_MLX5_RX_QUEUE_LEN=8191
add UCX_RC_MLX5_RX_QUEUE_LEN

@Tobez123
Copy link
Author

Tobez123 commented Mar 7, 2024

osu_iallgatherv
add -x UCX_RNDV_THRESH=8192
add UCX_RNDV_THRESH

@Tobez123
Copy link
Author

Tobez123 commented Mar 7, 2024

osu_iallgather
add -x UCX_RC_MLX5_RX_QUEUE_LEN=8191
iallgather add UCX_RC_MLX5_RX_QUEUE_LEN

@Tobez123
Copy link
Author

Tobez123 commented Mar 7, 2024

osu_iallgather
add -x UCX_RNDV_THRESH=8192
iallgather add UCX_RNDV_THRESH

@rakhmets
Copy link
Collaborator

rakhmets commented Mar 8, 2024

Hi,

I noticed that when you set UCX_RNDV_THRESH=8192, you didn't set UCX_TLS=sm,rc_x. I guess that in the case of UCX_RNDV_THRESH=8192, the reason was the use of different transport by the UCX.

Does the program stuck if the command line contains UCX_TLS=sm,rc_x along with UCX_RNDV_THRESH=8192?

mpirun -x UCX_RNDV_THRESH=8192 -x UCX_TLS=sm,rc_x -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2

Does the program stuck if the command line doesn't contain UCX_TLS=sm,rc_x?

mpirun -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2

Does the program stuck if the command line contains UCX_TLS=sm,rc_x,dc?

mpirun -x UCX_TLS=sm,rc_x,dc -x UCX_NET_DEVICES=mlx5_1:1 -np 1024 -N 128 --hostfile hostfile_path -mca pml ucx -mca btl ^vader,tcp,openib,uct osu_iallgather -i 2

@Tobez123
Copy link
Author

Thanks for your reply! Following screenshots are the results I have tried.

  1. contains UCX_TLS=sm,rc_x along with UCX_RNDV_THRESH=8192
    iallgather+UCX_RNDV_THRESH
  2. doesn't contain UCX_TLS=sm,rc_x
    iallgather
  3. contains UCX_TLS=sm,rc_x,dc
    iallgather+UCX_TLS sm_rc_x_dc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants