Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we allow for site-specific tuning of OpenMPI? #456

Closed
ocaisa opened this issue Jan 17, 2024 · 8 comments · Fixed by #525 or EESSI/test-suite#142
Closed

Can we allow for site-specific tuning of OpenMPI? #456

ocaisa opened this issue Jan 17, 2024 · 8 comments · Fixed by #525 or EESSI/test-suite#142
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io bug Something isn't working

Comments

@ocaisa
Copy link
Member

ocaisa commented Jan 17, 2024

I ran into this issue that a simple mpi4py code would not run on a Magic Castle deployment with EESSI (though it works on the same system with the pilot):

[ocaisa@node1 ~]$ cat bcast.py 
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    data = {'a':1,'b':2,'c':3}
else:
    data = None

data = comm.bcast(data, root=0)
print('rank %d : %s'% (rank,data))

[ocaisa@login1 ~]$ module purge
[ocaisa@login1 ~]$ echo $MODULEPATH
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/modules/all
[ocaisa@login1 ~]$ module use /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/amd/zen3/modules/all
[ocaisa@login1 ~]$ module load SciPy-bundle/2021.05-foss-2021a
[ocaisa@login1 ~]$ mpirun -n 2 python bcast.py 
rank 0 : {'a': 1, 'b': 2, 'c': 3}
rank 1 : {'a': 1, 'b': 2, 'c': 3}

[ocaisa@login1 ~]$ module purge
[ocaisa@login1 ~]$ module unuse /cvmfs/pilot.eessi-hpc.org/versions/2021.12/software/linux/x86_64/amd/zen3/modules/all
[ocaisa@login1 ~]$ module load mpi4py
[ocaisa@login1 ~]$ module list

Currently Loaded Modules:
  1) GCCcore/12.3.0                  5) libpciaccess/0.17-GCCcore-12.3.0   9) UCX/1.14.1-GCCcore-12.3.0        13) OpenMPI/4.1.5-GCC-12.3.0      17) libffi/3.4.4-GCCcore-12.3.0
  2) GCC/12.3.0                      6) hwloc/2.9.1-GCCcore-12.3.0        10) libfabric/1.18.0-GCCcore-12.3.0  14) gompi/2023a                   18) Python/3.11.3-GCCcore-12.3.0
  3) numactl/2.0.16-GCCcore-12.3.0   7) OpenSSL/1.1                       11) PMIx/4.2.4-GCCcore-12.3.0        15) Tcl/8.6.13-GCCcore-12.3.0     19) mpi4py/3.1.4-gompi-2023a
  4) libxml2/2.11.4-GCCcore-12.3.0   8) libevent/2.1.12-GCCcore-12.3.0    12) UCC/1.2.0-GCCcore-12.3.0         16) SQLite/3.42.0-GCCcore-12.3.0

[ocaisa@login1 ~]$ mpirun -n 2 python bcast.py 
login1.int.jetstream2.hpc-carpentry.org:rank0.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank0: PSM3 can't open nic unit: 1 (err=23)
login1.int.jetstream2.hpc-carpentry.org:rank1.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank1: PSM3 can't open nic unit: 1 (err=23)
login1.int.jetstream2.hpc-carpentry.org:rank1.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank1: PSM3 can't open nic unit: 1 (err=23)
login1.int.jetstream2.hpc-carpentry.org:rank0.python: Failed to get eth0 (unit 1) cpu set
login1.int.jetstream2.hpc-carpentry.org:rank0: PSM3 can't open nic unit: 1 (err=23)
(hanging)

It turns out this issue was already "solved" for an EasyBuild use case, which also resolved things for my case.

It raises the issue though that OpenMPI may need to be configured to work correctly on the host site (and indeed this was also raised in #1 ). @bartoldeman explained how they account for this in Compute Canada:

the way we solve this (for the soft.computecanada.ca stack) is to set an environment variable RSNT_INTERCONNECT using this logic in lmod:

function get_interconnect()
        local posix = require "posix"
        if posix.stat("/sys/module/opa_vnic","type") == 'directory' then
                return "omnipath"
        elseif posix.stat("/sys/module/ib_core","type") == 'directory' then
                return "infiniband"
        end
        return "ethernet"
end

for "ethernet" we have:

OMPI_MCA_btl='^openib,ofi'
OMPI_MCA_mtl='^ofi'
OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'

so libfabric (OFI) isn't used by Open MPI which eliminates any use of PSM3 as well, it'll basically force Open MPI to use either the tcp or vader (shm) + self btl , with the ob1 pml, no runtime use of UCX nor OFI.
I'm not sure if EESSI still compiles Open MPI with support for openib, if not, the first one could be OMPI_MCA_btl='^ofi'
for "infiniband" it's:

OMPI_MCA_btl='^openib,ofi'
OMPI_MCA_mtl='^ofi'

to eliminate libfabric as well; Open MPI will use UCX through its priority mechanism.
Lastly for "omnipath" :

OMPI_MCA_btl='^openib'
OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'

where we do allow ofi though the priority mechanism will select the cm pml with the psm2 mtl
So basically:

  • always exclude openib (the only use case we have for it is DDT, that's why it's compiled in)
  • infiniband excludes libfabric
  • omnipath excludes UCX
  • ethernet excludes both libfabric and UCX

We set the envvars via a configuration file included in the module, specifically with a modluafooter in the easyconfig:

assert(loadfile("/cvmfs/soft.computecanada.ca/config/lmod/openmpi_custom.lua"))("4.1")
@ocaisa
Copy link
Member Author

ocaisa commented Jan 17, 2024

We need something similar but we also need to allow for other fabrics (like EFA)

@boegel
Copy link
Contributor

boegel commented Jan 18, 2024

This is another reason to let sites provide a script that is sourced along with initializing the EESSI environment, although in this case we should probably try and auto-detect which environment variables we should set?

It's a bit silly we have to do that though, why isn't OpenMPI doing that?

@boegel boegel added 2023.06-software.eessi.io 2023.06 version of software.eessi.io bug Something isn't working labels Jan 18, 2024
@bartoldeman
Copy link

We could file bugs to Open MPI for some subcases I presume.
But much has to do with the initialization code and priority mechanisms, sometimes the hardware probe itself messes things up or shows warnings that are confusing to users, and sometimes the priority mechanism isn't what we want.

When Open MPI is compiled with loadable runtime plugins there's also some benefit in memory and start-up time, not loading the .so with the associated libraries it links to (libfabric.so etc), which can't be fixed otherwise.

@casparvl
Copy link
Collaborator

casparvl commented Mar 4, 2024

I've run into this with the OSU tests with CUDA support as well:

source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1
$ mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c -d cuda D D
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn11.local.snellius.surf.nl:rank0.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
[1709563752.198867] [gcn11:2419930:0]           ib_md.c:1406 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1709563752.198878] [gcn11:2419930:0]           ib_md.c:1407 UCX  WARN  IB: data corruption might occur when using registered memory.
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_0: Invalid argument
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
[1709563752.241205] [gcn11:2419930:0]           ib_md.c:1406 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1709563752.241215] [gcn11:2419930:0]           ib_md.c:1407 UCX  WARN  IB: data corruption might occur when using registered memory.
gcn15.local.snellius.surf.nl:rank1.osu_bw: Failed to modify UD QP to INIT on mlx5_1: Invalid argument
[1709563752.377782] [gcn15:3387467:0]           ib_md.c:1406 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1709563752.377793] [gcn15:3387467:0]           ib_md.c:1407 UCX  WARN  IB: data corruption might occur when using registered memory.
[1709563752.421707] [gcn15:3387467:0]           ib_md.c:1406 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1709563752.421717] [gcn15:3387467:0]           ib_md.c:1407 UCX  WARN  IB: data corruption might occur when using registered memory.
# OSU MPI-CUDA Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)        Validation
# Datatype: MPI_CHAR.
[1709563752.578672] [gcn15:3387467:0]            sock.c:323  UCX  ERROR   connect(fd=129, dest_addr=XXX.XXX.XXX.XXX:47549) failed: No route to host
[gcn15.local.snellius.surf.nl:3387467] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[1709563752.578766] [gcn11:2419930:0]            sock.c:323  UCX  ERROR   connect(fd=129, dest_addr=XXX.XXX.XXX.XXX:58101) failed: No route to host
[gcn15.local.snellius.surf.nl:3387467] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 0
[gcn11.local.snellius.surf.nl:2419930] pml_ucx.c:424  Error: ucp_ep_create(proc=1) failed: Destination is unreachable
[gcn11.local.snellius.surf.nl:2419930] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 1

There is more than one thing going on here by the way. Setting:

export OMPI_MCA_mtl='^ofi'
export OMPI_MCA_btl='^ofi'

I get fewer errors, but the code still doens't run:

$ mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c -d cuda D D
# OSU MPI-CUDA Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)        Validation
# Datatype: MPI_CHAR.
[1709563901.194239] [gcn11:2420575:0]            sock.c:323  UCX  ERROR   connect(fd=121, dest_addr=XXX.XXX.XXX.XXX:41161) failed: No route to host
[gcn11.local.snellius.surf.nl:2420575] pml_ucx.c:424  Error: ucp_ep_create(proc=1) failed: Destination is unreachable
[gcn11.local.snellius.surf.nl:2420575] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 1
[1709563901.194189] [gcn15:3387940:0]            sock.c:323  UCX  ERROR   connect(fd=121, dest_addr=XXX.XXX.XXX.XXX:56985) failed: No route to host
[gcn15.local.snellius.surf.nl:3387940] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[gcn15.local.snellius.surf.nl:3387940] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 0

We found the same issue in our local module stack, and found that we have to:

export UCX_TLS='^tcp'

Before this thing runs as it should:

$ mpirun -np 2 osu_bw -m 4194304 -x 10 -i 1000 -c -d cuda D D
# OSU MPI-CUDA Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)        Validation
# Datatype: MPI_CHAR.
1                       2.27              Pass
2                       4.52              Pass
4                       9.06              Pass
8                      17.96              Pass
16                     35.96              Pass
32                     71.84              Pass
64                    135.83              Pass
128                   274.39              Pass
256                   538.57              Pass
512                  1018.69              Pass
1024                 1897.69              Pass
2048                 3352.07              Pass
4096                 5433.91              Pass
8192                 7222.91              Pass
16384                9498.64              Pass
32768               10453.94              Pass
... etc

@satishskamath is a bit deeper into the details, but the gist of that UCX issue is that it tries to use the TCP interface on the public IP address of our nodes. This fails, because the firewall closes this off. Similar to the OpenMPI case above, I guess this could be considered a bug in initialization and priority mechanisms: maybe it shouldn't have picked TCP to begin with, but even if it does, one would hope that if that initialization fails it would fall back to another mechanism.

I'll ask Satish to report this upstream, but I guess the bottom line from an EESSI point of view is: we will have these system-specific settings that a hosting site might want to change, whether they are unresolved bugs, tuning paramters, or otherwise. I'm wondering what the most convenient way is. I think we should have a good discussion on this, and then document it.

From Kenneths comment:

This is another reason to let sites provide a script that is sourced along with initializing the EESSI environment, although in this case we should probably try and auto-detect which environment variables we should set?

A script that gets sourced would allow some global config, but some of this may be specific to which version of e.g. UCX or OpenMPI gets loaded.

I'm no expert in LMOD hooks, but I'm wondering if these shouldn't be the solution here. From what I've seen when implementing our GPU support, I think one can control how specific they are (from "set X for all modules named Y" to "set X for modules named Y if their version is Z"). Aside from resolving bugs, the same mechanism could even facilitate site-specific OpenMPI tuning.

What do you think? Would LMOD hooks be a good practice? If so, we can use the current use case to as an example on how to write such a site-specific LMOD hook, how we should make sure it gets picked up, and document that procedure.

@casparvl
Copy link
Collaborator

casparvl commented Mar 4, 2024

Just checked our hook. It is defined in the LMOD_RC file:

$ module --config 2>&1 | grep LMOD_RC
LMOD_RC (LMOD_RC)                                             /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/.lmod/lmodrc.lua

That's where we define the hooks currently:

$ cat /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/.lmod/lmodrc.lua | grep hook
local hook = require("Hook")
local function cuda_enabled_load_hook(t)
local function openmpi_load_hook(t)
hook.register("load", cuda_enabled_load_hook)
hook.register("load", openmpi_load_hook)

I think there can only be one LMOD_RC file, but we could source something from host_injections in that LMOD_RC file to make it extensible? That would allow host sites to add hooks themselves.

@casparvl
Copy link
Collaborator

casparvl commented Mar 4, 2024

Good find from @ocaisa : https://lmod.readthedocs.io/en/latest/145_properties.html?highlight=lmod_rc#the-properties-file-lmodrc-lua talks about the search order of Lmod RC files. More importantly, it says:

If there are more than one of these files exist then they are merged and not a replacement. So a site can (and should) leave the first file as is and create another one to specify site properties and Lmod will merge the information into one.

That means that configuration can be set at multiple levels, and will be merged. The only thing we should consider is what we want to happen in conflicting situations. The search order is:

  1. /apps/lmod/X.Y.Z/init/lmodrc.lua
  2. /apps/lmod/etc/lmodrc.lua
  3. $LMOD_CONFIG_DIR/lmodrc.lua (default /etc/lmod/lmodrc.lua)
  4. /etc/lmodrc.lua
  5. $HOME/.lmodrc.lua
  6. $LMOD_RC

meaning that if the $LMOD_RC sets a hook called cuda_hook it would probably overwrite hooks called cuda_hook set at earlier levels. Right now, we set $LMOD_RC. Probably, we should change that to setting $LMOD_CONFIG_DIR, so that host sites still have some options to overwrite what we do.

Also important: we should probably prefix anything we do with an eessi_ prefix, to reduce the change of accidental name collisions. It is fine if a host site intentially overwrites one of our hooks, but you wouldn't want that to happen unintentionally.

@casparvl
Copy link
Collaborator

casparvl commented Mar 6, 2024

To be kept in mind: hooks should be defined in SitePackage.lua, not in lmodrc. See #491

@casparvl
Copy link
Collaborator

Just to log somewhere: I'm running into the same issue as #456 (comment) on the Karolina HPC cluster as well. The job:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_QuantumESPRESSO_PW_42db3ef7"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p qcpu
#SBATCH -A DD-23-96
#SBATCH --export=None
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
export SLURM_EXPORT_ENV=ALL
export OMPI_MCA_pml=ucx
module load QuantumESPRESSO/7.2-foss-2022b
export OMP_NUM_THREADS=1
wget -q http://pseudopotentials.quantum-espresso.org/upf_files/Si.pbe-n-kjpaw_psl.1.0.0.UPF
mpirun -np 2 pw.x -in Si.scf.in

(A test from this pr to the EESSI test suite)

And the output file is:

cn355.karolina.it4i.cz:rank1.pw.x: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
cn355.karolina.it4i.cz:rank1.pw.x: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
cn354.karolina.it4i.cz:rank0.pw.x: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
cn354.karolina.it4i.cz:rank0.pw.x: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted

We should contact the Karolina staff (and probably also the Vega staff) to see if they can provision some (permanent) storage for us to host the host_injections stuff, ask them to point the variant symlink for host_injections there (through their CVMFS config), and then do all of this system-specific configuration ourselves.

casparvl pushed a commit to casparvl/test-suite that referenced this issue May 13, 2024
…fix it through an LMOD hook in host_injections
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io bug Something isn't working
Projects
None yet
4 participants