Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Lmod hook to set $OMPI_MCA_btl to ^smcuda when loading OpenMPI module to work around bug #473

Conversation

boegel
Copy link
Contributor

@boegel boegel commented Feb 12, 2024

Temporary workaround for hangs/crashes in OpenMPI due to a bug, cfr. https://gitlab.com/eessi/support/-/issues/41

Tested, works like a charm, for example:

$ module load foss/2023a
Adding '^smcuda' to $OMPI_MCA_btl to work around bug in OpenMPI (see https://gitlab.com/eessi/support/-/issues/41)

$ echo $OMPI_MCA_btl
^smcuda

@boegel boegel added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Feb 12, 2024
Copy link

eessi-bot-aws bot commented Feb 12, 2024

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

@boegel boegel force-pushed the 2023.06-software.eessi.io_OpenMPI_disable_smcuda branch from c4dadde to 0d03852 Compare February 12, 2024 20:52
@EESSI EESSI deleted a comment from eessi-bot-aws bot Feb 12, 2024
@EESSI EESSI deleted a comment from eessi-bot-aws bot Feb 12, 2024
@boegel
Copy link
Contributor Author

boegel commented Feb 12, 2024

bot: build repo:eessi.io-2023.06-software arch:aarch64/neoverse_v1

Copy link

eessi-bot-aws bot commented Feb 12, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5835

date job status comment
Feb 12 20:53:37 UTC 2024 submitted job id 5835 awaits release by job manager
Feb 12 20:53:46 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 20:57:05 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5835.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_v1-1707771408.tar.gzsize: 0 MiB (162071 bytes)
entries: 4
modules under 2023.06/software/linux/aarch64/neoverse_v1/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/neoverse_v1/software
no software packages in tarball
other under 2023.06/software/linux/aarch64/neoverse_v1
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 20:57:05 UTC 2024 test result (no tests yet)
Feb 13 10:43:08 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-aarch64-neoverse_v1-1707771408.tar.gz to S3 bucket succeeded

@boegel
Copy link
Contributor Author

boegel commented Feb 12, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/generic
bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/haswell
bot: build repo:eessi.io-2023.06-software arch:x86_64/intel/skylake_avx512
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3
bot: build repo:eessi.io-2023.06-software arch:aarch64/generic
bot: build repo:eessi.io-2023.06-software arch:aarch64/neoverse_n1

Copy link

eessi-bot-aws bot commented Feb 12, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5836

date job status comment
Feb 12 21:29:36 UTC 2024 submitted job id 5836 awaits release by job manager
Feb 12 21:30:18 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 21:36:29 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5836.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-1707773740.tar.gzsize: 0 MiB (162292 bytes)
entries: 4
modules under 2023.06/software/linux/x86_64/generic/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/generic/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/generic
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 21:36:29 UTC 2024 test result (no tests yet)
Feb 13 10:43:27 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-generic-1707773740.tar.gz to S3 bucket succeeded

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-intel-haswell for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5837

date job status comment
Feb 12 21:29:39 UTC 2024 submitted job id 5837 awaits release by job manager
Feb 12 21:30:20 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 21:35:27 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5837.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-intel-haswell-1707773711.tar.gzsize: 0 MiB (162538 bytes)
entries: 4
modules under 2023.06/software/linux/x86_64/intel/haswell/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/intel/haswell/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/intel/haswell
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 21:35:27 UTC 2024 test result (no tests yet)
Feb 13 10:43:47 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-intel-haswell-1707773711.tar.gz to S3 bucket succeeded

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-intel-skylake_avx512 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5838

date job status comment
Feb 12 21:29:43 UTC 2024 submitted job id 5838 awaits release by job manager
Feb 12 21:30:22 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 21:36:30 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5838.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-intel-skylake_avx512-1707773768.tar.gzsize: 0 MiB (162854 bytes)
entries: 4
modules under 2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/intel/skylake_avx512/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/intel/skylake_avx512
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 21:36:30 UTC 2024 test result (no tests yet)
Feb 13 10:44:06 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-intel-skylake_avx512-1707773768.tar.gz to S3 bucket succeeded

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5839

date job status comment
Feb 12 21:29:47 UTC 2024 submitted job id 5839 awaits release by job manager
Feb 12 21:30:15 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 21:35:24 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5839.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1707773684.tar.gzsize: 0 MiB (162383 bytes)
entries: 4
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 21:35:24 UTC 2024 test result (no tests yet)
Feb 13 10:44:25 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen2-1707773684.tar.gz to S3 bucket succeeded

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen3 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5840

date job status comment
Feb 12 21:29:50 UTC 2024 submitted job id 5840 awaits release by job manager
Feb 12 21:30:16 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 21:35:26 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5840.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1707773683.tar.gzsize: 0 MiB (162378 bytes)
entries: 4
modules under 2023.06/software/linux/x86_64/amd/zen3/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen3/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen3
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 21:35:26 UTC 2024 test result (no tests yet)
Feb 13 10:44:45 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen3-1707773683.tar.gz to S3 bucket succeeded

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5841

date job status comment
Feb 12 21:29:54 UTC 2024 submitted job id 5841 awaits release by job manager
Feb 12 21:30:11 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 22:36:37 UTC 2024 running job 5841 is running
Feb 12 22:37:38 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5841.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1707777406.tar.gzsize: 0 MiB (162055 bytes)
entries: 4
modules under 2023.06/software/linux/aarch64/generic/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/generic/software
no software packages in tarball
other under 2023.06/software/linux/aarch64/generic
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 22:37:38 UTC 2024 test result (no tests yet)
Feb 13 10:45:04 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-aarch64-generic-1707777406.tar.gz to S3 bucket succeeded

Copy link

eessi-bot-aws bot commented Feb 12, 2024

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_n1 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.02/pr_473/5842

date job status comment
Feb 12 21:29:58 UTC 2024 submitted job id 5842 awaits release by job manager
Feb 12 21:30:13 UTC 2024 released job awaits launch by Slurm scheduler
Feb 12 21:35:23 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-5842.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_n1-1707773691.tar.gzsize: 0 MiB (162081 bytes)
entries: 4
modules under 2023.06/software/linux/aarch64/neoverse_n1/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/neoverse_n1/software
no software packages in tarball
other under 2023.06/software/linux/aarch64/neoverse_n1
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp
.lmod/lmodrc.lua
Feb 12 21:35:23 UTC 2024 test result (no tests yet)
Feb 13 10:45:24 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-aarch64-neoverse_n1-1707773691.tar.gz to S3 bucket succeeded

Copy link
Collaborator

@bedroge bedroge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@bedroge bedroge added the bot:deploy Ask bot to deploy missing software installations to EESSI label Feb 13, 2024
@boegel boegel changed the title add Lmod hook to set $OMPI_MCA_btl to '^smcuda' when loading OpenMPI module add Lmod hook to set $OMPI_MCA_btl to ^smcuda when loading OpenMPI module to work around bug Feb 13, 2024
@bedroge
Copy link
Collaborator

bedroge commented Feb 13, 2024

The staging PRs have been merged, tarballs have been injected, and this is now available on the Stratum servers.

@bedroge bedroge merged commit 3c00a95 into EESSI:2023.06-software.eessi.io Feb 13, 2024
33 checks passed
@boegel boegel deleted the 2023.06-software.eessi.io_OpenMPI_disable_smcuda branch February 13, 2024 12:54
@boegel boegel mentioned this pull request Feb 13, 2024
5 tasks
@Neves-P
Copy link
Member

Neves-P commented Feb 14, 2024

Following up from #404 (comment) which still hangs, it appears the Lmod hook is not available to the bot. @bedroge found that the Lmod hook in LMOD_RC gets set in init/eessi_environment_variables, which the bot doesn't source (instead doing so for init/minimal_eessi_env). We would want the hook to apply in the build environment and also when the module is loaded. Should we also apply this to init/minimal_eessi_env, or do you think it's better to go with a different approach?

@boegel
Copy link
Contributor Author

boegel commented Feb 14, 2024

@Neves-P I don't mind moving the setting of $LMOD_RC to minimal_eessi_env, but since that requires $EESSI_SOFTWARE_PATH, I don't think that's going to be so easy.

Do note that eessi_environment_variables is being sourced in EESSI-install-software.sh, so that means that $LMOD_RC should be set correctly by the time that EasyBuild is being run, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io bot:deploy Ask bot to deploy missing software installations to EESSI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants