Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[system] cuDNN/8.9.2.26-CUDA-12.1.1 #581

Open
wants to merge 10 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented May 17, 2024

Attempt to add cuDNN which is a dependency of other packages such as TensorFlow and PyTorch.

Major additions/changes:

  • scripts/gpu_support/nvidia/install_cuda_and_libraries.sh with
    scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
    • script to install CUDA and cuDNN packages under .../host_injections
  • EESSI-install-software.sh
    • use scripts/gpu_support/nvidia/install_cuda_and_libraries.sh with
      scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml to install CUDA, cuDNN under .../host_injections
  • eb_hooks.py
    • put code that iterates over all files replacing non-distributable ones with
      symlinks into host_injections with a common function
      (replace_non_distributable_files_with_symlinks)
    • additional post_sanitycheck_hook which replaces files with symlinks into corresponding paths under .../host_injections for all files that cannot be redistributed
    • dropping dependency on cuDNN to a build dependency (see inject_gpu_property)
  • create_lmodsitepackage.py
    • consolidate eessi_{cuda,cudnn}_enabled_load_hook functions in a single one
      (eessi_cuda_and_libraries_enabled_load_hook)
    • the remaining hook is prepared to easily add new modules, e.g., cuTENSOR
  • install_scripts.sh
    • add files to copy to CVMFS (see nvidia_files)

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io gpu labels May 17, 2024
Copy link

eessi-bot-aws bot commented May 17, 2024

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

Copy link

eessi-bot-aws bot commented May 17, 2024

Instance eessi-bot-mc-azure is configured to build:

  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen4 for repo eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented May 17, 2024

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot-aws bot commented May 17, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10940

date job status comment
May 17 09:26:27 UTC 2024 submitted job id 10940 awaits release by job manager
May 17 09:27:22 UTC 2024 released job awaits launch by Slurm scheduler
May 17 09:32:24 UTC 2024 running job 10940 is running
May 17 09:40:32 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-10940.out
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715938433.tar.gzsize: 698 MiB (732495131 bytes)
entries: 74
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 09:40:32 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-10940.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

eb_hooks.py Outdated
Comment on lines 663 to 688
# iterate over all files in the CUDA installation directory
for dir_path, _, files in os.walk(self.installdir):
for filename in files:
full_path = os.path.join(dir_path, filename)
# we only really care about real files, i.e. not symlinks
if not os.path.islink(full_path):
# check if the current file is part of the allowlist
basename = filename.split('.')[0]
if '.' in filename:
extension = '.' + filename.split('.')[1]
if basename in allowlist:
self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path)
elif '.' in filename and extension in allowlist:
self.log.debug("%s is found in allowlist, so keeping it: %s", extension, full_path)
else:
self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s",
filename, full_path)
# if it is not in the allowlist, delete the file and create a symlink to host_injections
host_inj_path = full_path.replace('versions', 'host_injections')
# make sure source and target of symlink are not the same
if full_path == host_inj_path:
raise EasyBuildError("Source (%s) and target (%s) are the same location, are you sure you "
"are using this hook for a NESSI installation?",
full_path, host_inj_path)
remove_file(full_path)
symlink(host_inj_path, full_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this identical to what is done for CUDA? We should probably just create a function that takes the installdir and allowlist as arguments and does this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Collaborator Author

@trz42 trz42 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there are subtle differences. For CUDA, the EULA/README lists files you can distribute. For cuDNN the LICENSE lists what type of files you can distribute. These differences require small modifications. For example, in the hook for CUDA we have:

                    basename = filename.split('.')[0]
                    if basename in allowlist:
                        self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path)
                    else:
                        self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s",
                                       basename, full_path)

For cuDNN, we have

                    basename = filename.split('.')[0]
                    if '.' in filename:
                        extension = '.' + filename.split('.')[1]
                    if basename in allowlist:
                        self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path)
                    elif '.' in filename and extension in allowlist:
                        self.log.debug("%s is found in allowlist, so keeping it: %s", extension, full_path)
                    else:
                        self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s",
                                       filename, full_path)

Anyhow, the differences are relatively small, so a function would require a parameter that allows it to distinguish between CUDA and cuDNN (and in the future maybe other packages such as cuTENSOR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the extension part, perhaps we should split on all . and look for the last non-numeric entry (which should be the extension)? I can imagine there could be files like libcuda.so.520.12.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, you already have a good solution, you are taking the second entry which is virtually guaranteed to the the extension

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a function that implements the suggestion in 74a9a55

eb_hooks.py Outdated
ec_dict['builddependencies'].append(dep)
value = '\n'.join([value, 'setenv("EESSICUDNNVERSION","%s")' % cudnn_version])
if key in ec_dict:
if not value in ec_dict[key]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is probably no longer good enough, we're looking for the exact string, but that is not likely to exist (even though the add_property("arch","gpu") most likely does exist since the applications also should have a CUDA dep). What we really need to do is

  • Grab what is there already
  • Split it on \n
  • Add any missing elements
  • Put it back together again and replace it

Either this, or only the modify/add the modluafooter once in the entire function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the if not value in ec_dict[key] is not good enough?

Copy link
Member

@ocaisa ocaisa May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the value is a composite string of property and the setenv, and the property will already (very likely) exist from the CUDA part of this hook

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module file for cuDNN contains the following

-- Built with EasyBuild version 4.9.1

add_property("arch","gpu")
setenv("EESSICUDAVERSION","12.1.1")

For something that builds on top of cuDNN, we would the above and something like

setenv("EESSICUDNNVERSION","8.9.2.26")

Copy link
Member

@ocaisa ocaisa May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently implemented, for something that builds on top of cuDNN I believe you will have

-- Built with EasyBuild version 4.9.1

add_property("arch","gpu")
setenv("EESSICUDAVERSION","12.1.1")
add_property("arch","gpu")
setenv("EESSICUDNNVERSION","8.9.2.26")

as it will see if the entire string add_property("arch","gpu")\nsetenv("EESSICUDNNVERSION","8.9.2.26") is in the footer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Working on something to implement the desired footer (and avoiding duplication of code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function. It still produces the same footer for cuDNN. I guess a real test would be a build that uses cuDNN. @ocaisa can you check if the function looks better now?

@trz42
Copy link
Collaborator Author

trz42 commented May 17, 2024

Retry after fixing args to cuDNN install script...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot-aws bot commented May 17, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10941

date job status comment
May 17 10:45:01 UTC 2024 submitted job id 10941 awaits release by job manager
May 17 10:45:40 UTC 2024 released job awaits launch by Slurm scheduler
May 17 10:49:42 UTC 2024 running job 10941 is running
May 17 10:59:52 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-10941.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715943174.tar.gzsize: 698 MiB (732493432 bytes)
entries: 74
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 10:59:52 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-10941.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Member

ocaisa commented May 17, 2024

@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?

@trz42
Copy link
Collaborator Author

trz42 commented May 17, 2024

@trz42 The installation looks suspiciously large at 700MB, are you sure your hook is cleaning out the files it should?

Full package is 1.4 GB.

@trz42
Copy link
Collaborator Author

trz42 commented May 17, 2024

Rebuild after changing hook function that handles dependencies and creates modluafooter entries...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot-aws bot commented May 17, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10942

date job status comment
May 17 12:54:38 UTC 2024 submitted job id 10942 awaits release by job manager
May 17 12:55:03 UTC 2024 released job awaits launch by Slurm scheduler
May 17 13:00:06 UTC 2024 running job 10942 is running
May 17 13:05:11 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-10942.out
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715950816.tar.gzsize: 0 MiB (15041 bytes)
entries: 3
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 13:05:11 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-10942.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 17, 2024

One more time...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

Copy link

eessi-bot-aws bot commented May 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot-aws bot commented May 17, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/10943

date job status comment
May 17 13:14:32 UTC 2024 submitted job id 10943 awaits release by job manager
May 17 13:15:15 UTC 2024 released job awaits launch by Slurm scheduler
May 17 13:16:17 UTC 2024 running job 10943 is running
May 17 13:24:26 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-10943.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1715951838.tar.gzsize: 698 MiB (732495999 bytes)
entries: 74
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/install_cudnn_host_injections.sh
.lmod/SitePackage.lua
May 17 13:24:26 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-10943.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Member

ocaisa commented May 20, 2024

@trz42 I will take your updated host_injections script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch

@ocaisa
Copy link
Member

ocaisa commented May 20, 2024

I also get the feeling that if we are going to move to easystack files (a good idea) then we should probably ship the ones we expect people to use

@trz42
Copy link
Collaborator Author

trz42 commented May 21, 2024

@trz42 I will take your updated host_injections script for a test drive tomorrow, I think I have a few suggestions there and will open a PR to your branch

Just updated the script with some improvements/fixes after my own testing.

truib added 3 commits May 22, 2024 18:21
…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system
- `EESSI-install-software.sh`
  - use `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` with
    `scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml`
- `create_lmodsitepackage.py`
  - consolidate `eessi_{cuda,cudnn}_enabled_load_hook` functions in a single one
    (`eessi_cuda_and_libraries_enabled_load_hook`)
  - the remaining hook is prepared to easily add new modules, e.g., cuTENSOR
- `eb_hooks.py`
  - put code that iterates over all files replacing non-distributable ones with
    symlinks into `host_injections` with a common function
    (`replace_non_distributable_files_with_symlinks`)
- `install_scripts.sh`
  - add files to copy to CVMFS (see `nvidia_files`)
- `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh`
  - improved creation of tmp directory
@trz42
Copy link
Collaborator Author

trz42 commented May 23, 2024

Run another build after several changes...

bot: build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2

Copy link

eessi-bot-aws bot commented May 23, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

Copy link

eessi-bot-aws bot commented May 23, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build inst:aws repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 from trz42

    • expanded format: build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2
  • handling command build instance:aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 resulted in:

    • no jobs were submitted

Copy link

eessi-bot-aws bot commented May 23, 2024

New job on instance eessi-bot-mc-aws for architecture x86_64-amd-zen2 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.05/pr_581/11284

date job status comment
May 23 09:28:36 UTC 2024 submitted job id 11284 awaits release by job manager
May 23 09:29:06 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:30:09 UTC 2024 running job 11284 is running
May 23 09:42:29 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-11284.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716456951.tar.gzsize: 698 MiB (732492073 bytes)
entries: 75
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
cuDNN/8.9.2.26-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
cuDNN/8.9.2.26-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/easybuild/eb_hooks.py
2023.06/scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml
2023.06/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
.lmod/SitePackage.lua
May 23 09:42:29 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11284.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants