-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #369
{2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #369
Conversation
Instance
|
Instance
|
Instance
|
Instance
|
Just a first test on a single architecture... bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…into nessi-2023.06-PyTorch-2.1.2-2023a-CUDA-12.1.1
Next try after #370 (fix for GPU check)... bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Next try after putting source file under shared source path... bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…into nessi-2023.06-PyTorch-2.1.2-2023a-CUDA-12.1.1
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…into nessi-2023.06-PyTorch-2.1.2-2023a-CUDA-12.1.1
Try job on AWS and with different container (on eX3)... bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
New job on instance
|
Try again bot: build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Try running tests sequentially... bot: build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Checklist before starting deployment (setting
|
We need to be very careful which staging PRs shall be approved and which shall be rejected. Also any tarball we ingest may not include the updated Target architectures
Checklist for deployment/ingestion
command & logcommand BASE_DIR=/cvmfs/pilot.nessi.no/versions/2023.06/software/linux \
ARCHS=() \
ARCHS+=("aarch64/generic") ; \
ARCHS+=("x86_64/generic") ; \
ARCHS+=("x86_64/amd/zen2") ; \
ARCHS+=("x86_64/intel/broadwell") ; \
ARCHS+=("x86_64/intel/skylake_avx512") ; \
for arch in "${ARCHS[@]}"; do \
ls -l \
${BASE_DIR}/${arch}/{software,modules/all}/magma/2.7.2-foss-2023a-CUDA-12.1.1* \
${BASE_DIR}/${arch}/{software,modules/all}/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1* ; \
done ; \
ls -l \
${BASE_DIR}/../../init/easybuild/eb_hooks.py log - BEFORE ingestion ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
-rw-rw-r-- 1 cvmfs cvmfs 43825 May 22 23:12 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/../../init/easybuild/eb_hooks.py log - AFTER ingestion -rw-rw-r-- 1 cvmfs cvmfs 1425 May 29 19:05 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3238 May 30 04:34 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 29 19:05 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 29 19:05 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 29 19:05 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 29 19:05 lib64 -> lib
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 30 04:34 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 30 04:35 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 30 04:32 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 30 04:34 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1424 May 27 11:29 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3237 May 27 20:54 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:29 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 11:29 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:29 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 27 11:29 lib64 -> lib
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 20:54 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 20:55 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 20:52 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 27 20:54 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1425 May 26 23:06 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3238 May 27 09:23 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 23:07 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 26 23:06 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 23:06 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 26 23:06 lib64 -> lib
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 09:22 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 09:24 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 09:20 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 27 09:22 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1432 May 27 11:14 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3245 May 28 07:06 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:14 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 11:14 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:14 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 27 11:14 lib64 -> lib
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 28 07:06 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 28 07:08 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 28 07:02 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 28 07:06 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1437 May 26 09:03 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3250 May 27 03:02 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 09:03 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 26 09:02 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 09:02 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 26 09:02 lib64 -> lib
/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 03:01 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 03:03 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 01:03 lib
lrwxrwxrwx 1 cvmfs cvmfs 3 May 27 03:01 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 45378 May 26 08:15 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/../../init/easybuild/eb_hooks.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an easy one.
We'll have to redo it once it is clear how to build such modules in EESSI and when the bot is ready for it.
In the meantime we can test it on Saga, Betzy and eX3.
Add
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
to NESSI.SPDX license identifier:
BSD-style
Missing packages: