Skip to content

Conversation

bedroge
Copy link
Contributor

@bedroge bedroge commented Oct 9, 2025

And undo the limit for CP2K (introduced in #104), I don't think that was required and it didn't solve the issue in EESSI/software-layer#1220 (comment). Newer QE versions removed the maxparallel=1, and it looks like this makes it run out of memory.

To be sure, I'll add an easystack here that builds both QE and CP2K, just to confirm that both build without issues now.

@bedroge bedroge added the a64fx label Oct 9, 2025
@bedroge
Copy link
Contributor Author

bedroge commented Oct 9, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Oct 9, 2025

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/581452

date job status comment
Oct 09 13:55:22 UTC 2025 submitted job id 581452 awaits release by job manager
Oct 09 13:55:28 UTC 2025 released job awaits launch by Slurm scheduler
Oct 09 13:56:34 UTC 2025 running job 581452 is running
Oct 09 14:05:13 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-581452.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-17600183290.tar.gzsize: 0 MiB (21567 bytes)
entries: 1
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/easybuild/eb_hooks.py
Oct 09 14:05:13 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 2/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 3/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 4/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ OK ] ( 5/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64_a64fx+default
P: perf: 582.844 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default
P: perf: 583.466 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64_a64fx+default
P: latency: 1.64 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64_a64fx+default
P: latency: 1.64 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64_a64fx+default
P: bandwidth: 8496.74 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64_a64fx+default
P: bandwidth: 7820.35 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 6/10 test case(s) from 10 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-581452.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Contributor Author

bedroge commented Oct 9, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Oct 9, 2025

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/581603

date job status comment
Oct 09 15:22:02 UTC 2025 submitted job id 581603 awaits release by job manager
Oct 09 15:22:17 UTC 2025 released job awaits launch by Slurm scheduler
Oct 09 15:23:23 UTC 2025 running job 581603 is running
Oct 10 10:18:32 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-581603.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-17600891280.tar.gzsize: 4512 MiB (4731615076 bytes)
entries: 30510
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
CP2K/2023.1-foss-2023a.lua
Libint/2.7.2-GCC-12.3.0-lmax-6-cp2k.lua
QuantumESPRESSO/7.3.1-foss-2023a.lua
libvori/220621-GCCcore-12.3.0.lua
software under 2023.06/software/linux/aarch64/a64fx/software
CP2K/2023.1-foss-2023a
Libint/2.7.2-GCC-12.3.0-lmax-6-cp2k
QuantumESPRESSO/7.3.1-foss-2023a
libvori/220621-GCCcore-12.3.0
reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/easybuild/eb_hooks.py
Oct 10 10:18:32 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 48332.8 MiB is needed
[ SKIP ] ( 2/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 3/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 4/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 5/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ OK ] ( 6/11) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64_a64fx+default
P: perf: 583.705 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 7/11) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default
P: perf: 551.251 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 8/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64_a64fx+default
P: latency: 1.71 us (r:0, l:None, u:None)
[ OK ] ( 9/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64_a64fx+default
P: latency: 1.74 us (r:0, l:None, u:None)
[ OK ] (10/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64_a64fx+default
P: bandwidth: 8851.13 MB/s (r:0, l:None, u:None)
[ OK ] (11/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64_a64fx+default
P: bandwidth: 8744.57 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 6/11 test case(s) from 11 check(s) (0 failure(s), 5 skipped, 0 aborted)
Details
✅ job output file slurm-581603.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Contributor Author

bedroge commented Oct 9, 2025

Job is still running, but the QE build just completed. The max memory usage reported by Slurm is only 1953600K, so I don't understand why it didn't work with the default settings (which should be 12 cores instead of 6?).

@bedroge
Copy link
Contributor Author

bedroge commented Oct 10, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Oct 10, 2025

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/582025

date job status comment
Oct 10 10:24:33 UTC 2025 submitted job id 582025 awaits release by job manager
Oct 10 10:24:46 UTC 2025 released job awaits launch by Slurm scheduler
Oct 10 10:25:50 UTC 2025 running job 582025 is running
Oct 10 10:33:33 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-582025.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-17600920600.tar.gzsize: 0 MiB (21566 bytes)
entries: 1
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/easybuild/eb_hooks.py
Oct 10 10:33:33 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 2/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 3/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 4/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed
[ OK ] ( 5/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64_a64fx+default
P: perf: 581.326 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default
P: perf: 580.767 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64_a64fx+default
P: latency: 1.67 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64_a64fx+default
P: latency: 1.72 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64_a64fx+default
P: bandwidth: 8461.09 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64_a64fx+default
P: bandwidth: 8110.44 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 6/10 test case(s) from 10 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-582025.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Oct 10 14:59:35 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-a64fx-17600920600.tar.gz to S3 bucket succeeded

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Oct 10, 2025

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/582030

date job status comment
Oct 10 10:24:38 UTC 2025 submitted job id 582030 awaits release by job manager
Oct 10 10:24:43 UTC 2025 released job awaits launch by Slurm scheduler
Oct 10 10:25:52 UTC 2025 running job 582030 is running
Oct 10 10:29:09 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-582030.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17600919250.tar.gzsize: 0 MiB (21566 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/a64fx
2025.06/init/easybuild/eb_hooks.py
Oct 10 10:29:09 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-582030.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Oct 10 14:59:27 UTC 2025 uploaded transfer of eessi-2025.06-software-linux-aarch64-a64fx-17600919250.tar.gz to S3 bucket succeeded

@trz42
Copy link
Contributor

trz42 commented Oct 10, 2025

Job is still running, but the QE build just completed. The max memory usage reported by Slurm is only 1953600K, so I don't understand why it didn't work with the default settings (which should be 12 cores instead of 6?).

Nodes have about 29G free memory for jobs. So if 6 use 19 G (6 * 2.5 G + 4 G), using 38 G for 12 cores (or say 12 * 2.5 G + 4 G = 34 g) would be too much.

Anyhow, the true culprit has been found.

@bedroge
Copy link
Contributor Author

bedroge commented Oct 10, 2025

So if 6 use 19 G (6 * 2.5 G + 4 G), using 38 G for 12 cores (or say 12 * 2.5 G + 4 G = 34 g) would be too much.

True, but 1953600 K is only 1.9 GB 😉

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegel
Copy link
Contributor

boegel commented Oct 10, 2025

staging PR merged

@boegel boegel merged commit 3a3ea5b into EESSI:main Oct 10, 2025
66 of 68 checks passed
@bedroge bedroge deleted the qe_numcores_a64fx branch October 10, 2025 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants