Limit QuantumESPRESSO builds on A64FX to max 6 cores #106

bedroge · 2025-10-09T13:52:07Z

And undo the limit for CP2K (introduced in #104), I don't think that was required and it didn't solve the issue in EESSI/software-layer#1220 (comment). Newer QE versions removed the maxparallel=1, and it looks like this makes it run out of memory.

To be sure, I'll add an easystack here that builds both QE and CP2K, just to confirm that both build without issues now.

…e_numcores_a64fx

bedroge · 2025-10-09T13:55:15Z

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

eessi-bot-deucalion · 2025-10-09T13:55:23Z

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/581452

date	job status	comment
Oct 09 13:55:22 UTC 2025	submitted	job id `581452` awaits release by job manager
Oct 09 13:55:28 UTC 2025	released	job awaits launch by Slurm scheduler
Oct 09 13:56:34 UTC 2025	running	job `581452` is running
Oct 09 14:05:13 UTC 2025	finished	😢 FAILURE (click triangle for details) Details ✅ job output file `slurm-581452.out` ✅ no message matching `FATAL:` ❌ found message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ❌ no message matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-aarch64-a64fx-17600183290.tar.gz` size: 0 MiB (21567 bytes) entries: 1 modules under 2023.06/software/linux/aarch64/a64fx/modules/all no module files in tarball software under 2023.06/software/linux/aarch64/a64fx/software no software packages in tarball reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod no reprod directories in tarball other under 2023.06/software/linux/aarch64/a64fx `2023.06/init/easybuild/eb_hooks.py`
Oct 09 14:05:13 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ SKIP ] ( 1/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 2/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 3/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 4/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ OK ] ( 5/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64_a64fx+default P: perf: 582.844 timesteps/s (r:0, l:None, u:None) [ OK ] ( 6/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default P: perf: 583.466 timesteps/s (r:0, l:None, u:None) [ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64_a64fx+default P: latency: 1.64 us (r:0, l:None, u:None) [ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64_a64fx+default P: latency: 1.64 us (r:0, l:None, u:None) [ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64_a64fx+default P: bandwidth: 8496.74 MB/s (r:0, l:None, u:None) [ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64_a64fx+default P: bandwidth: 7820.35 MB/s (r:0, l:None, u:None) [ PASSED ] Ran 6/10 test case(s) from 10 check(s) (0 failure(s), 4 skipped, 0 aborted) Details ✅ job output file `slurm-581452.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

bedroge · 2025-10-09T14:48:26Z

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

eessi-bot-deucalion · 2025-10-09T15:22:04Z

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/581603

date	job status	comment
Oct 09 15:22:02 UTC 2025	submitted	job id `581603` awaits release by job manager
Oct 09 15:22:17 UTC 2025	released	job awaits launch by Slurm scheduler
Oct 09 15:23:23 UTC 2025	running	job `581603` is running
Oct 10 10:18:32 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-581603.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-aarch64-a64fx-17600891280.tar.gz` size: 4512 MiB (4731615076 bytes) entries: 30510 modules under 2023.06/software/linux/aarch64/a64fx/modules/all `CP2K/2023.1-foss-2023a.lua` `Libint/2.7.2-GCC-12.3.0-lmax-6-cp2k.lua` `QuantumESPRESSO/7.3.1-foss-2023a.lua` `libvori/220621-GCCcore-12.3.0.lua` software under 2023.06/software/linux/aarch64/a64fx/software `CP2K/2023.1-foss-2023a` `Libint/2.7.2-GCC-12.3.0-lmax-6-cp2k` `QuantumESPRESSO/7.3.1-foss-2023a` `libvori/220621-GCCcore-12.3.0` reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod no reprod directories in tarball other under 2023.06/software/linux/aarch64/a64fx `2023.06/init/easybuild/eb_hooks.py`
Oct 10 10:18:32 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ SKIP ] ( 1/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 48332.8 MiB is needed [ SKIP ] ( 2/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 3/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 4/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 5/11) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ OK ] ( 6/11) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64_a64fx+default P: perf: 583.705 timesteps/s (r:0, l:None, u:None) [ OK ] ( 7/11) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default P: perf: 551.251 timesteps/s (r:0, l:None, u:None) [ OK ] ( 8/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64_a64fx+default P: latency: 1.71 us (r:0, l:None, u:None) [ OK ] ( 9/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64_a64fx+default P: latency: 1.74 us (r:0, l:None, u:None) [ OK ] (10/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64_a64fx+default P: bandwidth: 8851.13 MB/s (r:0, l:None, u:None) [ OK ] (11/11) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64_a64fx+default P: bandwidth: 8744.57 MB/s (r:0, l:None, u:None) [ PASSED ] Ran 6/11 test case(s) from 11 check(s) (0 failure(s), 5 skipped, 0 aborted) Details ✅ job output file `slurm-581603.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

bedroge · 2025-10-09T18:39:08Z

Job is still running, but the QE build just completed. The max memory usage reported by Slurm is only 1953600K, so I don't understand why it didn't work with the default settings (which should be 12 cores instead of 6?).

…-scripts into qe_numcores_a64fx

bedroge · 2025-10-10T10:24:26Z

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

eessi-bot-deucalion · 2025-10-10T10:24:34Z

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/582025

date	job status	comment
Oct 10 10:24:33 UTC 2025	submitted	job id `582025` awaits release by job manager
Oct 10 10:24:46 UTC 2025	released	job awaits launch by Slurm scheduler
Oct 10 10:25:50 UTC 2025	running	job `582025` is running
Oct 10 10:33:33 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-582025.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-aarch64-a64fx-17600920600.tar.gz` size: 0 MiB (21566 bytes) entries: 1 modules under 2023.06/software/linux/aarch64/a64fx/modules/all no module files in tarball software under 2023.06/software/linux/aarch64/a64fx/software no software packages in tarball reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod no reprod directories in tarball other under 2023.06/software/linux/aarch64/a64fx `2023.06/init/easybuild/eb_hooks.py`
Oct 10 10:33:33 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ SKIP ] ( 1/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 2/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 3/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ SKIP ] ( 4/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) accodring to the current ReFrame configuration, but 49152 MiB is needed [ OK ] ( 5/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64_a64fx+default P: perf: 581.326 timesteps/s (r:0, l:None, u:None) [ OK ] ( 6/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default P: perf: 580.767 timesteps/s (r:0, l:None, u:None) [ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64_a64fx+default P: latency: 1.67 us (r:0, l:None, u:None) [ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64_a64fx+default P: latency: 1.72 us (r:0, l:None, u:None) [ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64_a64fx+default P: bandwidth: 8461.09 MB/s (r:0, l:None, u:None) [ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64_a64fx+default P: bandwidth: 8110.44 MB/s (r:0, l:None, u:None) [ PASSED ] Ran 6/10 test case(s) from 10 check(s) (0 failure(s), 4 skipped, 0 aborted) Details ✅ job output file `slurm-582025.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`
Oct 10 14:59:35 UTC 2025	uploaded	transfer of `eessi-2023.06-software-linux-aarch64-a64fx-17600920600.tar.gz` to S3 bucket succeeded

eessi-bot-deucalion · 2025-10-10T10:24:39Z

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2025.10/pr_106/582030

date	job status	comment
Oct 10 10:24:38 UTC 2025	submitted	job id `582030` awaits release by job manager
Oct 10 10:24:43 UTC 2025	released	job awaits launch by Slurm scheduler
Oct 10 10:25:52 UTC 2025	running	job `582030` is running
Oct 10 10:29:09 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-582030.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2025.06-software-linux-aarch64-a64fx-17600919250.tar.gz` size: 0 MiB (21566 bytes) entries: 1 modules under 2025.06/software/linux/aarch64/a64fx/modules/all no module files in tarball software under 2025.06/software/linux/aarch64/a64fx/software no software packages in tarball reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod no reprod directories in tarball other under 2025.06/software/linux/aarch64/a64fx `2025.06/init/easybuild/eb_hooks.py`
Oct 10 10:29:09 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-582030.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`
Oct 10 14:59:27 UTC 2025	uploaded	transfer of `eessi-2025.06-software-linux-aarch64-a64fx-17600919250.tar.gz` to S3 bucket succeeded

trz42 · 2025-10-10T13:04:36Z

Job is still running, but the QE build just completed. The max memory usage reported by Slurm is only 1953600K, so I don't understand why it didn't work with the default settings (which should be 12 cores instead of 6?).

Nodes have about 29G free memory for jobs. So if 6 use 19 G (6 * 2.5 G + 4 G), using 38 G for 12 cores (or say 12 * 2.5 G + 4 G = 34 g) would be too much.

Anyhow, the true culprit has been found.

bedroge · 2025-10-10T13:07:20Z

So if 6 use 19 G (6 * 2.5 G + 4 G), using 38 G for 12 cores (or say 12 * 2.5 G + 4 G = 34 g) would be too much.

True, but 1953600 K is only 1.9 GB 😉

ocaisa

LGTM

eb_hooks.py

boegel · 2025-10-10T17:21:20Z

staging PR merged

use max 6 cores for QE instead of CP2K

bd88835

bedroge added the a64fx label Oct 9, 2025

bedroge added 2 commits October 9, 2025 15:53

easystack with CP2K and QE

7e51280

Merge branch 'main' of github.com:EESSI/software-layer-scripts into q…

70bc32d

…e_numcores_a64fx

add missing comma

b19f7ac

bedroge added 2 commits October 10, 2025 12:23

remove easystack

9457ce4

Merge branch 'qe_numcores_a64fx' of github.com:bedroge/software-layer…

6068f82

…-scripts into qe_numcores_a64fx

bedroge added the ready-to-deploy label Oct 10, 2025

bedroge mentioned this pull request Oct 10, 2025

{2023.06}[2023a,a64fx] apps originally built with EB 4.9.2 - QuantumESPRESSO 7.3.1 and CP2K 2023.1 EESSI/software-layer#1233

Open

ocaisa approved these changes Oct 10, 2025

View reviewed changes

ocaisa added bot:deploy and removed ready-to-deploy labels Oct 10, 2025

boegel reviewed Oct 10, 2025

View reviewed changes

eb_hooks.py Show resolved Hide resolved

boegel merged commit 3a3ea5b into EESSI:main Oct 10, 2025
66 of 68 checks passed

bedroge deleted the qe_numcores_a64fx branch October 10, 2025 18:13

Limit QuantumESPRESSO builds on A64FX to max 6 cores #106

Limit QuantumESPRESSO builds on A64FX to max 6 cores #106

Uh oh!

Conversation

bedroge commented Oct 9, 2025

Uh oh!

bedroge commented Oct 9, 2025

Uh oh!

eessi-bot-deucalion bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedroge commented Oct 9, 2025

Uh oh!

eessi-bot-deucalion bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedroge commented Oct 9, 2025

Uh oh!

bedroge commented Oct 10, 2025

Uh oh!

eessi-bot-deucalion bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessi-bot-deucalion bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trz42 commented Oct 10, 2025

Uh oh!

bedroge commented Oct 10, 2025

Uh oh!

ocaisa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

boegel commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eessi-bot-deucalion bot commented Oct 9, 2025 •

edited

Loading

eessi-bot-deucalion bot commented Oct 9, 2025 •

edited

Loading

eessi-bot-deucalion bot commented Oct 10, 2025 •

edited

Loading

eessi-bot-deucalion bot commented Oct 10, 2025 •

edited

Loading