Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to retrigger FFTW.MPI build issue for 2022a #374

Closed

Conversation

casparvl
Copy link
Collaborator

See if we can trigger a rebuild without the FFTW.MPI test hook, to see if we can retrigger the issue on the new Magic Castle cluster

DONT MERGE THIS PR!

…e if we can retrigger the issue on the new Magic Castle cluster
@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 26, 2023

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 26, 2023

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1 from casparvl

    • expanded format: build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1
  • handling command build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1 resulted in:

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 26, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi-hpc.org-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.10/pr_374/372

date job status comment
Oct 26 16:53:10 UTC 2023 submitted job id 372 awaits release by job manager
Oct 26 16:53:21 UTC 2023 released job awaits launch by Slurm scheduler
Oct 26 16:57:23 UTC 2023 running job 372 is running
Oct 26 17:01:27 UTC 2023 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-372.out
❌ found message matching ERROR:
❌ found message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-neoverse_v1-1698339627.tar.gzsize: 0 MiB (231526 bytes)
entries: 3
modules under 2023.06/software/linux/aarch64/neoverse_v1/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/neoverse_v1/software
no software packages in tarball
other under 2023.06/software/linux/aarch64/neoverse_v1
.lmod/cache/spiderT.lua
.lmod/cache/spiderT.luac_5.1
.lmod/cache/timestamp

@casparvl
Copy link
Collaborator Author

Ok, so it is not so easy to retrigger a build that has already been ingested...

(last error: [Errno 13] Permission denied: '/cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_v1/software/FFTW.MPI/3.3.10-gompi-2022a/lib/libfftw3.a') (at easybuild/tools/filetools.py:1835 in adjust_permissions)

I guess that's because that file exists. Thus it doesn't end up in the writeable overlay, but is actually trying to change the existing (read-only) file.

Does anyone have suggestions on how to retrigger this build with the bot...?

…to make the current installation writeable
@casparvl
Copy link
Collaborator Author

bot: build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 27, 2023

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1 from casparvl

    • expanded format: build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1
  • handling command build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1 resulted in:

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 27, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi-hpc.org-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.10/pr_374/373

date job status comment
Oct 27 07:38:46 UTC 2023 submitted job id 373 awaits release by job manager
Oct 27 07:39:28 UTC 2023 released job awaits launch by Slurm scheduler
Oct 27 07:44:31 UTC 2023 running job 373 is running
Oct 28 07:44:46 UTC 2023 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job373.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 28, 2023

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1 from casparvl

    • expanded format: build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1
  • handling command build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1 resulted in:

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Oct 28, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi-hpc.org-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.10/pr_374/374

date job status comment
Oct 28 17:15:04 UTC 2023 submitted job id 374 awaits release by job manager
Oct 28 17:15:27 UTC 2023 released job awaits launch by Slurm scheduler
Oct 28 17:20:29 UTC 2023 running job 374 is running
Oct 28 19:26:25 UTC 2023 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-374.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.

@casparvl
Copy link
Collaborator Author

casparvl commented Nov 1, 2023

Hm...

--------------------------------------------------------------
     MPI FFTW transforms passed 10 tests, 1 CPU
--------------------------------------------------------------
perl -w ../tests/check.pl --verbose --random --maxsize=10000 -c=10  --mpi "mpirun -np 2 `pwd`/mpi-bench"
Executing "mpirun -np 2 /tmp/bot/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench --verbose=1   --verify 'obr[40x7' --verify 'ibr[40x7' --verify 'obc[40x7' --verify 'ibc[40x7' --verify 'ofc[40x7' --verify 'ifc[40x7' --verify 'ofr]8x10x11x7' --verify 'ifr]8x10x11x7' --verify 'obc]8x10x11x7' --verify 'ibc]8x10x11x7' --verify 'ofc]8x10x11x7' --verify 'ifc]8x10x11x7' --verify 'okd8o11x4o11x12o10v1' --verify 'ikd8o11x4o11x12o10v1' --verify 'obr[2x5x11' --verify 'ibr[2x5x11' --verify 'obc[2x5x11' --verify 'ibc[2x5x11' --verify 'ofc[2x5x11' --verify 'ifc[2x5x11' --verify 'ok[5e11x6e11x6e01x8e00v3' --verify 'ik[5e11x6e11x6e01x8e00v3' --verify 'okd9e11x13e10x7hx9o11' --verify 'ikd9e11x13e10x7hx9o11' --verify 'obr[4x6x5x8' --verify 'ibr[4x6x5x8' --verify 'obc[4x6x5x8' --verify 'ibc[4x6x5x8' --verify 'ofc[4x6x5x8' --verify 'ifc[4x6x5x8'"
obr[40x7 3.56943e-16 6.36946e-16 3.62614e-16
ibr[40x7 3.71849e-16 6.36946e-16 3.4114e-16
obc[40x7 2.69428e-16 8.49261e-16 3.69264e-16
ibc[40x7 3.18407e-16 6.36946e-16 4.38655e-16
ofc[40x7 4.01251e-16 6.36946e-16 3.78697e-16
ifc[40x7 3.21605e-16 6.36946e-16 4.37138e-16
ofr]8x10x11x7 3.64956e-16 2.53488e-15 5.81605e-16
ifr]8x10x11x7 3.77969e-16 2.89701e-15 5.1962e-16
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
FAILED mpirun -np 2 /tmp/bot/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10/mpi/mpi-bench:  --verify 'obr[40x7' --verify 'ibr[40x7' --verify 'obc[40x7' --verify 'ibc[40x7' --verify 'ofc[40x7' --verify 'ifc[40x7' --verify 'ofr]8x10x11x7' --verify 'ifr]8x10x11x7' --verify 'obc]8x10x11x7' --verify 'ibc]8x10x11x7' --verify 'ofc]8x10x11x7' --verify 'ifc]8x10x11x7' --verify 'okd8o11x4o11x12o10v1' --verify 'ikd8o11x4o11x12o10v1' --verify 'obr[2x5x11' --verify 'ibr[2x5x11' --verify 'obc[2x5x11' --verify 'ibc[2x5x11' --verify 'ofc[2x5x11' --verify 'ifc[2x5x11' --verify 'ok[5e11x6e11x6e01x8e00v3' --verify 'ik[5e11x6e11x6e01x8e00v3' --verify 'okd9e11x13e10x7hx9o11' --verify 'ikd9e11x13e10x7hx9o11' --verify 'obr[4x6x5x8' --verify 'ibr[4x6x5x8' --verify 'obc[4x6x5x8' --verify 'ibc[4x6x5x8' --verify 'ofc[4x6x5x8' --verify 'ifc[4x6x5x8'

@casparvl
Copy link
Collaborator Author

casparvl commented Nov 1, 2023

bot: build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1

Copy link

eessi-bot-aws bot commented Nov 1, 2023

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi-hpc.org-2023.06-software arch:aarch64/neoverse_v1 from casparvl

    • expanded format: build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1
  • handling command build repository:eessi-hpc.org-2023.06-software architecture:aarch64/neoverse_v1 resulted in:

Copy link

eessi-bot-aws bot commented Nov 1, 2023

New job on instance eessi-bot-mc-aws for architecture aarch64-neoverse_v1 for repository eessi-hpc.org-2023.06-software in job dir /project/def-users/SHARED/jobs/2023.11/pr_374/419

date job status comment
Nov 01 15:01:37 UTC 2023 submitted job id 419 awaits release by job manager
Nov 01 15:02:28 UTC 2023 released job awaits launch by Slurm scheduler
Nov 01 15:07:30 UTC 2023 running job 419 is running
Nov 02 15:07:36 UTC 2023 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job419.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.

@boegel
Copy link
Contributor

boegel commented Nov 7, 2023

@casparvl Did we reach a conclusion here?

@casparvl
Copy link
Collaborator Author

Not really. As you can see, the bot is often reporting 'UNKNOWN' after 24h. Not sure what's going wrong in those cases? Does it loose sight of the job somehow?

The one time it did run there was some error in the test, but not like I've seen before (I think). It's... strange...

@boegel
Copy link
Contributor

boegel commented Nov 14, 2023

The "unknown" state is what the bot will currently report when the build job has timed out after 24h, so based on the timestamps that's likely what's going on here...

@boegel boegel changed the base branch from 2023.06 to pilot.eessi-hpc.org-2023.06 November 21, 2023 21:20
@casparvl
Copy link
Collaborator Author

casparvl commented Apr 2, 2024

I'm closing this PR. This still targetted the pilot, it's not so relevant anymore. We could try to see if we can get rid of the FFTW.MPI hook now that we reinstalled OpenMPI in software.eessi.io. I'd love to see if that's possible, but since the bot currently doesn't work, I guess it'll have to wait.

@casparvl casparvl closed this Apr 2, 2024
TopRichard added a commit to TopRichard/bot-software-layer1 that referenced this pull request May 26, 2024
….2-foss/2023a

{2023.06}[foss/2023a] WhatsHap V2.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants