Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script to more easily inspect a build job #317

Merged
merged 15 commits into from
Oct 17, 2023

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Aug 22, 2023

Provides the script bot/inspect.sh which should be called with the full path to a tarball containing the state of a build job's temporary storage.

Before calling the script, launch an interactive job, e.g.,

srun --time=8:00:00 -C shape=c4.2xlarge --pty /bin/bash

Then do the following to enter the build job's environment

./bot/inspect.sh --resume /mnt/shared/home/bot/eessi-bot-software-layer/jobs/2023.07/pr_304/5793/previous_tmp/build_step/eessi-2023.06-software-1689005270.tgz

This follows all the setup done for a build job until the step where eb --easystack ... is called. A message such as

Ready for inspection of build job:
 - job directory is /eessi_bot_job ($HOME), check for slurm-*.out file
 - temporary data of job available at /tmp
 - Note, prefix /cvmfs/pilot.nessi.no/versions/2023.06 is writable
 - EasyBuild v4.7.2 is available

is printed when it's ready to be used.

Testing & feedback welcome. Would like to keep it as WIP for a few days to allow for polishing and improvements.

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Aug 22, 2023

Instance eessi-bot-citc-aws is configured to build:

  • arch x86_64/generic for repo eessi-2021.12
  • arch x86_64/generic for repo eessi-2023.06-compat
  • arch x86_64/generic for repo eessi-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-2021.12
  • arch x86_64/intel/haswell for repo eessi-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-2021.12
  • arch x86_64/intel/skylake_avx512 for repo eessi-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-2021.12
  • arch x86_64/amd/zen2 for repo eessi-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-2021.12
  • arch x86_64/amd/zen3 for repo eessi-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-2023.06-software
  • arch aarch64/generic for repo eessi-2021.12
  • arch aarch64/generic for repo eessi-2023.06-compat
  • arch aarch64/generic for repo eessi-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-2021.12
  • arch aarch64/neoverse_n1 for repo eessi-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-2021.12
  • arch aarch64/neoverse_v1 for repo eessi-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-2023.06-software

@trz42 trz42 marked this pull request as draft August 22, 2023 13:08
@casparvl
Copy link
Collaborator

I've played with this a bit, but wanted to do so in my own homedir. I think it would be good to see what the minimal 'unit' is we need to give to someone to make this work. From my attempt in #310 (comment) I'm assuming it needs the tarball, and the config dir.

The issue is that in this script, you make the assumption that the complete build tree looks the same. E.g.

job_dir=$(dirname $(dirname $(dirname ${resume_tgz})))

It would be nice if that is just configurable. Than we (people who have access to the bot account) could simply give others (without access) only the necessary things: config dir & tarball. And from there, they could resume.

Though I haven't yet had the time to read through the code in detail, I expect the only thing this would need is an argument to point to the location of the cfg dir.

But maybe it's not so easy...?

@casparvl
Copy link
Collaborator

casparvl commented Aug 28, 2023

Ok, let me log what I do here, just to have written down somewhere how I debug #310 as an example:

  • Check out your feature branch from the repo, and in that checkout run ./bot/inspect.sh --resume /mnt/shared/home/bot/eessi-bot-software-layer/jobs/2023.08/pr_310/6355/previous_tmp/build_step/eessi-2023.06-software-1691495231.tgz
  • I checked the EasyBuild logs at /tmp/eb-vnem_h5_/easybuild-1ni1a4ed.log. This requires a bit of searching, but they are always in /tmp/eb-*, so should be doable to find them.
  • Dump the environment script for the failing software eb FFTW.MPI-3.3.10-gompi-2022a.eb --dump-env-script, as it is failing in the test suite. Similarly, if I wanted to mimic the build, I could have just run eb FFTW.MPI-3.3.10-gompi-2022a.eb to mimic the build, or eb eessi-2023.06-eb-4.7.2-2022a.yml (though this may take a bit longer as it needs to figure out what was installed already).
  • Check which test step command failed, e.g. from the bottom we search for Starting test step and find the command that is being run:
running cmd  export OMPI_MCA_rmaps_base_oversubscribe=true &&   make check  (in /tmp/bot/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10)
  • cd /tmp/bot/easybuild/build/FFTWMPI/3.3.10/gompi-2022a/fftw-3.3.10, source FFTW.MPI-3.3.10-gompi-2022a.env and run the test command OMPI_MCA_rmaps_base_oversubscribe=true && make check

(and, of course, everything runs like a charm now... so much for reproducing my issue :D)

@casparvl casparvl mentioned this pull request Aug 28, 2023
bot/inspect.sh Outdated
echo_green ">> MODULEPATH set up: ${MODULEPATH}"
fi

eb_version='4.7.2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this is an option to the script, since we'll soon be using a newer EasyBuild version for 2023.06.

Maybe we shouldn't be loading an EasyBuild module at all here (at least by default).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Should not be set to a specific version because different ones may be used by the same job.

I think it would be best to skip this, and change the comments with instructions to load EasyBuild.

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, it might be useful if one can also drive this script in a batch job. For that to be useful, we need to be able to run commands in the container. I added:

  echo "  -c | --command COMMAND -  command to execute inside the container, in the prefix environment"
...
    -c|--command)
      export run_in_prefix="${2}"
      shift 2
      ;;
...
./eessi_container.sh "${CMDLINE_ARGS[@]}" \
                     -- ${EESSI_COMPAT_LAYER_DIR}/startprefix <<< ${run_in_prefix}

which works, but I haven't tested how it behaves if no --command is passed. Might have to do some more elegant if-else there to avoid passing in ${run_in_prefix} if it isn't defined.

Caspar van Leeuwen and others added 7 commits September 26, 2023 10:03
@trz42 trz42 marked this pull request as ready for review September 29, 2023 19:46
@trz42 trz42 changed the title [WIP] script to more easily inspect a build job script to more easily inspect a build job Sep 29, 2023
Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used this a couple of times by now. Works like a charm for me. Lgtm!

@casparvl casparvl merged commit 3b55b9e into EESSI:2023.06 Oct 17, 2023
121 checks passed
trz42 pushed a commit to trz42/software-layer that referenced this pull request Apr 12, 2024
let bot run EESSI test suite after building
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants