Skip to content

This repo contains a slurm test suite which can be used to test new hardware automatically. This includes checking the availability of modules and, at a later stage, more complex programs and parallelization

License

Notifications You must be signed in to change notification settings

Sagnik700/slurm-test-suite

Repository files navigation

Slurm test suite

Vision: This repo contains a slurm test suite which can be used to test new hardware automatically. This includes checking the availability of modules and, at a later stage, more complex programs and parallelization.

Project structure

slurm_test_suite
-all_tests (contains all the folder for individual modules to test, eg. java, python, gcc, etc)
-logs (contains slurm output files for individual module's slurm test files)
-logs_archive (contains the .tar file archive of the log files in logs folder)
-module_output.txt (pass or fail information from the slurm scripts of individual modules are stored here after each run)
-test-runner-script.sh (main bash script which is the starting point of the project)
-config.cfg (configuration file for sbatch parameters and selecting modules that need to be tested)

1. Main bash runner script

The main bash script that runs all the subsequent jobs is named as test-runner-script.sh. This bash script submits all the Slurm jobs that checks individual modules whether they are present or not. This script is run as a normal bash script and it contains individual modules like(java, python, etc) which is run sequentially as separate slurm batch files. Each Slurm batch test file produces an output (fail/pass). This test-runner-script.sh also checks if the logs folder is full or not, if it's full then all the logs inside are archived and stored in the logs_archive folder and the logs folder is emptied. This file also takes command line arguments to specify if only specific modules are needed to be run or all the modules are needed to be run.

After running this file, some basic information are displayed in the beginning like the hostname of the frontend from where it is invoked and whether the log folder has space or not. Then the respective information for the module test is displayed, which is the id of the submitted slurm job, the compute node where the job is run and whether the job passed or failed. All tests that failed will be specified in the command line output. To get detailed information about the module test, the appropriate log file in the log folder should be checked.

This is a sample of the command line output after the gcc module passed all of its tests. image.png

2. How to run the project

The entire project can be run by just calling the test-runner-script.sh. It can be run from the command line with the command -

sh test-runner-script.sh

If no command line arguments are specified then the configuration file(config.cfg) is read line by line and all keys starting with the name sbatch that are not empty are added as parameters to the sbatch command run for testing the modules. Every key starting with module which is set to the value 1(all values other than 1 are ignored) will be executed, whereas if modules_all key is set to 1 then all modules are tested and rest of the module keys are ignored.

sh test-runner-script.sh --default

For this command, the configuration file(config.cfg) is read line by line and all keys starting with the name def_sbatch that are not empty are added as parameters to the sbatch command run for testing all the modules.

sh test-runner-script.sh java python git

Specific modules can also tested quickly if they are passed as command line arguments. For this command, only the java, python and git modules will be run respectively. The value or names for the command line argument for each module is specified in the next section.

3. Configuration file

All keys having sbatch as part of the name that are not empty are added as parameters to the sbatch command run for testing the modules(def_sbatch keys only for --default command line argument - see the section above). Every key starting with module which is set to the value 1(all values other than 1 are ignored) will be executed, whereas if modules_all key is set to 1 then all modules are tested and rest of the module keys are ignored(for --default command line argument all modules are tested irrespective of the existing key value pairs in the config file).

4. Available module test and command-line-argument values for each module

All the modules that are checked within this project are as follows. The respective name for each module is also its command line argument value(for eg. command line argument for Java is java):-

  • gcc
  • python
  • java
  • matlab
  • git
  • go
  • r
  • jq (just avail test)
  • octave
  • rclone (just avail test)
  • samtools
  • spark
  • singularity
  • valgrind
  • gromacs
  • bwa
  • bowtie2
  • gatk
  • mafft

5. Output file

The text file named module_output.txt stores information about passing or failing tests of each individual modules after it is run from the respective bash files within the module folders(all the files inside all_tests folder). This file is maintained so that multiple information can be propagated from the child bash files to the parent bash files. This file has a key value structure. The respective keys and their associated values are mentioned follows:-

  1. hostname: The compute node hostname where individual slurm job runs for a particular module
  2. module_check_status: Binary string which holds information which tests failed and passed. 1 or 0 at the nth position in this string denotes the nth test for a module failed or passed respectively.

6. Logs

After running the test-runner-script, logs are generated with respect to each individual module inside the logs folder.

7. Timing

Not yet working on this. In the future, the execution times of more complex programs can be used.

8. Parallisation

Not yet working on this. In the future, the work done in scalable-ai can get integrated as complex test cases.

9. Problems with some modules

During the execution of some modules, certain problems were faced which created a roadblock towards the development of their individual tests. The concerned modules are discussed as follows:-

  1. git: The folder containing the git repository could not be deleted even though the git repository was deleted(the .git file was deleted). The error message showed that the folder was still in use, and a '.nfs' file was created inside the folder.
  2. octave: The execution of this module was problematic since the usual way of execution was getting frozen in between. An octave test was created with the name basicscript.m and it was executed with octave basicscript.m. But the execution froze everytime after the command was put in.
  3. spark: The GWDG official documenations are out of date, thus no documentation was found which can be followed to set up spark and run a .scala file
  4. openfoam: Module cannot be loaded with module load openfoam
  5. gromacs: After loading gromacs module with module load gromacs, the gromacs functionality cannot be used with gmx command. In every documentation found online, they have used the gmx but here is does not seem to work. Also, no proper documentation found which is specific to GWDG
  6. openmpi: Segmentation fault error encountered while running a basic openmpi program which prints the name and rank of the available processors. This error only persists when gcc and openmpi are loaded together. If loaded separately then the error does not exist: module load gcc openmpi -> mpicc --version = 10.2 -> segmentation fault while running basic openmpi program module load openmpi -> mpicc --version = 9.3.0 -> works

About

This repo contains a slurm test suite which can be used to test new hardware automatically. This includes checking the availability of modules and, at a later stage, more complex programs and parallelization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published