Trouble running Cactus on a cluster with Slurm #261

brettChapman · 2020-06-05T06:12:11Z

Hi

I currently have access to a cluster on the cloud managed by OpenStack (https://nimbus.pawsey.org.au/)

I've been trying to get Cactus to run with Slurm across all nodes, without much success.

I have been following examples such as this one from Harvard, which uses Singularity (
https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html), however they only seem to run on one node (which runs successfully), but when I modify it to run across multiple nodes using my own larger dataset (I'm currently testing with only 3 nodes: -N 3 -n 24) I get errors due to multiple writes to the same mkfs.ext3 image which isn't allowed.

I have also tried running using only Docker, with the following sbatch script:

#!/bin/bash
#SBATCH --nodes=1

allow use of all the memory on the node

#SBATCH --ntasks-per-node=8
#SBATCH --mem=0

request all CPU cores on the node

#SBATCH --exclusive

Customize --time --partition as appropriate

#SBATCH --partition=debug

set -o nounset -o errexit -o xtrace

readonly workingPath=/data

srun -n 1 sudo docker run -v ${workingPath}:/data --rm quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm

But in this case I have write permission problems.

Then if I bypass sbatch and srun and simply run on the master node I get file not found errors when I've specified Slurm as the batch system:

FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo'

Does anyone have any ideas on what I'm doing wrong or some of their own sbatch scripts they would be willing to share?

Thank you for any help you can provide.

diekhans · 2020-06-05T20:32:15Z

If you can attache the stack traces from where the write errors occurred, we can probably offer advice. We don't have slurm here, so we don't know the details of how to use it, however the sinfo is suppose to be part of slurm, so I have know idea why it isn't found. Brett Chapman <notifications@github.com> writes:

…

Hi I currently have access to a cluster on the cloud managed by OpenStack (https://nimbus.pawsey.org.au/) I've been trying to get Cactus to run with Slurm across all nodes, without much success. I have been following examples such as this one from Harvard, which uses Singularity ( https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html), however they only seem to run on one node (which runs successfully), but when I modify it to run across multiple nodes using my own larger dataset (I'm currently testing with only 3 nodes: -N 3 -n 24) I get errors due to multiple writes to the same mkfs.ext3 image which isn't allowed. I have also tried running using only Docker, with the following sbatch script: #!/bin/bash #SBATCH --nodes=1 # allow use of all the memory on the node #SBATCH --ntasks-per-node=8 #SBATCH --mem=0 # request all CPU cores on the node #SBATCH --exclusive # Customize --time --partition as appropriate #SBATCH --partition=debug set -o nounset -o errexit -o xtrace readonly workingPath=/data srun -n 1 sudo docker run -v ${workingPath}:/data --rm quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm But in this case I have write permission problems. Then if I bypass sbatch and srun and simply run on the master node I get file not found errors when I've specified Slurm as the batch system: FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo' Does anyone have any ideas on what I'm doing wrong or some of their own sbatch scripts they would be willing to share? Thank you for any help you can provide. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #261 Hi I currently have access to a cluster on the cloud managed by OpenStack (https:/ /nimbus.pawsey.org.au/) I've been trying to get Cactus to run with Slurm across all nodes, without much success. I have been following examples such as this one from Harvard, which uses Singularity ( https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html), however they only seem to run on one node (which runs successfully), but when I modify it to run across multiple nodes using my own larger dataset (I'm currently testing with only 3 nodes: -N 3 -n 24) I get errors due to multiple writes to the same mkfs.ext3 image which isn't allowed. I have also tried running using only Docker, with the following sbatch script: #!/bin/bash #SBATCH --nodes=1 allow use of all the memory on the node #SBATCH --ntasks-per-node=8 #SBATCH --mem=0 request all CPU cores on the node #SBATCH --exclusive Customize --time --partition as appropriate #SBATCH --partition=debug set -o nounset -o errexit -o xtrace readonly workingPath=/data srun -n 1 sudo docker run -v ${workingPath}:/data --rm quay.io/ comparative-genomics-toolkit/cactus:latest cactus jobStore /data/ evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm But in this case I have write permission problems. Then if I bypass sbatch and srun and simply run on the master node I get file not found errors when I've specified Slurm as the batch system: FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo' Does anyone have any ideas on what I'm doing wrong or some of their own sbatch scripts they would be willing to share? Thank you for any help you can provide. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.*

brettChapman · 2020-06-08T01:37:55Z

Thanks.

I'll paste my sbatch script and the stack trace output for both the Singularity error and Docker running on the master node error:

SBATCH script (modified from https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html using Singularity):

#!/bin/bash
#SBATCH --nodes=3

allow use of all the memory on the node

#SBATCH --ntasks-per-node=8
#SBATCH --mem=0

request all CPU cores on the node

#SBATCH --exclusive

Customize --time --partition as appropriate

#SBATCH --partition=debug

set -o nounset -o errexit -o xtrace

readonly CACTUS_IMAGE=/data/cactus_builds/cactus_latest.sif
readonly JOBSTORE_IMAGE=jobStore.img
readonly SEQFILE=wildBarley.txt
readonly OUTPUTHAL=wildBarley.hal
readonly CACTUS_OPTIONS='--root Anc0'

readonly CACTUS_SCRATCH=/data/scratch/cactus-${SLURM_JOB_ID}

if [ ! -e "${JOBSTORE_IMAGE}" ]
then
restart=''
mkdir -p -m 777 ${CACTUS_SCRATCH}/upper
truncate -s 2T "${JOBSTORE_IMAGE}"
singularity exec ${CACTUS_IMAGE} mkfs.ext3 -d ${CACTUS_SCRATCH} "${JOBSTORE_IMAGE}"
else
restart='--restart'
fi

mkdir -m 700 -p ${CACTUS_SCRATCH}/tmp

sudo singularity exec --overlay ${JOBSTORE_IMAGE} ${CACTUS_IMAGE} sudo mkdir -p /cactus/workDir
srun -N 3 -n 24 sudo singularity exec --cleanenv
--no-home
--overlay ${JOBSTORE_IMAGE}
--bind ${CACTUS_SCRATCH}/tmp:/tmp
${CACTUS_IMAGE}
cactus ${CACTUS_OPTIONS-} ${restart-} --workDir=/cactus/workDir --binariesMode local /cactus/jobStore "${SEQFILE}" "${OUTPUTHAL}"

rm -rf ${CACTUS_SCRATCH} jobStore.img

Stack trace:

readonly CACTUS_IMAGE=/data/cactus_builds/cactus_latest.sif
CACTUS_IMAGE=/data/cactus_builds/cactus_latest.sif
readonly JOBSTORE_IMAGE=jobStore.img
JOBSTORE_IMAGE=jobStore.img
readonly SEQFILE=wildBarley.txt
SEQFILE=wildBarley.txt
readonly OUTPUTHAL=wildBarley.hal
OUTPUTHAL=wildBarley.hal
readonly 'CACTUS_OPTIONS=--root Anc0'
CACTUS_OPTIONS='--root Anc0'
readonly CACTUS_SCRATCH=/data/scratch/cactus-45
CACTUS_SCRATCH=/data/scratch/cactus-45
'[' '!' -e jobStore.img ']'
restart=
mkdir -p -m 777 /data/scratch/cactus-45/upper
truncate -s 2T jobStore.img
singularity exec /data/cactus_builds/cactus_latest.sif mkfs.ext3 -d /data/scratch/cactus-45 jobStore.img
mke2fs 1.44.1 (24-Mar-2018)
ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether jobStore.img is mounted.
Discarding device blocks: done
Creating filesystem with 536870912 4k blocks and 134217728 inodes
Filesystem UUID: ba42d0e5-521c-492f-8dc2-82b8f30eab6c
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Copying files into the device: done
Writing superblocks and filesystem accounting information: done

mkdir -m 700 -p /data/scratch/cactus-45/tmp
singularity exec --overlay jobStore.img /data/cactus_builds/cactus_latest.sif mkdir -p /cactus/workDir
srun -N 3 -n 24 singularity exec --cleanenv --no-home --overlay jobStore.img --bind /data/scratch/cactus-45/tmp:/tmp /data/cactus_builds/cactus_latest.sif cactus --root Anc0 --workDir=/cactus/workDir --binariesMode local /cactus/jobStore wildBarley.txt wildBarley.hal
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
srun: error: node-1: tasks 0-4,6-7: Exited with exit code 255
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
ESC[31mFATAL: ESC[0m error while locking ext3 overlay partition from /data/jobStore.img: can't open /data/jobStore.img for writing, currently in use by another process
srun: error: node-2: tasks 8-15: Exited with exit code 255
srun: error: node-3: tasks 16-23: Exited with exit code 255

The next error is when I run Cactus within Docker directly on the master node with batch system set to Slurm. Just before I run you can see I run sinfo to show Slurm is set up correctly and running:

ubuntu@node-0:/data$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 3 idle node-[1-3]
ubuntu@node-0:/data$ docker run -v /data:/data --rm quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm
8fc06652b568 2020-06-08 01:15:39,542 MainThread INFO cactus.progressive.projectWrapper: Using config from path /usr/local/lib/python3.6/dist-packages/cactus/cactus_progressive_config.xml.
8fc06652b568 2020-06-08 01:15:39,558 MainThread INFO toil.lib.bioio: xmlRoot = <multi_cactus inputSequenceNames="simHuman_chr6 simMouse_chr6 simRat_chr6 simCow_chr6 simDog_chr6" inputSequences="http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simHuman.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simMouse.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simRat.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simCow.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simDog.chr6">
((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;

</multi_cactus>
Traceback (most recent call last):
File "/usr/local/bin/cactus", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py", line 406, in main
runCactusProgressive(options)
File "/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py", line 454, in runCactusProgressive
halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory()))
File "/usr/local/lib/python3.6/dist-packages/toil/common.py", line 808, in start
self._batchSystem = self.createBatchSystem(self.config)
File "/usr/local/lib/python3.6/dist-packages/toil/common.py", line 953, in createBatchSystem
return batchSystemClass(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 301, in init
self.maxCPU, self.maxMEM = self.obtainSystemConstants()
File "/usr/local/lib/python3.6/dist-packages/toil/batchSystems/slurm.py", line 249, in obtainSystemConstants
lines = call_command(['sinfo', '-Nhe', '--format', '%m %c']).split('\n')
File "/usr/local/lib/python3.6/dist-packages/toil/lib/misc.py", line 335, in call_command
encoding='utf-8', errors="replace", env=env)
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo'

The first error I get with Singularity, appears it may be an issue with how Singularity with it's overlay option allows concurrent writes to an image from multiple nodes, so I'm not sure how to work around this, other than to not use an image and instead write to a directory instead. The /data directory is shared across all nodes and is mounted on the master node. However, I'm not very familiar with Singularity. I've been more accustomed to using Docker.

The next error with Docker may be because I'm running within Docker and it's not a local install. I'm not sure how the Docker version of Cactus interacts with the locally installed job management system. It looks like it can't see Slurm outside of Docker.

Thanks for your help.

diekhans · 2020-06-11T17:40:40Z

I don't know Singularity however if jobStore.img on contains the jobStore a lot of process will be writing it.

brettChapman · 2020-06-12T01:58:40Z

Thanks.

I managed to sort out the Slurm issue. I had not setup SlurmDBD, and it appears Cactus expected to access the database. Once I had setup SlurmDBD the errors went away.

In relation to concurrent writes to an ext3 image, I received feedback from the issue I posted on the Singularity github repo. They suggested I don't use Toil with Slurm, as this is also mentioned on the Cactus repo. It's recommended to use a single node with enough cores, as most file systems can not handle so many writes to the jobStore. However, I do not have access to a single node big enough. The single biggest node I have access to right now is 16 cores with 64GB RAM and I have previously ran Cactus on it with 3 large genomes (~5Gbps each). The job failed due to a limited number of processors.

Does anyone have a suggestion for a workaround? I could limit the number of additional nodes to around 3 to 5 to reduce the stress on the file system, and I could also try only aligning portions of the genomes in each Cactus run (say only chr1, 2, 3... etc per job). The genomes are of the same species (different cultivars) so the differences between genomes will only be limited to snps and indels across the same chromosomes. There are unlikely to be very large structural difference as the divergence time between the cultivars is <10,000 years.

brettChapman · 2020-06-15T07:35:52Z

I'm still getting errors when I try to run Cactus from Docker and specify Slurm as the batch system:

ubuntu@node-0:/data$ docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm
6fdf70a5cb45 2020-06-15 07:28:34,667 MainThread INFO cactus.progressive.projectWrapper: Using config from path /usr/local/lib/python3.6/dist-packages/cactus/cactus_progressive_config.xml.
6fdf70a5cb45 2020-06-15 07:28:34,683 MainThread INFO toil.lib.bioio: xmlRoot = <multi_cactus inputSequenceNames="simHuman_chr6 simMouse_chr6 simRat_chr6 simCow_chr6 simDog_chr6" inputSequences="http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simHuman.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simMouse.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simRat.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simCow.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simDog.chr6">
((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;

</multi_cactus>
Traceback (most recent call last):
File "/usr/local/bin/cactus", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py", line 406, in main
runCactusProgressive(options)
File "/usr/local/lib/python3.6/dist-packages/cactus/progressive/cactus_progressive.py", line 454, in runCactusProgressive
halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory()))
File "/usr/local/lib/python3.6/dist-packages/toil/common.py", line 808, in start
self._batchSystem = self.createBatchSystem(self.config)
File "/usr/local/lib/python3.6/dist-packages/toil/common.py", line 953, in createBatchSystem
return batchSystemClass(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 301, in init
self.maxCPU, self.maxMEM = self.obtainSystemConstants()
File "/usr/local/lib/python3.6/dist-packages/toil/batchSystems/slurm.py", line 249, in obtainSystemConstants
lines = call_command(['sinfo', '-Nhe', '--format', '%m %c']).split('\n')
File "/usr/local/lib/python3.6/dist-packages/toil/lib/misc.py", line 335, in call_command
encoding='utf-8', errors="replace", env=env)
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo'

When I run from the locally installed Cactus, and specify the binariesMode as Docker, it runs, but then complains that Cactus doesn't support batch system auto-deployment. I'll try installing cactus on all nodes.

In regards to running in Docker with Slurm as the batch system, is there a way to get Docker to see the host system and executables such as sacct and sinfo?

brettChapman · 2020-06-16T02:57:07Z

Hi

I've installed Cactus on all nodes, and when I run Cactus it still complains about the batch system not supporting auto-deployment.

cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --batchSystem Slurm --binariesMode docker --root mr
node-1 2020-06-16 02:37:36,951 MainThread INFO cactus.progressive.projectWrapper: Using config from path /usr/local/lib/python3.8/dist-packages/cactus/cactus_progressive_config.xml.
node-1 2020-06-16 02:37:36,980 MainThread INFO toil.lib.bioio: xmlRoot = <multi_cactus inputSequences="http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simHuman.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simMouse.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simRat.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simCow.chr6 http://s3-us-west-2.amazonaws.com/jcarmstr-misc/testRegions/evolverMammals/simDog.chr6" inputSequenceNames="simHuman_chr6 simMouse_chr6 simRat_chr6 simCow_chr6 simDog_chr6">
(simMouse_chr6:0.084509,simRat_chr6:0.091589,simHuman_chr6:0.41599200000000003,simCow_chr6:0.514545,simDog_chr6:0.488495)mr;

</multi_cactus>
node-1 2020-06-16 02:37:45,816 MainThread WARNING toil.common: Batch system does not support auto-deployment. The user script ModuleDescriptor(dirPath='/usr/local/lib/python3.8/dist-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=True) will have to be present at the same location on every worker.
node-1 2020-06-16 02:37:45,906 MainThread INFO toil: Running Toil version 4.1.0-5ad5e77d98e1456b4f70f5b00e688a43cdce2ebe.
node-1 2020-06-16 02:37:45,992 MainThread INFO toil.leader: Issued job 'RunCactusPreprocessorThenProgressiveDown' kind-RunCactusPreprocessorThenProgressiveDown/instance-5c9sgrst with job batch system ID: 0 and cores: 1, disk: 2.0 G, and memory: 3.3 G
/var/log/slurm_jobacct.log: No such file or directory
node-1 2020-06-16 02:37:46,923 Thread-2 ERROR toil.batchSystems.abstractGridEngineBatchSystem: GridEngine like batch system failure
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/toil/batchSystems/slurm.py", line 82, in getJobExitCode
state, rc = self._getJobDetailsFromSacct(slurmJobID)
File "/home/ubuntu/.local/lib/python3.8/site-packages/toil/batchSystems/slurm.py", line 103, in _getJobDetailsFromSacct
stdout = call_command(args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/toil/lib/misc.py", line 340, in call_command
raise CalledProcessErrorStderr(proc.returncode, cmd, output=stdout, stderr=stderr)
toil.lib.misc.CalledProcessErrorStderr: Command '['sacct', '-n', '-j', '182', '--format', 'State,ExitCode', '-P', '-S', '1970-01-01']' exit status 1: /var/log/slurm_jobacct.log: No such file or directory

Running directly on the master node gives the same result, as well as running using sbatch with this script:

#!/bin/bash
#SBATCH --nodes=1

allow use of all the memory on the node

#SBATCH --ntasks-per-node=8
#SBATCH --mem=0

request all CPU cores on the node

#SBATCH --exclusive

Customize --time --partition as appropriate

#SBATCH --partition=debug

set -o nounset -o errexit -o xtrace

source ~/cactus_env/bin/activate

cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --batchSystem Slurm --binariesMode docker --root mr

You can see cactus is on every node:

(cactus_env) ubuntu@node-0:/data$ which cactus
/usr/local/bin/cactus
(cactus_env) ubuntu@node-0:/data$ pdsh -a which cactus
node-1: /usr/local/bin/cactus
node-3: /usr/local/bin/cactus
node-2: /usr/local/bin/cactus

Whether I set up a virtual environment (cactus_env) to load on each node or not, makes no difference.

I also check to make sure that the directory and scripts Toil is looking at in the error message exist. They do exist on every nodes:

(cactus_env) ubuntu@node-0:/data$ ls -l /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive*
-rw-r--r-- 1 root staff 22182 Jun 16 02:31 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive.py
-rw-r--r-- 1 root staff 9634 Jun 16 02:31 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressiveTest.py

(cactus_env) ubuntu@node-0:/data$ pdsh -a ls -l /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive*
node-3: -rw-r--r-- 1 root staff 22182 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive.py
node-3: -rw-r--r-- 1 root staff 9634 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressiveTest.py

node-1: -rw-r--r-- 1 root staff 22182 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive.py
node-1: -rw-r--r-- 1 root staff 9634 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressiveTest.py

node-2: -rw-r--r-- 1 root staff 22182 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive.py
node-2: -rw-r--r-- 1 root staff 9634 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressiveTest.py

brettChapman · 2020-06-16T05:34:40Z

I've managed to get Cactus working with Slurm by ignoring the batch system parameter and instead splitting up my jobs per node by splitting each job per chromosome. It isn't ideal, but likely the only way I'll get it running. Each chromosome is roughly 600Mbps and there are 7 chromosomes. I'm comparing 3 genomes, but this will likely scale up to many more genomes in the future, which likely wont scale with my current limited compute resources.

I submit the job to a single node using sbatch, and repeated for subsequent chromosomes:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --partition=debug

source ~/cactus_env/bin/activate

cactus jobStore /data/wild_barley/chr1/wildBarley_chr1.txt /data/wild_barley/chr1/wildBarley_chr1.hal --workDir /data/tmp/ --binariesMode docker --root Anc0

brettChapman mentioned this issue Jun 9, 2020

Singularity overlay error with concurrent writes to ext3 image apptainer/singularity#5351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble running Cactus on a cluster with Slurm #261

Trouble running Cactus on a cluster with Slurm #261

brettChapman commented Jun 5, 2020

diekhans commented Jun 5, 2020 via email

brettChapman commented Jun 8, 2020

diekhans commented Jun 11, 2020

brettChapman commented Jun 12, 2020

brettChapman commented Jun 15, 2020

brettChapman commented Jun 16, 2020

brettChapman commented Jun 16, 2020

Trouble running Cactus on a cluster with Slurm #261

Trouble running Cactus on a cluster with Slurm #261

Comments

brettChapman commented Jun 5, 2020

allow use of all the memory on the node

request all CPU cores on the node

Customize --time --partition as appropriate

diekhans commented Jun 5, 2020 via email

brettChapman commented Jun 8, 2020

allow use of all the memory on the node

request all CPU cores on the node

Customize --time --partition as appropriate

diekhans commented Jun 11, 2020

brettChapman commented Jun 12, 2020

brettChapman commented Jun 15, 2020

brettChapman commented Jun 16, 2020

allow use of all the memory on the node

request all CPU cores on the node

Customize --time --partition as appropriate

brettChapman commented Jun 16, 2020