-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble running Cactus on a cluster with Slurm #261
Comments
If you can attache the stack traces from where the write errors
occurred, we can probably offer advice.
We don't have slurm here, so we don't know the details of how to
use it, however the sinfo is suppose to be part of slurm, so I
have know idea why it isn't found.
Brett Chapman <notifications@github.com> writes:
… Hi
I currently have access to a cluster on the cloud managed by OpenStack (https://nimbus.pawsey.org.au/)
I've been trying to get Cactus to run with Slurm across all nodes, without much success.
I have been following examples such as this one from Harvard, which uses Singularity (
https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html), however they only seem to run on one node (which runs successfully), but when I modify it to run across multiple nodes using my own larger dataset (I'm currently testing with only 3 nodes: -N 3 -n 24) I get errors due to multiple writes to the same mkfs.ext3 image which isn't allowed.
I have also tried running using only Docker, with the following sbatch script:
#!/bin/bash
#SBATCH --nodes=1
# allow use of all the memory on the node
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
# request all CPU cores on the node
#SBATCH --exclusive
# Customize --time --partition as appropriate
#SBATCH --partition=debug
set -o nounset -o errexit -o xtrace
readonly workingPath=/data
srun -n 1 sudo docker run -v ${workingPath}:/data --rm quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm
But in this case I have write permission problems.
Then if I bypass sbatch and srun and simply run on the master node I get file not found errors when I've specified Slurm as the batch system:
FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo'
Does anyone have any ideas on what I'm doing wrong or some of their own sbatch scripts they would be willing to share?
Thank you for any help you can provide.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#261
Hi
I currently have access to a cluster on the cloud managed by OpenStack (https:/
/nimbus.pawsey.org.au/)
I've been trying to get Cactus to run with Slurm across all nodes, without much
success.
I have been following examples such as this one from Harvard, which uses
Singularity (
https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html), however
they only seem to run on one node (which runs successfully), but when I modify
it to run across multiple nodes using my own larger dataset (I'm currently
testing with only 3 nodes: -N 3 -n 24) I get errors due to multiple writes to
the same mkfs.ext3 image which isn't allowed.
I have also tried running using only Docker, with the following sbatch script:
#!/bin/bash
#SBATCH --nodes=1
allow use of all the memory on the node
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
request all CPU cores on the node
#SBATCH --exclusive
Customize --time --partition as appropriate
#SBATCH --partition=debug
set -o nounset -o errexit -o xtrace
readonly workingPath=/data
srun -n 1 sudo docker run -v ${workingPath}:/data --rm quay.io/
comparative-genomics-toolkit/cactus:latest cactus jobStore /data/
evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode
local --batchSystem Slurm
But in this case I have write permission problems.
Then if I bypass sbatch and srun and simply run on the master node I get file
not found errors when I've specified Slurm as the batch system:
FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo'
Does anyone have any ideas on what I'm doing wrong or some of their own sbatch
scripts they would be willing to share?
Thank you for any help you can provide.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.*
|
Thanks. I'll paste my sbatch script and the stack trace output for both the Singularity error and Docker running on the master node error: SBATCH script (modified from https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html using Singularity): #!/bin/bash allow use of all the memory on the node#SBATCH --ntasks-per-node=8 request all CPU cores on the node#SBATCH --exclusive Customize --time --partition as appropriate#SBATCH --partition=debug set -o nounset -o errexit -o xtrace readonly CACTUS_IMAGE=/data/cactus_builds/cactus_latest.sif readonly CACTUS_SCRATCH=/data/scratch/cactus-${SLURM_JOB_ID} if [ ! -e "${JOBSTORE_IMAGE}" ] mkdir -m 700 -p ${CACTUS_SCRATCH}/tmp sudo singularity exec --overlay ${JOBSTORE_IMAGE} ${CACTUS_IMAGE} sudo mkdir -p /cactus/workDir rm -rf ${CACTUS_SCRATCH} jobStore.img Stack trace:
Allocating group tables: done
The next error is when I run Cactus within Docker directly on the master node with batch system set to Slurm. Just before I run you can see I run sinfo to show Slurm is set up correctly and running: ubuntu@node-0:/data$ sinfo The first error I get with Singularity, appears it may be an issue with how Singularity with it's overlay option allows concurrent writes to an image from multiple nodes, so I'm not sure how to work around this, other than to not use an image and instead write to a directory instead. The /data directory is shared across all nodes and is mounted on the master node. However, I'm not very familiar with Singularity. I've been more accustomed to using Docker. The next error with Docker may be because I'm running within Docker and it's not a local install. I'm not sure how the Docker version of Cactus interacts with the locally installed job management system. It looks like it can't see Slurm outside of Docker. Thanks for your help. |
I don't know Singularity however if jobStore.img on contains the jobStore a lot of process will be writing it. |
Thanks. I managed to sort out the Slurm issue. I had not setup SlurmDBD, and it appears Cactus expected to access the database. Once I had setup SlurmDBD the errors went away. In relation to concurrent writes to an ext3 image, I received feedback from the issue I posted on the Singularity github repo. They suggested I don't use Toil with Slurm, as this is also mentioned on the Cactus repo. It's recommended to use a single node with enough cores, as most file systems can not handle so many writes to the jobStore. However, I do not have access to a single node big enough. The single biggest node I have access to right now is 16 cores with 64GB RAM and I have previously ran Cactus on it with 3 large genomes (~5Gbps each). The job failed due to a limited number of processors. Does anyone have a suggestion for a workaround? I could limit the number of additional nodes to around 3 to 5 to reduce the stress on the file system, and I could also try only aligning portions of the genomes in each Cactus run (say only chr1, 2, 3... etc per job). The genomes are of the same species (different cultivars) so the differences between genomes will only be limited to snps and indels across the same chromosomes. There are unlikely to be very large structural difference as the divergence time between the cultivars is <10,000 years. |
I'm still getting errors when I try to run Cactus from Docker and specify Slurm as the batch system: ubuntu@node-0:/data$ docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm When I run from the locally installed Cactus, and specify the binariesMode as Docker, it runs, but then complains that Cactus doesn't support batch system auto-deployment. I'll try installing cactus on all nodes. In regards to running in Docker with Slurm as the batch system, is there a way to get Docker to see the host system and executables such as sacct and sinfo? |
Hi I've installed Cactus on all nodes, and when I run Cactus it still complains about the batch system not supporting auto-deployment. cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --batchSystem Slurm --binariesMode docker --root mr Running directly on the master node gives the same result, as well as running using sbatch with this script: #!/bin/bash allow use of all the memory on the node#SBATCH --ntasks-per-node=8 request all CPU cores on the node#SBATCH --exclusive Customize --time --partition as appropriate#SBATCH --partition=debug set -o nounset -o errexit -o xtrace source ~/cactus_env/bin/activate cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --batchSystem Slurm --binariesMode docker --root mr You can see cactus is on every node: (cactus_env) ubuntu@node-0:/data$ which cactus Whether I set up a virtual environment (cactus_env) to load on each node or not, makes no difference. I also check to make sure that the directory and scripts Toil is looking at in the error message exist. They do exist on every nodes: (cactus_env) ubuntu@node-0:/data$ ls -l /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive* (cactus_env) ubuntu@node-0:/data$ pdsh -a ls -l /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive* node-1: -rw-r--r-- 1 root staff 22182 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive.py node-2: -rw-r--r-- 1 root staff 22182 Jun 16 02:33 /usr/local/lib/python3.8/dist-packages/cactus/progressive/cactus_progressive.py |
I've managed to get Cactus working with Slurm by ignoring the batch system parameter and instead splitting up my jobs per node by splitting each job per chromosome. It isn't ideal, but likely the only way I'll get it running. Each chromosome is roughly 600Mbps and there are 7 chromosomes. I'm comparing 3 genomes, but this will likely scale up to many more genomes in the future, which likely wont scale with my current limited compute resources. I submit the job to a single node using sbatch, and repeated for subsequent chromosomes: #!/bin/bash source ~/cactus_env/bin/activate cactus jobStore /data/wild_barley/chr1/wildBarley_chr1.txt /data/wild_barley/chr1/wildBarley_chr1.hal --workDir /data/tmp/ --binariesMode docker --root Anc0 |
Hi
I currently have access to a cluster on the cloud managed by OpenStack (https://nimbus.pawsey.org.au/)
I've been trying to get Cactus to run with Slurm across all nodes, without much success.
I have been following examples such as this one from Harvard, which uses Singularity (
https://informatics.fas.harvard.edu/cactus-on-the-fasrc-cluster.html), however they only seem to run on one node (which runs successfully), but when I modify it to run across multiple nodes using my own larger dataset (I'm currently testing with only 3 nodes: -N 3 -n 24) I get errors due to multiple writes to the same mkfs.ext3 image which isn't allowed.
I have also tried running using only Docker, with the following sbatch script:
#!/bin/bash
#SBATCH --nodes=1
allow use of all the memory on the node
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
request all CPU cores on the node
#SBATCH --exclusive
Customize --time --partition as appropriate
#SBATCH --partition=debug
set -o nounset -o errexit -o xtrace
readonly workingPath=/data
srun -n 1 sudo docker run -v ${workingPath}:/data --rm quay.io/comparative-genomics-toolkit/cactus:latest cactus jobStore /data/evolverMammals.txt /data/evolverMammals.hal --workDir /data/tmp/ --binariesMode local --batchSystem Slurm
But in this case I have write permission problems.
Then if I bypass sbatch and srun and simply run on the master node I get file not found errors when I've specified Slurm as the batch system:
FileNotFoundError: [Errno 2] No such file or directory: 'sinfo': 'sinfo'
Does anyone have any ideas on what I'm doing wrong or some of their own sbatch scripts they would be willing to share?
Thank you for any help you can provide.
The text was updated successfully, but these errors were encountered: