Skip to content

Singularity

IBEXCluster edited this page Oct 28, 2022 · 33 revisions

Welcome to Rice Variant Calling!

This demo session will be used for hands-on workshop at ISRFG '22.

Principal investigator:

Prof. Rod A. Wing, Director, Center for Desert Agriculture,4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, KSA

Presenters:

Nagarajan Kathiresan {nagarajan.kathiresan@kaust.edu.sa} and Yong Zhou {yong.zhou@kaust.edu.sa}

Demo in Ibex cluster using Singularity.

Prerequisite:

  1. Singularity should be installed in your cluster or High-Performance Computing environment.

  2. Download the demo rice genome dataset and pipeline scripts:

    Download link

  3. untar the download file:

    tar -xzvf rice_pipeline_demo.tar.gz

  4. Change to rice_pipeline_demo directory. This directory will be your working directory for this demo exercise.

    cd rice_pipeline_demo

  5. Your directory structure for demo exercise will be:

     ~/Rice_pipeline/rice_pipeline_demo$ tree -L 2
       ├── input
            ├── ERS467814_ERR614071_1.fastq.gz
            ├── ERS467814_ERR614071_2.fastq.gz
            ├── ERS467814_ERR614072_1.fastq.gz
            ├── ERS467814_ERR614072_2.fastq.gz
            ├── ERS467860_ERR615300_1.fastq.gz
            ├── ERS467860_ERR615300_2.fastq.gz
            ├── ERS467860_ERR615301_1.fastq.gz
            └── ERS467860_ERR615301_2.fastq.gz
       ├── output
       ├── ref
            ├── Nipponbare_chr.dict
            ├── Nipponbare_chr.fasta
            ├── Nipponbare_chr.fasta.amb
            ├── Nipponbare_chr.fasta.ann
            ├── Nipponbare_chr.fasta.bwt
            ├── Nipponbare_chr.fasta.fai
            ├── Nipponbare_chr.fasta.pac
            └── Nipponbare_chr.fasta.sa
      ├── scripts
            ├── Phase1
            ├── Phase2
            ├── Phase3
            └── Phase4
      └── tmp
    
  6. Download the singularity image: Example script to build the Singularity SIF image.

    #!/bin/bash
    module load singularity
    SINGULARITY_CACHEDIR=$HOME/singularity/cache
    SINGULARITY_PULLFOLDER=$HOME/singularity/images
    SINGULARITY_TMPDIR=$HOME/singularity/singularity/tmp
    mkdir -p ${SINGULARITY_CACHEDIR} $SINGULARITY_PULLFOLDER{} ${SINGULARITY_TMPDIR}
    singularity build BioApps.sif docker://ibexcluster/bioapps:v1.0
    

Input and Reference data preparation:

We are providing 2 rice genome samples (ERS467814 and ERS467860). To improve the quality, each samples are resequenced two times and the summary is as follows (All these *.fastq.gz files are in the input directory):

   Sample #1
  ├── ERS467814_ERR614071_1.fastq.gz
  ├── ERS467814_ERR614071_2.fastq.gz
  ├── ERS467814_ERR614072_1.fastq.gz
  ├── ERS467814_ERR614072_2.fastq.gz
   Sample #2
  ├── ERS467860_ERR615300_1.fastq.gz
  ├── ERS467860_ERR615300_2.fastq.gz
  ├── ERS467860_ERR615301_1.fastq.gz
  └── ERS467860_ERR615301_2.fastq.gz

We are providing Nipponbare rice genome reference along with all the required index files and it's available in ref directory.

  ├── Nipponbare_chr.dict
  ├── Nipponbare_chr.fasta
  ├── Nipponbare_chr.fasta.amb
  ├── Nipponbare_chr.fasta.ann
  ├── Nipponbare_chr.fasta.bwt
  ├── Nipponbare_chr.fasta.fai
  ├── Nipponbare_chr.fasta.pac
  └── Nipponbare_chr.fasta.sa

Phase 1: Genome mapping

Step 1: Prepare your input files. i.e., Copy the list of unique forward sequence files from input/ directory into scripts/Phase1/Phase1.txt file directory.

cd input/

ls -lrta *_1.fastq.gz | awk '{print $9}' > ../scripts/Phase1/Phase1.txt

Step 2: Execute the MPI wrapper program for Phase 1:

  • One core will be assigned to 1 sample (by default)

  • Number of MPI processes (-np)should be equal to the number of samples listed in the Phase1.txt file.
    Example job submission script as follows:
    (This script uses 2 nodes, 2 cores per node totaling 4 cores)

      #!/bin/bash 
      #SBATCH --ntasks-per-node=2 
      #SBATCH -N 2 
      #SBATCH --mem=16GB 
      #SBATCH -J Singularity 
      #SBATCH --error=STDERR.Singularity.%J.err 
      #SBATCH --output=STDOUT.Singularity.%J.out 
      #SBATCH --time=10:00 
      #SBATCH -A ibex-cs
    
     ## User environment 
      export DOCKER_MOUNT=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/rice_pipeline_demo ;
      export SIF=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/BioApps.sif ;
    
     ## Module file
      module load mpich/3.3/gnu-6.4.0 singularity 
      export SINGULARITY_BIND="$DOCKER_MOUNT,$PWD,/sw"
    
     ## Create required files and directories
     scontrol show hostnames > hostfile
    
     mpicc ${DOCKER_MOUNT}/scripts/Phase1/Phase1.c -o ${DOCKER_MOUNT}/scripts/Phase1/Phase1.exe
     mpiexec -np 4 -hostfile ./hostfile singularity exec $SIF ${DOCKER_MOUNT}/scripts/Phase1/Phase1.exe
    

Phase 2. Variants discovery

Step 1: Prepare for Phase 2 execution.

  1. List the Sample names and update into Phase2.prefix.txt

ls -ld output/tmpBAM/* | awk -F'/' '{print $NF}' > scripts/Phase2/Phase2.prefix.txt

  1. List the sample directories and update into Phase2.directory.txt

    ls -ld $PWD/output/tmpBAM/* | awk '{print $9} ' > scripts/Phase2/Phase2.directory.txt

  2. Find the number of MPI process (-np) for multi-core runs.

cat scripts/Phase2/Phase2.prefix.txt | wc -l

(This number should be used as an argument for multi-core runs)

Step 2: Execute the MPI wrappers

   (This script uses 2 nodes, 1 cores per node totaling 2 cores)

    #!/bin/bash 
    #SBATCH --ntasks-per-node=1 
    #SBATCH -N 2 
    #SBATCH --mem=16GB 
    #SBATCH -J Singularity 
    #SBATCH --error=STDERR.Singularity.%J.err 
    #SBATCH --output=STDOUT.Singularity.%J.out 
    #SBATCH --time=1:00:00 
    #SBATCH -A ibex-cs
 
   ## User environment ` <br>
    export DOCKER_MOUNT=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/rice_pipeline_demo ;
    export SIF=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/BioApps.sif ;

   ## Module file
    module load mpich/3.3/gnu-6.4.0 singularity 
    export SINGULARITY_BIND="$DOCKER_MOUNT,$PWD,/sw"

   ## Create required files and directories
   scontrol show hostnames > hostfile
   mpicc ${DOCKER_MOUNT}/scripts/Phase2/Phase2.c -o ${DOCKER_MOUNT}/scripts/Phase2/Phase2.exe
   mpiexec -np 2 -hostfile ./hostfile singularity exec $SIF ${DOCKER_MOUNT}/scripts/Phase2/Phase2.exe

Phase 3. Call set refinement and combining variants

Step 1: Execute the Prerequisite data distribution script with number of Cores available in your HPC/cluster. Here I’m using 112 cores in my Ibex cluster.

sh scripts/Phase3/Phase3.prerequisite.optimized.sh 250

This script will give the estimated optimal number of cores! sh ./Phase3.prerequisite.optimized.sh 250

 *************************************************************
  Max size of Chromosome in the given reference is: 43270923
  Total no. of Chromosomes in the given reference is: 12
  No. of optimal CPUs will be calculated as follows:
  Please use: -np 178
*************************************************************

Step 2: Launch the MPI wrapper script for Phase 3.

(This script uses 23 nodes, 8 cores per node totaling 178 cores)

#!/bin/bash 
#SBATCH --ntasks-per-node=8 
#SBATCH -N 23 
#SBATCH --mem=16GB 
#SBATCH -J Singularity 
#SBATCH --error=STDERR.Singularity.%J.err 
#SBATCH --output=STDOUT.Singularity.%J.out 
#SBATCH --time=1:00:00 
#SBATCH -A ibex-cs

 ## User environment 
  export DOCKER_MOUNT=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/rice_pipeline_demo ;
  export SIF=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/BioApps.sif ;

 ## Module file
  module load mpich/3.3/gnu-6.4.0 singularity 
  export SINGULARITY_BIND="$DOCKER_MOUNT,$PWD,/sw"

 ## Create required files and directories
   scontrol show hostnames > hostfile
   mpicc ${DOCKER_MOUNT}/scripts/Phase3/Phase3.c -o ${DOCKER_MOUNT}/scripts/Phase3/Phase3.exe
   mpiexec -np 84 -hostfile ./hostfile singularity exec $SIF ${DOCKER_MOUNT}/scripts/Phase3/Phase3.exe 

Phase 4. Variants matrixes

Step 1: Run the Prerequisite sh scripts/Phase4/Phase4.prerequisite.sh

Step 2: Run the Variant caller program

(This script uses 1 node and 12 cores per node)

#!/bin/bash 
#SBATCH --ntasks-per-node=12 
#SBATCH -N 1 
#SBATCH --mem=16GB 
#SBATCH -J Singularity 
#SBATCH --error=STDERR.Singularity.%J.err 
#SBATCH --output=STDOUT.Singularity.%J.out 
#SBATCH --time=1:00:00 
#SBATCH -A ibex-cs

  #User environment 
   export DOCKER_MOUNT=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/rice_pipeline_demo ;
   export SIF=/ibex/scratch/projects/c2072/work/Singularity/rice_genome/BioApps.sif ;
  #Module file
   module load mpich/3.3/gnu-6.4.0 singularity 
   export SINGULARITY_BIND="$DOCKER_MOUNT,$PWD,/sw"

  #Create required files and directories
    scontrol show hostnames > hostfile
    mpicc ${DOCKER_MOUNT}/scripts/Phase4/Phase4.c -o ${DOCKER_MOUNT}/scripts/Phase4/Phase4.exe
    mpiexec -np 12 -hostfile ./hostfile singularity exec $SIF ${DOCKER_MOUNT}/scripts/Phase4/Phase4.exe 

Clone this wiki locally