Docker

Welcome to Rice Variant Calling!

This demo session will be used for hands-on workshop at ISRFG '22.

Principal investigator:

Prof. Rod A. Wing, Director, Center for Desert Agriculture,4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, KSA

Developers:

Nagarajan Kathiresan {nagarajan.kathiresan@kaust.edu.sa} and Yong Zhou {yong.zhou@kaust.edu.sa}

Demo in Workstation or Laptop using Docker.

Prerequisite:

Docker installation on your laptop/workstation.
Download the demo rice genome dataset and pipeline scripts:

Download link
untar the download file:

tar -xzvf rice_pipeline_demo.tar.gz
Change to rice_pipeline_demo directory. This directory will be your working directory for this demo exercise.

cd rice_pipeline_demo

Your directory structure for demo exercise will be:

 ~/Rice_pipeline/rice_pipeline_demo$ tree -L 2
   ├── input
        ├── ERS467814_ERR614071_1.fastq.gz
        ├── ERS467814_ERR614071_2.fastq.gz
        ├── ERS467814_ERR614072_1.fastq.gz
        ├── ERS467814_ERR614072_2.fastq.gz
        ├── ERS467860_ERR615300_1.fastq.gz
        ├── ERS467860_ERR615300_2.fastq.gz
        ├── ERS467860_ERR615301_1.fastq.gz
        └── ERS467860_ERR615301_2.fastq.gz
   ├── output
   ├── ref
        ├── Nipponbare_chr.dict
        ├── Nipponbare_chr.fasta
        ├── Nipponbare_chr.fasta.amb
        ├── Nipponbare_chr.fasta.ann
        ├── Nipponbare_chr.fasta.bwt
        ├── Nipponbare_chr.fasta.fai
        ├── Nipponbare_chr.fasta.pac
        └── Nipponbare_chr.fasta.sa
  ├── scripts
        ├── Phase1
        ├── Phase2
        ├── Phase3
        └── Phase4
  └── tmp

Download the docker image:

sudo docker pull ibexcluster/biohpc:v1.0
Ensure the docker image is available in your workstation/Laptop:

sudo docker images
Start the docker image and include the mount points (source $PWD and destinations /demo):

sudo docker run -d -it --name biohpc --mount type=bind,source="$(pwd)",target=/demo ibexcluster/biohpc:v1.0
Ensure the docker image is running at your workstation:

sudo docker ps -a

Input and Reference data preparation:

We are providing 2 rice genome samples (ERS467814 and ERS467860). To improve the quality, each samples are resequenced two times and the summary is as follows (All these *.fastq.gz files are in the input directory):

   Sample #1
  ├── ERS467814_ERR614071_1.fastq.gz
  ├── ERS467814_ERR614071_2.fastq.gz
  ├── ERS467814_ERR614072_1.fastq.gz
  ├── ERS467814_ERR614072_2.fastq.gz
   Sample #2
  ├── ERS467860_ERR615300_1.fastq.gz
  ├── ERS467860_ERR615300_2.fastq.gz
  ├── ERS467860_ERR615301_1.fastq.gz
  └── ERS467860_ERR615301_2.fastq.gz

We are providing Nipponbare rice genome reference along with all the required index files and it's available in ref directory.

  ├── Nipponbare_chr.dict
  ├── Nipponbare_chr.fasta
  ├── Nipponbare_chr.fasta.amb
  ├── Nipponbare_chr.fasta.ann
  ├── Nipponbare_chr.fasta.bwt
  ├── Nipponbare_chr.fasta.fai
  ├── Nipponbare_chr.fasta.pac
  └── Nipponbare_chr.fasta.sa

Phase 1: Genome mapping

Step 1: Prepare your input files. i.e., Copy the list of unique forward sequence files from input/ directory into scripts/Phase1/Phase1.txt file directory.

cd input/

ls -lrta *_1.fastq.gz | awk '{print $9}' > ../scripts/Phase1/ Phase1.txt

Step 2: Execute the MPI wrapper program for Phase 1:

One core will be assigned to 1 sample (by default)
Number of MPI processes (-np)should be equal to the number of samples listed in the Phase1.txt file

sudo docker exec -ti biohpc sh -c "mpirun --allow-run-as-root -np 4 /demo/scripts/Phase1/Phase1.exe"

Phase 2. Variants discovery

Step 1: Prepare for Phase 2 execution.

List the Sample names and update into Phase2.prefix.txt

ls -ld output/tmpBAM/* | awk -F'/' '{print $NF}' > scripts/Phase2/Phase2.prefix.txt

List the sample directories and update into Phase2.directory.txt

ls -ld output/tmpBAM/* | awk '{print "/demo/"$9} ' > scripts/Phase2/Phase2.directory.txt

Find the number of MPI process (-np) for multi-core runs.

cat scripts/Phase2/Phase2.prefix.txt | wc -l

(This number should be used as an argument for multi-core runs)

Step 2: Execute the MPI wrappers

sudo docker exec -ti biohpc sh -c "mpirun --allow-run-as-root -np 2 /demo/scripts/Phase2/Phase2.exe"

Phase 3. Call set refinement and combining variants

Step 1: Execute the Prerequisite data distribution script with number of Cores available in your workstation. Here I’m using 112 cores in my workstation.

sudo docker exec -ti biohpc sh -c "sh /demo/scripts/Phase3/Phase3.prerequisite.optimized.sh 112"

Step 2: Launch the MPI wrapper script for Phase 3.

sudo docker exec -ti biohpc sh -c "mpirun --allow-run-as-root --oversubscribe -np 84 /demo/scripts/Phase3/Phase3.exe"

Phase 4. Variants matrixes

Step 1: Run the Prerequisite sudo docker exec -ti biohpc sh -c " sh /demo/scripts/Phase4/Phase4.prerequisite.sh"

Step 2: Run the Variant caller program

sudo docker exec -ti biohpc sh -c "mpirun --allow-run-as-root --oversubscribe -np 12 /demo/scripts/Phase4/Phase4.exe"

Results

Successfully completed results are available here:
Phase #1: Phase #1 Results Phase #2: Phase #2 Results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly