# The Notebook -  by GATK-Spark 

## Motivation

We demonstrate how by embedding Docker containers, we can distribute Jupyter notebooks that can run complicated workflows on the cloud interactively.

The example is a GATK variant calling pipeline that is run in a distributed manner using Spark on an auto-generated AWS kubernetes cluster


### Download the fastq input files 

##### This step will take approximately 30 minutes, depending on your bandwidth.

 

The fastq files are publicly available with the following links:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493366
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493367
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493368
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493369
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493370
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493371

To download the files we can use the parallel-fastq-dump container. This is a python wrapper around the fastq-dump utility that allows it to download separate chunks of the files separately.

```bash

docker run --rm -i -v /.nbdocker:/data biodepot/alpine-tools bash -c 'mkdir /data/fastq && cd /data/fastq && \
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00100/sequence_read/ERR013140_1.filt.fastq.gz && \
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00100/sequence_read/ERR013140_2.filt.fastq.gz'
```

{nbdocker#0}

### Download the reference and create the indices
##### Running two containers will take approximately 10 minutes.

We download the reference and generate the indices for bwa

Download the reference human transcriptome:
```bash
docker run --rm -i -v /.nbdocker:/data biodepot/alpine-tools bash -c 'mkdir -p /data/reference && wget -qO- 'ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz' | gunzip -c > /data/reference/GCh38.fa
```
{nbdocker#1}

Build indices using bwa:

```bash
docker run --rm -i -v  /.nbdocker:/data biodepot/alpine-bwa:3.7-0.7.15 bash -c 'cd /data/reference && bwa index  GCh38.fa'

```
{nbdocker#2}



## Alignment using bwa-mem to create bamfiles

##### Running alignment for each fastq input file will take approximately 30 minutes.  There are a total of 6 of these jobs to run. So, a total of ~36 minutes.

```bash
docker run --rm -i -e NTHREADS='3'\
 -e file1='/data/fastq/ERR013140_1.fastq.gz' \
 -e file2='/data/fastq/ERR013140_2.fastq.gz' \
 -e bamfile='/data/gatk/bams/ERR013140.bam'  \
 -e genomeFile='/data/reference/GCh38.fa' \
 -v /.nbdocker:/data biodepot/alpine-bwa-samtools:3.7-0.7.15-1.9-52-g651bf14 \
 bash -c 'mkdir -p /data/gatk/bams && bwa mem -t $NTHREADS \
 $genomFile ${file1}.fastq ${file2}.fastq | samtools sort -@${NTHREADS} -o $bamfile - '
```

{nbdocker#3}


Results will be in: /home/jovyan/work/kallisto/


## GATK (non-Spark steps)

Not all the components of GATK use Spark


```

docker run --rm -i -v /.nbdocker:/data  \
-e bamfile:'/data/bams/ERR013140.bam' \
-e reference:'ref/ref.fasta' \
-e outputFile: '/data/variants/variants.vcf'\

broadinstitute/gatk:4.0.5.1 \

/bin/bash -c 'mkdir -p /data/variants && \
              cd /gatk/gatk_data/germline && \
              gatk HaplotypeCaller -R $reference -I $bamfile -O $outputFile '
            
```

{nbdocker#4}
```

docker run --rm -i -v ./nbdocker/:/data \
-e bamfile:'/data/bams/ERR013140.bam' \

broadinstitute/gatk:4.0.5.1 \

/bin/bash -c 'gatk ValidateSamFile -I $bamfile -MODE SUMMARY'

```

{nbdocker#5}


## GATK (Spark steps)
##### About 1 minute

```
docker run --rm -i -v /.nbdocker/:/home/ubuntu/gatk_data \   
-e reference='ref.fasta' \
-e bamfile='/data/bams/ERR013140.bam' \
-e dedupFile: '/data/bams/ERR013140_dedup.bam ' \
-e metricsFile: '/data/bams/metrics.txt' \
broadinstitute/gatk:4.0.5.1 \
/bin/bash -c ' gatk --java-options "-Xmx6G" MarkDuplicatesSpark \
-R $reference \
-I $bamfile \
-O $dedupFile \
-M $metricsFile \
-- \
--spark-master local[*]'

```

{nbdocker#6}




## Visualization steps

Something something the force...