# The Notebook -  by GATK-Spark 

## Motivation

We demonstrate how by embedding Docker containers, we can distribute Jupyter notebooks that can run complicated workflows on the cloud interactively.

The example is a GATK variant calling pipeline that is run in a distributed manner using Spark on an auto-generated AWS kubernetes cluster


### Download the fastq input files 

##### This step will take approximately 30 minutes, depending on your bandwidth.

 

The fastq files are publicly available with the following links:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493366
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493367
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493368
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493369
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493370
https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR493371

To download the files we can use the parallel-fastq-dump container. This is a python wrapper around the fastq-dump utility that allows it to download separate chunks of the files separately.
```bash
docker run --rm -it biodepot/parallel-fastq-dump /bin/bash -c 'parallel-fastq-dump --sra-id SRR493366 SRR493367 SRR493368 SRR493369 SRR493370 SRR493371  --threads 16 --outdir /.nbdocker/kallisto/output --split-files --gzip'
```

{nbdocker#0}

### Download the reference and create the indices
##### Running two containers will take approximately 10 minutes.

We download the reference and generate the indices for bwa

Download the reference human transcriptome:
```bash
docker run --rm -it biodepot/kallisto bash -c 'kallisto index -i /.nbdocker/kallisto/annotation/human_trans.kidx /.nbdocker/kallisto/annotation/human_trans.fa.gz'
```
{nbdocker#1}

Build indices using bwa:

```bash
docker run --rm -it biodepot/kallisto bash -c 'kallisto index -i /.nbdocker/kallisto/annotation/human_trans.kidx /.nbdocker/kallisto/annotation/human_trans.fa.gz'
```

{nbdocker#2}

## Alignment using bwa-mem to create bamfiles

##### Running alignment for each fastq input file will take approximately 30 minutes.  There are a total of 6 of these jobs to run. So, a total of ~36 minutes.

#### We run this using a shell script that executes the alignment and samtools conversion to bamfiles in parallel

```bash
docker run --rm -it biodepot/kallisto bash -c 'kallisto quant -i /.nbdocker/kallisto/annotation/human_trans.kidx -b 30 --bias -t 8 -o /.nbdocker/kallisto/SRR493366 /.nbdocker/kallisto/output/SRR493366_1.fastq.gz /.nbdocker/kallisto/output/SRR493366_2.fastq.gz' 
```

{nbdocker#3}


Results will be in: /home/jovyan/work/kallisto/


## GATK (non-Spark steps)

Not all the components of GATK use Spark


```bash
docker run --rm -i -v /home/jovyan/work/gatk_data:/home/ubuntu/gatk_data  broadinstitute/gatk:4.0.5.1 \

/bin/bash -c 'cd /gatk/gatk_data/germline && gatk HaplotypeCaller -R ref/ref.fasta -I bams/mother.bam -O sandbox/variants.vcf'


```

{nbdocker#4}
```bash
docker run --rm -i -v /home/jovyan/work/gatk_data:/home/ubuntu/gatk_data  broadinstitute/gatk:4.0.5.1 \

/bin/bash -c 'cd /gatk/gatk_data/germline && gatk ValidateSamFile -I bams/mother.bam -MODE SUMMARY'

```

{nbdocker#5}


## GATK (Spark steps)
##### About 1 minute

```bash
docker run --rm -i -v /home/jovyan/work/gatk_data:/home/ubuntu/gatk_data  broadinstitute/gatk:4.0.5.1 \

/bin/bash -c 'cd /gatk/gatk_data/germline && \
docker run --rm -i -v /home/jovyan/work/gatk_data:/home/ubuntu/gatk_data  broadinstitute/gatk:4.0.5.1 \

gatk --java-options "-Xmx6G" MarkDuplicatesSpark \
-R ref/ref.fasta \
-I bams/mother.bam \
-O sandbox/mother_dedup.bam \
-M sandbox/metrics.txt \
-- \
--spark-master local[*]'

```

{nbdocker#6}




## Visualization steps

The text file called "sample_info.tsv" provides information whether each sample is the control or corresponds to KOXA1 knockdown.  In addition, the path of the transcript abundance file for each sample is specified.

In [1]:
s2c <- read.table("sample_info.tsv", header = TRUE, stringsAsFactors = FALSE)
s2c

“cannot open file 'sample_info.tsv': No such file or directory”

ERROR: Error in file(file, "rt"): cannot open the connection
