# Fusera variant calling pipeline
---

In this tutorial, we will use [Fusera](https://github.com/mitre/fusera) (a cloud extension to the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra)) to access publicly available test data from the 1000 Genomes project. Then we will follow GATK best practices to do variant calling on the test data.

**Note:** If you have not installed [Fusera](https://github.com/mitre/fusera) and [bcbio](https://bcbio-nextgen.readthedocs.io/en/latest/index.html), please do so before proceeding.

---

## Part I: Accessing data with Fusera

First, we will download the publicly available test `.ngc` file. If there are other datasets you want to use, please submit a data access request to [dbGaP](https://www.ncbi.nlm.nih.gov/gap).

In [None]:
# Download test .ngc file (publicly available)
cd ~
wget ftp://ftp.ncbi.nlm.nih.gov/sra/examples/decrypt_examples/prj_phs710EA_test.ngc

Next, let's set up directories for this tutorial. The directory `cloud_mountpoint` must be empty; we will use this directory to mount our cloud instance for data access. The `local_results` directory will store our output files from running GATK variant calling pipeline.

In [9]:
# This is dir we will use to mount cloud instance to access data
mkdir -p ~/cloud_mountpoint # This dir must be empty for Fusera to work!

# We will use this local dir on to store our analysis output
mkdir -p ~/local_results

In [10]:
ls

[34mcloud_mountpoint[39;49m[0m             output.log
config_GCF_gatk_variant.yaml prj_phs710EA_test.ngc
fusera_variant_calling.ipynb [34mtest[39;49m[0m
[34mlocal_results[39;49m[0m


Now, we will mount the cloud instance to access our test data. **Important:** You must unmount the cloud instance after you are done. You can use the command `fusera unmount ~/cloud_mountpoint`. I have also included a code block at the end of this tutorial to unmount.

If you would like to access other data, you can find valid input for the `--location` flag by going to [NCBI SRA](https://www.ncbi.nlm.nih.gov/sra), then SRA Run Selector. Now you can search for your run (i.e. SRR1219902) and look at the RunInfo Table. You can use info in the `DATASTORE_region` column of the RunInfo Table for the `--location` flag. For the example used in this tutorial, the RunInfo Table can be found [on this page](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRR121990&go=go).

In [16]:
# You will have to change the filepaths depending on your directory structure
fusera mount --ngc "~/prj_phs710EA_test.ngc" --accession "SRR1219902" --location s3.us-east-1 ~/cloud_mountpoint > output.log 2>&1 &


[1] 10203


: 1

Let's take a look at what is inside of `cloud_mountpoint`.

In [18]:
cd ~/cloud_mountpoint
ls

---

## Part II: GATK variant calling

Next, we will run `bcbio`. We have already filled out the config file, called `config_GCF_gatk_variant.yaml`, that stores the parameters for our variant calling pipeline. If you would like to play around with the parameters for this tutorial, the `config_GCF_gatk_variant.yaml` lives in the same directory as this iPython notebook.

If you would like to use a different variant calling pipeline, `bcbio` comes with many configuration file templates. Or you can also use the following command to automatically create a sample configuration file for your favorite pipeline:

```bash
bcbio_nextgen.py -w template gatk-variant project1 sample1.bam sample2_1.fq sample2_2.fq
```

You can find a list of supported pipelines [here](https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html).

Please read [`bcbio`'s getting started page](https://bcbio-nextgen.readthedocs.io/en/latest/contents/testing.html) for more details.

In [None]:
# The -n flag distributes this across 8 local cores
bcbio_nextgen.py config_GCF_gatk_variant.yaml -n 8

---

## Important: Unmount Fusera

In [None]:
disown %1

In [None]:
cd ~/jupyter_notebooks/fusera_vc
fusera unmount ~/cloud_mountpoint