## A brief primer on variant calling

The DNA sequences of two individuals in a population will be different because of crossover between parents' genes and because of random mutations. For a wide variety of applications, it has become important to know an individual's genetic composition (or genotype). But do we need to store the whole sequence?

It turns out that, instead of storing the whole sequence, we can simply store the difference between an individual's DNA sequence and a reference individual, as illustrated in this image from the [European Bioinformatics Institute](https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction/variant-identification-and-analysis/what-variant):
![Generate Variants](https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/resize/GenVar_Fig_CRAM_file-750x162.png)

In the image, the gray lines are "reads" from the sequencing instrument. The red letters indicate the areas where the reads differ from the reference DNA sequence (depicted in color). These differences are called variants, and the variants are stored in a tab-delimited text file called a Variant Caller Format (VCF) file.

Identifying variants requires a sophisticated algorithm such as GATK or DeepVariant. For one thing, the reads from the instrument have to be aligned to the reference DNA sequence. The reads may not exactly match any part of the reference because ... variants. Even though the image above shows differences in just one letter (or single nucleotide polymorphism), variants can be more complex -- entire subsequences might be deleted, inserted, inverted, duplicated, or duplicated repeatedly:

![Types of Variants](https://codelabs.developers.google.com/codelabs/genomics-deepvariant/img/f7142c3c5224fe9c.png)

*Image from European Bioinformatics Institute.*

Variants are significant, not just because they allow more efficient storage, but because genetic differences between individuals can lead to differences in an individual's phenotype (such as their height or risk of developing a disease).

Variant calling is the process of going from sequences (typically stored in a BAMS files) to identifying variants between an individual and a reference genome (typically stored in a VCF file). Once we have the variants, then it is possible to do analysis on the variants to look for specific genetic markers.

### Test your understanding:

    DeepVariant is an algorithm to do _________________
    There are _______ inputs to a variant caller.
    The inputs to a variant caller are _______________ and _________________.
    The output of a variant caller is __________________

### Answers to "Test your understanding":

    Variant calling.
    Two. (see below)
    A reference genome and sequence reads belonging to an individual tissue. The reference could belong to the species or to the same individual organism (such as from the organism a few years prior, or from a part of the body unaffected by disease).
    A VCF file containing the variants between the reference genome and the individual tissue's genotype.



In [None]:
import htsget

url = "http://htsnexus-server:48444/v1/reads/1000genomes_low_coverage/NA20276"
with open("NA20276.bam", "wb") as output:
    htsget.get(url, output)

In [1]:
!git clone https://github.com/dnanexus-rnd/DeepVariant-GLnexus-WDL.git

Cloning into 'DeepVariant-GLnexus-WDL'...
remote: Enumerating objects: 157, done.[K
remote: Total 157 (delta 0), reused 0 (delta 0), pack-reused 157[K
Receiving objects: 100% (157/157), 40.23 KiB | 2.12 MiB/s, done.
Resolving deltas: 100% (90/90), done.


In [2]:
!wget https://github.com/broadinstitute/cromwell/releases/download/36.1/womtool-36.1.jar

--2019-03-04 02:52:15--  https://github.com/broadinstitute/cromwell/releases/download/36.1/womtool-36.1.jar
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/34136406/1f5d6500-36ac-11e9-9d75-d285946e7f6e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20190304%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190304T025215Z&X-Amz-Expires=300&X-Amz-Signature=bc337b6a400fb7ac1fce15fb225d310b786e26c5f2afb9ce4c92a97bd16b8f17&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dwomtool-36.1.jar&response-content-type=application%2Foctet-stream [following]
--2019-03-04 02:52:15--  https://github-production-release-asset-2e65be.s3.amazonaws.com/34136406/1f5d6500-36ac-11e9-9d75-d285946e7f6e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-A

In [3]:
! docker run -it broadinstitute/womtool:36-858f647 sh

/bin/sh: 1: docker: not found
