## A Brief Primer on Variant Calling

The DNA sequences of two individuals in a population will be different because of crossover between parents' genes and because of random mutations. For a wide variety of applications, it has become important to know an individual's genetic composition (or genotype). But do we need to store the whole sequence?

It turns out that, instead of storing the whole sequence, we can simply store the difference between an individual's DNA sequence and a reference individual, as illustrated in this image from the [European Bioinformatics Institute](https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction/variant-identification-and-analysis/what-variant):
![Generate Variants](https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/resize/GenVar_Fig_CRAM_file-750x162.png)

In the image, the gray lines are "reads" from the sequencing instrument. The red letters indicate the areas where the reads differ from the reference DNA sequence (depicted in color). These differences are called variants, and the variants are stored in a tab-delimited text file called a Variant Caller Format (VCF) file.

Identifying variants requires a sophisticated algorithm such as GATK or DeepVariant. For one thing, the reads from the instrument have to be aligned to the reference DNA sequence. The reads may not exactly match any part of the reference because ... variants. Even though the image above shows differences in just one letter (or single nucleotide polymorphism), variants can be more complex -- entire subsequences might be deleted, inserted, inverted, duplicated, or duplicated repeatedly:

![Types of Variants](https://codelabs.developers.google.com/codelabs/genomics-deepvariant/img/f7142c3c5224fe9c.png)

*Image from European Bioinformatics Institute.*

Variants are significant, not just because they allow more efficient storage, but because genetic differences between individuals can lead to differences in an individual's phenotype (such as their height or risk of developing a disease).

Variant calling is the process of going from sequences (typically stored in a BAMS files) to identifying variants between an individual and a reference genome (typically stored in a VCF file). Once we have the variants, then it is possible to do analysis on the variants to look for specific genetic markers.

**Test your understanding:**

    DeepVariant is an algorithm to do _________________
    There are _______ inputs to a variant caller.
    The inputs to a variant caller are _______________ and _________________.
    The output of a variant caller is __________________

**Answers to "Test your understanding":**

    Variant calling.
    Two. (see below)
    A reference genome and sequence reads belonging to an individual tissue. The reference could belong to the species or to the same individual organism (such as from the organism a few years prior, or from a part of the body unaffected by disease).
    A VCF file containing the variants between the reference genome and the individual tissue's genotype.



## DeepVariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. It is a TensorFlow-based image classification model ([inception-v3](https://arxiv.org/abs/1512.00567)) to assign genotype likelihoods from the experimental data produced by the instrument. Read additional information on the [Google Research blog.](https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html)

![DeepVariant Workflow](https://github.com/google/deepvariant/blob/r0.7/docs/DeepVariant-workflow-figure.png?raw=true)



## Using DeepVariant to Build Analyze Variants

These portable WDL workflows use DeepVariant to call variants from WGS read alignments, followed by GLnexus to merge the resulting Genome VCF (gVCF) files for several samples into a Project VCF (pVCF). This sequential workflow to generate gVCF from a given BAM file and genomic range.

```
             +----------------------------------------------------------------------------+
             |                                                                            |
             |  DeepVariant.wdl                                                           |
             |                                                                            |
             |  +-----------------+    +-----------------+    +------------------------+  |
sample.bam   |  |                 |    |                 |    |                        |  |
 genome.fa ----->  make_examples  |---->  call_variants  |---->  postprocess_variants  |-----> gVCF
     range   |  |                 |    |                 |    |                        |  |
             |  +-----------------+    +--------^--------+    +------------------------+  |
             |                                  |                                         |
             |                                  |                                         |
             +----------------------------------|-----------------------------------------+
                                                |
                                       DeepVariant Model
```

`make_examples` and `call_variants` internally parallelize across CPUs on the machine they run on. The tasks use the docker image published by the [DeepVariant](https://github.com/google/deepvariant) team.

## Htsget + Deepvariant + WDL
For each range, fetches a BAM slice using the GA4GH htsget client in samtools 1.7+, given an htsget server endpoint and sample ID. Finally, concatenates the per-range gVCFs to the complete product.

```
             +--------------------------------------------------------------------------------+
             |                                                                                |
             |  htsget_DeepVariant.wdl                                                        |
             |                                                                                |
             |       +-----------------+    +-------------------+                             |
             |       |                 |    |                   |  range gVCF                 |
             |   +--->  htsget client  |---->  DeepVariant.wdl  |---+                         |
             |   |   |  (samtools)     |    |                   |   |                         |
             |   |   |                 |    +-------------------+   |                         |
sample ID    |   |   +-----------------+                            |  +-------------------+  |
             |   |                                                  +-->                   |  |
   ranges -------+---> ...                  ...                 ... --->  bcftools concat  +-----> sample gVCF
    (e.g.    |   |                                                  +-->                   |  |
     chr1    |   |   +-----------------+                            |  +-------------------+  |
     chr2    |   |   |                 |    +-------------------+   |                         |
     ...)    |   +--->  htsget client  |    |                   |   |                         |
             |       |  (samtools)     |---->  DeepVariant.wdl  |---+                         |
             |       |                 |    |                   |  range gVCF                 |
             |       +------------^----+    +-------------------+                             |
             |            |       |                                                           |
             |            |       |                                                           |
             +------------|-------|-----------------------------------------------------------+
                          |       |
               sample ID  |       |
                   range  |       |  range BAM
                          |       |
                     +----v------------+
                     |                 |
                     |  htsget server  |
                     |                 |
                     +-----------------+
```

By using htsget, the workflow scatters across the ranges without first having to download and slice up a monolithic BAM file.