# GA4GH Variation Representation Overview

The GA4GH Variation Representation Specification consists of two components: a JSON schema that describes the structure of data, and conventions for how to use that structure to improve the consistency of sequence variation shared in the community.  A reference implementation in Python accompanies the specification and is demonstrated in this notebook.

GA4GH VR was formerly known as the Variation Modelling Collaboration (VMC).

## Using the Reference Implementation
All publicly available functionality is accessed by importing from `ga4gh.vr`, as shown below.

In [1]:
from ga4gh.core import ga4gh_digest
from ga4gh.vr import __version__, ga4gh_identify, ga4gh_serialize, models, normalize

<br>
<div>
    <div style="border-radius: 10px; width: 80%; margin: 0 auto; padding: 5px; border: 2pt solid #660000; color: #660000; background: #f4cccc;">
    <span style="font-size: 200%">⚠</span> Import only from <code>ga4gh.vr</code>.
            Submodules contain implementation details that are likely to change without notice.
    </div>
</div>

In [2]:
# You can see the version of ga4gh.vr like so:
__version__

'0.2.1.dev49+g5384bae.d20190618'

## Top-Down View of VR Schema Classes

The top-level VR classes are Location and Variation.  A Location describes *where* an event occurs.  Variation describes an event at a Location. Location and Variation are *abstract* objects — their purpose is to provide a framework for the way we think about variation, but they doen't represent any particular instance themselves. 

Currently, there is only one Location class: SequenceLocation, which defines a precise span on a named sequence. Future Location classes will include Cytoband Locations, Gene Locations, as well as SequenceLocations using fuzzy and/or intronic coordinates.

There are two Variation subclasses: 
* Text -- a blob of text, used when a textual representation is not (yet) parseable
* Allele -- contiguous state of a sequence or conceptual region
Future kinds of Variation will support haplotypes, genotypes, and translocations/fusions.

The top-level classes in VR are *identifiable*, meaning that VR proscribes how implementations can compute globally-consistent identifiers from the data.

See <a href="ga4gh-vr-schema.svg">this figure</a> for a schematic representation.


### Locations
A Location is an *abstract* object that refer to contiguous regions of biological sequences. Concrete types of Locations are shown below.

The most common Location is a SequenceLocation, i.e., a Location based on a named sequence and an Interval on that sequence. Locations may also be conceptual or symbolic locations, such as a cytoband region or a gene.
Any of these may be used as the Location for Variation.

#### SimpleInterval

In [3]:
simple_interval = models.SimpleInterval(start=42, end=43)
simple_interval.as_dict()

{'end': 43, 'start': 42, 'type': 'SimpleInterval'}

#### NestedInterval
Document conversion with ranged format

In [4]:
nested_interval = models.NestedInterval(
    inner=models.SimpleInterval(start=29,end=30),
    outer=models.SimpleInterval(start=20,end=39))
nested_interval.as_dict()

{'inner': {'end': 30, 'start': 29, 'type': 'SimpleInterval'},
 'outer': {'end': 39, 'start': 20, 'type': 'SimpleInterval'},
 'type': 'NestedInterval'}

#### SequenceLocation

In [5]:
# A SequenceLocation based on a SimpleInterval
sequence_location_si = models.SequenceLocation(
    sequence_id="refseq:NM_0001234.5",
    interval=simple_interval)
ga4gh_identify(sequence_location_si)
sequence_location_si.as_dict()

{'id': 'ga4gh:SL/UqdjWOolIz8Vxd5b14eVND0vw88q0vqr',
 'interval': {'end': 43, 'start': 42, 'type': 'SimpleInterval'},
 'sequence_id': 'refseq:NM_0001234.5',
 'type': 'SequenceLocation'}

In [6]:
# A SequenceLocation based on a NestedInterval
sequence_location_ni = models.SequenceLocation(sequence_id="refseq:NM_0001234.5", 
                                               interval=nested_interval)
ga4gh_identify(sequence_location_ni)
sequence_location_ni.as_dict()

{'id': 'ga4gh:SL/2MSDt5nZgycxhOVYPfzViwH4XzqgbU1_',
 'interval': {'inner': {'end': 30, 'start': 29, 'type': 'SimpleInterval'},
  'outer': {'end': 39, 'start': 20, 'type': 'SimpleInterval'},
  'type': 'NestedInterval'},
 'sequence_id': 'refseq:NM_0001234.5',
 'type': 'SequenceLocation'}

### Text Variation

In order to support variation descriptions that cannot be parsed, or cannot be parsed yet, the VR provides a Text schema object. The intention is to provide ids for *any* variation, particularly human descriptions of variation.

In [7]:
text_variation = models.Text(definition="PTEN loss")
text_variation.as_dict()

{'definition': 'PTEN loss', 'type': 'Text'}

### Alleles

An Allele is an asserion of a SequenceState at a Location. The many possible Location and SequenceState classes enable the representation of many kinds of Variation.

### "Simple" sequence replacements
This case covers any "ref-alt" style variation, which includes SNVs, MNVs, del, ins, and delins.

In [8]:
sequence_state = models.SequenceState(sequence="A")
allele = models.Allele(location=sequence_location_si, state=sequence_state)
ga4gh_identify(allele)
allele.as_dict()

{'id': 'ga4gh:VA/C0e28xlAfc9LVvCj_2092gF28UbtP3oX',
 'location': {'id': 'ga4gh:SL/UqdjWOolIz8Vxd5b14eVND0vw88q0vqr',
  'interval': {'end': 43, 'start': 42, 'type': 'SimpleInterval'},
  'sequence_id': 'refseq:NM_0001234.5',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'A', 'type': 'SequenceState'},
 'type': 'Allele'}

---
## Functions
Conventions in the VR specification are implemented through several algorithmic functions. They are:

* `normalize`: Implements sequence normalization for ins and del variation.
* `ga4gh_digest`: Implements a convention constructing and formatting digests for an object.
* `serialize`: Implements object serialization based on a canonical form of JSON.
* `identify`: Generates a computed identifier for an identifiable object.


### normalize()
VR Spec RECOMMENDS that variation is reported as "expanded" alleles. Expanded alleles capture the entire region of insertion/deletion amiguity, thereby facilitating comparisons that would otherwise require on-the-fly computations.

In [9]:
# Define a dinucleotide insertion on the following sequence at interbase (13, 13)
sequence = "CCCCCCCCACACACACACTAGCAGCAGCA"
#    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
#     C C C C C C C C A C A C A C A C A C T A G C A G C A G C A
#                              ^ insert CA here
interval = (13, 13)
alleles = (None, "CA")
args = dict(sequence=sequence, interval=interval, alleles=alleles, bounds=(0,len(sequence)))

In [10]:
# The expanded allele sequences
normalize(**args, mode="EXPAND")

((7, 18), ('CACACACACAC', 'CACACACACACAC'))

In [11]:
# For comparison, the left and right shuffled alleles
normalize(**args, mode="LEFTSHUFFLE")

((7, 7), ('', 'CA'))

In [12]:
normalize(**args, mode="RIGHTSHUFFLE")

((18, 18), ('', 'AC'))

### ga4gh_digest()
The `ga4gh_digest` is a convention for constructing unique identifiers from binary objects (as from serialization) using well-known SHA512 hashing and Base64 (i.e., base64url) encoding. 

In [13]:
ga4gh_digest(b"")

'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'

In [14]:
ga4gh_digest(b"ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

### ga4gh_serialize()
Serialization is the process of converting an object to a *binary* representation for transmission or communication. In the context of generating GA4GH identifiers, serialization is a process to generate a *canonical* JSON form in order to generate a digest. The VR serialization is based on a JSON canonincialization scheme consistent with several existing proposals. See the spec for details.

Because the serialization and digest methods are well-defined, groups with the same data will generate the same digests and computed identifiers.

GA4GH serialization replaces inline identifiable objects with their digests in order to create a well-defined ordering. See the `location` property in the `Allele` example below.

<br>
<div>
    <div style="border-radius: 10px; width: 80%; margin: 0 auto; padding: 5px; border: 2pt solid #660000; color: #660000; background: #f4cccc;">
        <span style="font-size: 200%">⚠</span> Although JSON serialization and GA4GH canonical JSON serialization appear similar, they are NOT interchangeable and will generated different digests. GA4GH identifiers are defined <i>only</i> when used with GA4GH serialization process.
    </div>
</div>

In [15]:
# This is the "simple" allele defined above, repeated here for readability
# Note that the location data is inlined
allele.as_dict()

{'id': 'ga4gh:VA/C0e28xlAfc9LVvCj_2092gF28UbtP3oX',
 'location': {'id': 'ga4gh:SL/UqdjWOolIz8Vxd5b14eVND0vw88q0vqr',
  'interval': {'end': 43, 'start': 42, 'type': 'SimpleInterval'},
  'sequence_id': 'refseq:NM_0001234.5',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'A', 'type': 'SequenceState'},
 'type': 'Allele'}

In [16]:
# This is the serialized form. Notice that the inline `Location` instance was replaced with
# its identifier and that the Allele id is not included. 
ga4gh_serialize(allele)

b'{"location":"UqdjWOolIz8Vxd5b14eVND0vw88q0vqr","state":{"sequence":"A","type":"SequenceState"},"type":"Allele"}'

### ga4gh_identify()
VR computed identifiers are constructed from digests on serialized objects by prefixing a VR digest with a type-specific code.

In [17]:
# applying ga4gh_digest to the serialized allele returns a base64url-encoded digest
ga4gh_digest( ga4gh_serialize(allele) )

'C0e28xlAfc9LVvCj_2092gF28UbtP3oX'

In [18]:
# identify() uses this digest to construct a CURIE-formatted identifier.
# The VA prefix identifies this object as a Variation Allele.
ga4gh_identify(allele)

'ga4gh:VA/C0e28xlAfc9LVvCj_2092gF28UbtP3oX'

---
## ga4gh.vr.extras

### Data Proxy
VR implementations will need access to sequences and sequence identifiers. Sequences are used during normalization and during conversions with other formats. Sequence identifiers are necessary in order to translate identfiers from common forms to a digest-based identifier.

The VR specification leaves the choice of those data sources to the implementations.  In vr-python, `ga4gh.vr.extras.dataproxy` provides an abstract base class as a basis for data source adapters.  One source is [SeqRepo](https://github.com/biocommons/biocommons.seqrepo/), which is used below.  (An adapter based on the GA4GH refget specification exists, but is pending necessary changes to the refget interface to provide accession-based lookups.)

SeqRepo: [github](https://github.com/biocommons/biocommons.seqrepo/) | [data snapshots](http://dl.biocommons.org/seqrepo/) | [rest API github](https://github.com/biocommons/seqrepo-rest-service) | [rest API docker images](https://cloud.docker.com/u/biocommons/repository/docker/biocommons/seqrepo-rest-service)

In [19]:
# Requires seqrepo REST interface is running on this URL (e.g., using docker image)
seqrepo_rest_service_url = "http://localhost:5000/seqrepo"

from ga4gh.vr.extras.dataproxy import SeqRepoRESTDataProxy
dp = SeqRepoRESTDataProxy(base_url=seqrepo_rest_service_url)

In [20]:
dp.get_metadata("refseq:NM_000551.3")

2019-06-18 10:49:28 snafu ga4gh.vr.extras.dataproxy[14578] INFO Fetching http://localhost:5000/seqrepo/1/metadata/RefSeq:NM_000551.3


{'added': '2016-08-24T05:03:11Z',
 'aliases': ['MD5:215137b1973c1a5afcf86be7d999574a',
  'RefSeq:NM_000551.3',
  'SEGUID:T12L0p2X5E8DbnL0+SwI4Wc1S6g',
  'SHA1:4f5d8bd29d97e44f036e72f4f92c08e167354ba8',
  'VMC:GS_v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
  'ga4gh:SQ/v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_',
  'TRUNC512:bff413735a7e31461d82b46fe0b313e81c9720eb1dc370bf',
  'gi:319655736'],
 'alphabet': 'ACGT',
 'length': 4560}

In [21]:
dp.get_sequence("ga4gh:SQ/v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_", start=0, end=50) + "..."

2019-06-18 10:49:28 snafu ga4gh.vr.extras.dataproxy[14578] INFO Fetching http://localhost:5000/seqrepo/1/sequence/VMC:GS_v_QTc1p-MUYdgrRv4LMT6ByXIOsdw3C_


'CCTCGCCTCCGTTACAACGGCCTACGGTGCTGGAGGATCCTTCTGCGCAC...'

### Format translator
ga4gh.vr.extras.translator translates various formats into VR representations.

<br>
<div>
    <div style="border-radius: 10px; width: 80%; margin: 0 auto; padding: 5px; background: #d9ead3; border: 2pt solid #274e13; color: #274e13">
    <span style="font-size: 200%">🚀</span> The examples below use the same variant in 4 formats: HGVS, beacon, spdi, and VCF/gnomAD. Notice that the resulting Allele objects and computed identifiers are identical.</b>
    </div>
</div>


In [22]:
from ga4gh.vr.extras.translator import Translator
tlr = Translator(data_proxy=dp)

2019-06-18 10:49:28 snafu hgvs[14578] INFO hgvs 1.3.0.post0; released: False


In [23]:
a = tlr.from_hgvs("NC_000013.11:g.32936732G>C")
a.as_dict()

2019-06-18 10:49:28 snafu ga4gh.vr.extras.translator[14578] INFO Creating  parser
2019-06-18 10:49:29 snafu ga4gh.vr.extras.dataproxy[14578] INFO Fetching http://localhost:5000/seqrepo/1/metadata/RefSeq:NC_000013.11


{'id': 'ga4gh:VA/cxH2DtdGp35-S0Eqj9Jf1-cH_yZiaf0U',
 'location': {'id': 'ga4gh:SL/HTlh4NTzBViUK0P5ZMinx-YJuNvlBmht',
  'interval': {'end': 32936732, 'start': 32936731, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ/_0wi-qoDrvram155UmcSC-zA5ZK4fpLT',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'C', 'type': 'SequenceState'},
 'type': 'Allele'}

In [24]:
# from_beacon: Translate from beacon's form
a = tlr.from_beacon("13 : 32936732 G > C")
a.as_dict()

2019-06-18 10:49:29 snafu ga4gh.vr.extras.dataproxy[14578] INFO Fetching http://localhost:5000/seqrepo/1/metadata/GRCh38:13


{'id': 'ga4gh:VA/cxH2DtdGp35-S0Eqj9Jf1-cH_yZiaf0U',
 'location': {'id': 'ga4gh:SL/HTlh4NTzBViUK0P5ZMinx-YJuNvlBmht',
  'interval': {'end': 32936732, 'start': 32936731, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ/_0wi-qoDrvram155UmcSC-zA5ZK4fpLT',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'C', 'type': 'SequenceState'},
 'type': 'Allele'}

In [25]:
# SPDI uses 0-based coordinates
a = tlr.from_spdi("NC_000013.11:32936731:1:C")
a.as_dict()

{'id': 'ga4gh:VA/cxH2DtdGp35-S0Eqj9Jf1-cH_yZiaf0U',
 'location': {'id': 'ga4gh:SL/HTlh4NTzBViUK0P5ZMinx-YJuNvlBmht',
  'interval': {'end': 32936732, 'start': 32936731, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ/_0wi-qoDrvram155UmcSC-zA5ZK4fpLT',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'C', 'type': 'SequenceState'},
 'type': 'Allele'}

In [26]:
a = tlr.from_vcf("13-32936732-G-C")   # gnomAD-style expression
a.as_dict()

{'id': 'ga4gh:VA/cxH2DtdGp35-S0Eqj9Jf1-cH_yZiaf0U',
 'location': {'id': 'ga4gh:SL/HTlh4NTzBViUK0P5ZMinx-YJuNvlBmht',
  'interval': {'end': 32936732, 'start': 32936731, 'type': 'SimpleInterval'},
  'sequence_id': 'ga4gh:SQ/_0wi-qoDrvram155UmcSC-zA5ZK4fpLT',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'C', 'type': 'SequenceState'},
 'type': 'Allele'}