# GA4GH Variation Representation Overview

This notebook provides an overview of all objects in the GA4GH VR schema. (GA4GH VR was formerly known as VMC. Renaming is in progress.)

## Top-down view of the VR Schema

The highest level objects of VR are Variation and VariationSet. Both are *abstract* objects. 

Conceptually, Variation can be any of the following:
* Text -- a blob of text, used when a textual representation is not (yet) parseable
* Allele -- contiguous state of a sequence or conceptual region
* Haplotype -- a set of Alleles known to be in phase
* Genotype -- a set of Haplotypes

VariationSet -- is an arbitrary set of anything

👉 The above hierarchy is being reevaluated.

An Allele consists of a Location and State, which are both abstract concepts.

Kinds of Locations:
* SequenceLocation
* CytobandLocation
* GeneLocation

Kinds of State:
* SequenceState
* CNVState



---
# Reference Implementation

In [1]:
import vmc

---
# Functions
VR defines functions that operate on some objects. They are demonstrated below.

## Digest
The VMC digest is merely a convention for how to apply well-known existing technology to generating a unique fingerprint of a string object.

In [2]:
vmc.digest("")

'z4PhNX7vuL3xVChQ1m2AB9Yg5AULVxXc'

In [3]:
vmc.digest("", digest_size=12)

'z4PhNX7vuL3xVChQ'

In [4]:
vmc.digest("ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

## Serialization

## Computed Identifiers
Generating a computed identifier involves running the digest algorithm above on a serialized object. Because it requires a VR object, the computed identifier is demonstrated below.

---
# Text Variation

In order to support variation descriptions that cannot be parsed, or cannot be parsed yet, the VR provides a Text schema object. The intention is to provide ids for *any* variation, particularly human descriptions of variation.

In [3]:
v = vmc.models.Text(definition="PTEN loss")
v.as_dict()

{'definition': 'PTEN loss', 'type': 'Text'}

In [4]:
vmc.computed_id(v)

'VMC:GT_VX60NSGLem4X3Q8gnOSx48pZDCmJVSUk'

---
# Locations
A Location is an *abstract* object that refer to contiguous regions of biological sequences. Concrete types of Locations are shown below.

The most common Location is a SequenceLocation, i.e., a Region on a named sequence.
Locations may also be more conceptual, such as a cytoband region or a gene.
Any of these may be used as the Location for Variation.

## Regions

Regions refer to contiguous spans of an implied sequence.
Regions are not identifiable objects, so have no computed identifier defined.

### SimpleRegion

In [5]:
sr = vmc.models.SimpleRegion(start=42, end=43)
sr.as_dict()

{'end': 43, 'start': 42, 'type': 'SimpleRegion'}

In [6]:
vmc.serialize(sr)

b'{"end":43,"start":42,"type":"SimpleRegion"}'

### NestedRegion

In [7]:
nr = vmc.models.NestedRegion(
    inner=vmc.models.SimpleRegion(start=29,end=30),
    outer=vmc.models.SimpleRegion(start=30,end=39))
nr.as_dict()

{'inner': {'end': 30, 'start': 29, 'type': 'SimpleRegion'},
 'outer': {'end': 39, 'start': 30, 'type': 'SimpleRegion'},
 'type': 'NestedRegion'}

In [8]:
vmc.serialize(nr)

b'{"inner":{"end":30,"start":29,"type":"SimpleRegion"},"outer":{"end":39,"start":30,"type":"SimpleRegion"},"type":"NestedRegion"}'

### RangedRegion

In [9]:
rr = vmc.models.RangedRegion(
    start=vmc.models.SimpleRegion(start=20,end=29),
    end=vmc.models.SimpleRegion(start=30,end=39))
rr.as_dict()

{'end': {'end': 39, 'start': 30, 'type': 'SimpleRegion'},
 'start': {'end': 29, 'start': 20, 'type': 'SimpleRegion'},
 'type': 'RangedRegion'}

In [10]:
vmc.serialize(rr)

b'{"end":{"end":39,"start":30,"type":"SimpleRegion"},"start":{"end":29,"start":20,"type":"SimpleRegion"},"type":"RangedRegion"}'

### Future: ComplexRegion
A complex region will support intronic offsets and relative locations (a la FALDO)

### SequenceLocation

In [11]:
slsr = vmc.models.SequenceLocation(sequence_id="NM_0001234.5", region=sr)
slsr.id = vmc.computed_id(slsr)
slsr.as_dict()

{'id': 'VMC:GL_yKxYQ4j-D1f43mjYbobEKk74CVfwSEQj',
 'region': {'end': 43, 'start': 42, 'type': 'SimpleRegion'},
 'sequence_id': 'NM_0001234.5',
 'type': 'SequenceLocation'}

In [12]:
slnr = vmc.models.SequenceLocation(sequence_id="NM_0001234.5", region=nr)
slnr.id = vmc.computed_id(slnr)
slnr.as_dict()

{'id': 'VMC:GL_y1KI63endgw2MbqVIy5n4Ef2BWbnEVH2',
 'region': {'inner': {'end': 30, 'start': 29, 'type': 'SimpleRegion'},
  'outer': {'end': 39, 'start': 30, 'type': 'SimpleRegion'},
  'type': 'NestedRegion'},
 'sequence_id': 'NM_0001234.5',
 'type': 'SequenceLocation'}

In [13]:
slrr = vmc.models.SequenceLocation(sequence_id="NM_0001234.5", region=rr)
slrr.id = vmc.computed_id(slrr)
slrr.as_dict()

{'id': 'VMC:GL_SS97Ti3Tjad9M8GW1F0miTgS06RJ6fKB',
 'region': {'end': {'end': 39, 'start': 30, 'type': 'SimpleRegion'},
  'start': {'end': 29, 'start': 20, 'type': 'SimpleRegion'},
  'type': 'RangedRegion'},
 'sequence_id': 'NM_0001234.5',
 'type': 'SequenceLocation'}

### CytobandLocation

In [14]:
cbl = vmc.models.CytobandLocation(chr="11", start="q22.3", end="q23.1")
cbl.id = vmc.computed_id(cbl)
cbl.as_dict()

{'chr': '11',
 'end': 'q23.1',
 'id': 'VMC:GL_R2RiNOcD_3F-NNEQUrIst3M84LTsVQWF',
 'start': 'q22.3',
 'type': 'CytobandLocation'}

### GeneLocation

In [15]:
gl = vmc.models.GeneLocation(gene="HGNC:MSH2")
gl.id = vmc.computed_id(gl)
gl.as_dict()

{'gene': 'HGNC:MSH2',
 'id': 'VMC:GL_HUswIoUpNqPZa2rBwJR_32At9A3wnWJJ',
 'type': 'GeneLocation'}

# Alleles

An Allele is essentially just a pair of Location and State. The many possible Location and State types permit representing many flavors of variation.

### "Simple" sequence replacements
This case covers any "ref-alt" style variation, which includes SNVs, MNVs, del, ins, and delins.

In [16]:
ss = vmc.models.SequenceState(sequence="A")
a = vmc.models.Allele(location=slsr, state=ss)
a.id = vmc.computed_id(a)
a.as_dict()

{'id': 'VMC:GA_9Mhs4r1-HHJA29T1vccB13MyocdlVOsm',
 'location': {'id': 'VMC:GL_yKxYQ4j-D1f43mjYbobEKk74CVfwSEQj',
  'region': {'end': 43, 'start': 42, 'type': 'SimpleRegion'},
  'sequence_id': 'NM_0001234.5',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'A', 'type': 'SequenceState'},
 'type': 'Allele'}

## CNV of a simple SequenceLocation, copy location unknown/unspecified

In [17]:
sr = vmc.models.SimpleRegion(start=20,end=30)
sl = vmc.models.SequenceLocation(sequence_id="NM_0001234.5", region=sr)

cnvstate = vmc.models.CNVState(min_copies=3, max_copies=5, copy_measure="TOTAL")

a = vmc.models.Allele(location=sl, state=cnvstate)
a.id = vmc.computed_id(a)
a.as_dict()

{'id': 'VMC:GA_jfxtN9iWJDjh1qt8i7sOuFAcSvhDhdvp',
 'location': {'region': {'end': 30, 'start': 20, 'type': 'SimpleRegion'},
  'sequence_id': 'NM_0001234.5',
  'type': 'SequenceLocation'},
 'state': {'copy_measure': 'TOTAL',
  'max_copies': 5,
  'min_copies': 3,
  'type': 'CNVState'},
 'type': 'Allele'}

## Same CNV, now with known location

In [18]:
sr = vmc.models.Region(start=20,end=30)
sl = vmc.models.SequenceLocation(sequence_id="NM_0001234.5", region=sr)

# 👉 Note addition of location in CNVState
# When CNV.location == Allele.location, CN is total copy number and copies are tandem
cnvstate = vmc.models.CNVState(min_copies=3, max_copies=5, copy_measure="TOTAL", location = sl)

a = vmc.models.Allele(location=sl, state=cnvstate)
a.id = vmc.computed_id(a)
a.as_dict()

{'id': 'VMC:GA_deCg4f_LrHcsfDTAg9y7AnWGKaPzhXby',
 'location': {'region': {'end': 30, 'start': 20, 'type': 'SimpleRegion'},
  'sequence_id': 'NM_0001234.5',
  'type': 'SequenceLocation'},
 'state': {'copy_measure': 'TOTAL',
  'location': {'region': {'end': 30, 'start': 20, 'type': 'SimpleRegion'},
   'sequence_id': 'NM_0001234.5',
   'type': 'SequenceLocation'},
  'max_copies': 5,
  'min_copies': 3,
  'type': 'CNVState'},
 'type': 'Allele'}

## CNV at a Gene Location
Because any Location may be used to define an Allele, it's straightforward to define gene copy number

In [19]:
gl = vmc.models.GeneLocation(gene="HGNC:MSH2")

cnvstate = vmc.models.CNVState(min_copies=3, max_copies=5, copy_measure="RELATIVE")

a = vmc.models.Allele(location=gl, state=cnvstate)
a.id = vmc.computed_id(a)
a.as_dict()

{'id': 'VMC:GA_TraGwt0_Ks5VSR7_DeLJnuIwHWEpJTov',
 'location': {'gene': 'HGNC:MSH2', 'type': 'GeneLocation'},
 'state': {'copy_measure': 'RELATIVE',
  'max_copies': 5,
  'min_copies': 3,
  'type': 'CNVState'},
 'type': 'Allele'}

### Gene Deletion

🤔 Deletion could be a SequenceState(""), or, in principle, CNVState(min_copies=0, max_copies=0). Need to provide guidance about which. For now, I'm using SequenceState(""). 

In [20]:
a = vmc.models.Allele(location=gl, state=vmc.models.SequenceState(sequence=""))
a.id = vmc.computed_id(a)
a.as_dict()

{'id': 'VMC:GA_DBkbidc-OqU7-L4_p3b7StzPZ9jnSxPN',
 'location': {'gene': 'HGNC:MSH2', 'type': 'GeneLocation'},
 'state': {'sequence': '', 'type': 'SequenceState'},
 'type': 'Allele'}

# Haplotypes
A Haplotype is a collection of allele_ids, with optional specification for covered location and completeness

In [21]:
h = vmc.models.Haplotype(
    location_id=slsr.id,
    allele_ids=[
    'BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
    'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
    'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
    'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j'],
    completeness="PARTIAL")
h.id = vmc.computed_id(h)
h.as_dict()

{'allele_ids': ['BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
  'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
  'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
  'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j'],
 'completeness': 'PARTIAL',
 'id': 'VMC:GH_d-arw1iegJ7WghhHiKzv90K0diZf-YTY',
 'location_id': 'VMC:GL_yKxYQ4j-D1f43mjYbobEKk74CVfwSEQj',
 'type': 'Haplotype'}

# Genotypes
A Genotype is a collection of Haplotypes_ids, with optional specification for completeness

In [22]:
g = vmc.models.Genotype(
    haplotype_ids=[
    'BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
    'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
    'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
    'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j'],
    completeness="PARTIAL")
g.id = vmc.computed_id(g)
g.as_dict()

{'completeness': 'PARTIAL',
 'haplotype_ids': ['BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
  'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
  'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
  'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j'],
 'id': 'VMC:GG_5IARhH263DsFl6P7WfCLAfqkcnCbl4tn',
 'type': 'Genotype'}

# VariationSet
VariationSet is just a bucket of ids, which may not even exist.

In [23]:
vs = vmc.models.VariationSet(member_ids=[
    'BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
    'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
    'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
    'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j',
    'BOGUS:XX_pel3HzoNSMCEvPoQQD-AOBE8I8s0eCn9',
    'BOGUS:XX_X2x6a4Xvil365Ea-Po8WcuuQPWx973U8',
    'BOGUS:XX_QHDx_0DbssgtGljy-K1q7WAcNkqD5TY-',
    'BOGUS:XX_3x2p-8eCIc0pU-if_6CFBKGLziZRSWdz',
    'BOGUS:XX_RXF8gSNDDyPQ0opTA8ordEE6hGGm2GYJ',
    'BOGUS:XX_7tDaRPzXL4rfOLoYtRUUGCTy65ptDs8J'])
vs.id = vmc.computed_id(vs)
vs.as_dict()

{'id': 'VMC:??_6m4w9eXBvYrs3FKOZq62E_955VZ7Oo7V',
 'member_ids': ['BOGUS:XX_WMv1y-3Q460hi_S3ND5N5Ct2Ci58TOZd',
  'BOGUS:XX_jW7bSR3Obmx3IewIRSJkJMf6t7b73LVU',
  'BOGUS:XX_23DL4svp8FvWdMrkhuOckbyjM-0I1Dov',
  'BOGUS:XX_n363FutAEo79HhjNl7wea61SGc_tU40j',
  'BOGUS:XX_pel3HzoNSMCEvPoQQD-AOBE8I8s0eCn9',
  'BOGUS:XX_X2x6a4Xvil365Ea-Po8WcuuQPWx973U8',
  'BOGUS:XX_QHDx_0DbssgtGljy-K1q7WAcNkqD5TY-',
  'BOGUS:XX_3x2p-8eCIc0pU-if_6CFBKGLziZRSWdz',
  'BOGUS:XX_RXF8gSNDDyPQ0opTA8ordEE6hGGm2GYJ',
  'BOGUS:XX_7tDaRPzXL4rfOLoYtRUUGCTy65ptDs8J'],
 'type': 'VariationSet'}

---
# vmc.extras

## Format translator
vmc.extras.translator translates various formats into VR representations

In [1]:
from vmc.extras.translator import Translator
tlr = Translator()

In [2]:
a = tlr.from_beacon("13 : 32936732 G > C")
a.as_dict()

{'location': {'region': {'end': 32936732,
   'start': 32936731,
   'type': 'SimpleRegion'},
  'sequence_id': 'GRCh38:13 ',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'C', 'type': 'SequenceState'},
 'type': 'Allele'}

In [3]:
a = tlr.from_gnomad("1-55516888-G-GA")
a.as_dict()

{'location': {'region': {'end': 55516888,
   'start': 55516887,
   'type': 'SimpleRegion'},
  'sequence_id': 'GRCh38:1',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'GA', 'type': 'SequenceState'},
 'type': 'Allele'}

In [4]:
a = tlr.from_hgvs("NM_012345.6:c.22A>T")
a.as_dict()



{'location': {'region': {'end': 22, 'start': 21, 'type': 'SimpleRegion'},
  'sequence_id': 'refseq:NM_012345.6',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'T', 'type': 'SequenceState'},
 'type': 'Allele'}

In [5]:
a = tlr.from_spdi("NM_012345.6:21:1:T")
a.as_dict()

{'location': {'region': {'end': 22, 'start': 21, 'type': 'SimpleRegion'},
  'sequence_id': 'refseq:NM_012345.6',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'T', 'type': 'SequenceState'},
 'type': 'Allele'}

## Translating sequence identifiers to VMC sequence identifiers
Sequence lookup services are required to implement VMC operations, but the exact implementation is up to the implementer. The most important need is to translate sequence identifiers from RefSeq or other sources into VMC sequence identifiers.

In [27]:
from vmc.extras.seqrepo import get_vmc_sequence_identifier
get_vmc_sequence_identifier("RefSeq:NC_000019.10")

'VMC:GS_IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'