Skip to content

Commit

Permalink
#2450 - gene/transcripts etc FAQ
Browse files Browse the repository at this point in the history
  • Loading branch information
davmlaw committed May 7, 2020
1 parent 78f6f3b commit 0169801
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 3 deletions.
3 changes: 2 additions & 1 deletion docs/index.rst
Expand Up @@ -26,7 +26,7 @@ If you are viewing classifications, fixing issues, resolving discordances
site/classification_form
site/classification_listing

If you are working on the technical connection to Sharaint
If you are working on the technical connection to Shariant
----------------------------------------------------------

.. toctree::
Expand All @@ -39,6 +39,7 @@ If you are working on the technical connection to Sharaint
integration/basics/sharing
integration/basics/variant_matching_overview
integration/basics/variant_matching_technical
integration/basics/gene_transcripts_builds_faq

.. toctree::
:maxdepth: 1
Expand Down
30 changes: 30 additions & 0 deletions docs/integration/basics/gene_transcripts_builds_faq.md
@@ -0,0 +1,30 @@
# Gene / Transcript / Build issues FAQ

## Q. How do genes and symbols work?

RefSeq and Ensembl curate genes and transcripts, giving them stable identifiers (such as ENSG00000139618) and versions.

Gene symbols (eg BRCA2) are decided by committees - eg HUGO / HGNC for human. A gene version has an assigned symbol, but this changes over time

## Q. Why are some gene/transcripts only available in certain genome builds?

Transcripts releases are made frequently (RefSeq is on version 99 and Ensembl version 100 as of May 2020) but new transcripts are only aligned to the latest build. Here are build release dates:

|Build | Released | Last Update |
|------|----------|-------------|
| GRCh37 | 2009/02/27 | 2013/06/28 (p.13) |
| GRCh38 | 2013/12/17 | 2019/02/28 (p.13) |

So all gene/transcript versions released since 2013/06/28 are not available for GRCh37

Obsolete transcript versions are not mapped to new builds, so all gene/transcript versions replaced between 2009-2013 are available in GRCh37 but not GRCh38.

## Q. Why do gene names change between genome builds?

A gene version has a fixed gene symbol, independent of build, eg ENSG00000164199 version 11 has the symbol 'GPR98', while in version 18 it is “ADGRV1”
However, as per above, the versions of genes and transcripts available for a build will differ, so ENSG00000164199 is 'GPR98' in GRCh37 but ADGRV1 in GRCh38

## Q. Why does the same gene/transcript version have different exons in different genome builds?

Transcripts are worked out as mRNA then aligned back against the genome builds to find the exon coordinates. If genome builds differ at that gene, for instance inserting or deleting sequence, or base changes altering splice sites, then exon lengths or ends can change.

6 changes: 4 additions & 2 deletions docs/integration/basics/variant_matching_overview.md
Expand Up @@ -77,19 +77,21 @@ HGVS nomenclature specifies right alignment. The table below (copied from ClinGe

**Shariant Solution**: HGVS coordinates are converted into VCF coordinates, then all VCF (or VCF from HGVS) coordinates are run through VT normalise before linking to a variant.

## Cross mapping between different genome builds
## Conversion between genome builds

It is not always possible to lift-over all variants to another build, with rates of ~97% being commonly reported.

Newer builds have resolved some difficult sequences and introduced additional haplotypes (different reference sequences for regions of the genome that vary between human populations). Therefore, it may be possible that two distinct variants in one genome build will lift over to become a single variant in another genome build, or vice versa. A naive lift-over of two separate classifications from the same genome build may make those same classifications discordant on another genome build.

**Shariant Solution**: The ClinGen Allele Registry will be used to solve this problem by providing a globally unique ID (CAid) which can link variants across different genome builds. When a classification is imported which resolves to a novel variant in Shariant, an API request will be made to the ClinGen Allele Registry to retrieve or create an CAid for this variant.
**Shariant Solution**: The [ClinGen Allele Registry](http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/landing) will be used to solve this problem by providing a globally unique ID (CAid) which can link variants across different genome builds. When a classification is imported which resolves to a novel variant in Shariant, an API request will be made to the ClinGen Allele Registry to retrieve or create an CAid for this variant.

The variant classification discordance process implemented in Shariant will work against these CAids (please refer to Shariant Technical Overview).

CAids can also be used as Evidence Keys to provide unambiguous linking of classifications to variants, and simplify submission to ClinVar.
References

The HGVS for a classification may change after liftover, see [Gene/Transcripts/Builds FAQ](gene_transcripts_builds_faq.md)

## References

[1] Dunnen, J. T., Dalgleish, R. , Maglott, D. R., Hart, R. K., Greenblatt, M. S., McGowan‐Jordan, J. , Roux, A. , Smith, T. , Antonarakis, S. E. and Taschner, P. E. (2016), HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Human Mutation, 37: 564-569. doi:10.1002/humu.22981
Expand Down

0 comments on commit 0169801

Please sign in to comment.