cdot provides transcripts for the 2 most popular Python HGVS libraries.
It works by:
- Converting RefSeq/Ensembl GTFs to JSON
- Providing loaders for the HGVS libraries, via JSON.gz files, or REST API via cdot_rest)
We currently support 1.58 million transcript/genome alignments (vs ~141k in UTA v.20210129)
See changelog
2024-08-15:
- 'data_release' helper code
- Many minor updates to data (see changelog)
2023-07-05:
- BioCommons HGVS DataProvider fixes
- Support for mouse transcripts (Mus Musculus GRCm38 and GRCm39)
2023-04-03:
- #41 - Support for T2T CHM13v2.0 example code
pip install cdot
Biocommons HGVS example:
import hgvs
from hgvs.assemblymapper import AssemblyMapper
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider
hdp = RESTDataProvider() # Uses API server at cdot.cc
# hdp = JSONDataProvider(["./cdot-0.2.14.refseq.grch37.json.gz"]) # Uses local JSON file
am = AssemblyMapper(hdp,
assembly_name='GRCh37',
alt_aln_method='splign', replace_reference=True)
hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant('NM_001637.3:c.1582G>A')
am.c_to_g(var_c)
PyHGVS example:
import pyhgvs
from pysam.libcfaidx import FastaFile
from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory
genome = FastaFile("/data/annotation/fasta/GCF_000001405.25_GRCh37.p13_genomic.fna.gz")
factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.14.refseq.grch37.json.gz"]) # Uses local JSON file
pyhgvs.parse_hgvs_name('NM_001637.3:c.1582G>A', genome, get_transcript=factory.get_transcript_grch37)
- UTA public DB: 1-1.5 seconds / transcript
- cdot REST service: 10/second
- cdot JSON.gz: 500-1k/second
Download from GitHub releases - RefSeq (37/38) - 72M, Ensembl (37/38) 61M
Details on what the files contain here
Both projects have similar goals of providing transcripts for loading HGVS, but they approach it from different ways
- UTA aligns sequences, then stores coordinates in an SQL database.
- cdot convert existing Ensembl/RefSeq GTFs into JSON
See wiki page for the format.
We think a standard for JSON gene/transcript information would be a great thing, and am keen to collaborate to make it happen!
cdot, pronounced "see dot" is a play on HGVS coding sequence :c.
But if you want a backronym, it's "Complete Dict Of Transcripts"
This was developed for the Australian Genomics Shariant project, due to the need to load historical HGVS from lab archives.