Skip to content

cdot vs UTA

Dave Lawrence edited this page Feb 3, 2022 · 2 revisions

cdot and Universal Transcript Archive have similar goals of providing transcripts for loading HGVS, but they approach it from different ways:

  • UTA aligns sequences, then stores coordinates in an SQL database.
  • cdot convert existing Ensembl/RefSeq GTFs into JSON

Alignment Gaps

RefSeq transcripts sequences can differ from the genome sequence, which means they can align with gaps. Prior to v105 (GRCh37.p13) RefSeq did not provide alignment gap information, so UTA was forced to do their own alignment to get CIGAR strings, to correctly handle these gaps.

From v105 onwards, RefSeq provide these gaps - making it possible to use the GFFs.

Advantages of aligning sequences

  • UTA can map GRCh37 sequences to GRCh38 and vice-versa
  • UTA can account for alignment gaps in earlier RefSeq releases (cdot uses these UTA transcripts - thanks!)

Advantages of using existing GTFs

  • Drastically simpler workflow - meaning we can load more transcripts
  • Alignments exactly match those in official releases

JSON vs SQL

There's a bit of redundancy in JSON, but:

  • You can copy flat files around without dealing with Docker/PostgreSQL/database schemas etc.
  • It's trivial to write a REST server and the client already consumes JSON
  • It's lightning fast to load into RAM in Python