-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize gene representation #381
Comments
Here's some other examples: The Arabidopsis Information Resource
See GFF3
BrakerBraker is a popular genome annotation program Output depends on the settings. For one of our gff file from braker 2 we get these types:
Note that it includes: See GFF3
TomatoSource: https://solgenomics.net/ftp/tomato_genome/annotation/ITAG4.0_release/ITAG4.0_gene_models.gff There is nothing unusual here. All features have unique identifier. Genes have: CDS, exon, five_prime_UTR, gene, mRNA, three_prime_UTR See GFF3
|
@kyostiebi Attached here is a GFF3 that has genes in several different formats. Currently the changes that load data in the new feature model are only guaranteed to work with the first gene format in this file. Could you update the importing code in the new feature model branch you've been working on so that it handles all the cases in the attached GFF3? All cases in this file should end up with the same gene model (just with the position offset by 10000 bases). |
Up until now we've basically preserved exactly what is in the GFF3 that is imported, with only a bit of formatting changes to store it internally. However, this has led to some places in our code that handle things differently based on how the GFF3 is formatted. A big example is the
CanonicalGeneGlyph
and theImplicitExonGeneGlyph
. There have also been GFF3s that we've tried uploading where neither of these glyphs work. I also noticed this behavior when looking at the Transcript Details Widget, certain things only worked if the GFF3 was formatted in a certain way.I think the way we need to handle this going forward is to standardize the GFF3 data on import, specifically for genes, so Apollo can always expect a single format. This means a potential loss of data. For example, if a
five_prime_UTR
in a GFF3 has anID
, but we decide to drop UTRs from the data when standardizing it (since the location of UTRs can be calculated based on the locations of other features), we'd lose the UTR's ID. I think this is unavoidable, though, and can also be somewhat mitigated by having a robust GFF3 export system.Here are some GFF3s that I found that illustrate how GFF3s format genes:
Sequence Ontology GFF3 Spec
See GFF3
mRNA
,exon
, andCDS
.CDS
s have multiple locations under the same ID.Ensembl GRCh38
See GFF3
mRNA
,exon
,CDS
,five_prime_UTR
, andthree_prime_UTR
.CDS
s have multiple locations under the same ID.RefSeq GRCh38
See GFF3
mRNA
,exon
, andCDS
.CDS
s have multiple locations under the same ID. (Matches Sequence Ontology spec)Wormbase C. elegans
See GFF3
mRNA
,exon
,CDS
,five_prime_UTR
,three_prime_UTR
, andintron
.CDS
s have multiple locations under the same ID.PlasmoDB P. falciparum
See GFF3
mRNA
,exon
,CDS
,five_prime_UTR
, andthree_prime_UTR
. EachCDS
location has a unique ID.We need to figure out what our standard internal representation will be so that we can start figuring out how to standardize the data.
The text was updated successfully, but these errors were encountered: