New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ncRNA type to gpi file #63

Closed
AntonPetrov opened this Issue Oct 11, 2016 · 12 comments

Comments

Projects
None yet
5 participants
@AntonPetrov
Member

AntonPetrov commented Oct 11, 2016

Need to map between INSDC ncRNA types and SO ncRNAs.

More information here:
geneontology/go-site#234

@AntonPetrov AntonPetrov added this to the Release 6 milestone Oct 11, 2016

@AntonPetrov AntonPetrov self-assigned this Oct 11, 2016

@AntonPetrov AntonPetrov modified the milestones: Release 6, Release 7 Jan 4, 2017

@cmungall

This comment has been minimized.

Show comment
Hide comment
@cmungall

cmungall Feb 14, 2017

Any update on this? Thanks!

cmungall commented Feb 14, 2017

Any update on this? Thanks!

@blakesweeney blakesweeney self-assigned this Feb 15, 2017

@blakesweeney

This comment has been minimized.

Show comment
Hide comment
@blakesweeney

blakesweeney Feb 15, 2017

Member

Hi, just to let you know we've been doing some work improving ncRNA type (and related issues) in #111. Once this is merged in fixing this will one of our priorities. Thanks!

Member

blakesweeney commented Feb 15, 2017

Hi, just to let you know we've been doing some work improving ncRNA type (and related issues) in #111. Once this is merged in fixing this will one of our priorities. Thanks!

blakesweeney added a commit that referenced this issue Mar 3, 2017

Initial addition of a mapping dict
The goal here is create a mapping from INSDC terms to correct SO term as
seen in #63. This
is my initial start at this. I don't think I will just leave a random
dict in some file, but I need to put this somewhere for now. Future work
will fill out this dict more as well as move it around (probably). For
now this is just to fill out the mapping.
@blakesweeney

This comment has been minimized.

Show comment
Hide comment
@blakesweeney

blakesweeney Mar 6, 2017

Member

Hi, @cmungall I've done some work on creating a mapping for the INSDC types to SO terms and below is my current mapping. Most of the mappings are very straight forward, however there are a few where I had to make a judgement call. I've included some notes on why I selected the SO I did. Let me know if you have any comments or changes you'd like me to make. Thanks!

Member

blakesweeney commented Mar 6, 2017

Hi, @cmungall I've done some work on creating a mapping for the INSDC types to SO terms and below is my current mapping. Most of the mappings are very straight forward, however there are a few where I had to make a judgement call. I've included some notes on why I selected the SO I did. Let me know if you have any comments or changes you'd like me to make. Thanks!

@blakesweeney

This comment has been minimized.

Show comment
Hide comment
@blakesweeney

blakesweeney Apr 24, 2017

Member

Hi, @cmungall have you had a chance to look over this mapping? Thanks!

Member

blakesweeney commented Apr 24, 2017

Hi, @cmungall have you had a chance to look over this mapping? Thanks!

@cmungall

This comment has been minimized.

Show comment
Hide comment
@cmungall

cmungall Apr 24, 2017

The ones I looked at seemed fine. Good point about the miscRNA. I made an issue in the SO tracker (linked above)

cmungall commented Apr 24, 2017

The ones I looked at seemed fine. Good point about the miscRNA. I made an issue in the SO tracker (linked above)

@murphyte

This comment has been minimized.

Show comment
Hide comment
@murphyte

murphyte Apr 25, 2017

FYI, I just attached the INSDC:SO mappings that we use at NCBI to this ticket:
The-Sequence-Ontology/SO-Ontologies#378

I didn't double-check it vs. your mapping above, but most of them are pretty obvious and I'm sure they're consistent.

For precursor_RNA, it is an over-statement to map INSDC precursor_RNA specifically to pre-miRNA. That is the most common usage within RefSeq, but there are a few other places where precursor_RNA is used so at best it requires some additional analysis to decide on the mapping.

For misc_RNA, we also map to the general SO term "transcript", or "pseudogenic_transcript" if marked up with a /pseudo qualifier. I wouldn't say "transcript" is wrong, it's just not as informative as you might like.

murphyte commented Apr 25, 2017

FYI, I just attached the INSDC:SO mappings that we use at NCBI to this ticket:
The-Sequence-Ontology/SO-Ontologies#378

I didn't double-check it vs. your mapping above, but most of them are pretty obvious and I'm sure they're consistent.

For precursor_RNA, it is an over-statement to map INSDC precursor_RNA specifically to pre-miRNA. That is the most common usage within RefSeq, but there are a few other places where precursor_RNA is used so at best it requires some additional analysis to decide on the mapping.

For misc_RNA, we also map to the general SO term "transcript", or "pseudogenic_transcript" if marked up with a /pseudo qualifier. I wouldn't say "transcript" is wrong, it's just not as informative as you might like.

@murphyte

This comment has been minimized.

Show comment
Hide comment
@murphyte

murphyte Apr 25, 2017

Also note that in addition to the INSDC feature type misc_RNA, there is the /mol_type="transcribed RNA", which is used extensively within TSA:
https://www.ncbi.nlm.nih.gov/nuccore/GDAY02000562.1
https://www.ncbi.nlm.nih.gov/nuccore/IABX01000001.1

For the first case, there's a CDS feature so you can deduce that mRNA would be a more appropriate type. For the second, there is no feature annotated on the RNA, but it's reasonable to treat it the same as if it had a misc_RNA feature spanning the full length. I think we don't have any cases like that in RefSeq, and it's rare in INSDC outside of TSA so you may not need to deal with it for RNAcentral, but I figured I'd point out the wrinkle for you just in case you do need to account for it.

murphyte commented Apr 25, 2017

Also note that in addition to the INSDC feature type misc_RNA, there is the /mol_type="transcribed RNA", which is used extensively within TSA:
https://www.ncbi.nlm.nih.gov/nuccore/GDAY02000562.1
https://www.ncbi.nlm.nih.gov/nuccore/IABX01000001.1

For the first case, there's a CDS feature so you can deduce that mRNA would be a more appropriate type. For the second, there is no feature annotated on the RNA, but it's reasonable to treat it the same as if it had a misc_RNA feature spanning the full length. I think we don't have any cases like that in RefSeq, and it's rare in INSDC outside of TSA so you may not need to deal with it for RNAcentral, but I figured I'd point out the wrinkle for you just in case you do need to account for it.

@blakesweeney

This comment has been minimized.

Show comment
Hide comment
@blakesweeney

blakesweeney Apr 25, 2017

Member

Hi, @murphyte thanks for the feedback. I'll look over the mapping you gave us and make changes if needed.

Thank you for the comments about annotations those specific annotations. We do a bit of work to select the best annotated RNA type for each sequence, hopefully this means our precursor_RNA annotations are reliable but I will have to check and change our mapping if needed.

In general we do not import psuedogenes so I will leave the misc_RNA as transcript. I'll also make a note to use the CDS feature and the "transcribed RNA" mol_type to exclude things from import into RNAcentral. I don't think it's an issue but I haven't checked extensively. Thanks!

Member

blakesweeney commented Apr 25, 2017

Hi, @murphyte thanks for the feedback. I'll look over the mapping you gave us and make changes if needed.

Thank you for the comments about annotations those specific annotations. We do a bit of work to select the best annotated RNA type for each sequence, hopefully this means our precursor_RNA annotations are reliable but I will have to check and change our mapping if needed.

In general we do not import psuedogenes so I will leave the misc_RNA as transcript. I'll also make a note to use the CDS feature and the "transcribed RNA" mol_type to exclude things from import into RNAcentral. I don't think it's an issue but I haven't checked extensively. Thanks!

@keilbeck

This comment has been minimized.

Show comment
Hide comment
@keilbeck

keilbeck Apr 26, 2017

Chris, Terrance &Blake
Nicole @nicoleruiz and I are looking over the mappings now.
We would add a new term to map to miscRNA if you all think it is necessary and will make sure that are definitions are inclusive of your meanings.

We will also add xrefs if they are missing.

keilbeck commented Apr 26, 2017

Chris, Terrance &Blake
Nicole @nicoleruiz and I are looking over the mappings now.
We would add a new term to map to miscRNA if you all think it is necessary and will make sure that are definitions are inclusive of your meanings.

We will also add xrefs if they are missing.

@blakesweeney

This comment has been minimized.

Show comment
Hide comment
@blakesweeney

blakesweeney Apr 27, 2017

Member

Hi @keilbeck and @nicoleruiz, thank you for checking over the mappings. I will happily use the misc_RNA term if it exists, but I am ok without having it. From the RNAcentral perspective it is a very uninformative term and we are hoping to cut down on the numbers of sequences annotated as misc_RNA. Thanks again!

Member

blakesweeney commented Apr 27, 2017

Hi @keilbeck and @nicoleruiz, thank you for checking over the mappings. I will happily use the misc_RNA term if it exists, but I am ok without having it. From the RNAcentral perspective it is a very uninformative term and we are hoping to cut down on the numbers of sequences annotated as misc_RNA. Thanks again!

@blakesweeney

This comment has been minimized.

Show comment
Hide comment
@blakesweeney

blakesweeney Apr 27, 2017

Member

Hi, I've looked a bit more at the things we label as precursor_RNA and I agree with @murphyte that pre-miRNA may be an overstatement. So I will move to mapping to primary_transcript like RefSeq does. One of our goals is to have a more accurate mapping so we will look into having more precise labels. Thanks for the feedback!

Member

blakesweeney commented Apr 27, 2017

Hi, I've looked a bit more at the things we label as precursor_RNA and I agree with @murphyte that pre-miRNA may be an overstatement. So I will move to mapping to primary_transcript like RefSeq does. One of our goals is to have a more accurate mapping so we will look into having more precise labels. Thanks for the feedback!

blakesweeney added a commit that referenced this issue Apr 27, 2017

Change precursor_RNA to map to primary_transcript
As discussed in #63 the mapping to pre-miRNA is too precise unless we
have additional information. At the moment the mapping is just based
upon the terms and does not consider any additional information. Because
of this I use a more general term for now. We can use a more precise one
once the mappings consider other things. For example, if the data is
coming from miRBase it is probably a pri-mRNA, however we don't do that
yet so we will stick with a more generic mapping.
@AntonPetrov

This comment has been minimized.

Show comment
Hide comment
@AntonPetrov

AntonPetrov May 16, 2017

Member

The RNAcentral GPI file now contains ncRNA types:
ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/7.0/gpi/

@tonysawfordebi, sorry it took us so long!

Feel free to reopen this issue or create a new one if you notice any problems.

Member

AntonPetrov commented May 16, 2017

The RNAcentral GPI file now contains ncRNA types:
ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/7.0/gpi/

@tonysawfordebi, sorry it took us so long!

Feel free to reopen this issue or create a new one if you notice any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment