Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relating PRO to UniProt #165

Open
nataled opened this issue Oct 2, 2019 · 6 comments

Comments

@nataled
Copy link
Collaborator

commented Oct 2, 2019

This issue is a continuation of the discussion here:
geneontology/neo#34

This thread will focus on:

  1. What do the UniProt PURLs denote: database entry, protein class, or sequence?
  2. How does PRO relate to UniProt?

Interested parties (so far):
@JervenBolleman
@cmungall
@goodb
@alanruttenberg

@cmungall

This comment has been minimized.

Copy link

commented Oct 2, 2019

I would very much like there to be a single URI for a concept like "human Shh protein" (or at least two equivalent interchangeable URIs).

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 2, 2019

This will be possible once we find out just what the UniProt PURLS intend to mean. I recall @JervenBolleman saying he considers them to mean the same as PRO when he gives talks, but I'm not sure there's agreement on that (several people on the previous thread--myself included--indicated that they consider them as referring to database entries). In PRO we consider them exactly that--database entries that are about some protein class (for example, http://purl.uniprot.org/uniprot/P05067 is_about http://purl.obofoundry.org/obo/PR_P05067).

My main concern is that the UniProt PURLs might be overloaded in meaning. That is, some people consider them to refer to classes of proteins, some say they refer to database entries, and others might consider them as referring to sequences . If they are database entries, fine, but for PRO purposes we'll need a way to refer to the sequence. If they are protein classes, fine, we'll provide the appropriate equivalency statements, but we'll still need a way to refer to the sequence. If they are sequences, fine, we'll make the appropriate connection. I recall @cmungall suggesting that for the sequences we use a URL such as https://www.uniprot.org/uniprot/P05067.fasta?version=1. That would be fine, but there are also these things: http://purl.uniprot.org/isoforms/P05067-1. I asked if that PURL is intended to represent the (current) sequence, or intended to represent the class of proteins derived from that isoform. I did not get an answer.

@cmungall

This comment has been minimized.

Copy link

commented Oct 2, 2019

[broken record]
I think the whole referring to database entries is a red herring. http://purl.obolibrary.org/obo/GO_0097194 refers to a database entry, for a term in GO. It has databasey properties like identifiers, and xrefs, and information about which curator created it. But it's also a representation of a repeatable thing in nature. Ultimately we're all in the business of representing things in nature here, and at the same time doing database/ontology curation.

Our IDs can do dual duties as representing database entities and things in nature. There is no need to get meta and introduce an extra layer of indirection. Or at least I am not aware of such a use case, where someone really needs to track both these things and keep them distinct.
[/broken record]

I think the sequence vs protein molecule aspect is a bit more nuanced

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 2, 2019

I believe you missed my point. It isn't that I am introducing a layer. The question is "What kind of entity does UniProt consider its entries to be?" And one possible answer is..."Database entries."

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 3, 2019

@cmungall asked "What are the semantics of a non-GCRP trembl ID according to PRO?"

TrEMBL entries fall into the following types:

A) If there already exists a Swiss-Prot entry describing the products of some gene G (SP_of_G), then the TrEMBL entry describing a product of the same gene (Tr_of_G) can be:

  1. A sequence variant (allele) of G. These would be Tr_of_G is_a SP_of_G
  2. An isoform of G. These would be Tr_of_G is_a SP_of_G

B) If no Swiss-Prot entry describes the products of the TrEMBL gene, then the TrEMBL entry describing a product of that gene (Tr_of_G) can be:

  1. The 'proto-canonical' sequence (either because there is no other entry describing a product of that gene, or because it has the longest sequence among all TrEMBL entries with that gene). We'll call these TrC_of_G. In this case TrC_of_G is_a protein (or whatever level is appropriate). I describe this only for completeness; these are (or should be) part of the GCRP set.
  2. A sequence variant (allele) of that gene (TrV_of_G). Then, TrV_of_G is_a TrC_of_G.
  3. An isoform of G (TrI_of_G). Then, TrI_of_G is_a TrC_of_G.

C) If no gene is indicated in the TrEMBL entry (call it TrX), then...

  1. TrX is_a protein (if no species non-specific parent can be found).
  2. TrX is_a =species non-specific parent=

Technically speaking, TrEMBL entries (like some Swiss-Prot) can also describe fragments.

@cmungall

This comment has been minimized.

Copy link

commented Oct 4, 2019

I'm going to post a strawman proposal:

PRO gene-level protein classes and UniProt canonical/GCRP entries are to be considered equivalent in the strict OWL sense. (ergo the URIs could be collapsed with no loss of logical entailment and no introduction of inconsistency. This would be a win as the community would not have to make an arbitrary selection between two distinct PURLs/CURIEs)

Ontologically these are protein classes, which are material entity classes (as is currently the case in PRO)

(The uniprot docs talk about these as sequences, which is perfectly valid as the main use case for these involves treating them as sequences, but in the ontological treatment, the sequence would be a property of the material entity)

They are the superclasses of isoform classes (as they are now, in PRO)

The isoform level classes in PRO would be equivalent to the uniprot isoform entries (e.g. P12345-1)

There could be some kind of has-canonical-form relationship between the main class and isoform-1 (see http://purl.obolibrary.org/obo/RO_0002214)

Note that at the database level, the canonical entry will have annotations for things such as protein domains, functions, etc. At the ontological level this will not be taken to mean that all instances of that protein have those properties. Otherwise we end up with logical inconsistencies. Instead it will be a some-some.

Note that neither resource needs to make any changes to implement this. It would be a semantic MOU about ontological commitment of PURLs. And both would agree not to publish logical axioms that introduce logical inconsistencies.

However, if both parties agree, then there is a strong case for PRO switching from PRO purls for gene-level to instead use uniprot PURLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.