Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the PRO-proteoform-std line instead of the definition line by default #248

nataled opened this issue Jun 5, 2021 · 1 comment


Copy link

nataled commented Jun 5, 2021

  • There are 16186 cases where a term has the proteoform notation on both the definition and the PRO-proteoform-std lines.
  • There are 944 non-obsolete cases where a term has the proteoform notation on only the definition line. Of these, 917 are at the organism-neutral level; these are expected to lack the PRO-proteoform-std line. The others are organism-seqgroup (27 cases). All the organism-seqgroup cases are somewhat odd because the "Example: UniProtKB..." is not intended to serve the same purpose as PRO-proteoform-std. Rather, it just indicates one example of what would be a child term from a more-specific taxon (similar to the cases we have for S. pombe vs S. pombe 972h-).
  • There are 89 cases where a term has the proteoform notation on only the PRO-proteoform-std line. Of these, 78 are organism-seqgroup terms for some HLA subtype--these all have very complicated proteoform notations. The other 11 are cases where there are two EXACT synonyms for PRO-proteoform-std, each using a different UniProtKB identfier, so it isn't clear which to use. I could possibly figure out a way to decide that isn't completely arbitrary.

While the number of PRO-proteoform-std only lines is relatively small, all the terms like that are quite important for immunology or disease (HLA and SARS-CoV-2).

We should review the rules currently in use for obtaining the sequence information displayed in the alignment.

@nataled nataled self-assigned this Jun 5, 2021
Copy link
Collaborator Author

nataled commented Mar 24, 2023

@Julie-Cowart here are the 11 cases. I think all are for SARS-CoV-2 proteins. The virus has two genomic polyproteins, each of which are processed to yield the indicated proteins. The two polyproteins have separate accessions in UniProtKB, so basically each of the cases listed can from from two accessions. Despite this, as mentioned, the sequences are identical so both are given. After the list is a sample stanza. Interestingly, there is no PRO-proteoform-std like sentence in the def line (probably because it was unclear how to do it).


id: PR:000050270
name: host translation inhibitor nsp1 (SARS-CoV-2)
def: "A rep gene proteolytic cleavage product (SARS-CoV-2) that is the N-terminal portion produced by the enzymatic action of PL-PRO proteinase (nsp3) when it cleaves the precursor rep gene translation product (SARS-CoV-2) between residues 180-181." [PRO:DAN, GenPept:YP_009725297, UniProtKB:P0DTC1, UniProtKB:P0DTD1]
comment: Category=organism-modification. Note: 180 aa protein. Proteinase inferred from PMID:15564471.
synonym: "rep/Clv:nsp1 (SARS2)" EXACT PRO-short-label [PRO:DAN]
synonym: "leader protein (SARS-CoV-2)" EXACT [UniProtKB:P0DTC1, UniProtKB:P0DTD1]
synonym: "non-structural protein 1 (SARS-CoV-2)" EXACT [UniProtKB:P0DTC1, UniProtKB:P0DTD1]
synonym: "nsp1 (SARS-CoV-2)" EXACT [UniProtKB:P0DTC1, UniProtKB:P0DTD1]
synonym: "PRO_0000449619" EXACT PRO-proteoform-ftid [PRO:DAN]
synonym: "PRO_0000449635" EXACT PRO-proteoform-ftid [PRO:DAN]
synonym: "UniProtKB:P0DTC1, 1-180" EXACT PRO-proteoform-std [PRO:DAN]
synonym: "UniProtKB:P0DTD1, 1-180" EXACT PRO-proteoform-std [PRO:DAN]
xref: GenPept:YP_009725297
xref: GenPept:YP_009742608
is_a: PR:000050281 ! rep gene proteolytic cleavage product (SARS-CoV-2)
relationship: only_in_taxon NCBITaxon:2697049 ! Severe acute respiratory syndrome coronavirus 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants