Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S. pombe protein complex terms (many) #106

Open
nataled opened this issue Mar 25, 2015 · 16 comments

Comments

Projects
None yet
1 participant
@nataled
Copy link
Collaborator

commented Mar 25, 2015

Hello,

For GO annotations and network representation (we're using esyN - www.esyn.org), we would find it very useful to have a set of PRO entries for S. pombe complexes. We maintain a list of GO cellular component complex terms and annotated genes that we hope is a good starting point.

May we have PRO terms/ids/etc. for the pombe versions of the complexes in this list (one per GO term)? For us, the ones with PMID references in the "source" column are higher priority than the rest.

ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Complexes/Complex_annotation

(with explanation ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Complexes/README)

If you need us to attach or email a copy of the file, or if you have any problems or questions, please let us know.

Thanks!
Midori (and the rest of the PomBase curators)

Reported by: mah11

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 25, 2015

Hi Midori,

This can be done, even automated, but I see several questionable cases that lead me to think the list isn't complete in terms of complex components, and this is not even counting the lack of cardinality or modification information. For example, how likely is it that GO:0071014 "post-mRNA release spliceosomal complex" contains only a single type of protein? I see other complexes with similar issues. Technically speaking, we don't need to even indicate the subunits, and don't even need to indicate all the components, so the terms can be made, but probably you want more than just "protein complex X (S. pombe)" and probably want to avoid the misleading look of failing to indicate all.

Please let me know how you'd like to proceed.

Darren

Original comment by: nataled

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 25, 2015

Hi Darren,

Thanks for commenting so promptly on this request, and for the excellent questions. I'll have to ask Val address your concerns, because the complex inventory file is something she's been curating sort-of-manually from PomBase's GO cellular component annotation set. I therefore don't have nearly as good a sense as she would of how complete the inventory is, and in particular, the balance between incompleteness that reflects incomplete curation versus incomplete knowledge available for us to curate.

We may be able to provide a shorter list of complexes that are more nearly completely characterized, and that we most want to see represented in PRO (for example, I have a GO annotation I could hang on an S. pombe RFC complex ID from a paper I was just reading an hour ago).

Midori

Original comment by: mah11

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 25, 2015

Sorry we should have only sent you the experimental ones for starters. That is anything in the file with a "PMID". So ignore anything with PomBase GO_REF:0000002/IEA GO_REF:0000024/ISO.

All of the ISO data are manually curated, and I usually try to identify every subunit in a complex when I do ISO from SGD, however, for some things (like spliceosomal subcomplexes) I have not done this thoroughly if there is a splicing complex grouping term. I will annotate the other subunits of the 'spliceosomal disassembly complex' tomorrow.

I just noticed also that the final 2 column headers are incorrect
xref_dbname source
should be
source xref_dbname
I thought that this was corrected so we need to check that the file was correctly updated. I am pretty certain that it wasn't as the new version of the file should have "|" separated PMIDs if there are multiple papers, and I don't detect any pipes in the file on our ftp site....

Apologies....

Val

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 25, 2015

The version of the file is correct, the PMID's are comma spearated, not pipe
e.g. PMID:16079914,PMID:16079914
It is just the headers that are incorrect. I will get this fixed.

Val

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 25, 2015

I'm pretty sure the file headers are correct; they're just a bit cryptic. "Source" means the reference.

Original comment by: mah11

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

You are correct. Will clarify the column headers.
Sorry for the confusion.
Val

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Val, I'm not sure I understood your message about which to ignore. On the one hand you say to ignore GO_REF:0000024/ISO, but then you say you manually verified these (by the way, I also see GO_REF:0000024 associated with ISM). If they are verified, should they not be included?

Another question: how should I handle cases where, say, one component is IEA but all others have "good" codes? For example, hcr1 in GO:0070993 is IEA while others are experimental (note: I'm ignoring the IEA part of those that have multiple codes; not sure why there are things like 'IEA,IEA').

Original comment by: nataled

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Some numbers: There are 446 GO complexes listed. Of these, if we ignore those that have any IEA or ISO component, we lose about one-fourth. If we ignore those do not have PMID for all components, we lose half.

Original comment by: nataled

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Hi Darren,

Sorry I wasn't clear. I hope the following clears up some confusions.

  1. I had spotted the duplicated evidence codes and have already reported this.

  2. All of the ISO/ISM annotations are manually curated, but inferred from sequence similarity. There is likely no experimental data for these as yet, but so far for complexes in S. c where the members are conserved 1:1 in pombe the complexes have been identical composition.

  3. For some EXP described complexes we also have cardinality data, but we have not exported this. We can make this available to you too.

  4. There are only 112 IEAs, we will try to resolve these over the next couple of months, by manually annotating the ones which are split between experimental and IEA codes, and supressing some which are to 'generic' grouping complex or component terms

  5. What type of modification data do you include in the complex entries?

There is no hurry for this, we just wanted to get this in motion so we could make annotations to complexes and create complex pages in PomBase.

We can clean up this file over the next couple of months and take it from there.

Best

Val

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Cross posted, my comments should address this one too

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Some comments on your list of comments:

  1. The duplicate codes don't bother me; already wrote a script to clean them.

  2. IMO anything manually verified is good to go. In PRO we do have complexes that are inferred by comparison with those in other organisms.

  3. Cardinality would be excellent!

  4. Not sure what you mean by "generic grouping complex or component terms." Of the 112 IEAs, only 33 are associated with complexes with otherwise 'better' evidences.

  5. We can include all kinds of modifications. For example, you might want to specify that a particular complex contains a phosphorylated form of a protein. See PR:000037300 for an example. It would be no problem to make the complexes first, then change the components to something more specific later, if you'd like.

Consider it in motion! I won't make a further move on it until I get the word from you. My preference is to do all the eligible ones at once rather than something like "only those with PMIDs first, then ISOs later." However, if the need for a specific complex (or limited set of them) arises before the bulk are ready, we'll make them right away.

Original comment by: nataled

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Re 4)

Too general /not a ‘specific complex’ /not sure that they are a complex in pombe/
or will be replaced by a more specific annotation in the term set below
will filter
GO:0000015 phosphopyruvate hydratase complex
GO:0000148 1,3-beta-D-glucan synthase complex
GO:0000159 protein phosphatase type 2A complex
GO:0000786 nucleosome
GO:0002178 palmitoyltransferase complex
GO:0005891 voltage-gated calcium channel complex
GO:0005952 cAMP-dependent protein kinase complex
GO:0030118 clathrin coat
GO:0030119 AP-type membrane coat adaptor complex
GO:0030130 clathrin coat of trans-Golgi network vesicle
GO:0030131 clathrin adaptor complex
GO:0031515 tRNA (m1A) methyltransferase complex
GO:0032300 mismatch repair complex
GO:0032301 MutSalpha complex
GO:0032302 MutSbeta complex
GO:0033573 high-affinity iron permease complex
GO:0034703 cation channel complex
GO:0034704 calcium channel complex
GO:0034707 chloride channel complex
GO:0042765 GPI-anchor transamidase complex
GO:0043527 tRNA methyltransferase complex
GO:0071010 prespliceosome
GO:1902562 H4 histone acetyltransferase complex
GO:0097346 INO80-type complex
GO:0043189 H4/H2A histone acetyltransferase complex
GO:0031332 RNAi effector complex

Will manually annotate (more specifically in some cases), or remove
GO:0000930 gamma-tubulin complex
GO:0000932 cytoplasmic mRNA processing body
GO:0000346 transcription export complex
GO:0000347 THO complex
GO:0000444 MIS12/MIND type complex
GO:0000930 gamma-tubulin complex
GO:0000932 cytoplasmic mRNA processing body
GO:0005643 nuclear pore
GO:0005663 DNA replication factor C complex
GO:0005665 DNA-directed RNA polymerase II, core complex
GO:0005680 anaphase-promoting complex
GO:0005684 U2-type spliceosomal complex
GO:0005760 gamma DNA polymerase complex
GO:0005852 eukaryotic translation initiation factor 3 complex
GO:0005960 glycine cleavage complex
GO:0008180 COP9 signalosome
GO:0008280 cohesin core heterodimer
GO:0008622 epsilon DNA polymerase complex
GO:0016282 eukaryotic 43S preinitiation complex
GO:0016442 RISC complex
GO:0016591 DNA-directed RNA polymerase II, holoenzyme
GO:0016602 CCAAT-binding factor complex
GO:0022627 cytosolic small ribosomal subunit
GO:0030119 AP-type membrane coat adaptor complex
GO:0030688 preribosome, small subunit precursor
GO:0030870 Mre11 complex
GO:0031011 Ino80 complex
GO:0031515 tRNA (m1A) methyltransferase complex
GO:0032040 small-subunit processome
GO:0033290 eukaryotic 48S preinitiation complex
GO:0035267 NuA4 histone acetyltransferase complex
GO:0043564 Ku70:Ku80 complex
GO:0043599 nuclear DNA replication factor C complex
GO:0070390 transcription export complex 2
GO:0070993 translation preinitiation complex
GO:0071004 U2-type prespliceosome
GO:1990077 primosome complex
GO:0000812 Swr1 complex
GO:0042575 DNA polymerase complex

unsure but one of the above will happen with these:
GO:0009316 3-isopropylmalate dehydratase complex
GO:0009331 glycerol-3-phosphate dehydrogenase complex
GO:0009349 riboflavin synthase complex
GO:0032777 Piccolo NuA4 histone acetyltransferase complex
GO:0032797 SMN complex
GO:0035339 SPOTS complex (I don’t know what this is)
GO:0097361 CIA complex

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 26, 2015

Re 1) The duplicate codes don't bother me; already wrote a script to clean them.

This should now be fixed in our next release

Re 3) Cardinality

What we have is along these lines
heteromeric(2) Thakurta AG et al. (2004)
So we do not say explicity which units these apply to, but if the info is there, and at least we know which publication it is in. We will extend this so we also know which complex it applied to if there are multiple complexes. A small oversight!

Re 5) We will arrange for the modification data to be exported, this has been on the to do list for a while. It will be in this format:
http://www.pombase.org/submit-data/modification-bulk-upload-file-format

Give us a while to tidy the IEAs and we will send you a new version of the file.

Thanks for you speed and attention, as always!

Val

Original comment by: ValWood

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 22, 2018

  • status: open --> pending
  • assigned_to: Darren Natale
  • Group: -->

Original comment by: nataled

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 22, 2018

Going through old requests and closing those that were finished long ago or marking as "pending" those that await input from the requester. If your request is marked Pending, please advise as to whether the request has been satisfactorily addressed or is no longer needed.

Original comment by: nataled

@nataled

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 23, 2018

I think we would still like to have the requested terms eventually, but it isn't urgent for us. "Pending" status fits our situation just fine.

Original comment by: mah11

@nataled nataled self-assigned this May 23, 2019

@nataled nataled added Pending and removed sourceforge labels May 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.