Refine and finalize metadata and CV terms #7

ypriverol · 2018-04-19T20:40:48Z

The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:

The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.

The following fields can be reused from mzTab:

MTD   mzL-version	1.0.0      
MTD   title  Spectral Library Human from Peptide Atlas 
MTD   id     PXL00000001 
MTD   description Some description that can be used for example in the web about the library
MTD   instrument [MS, MS:1000703, LTQ Orbitrap,]
MTD   instrument [MS, MS:1000008, Velos Orbitrap,]

Can we add to this issue all the fields we think are interesting or important to trace?

The text was updated successfully, but these errors were encountered:

ypriverol · 2018-04-19T20:53:12Z

My list of attributes to define at the level of the library, I will add an * if I think is mandatory.

title*: title to be used by Visualization components
description*: to be used by visualization components and for search purpose
id*: Unique identifier across resources of ProteomeXchange for example.
instruments: A list of instruments included in the library. Here, we can be RECOMMENDED this as optional and only will be provided if the library is filtered by an instrument. I think peptide atlas release instrument-based library, but pride no.
modifications: A list of modifications included in the library. Similar to the instrument list we RECOMMENDED to add the modifications only if the library is dedicated/filtered by modified peptides. You can have one or more peptides modifications.
Fragmentation mode: Again a list of fragmentation modes if the library has been build for a specific type.
contacts: Contact names for the person or resources that have developed the library.
Comments: General comments about the library similar to mztab and other file formats.

schymane · 2018-04-20T10:13:36Z

I'm not sure this is the right place for this comment but note NIST have the MSP but also the SDF format that stores their spectral information (I do not see this mentioned here yet) - although MSP seems to be the more common exchange one, the SDF has the advantage that the full structure AND the spectrum can be in it ... and this paper is a great example of using SDF to do NMR exchange:
https://onlinelibrary.wiley.com/doi/abs/10.1002/mrc.4737

rsalek · 2018-04-20T10:20:56Z

Might be better to keep SDF - structure - separate as it should cover both proteomics and metabolomics or other maybe potentially other MS based applications

schymane · 2018-04-20T10:23:38Z

See parallel conversation for similar comments! MassBank/MassBank-web#110
I am missing background to the discussions here for sure; have plenty of thoughts for small molecule side but not much idea of how proteomics handles this.

mwang87 · 2018-04-21T07:03:36Z

I'm not sure if this is addressed in the Massbank format or other formats, but one things we try to track on the GNPS side is the provenance filename and scan number of where the reference spectrum came from. Though, its not perfect in the record and maybe it is more appropriate to be tracked externally (which is done at GNPS) and those records are referenced through an accession number.

ypriverol · 2018-04-21T07:14:31Z

@mwang87 this issue is to capture what we are planning to trace for the complete spectral library not for the individual spectra. I have created another issue for the individual spectra an cluster #9

henryhlam · 2018-04-22T07:36:35Z

Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together?

My quick thoughts below.

At library level, we need:

Format version (e.g. mzl 1.0)

A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018)

Publisher/source, including Contact (e.g. NIST)

Publishing date (or library version or serial number)

Library name/descriptor

Software generating the library and version

Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all)

All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here)

Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?)

Comments

Provenance
(From my experience, it is often necessary to modify/merge/filter libraries to create new custom ones. It would be nice to have a place to keep track of what has been done to the library. (e.g. This library is created from the NIST 2014 one by filtering for all tryptic peptides, and merged it with a decoy library...) This can be put into the Comments field, but it may be useful to have a separate "Provenance" field.)

For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this.

schymane · 2018-04-22T08:08:18Z

Re: organisms - it should be designed flexibly to allow extra metadata, but not be too biologically focused, for instance. There are a lot of people who use spectral libraries who do not have any organism context. The MassBank requirement for a "natural / not natural" tag has caused many headaches for us environmental people because we never have the context (caffeine is eg a natural product but for us a chemical found in the environment) and such classifications are extremely hard to auto-classify from the wrong context... (ie please do not force people to provide information they may not have and force them instead to fill in "something" that is likely incorrect just to fill a field). On Sun, Apr 22, 2018 at 9:36 AM +0200, "henryhlam" <notifications@github.com<mailto:notifications@github.com>> wrote: Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together? My quick thoughts below. At library level, we need: Format version (e.g. mzl 1.0) A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018) Publisher/source, including Contact (e.g. NIST) Publishing date (or library version or serial number) Library name/descriptor Software generating the library and version Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all) All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here) Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?) Comments Provenance (From my experience, it is often necessary to modify/merge/filter libraries to create new custom ones. It would be nice to have a place to keep track of what has been done to the library. (e.g. This library is created from the NIST 2014 one by filtering for all tryptic peptides, and merged it with a decoy library...) This can be put into the Comments field, but it may be useful to have a separate "Provenance" field.) For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#7 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD4a_cyQU8frF-C_rK0_4eWDHcQGJ2C3ks5trDMEgaJpZM4TcduF>.

ypriverol · 2018-04-22T08:24:35Z

The google document is this one:
https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit?usp=sharing

This issue is to discuss the metadata at the level of the library, we have another issue #9 to discuss the metadata to the individual spectra

schymane · 2018-04-22T20:29:14Z

My comment stands for both the individual spectrum and library level ... many spectral libraries will likewise not come from an organism ... although some may and in this case it would be valuable information to be captured. A more generic description may be more flexible?
I added some comments to the doc.

ypriverol · 2018-04-23T06:37:52Z

@schymane The idea of the organisms, instruments, and modifications at the library metadata is for dedicated libraries where for example you the library has been created/filtered for those properties. If is not the case, then those properties should be captured at the spectrum level because it can be huge the number of species, instruments and especially modifications in one library.

ypriverol · 2018-04-23T20:17:22Z

@henryhlam @edeutsch @sneumann @schymane I have updated the document with the new fields provided by you guys that are needed to capture at the level of the library. Please have a look here https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit

vrkosk · 2018-04-24T09:08:48Z

When a spectral library is searched, there are two fragment tolerances to consider: the user specifies the tolerance of the input data, and something must specify the tolerance of peaks in the library spectra. These two tolerances could easily differ (e.g. search 10ppm data against a 0.5Da library). It would be nice if fragment tolerance were a file-level attribute.

Also, it's very valuable to allow library entries to override file-level attributes. This allows specifying defaults (e.g. default instrument or default organism). It reduces metadata clutter and still lets you mix entries from different sources in the same file.

edeutsch · 2018-04-24T18:16:58Z

yes indeed. I put in precursor mass accuracy and fragment mass accuracy as desirable attributes in the library. Mass tolerance strikes me more as a software parameter and a user preference than an inherent property of a library or a spectrum. But I can see other opinions, too.

edeutsch · 2018-04-24T18:20:31Z

Hi everyone, I have updated the document to reflect notes that I have been taking and the overall direction of the document, which is metadata at all levels, not just the spectrum level or library level. I defined FOUR levels of metadata:

collection (library) level
spectrum level
peak level
peak annotation level (since one peak may have several possible annotations)

The spectrum level is somewhat further divided into merged spectra, individual spectra, and common to both merged and individual spectra.

I reorganized the document a little. I hope I didn't mess up anything in anyone's view. Have a look and see what you think.

vrkosk · 2018-04-25T13:47:27Z

I'm glad to see chimeric spectra taken into account. What about intact crosslinked peptides, e.g. disulfide bonds, or looplinked or cyclic peptides?

Protein-level data is inherently problematic in a peptide-centric spectral library. Example problems: there could be no parent protein (e.g. de novo identification); accession format must be restricted (if it's not restricted, it's a free text field); peptide sequence could appear in several locations in a protein; peptide is usually found in more than one protein; consensus spectra could have conflicting parent proteins; accession formats could differ between library entries and between libraries (and will differ if you compare or merge results with a database search). Shouldn't you also record the FASTA name and version if it was a database search? What about the protein description?

I realise this may be an unpopular view, but how about prohibiting protein-level metadata? Or at least move them out of peptide metadata and into the experiment-level metadata. Protein attributes help explain how the peptide was identified rather than being an inherent part of a peptide identification.

jgriss · 2018-04-25T19:27:40Z

I second @vrkosk view that the protein level information should not be added directly. It is to be expected that spectral libraries will be merged quite heavily. Then, this will definitely cause issues. If we do not add the protein level information from the start, users and search engines will expect that they need to supply a FASTA database. In my opinion, this is the cleanest solution.

ypriverol · 2018-04-25T20:05:27Z

I guess all information should remain at peptide level for biological entities, and not protein information. I also agree.

edeutsch · 2018-04-25T20:13:33Z

I agree that we should make sure that cross-linked peptides are supported. I think it is mostly already there with multiple simultaneous identifications already supported. But I suppose we need a flag to distinguish cases when the multiple peptides are chimeric vs. cross-linked. I will add that.

Regarding the encoding of proteins, I would disagree with the prevailing thought that we should prohibit protein information. I certainly agree that it should not be required, and I agree it could get a bit complex. But I suspect some people would like to encode that information and it seems to me that providing a standardized optional way of doing that is a better choice than attempting prohibition.

henryhlam · 2018-04-26T02:26:05Z

All, My idealistic and pedantic side would agree with all of you that protein information should not be included. I also struggled with these inconsistency issues when I developed SpectraST. However, I am with Eric in that keeping the protein information as an optional field is the way to go. In designing a format for everyone to use, we should value continuity and practical utility of the format over semantic purity. Many users of these formats are less into these issues and would just want a format that serves their needs. After all, that's why we started off in last year's PSI deciding that let's see how we can evolve an existing format to something better, rather than tearing up it up and starting from philosophical principles about what a library should be. The fear is that if we define the "perfect" format that is too far from the existing ones that no one wants to rewrite all existing codes just to fit the new format. So I would advocate a more flexible format with many optional fields, which can accommodate most use cases, and let all existing tools have an easier time switching over. This means we want it to capture most of the useful features of the existing formats, and not so easily dismissed them. For instance, I am sure NIST puts all those hard-to-decipher fields in there for a reason. They are there to support some functionalities in their tools. If we tell them, sorry, you can't have them any more because we are not supposed to be there, they will just not use our formats, or they will find all kinds of back-door ways to stuff the information back in there. That's not what we want to see. Back to the specific point of the protein field. The argument for having a protein field is for convenience and efficiency. Efficiency is important! Typically, users who search a peptide spectral library will want to know what proteins their IDs map to. If the search step is followed by another tool which will do the peptide-protein mapping, then all is well. (This step would require the user to supply a FASTA file.) But sometimes it is not. Remember sequence search engine will naturally provide that protein information, and that's the benchmark that spectral library engines are held up to. From my point of view as the developer of SpectraST, I cannot really tell users that no, a library search is not supposed to tell you that, you need to install another tool. So practically speaking, the spectral search engine will need it do the mapping post-search every time a search is done, not to mention the awkwardness of asking the user to always specify a FASTA file to accompany the library. The other use is for filtering. Often a user would want to filter his/her library by protein(s). If a protein field is present, then it is a simple thing. If not, then again the user has to look up the protein sequence, get all the possible peptide sequences of that protein, and then do a search by peptide. By the way, SpectraST already has a function to re-map all library entries to proteins, based on a given FASTA file. If user downloads a library but would like to use their own set of protein identifiers, it can do the re-mapping. It can be used to fixed errors in the mapping, or update to a new FASTA file. But if you don't allow me to store the protein somewhere in the library file, then I have to do this mapping every time a search is done! The reality is that most peptides map to a small number of proteins, and the mapping is quite stable. We are here to deal with 95% of the cases, not the 5%. As long as the field is not mandatory, and we allow multiple proteins, it will serve all purposes and not break anything. Ultimately we have to trust the tools to use these fields wisely. It will make the tools run faster and minimize unnecessary repeated tasks. I understand the argument that the protein is really not part of the analyte -- it is merely where it occurs in the natural world -- so it should not be stored with the library entry. We are saying, essentially, the source or any auxiliary information about the analyte should not be stored. But then what about organism? What about target/decoy? (The tool can figure that out from trying to map it to the FASTA! No need to store that field either.) What about natural/synthetic (the metabolomics people will want this field)? Oh, look up in some online database instead -- none of the business of the library. Synonyms of metabolites? Too messy, just store the InChI key and let the user look it up themselves. None of this has anything to do with the one-to-one correspondence between the analyte and its characteristic fragmentation pattern, which is, in a pure sense, what a library entry should be about. Are we really going to go down the road of cutting out anything that should not be part of this correspondence? I think our overriding concern, at this point of the exercise, should be to ensure that all existing tools are willing to switch over. If we make it too hard on the tool developer or the user, then we may have a beautiful and well-designed format that no one will use. Henry

…

On Thu, Apr 26, 2018 at 4:13 AM, Eric Deutsch ***@***.***> wrote: I agree that we should make sure that cross-linked peptides are supported. I think it is mostly already there with multiple simultaneous identifications already supported. But I suppose we need a flag to distinguish cases when the multiple peptides are chimeric vs. cross-linked. I will add that. Regarding the encoding of proteins, I would disagree with the prevailing thought that we should prohibit protein information. I certainly agree that it should not be required, and I agree it could get a bit complex. But I suspect some people would like to encode that information and it seems to me that providing a standardized optional way of doing that is a better choice than attempting prohibition. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Aa0goiqw4IQmTIsTGLVNjUIYpgfGuUN8ks5tsNjugaJpZM4TcduF> .

-- Henry H. N. Lam Associate Professor Department of Chemical and Biological Engineering Hong Kong University of Science and Technology Phone: 2358-7133 Fax: 2358-0054 Email: kehlam@ust.hk

schymane · 2018-04-26T06:15:34Z

I agree wholeheartedly. We should be flexible, be able to store all existing information from existing libraries (even if we don't necessarily see the use) and by allowing apppropriate optional (not mandatory) fields we should be able to achieve this.

ypriverol · 2018-04-26T06:37:16Z

By design, every file format from PSI has the options to add more additional properties by using in some cases CVParams, UsersPArams (mzIdentML, mzML); or optional columns (mzTab). In the current document, we are specifying which are the fields we want to capture, with their cardinality and we should guarantee that every section has a mechanism to add additional fields as CVParams. In the specification document, we can add a section about how to report protein information.

By design, we should have the flexibility to add the information and protein information is one of those cases.

javizca · 2018-04-26T07:22:11Z

In my view, there would need to be a way to encode protein level information, since some people/tools may need it. What I would avoid is to capture all the underlying complexity related to protein inference. That would ideally go somewhere else.
Analogous concepts would be applicable also to analytes coming from metabolomics, lipidomics, etc.

RalfG · 2019-11-18T10:10:57Z

To update this issue:

All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit.

When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on View > Mode > Suggesting.

sneumann · 2019-11-19T11:30:31Z

Hi, can you open the Document in "Comment" Mode for all with the link ? That should allow "Suggest mode" editing. And or respond to the "request permission" notification I sent ? Thanks, Yours, Steffen

…

________________________________________ Von: Ralf Gabriels <notifications@github.com> Gesendet: Montag, 18. November 2019 11:10 An: HUPO-PSI/SpectralLibraryFormat Cc: Neumann, Steffen; Mention Betreff: Re: [HUPO-PSI/SpectralLibraryFormat] We need to capture the metadata around the Spectral library (#7) To update this issue: All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit. When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on View > Mode > Suggesting. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#7>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABPWOJMHC6XIHIMDOWXZT3QUJS3DANCNFSM4E3R3OCQ>.

edeutsch · 2019-11-19T13:45:08Z

Switched document to all with link can suggest, as requested.

edeutsch · 2023-09-29T15:58:44Z

Newer document: https://docs.google.com/document/d/1o11m7grfHvMzfbTozvDY0twJ1g2dzk6I/edit

But @RalfG is working on a system to encode this in JSON.
See ongoing work in #73

Next when we have @RalfG on the Friday call, we should spend some time with this table of information.

ypriverol assigned jgriss, edeutsch and henryhlam Apr 19, 2018

ypriverol assigned sneumann Apr 19, 2018

sneumann mentioned this issue Apr 23, 2018

Decision on new or existing format #10

Closed

This was referenced Apr 24, 2018

Metadata information fo each spectra in the library #9

Closed

Definition of the attributes and terms for Spectral library. #4

Closed

RalfG changed the title ~~We need to capture the metadata around the Spectral library~~ Refine and finalize metadata and CV terms Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine and finalize metadata and CV terms #7

Refine and finalize metadata and CV terms #7

ypriverol commented Apr 19, 2018

ypriverol commented Apr 19, 2018 •

edited by RalfG

Loading

schymane commented Apr 20, 2018

rsalek commented Apr 20, 2018

schymane commented Apr 20, 2018

mwang87 commented Apr 21, 2018

ypriverol commented Apr 21, 2018

henryhlam commented Apr 22, 2018

schymane commented Apr 22, 2018 via email

ypriverol commented Apr 22, 2018 •

edited

Loading

schymane commented Apr 22, 2018

ypriverol commented Apr 23, 2018

ypriverol commented Apr 23, 2018

vrkosk commented Apr 24, 2018

edeutsch commented Apr 24, 2018

edeutsch commented Apr 24, 2018

vrkosk commented Apr 25, 2018

jgriss commented Apr 25, 2018

ypriverol commented Apr 25, 2018

edeutsch commented Apr 25, 2018

henryhlam commented Apr 26, 2018 via email

schymane commented Apr 26, 2018

ypriverol commented Apr 26, 2018

javizca commented Apr 26, 2018

RalfG commented Nov 18, 2019

sneumann commented Nov 19, 2019 via email

edeutsch commented Nov 19, 2019

edeutsch commented Sep 29, 2023

Refine and finalize metadata and CV terms #7

Refine and finalize metadata and CV terms #7

Comments

ypriverol commented Apr 19, 2018

ypriverol commented Apr 19, 2018 • edited by RalfG Loading

schymane commented Apr 20, 2018

rsalek commented Apr 20, 2018

schymane commented Apr 20, 2018

mwang87 commented Apr 21, 2018

ypriverol commented Apr 21, 2018

henryhlam commented Apr 22, 2018

schymane commented Apr 22, 2018 via email

ypriverol commented Apr 22, 2018 • edited Loading

schymane commented Apr 22, 2018

ypriverol commented Apr 23, 2018

ypriverol commented Apr 23, 2018

vrkosk commented Apr 24, 2018

edeutsch commented Apr 24, 2018

edeutsch commented Apr 24, 2018

vrkosk commented Apr 25, 2018

jgriss commented Apr 25, 2018

ypriverol commented Apr 25, 2018

edeutsch commented Apr 25, 2018

henryhlam commented Apr 26, 2018 via email

schymane commented Apr 26, 2018

ypriverol commented Apr 26, 2018

javizca commented Apr 26, 2018

RalfG commented Nov 18, 2019

sneumann commented Nov 19, 2019 via email

edeutsch commented Nov 19, 2019

edeutsch commented Sep 29, 2023

ypriverol commented Apr 19, 2018 •

edited by RalfG

Loading

ypriverol commented Apr 22, 2018 •

edited

Loading