Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine and finalize metadata and CV terms #7

Open
ypriverol opened this issue Apr 19, 2018 · 27 comments
Open

Refine and finalize metadata and CV terms #7

ypriverol opened this issue Apr 19, 2018 · 27 comments
Assignees

Comments

@ypriverol
Copy link
Contributor

The current msp and other spectral library formats only capture the metadata around each entry in the library (cluster, consensus spectra, peptides, small molecules), but not the way the spectral library has been generated. We need to define a general metadata section at the beginning this metadata. Similar to mztab, I think would be great to have something like:

The MTD version is helping the readers to know that this is a metadata field. The second column is the Key of the metadata attribute and the third is the value of the metadata field.

The following fields can be reused from mzTab:

MTD   mzL-version	1.0.0      
MTD   title  Spectral Library Human from Peptide Atlas 
MTD   id     PXL00000001 
MTD   description Some description that can be used for example in the web about the library
MTD   instrument [MS, MS:1000703, LTQ Orbitrap,]
MTD   instrument [MS, MS:1000008, Velos Orbitrap,]

Can we add to this issue all the fields we think are interesting or important to trace?

@ypriverol
Copy link
Contributor Author

ypriverol commented Apr 19, 2018

My list of attributes to define at the level of the library, I will add an * if I think is mandatory.

  • title*: title to be used by Visualization components
  • description*: to be used by visualization components and for search purpose
  • id*: Unique identifier across resources of ProteomeXchange for example.
  • instruments: A list of instruments included in the library. Here, we can be RECOMMENDED this as optional and only will be provided if the library is filtered by an instrument. I think peptide atlas release instrument-based library, but pride no.
  • modifications: A list of modifications included in the library. Similar to the instrument list we RECOMMENDED to add the modifications only if the library is dedicated/filtered by modified peptides. You can have one or more peptides modifications.
  • Fragmentation mode: Again a list of fragmentation modes if the library has been build for a specific type.
  • contacts: Contact names for the person or resources that have developed the library.
  • Comments: General comments about the library similar to mztab and other file formats.

@schymane
Copy link

I'm not sure this is the right place for this comment but note NIST have the MSP but also the SDF format that stores their spectral information (I do not see this mentioned here yet) - although MSP seems to be the more common exchange one, the SDF has the advantage that the full structure AND the spectrum can be in it ... and this paper is a great example of using SDF to do NMR exchange:
https://onlinelibrary.wiley.com/doi/abs/10.1002/mrc.4737

@rsalek
Copy link

rsalek commented Apr 20, 2018

Might be better to keep SDF - structure - separate as it should cover both proteomics and metabolomics or other maybe potentially other MS based applications

@schymane
Copy link

See parallel conversation for similar comments! MassBank/MassBank-web#110
I am missing background to the discussions here for sure; have plenty of thoughts for small molecule side but not much idea of how proteomics handles this.

@mwang87
Copy link
Contributor

mwang87 commented Apr 21, 2018

I'm not sure if this is addressed in the Massbank format or other formats, but one things we try to track on the GNPS side is the provenance filename and scan number of where the reference spectrum came from. Though, its not perfect in the record and maybe it is more appropriate to be tracked externally (which is done at GNPS) and those records are referenced through an accession number.

@ypriverol
Copy link
Contributor Author

@mwang87 this issue is to capture what we are planning to trace for the complete spectral library not for the individual spectra. I have created another issue for the individual spectra an cluster #9

@henryhlam
Copy link
Contributor

Let's make the largest list possible first (with each field marked as required/optional) and we can whittle it down. Maybe it is easier to edit a Google Doc together?

My quick thoughts below.

At library level, we need:

Format version (e.g. mzl 1.0)

A universal library identifier (similar to the universal spectrum identifier) e.g. mzlib:PXL0000100:NIST_cow_2018)

Publisher/source, including Contact (e.g. NIST)

Publishing date (or library version or serial number)

Library name/descriptor

Software generating the library and version

Organisms (do we need this? if yes, have to allow more multiple organisms, or none at all)

All Modifications (do we need this at the library level? probably only necessary to define special mods not already in PSI-MOD or UNIMOD, or to shorten the tags in each library entry by defining them here)

Instrumentation/Fragmentation (similar. We need this at spectrum level anyway, as many libraries contain mixture of spectra from different instruments. Do we need it here?)

Comments

Provenance
(From my experience, it is often necessary to modify/merge/filter libraries to create new custom ones. It would be nice to have a place to keep track of what has been done to the library. (e.g. This library is created from the NIST 2014 one by filtering for all tryptic peptides, and merged it with a decoy library...) This can be put into the Comments field, but it may be useful to have a separate "Provenance" field.)

For spectrum level, it is a lot more complicated. Maybe we should have a separate thread/doc for this.

@schymane
Copy link

schymane commented Apr 22, 2018 via email

@ypriverol
Copy link
Contributor Author

ypriverol commented Apr 22, 2018

The google document is this one:
https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit?usp=sharing

This issue is to discuss the metadata at the level of the library, we have another issue #9 to discuss the metadata to the individual spectra

@schymane
Copy link

My comment stands for both the individual spectrum and library level ... many spectral libraries will likewise not come from an organism ... although some may and in this case it would be valuable information to be captured. A more generic description may be more flexible?
I added some comments to the doc.

@ypriverol
Copy link
Contributor Author

@schymane The idea of the organisms, instruments, and modifications at the library metadata is for dedicated libraries where for example you the library has been created/filtered for those properties. If is not the case, then those properties should be captured at the spectrum level because it can be huge the number of species, instruments and especially modifications in one library.

@ypriverol
Copy link
Contributor Author

@henryhlam @edeutsch @sneumann @schymane I have updated the document with the new fields provided by you guys that are needed to capture at the level of the library. Please have a look here https://docs.google.com/document/d/1LgSGtR_t5IcUS9rV7YtsLveDVX9X8KsOpU-5NJ4vuYI/edit

@vrkosk
Copy link

vrkosk commented Apr 24, 2018

When a spectral library is searched, there are two fragment tolerances to consider: the user specifies the tolerance of the input data, and something must specify the tolerance of peaks in the library spectra. These two tolerances could easily differ (e.g. search 10ppm data against a 0.5Da library). It would be nice if fragment tolerance were a file-level attribute.

Also, it's very valuable to allow library entries to override file-level attributes. This allows specifying defaults (e.g. default instrument or default organism). It reduces metadata clutter and still lets you mix entries from different sources in the same file.

@edeutsch
Copy link
Contributor

yes indeed. I put in precursor mass accuracy and fragment mass accuracy as desirable attributes in the library. Mass tolerance strikes me more as a software parameter and a user preference than an inherent property of a library or a spectrum. But I can see other opinions, too.

@edeutsch
Copy link
Contributor

Hi everyone, I have updated the document to reflect notes that I have been taking and the overall direction of the document, which is metadata at all levels, not just the spectrum level or library level. I defined FOUR levels of metadata:

  • collection (library) level
  • spectrum level
  • peak level
  • peak annotation level (since one peak may have several possible annotations)

The spectrum level is somewhat further divided into merged spectra, individual spectra, and common to both merged and individual spectra.

I reorganized the document a little. I hope I didn't mess up anything in anyone's view. Have a look and see what you think.

@vrkosk
Copy link

vrkosk commented Apr 25, 2018

I'm glad to see chimeric spectra taken into account. What about intact crosslinked peptides, e.g. disulfide bonds, or looplinked or cyclic peptides?

Protein-level data is inherently problematic in a peptide-centric spectral library. Example problems: there could be no parent protein (e.g. de novo identification); accession format must be restricted (if it's not restricted, it's a free text field); peptide sequence could appear in several locations in a protein; peptide is usually found in more than one protein; consensus spectra could have conflicting parent proteins; accession formats could differ between library entries and between libraries (and will differ if you compare or merge results with a database search). Shouldn't you also record the FASTA name and version if it was a database search? What about the protein description?

I realise this may be an unpopular view, but how about prohibiting protein-level metadata? Or at least move them out of peptide metadata and into the experiment-level metadata. Protein attributes help explain how the peptide was identified rather than being an inherent part of a peptide identification.

@jgriss
Copy link

jgriss commented Apr 25, 2018

I second @vrkosk view that the protein level information should not be added directly. It is to be expected that spectral libraries will be merged quite heavily. Then, this will definitely cause issues. If we do not add the protein level information from the start, users and search engines will expect that they need to supply a FASTA database. In my opinion, this is the cleanest solution.

@ypriverol
Copy link
Contributor Author

I guess all information should remain at peptide level for biological entities, and not protein information. I also agree.

@edeutsch
Copy link
Contributor

I agree that we should make sure that cross-linked peptides are supported. I think it is mostly already there with multiple simultaneous identifications already supported. But I suppose we need a flag to distinguish cases when the multiple peptides are chimeric vs. cross-linked. I will add that.

Regarding the encoding of proteins, I would disagree with the prevailing thought that we should prohibit protein information. I certainly agree that it should not be required, and I agree it could get a bit complex. But I suspect some people would like to encode that information and it seems to me that providing a standardized optional way of doing that is a better choice than attempting prohibition.

@henryhlam
Copy link
Contributor

henryhlam commented Apr 26, 2018 via email

@schymane
Copy link

I agree wholeheartedly. We should be flexible, be able to store all existing information from existing libraries (even if we don't necessarily see the use) and by allowing apppropriate optional (not mandatory) fields we should be able to achieve this.

@ypriverol
Copy link
Contributor Author

By design, every file format from PSI has the options to add more additional properties by using in some cases CVParams, UsersPArams (mzIdentML, mzML); or optional columns (mzTab). In the current document, we are specifying which are the fields we want to capture, with their cardinality and we should guarantee that every section has a mechanism to add additional fields as CVParams. In the specification document, we can add a section about how to report protein information.

By design, we should have the flexibility to add the information and protein information is one of those cases.

@javizca
Copy link

javizca commented Apr 26, 2018

In my view, there would need to be a way to encode protein level information, since some people/tools may need it. What I would avoid is to capture all the underlying complexity related to protein inference. That would ideally go somewhere else.
Analogous concepts would be applicable also to analytes coming from metabolomics, lipidomics, etc.

@RalfG
Copy link
Collaborator

RalfG commented Nov 18, 2019

To update this issue:

All metadata is listed in the following document https://docs.google.com/document/d/1rN5DJSowp2micxlwJQlPxlv39ZiaLEfv/edit.

When editing, ALWAYS make sure you are using the "Suggesting" mode. To activate this click on View > Mode > Suggesting.

@sneumann
Copy link
Member

sneumann commented Nov 19, 2019 via email

@edeutsch
Copy link
Contributor

Switched document to all with link can suggest, as requested.

@RalfG RalfG changed the title We need to capture the metadata around the Spectral library Refine and finalize metadata and CV terms Jun 6, 2020
@edeutsch
Copy link
Contributor

Newer document: https://docs.google.com/document/d/1o11m7grfHvMzfbTozvDY0twJ1g2dzk6I/edit

But @RalfG is working on a system to encode this in JSON.
See ongoing work in #73

Next when we have @RalfG on the Friday call, we should spend some time with this table of information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests