Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native scan identification and ion mobility spectra #1

Open
chambm opened this issue Dec 9, 2019 · 4 comments
Open

Native scan identification and ion mobility spectra #1

chambm opened this issue Dec 9, 2019 · 4 comments

Comments

@chambm
Copy link

chambm commented Dec 9, 2019

ACTION: Everyone: Consider the following TripleTOF spectrum and see how to support it

mzspec:PXD013210:TTB20160722_ISBHJOMXX001879_r01:scan:19809:SITS[phospho]PTTLYDR/2
https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/ShowObservedSpectrum?usi=mzspec:PXD013210:TTB20160722_ISBHJOMXX001879_r01:scan:19809:SITS[phospho]PTTLYDR/2

Scan numbers should absolutely not be used to identify WIFF spectra. There's simply no reliable way to get back from a scan number to the <sample, period, cycle, experiment> tuple that is necessary to actually pinpoint a spectrum in a WIFF file. Limiting the id to a single number makes the "universal" modifier rather inaccurate. :) The same goes for Waters spectra, where function and scan are orthogonal and both are needed to pinpoint a spectrum in the .raw data.

Index is also unsuitable for maintaining a link back to the native spectrum, especially for multi-dimension formats (WIFF and Waters .raw). Because the enumeration order of the dimensions is not guaranteed nor is there any clarity that the indexes used for any format are based on a completely unfiltered enumeration of data. In other words, someone generating USIs from a DDA mzML that has been filtered to only MS1s will get different indices than someone looking them up in an unfiltered file. It's simply not worth the potential for confusion!

We already solved this problem a decade ago with mzML and nativeIDs. Since they can be a bit verbose in a USI which is already quite long, I suggest we use an abbreviated format. Instead of "controllerType=0 controllerNumber=1 scan=123" we can put "MS:1000768:0.1.123" which is the combination of the Thermo nativeID accession and the abbreviated nativeID. Likewise:

  • WIFF: MS:1000770:1.1.123.2
  • Waters: MS:1000769:1.0.123
  • Bruker BAF: MS:1000772:123
  • Bruker FID: MS:1000773:_x0031_00_x0020_fmol_x0020_BSA_x002f_0_B1_x002f_1_x002f_1SRef_x002f_fid (this is an encoded version of 100 fmol BSA/0_B1/1/1SRef/fid because IDREF is the datatype)
  • MGF: MS:1000774:123
  • mzXML: MS:100776:123

The WIFF nativeID also solves another problem described here: the sample index in the WIFF file which can contain multiple samples which are NOT necessarily named uniquely. For a WIFF file, the "run name" part of the USI should refer ONLY to the WIFF filename, not the sample name.

However, there is an unresolved discussion about nativeIDs in the soon-to-be-recommended 3-array representation for ion mobility spectra in mzML. That discussion should apply to USIs as well, probably even more urgently because USIs may be paired with a spectrum interpretation. A single 3-array diaPASEF (or Agilent/Waters full IM frame) spectrum may correspond with multiple peptides. When the peptides are separated in the IM dimension, then creating a combined spectrum actually combines evidence that could otherwise be kept separate and combined for each peptide individually (using a unique range of mobility scans).

For example, let's say there is a Waters IM frame, which has 200 mobility scans (they all have the same retention time but cover a range of drift times). One peptide at drift time 5ms is supported by scans 50-60, and another peptide at drift time 10ms is supported by scans 120-130. If the combined spectrum was the entire frame of 200 scans (as @edeutsch suggested in email), then that evidence would all be combined in the same spectrum, and USIs to the spectrum would be ambiguous (kind of like a chimeric spectrum). When reading/converting the raw data, there's no interpretation of course, so a reader/converter can't know that the spectra should be separated by drift time. I was going to suggest that the raw spectra be given the full range of drift scans explicitly, like frame=123 scanStart=1 scanEnd=200 and the interpreting software can make a USI with a subset of the start/end range to refer to a specific subset of mobility scans. But I feel that's too complex if accessing the full combined spectrum in mzML. I think it makes more sense to make sure the USIs for ion mobility identifications include the IM window so reader code can do its own filtering (similar to using the peptide sequence to infer the precursor and product m/zs). The same logic would apply for diaPASEF, but not ddaPASEF. The latter can be easily separated into combined spectra with just the subset of the mobility range relevant to a specific precursor (e.g. frame=123 scanStart=456 scanEnd=567 for precursor 678.9). It's worth noting that ddaPASEF spectra are usually further merged (between frames) for searching purposes, and I think representing that is outside the scope of nativeIds. So those spectra, if searched, could only be tracked back to the mzML or MGF file (a merged=123 spectrum).

@edeutsch
Copy link
Contributor

Hi Matt, thanks for these thoughtful comments.

  • You are correct that my attempt at using scan was foolish
  • In the scan/index field, instead of MS:1000770, can we just use nativeId. Once a receiver of a USI has determined the right file, is the term obvious?
    (I suggest this because asking ordinary users to juggle MS:1000770 seems difficult
  • I suggest we still keep scan for Thermo files to avoid complication
  • For WIFF can we use nativeId:1,1,123,2
  • Instead of the scanStart scanEnd thing, can we use use a range in the string? (nativeId:123,456-567?

@chambm
Copy link
Author

chambm commented Jan 13, 2020

  1. Just knowing the filename doesn't tell you what kind of nativeId it is. You can't key off file extension: Bruker and Agilent both use .d as an extension. I guess you could require users to open the file before parsing the nativeId, at which point they should be able to infer the nativeId type, but to me it would feel underspecified without an explicit nativeId type.

  2. Can you elaborate on that? I don't think 0.1.123 is much more complicated than 123. It's an unambiguous abbreviation of the proper nativeId.

  3. That would depend on the final answer to At least one USI example in the USI publication is not working #2. But if nativeId was indeed used without a specific type then that WIFF id would be correct.

  4. I lean toward keeping discrete key-value pairs because it'll be easier for existing parsers to deal with. But using these combined scan nativeId formats at all defers to the other discussion about how to combine IMS scans which has not yet been resolved. The formats will need to be added to the CV for example.

@edeutsch
Copy link
Contributor

edeutsch commented Feb 3, 2020

  1. My concern here is that we are hoping that USIs are something that can be used by all users in the community. Making users put 'MS:1000770' into the USI seems like adding a needless element of mysterious black magic into the USI. Just having "nativeId" seems to me like the limit of what users will put up with. It seems like ProteoWizard can, given a filename, open it and convert it to mzML where it tells you what the nativeId format is. Surely ProteoWizard should be able to open an arbitrary file (be it Bruker or Agilent or mzML) and determine what the nativeID format would be if it were to convert that file (it already does it somehow) and then interpret the user input "nativeId:1,1,123,2" in that context. In fact, for a WIFF file, it can only be one thing (I think?), so if a user provides "MS:1000769:1,1,123,2" and that is the wrong nativeId for a WIFF file, then ProteoWizard might return "nativeId type for this kind of file should be MS:1000770". Why make the user tell the software what the software can determine for itself? Maybe this adds an extra "check digit" confirmation, but it really seems like needless turn-off for users.

  2. It is true that "0,1,123" is not much more complicated than "123", but we're trying to make this as simple as possible for ordinary users and I will suggest that the vast majority of users are used to designating Thermo spectra by their "scan number", why introduce the concept of controllerNumber and controllerType if they're always going to be "0,1"? I will posit that for 99% of Thermo users out there, if you ask them "Please open up file X in Excalibur and show me scan:123", they'll know exactly what to do. If you ask them "Please open up file X in Excalibur and show me MS:1000768:0,1,123", you will only get a blank stare. And you and I can write software that can handle either, so why go with the complicated solution?

  3. I am neutral on this one. Creating a whole new set of terms include xxStart and xxEnd doesn't sound ideal, but I'm not against it. Using a range like 1,1,120-130,2 does feel like a bit a hack. Parsers will need to be updated in either case when faced with such input. For a 4-part key like 1,1,123,2, would we need to have separate terms for each permutation of what could be rangeable? Could you ever have 1,1,120-130,2-3? Presumably not, but if there were, the one term would amplify into 4 terms. Not a big deal, I guess. I'm fine either way.

@chambm
Copy link
Author

chambm commented Feb 13, 2020

  1. It's true that, other than displaying the id in a more user-friendly way, I can't think of a reason to parse most abbreviated ids back into its original format before opening the file. However there are some exceptions:
    a. What about mzML files? They could have scans like "merged=123" as well as nativeIDs. Granted, we don't have a CV term for that format. So in a USI it might be merged:123, although this would only work for referencing the mzML file which has that specific id (i.e. not the original raw file)
    b. What about MGF files with their just-won't-go-away spectrum titles (which by the way are not necessarily unique)? If you want to talk about user unfriendliness, ask a user to find "index=1234" in an MGF file. :) But then ask them to find the spectrum title when it's not unique. Ah, the wonders of MGF.
    c. What about 3-array ion mobility formats? When we start suggesting (but not mandating, right?) combined 3-array spectra as the recommended representation, the nativeId format will be different depending on whether the spectra are in the 3-array representation or not. So for Waters, uncombined id would be:
    nativeId:1,0,123 (function=1 process=0 scan=123)
    and a combined id would be:
    nativeId:1,1,200 (function=1 scanStart=1 scanEnd=200)
    We could possibly disambiguate this as:
    nativeId:1,0,123
    combinedNativeId:1,1,200

  2. I can only say 2 things about this one:
    a. I don't think USIs are easily human readable (nor do I think they need to be). Taking the scan number out of a big USI mzspec:PXD013210:TTB20160722_ISBHJOMXX001879_r01:scan:19809 doesn't make the USI as a whole easily human readable.
    b. I don't really like giving Thermo special treatment. We have that 0,1 in the id for non-MS spectra. For MS spectra it's always going to be 0,1 (AFAIK). But if a user opened a file with both MS and PDA detectors, and maybe the PDA spectra showed up first instead of MS spectra, saying "find scan:123" is ambiguous. Of course in the contexts we work it's almost certainly meant to be MS scan 123, but I don't see why Thermo should get special license to be ambiguous. Unfortunately our Thermo nativeID format uses numbers instead of strings to identify the detector (e.g. controllerType=0 instead of controllerType=MS, controllerType=5 instead of controllerType=PDA), but that's water under the bridge.

  3. So far each vendor would only have 1 dimension with a start/end. We discussed on the email thread that combining between scan times (frames or blocks) should be considered a processing step rather than a raw data representation. But it's also true that with enough arrays (with repeated values where necessary), an entire run could just be represented as a single spectrum. However, in either case (using ranges or separate start/end terms), new native ID formats might be needed for the combined mobility formats. We really ought to finalize that recommendation.

@edeutsch edeutsch transferred this issue from another repository Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants