Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Bruker PASEF support #50

Merged
merged 17 commits into from
May 19, 2018

Conversation

chambm
Copy link
Member

@chambm chambm commented May 3, 2018

  • added combineIonMobilitySpectra support for Bruker TIMS data; it has special behavior compared to Waters and Agilent IMS data, and on PASEF MS2 data it's even more special:
    • for non-PASEF data, all mobility scans are merged in each frame, but the mobility values are kept as a 3rd binary data array for that merged spectrum; whenever m/z values would overlap between mobility scans, a small jitter (1e-8) is added to the duplicate values so they can be preserved without violating the m/z uniqueness constraint; thus all mobility information is preserved but in a much more compact representation
    • for PASEF data, MS1 frames are merged the same way as above, but MS2 frames are merged only within PASEF precursors (i.e. each precursor gets its own merged scan); MS2 scans that do not belong to a PASEF precursor are dropped entirely
    • merged spectrum ids are like 'merged=1234'; for PASEF data, the constituent ids (frame and scan) are preserved in the scanList element (but not for full frame merges, that would be too verbose)
  • added IonMobility field to SpectrumList_TitleMaker
  • fixed some bugs in SpectrumList_ScanSummer and gave it the ability to combine peaks (within an individual spectrum) that occur in very close proximity (1e-7); (the intent is that SpectrumList_ScanSummer will be used to prepare PASEF data for identification, so at this point the mobility dimension for fragment peaks is collapsed)
  • updated threshold filter to work on all arrays with the same size as defaultArrayLength, not just the two primary arrays (usually m/z and intensity)
  • fixed SeeMS issues with showing 'merged=1234' ids, and showing IMS heatmaps for files created with combineIonMobilitySpectra
  • added MSData::getArrayByCVID() convenience function for retrieving extra array identified by CVID
  • added attributes to msdata::TextWriter
  • removed runtime-debugging=off restriction for using Bruker API
  • fixed building outside a git clone or without git available
  • added a for --without-compassxtract builds

Ping @hroest

…s special behavior compared to Waters and Agilent IMS data, and on PASEF MS2 data it's even more special:

  -- for non-PASEF data, all mobility scans are merged in each frame, but the mobility values are kept as a 3rd binary data array for that merged spectrum; whenever m/z values would overlap between mobility scans, a small jitter (1e-8) is added to the duplicate values so they can be preserved without violating the m/z uniqueness constraint; thus all mobility information is preserved but in a much more compact representation
  -- for PASEF data, MS1 frames are merged the same way as above, but MS2 frames are merged only within PASEF precursors (i.e. each precursor gets its own merged scan); MS2 scans that do not belong to a PASEF precursor are dropped entirely
  -- merged spectrum ids are like 'merged=1234'; for PASEF data, the constituent ids (frame and scan) are preserved in the scanList element (for full frame merges, that would be too verbose)
- added IonMobility field to SpectrumList_TitleMaker
- fixed some bugs in SpectrumList_ScanSummer and gave it the ability to combine peaks (within an individual spectrum) that occur in very close proximity (1e-7); (the intent is that SpectrumList_ScanSummer will be used to prepare PASEF data for identification, so at this point the mobility dimension for fragment peaks is collapsed)
- updated threshold filter to work on all arrays with the same size as defaultArrayLength, not just the two primary arrays (usually m/z and intensity)
- fixed SeeMS issues with showing 'merged=1234' ids, and showing IMS heatmaps for files created with combineIonMobilitySpectra
* added MSData::getArrayByCVID() convenience function for retrieving extra array identified by CVID
* added <scan> attributes to msdata::TextWriter
* removed runtime-debugging=off restriction for using Bruker API
* fixed building outside a git clone or without git available
* added a <location-prefix> for --without-compassxtract builds
@chambm chambm requested review from bspratt and brendanx67 May 3, 2018 17:42
@chambm
Copy link
Member Author

chambm commented May 3, 2018

NB: mz5 currently ignores extra binaryDataArrays, so it will drop the mobility array.

Copy link
Member

@bspratt bspratt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good - was the 1/k0 output for mgf in a different commit? Also, what's the jamroot change about? And we'd discussed adding a mechanism for Skyline to know whether a file is PASEF data or not, I don't think I see that here.

@chambm
Copy link
Member Author

chambm commented May 3, 2018

For MGFs: added IonMobility field to SpectrumList_TitleMaker
Jamroot change: fixed building outside a git clone or without git available

I forgot about the PASEF determination from Skyline. I'll hack it in there.

@bspratt
Copy link
Member

bspratt commented May 3, 2018

I was naively expecting to see a literal "1/k0=" string the in the code but I suppose it makes more sense to do it more generally - though it will need to declare units (or perhaps I'm missing something and it already does). It looks like this will get us into trouble if a 3rd type of ion mobility ever is introduced.

In any event it looks like I'll need to go back into BlibBuild to handle whatever we come up with as a 4th variant for Mascot results parsing.

@chambm
Copy link
Member Author

chambm commented May 3, 2018

TitleMaker allows maximum flexibility. The user can make the format whatever they want. Including:
--filter "titleMaker 1/K0=<IonMobility>"

I think it's fair to leave it up to the user to put the correct quantity type there although there's nothing difficult about adding extra fields like <IonMobilityQuantity> and <IonMobilityUnits>. Is it necessary? The MSData interface will provide units if you're reading straight from the raw data for MS1 quant.

@bspratt
Copy link
Member

bspratt commented May 3, 2018

I see - that should suffice. The units do matter in the context of BiblioSpec, but they're easily inferred from suitable formatted data as with your example --filter "titleMaker 1/K0=".

@chambm
Copy link
Member Author

chambm commented May 3, 2018

From Brendan email:

Doesn’t that seem a little brittle to make this depend on users getting a format string right?

For our users, I would prefer to have them just use MSConvertGUI and choose MGF and have that work.

I think you are imagining command line users.

The idea of "prepare this PASEF data for search by Mascot", which includes merging across retention time (using the scanSummer filter), is not something that should just happen by default by selecting MGF. That's too much magic for a generic conversion tool's default behavior. But I could see that kind of capability being baked into MSConvertGUI "presets", which would set the filters and other options in a certain way. But yeah, currently the GUI doesn't have combineIonMobilitySpectra, scanSummer, or a customizable titleMaker...it only uses it for the 'TPP compatibility' mode. So I'll need to add those at the very least for this to work from the GUI.

@chambm
Copy link
Member Author

chambm commented May 3, 2018

From Brian email:

I agree that some kind of reasonable default for including available ion mobility in the MGF in the absence of a titleMaker filter would be highly desirable. That's what I was initially imaging, which is why I was confused.

It's certainly do-able but doesn't solve the more significant issue with PASEF of merging the MS2s properly. But I'm also pretty sure we could ask 10 people what they think the default MGF title should be and get 10 different answers. This is why we have standards...

@brendanx67
Copy link
Contributor

brendanx67 commented May 3, 2018 via email

…o 1e-2 to more closely approximate the official Bruker peak merging, but eventually this threshold should be configurable
@@ -327,8 +360,24 @@ protected override void OnShown(EventArgs e)

double scanTime = (double) dgv[scanTimeColumn.Index, i].Value;
double ionMobility = (double) dgv[ionMobilityColumn.Index, i].Value;
if (ionMobility == 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bspratt At this point, when ionMobility == 0, I check whether the spectrum has a mobility array. If it does, then I add all the mobility bins from that array. For an MS1, all heatmap data comes from a single spectrum with 3 arrays rather than 1000 spectra with 2 arrays. The number of data points is that same but the performance is much better with 1 spectrum due to data locality.

bounds.MaxY = Math.Max(bounds.MaxY, bin.IonMobility);

heatmapPoints.Add(new Point3D(mz, bin.IonMobility, intensity));
var mobilityArray = mobilityBDA.data;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bspratt Here's where I am adding each mz/mobility/intensity point to the heatmap. The heatmap has to know all the mobility bins in both cases, but with the combined representation, it populates them all from a single spectrum (for MS1; for PASEF MS2, there will still be multiple spectra, but far fewer).

Copy link
Member

@bspratt bspratt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, in TitleMaker ionmobility might also reasonably include FAIMS compensation voltage

@chambm
Copy link
Member Author

chambm commented May 8, 2018

Agreed. I've got a request in for a CV change to add parent terms for ion mobility concepts so I can just look for a child of a single term. Also it'll allow putting those terms in mzIdentML files, so search engines can start putting it in structured output rather than title strings. The CV change seems unopposed so should be available soon. Title strings give me the heebie jeebies, but they're certainly a necessary evil right now because it's currently necessary from pepXML/mzIdentML to go back to the raw data (e.g. mzML) to get the ion mobility value.

chambm added 5 commits May 8, 2018 17:22
…that the client only wants to enumerate a specific MS level (currently only Bruker TDF uses this, for performance reasons) (from Brian)

* added sqlite3pp::has_table()
* reduced memory footprint of TDF index by using on-the-fly logic in SpectrumList_Bruker::find()
* moved Bruker TDF ion mobility metadata from Instant level to Fast, and merged scan numbers to Full
* added hasPASEF() overload for non-MSVC builds
@chambm
Copy link
Member Author

chambm commented May 15, 2018

@bspratt Did your ion mobility changes cause the Skyline test failures?

@bspratt
Copy link
Member

bspratt commented May 15, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 15, 2018

My last push includes most of the patch you sent me, refactored from preferOmitMsLevel to preferOnlyMsLevel.

@bspratt
Copy link
Member

bspratt commented May 15, 2018 via email

… ignored the need to TIC collection in MS1 data
@bspratt
Copy link
Member

bspratt commented May 16, 2018 via email

@bspratt
Copy link
Member

bspratt commented May 16, 2018 via email

@bspratt
Copy link
Member

bspratt commented May 16, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 16, 2018

The bug fix I mentioned was in my last push.

> filepath
"D:\\test\\Bruker\\tims\\BSA_50fmol_TIMS_InfusionESI_10prec.d\\analysis.tdf"
> var msd = new MSDataFile(filepath);
> var sl = msd.run.spectrumList;
> sl.size()
464013
> sl.spectrumIdentity(300000)
{pwiz.CLI.msdata.SpectrumIdentity}
    base_: 0x0000000027d87440
    id: "frame=306 scan=796"
    index: 300000
    owner_: {pwiz.CLI.msdata.SpectrumList}
    sourceFilePosition: 18446744073709551615
    spotID: ""
> sl.findAbbreviated("306.796")
300000

How do I run your failing test myself?

@bspratt
Copy link
Member

bspratt commented May 16, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 16, 2018

Any ideas for how I figure out what aspect of non-PASEF features is the culprit? Non-PASEF functionality should be unaffected by the changes in this PR. The TDF file for the unit test had an out of date mzML (i.e. TDF now has a dedicated nativeID CV term, and using inverse reduced mobility CV term instead of drift time), but it's mostly the same as the mzML I generated today. It has the same number of scans and format of the native ID.

@bspratt
Copy link
Member

bspratt commented May 16, 2018 via email

@bspratt
Copy link
Member

bspratt commented May 16, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 16, 2018 via email

* fixed TimsData::getSpectrumIndex() when preferOnlyMsLevel is non-zero (refactored vector of frames to a flat_map so we no longer use frame id as an index into a vector)
* updated VendorReaderTestHarness to allow testing vendor files with non-default Reader::Configs
* added PASEF data set to Bruker unit tests, and added Config tests for preferOnlyMsLevel 1/2 and combineIonMobilitySpectra on/off
* fixed CompassDataTest build dependencies
@chambm
Copy link
Member Author

chambm commented May 17, 2018

OK, pull the latest changes. The find() issues should be fixed. I also updated VendorReaderTestHarness to allow testing with non-default Reader configs, so this kind of thing will be caught in the core tests in the future.
I've also added a PASEF file to the unit tests by trimming down a large tdf and tdf_bin. I edited the TDF SQLite file to delete frames with id > 6 (and references to them), set the number of scans in each frame to smaller number (i.e. from 985 to 350 for ms2, and 100 for ms1), then used Process Monitor to see what file offsets were being accessed when enumerating the full file. I was able to simply truncate the file after the last accessed offset (plus 100kb of buffer after the last offset). I'm pretty happy with the result: I may try this technique for other vendors to see if we can expand the test coverage for them.

@bspratt
Copy link
Member

bspratt commented May 17, 2018 via email

@bspratt
Copy link
Member

bspratt commented May 17, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 17, 2018

ARGH! Of course. Because the ms1 filtered file only has 1 frame! 😱

@bspratt
Copy link
Member

bspratt commented May 17, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 18, 2018

Hmm, false alarm. I can't reproduce the find() bug anymore. Either on the full HeLa file or the trimmed one (or a newer trimmed one where I included 17 frames instead of 6). Can you give more detail about the error? There are some other issues I need to take care of with the unit test, but those are related to mz5 and out of date test mzMLs.

@bspratt
Copy link
Member

bspratt commented May 18, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 18, 2018

Has Skyline cached the spectrum index somewhere rather than the abbreviated id? Those indexes certainly won't be the same with/without preferOnlyMsLevel set.

@bspratt
Copy link
Member

bspratt commented May 18, 2018 via email

@chambm
Copy link
Member Author

chambm commented May 18, 2018

The trouble seems to be that MsDataFileImpl::GetSpectrumIndex(string id)
returns a different value depending on whether MS2 prefiltering is on

That's correct. SpectrumList.find(SpectrumList.spectrumIdentity(SpectrumList.size()-1).id) will return a different number depending on how preferOnlyMsLevel is set, and it should always be less than SpectrumList.size() (which would mean the id wasn't found).

and MsDataFileImpl:: GetSpectrum(int spectrumIndex) seems not to be dealing
what that difference.

This is what I don't understand and haven't been able to reproduce since the find() fix. Is there a SpectrumList wrapper in play here?

So probably one or the other is dealing or failing to deal with the gaps in the indexing?

@bspratt
Copy link
Member

bspratt commented May 18, 2018 via email

@bspratt
Copy link
Member

bspratt commented May 18, 2018 via email

@bspratt
Copy link
Member

bspratt commented May 18, 2018 via email

bspratt and others added 2 commits May 18, 2018 11:12
…eImpl which could result in the ion mobility peak detector asking for MS2 data after promising not to do so.
…endorReaderTestHarness for non-mzML conversions

- fixed bogus references to CompassXtract API for BAF and TDF files (which now use Baf2Sql and TIMS SDK, respectively)
- removed redundant scan times and ranges from combined PASEF spectra's scanLists (for all but the first scan)
@bspratt
Copy link
Member

bspratt commented May 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants