Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sampled noise arrays from Thermo RAW files #165

Closed
wants to merge 19 commits into from

Conversation

rfellers
Copy link

This PR follows up on an email conversation I had with @chambm over a year ago about adding noise support for Thermo RAW files. Specifically, I'm populating 3 binary arrays with "sampled" noise arrays, which are substantially shorter than the m/z and intensity arrays. These arrays are consistent with the way RAW files store noise and are useful for spectral visualization and other downstream processing.

I have done some initial performance comparisons with and without the modification:

File RAW Size mzML Size mzML Size w/ Noise % Increase
Internal MS/MS standard 242MB 290MB 292MB 0.7
PT4530-1 from here 2.55GB 2.36GB 2.58GB 9.3
File Benchmark (avg. 3 runs)* Benchmark w/ Noise (avg. 3 runs)* % Increase
Internal MS/MS standard 9.8 13.6 38.8
PT4530-1 from here 97.4 116.8 19.9

* Real time in seconds from msbenchmark.

I ran the following commands to generate the data above:
msconvert.exe .\file.raw -n
msbenchmark.exe spectra full-data .\file.raw

What are your thoughts given this outcome? Thanks!

@bspratt bspratt requested a review from chambm August 22, 2018 17:09
@chambm
Copy link
Member

chambm commented Aug 22, 2018

Thanks for the PR Ryan. Unfortunately I am in the process of updating the 64-bit pwiz to use RawFileReader for Thermo files. So this code will need to be updated when that gets merged. Doubly unfortunately, in order to continue 32-bit compatibility, RawFile.cpp will become a mess of #ifdef WIN64 to choose between the RawFileReader syntax and MSFileReader syntax.

Other than that, the performance implications are moot as long as you add an "opt-in" mechanism. This could either be an extra DetailLevel above DetailLevel_BinaryData (e.g. DetailLevel_BinaryDataWithExtras, or DetailLevel_VerboseBinaryData), or a flag passed to Reader::Config (e.g. Reader::Config::addNoiseData).

float baseline;
float intensity;
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using three vector since that's what gets stored in the MSData model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you're saying. I was just copying the example of MZIntensityPair. Are you suggesting that I use 3 specific vectors (noise m/z, noise intensity, noise baseline) instead? Perhaps it would be even better if SpectrumList_Thermo.cpp called a generic "setBinaryArray" method on MSData 3 times instead. If other readers ever want to respect the --addNoiseData flag, they may want to set a different number or types of binary arrays for noise.

Copy link
Member

@chambm chambm Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a setBinaryDataArray is a good idea (which takes a CVID and creates/updates the corresponding BinaryDataArray). Also swapBinaryDataArray. And yes, do that 3 times with 3 vector<double>s.

@rfellers
Copy link
Author

Thanks for your comments. No worries on updating the 64 bit support for RAW files to use RawFileReader. Actually, we've been using that internally for years, so I'm even more familiar with the API. When the time comes I'll merge your changes and fix up RawFile.cpp for noise handling.

I've added a new --addNoiseData flag for msconvert in my latest commit.

@@ -560,6 +567,9 @@ struct PWIZ_API_DECL Spectrum : public SpectrumIdentity, public ParamContainer
/// set binary data arrays
void setMZIntensityPairs(const MZIntensityPair* input, size_t size, CVID intensityUnits);

/// set sampled noise data arrays
void setNoiseInfo(const NoiseDataInfo* input, size_t size);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set/swapBinaryDataArray you can get rid of setNoiseInfo. It'll be easy enough to do all the calls from SpectrumList_Thermo.

@rfellers
Copy link
Author

I've got this working for 64 bit, but I still need to test for 32 bit (didn't have MSFileReader installed on my development box). I'll test it tomorrow on another VM to make sure 32 bit is OK.

@rfellers
Copy link
Author

rfellers commented Jan 2, 2019

I was having trouble building x86 in early December, but now things are working for me locally. I've built msconvert in x86 and x64 and tested that it creates noise arrays when --addNoiseData is passed. A couple things:

  • The Core Windows x86 CI build failed. Things work for me locally and I don't see much in the output, maybe the problem is
    [09:19:43][Step 1/1] ...failed common.mkdir Z:\pwiz\build-nt-x86\pwiz\data\vendor_readers\ABI\Reader_ABI_Test.test\msvc-14.1\rls\async-excpt-on\lnk-sttc\thrd-mlt...
  • I haven't added any unit tests. If you'd like that, could you point me in the right direction? I'd guess Reader_Thermo_Test.cpp, but it looks like I'd have to modify more to actually test noise reading

@chambm
Copy link
Member

chambm commented Jan 4, 2019

A unit test would be good. You would have to modify VendorReaderTestHarness.cpp too since there's a new Reader::Config variable. The problem is we need a small test file with some FT scans. Our current file is only IT scans. Are you able to get such a file? If so, then after making the code modifications you'd just run that generate_thermo_mzml.bat script to regenerate the test data tarball. You'd want to manually check the new mzML to make sure it has what you expect to be there.

@brendanx67
Copy link
Contributor

Can we wrap this up? Either merge or close the pull request. At least it is on a branch if we want to come back to it, but it is our oldest pull request.

@rfellers
Copy link
Author

rfellers commented Nov 7, 2019

Apologies all. I got stuck on the unit test, put it down for a bit, and then lost track of it. Should we simply merge without a test to get this cleared? Or I'm glad to give the unit test one more go. Either way, I'm back up to date with your master branch.

@brendanx67
Copy link
Contributor

In my opinion, we should not merge without a test. Rather we should drop it from the pull request list and raise it as a PR once it is ready to go. If you can get it ready to Matt's satisfaction on this next push, then we can merge it. Otherwise, it can stay on a branch, but not as a PR.

@chambm
Copy link
Member

chambm commented Nov 7, 2019

Do you have a small FT file to make a unit test with? If so I'll take a crack at adding the test myself.

@rfellers
Copy link
Author

rfellers commented Nov 7, 2019

Thanks for offering to create the unit test! I have a standards file here that is FTMS/ITMS MS1 with a Top1 MS2, 95MB. Off hand I can't think of anything smaller. If that is too big I'm happy to make as many single scan RAW files as you see fit.

@chambm
Copy link
Member

chambm commented Nov 7, 2019

We'd like something less than 2mb. So it can be a few scans or cycles rather than just 1, but not anything like a useful-for-analytical-chemists length. If you could create such a file that'd be wonderful! If you can put some MSX and SPS in there, even better. 🥇 The FT should be profile (or at least one of them should be) but the IT can be all centroid.

…eList() to RawFileThreadImpl, removed unneeded casting and addNoiseData assignment
@rfellers
Copy link
Author

rfellers commented Nov 7, 2019

Sadly, using Thermo's QualBrowser, I can't merge multiple scans into a single file, only export 1 scan at a time. So, the attached ZIP contains 3 RAW files (ITMS MS1, FTMS MS1, FTMS MS2), totaling under 1MB in size. Sorry, no SPS ... don't think I've ever seen a file with that. I included the ITMS because it is an example of a scan that doesn't contain noise data (and therefore shouldn't include anything in the resulting file).

ThermoRAWScans.zip

@chambm
Copy link
Member

chambm commented Nov 7, 2019

You can create RAW files with QualBrowser?!

Mind...
blown...

Must check this out.

@rfellers
Copy link
Author

rfellers commented Nov 7, 2019

Yup, right click on bottom panel, Export ->Write to RAW file

image

@chambm
Copy link
Member

chambm commented Nov 7, 2019

Hmph. Would be a lot more useful if I could create a cycle of scans. But this will certainly work for testing noise extraction and FT centroiding. Thanks for pointing it out!

@chambm chambm mentioned this pull request Nov 22, 2019
@chambm
Copy link
Member

chambm commented Nov 22, 2019

Superseded by #737

@chambm chambm closed this Nov 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants