Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timsrust sage error #15

Closed
tomthun opened this issue Feb 10, 2024 · 21 comments
Closed

Timsrust sage error #15

tomthun opened this issue Feb 10, 2024 · 21 comments

Comments

@tomthun
Copy link

tomthun commented Feb 10, 2024

Timstof .d directories are now natively supported by rust: lazear/sage#117 (comment), using timsrust to load them.
I just ran some quick tests on some test data and got the following:

(base) PS D:\Data\tools\SAGE> sage .\current_config.json
[2024-02-09T11:03:06Z INFO sage] generated 111120583 fragments, 5992882 peptides in 7285ms
[2024-02-09T11:03:06Z INFO sage] processing files 0 .. 1
thread 'main' panicked at C:\Users\runneradmin.cargo\registry\src\index.crates.io-6f17d22bba15001f\timsrust-0.2.0\src\file_readers\common\sql_reader.rs:30:62:
called Result::unwrap() on an Err value: SqliteFailure(Error { code: Unknown, extended_code: 1 }, Some("incomplete input"))
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

I use the current_config.json for sage.
The developer of sage error suggests that this is an issue in Bruker's timsrust library hence i wanted to ask here if there is a solution?

Edit: The raw data can be found here. Sorry, i had the wrong settings for the file, but should now work.

@sander-willems-bruker
Copy link
Collaborator

Dear @tomthun . This seems to be diaPASEF data rather than ddaPASEF. You are correct that this indeed is not yet supported. We might look into this in the future, but please take note that this is not a trivial task as the data format is vastly different.

@tomthun
Copy link
Author

tomthun commented Feb 16, 2024

@sander-willems-bruker Thank you very much for looking into this. A support for diaPASEF would be much appreciated!

Edit: Do you have a rough estimate when it could be available?

@jspaezp
Copy link
Contributor

jspaezp commented Feb 20, 2024

@tomthun I am not 100% sure what you mean by that. Reading diaPASEF is supported in timsrust (albeit with some limitations and at a reasonably low level). If on the other hand you are asking for diaPASEF support on sage, I am unsure if there are any plans in the near future to support DIA data in general for sage.

Would you mind elaborating?

@tomthun
Copy link
Author

tomthun commented Feb 20, 2024

I actually thought sage would already support DIA (only roughly but with chimeric deconvolution
and dynamic tolerance features i thought it would be possible). It is also listed under the features:

I am currently a bit confused what is actually supported and what not. For me it seems that generally DIA is supported. Maybe @lazear could elaborate.

I would be very happy if DIA would get a full support if possible! :)

@lazear
Copy link
Contributor

lazear commented Feb 20, 2024

There is nothing preventing you from theoretically converting your diaPASEF files to mzML/MGF and searching them with Sage - Sage has no concept of PASEF internally though, so I don't know how the results will look. You can also go and search Thermo/Agilent/etc data in DIA/WWA mode... Bruker is just a bit special 😉

Currently, only reading ddaPASEF .d files is supported,since they can be easily converted to the same internal representation of a spectrum as data from the other vendors

@jspaezp
Copy link
Contributor

jspaezp commented Feb 20, 2024

Thanks for clarifying @tomthun!

You are right @lazear, when I mentioned supporting DIA I mean more on the dia-umpire or peptide-centric search side of things (sorry for disregarding your contribution to open search MS 😉 )

Going into why bruker is special and why converting ddaPASEF is trivial but not diaPASEF. (correct me if I am wrong ... this is more my intuition of the problem than a hard answer ...) ddaPASEF, by virtue of being targeted on the ion mobility space, can be converted to a regular "spectrum" by just collapsing the mobility dimension, thus generating a "profile" scan that can be centroided. On the other hand, if the ion mobility is not targeted, collapsing the ion mobility dimension would lead to a lot of information loss, which makes this conversion less practical (and in my experience noisy enough to not be usable in practice).

I would love to hear your thoughts on "centroiding approaches" viable for diaPASEF!

Schematic of "reading diaPASEF data"
image

@GeorgWa
Copy link
Collaborator

GeorgWa commented Feb 20, 2024

It might come with some surprise but there's actually a package for this called https://github.com/MannLabs/timspeak for clustering and https://github.com/MannLabs/alphasynchro for pseudo MSMS generation. The Package works voth for synchroPasef as well as diaPasef. The workflow was never published as part of a paper but is based on very rigorous work from @swillems.

@KlemensFroehlich
Copy link

KlemensFroehlich commented Feb 23, 2024

could you collapse an entire pasef window into a 2D centroided information table? Basically just create one giant mass spectrum that ultimately adds up hundreds of individual mass scans?

image

I have no idea what the data structure looks like or whether this would make sense, but from a technological point of view:

I think the information of the ion mobility axis can definitely be compressed / information can be lost as the ion mobility just inherently has a very low separation power of different ions as compared to the m/z axis....

So I think it would be totally okay to collapse all ions of a PASEF window into just one single MS scan with a centroided m/z info and a centroided ion mobility info.

This should also drastically reduce file size, correct?

I am probably misunderstanding something here so please feel free to ignore me ;) or is this already happening with the clustering in timspeak? Does timspeak actually support writing output to mzml format?

@lazear

There is nothing preventing you from theoretically converting your diaPASEF files to mzML/MGF and searching them with Sage - Sage has no concept of PASEF internally though, so I don't know how the results will

a 5 minute DIA is 1.4GB in size as .d folder. Converted with standard MSConvert settings it is 25GB.... I could of course next time activate gzip on top, but a 30min gradient..... it would still be a huge file.
I am currently running SAGE on the 25GB mzml. Will update later how that looks.

there is the timsconvert package https://github.com/gtluu/timsconvert which somehow generates smaller mzml files, but when I push those through sage, it cannot find any precursors with q_val < 0.01. It does find around 25.000 precursors in the 5min gradient, but all of them have a q_val > 0.11.... I guess they also use a costum annotation of ion mobility, which is probably not supported by sage?

I know this is no priority of yours, but I would still love to hear your thoughts! I can also share the data if you are interested of course!

Best, Klemens

edit:
somewhat clearer illustration

@jspaezp
Copy link
Contributor

jspaezp commented Feb 23, 2024

@KlemensFroehlich I believe we are saying the same thing, same as @GeorgWa. But the point I was trying to make (and apparently didn't convey correctly) was that the exact way in which this centroiding needs to happen is not a trivial decision (there are 19 hyperparameters in the timspeak implementation). I would love to know if @sander-willems-bruker has any guidelines on what method+parameters would be "good enough" for most purposes or more specifically why has the centroiding been implemented only for ddaPASEF MS2 scans.

best!

@GeorgWa
Copy link
Collaborator

GeorgWa commented Feb 23, 2024

Yes @KlemensFroehlich what you get is in principle a table which is represented as a list of spectra with defined mobility and RT.

The workflow itself is very robust but as you note @jspaezp, it's not a full end to end search engine. Ideally, there would be a minimal required parameter set and all other parameters are optimized in a feedback loop with the search.

Based on my experience centroiding is quite hard to get right and there are no universal parameters. It usually works best if done as a well controlled optimization task.

@KlemensFroehlich
Copy link

KlemensFroehlich commented Feb 24, 2024

hi everyone,

thanks for clearing things up for me!

@GeorgWa if you aim to get an iterative approach of centroiding based on the search results, this means it is not your intention to "just" centroid (and provide an mzml after doing so). It would rather be a long term goal to provide an end to end solution from .d folder to search / quant results?
Can I ask again if timspeak can export mzml? I just see other formats as output.

Going on a rant here, please again ignore me if I ask very stupid questions:
Again, I am not an expert, but I can hardly believe that a high accuracy for the ion mobility centroiding is needed... I would actually bet that low accuracy on the ion mobility dimension centroiding would only lead to minimal loss of identifications (if at all). The ion mobility has a REALLY low resolution / peak capacity

image

Just looking at a background signal here of a highly abundant precursor, which roughly spans from 0.66 to 0.76: Most relevant peptide ions are located between 0.7 and 1.3 1/k0.
Even assuming that a regular peptide only spans 0.05 1/k0, this effectively gives us a peak capacity in the dozens.
Chromatography gives us peak capacity in the hundreds
Mass Resolution in the 1000s.

So please forgive my naivite when I ask:
Do you really need 19 hyperparameters or even an iterative approach to perfect centroiding, when one dimension has a really low resolution?

Has anyone tried simple binning of the ion mobility dimension? Maybe with overlap of the binning boundaries? This would also provide a potentially super quick approach, would reduce file size tremendously and still preserve the ion mobility info.

This seems to be implemented in MSconvert. Is that not sufficient here or what are your thoughts on this?
image

Best Klemens

@GeorgWa
Copy link
Collaborator

GeorgWa commented Feb 26, 2024

Hi Klemens,

Yes, one would finetune the hyperparameters based on confident identifications. I'm speaking from a purely theoretical perspective though. We are not planning this at the moment.

To my knowledge timspeak only supports mgf.

I share your intuition on the resolution of the ion mobility dimension. I think the main benefit manifests on the fragment level, where the strong correlation between precursor mz and the mobility is not an issue.

Let's do two things: I will ping Sander and ask him if there is a way to perform centroiding only on the ion mobility level, not on RT and without connecting clusters. I'm also happy to discuss the general matter in more detail on Zoom if you like. @jspaezp or @tomthun or whoever is interested are of course invited to join.

Best,
Georg

@sander-willems-bruker
Copy link
Collaborator

Hi all. Been awhile since this topic was active, but I finally managed to make some headway on this issue. In the upcoming timsrust 0.3.0, dia windows can be read as if they were spectra. Annotating with sage directly seems to work out of the box with this! Results are not really impressive compared to peptide-centric analyses, but that might also be because I use a very slim (10-20aa, no misscleavages, ...) sage/fasta config for quick testing. Good news is, it takes less than a minute to find ~15K peptides in a 22min hela dia gradient directly from the raw.d (1.7GB), so very useful for some quick QC!

@sander-willems-bruker
Copy link
Collaborator

PR for sage now available: lazear/sage#140

@jspaezp
Copy link
Contributor

jspaezp commented Jul 8, 2024

Took me a while to track how its handled (since so much changed from 0.2 -> 0.3) .... so for anyone else that might want to know more the implementation is here:

impl RawSpectrumReaderTrait for DIARawSpectrumReader {
fn get(&self, index: usize) -> RawSpectrum {
let quad_settings = &self.expanded_quadrupole_settings[index];
let collision_energy = quad_settings.collision_energy[0];
let isolation_mz = quad_settings.isolation_mz[0];
let isolation_width = quad_settings.isolation_width[0];
let scan_start = quad_settings.scan_starts[0];
let scan_end = quad_settings.scan_ends[0];
let frame_index = quad_settings.index - 1;
let frame = self.frame_reader.get(frame_index);
let offset_start = frame.scan_offsets[scan_start] as usize;
let offset_end = frame.scan_offsets[scan_end] as usize;
let tof_indices = &frame.tof_indices[offset_start..offset_end];
let intensities = &frame.intensities[offset_start..offset_end];
let (raw_tof_indices, raw_intensities) = group_and_sum(
tof_indices.iter().map(|x| *x).collect(),
intensities.iter().map(|x| *x as u64).collect(),
);
let raw_spectrum = RawSpectrum {
tof_indices: raw_tof_indices,
intensities: raw_intensities,
index: index,
collision_energy,
isolation_mz,
isolation_width,
};
raw_spectrum
}
}

aaaand ...

pub fn group_and_sum<T: Ord + Copy, U: std::ops::Add<Output = U> + Copy>(
groups: Vec<T>,
values: Vec<U>,
) -> (Vec<T>, Vec<U>) {
if groups.len() == 0 {
return (vec![], vec![]);
}
let order: Vec<usize> = argsort(&groups);
let mut new_groups: Vec<T> = vec![];
let mut new_values: Vec<U> = vec![];
let mut current_group: T = groups[order[0]];
let mut current_value: U = values[order[0]];
for &index in &order[1..] {
let group: T = groups[index];
let value: U = values[index];
if group != current_group {
new_groups.push(current_group);
new_values.push(current_value);
current_group = group;
current_value = value;
} else {
current_value = current_value + value;
};
}
new_groups.push(current_group);
new_values.push(current_value);
(new_groups, new_values)
}

So if I understand correctly ... all scans within a frame that share quad isolation window will be combined using a simple addition of intensities for every one of the tof indices, right @sander-willems-bruker ?

also ... would you mind providing the sage config you used?

@sander-willems-bruker
Copy link
Collaborator

Fully correct @jspaezp ! To the best of my knowledge this is how it is "traditionally" also done for Sciex and Thermo instruments that do no have an IMS dimension. As far as this topic goes, the illustration from @KlemensFroehlich is as close as it gets, although there is no IM centroiding but everything is just straight up projected on the TOF (mz) axis.

could you collapse an entire pasef window into a 2D centroided information table? Basically just create one giant mass spectrum that ultimately adds up hundreds of individual mass scans?

image

I have no idea what the data structure looks like or whether this would make sense, but from a technological point of view:

I think the information of the ion mobility axis can definitely be compressed / information can be lost as the ion mobility just inherently has a very low separation power of different ions as compared to the m/z axis....

So I think it would be totally okay to collapse all ions of a PASEF window into just one single MS scan with a centroided m/z info and a centroided ion mobility info.

This should also drastically reduce file size, correct?

@sander-willems-bruker
Copy link
Collaborator

sander-willems-bruker commented Jul 9, 2024

The config I use for testing (functionality wise, not to get the highest numbers out) of dia:
config_dia.json.txt.

Side note, since @KlemensFroehlich actually mentioned the use case specifically: Short gradients tend to yield too complex dia "spectra". I can get some results on 5min gradient, but longer gradients definately yield less complex spectra which are far easier to annotate (at least with the search settings I use)

@wfondrie
Copy link

wfondrie commented Jul 9, 2024

Hi all 👋 - I'm just curious if we've explored "centroiding" (I use that term loosely here) the collapsed spectrum along the TOF indices. I could imagine that the same ion measured across multiple spectra in the ion mobility domain could potentially have slightly different TOF indices. If we combine these into a single peak in the collapsed spectrum, then it could potentially boost signal-to-noise.

There are a few ways we could do it, but since we're already dealing with a discrete representation of m/z, perhaps something like a simplified version of the peak-picking algorithm used by PTM-Shepherd for aggregating mass shifts in open-modification searches would work well:

PTM-Shepherd picks peaks based on a mixture of peak prominence and signal-to-noise remainder (SNR) as measures of quality and quantification, respectively. A peak's prominence is calculated as the ratio of its apex to the more intense of either its left or right shoulder, found by following a peak downward monotonically (Fig. 1). To improve monotonicity for this procedure, adjacent histogram bins are temporarily grouped into small sets and flattened to the minimum bin height within the set, with set size internally calculated based on the total number of histogram bins. Peaks are called when their prominence exceeds 0.3 (by default). A peak's SNR is calculated with a 0.004 Da sliding window (by default) against a background of 0.005 Da on either side (scaling linearly with peak picking width). The average height per histogram bin is computed for the signal and noise regions, and then the signal remainder is calculated by subtracting off the noise. From this list of peaks, the top 500 by SNR (by default) are sent to downstream processing. Peak boundaries are considered to be either the observed peak boundary or the defined precursor tolerance, whichever is closer to the apex.

Feel free to ignore this comment if it's out-of-scope 😃

@KlemensFroehlich
Copy link

hi all,
this sounds really promising. Thank you so much for working on this. I will definitely try this later this week and maybe try to see whether we can use this for QC (one has to be careful though with these quick n dirty searches. Sometimes differences in these quick n dirty analyses are not reflected in a "deep" analysis and therefor may not be indicative of changes in system performance).

Over the last couple of month I have thought about this at length and actually I think I have an idea how to improve this spectra aggregation vastly (says the uninformed non-informatician who does not know about knitty gritty details such as pesky programming or the reality thereof :P )

Okay so the "problem" I see with my own suggestion and the current implementation is noise.

image

I propose to only use a fraction of the scans with the same quadrupole isolation margins.
Judging from some contamination peaks I would say typical peaks have around 0.05 to 0.1 1/k0 elution profiles.

image

with a "rolling average" scheme like so:

image

the idea being that we capture practically all features nicely while still reducing the number of MS2 scans tremendously.

now I am saying aggregated scans because I stumbled over this cool publication a while back:
https://doi.org/10.1002/pmic.202300234

Would it be possible to implement this or a similar algorithm here?

If I understand correctly the slight changes in m/z should be taken care of and outliers should be rejected (which would be random noise that is not repeatedly observed between adjacent MS2 spectra).

Wills idea might also work here?

Best Klemens

@sander-willems-bruker
Copy link
Collaborator

sander-willems-bruker commented Jul 10, 2024

Hi all 👋 - I'm just curious if we've explored "centroiding" (I use that term loosely here) the collapsed spectrum along the TOF indices. I could imagine that the same ion measured across multiple spectra in the ion mobility domain could potentially have slightly different TOF indices. If we combine these into a single peak in the collapsed spectrum, then it could potentially boost signal-to-noise.

There are a few ways we could do it, but since we're already dealing with a discrete representation of m/z, perhaps something like a simplified version of the peak-picking algorithm used by PTM-Shepherd for aggregating mass shifts in open-modification searches would work well:

PTM-Shepherd picks peaks based on a mixture of peak prominence and signal-to-noise remainder (SNR) as measures of quality and quantification, respectively. A peak's prominence is calculated as the ratio of its apex to the more intense of either its left or right shoulder, found by following a peak downward monotonically (Fig. 1). To improve monotonicity for this procedure, adjacent histogram bins are temporarily grouped into small sets and flattened to the minimum bin height within the set, with set size internally calculated based on the total number of histogram bins. Peaks are called when their prominence exceeds 0.3 (by default). A peak's SNR is calculated with a 0.004 Da sliding window (by default) against a background of 0.005 Da on either side (scaling linearly with peak picking width). The average height per histogram bin is computed for the signal and noise regions, and then the signal remainder is calculated by subtracting off the noise. From this list of peaks, the top 500 by SNR (by default) are sent to downstream processing. Peak boundaries are considered to be either the observed peak boundary or the defined precursor tolerance, whichever is closer to the apex.

Feel free to ignore this comment if it's out-of-scope 😃

Dear @wfondrie , this is actually being done to some extent in the TOF/MZ domain. @jspaezp message (#15 (comment)) only shows how to create the initial raw spectrum in the DIA case. After that creation, it is always smoothed and centroided regardless of it being DIA or DDA:
https://github.com/MannLabs/timsrust/blob/main/src/io/readers/spectrum_reader/tdf.rs#L59
However, I indeed agree we could come up with alternative centroiding algorithms. I have some simple ideas, but am more than happy to take PRs implementing this as well;)

@sander-willems-bruker
Copy link
Collaborator

hi all, this sounds really promising. Thank you so much for working on this. I will definitely try this later this week and maybe try to see whether we can use this for QC (one has to be careful though with these quick n dirty searches. Sometimes differences in these quick n dirty analyses are not reflected in a "deep" analysis and therefor may not be indicative of changes in system performance).

Over the last couple of month I have thought about this at length and actually I think I have an idea how to improve this spectra aggregation vastly (says the uninformed non-informatician who does not know about knitty gritty details such as pesky programming or the reality thereof :P )

Okay so the "problem" I see with my own suggestion and the current implementation is noise.

image

I propose to only use a fraction of the scans with the same quadrupole isolation margins. Judging from some contamination peaks I would say typical peaks have around 0.05 to 0.1 1/k0 elution profiles.

image

with a "rolling average" scheme like so:

image

the idea being that we capture practically all features nicely while still reducing the number of MS2 scans tremendously.

now I am saying aggregated scans because I stumbled over this cool publication a while back: https://doi.org/10.1002/pmic.202300234

Would it be possible to implement this or a similar algorithm here?

If I understand correctly the slight changes in m/z should be taken care of and outliers should be rejected (which would be random noise that is not repeatedly observed between adjacent MS2 spectra).

Wills idea might also work here?

Best Klemens

I have had comparable ideas as well. It actually is not that difficult to implement and we could play around with this reasonably well. My main concern with this would be transparancy and user-friendliness of the code. It will require at least a parameter to determine the aggregation width (although I agree there might be some trivial options, this still needs to be transparent for the end-user), and perhaps even more (e.g. for overlapping scans such as scan 2 that has 50% overlapping with 1 and 3 in your illustration). That said, I do want to emphasize that timsrust is only intended to be a quick library for TIMSTOF data reading. If we want to enable more elaborate processing steps, I would be in favor of doing this in a separate crate altogether. I would not even be opposed to move all spectrum processing as it is currently implemented to such a crate completely since in my opinion this is already not raw data anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants