Skip to content

[FFID][tweak] thoughts on settings/scores#7130

Closed
jpfeuffer wants to merge 6 commits intodevelopfrom
jpfeuffer-patch-7
Closed

[FFID][tweak] thoughts on settings/scores#7130
jpfeuffer wants to merge 6 commits intodevelopfrom
jpfeuffer-patch-7

Conversation

@jpfeuffer
Copy link
Copy Markdown
Contributor

@jpfeuffer jpfeuffer commented Oct 15, 2023

Added some thoughts on improvements with settings and scores.
TODO check ElutionModelFitter at the end of FFID regarding imputation/regression of intensities for unfittable features in low concentration environments.

Description

Checklist

  • Make sure that you are listed in the AUTHORS file
  • Add relevant changes and new features to the CHANGELOG file
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes
  • Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI

Click to expand If your PR is failing you can check out
  • The details of the action statuses at the end of the PR or the "Checks" tab.
  • http://cdash.openms.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number.
    If you click in the column that lists the failed tests you will get detailed error messages.

Advanced commands (admins / reviewer only)

Click to expand
  • /reformat (experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.
  • setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
  • commenting with rebuild jenkins will retrigger Jenkins-based CI builds

⚠️ Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

Added some thoughts on improvements with settings and scores.
TODO check ElutionModelFitter at the end of FFID regarding imputation/regression of intensities for unfittable features.
@jpfeuffer jpfeuffer marked this pull request as draft October 15, 2023 16:50
@timosachsenberg
Copy link
Copy Markdown
Contributor

Thanks!
I see quite a lot of tests with changed intensities (usually lower but in ProteomicsLFQ it seems to be higher) or slightly lower number of features.
Do you know the reason for that? For lower intensity features, I would suspect the baseline filter?
Or do you know if the other parameters affect feature extraction / fitting in a way such that intensities are different?
https://github.com/OpenMS/OpenMS/pull/7130/files#diff-8398c3d124cd05cff73da64570f549b25172b2feb680f2773e6292f91b69e32dR436-R439

Also pinging @cbielow @jcharkow @hroest. Maybe someone can provide some input here, as I am not super familiar with the OpenSWATH parameters.

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Oct 15, 2023

I just added some more notes. I don't think my settings have a lot of impact if elution_model is still on.
FFID will fit a combined model with all traces per feature in the end and overwrite the SWATH intensities.
PeakIntegrator "recently" received an update for SmartPeak. Maybe those intensities together with background subtraction are better now than when FFID was originally written.
Maybe ElutionModelFitter should therefore be disabled (although the joint fitting seems interesting). ElutionModelFitter is also the algorithm that imputes features where the combined model does not succeed. It does a regression/interpolation based on the above-mentioned intensities from SWATH's PeakIntegrator (input) and the integrated area of the combined EGH Model from ElutionModelFitter where the fit succeeded.

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Oct 15, 2023

If my settings indeed had an effect, it must mean that e.g. the elution_model_score is missing now and the combined SWATH LDA prescore is affected (it includes the elution_model_score). Apparently the pre-scoring then has an effect on which peak groups are turned into features (e.g. through a cutoff? or through the quality sorting when multiple peak groups are found in a transition group).

However, I don't think it makes sense to hypothesise on the number of features or absolute differences. It can only be compared reliably on a ground-truth dataset.

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Oct 15, 2023

It is also a bit unfortunate that we have 3 different fitting steps in the pipeline despite being the computationally most expensive step 🥲 I disabled 2 of 3 for now. Would be great if the results were stored and re-used (although ElutionModelFitter fits multiple traces at the same time and is therefore different).

The problem with disabling "only imputation" in ElutionModelFitter (despite being possible) is that IMO intensities of features without a successful model fit will not really be comparable to the rest.

@timosachsenberg
Copy link
Copy Markdown
Contributor

It is also a bit unfortunate that we have 3 different fitting steps in the pipeline despite being the computationally most expensive step 🥲 I disabled 2 of 3 for now. Would be great if the results were stored and re-used (although ElutionModelFitter fits multiple traces at the same time and is therefore different).

The problem with disabling "only imputation" in ElutionModelFitter (despite being possible) is that IMO intensities of features without a successful model fit will not really be comparable to the rest.

ok this sounds to me we might keep the interpolation but maybe have a better way to detect features below LOQ?
I currently try to do this using the target/decoy approach in FFID + a SVM outside FFID.
There seems to be multiple ways to achieve this and many parameters (e.g., what is the impact of the baseline estimation?). Any idea @cbielow @hroest how to move forward here.

Some background:

Main reason is that we report too many features even if below the LOQ.

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Oct 16, 2023

Yes I think filtering some of the low intensity features is probably more helpful but what's a bit unfortunate is that the ElutionModelFitter does not support background substraction (which I think will help tremendously in the remaining ones).
Therefore it basically defeats any previously performed bg subtraction from OpenSwath (except for failed features where it still plays a role in imputation).

Maybe adding the algorithms from PeakIntegrator into EMF would be worthwhile.

@jcharkow
Copy link
Copy Markdown
Collaborator

It is also a bit unfortunate that we have 3 different fitting steps in the pipeline despite being the computationally most expensive step 🥲 I disabled 2 of 3 for now. Would be great if the results were stored and re-used (although ElutionModelFitter fits multiple traces at the same time and is therefore different).
The problem with disabling "only imputation" in ElutionModelFitter (despite being possible) is that IMO intensities of features without a successful model fit will not really be comparable to the rest.

ok this sounds to me we might keep the interpolation but maybe have a better way to detect features below LOQ? I currently try to do this using the target/decoy approach in FFID + a SVM outside FFID. There seems to be multiple ways to achieve this and many parameters (e.g., what is the impact of the baseline estimation?). Any idea @cbielow @hroest how to move forward here.

Some background:

Main reason is that we report too many features even if below the LOQ.

Interesting findings. I'm looking at single cell/dilution series DIA data and seeing similar findings of many features being below the LOQ. Most are filtered out with FDR control however not great that they are present to begin with.

// TODO I wonder if the following parameters would be enough.
// In theory we only care for one feature per one set of extracted chromatograms (transition group)
//params.setValue("stop_report_after_feature", 1); // best by quality, after scoring
//params.setValue("TransitionGroupPicker:stop_after_feature", 1); // best by intensity, after picking, before scoring
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I did not look into it but if we have one feature with smaller intensity but much less RT error it could get lost. @hendrikweisser do you recall?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having only one feature candidate for sure wouldn't work with the SVM-based rescoring/FDR estimation approach in FFId. If you're not using this functionality you could only extract one feature, but I agree with Timo that RT deviation is often the most important criterion. Certainly if you detect features in the same file where the peptide IDs were generated - then the feature candidate overlapping the ID is always assumed to be the correct one. (So a nice optimisation may be to detect a single peak/feature starting at the ID position and moving outward.)

@hendrikweisser
Copy link
Copy Markdown
Contributor

Some additional comments from reading this thread:

ElutionModelFitter fits multiple traces at the same time and is therefore different

At least in theory this is important because it should reduce the impact of "interference" where something high-intensity overlaps with one of the mass traces of a feature. Then the "raw" intensity of that mass trace may be significantly wrong, but when fitting over multiple traces this should be evened out.

The problem with disabling "only imputation" in ElutionModelFitter (despite being possible) is that IMO intensities of features without a successful model fit will not really be comparable to the rest.

Intuitively this imputation step should be computationally very cheap (it's just a linear regression!) compared to the rest of the feature detection, so I'm surprised it's even a consideration for optimisation.

@timosachsenberg
Copy link
Copy Markdown
Contributor

timosachsenberg commented Nov 12, 2024

In test FFID 5 this feature for peptide LC(Carbamidomethyl)VLHEK/2 is missing because of
params.setValue("TransitionGroupPicker:background_subtraction", "exact"); :
image
image

/home/sachsenb/Development/OpenMS-build/bin/FeatureFinderIdentification "-test" "-in" "/home/sachsenb/Development/OpenMS/src/tests/topp/FeatureFinderIdentification_1_input.mzML" "-id" "/home/sachsenb/Development/OpenMS/src/tests/topp/FeatureFinderIdentification_1_input.idXML" "-out" "FeatureFinderIdentification_5.tmp.featureXML" "-candidates_out" "FeatureFinderIdentification_5_candidates.tmp.featureXML" "-extract:mz_window" "0.1" "-extract:batch_size" "10" "-detect:peak_width" "60" "-model:type" "none"

Peptide LC(Carbamidomethyl)VLHEK/2 (m/z: 449.744):
Peptide LC(Carbamidomethyl)VLHEK/2 (m/z: 449.744):
PeakPickerChromatogram.cpp(79):  ====  Picking chromatogram LC(Carbamidomethyl)VLHEK/2_i1 with 224 peaks (start at RT 1657.05 to RT 2087.49) using method 'corrected'
PeakPickerChromatogram.cpp(79):  ====  Picking chromatogram LC(Carbamidomethyl)VLHEK/2_i2 with 224 peaks (start at RT 1657.05 to RT 2087.49) using method 'corrected'
MRMFeatureFinderScoring.cpp(572): Scoring feature RT: 1782.34 MZ: 449.744 INT: 1.48314e+06 == LC(Carbamidomethyl)VLHEK/2 [ expected RT 1656.05 / 1656.05 ] with 2 transitions and 2 chromatograms
MRMFeatureFinderScoring.cpp(572): Scoring feature RT: 1961.13 MZ: 449.744 INT: 217331 == LC(Carbamidomethyl)VLHEK/2 [ expected RT 1656.05 / 1656.05 ] with 2 transitions and 2 chromatograms
MRMFeatureFinderScoring.cpp(572): Scoring feature RT: 1920.52 MZ: 449.744 INT: 6375.41 == LC(Carbamidomethyl)VLHEK/2 [ expected RT 1656.05 / 1656.05 ] with 2 transitions and 2 chromatograms

This feature seem to elute multiple times...
image

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Nov 12, 2024

Seems to elute a long time to me 😅

@timosachsenberg
Copy link
Copy Markdown
Contributor

How should we deal with those?
I think I would be fine with removing those (via background correction) if it improves results on UPS.

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

Looks like something that would distort quantities if not correctly normalized.

@timosachsenberg
Copy link
Copy Markdown
Contributor

timosachsenberg commented Dec 16, 2024

From @pjones using the default options

Replicate	Source	Cond	Baseline	PR7130	Diff
1	UPS	12500	45	45	0
1	YEAST	12500	828	827	-1
2	YEAST	125	865	862	-3
3	UPS	25000	47	47	0
3	YEAST	25000	833	836	3
4	UPS	2500	16	16	0
4	YEAST	2500	838	833	-5
5	YEAST	250	857	856	-1
6	UPS	50000	48	48	0
6	YEAST	50000	803	801	-2
7	UPS	5000	32	32	0
7	YEAST	5000	819	822	3
8	YEAST	500	833	830	-3
9	UPS	50	1	1	0
9	YEAST	50	937	932	-5

here are some plots of an older version (probably without the SVM filter)
from bigbio/quantms#301

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Dec 17, 2024

Nice to see there is progress.
I'd suggest ordering by concentration and filling missing values with 0.

And lastly , what I did back then, create a grid of pairwise fold changes and compare to expected (according to concentration ratio). To also validate quantification, not just identification. I did it with two box plots per grid cell but you could do it different.

@jpfeuffer
Copy link
Copy Markdown
Contributor Author

jpfeuffer commented Dec 17, 2024

Ideally, and I never got to it, we could do one grid for every UPS protein with data points for each (found) peptide.
This would be better for debugging than aggregated on protein level. But maybe this is the next step.

@timosachsenberg
Copy link
Copy Markdown
Contributor

Closing this long-running research PR. The key actionable items have been captured in #8886 (redundant elution model fitting, background subtraction exposure, score evaluation).

The branch is preserved if anyone needs to reference the code. Thanks @jpfeuffer for the analysis and discussion — the insights about fitting redundancy, background subtraction inconsistency, and score evaluation are well-documented in the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants