Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/dynamic main score #110

Merged
merged 16 commits into from Dec 15, 2022

Conversation

singjc
Copy link
Contributor

@singjc singjc commented Dec 13, 2022

Dynamic selection of main score during initial first past semi-supervised learning

Error when scoring using XGBoost

pyprophet score --in /data/yanliu_I170114_040_PhosNoco10_SW.osw --out /data/test/yanliu_I170114_040_PhosNoco10_SW.osw --classifier XGBoost --xeval_num_iter 3 --ss_num_iter 3 --level ms1ms2 --threads 3 --ss_initial_fdr 0.15 --ss_iteration_fdr 0.05
Info: Enable number of transitions & precursor / product charge scores for XGBoost-based classifier
Info: Learn and apply classifier from input data.
Warning: Column var_mi_ratio_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_elution_model_fit_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_im_xcorr_shape contains only invalid/missing values. Column will be dropped.
Warning: Column var_im_xcorr_coelution contains only invalid/missing values. Column will be dropped.
Warning: Column var_im_delta_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_lag contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_shape contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_sn contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_diff contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_trend contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_rsq contains only invalid/missing values. Column will be dropped.
Warning: Column var_ms1_im_ms1_delta_score contains only invalid/missing values. Column will be dropped.
Info: Data set contains 7854 decoy and 7993 target groups.
Info: Summary of input data:
Info: 78574 peak groups
Info: 15847 group ids
Info: 37 scores including main score
Info: Semi-supervised learning of weights:
Info: Start learning on 3 folds using 3 processes.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Error: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. Current lambda range: [0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45]

I usually run into this issue when trying to score using the XGBoost classifier. Usually the suggestion would be to change the --ss_main_score, but to find a better starting main score, you either need try test each score one by one, or you have to adjust the ss_initial_fdr and ss_iteration_fdr until it doesn't run into this error.

I found that this issue occurs when the top decoy data and best target data do no have a good amount of separation, or are very unbalanced after applying the cutoff of the top target scores (in select_train_peaks). The resulting classifier score used in iteration semi supervised learning fails to separate between decoy and targets. For example:

Info: For Training main_var_xcorr_shape - td_peaks ((3928, 43)) and bt_peaks ((787, 43))

Score Distribution input for start_semi_supervised_learning

image

Score Distribution input for iter_semi_supervised_learning

image

Even if you supply the --ss_main_score using the score that usually performs the best (main_var_mi_weighted_score), depending on the fold and the data split, it still might fail.

Second Learning Fold when setting --ss_main_score var_mi_weighted_score

image

Solution

I tried a couple of different approaches, but I found the approach below to be the best:

  1. When calling start_semi_supervised_learning
    1. Before actually training on the default/selected main score, first permute through the scores to find the score that gives the most balanced data between top decoy peaks and best target peaks (i.e. difference in the shape in number of scores ) after applying the cutoff in select_train_peaks
    2. Use the score that has the least difference in number of best targets to top decoys to generate scores for training
    3. train on these scores
  2. After initial semi supervised learning
    1. update the training dataframe and the experiment data frame to set the best identifed score (from 1.ii) as the main score
  3. proceed with iteration of semi supervised learning

I added an additional boolean parameter (--main_score_selection_report: Default is set to False to not generate a report) to module score, to optionally save a pdf report of the different individual score distributions and p-value distributions (like the plots above) computed in error_statistics. The different score reports get appended to the same file using PyPDF2, if the user threads the computation for fold learning, then N number of reports will be generated since there will be write conflicts if attempting to write to the same file. I find these reports useful for debugging purposes.

Comparisons

I ran some tests on a single file from the IPF paper: yanliu_I170114_040_PhosNoco10_SW.osw

I compare LDA and XGBoost + main score=var_mi_weighted_score for PyProphet (v2.1.12) to LDA, XGBoost and XGBoost + main score=var_mi_weighted_score for this implementation (vdynamic). I couldn't test XGBoost for PyProphet (v2.1.12) because it always errors out during pi0 estimation.

image

Total Numbers of IDs is computed from pyprophet statistics, and I take the average of these numbers after running the scoring, ipf and context workflow 20 times (which is seen from the standard deviation of the error bars). The changes don't apply to LDA, so I tested to make sure the results for LDAvv are still consistent. Numbers are fairly consistent between XGBoost + main score (v2.1.12), XGBoost (vdynamic) and XGBoost + main score (vdynamic) .

Mean Feature Importance

image

Updated parameter in XGBClassifier

The below warning message occurs because silent is no longer a parameter in newer versions of XGBoost, it has been changed to verbosity, which when set to 0 is equivalent to silent.

[05:21:28] WARNING: ../src/learner.cc:767: 
Parameters: { "silent" } are not used.

@grosenberger
Copy link
Contributor

Fantastic work, thank you! This is a very useful extension that will be useful for many applications.

@grosenberger grosenberger merged commit 265fca4 into PyProphet:master Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants