Feature/dynamic main score #110

singjc · 2022-12-13T18:54:44Z

Dynamic selection of main score during initial first past semi-supervised learning

Error when scoring using XGBoost

pyprophet score --in /data/yanliu_I170114_040_PhosNoco10_SW.osw --out /data/test/yanliu_I170114_040_PhosNoco10_SW.osw --classifier XGBoost --xeval_num_iter 3 --ss_num_iter 3 --level ms1ms2 --threads 3 --ss_initial_fdr 0.15 --ss_iteration_fdr 0.05
Info: Enable number of transitions & precursor / product charge scores for XGBoost-based classifier
Info: Learn and apply classifier from input data.
Warning: Column var_mi_ratio_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_elution_model_fit_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_im_xcorr_shape contains only invalid/missing values. Column will be dropped.
Warning: Column var_im_xcorr_coelution contains only invalid/missing values. Column will be dropped.
Warning: Column var_im_delta_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_lag contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_shape contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_sn contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_diff contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_trend contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_rsq contains only invalid/missing values. Column will be dropped.
Warning: Column var_ms1_im_ms1_delta_score contains only invalid/missing values. Column will be dropped.
Info: Data set contains 7854 decoy and 7993 target groups.
Info: Summary of input data:
Info: 78574 peak groups
Info: 15847 group ids
Info: 37 scores including main score
Info: Semi-supervised learning of weights:
Info: Start learning on 3 folds using 3 processes.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Error: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. Current lambda range: [0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45]

I usually run into this issue when trying to score using the XGBoost classifier. Usually the suggestion would be to change the --ss_main_score, but to find a better starting main score, you either need try test each score one by one, or you have to adjust the ss_initial_fdr and ss_iteration_fdr until it doesn't run into this error.

I found that this issue occurs when the top decoy data and best target data do no have a good amount of separation, or are very unbalanced after applying the cutoff of the top target scores (in select_train_peaks). The resulting classifier score used in iteration semi supervised learning fails to separate between decoy and targets. For example:

Info: For Training main_var_xcorr_shape - td_peaks ((3928, 43)) and bt_peaks ((787, 43))

Score Distribution input for start_semi_supervised_learning

Score Distribution input for iter_semi_supervised_learning

Even if you supply the --ss_main_score using the score that usually performs the best (main_var_mi_weighted_score), depending on the fold and the data split, it still might fail.

Second Learning Fold when setting `--ss_main_score var_mi_weighted_score`

Solution

I tried a couple of different approaches, but I found the approach below to be the best:

When calling start_semi_supervised_learning
1. Before actually training on the default/selected main score, first permute through the scores to find the score that gives the most balanced data between top decoy peaks and best target peaks (i.e. difference in the shape in number of scores ) after applying the cutoff in select_train_peaks
2. Use the score that has the least difference in number of best targets to top decoys to generate scores for training
3. train on these scores
After initial semi supervised learning
1. update the training dataframe and the experiment data frame to set the best identifed score (from 1.ii) as the main score
proceed with iteration of semi supervised learning

I added an additional boolean parameter (--main_score_selection_report: Default is set to False to not generate a report) to module score, to optionally save a pdf report of the different individual score distributions and p-value distributions (like the plots above) computed in error_statistics. The different score reports get appended to the same file using PyPDF2, if the user threads the computation for fold learning, then N number of reports will be generated since there will be write conflicts if attempting to write to the same file. I find these reports useful for debugging purposes.

Comparisons

I ran some tests on a single file from the IPF paper: yanliu_I170114_040_PhosNoco10_SW.osw

I compare LDA and XGBoost + main score=var_mi_weighted_score for PyProphet (v2.1.12) to LDA, XGBoost and XGBoost + main score=var_mi_weighted_score for this implementation (vdynamic). I couldn't test XGBoost for PyProphet (v2.1.12) because it always errors out during pi0 estimation.

Total Numbers of IDs is computed from pyprophet statistics, and I take the average of these numbers after running the scoring, ipf and context workflow 20 times (which is seen from the standard deviation of the error bars). The changes don't apply to LDA, so I tested to make sure the results for LDAvv are still consistent. Numbers are fairly consistent between XGBoost + main score (v2.1.12), XGBoost (vdynamic) and XGBoost + main score (vdynamic) .

Mean Feature Importance

Updated parameter in XGBClassifier

The below warning message occurs because silent is no longer a parameter in newer versions of XGBoost, it has been changed to verbosity, which when set to 0 is equivalent to silent.

[05:21:28] WARNING: ../src/learner.cc:767: 
Parameters: { "silent" } are not used.

…a main_score

…rain_peaks that results in a balanced td_peaks to bt_peaks

… to verbosity param

…ing iteration learning

grosenberger · 2022-12-15T08:34:02Z

Fantastic work, thank you! This is a very useful extension that will be useful for many applications.

singjc added 16 commits November 26, 2022 00:35

[MINOR] Docker file end bind mount point

06a741d

[ADD] method for updating Experiment df to use a different column as …

2dfb0d8

…a main_score

[ADD] method for dynamically chosing a feature column during select_t…

8d48c3d

…rain_peaks that results in a balanced td_peaks to bt_peaks

[UPDATE] inner learn randomizer to reflect changes in semi_supervised.py

ffb12f2

[UPDATE] XGBclassifier doesn't have the silent param anymore, changed…

1a14fa0

… to verbosity param

[FIX] Change dynamic main score selection to only be used for XGBoost

42174b2

[MINOR] text change

f9d43f6

[ADD] append to pdf report instead of overwrite

0270b08

[ADD] params and options for saving a report for main score selection

78761ea

[ADD] parameter for optionally saving a report for main score selection

77f4985

[ADD] PyPDF2 dependency for pdf merging

5dc9039

[FIX] syntax typo

50a6da7

[ADD] save classifier scores plots in main score selection report dur…

64e8817

…ing iteration learning

[UPDATE] how main score selection report gets saved

8d6fac6

[ADD] save classifier scores plots in main score selection report dur…

eb1a136

…ing iteration learning

[REMOVE] debugging import code

c192d7f

grosenberger merged commit 265fca4 into PyProphet:master Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/dynamic main score #110

Feature/dynamic main score #110

singjc commented Dec 13, 2022

grosenberger commented Dec 15, 2022

Feature/dynamic main score #110

Feature/dynamic main score #110

Conversation

singjc commented Dec 13, 2022

Dynamic selection of main score during initial first past semi-supervised learning

Error when scoring using XGBoost

Score Distribution input for start_semi_supervised_learning

Score Distribution input for iter_semi_supervised_learning

Second Learning Fold when setting --ss_main_score var_mi_weighted_score

Solution

Comparisons

Mean Feature Importance

Updated parameter in XGBClassifier

grosenberger commented Dec 15, 2022

Second Learning Fold when setting `--ss_main_score var_mi_weighted_score`