-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/dynamic main score #110
Merged
grosenberger
merged 16 commits into
PyProphet:master
from
singjc:feature/dynamic_main_score
Dec 15, 2022
Merged
Feature/dynamic main score #110
grosenberger
merged 16 commits into
PyProphet:master
from
singjc:feature/dynamic_main_score
Dec 15, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…rain_peaks that results in a balanced td_peaks to bt_peaks
… to verbosity param
…ing iteration learning
…ing iteration learning
Fantastic work, thank you! This is a very useful extension that will be useful for many applications. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dynamic selection of main score during initial first past semi-supervised learning
Error when scoring using XGBoost
I usually run into this issue when trying to score using the XGBoost classifier. Usually the suggestion would be to change the
--ss_main_score
, but to find a better starting main score, you either need try test each score one by one, or you have to adjust thess_initial_fdr
andss_iteration_fdr
until it doesn't run into this error.I found that this issue occurs when the top decoy data and best target data do no have a good amount of separation, or are very unbalanced after applying the cutoff of the top target scores (in
select_train_peaks
). The resulting classifier score used in iteration semi supervised learning fails to separate between decoy and targets. For example:Score Distribution input for start_semi_supervised_learning
Score Distribution input for iter_semi_supervised_learning
Even if you supply the
--ss_main_score
using the score that usually performs the best (main_var_mi_weighted_score), depending on the fold and the data split, it still might fail.Second Learning Fold when setting
--ss_main_score var_mi_weighted_score
Solution
I tried a couple of different approaches, but I found the approach below to be the best:
start_semi_supervised_learning
select_train_peaks
I added an additional boolean parameter (--main_score_selection_report: Default is set to False to not generate a report) to module
score
, to optionally save a pdf report of the different individual score distributions and p-value distributions (like the plots above) computed inerror_statistics
. The different score reports get appended to the same file using PyPDF2, if the user threads the computation for fold learning, then N number of reports will be generated since there will be write conflicts if attempting to write to the same file. I find these reports useful for debugging purposes.Comparisons
I ran some tests on a single file from the IPF paper: yanliu_I170114_040_PhosNoco10_SW.osw
I compare LDA and XGBoost + main score=var_mi_weighted_score for PyProphet (v2.1.12) to LDA, XGBoost and XGBoost + main score=var_mi_weighted_score for this implementation (vdynamic). I couldn't test XGBoost for PyProphet (v2.1.12) because it always errors out during pi0 estimation.
Total Numbers of IDs is computed from
pyprophet statistics
, and I take the average of these numbers after running the scoring, ipf and context workflow 20 times (which is seen from the standard deviation of the error bars). The changes don't apply to LDA, so I tested to make sure the results for LDAvv are still consistent. Numbers are fairly consistent between XGBoost + main score (v2.1.12), XGBoost (vdynamic) and XGBoost + main score (vdynamic) .Mean Feature Importance
Updated parameter in XGBClassifier
The below warning message occurs because
silent
is no longer a parameter in newer versions of XGBoost, it has been changed toverbosity
, which when set to 0 is equivalent to silent.