-
Notifications
You must be signed in to change notification settings - Fork 3
✨ change intermediate split format to csv (for R methods) #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
enryH
commented
Apr 11, 2023
- increase interoperability by moving from pkl default to csv (implemented before, but now set as a default to integrate NAguideR)
- create empty data folder (done by snakemake, but ensure it runs by notebook execution)
- some import removed which were obsolete
- increase interoperability by moving from pkl default to csv (implemented before, but now set as a default to integrate NAguideR) - create empty data folder (done by snakemake, but ensure it runs by notebook execution) - some import removed which were obsolete
- add model specific config ("train_{model}.yaml")
- add scikit-learn KNNimputer to comparsion
- 🐛 set data splits format to csv (-> for R based methods to work)
- 🎨 remove unused imports
c6c115f to
579a9ea
Compare
- after playing a lot with R the last week, add some NAGuideR method
-> installation issues to fix for some others
- Basic R integration:
- transfer data to format expected by NAGuideR (01_0_transform_data_to_wide_format.ipynb)
- run selected R methods based on function provided in NAGuideR shiny app
(01_1_train_NAGuideR_methods.ipynb)
- transfers results back into format expected by workflow
(01_1_transfer_NAGuideR_pred.ipynb)
- workflow updated:
- base intermediate results on csv format
- query for predictions
- add R methods to config
- environment: try to install some packages using conda
d72f317 to
7e0a753
Compare
- conda-forge has to be default channel (first), otherwise r packages might not work: conda-forge/r-stringi-feedstock#13 - restrict pandas to version 1 - set channel priority to strict - check some r imports
7e0a753 to
19a5373
Compare
- ✨ add NAGuideR methods which need local installation - 🐛 fix Snakefile to consider all models in comparison - 🎨 adapt plot (legends) or notebook descriptions
Some packages are not available via conda. Install manuelly (as before locally provided packages) - impseq, impseqrob -> rrcovNA - qrilc, mindet, minprob
- single train config per model - added some NAGuideR methods (some failed) - collect runtime of entire nb using benchmark directive from snakemake (comparison distorted as nb partly extend analysis) - set patience as input parameter for early stopping
- collect data dumps and figures saved to disk - save more data used for figures to disk - only plot performance annotation on bar plots if there is a bar -> prepare for manuel aggreation of best models plots
- pick best models as defined in default workflow - needed to change some data handling
- repeat splitting the data with different seeds (so not fold splitting, but random splitting) - added Snakefile for that - move collecting of metrics to separate notebook (easier to debug)
- permute protein data per feature
-> best model is no the median model
- in order to pass "None" as argument papermill needs the interpreted
parameter for the metadata ("-p")
- for both peptides and evidence less than 50 samples are retained (-> lower treshold of sample completeness?)
- datasets are the different levels: protein group, peptide and precursor - change legend of binned errors plot
- for peptides and precursors -> allow less feat. per sample -> retain all 50 samples -> see second step Fig. S1
- actually give a better impression of figure size (dpi setting) - set fonts explicitly for Fig. 2 - add meta data to figures: N samples, M feat.
Idea will be to run it for each dataset for diagnosis - create data visualization plots - group plotting functions
- make baseline model choice available - change params and outputfolder structure - diff analysis should be run one by one, then aggregation is in 10_3_ald_compare_methods.py
So far done for features that were shared between the approaches and that had competing outcomes in the diff. analysis.
- bin features by median (by median's integer values) - plot prop. missing for each feature in bin
- different kind of data visualizations - default: example data provided with the package, - two configs for two analyzed datasets - structure and format notebook - add clustermaps of hierachical clustering (seaborn) - add heatmaps (based on hierarchial clustering) -> todo: unify plotting layout (check defaults which are set, rather pick up from rc)
(no imputation added) - 10_1 Aggregation of scores removed, run one by one - compare two imputation choices againt each other (10_2), aggregate there - add # samples linear regression is based on to differential analysis - dump results of logistic regression -> allow custom plots later Having a different setup in the ALD study than others is still a bit tricky. Next is to generate visualizations for intensities (add more models)
- Add more methods to histogram and swarmplot of measured values vs imputed values - next: Update workflow, limit models to requested?
- adapt workflow - and configs to changes in nbs setup Next is to add the last notebook.
- for now used for ALD data, but should be relatively general already to be used with other data - new rule added for last notebook (10_4_ald_compare_single_pg) - small adaptions performed - rules ordered by notebook number
- add optional feature anntotations on scores (10_1_ald_diff_analysis.ipynb) - remove from 10_2_ald_compare_methods.ipynb annotations part - dump qvalues and rejection of null hypothesis (equality) for further analysis in 10_4_ald_compare_singe_pg.ipynb - add new outputs to workflow(s)
- use snakemake to execute snakemake - aggregate results of repeated workflow runs
- option to not have only a few lables shown in case of high-dimensional data - other customizations
…nd_comparison - remove conflict in data_splitting notebook (removed comments of other configuratoins)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.