-
Notifications
You must be signed in to change notification settings - Fork 3
Update main branch #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Ubuntu lastest (22.04) seems to be not compatible - long installations fail: conda-incubator/setup-miniconda#116 - try setting up manuelly (to avoid updating env)
- try if mamba implementation works on runner instance - runner instance seems to run into memory issue (Kubernetes pod error hints at that)
- improve cmd interface for two key notebooks (as scripts) - mamba (replacement for conda) better for large environment
- index is wrapped into iterable (e.g. for 0 as integer indicating index columns) - remove "%%time" cell magic -> let's papermill miss error in cell. - remove old input - drop depreciated parameter from read_csv
- option for csv would need to be specified - raise error when column Index (level) has no name.
- format should be updated eventually (reading a seq of configs and metrics with same schema)
- exit before DISEASES part -> should be at best separated into new script
- functionality copied from API description: https://www.uniprot.org/help/id_mapping - example provided
- disentangle preprocessing from analysis
- metadata expected for long-wide format transformations
start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function
start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function
- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically
- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically
- model_key and model needed currently for grid search - "id" (constructed on loading config and metric files) + "model" name should be unique combination in futue - CF model: batch_size (not batch_size_collab) - ToDo: model pred should be saved by default (as currently done)
- model_key and model needed currently for grid search - "id" (constructed on loading config and metric files) + "model" name should be unique combination in futue - CF model: batch_size (not batch_size_collab) - ToDo: model pred should be saved by default (as currently done)
- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database
- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database
- scripts need to be futher adapted
- remove parameter fig from plotting functions
- and yet another plot to inspet errors - plot historgram of intensities vs pred of top 4 models - use TARGET_COL everywhere
- small fixes and improvements - labeling - plot sizes - data prepared for NAGuideR package in separate workflow step -> data format is fixed
- add back frequency of protein group (PG) in data without imputation
- dump pvalues next to qvalues
- distinugish between old and new PGs for repeated analysis
(could be done for single ALD study comparison)
- update configs to submitted version of paper
- produce plots of imputed values only optionally in repeated anaysis
("make_plots" set to False
- remove ids from data nb
-> configfile loading error after snakemake-minimal update
to yte, see snakemake/snakemake#2276
- comp.: baseline on x-axis, target on y-axis - ROC and PRC -> make plot size smaller
- in original study RSN was done on a per sample basis (mean and std. of sample used to define dist. from drawing random replacements)
- count significant features for old and new features - new features that are significant: make plot
- 348 samples with all clinical variables non-missing (10 missing, 358 with kleiner score) - add model_key for edge case (for now testing only) - ToDo: Use in 10_4 the sample mask from here: mask_sample_with_complete_clinical_data
- script to select 80% of the data MAR - config files for reduced dataset for workflows - add option of reference score for comparison of "None" (preliminary implementation)
- restrict to 3 new models (CF, DAE, VAE) - 🐛 pass datasplits file format - use seaborn facetgrid to plot data - update configs and workflow files
- try to plot predictions (imputed values) by assigned model color - formatting of code - target naming - plot x for observed (measured) values - 🐛 None has no predictions, but still DA has to be computed
- 🎨 minor styling changes and updates
- install packages on the fly if not found
(conda and manual installation fails on Windows)
- update installation instructions
- document available methods
- 🐛 fix transfer predictions for NAGuideR
(use dumps argument if it is set)
- select smallest among top 3 models on validation data - update configs and document which R based models worked
split_data:
- ✨ option to use randomly subselected data
- 🐛 fix also validation and test splits in data notebook
(random_state)
- 🎨 missing values prop. on training data
perfomance_plots
- 🎨 remove old plots
- ✨ write test data, add conprison betw. splits
(to excel output)
both
- 🎨 remove plots, move functions to top (to be moved)
for smaller HeLa dataset - update models based on grid search using N=50 HeLa samples (use smallest of top3 models) - 🎨 format two notebooks
Uses the config from N=50 for simplicity
- split data: check that for small N features are
not only in val and test (~< 3-4 intensities
for peptide and precursor level)
- combine tables for one dataset with different sample number per run
- upper keys enforced - only calculated matrices for selected models -> notebook to be made concise eventually
- start re-write to allow only application without early stopping benchmarking (-> no control of training procedure and performance evaluation)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.