Update main branch #46

enryH · 2023-05-18T12:42:03Z

improvements for revised analysis
R packages added to comparison listed in NAGuideR (see table in README)

- Ubuntu lastest (22.04) seems to be not compatible - long installations fail: conda-incubator/setup-miniconda#116 - try setting up manuelly (to avoid updating env)

- try if mamba implementation works on runner instance - runner instance seems to run into memory issue (Kubernetes pod error hints at that)

- improve cmd interface for two key notebooks (as scripts) - mamba (replacement for conda) better for large environment

- index is wrapped into iterable (e.g. for 0 as integer indicating index columns) - remove "%%time" cell magic -> let's papermill miss error in cell. - remove old input - drop depreciated parameter from read_csv

- option for csv would need to be specified - raise error when column Index (level) has no name.

- format should be updated eventually (reading a seq of configs and metrics with same schema)

- exit before DISEASES part -> should be at best separated into new script

- functionality copied from API description: https://www.uniprot.org/help/id_mapping - example provided

- disentangle preprocessing from analysis

- metadata expected for long-wide format transformations

start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function

- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically

- model_key and model needed currently for grid search - "id" (constructed on loading config and metric files) + "model" name should be unique combination in futue - CF model: batch_size (not batch_size_collab) - ToDo: model pred should be saved by default (as currently done)

- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database

- scripts need to be futher adapted

- remove parameter fig from plotting functions

- and yet another plot to inspet errors - plot historgram of intensities vs pred of top 4 models - use TARGET_COL everywhere

- small fixes and improvements - labeling - plot sizes - data prepared for NAGuideR package in separate workflow step -> data format is fixed

- add back frequency of protein group (PG) in data without imputation - dump pvalues next to qvalues - distinugish between old and new PGs for repeated analysis (could be done for single ALD study comparison) - update configs to submitted version of paper - produce plots of imputed values only optionally in repeated anaysis ("make_plots" set to False

- remove ids from data nb

-> configfile loading error after snakemake-minimal update to yte, see snakemake/snakemake#2276

- comp.: baseline on x-axis, target on y-axis - ROC and PRC -> make plot size smaller

- in original study RSN was done on a per sample basis (mean and std. of sample used to define dist. from drawing random replacements)

- count significant features for old and new features - new features that are significant: make plot

- 348 samples with all clinical variables non-missing (10 missing, 358 with kleiner score) - add model_key for edge case (for now testing only) - ToDo: Use in 10_4 the sample mask from here: mask_sample_with_complete_clinical_data

- script to select 80% of the data MAR - config files for reduced dataset for workflows - add option of reference score for comparison of "None" (preliminary implementation)

- restrict to 3 new models (CF, DAE, VAE) - 🐛 pass datasplits file format - use seaborn facetgrid to plot data - update configs and workflow files

- try to plot predictions (imputed values) by assigned model color - formatting of code - target naming - plot x for observed (measured) values - 🐛 None has no predictions, but still DA has to be computed

- 🎨 minor styling changes and updates

- install packages on the fly if not found (conda and manual installation fails on Windows) - update installation instructions - document available methods - 🐛 fix transfer predictions for NAGuideR (use dumps argument if it is set)

- select smallest among top 3 models on validation data - update configs and document which R based models worked

split_data: - ✨ option to use randomly subselected data - 🐛 fix also validation and test splits in data notebook (random_state) - 🎨 missing values prop. on training data perfomance_plots - 🎨 remove old plots - ✨ write test data, add conprison betw. splits (to excel output) both - 🎨 remove plots, move functions to top (to be moved)

for smaller HeLa dataset - update models based on grid search using N=50 HeLa samples (use smallest of top3 models) - 🎨 format two notebooks

Uses the config from N=50 for simplicity - split data: check that for small N features are not only in val and test (~< 3-4 intensities for peptide and precursor level) - combine tables for one dataset with different sample number per run

- upper keys enforced - only calculated matrices for selected models -> notebook to be made concise eventually

- start re-write to allow only application without early stopping benchmarking (-> no control of training procedure and performance evaluation)

Henry added 30 commits January 12, 2023 18:10

🐛 uses and runs cannot be in one named entity

8abee20

🐛 fix cicd

55cd9be

- Ubuntu lastest (22.04) seems to be not compatible - long installations fail: conda-incubator/setup-miniconda#116 - try setting up manuelly (to avoid updating env)

🐛 large environment -> memory error -> try mamba

0a6be2e

- try if mamba implementation works on runner instance - runner instance seems to run into memory issue (Kubernetes pod error hints at that)

📝 document execution and setup

4c82e6e

- improve cmd interface for two key notebooks (as scripts) - mamba (replacement for conda) better for large environment

📝 preprint link

242bb1d

🐛 remove warning, make index more robust

5cd3686

- index is wrapped into iterable (e.g. for 0 as integer indicating index columns) - remove "%%time" cell magic -> let's papermill miss error in cell. - remove old input - drop depreciated parameter from read_csv

🐛 workflow expects pickle format

ef7c2b8

- option for csv would need to be specified - raise error when column Index (level) has no name.

🐛 update interface and execution logic for both AE

c420e21

🐛 change default freq name

8cc58f2

♻️ unify metric and config loading

622aeda

- format should be updated eventually (reading a seq of configs and metrics with same schema)

🚧 Allow execution withou annotation

8688caa

- exit before DISEASES part -> should be at best separated into new script

✨ update to new Uniprot API

9100126

- functionality copied from API description: https://www.uniprot.org/help/id_mapping - example provided

🎨 move preprocessing of clinical meta data

04b8b55

- disentangle preprocessing from analysis

✅ axis name for index and columns

1c4d64a

- metadata expected for long-wide format transformations

🐛 swapped horizontal lines

3f9f215

🚧✅ prepare more testing

67e666c

🚧 streamline workflow

8cfecc4

start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function

🚧 streamline workflow

69b07bc

start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function

🎨 single comp. script update

42a9272

- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically

🎨 single comp. script update

b6a0a14

- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically

🎨 align training notebooks for DAE and VAE

62248db

🎨 align training notebooks for DAE and VAE

d0c7d4c

🎨 make model explicit in grid search

9449986

- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database

🎨 make model explicit in grid search

1e7c967

- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database

🎨 format workflows using snkefmt

a0ae39f

🎨 format workflows using snkefmt

6c94d6c

🚧 adapt execution order of grid search

f758598

- scripts need to be futher adapted

📝🐛 update markdown parser for sphinx

6ce33a1

Henry added 29 commits May 24, 2023 11:01

🎨 fix sizes of and colors in plots, dump stats

a30949e

- remove parameter fig from plotting functions

🎨 error by training data medians of feat + histogram of pred

ea9b6e2

- and yet another plot to inspet errors - plot historgram of intensities vs pred of top 4 models - use TARGET_COL everywhere

✨ Add ALD run configuraton, add back outputs

6df37b3

- small fixes and improvements - labeling - plot sizes - data prepared for NAGuideR package in separate workflow step -> data format is fixed

🎨🐛 small bug fixes (typos)

080b9d8

- remove ids from data nb

🐛 restrict snakemake-minimal for now

14cad97

-> configfile loading error after snakemake-minimal update to yte, see snakemake/snakemake#2276

🎨 axis order and plot size

2f39631

- comp.: baseline on x-axis, target on y-axis - ROC and PRC -> make plot size smaller

🐛 impute per default per sample for RSN, not per feat

fc841fc

- in original study RSN was done on a per sample basis (mean and std. of sample used to define dist. from drawing random replacements)

🎨 add counts and plots

05f4737

- count significant features for old and new features - new features that are significant: make plot

🎨 dump binned errors for reporting

fb7246c

✨ Add performance on reduced ALD dataset

591ea1a

- script to select 80% of the data MAR - config files for reduced dataset for workflows - add option of reference score for comparison of "None" (preliminary implementation)

🎨 make better grid search plot

4c0ab0e

- restrict to 3 new models (CF, DAE, VAE) - 🐛 pass datasplits file format - use seaborn facetgrid to plot data - update configs and workflow files

🎨 Update visualizations

48a570a

- try to plot predictions (imputed values) by assigned model color - formatting of code - target naming - plot x for observed (measured) values - 🐛 None has no predictions, but still DA has to be computed

✨ Interpretation of repeated execution of ALD data

a869b19

- 🎨 minor styling changes and updates

🎨 improve visability, parameterize

e263084

🐛📝 Install missing R pkgs on the fly fix #47

797944e

- install packages on the fly if not found (conda and manual installation fails on Windows) - update installation instructions - document available methods - 🐛 fix transfer predictions for NAGuideR (use dumps argument if it is set)

🎨 select best, smallest models consistently from grid search

2c3c087

- select smallest among top 3 models on validation data - update configs and document which R based models worked

🔧 update config for small grid search

2660fea

📝 update configs using grid search (small)

bc2d6c3

for smaller HeLa dataset - update models based on grid search using N=50 HeLa samples (use smallest of top3 models) - 🎨 format two notebooks

🔧 add all NAGuideR method to ALD data comp.

22952ca

✨ add small N analysis workflow

ff22b16

Uses the config from N=50 for simplicity - split data: check that for small N features are not only in val and test (~< 3-4 intensities for peptide and precursor level) - combine tables for one dataset with different sample number per run

🐛 use floor value of float for integer bin

980f042

🎨 make it more resiliant

6842ae3

- upper keys enforced - only calculated matrices for selected models -> notebook to be made concise eventually

🚧 adhoc scripts for supp. data

17f7723

🔥 start removing old code

7a442e2

🚧 allow non-benchmarking mode

635dbcf

- start re-write to allow only application without early stopping benchmarking (-> no control of training procedure and performance evaluation)

👷 only test one method per package

3891f77

enryH merged commit 95b6e9e into main Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update main branch #46

Update main branch #46

Uh oh!

enryH commented May 18, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update main branch #46

Update main branch #46

Uh oh!

Conversation

enryH commented May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enryH commented May 18, 2023 •

edited

Loading