✨ change intermediate split format to csv (for R methods) #44

enryH · 2023-04-11T15:49:40Z

increase interoperability by moving from pkl default to csv (implemented before, but now set as a default to integrate NAguideR)
create empty data folder (done by snakemake, but ensure it runs by notebook execution)
some import removed which were obsolete

- increase interoperability by moving from pkl default to csv (implemented before, but now set as a default to integrate NAguideR) - create empty data folder (done by snakemake, but ensure it runs by notebook execution) - some import removed which were obsolete

- add model specific config ("train_{model}.yaml") - add scikit-learn KNNimputer to comparsion - 🐛 set data splits format to csv (-> for R based methods to work) - 🎨 remove unused imports

- after playing a lot with R the last week, add some NAGuideR method -> installation issues to fix for some others - Basic R integration: - transfer data to format expected by NAGuideR (01_0_transform_data_to_wide_format.ipynb) - run selected R methods based on function provided in NAGuideR shiny app (01_1_train_NAGuideR_methods.ipynb) - transfers results back into format expected by workflow (01_1_transfer_NAGuideR_pred.ipynb) - workflow updated: - base intermediate results on csv format - query for predictions - add R methods to config - environment: try to install some packages using conda

- conda-forge has to be default channel (first), otherwise r packages might not work: conda-forge/r-stringi-feedstock#13 - restrict pandas to version 1 - set channel priority to strict - check some r imports

- ✨ add NAGuideR methods which need local installation - 🐛 fix Snakefile to consider all models in comparison - 🎨 adapt plot (legends) or notebook descriptions

Some packages are not available via conda. Install manuelly (as before locally provided packages) - impseq, impseqrob -> rrcovNA - qrilc, mindet, minprob

- single train config per model - added some NAGuideR methods (some failed) - collect runtime of entire nb using benchmark directive from snakemake (comparison distorted as nb partly extend analysis) - set patience as input parameter for early stopping

- collect data dumps and figures saved to disk - save more data used for figures to disk - only plot performance annotation on bar plots if there is a bar -> prepare for manuel aggreation of best models plots

- pick best models as defined in default workflow - needed to change some data handling

- repeat splitting the data with different seeds (so not fold splitting, but random splitting) - added Snakefile for that - move collecting of metrics to separate notebook (easier to debug)

- permute protein data per feature -> best model is no the median model - in order to pass "None" as argument papermill needs the interpreted parameter for the metadata ("-p")

- for both peptides and evidence less than 50 samples are retained (-> lower treshold of sample completeness?)

- datasets are the different levels: protein group, peptide and precursor - change legend of binned errors plot

- for peptides and precursors -> allow less feat. per sample -> retain all 50 samples -> see second step Fig. S1

- actually give a better impression of figure size (dpi setting) - set fonts explicitly for Fig. 2 - add meta data to figures: N samples, M feat.

Idea will be to run it for each dataset for diagnosis - create data visualization plots - group plotting functions

- make baseline model choice available - change params and outputfolder structure - diff analysis should be run one by one, then aggregation is in 10_3_ald_compare_methods.py

So far done for features that were shared between the approaches and that had competing outcomes in the diff. analysis.

- bin features by median (by median's integer values) - plot prop. missing for each feature in bin

- different kind of data visualizations - default: example data provided with the package, - two configs for two analyzed datasets - structure and format notebook - add clustermaps of hierachical clustering (seaborn) - add heatmaps (based on hierarchial clustering) -> todo: unify plotting layout (check defaults which are set, rather pick up from rc)

(no imputation added) - 10_1 Aggregation of scores removed, run one by one - compare two imputation choices againt each other (10_2), aggregate there - add # samples linear regression is based on to differential analysis - dump results of logistic regression -> allow custom plots later Having a different setup in the ALD study than others is still a bit tricky. Next is to generate visualizations for intensities (add more models)

- Add more methods to histogram and swarmplot of measured values vs imputed values - next: Update workflow, limit models to requested?

- adapt workflow - and configs to changes in nbs setup Next is to add the last notebook.

- for now used for ALD data, but should be relatively general already to be used with other data - new rule added for last notebook (10_4_ald_compare_single_pg) - small adaptions performed - rules ordered by notebook number

- add optional feature anntotations on scores (10_1_ald_diff_analysis.ipynb) - remove from 10_2_ald_compare_methods.ipynb annotations part - dump qvalues and rejection of null hypothesis (equality) for further analysis in 10_4_ald_compare_singe_pg.ipynb - add new outputs to workflow(s)

- use snakemake to execute snakemake - aggregate results of repeated workflow runs

- option to not have only a few lables shown in case of high-dimensional data - other customizations

…nd_comparison - remove conflict in data_splitting notebook (removed comments of other configuratoins)

Henry added 2 commits April 11, 2023 17:49

✨🐛 exend comparison, add specific train configs

adb1045

- add model specific config ("train_{model}.yaml") - add scikit-learn KNNimputer to comparsion - 🐛 set data splits format to csv (-> for R based methods to work) - 🎨 remove unused imports

enryH force-pushed the extend_comparison branch 2 times, most recently from c6c115f to 579a9ea Compare April 13, 2023 13:19

enryH force-pushed the extend_comparison branch 8 times, most recently from d72f317 to 7e0a753 Compare April 14, 2023 08:53

🐛 try to fix r dependencies for notebook

19a5373

- conda-forge has to be default channel (first), otherwise r packages might not work: conda-forge/r-stringi-feedstock#13 - restrict pandas to version 1 - set channel priority to strict - check some r imports

enryH force-pushed the extend_comparison branch from 7e0a753 to 19a5373 Compare April 14, 2023 09:39

Henry added 15 commits April 14, 2023 16:33

✨🎨 Add more NAGuideR methods

20218ae

- ✨ add NAGuideR methods which need local installation - 🐛 fix Snakefile to consider all models in comparison - 🎨 adapt plot (legends) or notebook descriptions

🐛✨ install method in case it is missing

9ebf00c

Some packages are not available via conda. Install manuelly (as before locally provided packages) - impseq, impseqrob -> rrcovNA - qrilc, mindet, minprob

🎨 dumps, text annotations (format and visability)

8a71857

- collect data dumps and figures saved to disk - save more data used for figures to disk - only plot performance annotation on bar plots if there is a bar -> prepare for manuel aggreation of best models plots

✨🔧 Update repeated training of best models

c2df9b8

- pick best models as defined in default workflow - needed to change some data handling

✨ evaluate different data splits

5e29c6b

- repeat splitting the data with different seeds (so not fold splitting, but random splitting) - added Snakefile for that - move collecting of metrics to separate notebook (easier to debug)

✨ Compare performance of permuted data

8cd3476

- permute protein data per feature -> best model is no the median model - in order to pass "None" as argument papermill needs the interpreted parameter for the metadata ("-p")

✨ add config for peptides and evidence of small datasets

2c16f8f

- for both peptides and evidence less than 50 samples are retained (-> lower treshold of sample completeness?)

🎨 aggregate performance data across three datasets

804cc3c

- datasets are the different levels: protein group, peptide and precursor - change legend of binned errors plot

🔧 lower sample completness treshold

e5a5038

- for peptides and precursors -> allow less feat. per sample -> retain all 50 samples -> see second step Fig. S1

🎨 optimize dumped figures, add meta data

d7d981b

- actually give a better impression of figure size (dpi setting) - set fonts explicitly for Fig. 2 - add meta data to figures: N samples, M feat.

✨🎨 Data inspection plots

e96cc2d

Idea will be to run it for each dataset for diagnosis - create data visualization plots - group plotting functions

🚧 start rebuilding ald comparison

2782f14

- make baseline model choice available - change params and outputfolder structure - diff analysis should be run one by one, then aggregation is in 10_3_ald_compare_methods.py

🚧✨ swarmplot of observed and imputed data

d4bf526

So far done for features that were shared between the approaches and that had competing outcomes in the diff. analysis.

✨🎨 prop. missing ove rfeature median (binned)

725cf8b

- bin features by median (by median's integer values) - plot prop. missing for each feature in bin

Henry added 10 commits May 5, 2023 16:55

🚧🎨 Add more methods to cp plots

9d9f8f3

- Add more methods to histogram and swarmplot of measured values vs imputed values - next: Update workflow, limit models to requested?

🚧 Update Ald workflow to new nbs

a38ce30

- adapt workflow - and configs to changes in nbs setup Next is to add the last notebook.

✨ Add plotting of differences to ALD workflow

f49eac6

- for now used for ALD data, but should be relatively general already to be used with other data - new rule added for last notebook (10_4_ald_compare_single_pg) - small adaptions performed - rules ordered by notebook number

✨ repeat ald workflow using workflows

ee7d5e2

- use snakemake to execute snakemake - aggregate results of repeated workflow runs

🎨 update heatmaps

5970901

- option to not have only a few lables shown in case of high-dimensional data - other customizations

🎨 add artefact to each action run for inspection

4dfda28

Merge branch 'dev' of https://github.com/RasmussenLab/pimms into exte…

1741dba

…nd_comparison - remove conflict in data_splitting notebook (removed comments of other configuratoins)

enryH merged commit d8a1f0f into dev May 17, 2023

enryH deleted the extend_comparison branch May 18, 2023 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ change intermediate split format to csv (for R methods) #44

✨ change intermediate split format to csv (for R methods) #44

Uh oh!

enryH commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✨ change intermediate split format to csv (for R methods) #44

✨ change intermediate split format to csv (for R methods) #44

Uh oh!

Conversation

enryH commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants