Skip to content

Conversation

@enryH
Copy link
Member

@enryH enryH commented May 18, 2023

  • improvements for revised analysis
  • R packages added to comparison listed in NAGuideR (see table in README)

Henry added 30 commits January 12, 2023 18:10
- Ubuntu lastest (22.04) seems to be not compatible
- long installations fail: conda-incubator/setup-miniconda#116
- try setting up manuelly (to avoid updating env)
- try if mamba implementation works on runner instance
- runner instance seems to run into memory issue
  (Kubernetes pod error hints at that)
- improve cmd interface for two key notebooks (as scripts)
- mamba (replacement for conda) better for large environment
- index is wrapped into iterable
  (e.g. for 0 as integer indicating index columns)
- remove "%%time" cell magic -> let's papermill
  miss error in cell.
- remove old input
- drop depreciated parameter from read_csv
- option for csv would need to be specified
- raise error when column Index (level) has no name.
- format should be updated eventually
  (reading a seq of configs and metrics with same schema)
- exit before DISEASES part
  -> should be at best separated into new script
- functionality copied from API description:
  https://www.uniprot.org/help/id_mapping
- example provided
- disentangle preprocessing from analysis
- metadata expected for long-wide format transformations
start with single dev dataset

- name parameters consistently
- model_key: save and use as given (easier for connecting
- model abbrev.: RSN, CF, DAE, VAE (make consistent)
- move helper function
start with single dev dataset

- name parameters consistently
- model_key: save and use as given (easier for connecting
- model abbrev.: RSN, CF, DAE, VAE (make consistent)
- move helper function
- update parameter parsing
- collect dump figure paths
- move one function to package
- pick up sample index name automatically
- update parameter parsing
- collect dump figure paths
- move one function to package
- pick up sample index name automatically
- model_key and model needed currently for grid search
- "id" (constructed on loading config and metric files)
 + "model" name should be unique combination in futue
- CF model: batch_size (not batch_size_collab)

- ToDo: model pred should be saved by default (as currently done)
- model_key and model needed currently for grid search
- "id" (constructed on loading config and metric files)
 + "model" name should be unique combination in futue
- CF model: batch_size (not batch_size_collab)

- ToDo: model pred should be saved by default (as currently done)
- add separate rules for interpolation and median imputation
  -> more separation
- possibility: Optimize models one by one,
  write results and configs to shared database
- add separate rules for interpolation and median imputation
  -> more separation
- possibility: Optimize models one by one,
  write results and configs to shared database
- scripts need to be futher adapted
Henry added 29 commits May 24, 2023 11:01
- remove parameter fig from plotting functions
- and yet another plot to inspet errors
- plot historgram of intensities vs pred
   of top 4 models
- use TARGET_COL everywhere
- small fixes and improvements
   - labeling
   - plot sizes

- data prepared for NAGuideR package in separate workflow step
  -> data format is fixed
- add back frequency of protein group (PG) in data without imputation
- dump pvalues next to qvalues
- distinugish between old and new PGs for repeated analysis
  (could be done for single ALD study comparison)
- update configs to submitted version of paper
- produce plots of imputed values only optionally in repeated anaysis
  ("make_plots" set to False
- remove ids from data nb
-> configfile loading error after snakemake-minimal update
    to yte, see snakemake/snakemake#2276
- comp.: baseline on x-axis, target on y-axis
- ROC and PRC -> make plot size smaller
- in original study RSN was done on a per sample basis
  (mean and std. of sample used to define dist. from drawing
   random replacements)
- count significant features for old and new features
- new features that are significant: make plot
- 348 samples with all clinical variables non-missing
  (10 missing, 358 with kleiner score)
- add model_key for edge case (for now testing only)
- ToDo: Use in 10_4 the sample mask from
  here: mask_sample_with_complete_clinical_data
- script to select 80% of the data MAR
- config files for reduced dataset for workflows
- add option of reference score for comparison of "None"
  (preliminary implementation)
- restrict to 3 new models (CF, DAE, VAE)
- 🐛 pass datasplits file format
- use seaborn facetgrid to plot data
- update configs and workflow files
- try to plot predictions (imputed values) by assigned model color
- formatting of code
- target naming
- plot x for observed (measured) values
- 🐛 None has no predictions, but still
   DA has to be computed
- 🎨 minor styling changes and updates
- install packages on the fly if not found
  (conda and manual installation fails on Windows)
- update installation instructions
- document available methods

- 🐛 fix transfer predictions for NAGuideR
           (use dumps argument if it is set)
- select smallest among top 3 models on validation data
- update configs and document which R based models worked
split_data:
- ✨ option to use randomly subselected data
- 🐛 fix also validation and test splits in data notebook
           (random_state)
- 🎨 missing values prop. on training data

perfomance_plots
- 🎨 remove old plots
- ✨ write test data, add conprison betw. splits
   (to excel output)

both
- 🎨 remove plots, move functions to top (to be moved)
for smaller HeLa dataset

- update models based on grid search using N=50 HeLa samples
  (use smallest of top3 models)
- 🎨 format two notebooks
Uses the config from N=50 for simplicity

- split data: check that for small N features are
                   not only in val and test (~< 3-4 intensities
                for peptide and precursor level)
- combine tables for one dataset with different sample number per run
- upper keys enforced
- only calculated matrices for selected models

-> notebook to be made concise eventually
- start re-write to allow only application
  without early stopping  benchmarking
  (-> no control of training procedure and performance evaluation)
@enryH enryH merged commit 95b6e9e into main Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants