## Running hyperparameter optimization - Part 2

This notebook goes through part two of the steps and codes for identifying the optimal hyperparameter settings for the Variational Autoencoder framework for integrating multi-omics and clinical data spanning both categorical and continuous variables. <br>

The optimal settings are identified based on multiple steps cosidering both reconstruction on the test and training sets as well as the stability/similiarity of the latent space in case of repeated training. Part one focus on the test and training reconstruction accuracies using in <i>MOVE_hyperparameter_optimization_reconstruction.ipynb</i>. From those results the optimal combination are then tested for stability of the latent space in repeated training using this notebook.

In [1]:
from hydra import initialize, compose
from omegaconf import OmegaConf

from move.training.train import optimize_stability
from move.utils.data_utils import get_data, get_list_value, merge_configs
from move.utils.visualization_utils import draw_boxplot
from move.utils.analysis import get_top10_stability, calculate_latent, make_and_save_best_stability_params

Below are the funcitons for reading data and calculations defiend

The next part is for reading in the data. This example uses the different datatypes included in the publication of MOVE which consist of three categorical datatypes and seven continuous. NOTE the data is not availble for testing. 

For this part we use all the data contraty to part 1 where it was divided into trainig and test, and investigate how similar the latent space is between the repeated runs. Below we define the selected hyper-parameter settings with equal or close to equal performance based on part 1. For plotting purposes we only test on three different "types" here being size of the hidden layer (nHidden), size of the latent space (nLatents) and the drop-out (drop_outs). The number of hidden lasyers are set to 1 (nl=1). We here repeat the traininng 5 times. 

In [2]:
# Initializing the default config 
with initialize(version_base=None, config_path="src/move/conf"):
    base_config = compose(config_name="main")


def main(base_config=base_config): 
    
    # Merging the user defined data.yaml, model.yaml and tuning_stability.yaml 
    # with the base_config to override it.
    print('Overriding the default config with configs from data.yaml, model.yaml and tuning_stability.yaml')
    cfg = merge_configs(base_config=base_config, 
                        config_types=['data', 'model', 'tuning_stability'])
    
    #Getting the variables used in the notebook
    path = cfg.data.processed_data_path   
    data_of_interest = cfg.data.data_of_interest
    categorical_names = cfg.data.categorical_names
    continuous_names = cfg.data.continuous_names
    categorical_weights = cfg.data.categorical_weights
    continuous_weights = cfg.data.continuous_weights
    
    seed = cfg.model.seed
    cuda = cfg.model.cuda
    lrate = cfg.model.lrate
    kld_steps = cfg.model.kld_steps
    batch_steps = cfg.model.batch_steps
    
    nHiddens = cfg.tuning_stability.num_hidden
    nLatents = cfg.tuning_stability.num_latent
    nLayers = cfg.tuning_stability.num_layers
    nDropout = cfg.tuning_stability.dropout
    nBeta = cfg.tuning_stability.beta
    batch_sizes = cfg.tuning_stability.batch_sizes
    repeat = cfg.tuning_stability.repeats
    nepochs = cfg.tuning_stability.tuned_num_epochs
    
    # Raising the error if more than 1 batch size is used 
    if len(batch_sizes)==1:
        batch_sizes = batch_sizes[0]
    elif len(batch_sizes)>1:
        raise('Currently the code is implemented to take take only one value for batch_size')
    
    #Getting the data
    cat_list, con_list, cat_names, con_names, headers_all, drug, drug_h = get_data(path, categorical_names, continuous_names, data_of_interest)
    
    #Performing hyperparameter tuning
    embeddings, latents, con_recons, cat_recons, recon_acc = optimize_stability(nHiddens, nLatents, 
                                                                                nDropout, nBeta, repeat,
                                                                                nepochs, nLayers,
                                                                                batch_sizes, lrate, 
                                                                                kld_steps, batch_steps, 
                                                                                cuda, path, 
                                                                                con_list, cat_list,
                                                                                continuous_weights, categorical_weights,
                                                                                seed)
    
    # Getting stability results 
    stability_top10, stability_top10_df = get_top10_stability(nHiddens, nLatents, nDropout, nLayers, repeat, latents, batch_sizes, nBeta)
    
    stability_total, rand_index = calculate_latent(nHiddens, nLatents, nDropout, repeat, nLayers, nBeta, latents) # Todo add priting or smth
    
    # Plotting the results 
    try: 
        draw_boxplot(path=path,
                     df=stability_top10,
                     title_text='Difference across replicationes in cosine similarity of ten closest neighbours in first iteration',
                     y_label_text="Average change",
                     save_fig_name="stability_top10")

        draw_boxplot(path=path,
                     df=stability_total,
                     title_text='Difference across replicationes in cosine similarity compared to first iteration',
                     y_label_text="Average change",
                     save_fig_name="stability_all")

        draw_boxplot(path=path,
                     df=rand_index,
                     title_text='Rand index across replicationes compared to first iteration',
                     y_label_text="Rand index",
                     save_fig_name="rand_index_all")
        print('Visualizing the hyperparameter tuning results\n')
        
    except:
        print('Could not visualize the results\n')
    
    # Getting best set of hyperparameters
    hyperparams_names = ['num_hidden', 'num_latent', 'num_layers', 'dropout', 'beta', 'batch_sizes']
    make_and_save_best_stability_params(stability_top10_df, hyperparams_names, nepochs)

    return()
    
if __name__ == "__main__":
    main()


Overriding the default config with configs from data.yaml, model.yaml and tuning_stability.yaml

Configuration used: 
---
data:
  user_config: data.yaml
  na_value: 'nan'
  raw_data_path: data/
  interim_data_path: data/
  processed_data_path: data/
  version: v1
  ids_file_name: baseline_ids
  categorical_inputs:
  - name: diabetes_genotypes
    weight: 1
  - name: baseline_drugs
    weight: 1
  - name: baseline_categorical
    weight: 1
  continuous_inputs:
  - name: baseline_continuous
    weight: 2
  - name: baseline_transcriptomics
    weight: 1
  - name: baseline_diet_wearables
    weight: 1
  - name: baseline_proteomic_antibodies
    weight: 1
  - name: baseline_target_metabolomics
    weight: 1
  - name: baseline_untarget_metabolomics
    weight: 1
  - name: baseline_metagenomics
    weight: 1
  data_of_interest: baseline_drugs
  categorical_names: ${names:${data.categorical_inputs}}
  continuous_names: ${names:${data.continuous_inputs}}
  categorical_weights: ${weights:${data.

KeyboardInterrupt: 

Below we run the full grid search. Here we also save the UMAP embeddings for the posibility of a visual investigation of the results.

Below is calcualtion and visualisation only focusig on the top 10 closest neigbour for each individual

The next part compared based on all of the latent space. Furthermore, it includes code for calculation on cluster stability if the latent space is to be used for clustering (not used by MOVE in the paper here only cosine similarity on latent is included).

![stability_all.png](attachment:stability_all.png)

From here on the optimal setting for further analysis can be selected. This includes the framework used in MOVE for identifying drug and multi-omics assosiations as descriped in the notebook <i>identify_drug_assosiation.ipynb</i>, compare the latent space integration to other methods (PCA) using the notebook <i>latent_space_analysis.ipynb</i> or for other types of analysis such as clustering of the latent space (not included here). 