# Introduction

This notebook contains all of the scripts used to tune and apply Uniform Manifold Approximation and Projection (UMAP) to the datasets cleaned and merged in the data_preprocessing_DR.ipynb notebook. The following 4 steps are taken in this notebook:

Step 1: Import the necessary libraries and datasets for the dimensionality reduction. The datasets involve a combination of MS/ALL, imputation methods, and unique/combined sessions. There is a total of 18 training datasets. Prior to loading all of the datasets, the 'umap' library will need to be installed with pip, conda or any other library installer.

Step 2: Tune the hyperparameters (hps) of the UMAP algorithm (perplexity, minimum distance, and metric) using K_fold in the make_K_folds function and the UMAP_gridsearch function. The dataset is split into 3 folds. For unique sessions datasets the subjects are split into train and test subjects which are then used to obtain the train and test dataset. For the combined sessions, the train subjects at Y00 and Y05 are used for training and the test subjects at Y05 are used for testing. The current train fold is used to fit the UMAP and Kmeans algorithm, the fitted UMAP is then used to get the test UMAP embeddings which are then used for making predictions with Kmeans. Adjusted rand index (ARI) is then used to evaluate the predictions against the true test labels. This sequence is repeated for every fold combination, after which the average ARI (AARI) is obtained for the given hps combination. These steps are then repeated for the other hyperparameter (hp) values. A dataframe of the optimal hp value per dataset is then created with the make_gridsearch_table function.

Step 3: The best hp value per dataset found during step 2 is used to make the training UMAP embeddings using the apply_UMAP function. A 2 dimensional plot of the embedded data is produced in the function group_plot_UMAP. The apply_UMAP function provides the embedded arrays for the unique sessions and combined session dataset. 

Step 4: The embedded UMAP arrays are saved and reserved for later use.

These datasets were used to assess the performance of UMAP in comparison with PCA, tSNE, and TPHATE. The statistical outcomes of part 1 of the project can be found in the 'Dimensionality Reduction' subsection of the results.

# Step 1: Import Libraries & Datasets

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import StandardScaler
from umap import UMAP

In [None]:
# Import the combined sessions datasets
MS_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia.xlsx')
ALL_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia.xlsx')
MS_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it.xlsx')
ALL_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it.xlsx')
MS_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in.xlsx')
ALL_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in.xlsx')

# Import the unique sessions datasets (for the Time imputation method)
MS_t1_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia00.xlsx')
MS_t2_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia05.xlsx')
ALL_t1_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia00.xlsx')
ALL_t2_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia05.xlsx')

# Import the unique sessions datasets (for the Time + Type imputation method)
MS_t1_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it00.xlsx')
MS_t2_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it05.xlsx')
ALL_t1_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it00.xlsx')
ALL_t2_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it05.xlsx')

# Import the unique sessions datasets (for the Time + Neighbor imputation method)
MS_t1_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in00.xlsx')
MS_t2_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in05.xlsx')
ALL_t1_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in00.xlsx')
ALL_t2_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in05.xlsx')

# Group the datasets by imputation methods and then by unique session 
time_set_all = [MS_t1_imp_all, MS_t2_imp_all, ALL_t1_imp_all, ALL_t2_imp_all]
time_set_type = [MS_t1_imp_type, MS_t2_imp_type, ALL_t1_imp_type, ALL_t2_imp_type]
time_set_nb = [MS_t1_imp_nb,  MS_t2_imp_nb, ALL_t1_imp_nb, ALL_t2_imp_nb]
time_set_ls = [time_set_all, time_set_type, time_set_nb]

# Group the datasets by imputation methods and then by combined sessions
complete_set_ls = [[MS_imp_all, ALL_imp_all], [MS_imp_type, ALL_imp_type], [MS_imp_nb, ALL_imp_nb]]

# Step 2: Hyperparameter tuning for UMAP (perplexity, minimum distance, metric)

In [None]:
def make_K_folds(df):
    """
    INPUT: Dataframe
    OUPUT: Nested list of dataframes containing the train and test datasets
    DESCRIPTION: Use k folds to split the input data into 3 splits, and return the x_train, x_test, y_train, and y_test for each of the
    folds as unique lists.
    """
    # Get the subject IDs & number of unique time points
    subjects = df['index'].unique()
    num_ses = len(df['Time'].unique())
    
    # Get the true labels
    label_col = [col for col in df.columns if col.startswith('MS')]
    labels = df[label_col[0]]
    
    # Normalize the dataframe
    norm_df = df.drop(columns=['EDSS', 'BL_Avg_cognition', 'Time', 'index', 'HC_CI_CP'] + label_col, axis=1)
    norm_df = StandardScaler().fit_transform(norm_df)
    
    # Initialize KFold for subjects
    kfold = KFold(n_splits = 3, shuffle = True, random_state = 42)

    # Initialize output list
    x_train_list, x_test_list, y_test_list = [], [], []
    
    if num_ses == 1:  # For unique session datasets
        for train_indices, test_indices in kfold.split(norm_df):
            # Get the training and test data for fold
            x_train, x_test = norm_df[train_indices], norm_df[test_indices]
            y_test = labels[test_indices]

            # Add the fold's training and test data to the corresponding list
            x_train_list.append(x_train)
            x_test_list.append(x_test)
            y_test_list.append(y_train)
        
    else: # For combined sessions datasets
        for train_idx, test_idx in kfold.split(df[df['Time'] == 1]):
            # Get the subject IDs for the training and test sets
            fold_subjects = subjects[test_idx]

            # Get the indices for test set
            test_idx = df[(df['index'].isin(fold_subjects)) & (df['Time'] == 2)].index

            # Get the training and test data
            x_train, x_test = norm_df[train_idx], norm_df[test_idx]
            y_test = labels[test_idx]

            # Add the fold's training and test data to the corresponding list
            x_train_list.append(x_train)
            x_test_list.append(x_test)
            y_test_list.append(y_train)
    
    return [x_train_list, x_test_list, y_test_list]

In [None]:
def UMAP_gridsearch(ls_ls_df, hp_dic):
    """
    INPUT: 
    ls_ls_df : (Nested lists of dataframes)
    OUTPUT: Nested lists of integers (Average adjusted rand index scores)
    DESCRIPTION: Find the optimal perplexity value & minimum distance to include for each of the df based on their 
    obtained AARI.
    """
    # Define the perplexity, learning rate ranges, and metrics and the number of runs
    n_runs = 2
    
    # Initialize the output list to maintain the same structure as the input list
    output_rand_lists = []

    # Iterate over the imputation methods lists
    for m, sublist in enumerate(ls_ls_df):
        sublist_rand_indices = []
        for n, df in enumerate(sublist):
            # Get the number of clusters needed for the df
            n_clusters = 2 if 'MS' in df.columns else 4
            
            # Get k-fold splits of the data
            k_fold_ls = make_K_folds(df)
            
            # Initializes the rand indices array (per df)           
            rand_indices = np.zeros((len(hp_dic['perplexities']), len(hp_dic['min_dist']), len(hp_dic['metrics'])))

            # Perform grid search
            for i, perplexity in enumerate(hp_dic['perplexities']):
                for j, min_dist in enumerate(hp_dic['min_dist']):
                    for k, metric in enumerate(hp_dic['metrics']):
                        temp_ARIs = []
                        for l in range(0, len(k_fold_ls[0])):
                            x_train = k_fold_ls[0][l]
                            x_test = k_fold_ls[1][l]
                            y_test = k_fold_ls[2][l]

                            for run in range(n_runs):
                                # Initialise UMAP and fit/transform the x data
                                umap_model = UMAP(n_components = 2, n_neighbors = perplexity, 
                                                  min_dist = min_dist, metric = metric, random_state = k)
                                x_train_umap = umap_model.fit_transform(x_train)
                                x_test_umap = umap_model.transform(x_test)

                                # Fit & apply K-means clustering
                                kmeans = KMeans(n_clusters = n_clusters, random_state = 42, n_init = 'auto')
                                kmeans.fit(x_train_umap)
                                y_pred = kmeans.predict(x_test_umap)
                                temp_ARIs.append(adjusted_rand_score(y_test, y_pred))

                        # Get average ARI score
                        rand_indices[i, j, k] = np.mean(temp_ARIs)
                        
                print(f'{i + 1}/15 perplexity values completed')
                
            # Add AARI per hps (for df) to sublist (for imputation type)
            sublist_rand_indices.append(rand_indices)
            print(f'Gridsearch for dataset {n} of type list {m} is completed')
        
        # Add (imputation type) sublist to output list
        output_rand_lists.append(sublist_rand_indices)
        
    return output_rand_lists

In [None]:
def make_row_names():
    """
    INPUT: 
    OUTPUT: List of strings
    DESCRIPTION: Creates the row names corresponding to the different datasets included in the make_gridsearch_tbl 
    function
    """
    ls_row_names = []
    ls_types = ['imp_type']
    ls_subjects = ['MS_', 'ALL_']
    
    for str3 in ls_types:
        for str1 in ls_subjects:
            for str2 in ['t1_', 't2_', 't3_']:
                name = str1 + str2 + str3
                ls_row_names.append(name)
    for str2 in ls_types:
        for str1 in ls_subjects:
            name = str1 + str2
            ls_row_names.append(name)   
    
    return ls_row_names

In [None]:
def make_gridsearch_tbl(time_RI, comp_RI, hp_dic):
    """
    INPUT: 
    time_RI : (nested lists of arrays) arrays AARI float for time split datasets
    comp_RI : (nested lists of arrays) arrays AARI float for time combined datasets
    OUTPUT: Dataframe
    DESCRIPTION: Create a table (df) with datset names in column 1, best perplexity values in column 2, best 
    minimum distance value in column 3, best metric value in column 4 and gridsearch AARI scores in column 5. 
    """
    # Make row names
    dataset_names = make_row_names()
    
    perplexities = hp_dic['perplexities']
    min_dist_values = hp_dic['min_dist']
    metrics = hp_dic['metrics']

    # Initialize lists to store maximum values, row indices, and column indices
    max_values = []
    max_perp_idx = [] # rows
    max_MD_idx = [] #col1
    max_met_idx = [] # col2

    # Iterate over time list (time_RI)
    for type_list in time_RI:
        for array in type_list:
            # Append the maximum value and its indices to the respective lists
            max_values.append(array.max())

            max_idx = np.argwhere(array == array.max())
            max_perp_idx.append(max_idx[0][0]) # first occurance | row index
            max_MD_idx.append(max_idx[0][1]) # first occurance | col1 index
            max_met_idx.append(max_idx[0][2]) # first occurance | col2 index)
    
    # Iterate over combined time list (comp_RI)
    for type_list in comp_RI:
        for array in type_list:
            # Append the maximum value and its indices to the respective lists
            max_values.append(array.max())
            max_idx = np.argwhere(array == array.max())
            max_perp_idx.append(max_idx[0][0]) # first occurance | row index
            max_MD_idx.append(max_idx[0][1]) # first occurance | col1 index
            max_met_idx.append(max_idx[0][2]) # first occurance | col2 index)
    
    # Create the DataFrame
    output_df = pd.DataFrame({
        'Dataset': dataset_names,
        'Best_Perplexity': [perplexities[i] for i in max_perp_idx],
        'Best_Min_Distance': [min_dist_values[j] for j in max_MD_idx],
        'Best_Metric': [metrics[k] for k in max_met_idx],
        'Best_ARI_Score': max_values})
     
    output_df.to_excel('updated_data/UMAP/best_gridsearch_per_dataset_tbl.xlsx', index = False)
    return output_df

In [None]:
def plot_UMAP_gridsearch(time_RI, comp_RI, hp_dic):
    """
    INPUT: 
    time_RI : (Nested lists of floats) nested lists of gridsearch AARI scores for time split datasets
    comp_RI : (Nested lists of floats) nested lists of gridsearch AARI scores for time combined datasets
    OUTPUT: 6 figures
    DESCRIPTION: Plot the AARI scores for each UMAP gridsearch run. For each figure each row corresponds to a 
    dataframe (t1, t2, combined) and each column a different metric. There are 2 figures per imputation styles,
    corresponding to the pwMS datasets and all subjects dataset. 
    """
    # Define the hyperparameters
    perplexities = hp_dic['perplexities']
    min_dist_values = hp_dic['min_dist']
    metrics = hp_dic['metrics']
    
    # Define the lists to iterate over for plotting & naming. 
    type_counter = 0
    imp_types = ['All Imputation', 'Type Imputation', 'Neighbor Imputation']
    file_name = ['all', 'type', 'neighbor']
    MS_col_ls = [['#FFC3D7', '#FF90B5', '#EA608D', '#B72253'], 
                 ['#D7BDE2', '#A569BD', '#7D3C98', '#4A235A'],
                 ['#D6EAF8', '#5DADE2', '#2E86C1', '#1B4F72']]
    ALL_col_ls = [['#F9E79F', '#F1C40F', '#F8C471', '#F39C12'],
                  ['#A9DFBF', '#27AE60', '#1E8449', '#145A32'],
                  ['#D6EAF8', '#5DADE2', '#2E86C1', '#1B4F72']]

    # Make 6 main figures (imputation types)
    for fig_num in range(1, 7):
        plt.figure(figsize=(24, 24))

        # Define the colors, participants and index range for plotting
        coloring = MS_col_ls if fig_num % 2 != 0 else ALL_col_ls 
        df_type = 'pwMS' if fig_num % 2 != 0 else 'HC + pwMS'
        df_ind_range = [0, 0, 0, 1, 1, 1, 0, 0, 0] if fig_num % 2 != 0 else [2, 2, 2, 3, 3, 3, 1, 1, 1]      

        # Make 9 main plots (MS & ALL)
        for plot_num in range(1, 10):
            plt.subplot(3, 3, plot_num)

            if plot_num <= 3: #tp1
                df_idx = df_ind_range[plot_num - 1]
                m_idx = plot_num - 1
                plot_title = [df_type, 1, metrics[m_idx]]

                for i, md in enumerate(min_dist_values):
                    plt.plot(perplexities, time_RI[type_counter][df_idx][:, i, m_idx], 
                             label=f'min distance={md}', color = coloring[m_idx][i])

            elif plot_num > 3 and plot_num <= 6: # tp2
                df_idx = df_ind_range[plot_num - 1]
                m_idx = plot_num - 4
                plot_title = (df_type, 2, metrics[m_idx])

                for i, lr in enumerate(min_dist_values):
                    plt.plot(perplexities, time_RI[type_counter][df_idx][:, i, m_idx], 
                         label=f'min distance={md}', color = coloring[m_idx][i])

            elif plot_num > 6: # combined
                df_idx = df_ind_range[plot_num - 1]
                m_idx = plot_num - 7
                plot_title = (df_type, 'Combined', metrics[m_idx])

                for i, lr in enumerate(min_dist_values):
                    plt.plot(perplexities, comp_RI[type_counter][df_idx][:, i, m_idx], 
                             label=f'min distance={md}', color = coloring[m_idx][i])         

            # Add plot visuals
            plt.xlabel('Perplexity')
            plt.ylabel('Average Adjusted Rand Index')
            plt.title(f'{imp_types[type_counter]} for {plot_title[0]} at Time Point {plot_title[1]} and Metric {plot_title[2]}')
            plt.legend(title = 'Minimum Distance')
            plt.grid(True)
        
        plt.tight_layout()
        plt.savefig(f'output/UMAP/{file_name[type_counter]}/Gridsearch_plots_figure_{file_name[type_counter]}_{plot_title[1]}_{fig_num}.png')
        
        if fig_num % 2 == 0: # Even figures (Hc + pwMS datasets)
            type_counter += 1
            
        plt.show()

In [None]:
# Define the hps to use in gridsearch
param_dict = {
    'perplexities': [2, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
    'min_dist': [0.0, 0.1, 0.5, 0.8],
    'metrics': ['euclidean', 'cosine', 'canberra']
}

# Run the UMAP gridsearch
time_gridsearch = UMAP_gridsearch(time_set_ls, param_dict)
comp_gridsearch = UMAP_gridsearch(complete_set_ls, param_dict)

# Make the table with best hps and corresponding ARI score
best_gridsearch_df = make_gridsearch_tbl(time_gridsearch, comp_gridsearch, param_dict)

# Plot the gridsearch runs
plot_UMAP_gridsearch(time_gridsearch, comp_gridsearch, param_dict)

# Step 3: Apply UMAP with Gridsearch Results (+ plot)

In [None]:
def group_plot_UMAP(time_arrays, time_ls, comp_arrays, comp_ls, best_GS_df):
    """
    INPUT: 
    time_arrays : (nested lists of arrays) nested lists with arrays of the UMAP embeddings for the time seperated datasets
    time_ls : (nested lists of dataframes) nested lists with dataframe for the time seperated datasets
    comp_arrays : (nested lists of arrays) nested lists with arrays of the UMAP embeddings for the time combined datasets
    comp_ls : (nested lists of dataframes) nested lists with dataframe for the time combined datasets
    best_GS_df : (dataframe) dataframe of the gridsearch outcomes
    OUTPUT: 3 figures of 3 by 2 subplots
    DESCRIPTION: creates 2 dimensional plots of the UMAP embedded dataframes
    """
    # Make list of file names for saving, and list to order the plots within the figure
    file_name = ['all_imp', 'type_imp', 'neighbor_imp']
    ordered_ls = [0,1,0,2,3,1]
    GS_ints = [0,1,12,2,3,13]
    
    # Make n main figures (imputation types)
    for fig_num in range(1, len(time_ls) + 1):
        plt.figure(figsize=(24, 12))

        # Make m main plots
        num_plots = len(time_arrays[0]) + len(comp_arrays[0])
        for plot_num, df_ind in enumerate(ordered_ls):
            plt.subplot(2, int(num_plots/2), plot_num + 1)
            
            # Assign a df and array to plot
            label_df = comp_ls[fig_num - 1][df_ind] if plot_num == len(GS_ints)/2 - 1 or plot_num == len(GS_ints) - 1 else time_ls[fig_num - 1][df_ind]
            plotting_array = comp_arrays[fig_num - 1][df_ind] if plot_num == len(GS_ints)/2 - 1 or plot_num == len(GS_ints) - 1 else time_arrays[fig_num - 1][df_ind]
            
            # Define the plot colours & label colum
            label_col = [col for col in label_df.columns if col.startswith('MS')]
            color_map = {0: 'pink', 1: 'orange', 2: 'purple'} if label_col[0] == 'MStype' else {0: 'green', 1: 'purple'}
            legend_labels = {0: 'PPMS', 1: 'SPMS', 2: 'RRMS'} if label_col[0] == 'MStype' else {0: 'HC', 1: 'MS'}
            mapped_colors = label_df['MStype'].map(color_map) if label_col[0] == 'MStype' else label_df['MS'].map(color_map)  
            
            # Make the plots
            for category, color in color_map.items():
                indices = label_df[label_col[0]] == category
                plt.scatter(plotting_array[indices, 0], plotting_array[indices, 1], 
                            c = color, label = legend_labels[category], alpha=0.7)
            
            # Make plot labels
            df_name = best_GS_df.iloc[GS_ints[plot_num], 0]
            perplexity = best_GS_df.iloc[GS_ints[plot_num], 1]
            min_distance = best_GS_df.iloc[GS_ints[plot_num], 2]
            metric = best_GS_df.iloc[GS_ints[plot_num], 3]
            plt.xlabel('UMAP Component 1')
            plt.ylabel('UMAP Component 2')
            plt.title(f'{df_name} (Perplexity={perplexity}, Min Distance={min_distance}, Metric={metric})')
            plt.legend()
            plt.grid(True)  
        
        GS_ints = [x + 2 if i == len(GS_ints)/2 - 1 or i == len(GS_ints) - 1 else x + 4 for i, x in enumerate(GS_ints)]

        # Make figures and save
        plt.tight_layout()
        plt.savefig(f'output/UMAP/{file_name[fig_num - 1]}/best_param_UMAP_plots_{file_name[fig_num - 1]}_multicolor.png')
        plt.show()

In [None]:
def apply_UMAP(time_ls, comp_ls, best_GS_df, plot_param):
    """
    INPUT:
    time_ls : (nested lists of dataframes) nested lists with dataframe for the unique sessions datasets
    comp_ls : (nested lists of dataframes) nested lists with dataframe for the combined sessions datasets
    best_GS_df : (dataframe) dataframe of the gridsearch outcomes
    plot_param : (Boolean) True/False make a 2-dimensional plot of the UMAP embeddings
    OUTPUT: 2 lists of nested dfs, 6 lists of nested floats
    DESCRIPTION: Applies tSNE fitting to each of the dataframes in the given list of nested dataframes, based on
    its optimal perplexity and learning rate values. Plots the ouput arrays of the UMAP fittings if plot_param is
    True.
    """
    # Initialise output lists
    output_time_ls, output_comp_ls = [], []
    
    # Needed to iterate through best_GS_df 
    counter = 0

    # Iterate through the unique sessions dataset
    for sublist in time_ls:
        type_list = []
        
        for df in sublist:
            # Get name, perplexity, learning rate and label column name for the df
            df_name = best_GS_df.iloc[counter, 0]
            perplexity = best_GS_df.iloc[counter, 1]
            min_dist = best_GS_df.iloc[counter, 2]
            metric = best_GS_df.iloc[counter, 3]
            label_col = [col for col in df.columns if col.startswith('MS')]

            # Remove target variables and normalise the df
            norm_df = df.drop(columns = ['EDSS', 'BL_Avg_cognition', 'index'] + label_col , axis = 1)
            norm_df = StandardScaler().fit_transform(norm_df) 

            # Run the UMAP model
            umap_model = UMAP(n_components = 2, n_neighbors = perplexity, min_dist = min_dist, 
                              metric = metric, random_state = 42)
            umap_array = umap_model.fit_transform(norm_df)
 
            type_list.append(umap_array)

            # Update the counter
            counter += 1

        output_time_ls.append(type_list)
        

    for sublist in comp_ls:
        type_list = []
        
        for df in sublist:
            # Get name, perplexity, learning rate and label column name for the df
            df_name = best_GS_df.iloc[counter, 0]
            perplexity = best_GS_df.iloc[counter, 1]
            min_dist = best_GS_df.iloc[counter, 2]
            metric = best_GS_df.iloc[counter, 3]
            label_col = [col for col in df.columns if col.startswith('MS')]

            # Remove target variables and normalise the df
            norm_df = df.drop(columns = ['EDSS', 'BL_Avg_cognition', 'index'] + label_col , axis = 1)
            norm_df = StandardScaler().fit_transform(norm_df) 

            # Run the UMAP model
            umap_model = UMAP(n_components = 2, n_neighbors = perplexity, min_dist = min_dist, 
                              metric = metric, random_state = 42)
            umap_array = umap_model.fit_transform(norm_df)

            type_list.append(umap_array)

            # Update the counter
            counter += 1

        output_comp_ls.append(type_list)
                       
    # Plotting condition (if true, plots are generated)
    if plot_param:
        group_plot_UMAP(output_time_ls, time_ls, output_comp_ls, comp_ls, best_GS_df)
     
    return output_time_ls, output_comp_ls

In [None]:
# Run step 3 (apply_UMAP)
time_UMAP_arrays, comp_UMAP_arrays = apply_UMAP(time_set_ls, complete_set_ls, best_gridsearch_df, True)                          

# Step 4: Save the UMAP embedded dataset

In [None]:
def export_UMAP_embedings(ls_ls_umap_array, ls_ls_matching_df):
    """
    INPUT:
    ls_ls_umap_array : (nested list of np.array) nested list of array of the UMAP embedding
    ls_ls_matching_df : (nested list of pd.dataframe) nested list of original pre embedding dataframe
    OUTPUT:
    DESCRIPTION: Exports the tSNE embeddings as dataframes with the same index/ subjects ID as their original dataset. 
    """
    imp_file = ['all', 'type', 'neighbor']
    imp_df = ['ia', 'it', 'in']

    for imp_idx, imp_ls in enumerate(ls_ls_umap_array):
        for emb_idx, umap_emb in enumerate(imp_ls):
            # Make a dataframe from the array
            output_df = pd.DataFrame(umap_emb, columns = [f'UMAP{i+1}' for i in range(umap_emb.shape[1])])

            # Reintroduce the participant ID (index)
            output_df['index'] = ls_ls_matching_df[imp_idx][emb_idx]['index'].reset_index(drop = True)
            columns = ['index'] + [col for col in output_df.columns if col != 'index']
            
            # Reorder the columns such that index is first
            output_df = output_df[columns]

            # Check for time split list or not
            if len(ls_ls_matching_df[0]) > 3:
                sub_type = 'MStrain_' if emb_idx < 2 else 'ALLtrain_'
                year = '00' if emb_idx % 2 == 0 else '05'

                # Save the new df as an excel file
                output_df.to_excel(f'output/UMAP/{imp_file[imp_idx]}/UMAP_{sub_type}{imp_df[imp_idx]}{year}.xlsx', index=False)
            
            else:
                sub_type = 'MStrain_' if emb_idx == 0 else 'ALLtrain_'

                # Save the new df as an excel file
                output_df.to_excel(f'output/UMAP/{imp_file[imp_idx]}/UMAP_{sub_type}{imp_df[imp_idx]}.xlsx', index=False)

In [None]:
# Run the export_UMAP_embedings function for the unique sessions and the combined sessions.
export_UMAP_embedings(time_UMAP_arrays, time_set_ls)
export_UMAP_embedings(comp_UMAP_arrays, complete_set_ls)