# Introduction

This notebook contains all of the scripts used to tune and apply temporal potential of heat-diffusion for affinity-based trajectory embedding (TPHATE) to the datasets cleaned and merged in the data_preprocessing_DR.ipynb notebook. The following 4 steps are taken in this notebook:

Step 1: Import the necessary libraries and datasets for the dimensionality reduction. The datasets involve a combination of MS/ALL, imputation methods, and unique/combined sessions. There is a total of 18 training datasets. Prior to loading all of the datasets, the 'tphate' library will need to be installed with pip, conda, or any other library installer.

Step 2: Tune the hyperparameters (hps) of the TPHATE algorithm (numbe of principal components (PCs) and diffusion time step (t)) using K_fold in the make_K_folds function and the TPHATE_gridsearch function. The dataset is split into 3 folds. For unique sessions datasets the subjects are split into train and test subjects which are then used to obtain the train and test dataset. For the combined sessions, the train subjects at Y00 and Y05 are used for training and the test subjects at Y05 are used for testing. The current train fold is used to fit the TPHATE and Kmeans algorithm, the fitted TPHATE is then used to get the test TPHATE embeddings which are then used for making predictions with Kmeans. Adjusted rand index (ARI) is then used to evaluate the predictions against the true test labels. This sequence is repeated for every fold combination, after which the average ARI (AARI) is obtained for the given hps combination. These steps are then repeated for the other hyperparameter (hp) values. A dataframe of the optimal hp value per dataset is then created with the make_gridsearch_table function.

Step 3: The best hp value per dataset found during step 2 is used to make the training TPHATE embeddings using the apply_TPHATE function. A 2 dimensional plot (and therefore 2 PCs) of the embedded data is produced in the function group_plot_TPHATE. The apply_TPHATE function provides the embedded arrays for the unique sessions and combined session dataset. 

Step 4: The embedded TPHATE arrays are saved and reserved for later use.

These datasets were used to assess the performance of TPHATE in comparison with PCA, tSNE, and UMAP. The statistical outcomes of part 1 of the project can be found in the 'Dimensionality Reduction' subsection of the results.

# Step 1: Import Libraries & Datasets

In [None]:
import tphate
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import StandardScaler

In [None]:
# Import the combined sessions datasets
MS_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia.xlsx')
ALL_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia.xlsx')
MS_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it.xlsx')
ALL_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it.xlsx')
MS_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in.xlsx')
ALL_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in.xlsx')

# Import the unique sessions datasets (for the Time imputation method)
MS_t1_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia00.xlsx')
MS_t2_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia05.xlsx')
ALL_t1_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia00.xlsx')
ALL_t2_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia05.xlsx')

# Import the unique sessions datasets (for the Time + Type imputation method)
MS_t1_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it00.xlsx')
MS_t2_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it05.xlsx')
ALL_t1_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it00.xlsx')
ALL_t2_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it05.xlsx')

# Import the unique sessions datasets (for the Time + Neighbor imputation method)
MS_t1_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in00.xlsx')
MS_t2_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in05.xlsx')
ALL_t1_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in00.xlsx')
ALL_t2_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in05.xlsx')

# Group the datasets by imputation methods and then by unique session 
time_set_all = [MS_t1_imp_all, MS_t2_imp_all, ALL_t1_imp_all, ALL_t2_imp_all]
time_set_type = [MS_t1_imp_type, MS_t2_imp_type, ALL_t1_imp_type, ALL_t2_imp_type]
time_set_nb = [MS_t1_imp_nb,  MS_t2_imp_nb, ALL_t1_imp_nb, ALL_t2_imp_nb]
time_set_ls = [time_set_all, time_set_type, time_set_nb]

# Group the datasets by imputation methods and then by combined sessions
complete_set_ls = [[MS_imp_all, ALL_imp_all], [MS_imp_type, ALL_imp_type], [MS_imp_nb, ALL_imp_nb]]

# Step 2: Hyperparameter tuning for TPHATE (number of PCs and t)

In [None]:
def make_K_folds(df):
    """
    INPUT: Dataframe
    OUPUT: Nested list of dataframes containing the train and test datasets
    DESCRIPTION: Use k folds to split the input data into 3 splits, and return the x_train, x_test, y_train, and y_test for each of the
    folds as unique lists.
    """
    # Get the subject IDs & number of unique time points
    subjects = df['index'].unique()
    num_ses = len(df['Time'].unique())
    
    # Get the true labels
    label_col = [col for col in df.columns if col.startswith('MS')]
    labels = df[label_col[0]]
    
    # Normalize the dataframe
    norm_df = df.drop(columns=['EDSS', 'BL_Avg_cognition', 'Time', 'index', 'HC_CI_CP'] + label_col, axis=1)
    norm_df = StandardScaler().fit_transform(norm_df)
    
    # Initialize KFold for subjects
    kfold = KFold(n_splits = 3, shuffle = True, random_state = 42)

    # Initialize output list
    x_train_list, x_test_list, y_test_list = [], [], []
    
    if num_ses == 1:  # For unique session datasets
        for train_indices, test_indices in kfold.split(norm_df):
            # Get the training and test data for fold
            x_train, x_test = norm_df[train_indices], norm_df[test_indices]
            y_test = labels[test_indices]

            # Add the fold's training and test data to the corresponding list
            x_train_list.append(x_train)
            x_test_list.append(x_test)
            y_test_list.append(y_train)
        
    else: # For combined sessions datasets
        for train_idx, test_idx in kfold.split(df[df['Time'] == 1]):
            # Get the subject IDs for the training and test sets
            fold_subjects = subjects[test_idx]

            # Get the indices for test set
            test_idx = df[(df['index'].isin(fold_subjects)) & (df['Time'] == 2)].index

            # Get the training and test data
            x_train, x_test = norm_df[train_idx], norm_df[test_idx]
            y_test = labels[test_idx]

            # Add the fold's training and test data to the corresponding list
            x_train_list.append(x_train)
            x_test_list.append(x_test)
            y_test_list.append(y_train)
    
    return [x_train_list, x_test_list, y_test_list]

In [None]:
def TPHATE_gridsearch(ls_ls_df, n_runs, param_grid):
    """
    INPUT: 
    ls_df : (Nested list of dataframes)
    n_runs : (integer) number of runs per kfold
    OUTPUT: List of floats (AARI), list of dictionaries (hps) list of arrays (with ARIs per hps).
    DESCRIPTION: Find the optimal PC numbers & diffusion time steps (t) to include for each of the df based on 
    their obtained average adjusted rand index
    """
    # Initialise output lists for score and parameters
    output_scores, output_params, output_rand = [], [], []

    for ls_df in ls_ls_df:
        # Initialise sub-output lists for score and parameters
        sub_scores, sub_params, sub_rand = [], [], []
        
        for df_num, df in enumerate(ls_df):
            best_score = -np.inf
            best_params = None
            output_rand_ls = []
        
            # Get the number of clusters needed for the df
            n_clusters = 2 if 'MS' in df.columns else 4
    
            # Get k-fold splits of the data
            k_fold_ls = make_K_folds(df)
    
            # Initializes the rand indices array (per df) 
            rand_indices = np.zeros((len(param_grid['n_components']), len(param_grid['t'])))
            
            # Perform grid search
            for i, n_components in enumerate(param_grid['n_components']): # Iterate over number of PCs
                for j, t in enumerate(param_grid['t']): # Iterate over diffusion time steps
                    temp_ARIs = []
                    for k in range(0, len(k_fold_ls[0])): # Iterate over the K folds
                        x_train = k_fold_ls[0][k]
                        x_test = k_fold_ls[1][k]
                        y_test = k_fold_ls[2][k]
                        
                        for run in range(n_runs): # Repeat for the number of runs per fold
                            tphate_model = tphate.TPHATE(verbose = 0, n_jobs = -1, n_landmark = x_train.shape[0],
                                                          n_components = n_components, t = t)
        
                            train_transformed = tphate_model.fit_transform(x_train)
                            test_transformed = tphate_model.transform(x_test)
        
                            # Evaluate the performance of the hps at the given run of Kfold with Kmeans & ARI
                            kmeans = KMeans(n_clusters = n_clusters, random_state = 42)
                            kmeans.fit(train_transformed)
                            y_pred = kmeans.predict(test_transformed)
                            temp_ARIs.append(adjusted_rand_score(y_test, y_pred))
        
                    # Get average ARI score
                    AARI = mean(temp_ARIs)
                    rand_indices[i,j] = AARI
        
                    if AARI > best_score:
                        best_score = AARI
                        best_params = {'n_components' : n_components, 't': t}

            sub_scores.append(best_score)
            sub_params.append(best_params)
            sub_rand.append(rand_indices)
        
        # Update the ouput lists        
        output_scores.append(sub_scores)
        output_params.append(sub_params)
        output_rand.append(sub_rand)
        
    return output_scores, output_params, output_rand

In [None]:
def make_row_names():
    """
    INPUT: 
    OUTPUT: List of strings
    DESCRIPTION: Creates the row names corresponding to the different datasets included in the make_gridsearch_tbl 
    function
    """
    ls_row_names = []
    ls_types = ['imp_type']
    ls_subjects = ['MS_', 'ALL_']
    
    for str3 in ls_types:
        for str1 in ls_subjects:
            for str2 in ['t1_', 't2_', 't3_']:
                name = str1 + str2 + str3
                ls_row_names.append(name)
    for str2 in ls_types:
        for str1 in ls_subjects:
            name = str1 + str2
            ls_row_names.append(name)   
    
    return ls_row_names

In [None]:
def make_gridsearch_tbl(time_best_param, time_best_score, comp_best_param, comp_best_score):
    """
    INPUT: 
    time_RI : (nested lists of arrays) arrays AARI float for time split datasets
    comp_RI : (nested lists of arrays) arrays AARI float for time combined datasets
    OUTPUT: Dataframe
    DESCRIPTION: Create a table (df) with dataset names in column 1, best number of PCs values in column 2, best 
    diffusion time steps in column 3, and gridsearch AARI scores in column 4. 
    """
    # Make row names
    dataset_names = make_row_names()
    
    # Initialize lists to store maximum values, row indices, and column indices
    max_ari, max_components, max_t = [], [], []

    # Iterate over time list (time_RI)
    for i, sublist in enumerate(time_best_param):
        for j, dict in enumerate(sublist):
            # Append the maximum value and its indices to the respective lists
            max_ari.append(time_best_score[i][j])
            max_components.append(time_best_param[i][j]['n_components'])
            max_t.append(time_best_param[i][j]['t'])
    
    # Iterate over combined time list (comp_RI)
    for i, sublist in enumerate(comp_best_params):
        for j, dict in enumerate(sublist):
            # Append the maximum value and its indices to the respective lists
            max_ari.append(comp_best_score[i][j])
            max_components.append(comp_best_param[i][j]['n_components'])
            max_t.append(comp_best_param[i][j]['t'])
    
    # Create the DataFrame
    output_df = pd.DataFrame({
        'Dataset': dataset_names,
        'Best_n_components': max_components,
        'Best_diffusion_time_step': max_t,
        'Best_ARI_Score': max_ari})
     
    output_df.to_excel('updated_data/TPHATE/best_gridsearch_per_dataset_tbl.xlsx', index = False)
    return output_df

In [None]:
def plot_TPHATE_gridsearch(time_gridsearch, comp_gridsearch):
    """
    INPUT: 2 lists of nested lists of 2D numpy arrays,
    OUTPUT: 3 figures (with 3x2 subplots)
    DESCRIPTION: Plot the AARI scores for each TPHATE gridsearch run. Figure 1-3 correspond to different 
    imputation types, and plots 1 and 2 represent the MS only vs.
    """
        
    # Define the perplexity, learning rate ranges, and the number of runs
    n_components = list(range(1,16))
    n_t = [1, 3, 5, 10, 20, 50]
    imp_types = ['All Imputation', 'Type Imputation', 'Neighbor Imputation']
    time_point = ['1', '1', '2', '2', 'Combined']
    file_name = ['all_imp', 'type_imp', 'neighbor_imp']
    df_ind_range = [0, 2, 1, 3]
    MS_col_ls = ['#EBDEF0','#D2B4DE', '#AF7AC5', '#8E44AD','#76448A', '#4A235A']
    ALL_col_ls = ['#D4EFDF', '#A9DFBF', '#7DCEA0', '#27AE60', '#1E8449', '#145A32']  


    # Make 3 main figures (imputation types)
    for fig_num in range(1, 4):
        plt.figure(figsize=(18, 24))

        # Make 6 main plots (MS & ALL)
        for plot_num in range(1, 7):
            plt.subplot(3, 2, plot_num)

            if plot_num % 2 != 0 and plot_num != 5: #Odd & not 5
                df_ind = df_ind_range[plot_num - 1]
                plot_title = ('MS Patients Only', time_point[plot_num - 1])
                for i, t in enumerate(n_t): 
                    plt.plot(n_components, time_gridsearch[fig_num - 1][df_ind][:,i], label = f't = {t}', color = MS_col_ls[i])

            elif plot_num % 2 == 0 and plot_num != 6: #Even & not 6
                df_ind = df_ind_range[plot_num - 1]
                plot_title = ('All Patients', time_point[plot_num - 1])
                for i, t in enumerate(n_t):
                    plt.plot(n_components, time_gridsearch[fig_num - 1][df_ind][:,i], label = f't = {t}', color = ALL_col_ls[i])
                
            elif plot_num == 5:
                plot_title = ('MS Patients Only', time_point[4])
                for i, t in enumerate(n_t):
                    plt.plot(n_components, comp_gridsearch[fig_num - 1][plot_num - 5][:,i], label = f't = {t}', color = MS_col_ls[i])
                    
            elif plot_num == 6:
                plot_title = ('All Patients', time_point[4])
                for i, t in enumerate(n_t):
                    plt.plot(n_components, comp_gridsearch[fig_num - 1][plot_num - 5][:,i], label = f't = {t}', color = ALL_col_ls[i])
                
            plt.xlabel('Number of PCs for affinity matrix')
            plt.ylabel('Average Adjusted Rand Index')
            plt.title(f'{imp_types[fig_num - 1]} for {plot_title[0]} Dataset at Time Point {plot_title[1]}')
            plt.legend(title = 'Diffusion time step (t)')
            plt.grid(True)

        plt.tight_layout()
        plt.savefig(f'output/TPHATE/{file_name[fig_num - 1]}/Gridsearch_plots_figure_{file_name[fig_num - 1]}_{plot_title[1]}.png')
        plt.show()

In [None]:
# Define the parameter grid
param_grid = {
    'n_components': list(range(1,16)),
    't': [1, 3, 5, 10, 20, 50]
}

# Run the TPHATE search
time_best_scores, time_best_params, time_gridsearch = TPHATE_gridsearch(time_set_ls, 2, param_dict):
comp_best_scores, comp_best_params, comp_gridsearch = TPHATE_gridsearch(complete_set_ls, 2, param_dict)

# Make the table with best hps and corresponding ARI score
best_gridsearch_df = make_gridsearch_tbl(time_best_params, time_best_scores, comp_best_params, comp_best_scores)

# Plot the gridsearch runs
plot_TPHATE_gridsearch(time_gridsearch, comp_gridsearch)

# Step 3: Apply TPHATE with Gridsearch Results (+ plot)

In [None]:
def group_plot_TPHATE(time_arrays, time_ls, comp_arrays, comp_ls, best_GS_df):
    """
    INPUT: 
    time_arrays : (nested lists of arrays) nested lists with arrays of the TPHATE embeddings for the time seperated datasets
    time_ls : (nested lists of dataframes) nested lists with dataframe for the time seperated datasets
    comp_arrays : (nested lists of arrays) nested lists with arrays of the TPHATE embeddings for the time combined datasets
    comp_ls : (nested lists of dataframes) nested lists with dataframe for the time combined datasets
    best_GS_df : (dataframe) dataframe of the gridsearch outcomes
    OUTPUT: 3 figures of 2 by 3 subplots
    DESCRIPTION: creates 2 dimensional plots of the TPHATE embedded dataframes
    """
    # Make list of file names for saving, and list to order the plots within the figure
    file_name = ['all_imp', 'type_imp', 'neighbor_imp']
    ordered_ls = [0,2,1,3,0,1]
    GS_ints = [0,2,1,3,12,13]
    
    # Make n main figures (imputation types)
    for fig_num in range(1, len(time_ls) + 1):
        plt.figure(figsize=(18, 24))

        # Make m main plots
        num_plots = len(time_arrays[0]) + len(comp_arrays[0])
        for plot_num, df_ind in enumerate(ordered_ls):
            plt.subplot(int(num_plots/2), 2, plot_num + 1)
            
            # Assign a df and array to plot
            label_df = time_ls[fig_num - 1][df_ind] if plot_num + 1 <= len(time_arrays[0]) else comp_ls[fig_num - 1][df_ind]
            plotting_array = time_arrays[fig_num - 1][df_ind] if plot_num + 1 <= len(time_arrays[0]) else comp_arrays[fig_num - 1][df_ind] 
                
            # Define the plot colours & label colum
            label_col = [col for col in label_df.columns if col.startswith('MS')]
            color_map = {0: 'pink', 1: 'orange', 2: 'purple'} if label_col[0] == 'MStype' else {0: 'green', 1: 'purple'}
            legend_labels = {0: 'PPMS', 1: 'SPMS', 2: 'RRMS'} if label_col[0] == 'MStype' else {0: 'HC', 1: 'MS'}
            mapped_colors = label_df['MStype'].map(color_map) if label_col[0] == 'MStype' else label_df['MS'].map(color_map)  
            
            # Make the plots
            for category, color in color_map.items():
                indices = label_df[label_col[0]] == category
                plt.scatter(plotting_array[indices, 0], plotting_array[indices, 1], c = color, label = legend_labels[category], alpha=0.7)
            
            # Make plot labels
            df_name = best_GS_df.iloc[GS_ints[plot_num], 0]
            n_components = 2 if best_GS_df.iloc[GS_ints[plot_num], 1] < 2 else best_GS_df.iloc[GS_ints[plot_num], 1]
            n_t = best_GS_df.iloc[GS_ints[plot_num], 2]
            plt.xlabel('TPHATE Component 1')
            plt.ylabel('TPHATE Component 2')
            plt.title(f'2D TPHATE for {df_name} Dataset (Components={n_components}, diffusion time step={n_t})')
            plt.legend()
            plt.grid(True)  
        
        GS_ints = [x + 4 if i < 4 else x + 2 for i, x in enumerate(GS_ints)]

        # Make figures and save
        plt.tight_layout()
        plt.savefig(f'output/TPHATE/{file_name[fig_num - 1]}/best_param_TPHATE_plots_{file_name[fig_num - 1]}_multicolor.png')
        plt.show()

In [None]:
def apply_TPHATE(time_ls, comp_ls, best_GS_df, plot_param):
    """
    INPUT:
    time_ls : (nested lists of dataframes) nested lists with dataframe for the time seperated datasets
    comp_ls : (nested lists of dataframes) nested lists with dataframe for the time combined datasets
    best_GS_df : (dataframe) dataframe of the gridsearch outcomes
    plot_param : (Boolean) True/False make a 2-dimensional plot of the TPHATE embeddings
    OUTPUT: 2 lists of nested dfs, 6 lists of nested floats
    DESCRIPTION: Applies TPHATE fitting to each of the dataframes in the given list of nested dataframes, based on
    its optimal perplexity and learning rate values. Plots the ouput arrays of the TPHATE fittings if plot_param is
    True.
    """
    # Initialise output lists
    output_time_ls, output_comp_ls = [], []
    
    # Needed to iterate through best_GS_df 
    counter = 0

    for sublist in time_ls:
        type_list = []
        
        for df in sublist:
            # Get name, hps, label, and index column name for the current df
            df_name = best_GS_df.iloc[counter, 0]
            n_components = 2 if best_GS_df.iloc[GS_ints[plot_num], 1] < 2 else best_GS_df.iloc[GS_ints[plot_num], 1]
            n_t = best_GS_df.iloc[counter, 2]
            label_col = [col for col in df.columns if col.startswith('MS')]
            idx_col = df.columns[0]
    
            # Remove target variables and normalise the df
            norm_df = df.drop(columns = ['EDSS', 'BL_Avg_cognition', 'Time', idx_col] + label_col, axis = 1)
            norm_df = StandardScaler().fit_transform(norm_df) 
    
            # Run the TPHATE model
            TPHATE_array = tphate.TPHATE(verbose = 0, n_jobs = -1, n_landmark = norm_df.shape[0], 
                              n_components = n_components, t = n_t).fit_transform(norm_df)
            
            type_list.append(TPHATE_array)

            # Update the counter
            counter += 1
    
        output_time_ls.append(type_list)
        
    for sublist in comp_ls:
        type_list = []
        
        for df in sublist:
            
            # Get name, hps, label, and index column name for the current df
            df_name = best_GS_df.iloc[counter, 0]
            n_components = 2 if best_GS_df.iloc[GS_ints[plot_num], 1] < 2 else best_GS_df.iloc[GS_ints[plot_num], 1]
            n_t = best_GS_df.iloc[counter, 2]
            label_col = [col for col in df.columns if col.startswith('MS')]
            idx_col = df.columns[0]

            # Remove target variables and normalise the df
            norm_df = df.drop(columns = ['EDSS', 'BL_Avg_cognition', 'Time', idx_col] + label_col, axis = 1)
            norm_df = StandardScaler().fit_transform(norm_df) 

            # Run the components model
            TPHATE_array = tphate.TPHATE(verbose = 0, n_jobs = -1, n_landmark = norm_df.shape[0], 
                                          n_components = n_components, t = n_t).fit_transform(norm_df)
    
            type_list.append(TPHATE_array)

            # Update the counter
            counter += 1

        output_comp_ls.append(type_list)

    # Plotting condition (if true, plots are generated)
    if plot_param:
        group_plot_TPHATE(output_time_ls, time_ls, output_comp_ls, comp_ls, best_GS_df)
     
    return output_time_ls, output_comp_ls

In [None]:
# Run step 3 (apply_TPHATE)
time_TPHATE_arrays, comp_TPHATE_arrays = apply_TPHATE(time_set_ls, complete_set_ls, best_gridsearch_df, True)                          

# Step 4: Save the TPHATE embedded dataset

In [None]:
def export_TPHATE_embedings(ls_ls_tphate_array, ls_ls_matching_df):
    """
    INPUT:
    ls_ls_tphate_array : (nested list of np.array) nested list of array of the TPHATE embedding
    ls_ls_matching_df : (nested list of pd.dataframe) nested list of original pre embedding dataframe
    OUTPUT:
    DESCRIPTION: Exports the tSNE embeddings as dataframes with the same index/ subjects ID as their original dataset. 
    """
    imp_file = ['all', 'type', 'neighbor']
    imp_df = ['ia', 'it', 'in']

    for imp_idx, imp_ls in enumerate(ls_ls_tphate_array):
        for emb_idx, tphate_emb in enumerate(imp_ls):
            # Make a df from the array
            output_df = pd.DataFrame(tphate_emb, columns = [f'PC{i+1}' for i in range(tphate_emb.shape[1])])

            # Reintroduce the patients ID (index)
            output_df['index'] = ls_ls_matching_df[imp_idx][emb_idx]['index'].reset_index(drop = True)
            columns = ['index'] + [col for col in output_df.columns if col != 'index']
            
            # Reorder the columns such that index is first
            output_df = output_df[columns]

            # Check for time split list or not
            if len(ls_ls_matching_df[0]) > 3:
                sub_type = 'MStrain_' if emb_idx < 2 else 'ALLtrain_'
                year = '00' if emb_idx % 2 == 0 else '05'

                # Save the new df as an excel file
                output_df.to_excel(f'output/TPHATE/{imp_file[imp_idx]}/TPHATE_{sub_type}{imp_df[imp_idx]}{year}.xlsx', index=False)
            
            else:
                sub_type = 'MStrain_' if emb_idx == 0 else 'ALLtrain_'

                # Save the new df as an excel file
                output_df.to_excel(f'output/TPHATE/{imp_file[imp_idx]}/TPHATE_{sub_type}{imp_df[imp_idx]}.xlsx', index=False)

In [None]:
# Run the export_TPHATE_embedings function for the unique sessions and the combined sessions.
export_TPHATE_embedings(time_TPHATE_arrays, time_set_ls)
export_TPHATE_embedings(comp_TPHATE_arrays, complete_set_ls)