# Introduction

This notebook contains all of the scripts used to tune and apply principal component analysis (PCA) to the datasets cleaned and merged in the data_preprocessing_DR.ipynb notebook. The following 5 steps are taken in this notebook:

Step 1: Import the necessary libraries and datasets for the dimensionality reduction. The datasets involve a combination of MS/ALL, imputation methods, and unique/combined sessions. There is a total of 18 training datasets.

Step 2: Tune the hyperparameters (hps) of the PCA algorithm (number of principal components (PCs)) using K_fold in the make_K_folds function and the PCA_gridsearch function. The dataset is split into 3 folds. For unique sessions datasets the subjects are split into train and test subjects which are then used to obtain the train and test dataset. For the combined sessions, the train subjects at Y00 and Y05 are used for training and the test subjects at Y05 are used for testing. The current train fold is used to fit the PCA and Kmeans algorithm, the fitted PCA is then used to get the test pca embeddings which are then used for making predictions with Kmeans. Adjusted rand index (ARI) is then used to evaluate the predictions against the true test labels. This sequence is repeated for every fold combination, after which the average ARI (AARI) is obtained for the given number of principal components. These steps are then repeated for the other hyperparameter (hp) values. A dataframe of the optimal hp value per dataset is then created with the make_PC_ARI_table function.

Step 3: The best hp value per dataset found during step 2 is used to make the training PCA embeddings using the apply_PCA function. A 2 dimensional (and therefore 2 PCs) plot of the embedding data is produced. The apply_PCA function provides the embedded arrays for the unique sessions and combined session dataset, the variance and total ratios for each dataset, and the feature loadings for each dataset. 

Step 4: The Variance explained and feature loadings obtained in step 3 are then added to the existing gridsearch dataframe, which is updated and saved.

Step 5: The embedded PCA arrays are saved and reserved for later use.

These datasets were used to assess the performance of PCA in comparison with tSNE, UMAP, and TPHATE. The statistical outcomes of part 1 of the project can be found in the 'Dimensionality Reduction' subsection of the results.

# Step 1: Import Libraries & Datasets

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
from sklearn.decomposition import PCA

In [None]:
# Import the combined sessions datasets
MS_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia.xlsx')
ALL_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia.xlsx')
MS_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it.xlsx')
ALL_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it.xlsx')
MS_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in.xlsx')
ALL_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in.xlsx')

# Import the unique sessions datasets (for the Time imputation method)
MS_t1_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia00.xlsx')
MS_t2_imp_all = pd.read_excel('prepro_data/all_imp/MStrain_ia05.xlsx')
ALL_t1_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia00.xlsx')
ALL_t2_imp_all = pd.read_excel('prepro_data/all_imp/ALLtrain_ia05.xlsx')

# Import the unique sessions datasets (for the Time + Type imputation method)
MS_t1_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it00.xlsx')
MS_t2_imp_type = pd.read_excel('prepro_data/type_imp/MStrain_it05.xlsx')
ALL_t1_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it00.xlsx')
ALL_t2_imp_type = pd.read_excel('prepro_data/type_imp/ALLtrain_it05.xlsx')

# Import the unique sessions datasets (for the Time + Neighbor imputation method)
MS_t1_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in00.xlsx')
MS_t2_imp_nb = pd.read_excel('prepro_data/nb_imp/MStrain_in05.xlsx')
ALL_t1_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in00.xlsx')
ALL_t2_imp_nb = pd.read_excel('prepro_data/nb_imp/ALLtrain_in05.xlsx')

# Group the datasets by imputation methods and then by unique session 
time_set_all = [MS_t1_imp_all, MS_t2_imp_all, ALL_t1_imp_all, ALL_t2_imp_all]
time_set_type = [MS_t1_imp_type, MS_t2_imp_type, ALL_t1_imp_type, ALL_t2_imp_type]
time_set_nb = [MS_t1_imp_nb,  MS_t2_imp_nb, ALL_t1_imp_nb, ALL_t2_imp_nb]
time_set_ls = [time_set_all, time_set_type, time_set_nb]

# Group the datasets by imputation methods and then by combined sessions
complete_set_ls = [[MS_imp_all, ALL_imp_all], [MS_imp_type, ALL_imp_type], [MS_imp_nb, ALL_imp_nb]]

# Step 2: Hyperparameter tuning for PCA (n_components)

In [None]:
def make_K_folds(df):
    """
    INPUT: Dataframe
    OUPUT: Nested list of dataframes containing the train and test datasets
    DESCRIPTION: Use k folds to split the input data into 3 splits, and return the x_train, x_test, y_train, and y_test for each of the
    folds as unique lists.
    """
    # Get the subject IDs & number of unique time points
    subjects = df['index'].unique()
    num_ses = len(df['Time'].unique())
    
    # Get the true labels
    label_col = [col for col in df.columns if col.startswith('MS')]
    labels = df[label_col[0]]
    
    # Normalize the dataframe
    norm_df = df.drop(columns=['EDSS', 'BL_Avg_cognition', 'Time', 'index'] + label_col, axis=1)
    norm_df = StandardScaler().fit_transform(norm_df)
    
    # Initialize KFold for subjects
    kfold = KFold(n_splits = 3, shuffle = True, random_state = 42)

    # Initialize output list
    x_train_list, x_test_list, y_test_list = [], [], []
    
    if num_ses == 1: # For unique session datasets

        for train_indices, test_indices in kfold.split(norm_df):
            # Get the training and test data for fold
            x_train, x_test = norm_df[train_indices], norm_df[test_indices]
            y_test = labels[test_indices]

            # Add the fold's training and test data to the corresponding list
            x_train_list.append(x_train)
            x_test_list.append(x_test)
            y_test_list.append(y_train)
        
    else: # For combined sessions datasets
        for train_idx, test_idx in kfold.split(df[df['Time'] == 1]):
            # Get the subject IDs for the training and test sets
            fold_subjects = subjects[test_idx]

            # Get the indices for test set
            test_idx = df[(df['index'].isin(fold_subjects)) & (df['Time'] == 2)].index

            # Get the training and test data
            x_train, x_test = norm_df[train_idx], norm_df[test_idx]
            y_test = labels[test_idx]

            # Add the fold's training and test data to the corresponding list
            x_train_list.append(x_train)
            x_test_list.append(x_test)
            y_test_list.append(y_test)
    
    return [x_train_list, x_test_list, y_test_list]

In [None]:
def PCA_gridsearch(lst, hp_dict):
    """
    INPUT: 
    lst : (Nested lists of dataframes)
    OUTPUT: Nested lists of integers (Average adjusted rand index scores)
    DESCRIPTION: Find the optimal number of PCs to include for each of the df based on their obtained AARI
    """
    # Define the range of principal components (PCs) to test
    output_rand_lists = []
    best_rands = []

    # Iterate over the imputation methods lists
    for sublst in lst:
        ls_rand_indices = []
        best_rands_sublist = []
        for i, df in enumerate(sublst):
            # Get labels and initialize rand list
            n_clusters = 2 if 'MS' in df.columns else 4
            
            # Get k-fold splits of the data
            k_fold_ls = make_K_folds(df)
            
            rand_indices = []
                
            # For each number of PCs
            for n_comp in hp_dict['n_pc']:
                temp_ARIs = []
                for k in range(0, len(k_fold_ls[0])):
                    x_train = k_fold_ls[0][k]
                    x_test = k_fold_ls[1][k]
                    y_test = k_fold_ls[2][k]

                    # Initialise PCA and fit/transform the x data
                    pca = PCA(n_components = n_comp)
                    x_train_pca = pca.fit_transform(x_train)
                    x_test_pca = pca.transform(x_test)

                    # Fit & apply K-means clustering
                    kmeans = KMeans(n_clusters = n_clusters, random_state = 42, n_init = 'auto')
                    kmeans.fit(x_train_pca)
                    y_pred = kmeans.predict(x_test_pca)
                    temp_ARIs.append(adjusted_rand_score(y_test, y_pred))
                                
                # Get average ARI score
                rand_indices.append(np.mean(temp_ARIs))
            
            # Add AARI per PC number (for df) to sublist (for imputation type)
            ls_rand_indices.append(rand_indices)
            
            # Find best PC number and corresponding ARI
            best_rand_value = max(rand_indices)
            best_pc_number = rand_indices.index(best_rand_value) + 1 
            best_rands_sublist.append((best_pc_number, best_rand_value))
        
        output_rand_lists.append(ls_rand_indices)
        best_rands.append(best_rands_sublist)
    
    return output_rand_lists, best_rands

In [None]:
def make_row_names():
    """
    INPUT: 
    OUTPUT: List of strings
    DESCRIPTION: Creates the row names corresponding to the different datasets included in the make_PC_ARI_table 
    function
    """
    ls_row_names = []
    ls_types = ['imp_all', 'imp_type', 'imp_neighbor']
    ls_subjects = ['MS_', 'ALL_']
    
    for str3 in ls_types:
        for str1 in ls_subjects:
            for str2 in ['t1_', 't2_']:
                name = str1 + str2 + str3
                ls_row_names.append(name)
    for str2 in ls_types:
        for str1 in ls_subjects:
            name = str1 + str2
            ls_row_names.append(name)   
    
    return ls_row_name

In [None]:
def make_PC_ARI_table(ls_best_ARI_time, ls_best_ARI_comp):
    """
    INPUT: 
    ls_best_ARI_time : (nested lists of tuples) tuples with integer and float (number of PCs, AARI score) for 
    unique sessions datasets
    ls_best_ARI_comp : (nested lists of tuples) tuples with integer and float (number of PCs, AARI score) for 
    combined sessions datasets
    OUTPUT: Dataframe
    DESCRIPTION: Create a dataframe with dataset names in column 1, best number of PCs in column 2, and gridsearch
    AARI scores in column 3. 
    """
    # Initialize table
    table_data = []
    
    # Flatten the list
    flattened_best_indices = [item for sublist in ls_best_ARI_time + ls_best_ARI_comp for item in sublist]
    
    # Make row names
    dataset_names = make_row_names()
    
    # Fill the table data
    for dataset, (PC, ARI) in zip(dataset_names, flattened_best_indices):
        table_data.append([dataset, PC, ARI])
        
    # Make DataFrame
    PC_ARI_df = pd.DataFrame(table_data, columns=["Dataset", "Best_PC_Number", "Best_ARI_Score"])
    
    # Save the output df
    PC_ARI_df.to_excel('updated_data/DR/PCA/bestPC_per_dataset_tbl.xlsx', index = False)
    
    return PC_ARI_df

In [None]:
def plot_PCA_ARI(time_RI, comp_RI, hp_dict):
    """
    INPUT: 
    time_RI : (Nested lists of floats) nested lists of gridsearch AARI scores for time split datasets
    comp_RI : (Nested lists of floats) nested lists of gridsearch AARI scores for time combined datasets
    OUTPUT: 3 figures
    DESCRIPTION: Plot the AARI scores for each PCA gridsearch run. Figure 1-3 correspond to different 
    imputation types, and plots 1 and 2 represent the MS only vs.
    """
    # Define the number of principal components tested (1 to 11)
    pc_range = hp_dict['n_pc']
    imp_types = ['All Imputation', 'Type Imputation', 'Neighbor Imputation']
    data_types = ['MS Patients Only', 'All Patients']


    # Make 3 main figures (imputation types)
    for fig_num in range(1, 4):
        plt.figure(figsize=(12, 6))

        # Make 2 main plots (MS & ALL)
        for plot_num in range(1, 3):
            plt.subplot(1, 2, plot_num)

            if plot_num == 1:
                plt.plot(pc_range, time_RI[fig_num - 1][0], label='Time point 1', color = '#D7BDE2')
                plt.plot(pc_range, time_RI[fig_num - 1][1], label='Time point 2', color = '#9B59B6')
                plt.plot(pc_range, comp_RI[fig_num - 1][0], linestyle='--', label='Time point 1 & 2', color = '#5B2C6F')
            elif plot_num == 2:
                plt.plot(pc_range, time_RI[fig_num - 1][2], label='Time point 1', color = '#A9DFBF')
                plt.plot(pc_range, time_RI[fig_num - 1][3], label='Time point 2', color = '#27AE60')
                plt.plot(pc_range, comp_RI[fig_num - 1][1], linestyle='--', label='Time point 1 & 2', color = '#196F3D')

            plt.xlabel('Number of Principal Components')
            plt.ylabel('ARI Score')
            plt.title(f'{imp_types[fig_num - 1]} for {data_types[plot_num - 1]} dataset')
            plt.legend()
            plt.grid(True)

        plt.tight_layout()
        plt.savefig(f'output/PCA/PC_v_ARI_plots_figure_{fig_num}.png')
        plt.show()

In [None]:
param_dict = {
    'n_pc': range(1, 12)
}
# Run the PCA gridsearch
time_rand_indices, time_best_indices = PCA_gridsearch(time_set_ls, param_dict)
comp_rand_indices, comp_best_indices = PCA_gridsearch(complete_set_ls, param_dict)

# Make table with best PC number and corresponding ARI score
PC_ARI_tbl = make_PC_ARI_table(time_best_indices, comp_best_indices)

# Plot the gridsearch runs
plot_PCA_ARI(time_rand_indices, comp_rand_indices, param_dict)

# Step 3: Apply PCA with Gridsearch Results

In [None]:
def get_feature_loadings(df, n_comp, pca_model):
    """
    INPUT: 
    df : (dataframe)
    n_comp : (integer) number of PCs
    pca_model : PCA model for the input df
    OUTPUT: dataframe
    DESCRIPTION: Create a dataframe of the feature loadings of the given dataframe for the specified number of PCs.
    """
    # Create feature loading dataframe
    feat_load = pd.DataFrame(pca_model.components_[:n_comp], columns = df.columns)
    
    # Structure the dataframe
    feat_load = feat_load.transpose()                                     # Get PCs as columns
    feat_load.columns = [f'PC {n}' for n in range(1, n_comp + 1)]         # Rename the columns
    feat_load.index.name = 'Feature'                                   # Get new index column (not features)
    feat_load.reset_index(inplace=True) 
    
    return feat_load

In [None]:
def group_plot_PCA(time_arrays, time_ls, comp_arrays, comp_ls, best_GS_df, exp_var_ls):
    """
    INPUT: 
    time_arrays : (nested lists of arrays) nested lists with arrays of the pca embeddings for the time seperated datasets
    time_ls : (nested lists of dataframes) nested lists with dataframe for the time seperated datasets
    comp_arrays : (nested lists of arrays) nested lists with arrays of the pca embeddings for the time combined datasets
    comp_ls : (nested lists of dataframes) nested lists with dataframe for the time combined datasets
    best_GS_df : (dataframe) dataframe of the gridsearch outcomes
    exp_var_ls : (nested lists of floats) nested lists of explained variance for the dataframe with their optimal number of PCs
    OUTPUT: 3 figures of 2 by 3 subplots
    DESCRIPTION: creates 2 dimensional plots of the PCA embedded dataframes
    """
    # Make list of file names for saving, and list to order the plots within the figure
    file_name = ['all_imp', 'type_imp', 'neighbor_imp']
    ordered_ls = [0,2,1,3,0,1]
    GS_ints = [0,2,1,3,12,13]
    
    # Make n main figures (imputation types)
    for fig_num in range(1, len(time_ls) + 1):
        plt.figure(figsize=(18, 24))

        # Make m main plots
        num_plots = len(time_arrays[0]) + len(comp_arrays[0])
        for plot_num, df_ind in enumerate(ordered_ls):
            plt.subplot(int(num_plots/2), 2, plot_num + 1)
            
            # Assign a df and array to plot
            label_df = time_ls[fig_num - 1][df_ind] if plot_num < len(time_arrays[0]) else comp_ls[fig_num - 1][df_ind]
            plotting_array = time_arrays[fig_num - 1][df_ind] if plot_num < len(time_arrays[0]) else comp_arrays[fig_num - 1][df_ind]

            # Define the plot colours & label colum
            label_col = [col for col in label_df.columns if col.startswith('MS')]
            color_map = {0: 'pink', 1: 'orange', 2: 'purple'} if label_col[0] == 'MStype' else {0: 'green', 1: 'purple'}
            legend_labels = {0: 'PPMS', 1: 'SPMS', 2: 'RRMS'} if label_col[0] == 'MStype' else {0: 'HC', 1: 'MS'}
            mapped_colors = label_df['MStype'].map(color_map) if label_col[0] == 'MStype' else label_df['MS'].map(color_map)            
            
            # Make plot
            for category, color in color_map.items():
                indices = label_df[label_col[0]] == category
                plt.scatter(plotting_array[indices, 0], plotting_array[indices, 1], 
                            c = color, label = legend_labels[category], alpha=0.7)
            
            # Make plot labels            
            df_name = best_GS_df.iloc[GS_ints[plot_num], 0]
            plt.xlabel('Principle Component 1')
            plt.ylabel('Principle Component 2')
            plt.title(f'2D PCA for {df_name} Dataset (2PC Variance Explained = {round(exp_var_ls[GS_ints[plot_num]] * 100, 2)}%)')
            plt.legend()
            plt.grid(True)
            
        GS_ints = [x + 4 if i < 4 else x + 2 for i, x in enumerate(GS_ints)] 

        # Make figures and save
        plt.tight_layout()
        plt.savefig(f'output/PCA/{file_name[fig_num - 1]}/2PCA_plots_{file_name[fig_num - 1]}_multicolor.png')
        plt.show()

In [None]:
def apply_PCA(time_ls, comp_ls, best_GS_df, plot_param):
    """
    INPUT:
    time_ls : (nested lists of dataframes) nested lists with dataframe for the unique sessions datasets
    comp_ls : (nested lists of dataframes) nested lists with dataframe for the combined sessions datasets
    best_GS_df : (dataframe) dataframe of the gridsearch outcomes
    plot_param : (Boolean) True/False make a 2-dimensional plot of the PCA embeddings
    OUTPUT: 2 lists of nested dfs, 6 lists of nested floats
    DESCRIPTION: Applies PCA fitting to each of the dataframes in the given list of nested dataframes, based on
    its optimal perplexity and learning rate values. Plots the ouput arrays of the PCA fittings if plot_param is True.
    """
    # Initialise output lists
    output_time_ls, output_comp_ls = [], []
    plot_time_ls, plot_comp_ls = [], []
    exp_var_plot_ls = []
    time_var_ratios, comp_var_ratios = [], []
    time_feature_loadings, comp_feature_loadings = [], []
    time_total_var_ratios, comp_total_var_ratios = [], []
    
    # Needed to iterate through best_GS_df 
    counter = 0

    # Iterate through the unique sessions dataset
    for sublist in time_ls:
        type_list = []
        plot_list = []
        sub_var_ratios = []
        subtotal_var_ratios = []
        sub_feat_loads = []
        
        for df in sublist:
            # Get name, label column name, and number of pc for the df
            df_name = best_GS_df.iloc[counter, 0]
            label_col = [col for col in df.columns if col.startswith('MS')]
            n_comp = best_GS_df.iloc[counter, 1]

            # Remove target variables and normalise the df
            dropped_df = df.drop(columns = ['EDSS', 'BL_Avg_cognition', 'Time', 'index'] + label_col , axis = 1)
            norm_df = StandardScaler().fit_transform(dropped_df) 

            # Run PCA models (Table & plot)
            temp_pca_model = PCA(n_components = n_comp)
            pca_embedding = temp_pca_model.fit_transform(norm_df)
            
            plot_pca_model = PCA(n_components = 2)
            plot_pca_embedding = plot_pca_model.fit_transform(norm_df)

            # Store pca information (table & plot)
            type_list.append((n_comp, pca_embedding))
            sub_var_ratios.append(temp_pca_model.explained_variance_ratio_)
            subtotal_var_ratios.append(sum(temp_pca_model.explained_variance_ratio_))
            sub_feat_loads.append(get_feature_loadings(dropped_df, n_comp, temp_pca_model))
            
            exp_var_plot_ls.append(np.sum(plot_pca_model.explained_variance_ratio_[:2]))
            plot_list.append(plot_pca_embedding)

            # Update the counter
            counter += 1

        # Add the new information to the ouput lists
        output_time_ls.append(type_list)
        time_var_ratios.append(sub_var_ratios)
        time_total_var_ratios.append(subtotal_var_ratios)
        time_feature_loadings.append(sub_feat_loads)
        plot_time_ls.append(plot_list)

    # Iterate through the combined sessions dataset
    for sublist in comp_ls:
        type_list = []
        plot_list = []
        sub_var_ratios = []
        subtotal_var_ratios = []
        sub_feat_loads = []
        
        for df in sublist:
            # Get name, perplexity, learning rate and label column name for the df
            df_name = best_GS_df.iloc[counter, 0]
            n_comp = best_GS_df.iloc[counter, 1]
            label_col = [col for col in df.columns if col.startswith('MS')]

            # Remove target variables and normalise the df
            dropped_df = df.drop(columns = ['EDSS', 'BL_Avg_cognition', 'Time', 'index'] + label_col , axis = 1)
            norm_df = StandardScaler().fit_transform(dropped_df) 

            # Run PCA models (Table & plot)
            temp_pca_model = PCA(n_components = n_comp)
            pca_embedding = temp_pca_model.fit_transform(norm_df)
            
            plot_pca_model = PCA(n_components = 2)
            plot_pca_embedding = plot_pca_model.fit_transform(norm_df)

            # Store pca information (table & plot)
            type_list.append((n_comp, pca_embedding))
            sub_var_ratios.append(temp_pca_model.explained_variance_ratio_)
            subtotal_var_ratios.append(sum(temp_pca_model.explained_variance_ratio_))
            sub_feat_loads.append(get_feature_loadings(dropped_df, n_comp, temp_pca_model))
            
            exp_var_plot_ls.append(np.sum(plot_pca_model.explained_variance_ratio_[:2]))
            plot_list.append(plot_pca_embedding)

            # Update the counter
            counter += 1

        # Add the new information to the ouput lists
        output_comp_ls.append(type_list)
        comp_var_ratios.append(sub_var_ratios)
        comp_total_var_ratios.append(subtotal_var_ratios)
        comp_feature_loadings.append(sub_feat_loads)
        plot_comp_ls.append(plot_list)
                       
    # Plotting condition (if true, plots are generated)
    if plot_param:
        group_plot_PCA(plot_time_ls, time_ls, plot_comp_ls, comp_ls, best_GS_df, exp_var_plot_ls)
     
    return output_time_ls, output_comp_ls, time_var_ratios, comp_var_ratios, time_total_var_ratios, comp_total_var_ratios, time_feature_loadings, comp_feature_loadings

In [None]:
# Run step 3 (apply_PCA)
time_pca_embed, comp_pca_embed, time_var_ratios, comp_var_ratios, time_total_var_ratios, comp_total_var_ratios, time_feat_load, comp_feat_load  = apply_PCA(time_set_ls, complete_set_ls, PC_ARI_tbl, True)

# Step 4: Variance Explained & Feature loading

In [None]:
def add_total_exp_var(PC_ari_df, tot_var_ratios_time, tot_var_ratios_comp):
    """
    INPUT: 
    PC_ari_df : (dataframe) gridsearch dataframe
    tot_var_ratios_time : (nested lists of floats) nested list of total variance ratios for the time split datasets
    tot_var_ratios_comp : (nested lists of floats) nested list of total variance ratios for the time combined datasets
    OUTPUT: dataframe
    DESCRIPTION: Adds the 'total_variance_exp' column with corresponding values to the input dataframe.
    """
    # Initialise the column name and get the data
    col_name = 'total_variance_exp'
    flattened_time = [num for sublist in tot_var_ratios_time for num in sublist]
    flattened_comp = [num for sublist in tot_var_ratios_comp for num in sublist]
    
    col_values = flattened_time + flattened_comp

    # Add the column to the gridsearch dataset
    PC_ari_df[col_name] = col_values
    
    return PC_ari_df

In [None]:
def get_significant_feat_loadings(ls_ls_df):
    """
    INPUT: 
    ls_ls_df : (Nested lists of dataframes) nested lists of feature loading dataframes
    OUTPUT: Nested lists of dataframes
    DESCRIPTION: Create a dataframe of the feature loadings of the given dataframe for the specified number of PCs.
    """
    # Initialise output list
    output_df_ls = []

    # Iterate through the imputation lists
    for sublst in ls_ls_df:
        ft_load_sublist = []
        for df in sublst:
    
            # Initialize the list of best scoring indexes
            top_indexes = []
            
            # Get the absolute values of feature loadings
            abs_df = df.iloc[:,1:].abs()

            # Iterate over the columns of abs_df
            for col in abs_df.columns:
                # Get the top 10 indexes
                top_10_idx = abs_df[col].nlargest(10).index
                top_indexes.extend(top_10_idx)

            # Remove duplicates
            unique_indexes = list(set(top_indexes))

            # Make output df with selected rows
            output_df = df.iloc[unique_indexes]
            
            ft_load_sublist.append(output_df)

        output_df_ls.append(ft_load_sublist)
    
    return output_df_ls

In [None]:
def save_significant_feature_loadings(ls_ls_df):
    """
    INPUT: 
    ls_ls_df : (Nested list of dataframes) nested lists of feature loading dataframes
    OUTPUT: N/A
    DESCRIPTION: Saves the dataframes with the significant feature loadings
    """
    # Initialize naming lists and print statement variable
    imp_types = ['all_imp', 'type_imp', 'neighbor_imp']
    data_types = ['MS', 'MS', 'ALL', 'ALL']
    time_types = ['t1', 't2', 't1', 't2']
    list_type_name = ''
    
    # For time split data
    if len(ls_ls_df[0]) == 4:
        list_type_name = 'time split'
        for i, sublst in enumerate(ls_ls_df):
            for j, df in enumerate(sublst):
                df.to_excel(f'output/PCA/{imp_types[i]}/top_feature_loading_{data_types[j]}_{imp_types[i]}_{time_types[j]}.xlsx', index = False)
   
    # For complete dataset
    else:
        list_type_name = 'complete'
        for i, sublst in enumerate(ls_ls_df):
            for j, df in enumerate(sublst):
                df.to_excel(f'output/PCA/{imp_types[i]}/top_feature_loading_{data_types[j + 1]}_{imp_types[i]}.xlsx', index = False)
        
    
    print(f'All data frames in the {list_type_name} input list have been successfully saved')


In [None]:
# Update the dataframe with feature loadings
PC_ARI_TVR_tbl = add_total_exp_var(PC_ARI_tbl, time_total_var_ratios, comp_total_var_ratios)

# Update the bestPC table
PC_ARI_TVR_tbl.to_excel('output/PCA/bestPC_per_dataset_tbl.xlsx', index = False)

# Get the feature loadings for the unique sessions and combined session
top_time_feat_load = get_significant_feat_loadings(time_feat_load)
top_comp_feat_load = get_significant_feat_loadings(comp_feat_load)

# Save the feature loadings for the unique sessions and combined session
save_significant_feature_loadings(top_time_feat_load)
save_significant_feature_loadings(top_comp_feat_load)

# Step 5: Save PCA embedded datasets

In [None]:
def export_pca_embedings(ls_ls_pca_embedding, ls_ls_matching_df):
    """
    INPUT:
    ls_ls_pca_embedding : (nested list of np.array) nested list of array of the PCA embedding
    ls_ls_matching_df : (nested list of pd.dataframe) nested list of original pre emebedding dataframe
    OUTPUT:
    DESCRIPTION: Exports the PCA embeddings as dataframes with the same index/ subjects ID as their original dataset. 
    """
    imp_file = ['all_imp', 'type_imp', 'neighbor_imp']
    imp_df = ['ia', 'it', 'in']

    for imp_idx, imp_ls in enumerate(ls_ls_pca_embedding):
        for emb_idx, pca_emb in enumerate(imp_ls):
            # Make a dataframe from the array
            output_df = pd.DataFrame(pca_emb[1], columns = [f'PC{i+1}' for i in range(pca_emb[1].shape[1])])

            # Reintroduce the participant ID (index)
            output_df['index'] = ls_ls_matching_df[imp_idx][emb_idx]['index'].reset_index(drop = True)
            columns = ['index'] + [col for col in output_df.columns if col != 'index']
            # Reorder the columns such that index is first
            output_df = output_df[columns]

            sub_type = ''
            year = ''
            # Check for time split list or not
            if len(ls_ls_pca_embedding[0]) > 3:
                print('entered time list statement')
                if emb_idx < 2:
                    sub_type = 'MStrain_'
                else: 
                    sub_type = 'ALLtrain_'
                
                if emb_idx % 2 == 0:
                    year = '00'
                else:
                    year = '05'

                # Save the new df as an excel file
                output_df.to_excel(f'output/PCA/{imp_file[imp_idx]}/PCA_{sub_type}{imp_df[imp_idx]}{year}.xlsx', index=False)
            
            else:
                print('entered no split list statement')
                if emb_idx == 0:
                    sub_type = 'MStrain_'
                else: 
                    sub_type = 'ALLtrain_'

                # Save the new df as an excel file
                output_df.to_excel(f'output/PCA/{imp_file[imp_idx]}/PCA_{sub_type}{imp_df[imp_idx]}.xlsx', index=False)
          

In [None]:
# Run the export_pca_embedings function for the unique sessions and the combined sessions.
export_pca_embedings(time_pca_embed, time_set_ls)
export_pca_embedings(comp_pca_embed, complete_set_ls)