# Introduction

This notebook contains all of the scripts necessary to preprocess the raw datasets for the dimensionality reduction (DR) part of the project. The notebook follows the following 4 steps:

Step 1: Isolates the features and participants of interest from each dataset containing the clinical, demographic, sMRIs, fMRI, and dMRI, see the participants section of the methodology for inclusion and exclusion criteria and Table 1 for more information on the features. The functions clean_dMRI, clean_sMRI_vol, clean_sMRI_thickness, under the 'Cleaning Functions' header are created to clean each of the datasets and are combined in the function make_base_df to create the base HC & pwMS and pwMS only datasets.

Step 2: Splits the newly created dataframes into train and test subjects (80%-20%, respectively). Stratified subject split is used to obtain a repreentative distribution of either the HC and pwMS by creating a new temporary variable with the following classes: HC, RRMS, SPMS, and PPMS, to represent the new splits. The functions get_test_subjects and make_train_test_split are used to obtain the train and test subject IDs such that there is no data leakage between the train and test dataset and only participants belonging to one of the classes are selected for the test set.

Step 3: Missing values are imputed with either 1 of 3 methods (Time, Time + Type, and Time + Neighbor), see the 'Missing Values' subsection of the Methodology for more information on these imputation methods. The function split_df_by_time splits the previously created train and test base dataframes into 1 unique session per dataframe and returns these splits in a list. Each unique session dataframe is then passed on to the impute_df function which fills the missing value with the specified imputatin method. The sessions for the test datasets are merged to obtain a dataframe containing all the sessions. In combination with the imputated dataframes with unique sessions, the test dataframes are saved and reserved for later use in the dimensionality reduction methods as they will not undergo oversampling. 

Step 4: The class imbalance between the MS phenotypes and HC / pwMS is fixed using Synthetic Minority Over-sampling Technique (SMOTE), more information with regards to SMOTE can be found in the subsubsection 'Class Imbalance' of the methodology. The lists of unique session dataframes created in the previous imputation step are converted from an X row per participants for the X sessions to a 1 row per participant format. The datasets are then used to synthesize new participants in the class_rebalancing_SMOTE function, using SMOTE based on the MS phenotypes column for MS datasets and the HC + MS phenotypes (HC, RRMS, SPMS, PPMS) for the HC + pwMS datasets, which are added to the existing participants. The new oversampled dataframes are then reformated to the original X rows per participant for the X sessions and saved in the reformat_SMOTE_dfs function. A combined session version of the oversampled dataset is also created and saved. This is repeated for all versions of the MS and ALL train datasets obtained from the different imputation methods.

These datasets will then be used in part 1 of the project for dimensionality reduction with PCA, tSNE, UMAP, and TPHATE.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import re
import pickle
import os

from sklearn.impute import KNNImputer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Step 1: Prepare & Merge the Datasets

## Cleaning Functions

In [None]:
def clean_dMRI(df):
    """
    INPUT: 
    df :(dataframe) dataset with dMRI data
    OUTPUT: dataframe
    DESCRIPTION: Sets up the dMRI dataset for merging with others in the make_base_df function. Isolates the no lession zscores 
    (FD, FDC, and logFC), subject IDs and sessions number.
    """
    # Create a list with all columns beginning with FD or Logfc in the df
    all_dMRI_cols = [col for col in df.columns if col.startswith('FD') or col.startswith('Logfc')]
    
    # Keep only columns with _nolesion_zscores
    keep_dMRI_cols = [col for col in all_dMRI_cols if col.endswith('_nolesion_zscores')]

    # Get the columns to remove based on wanted columns
    rm_cols = [col for col in df.columns if col not in ['PRESGENE_ID', 'Time'] + keep_dMRI_cols]

    # Remove the unwanted columns from the df
    output_df = df.drop(columns = rm_cols, axis = 1)

    return output_df

In [None]:
def clean_sMRI_vol(df):
    """
    INPUT: 
    df :(dataframe) dataset with demographic, clinical and structural MRI dataset.
    OUTPUT: dataframe
    DESCRIPTION: Sets up the sMRI (volumes) dataset for merging with others in the make_base_df function. 
    Retains the clinical, demographic, and sMRI volume features. Removes rows and columns with at least 90% of missing values.
    """
    # Make a copy of the input df to prevent editing the original df
    working_df = df.copy()
    
    #Remove raw crossectional measures
    clin_demo = ['id_presgene','programs_label_code','edss', 'mstype_code', 'subject_type_code',
                 'age',	'sex',	'average_cognition', 'HC_CI_CP']
    
    vol_long_corr = [col for col in working_df.columns if 'Vol_long_all_corrected' in col and 'Mask' not in col and 'BrainSeg' not in col]

    working_df = working_df[[col for col in working_df.columns if col in clin_demo + vol_long_corr + ['lesion_volume_mm3']]]

    # Replace other missing type notation with the most common MStype
    working_df['mstype_code'] = working_df['mstype_code'].apply(lambda x: 3 if x > 8 else x)

    # Remove columns filled with at least 90% missing values or zeros:
    nan_prct = working_df.isnull().mean()
    zr_prct = (working_df == 0).mean()
    prct = nan_prct+zr_prct
    nozr_df = working_df.loc[:, prct < 0.9]

    zr_col = prct[prct >= 0.9].index.tolist()
    print("Columns removed:", zr_col)

    # Remove rows filled with at leasst 90% missing values or zeros and any of their occurances:
    na_prct_row = nozr_df.isnull().mean(axis=1)
    zr_prct_row = (nozr_df == 0).mean(axis=1)
    prct_row = na_prct_row + zr_prct_row
    zr_row = nozr_df.loc[prct_row >= 0.8, 'id_presgene'].tolist()
    nozr_df =  nozr_df[~nozr_df['id_presgene'].isin(zr_row)]

    print("Rows with ID removed:", zr_row)

    # Encode the HC_CI_CP to numerical labels
    nozr_df['HC_CI_CP'] = nozr_df['HC_CI_CP'].fillna(9999)
    nozr_df['HC_CI_CP'] = nozr_df['HC_CI_CP'].replace({'HC': 0, 'CP': 1, 'CI': 2})
    nozr_df['HC_CI_CP'] = nozr_df['HC_CI_CP'].replace(9999, np.nan)
    
    # Rename columns for clarity
    rename_dict = {'id_presgene':'PRESGENE_ID',
                   'programs_label_code': 'Time',
                   'edss': 'EDSS',
                   'mstype_code':'MStype',
                   'subject_type_code':'MS',
                   'age': 'Age',
                   'sex': 'Sex',
                   'average_cognition':'BL_Avg_cognition'}
                          
    nozr_df = nozr_df.rename(columns = rename_dict)

    print(f"The sMRI volumes dataset is ready for merging")
    
    return nozr_df

In [None]:
def format_to_PRESGENE(df_list):
    """
    INPUT: 
    df_list : (list of dataframes) list containing 2 sMRI (thickness) datasets representing left & right hemisphere measurements
    OUTPUT: 2 dataframes of left and right hemisphere data.
    DESCRIPTION: Reformats the ID column to match that of the PRESGENE_ID, and identifies the left and right hemisphere dataset.
    """
    lh_df = None
    rh_df = None
    for df in df_list:
        old_id = df.columns[0]
        id_col = 'PRESGENE_ID'
        # Restructure ID column
        df = df.rename(columns = {df.columns[0] : id_col})
        df[id_col] = df[id_col].astype(str)
        df[id_col] = df[id_col].replace('^sub-', '', regex=True)
        df[id_col] = df[id_col].str[:4]+ '_' + df[id_col].str[4:]

        # Remove the BrainSegVolNotVent & eTIV columns:
        df.drop(columns = ['BrainSegVolNotVent', 'eTIV'], inplace=True)

        if 'lh' in old_id:
            lh_df = df
        elif 'rh' in old_id:
            rh_df = df
        else:
            print('Neither left nor right hemisphere files were provided.')
        
    return lh_df, rh_df

In [None]:
def clean_sMRI_thickness(df_list, num_ses):
    """
    INPUT:
    df_list : (nested list of dfs) Each session has its own sublist containing the left and right hemisphere sMRI thickness 
    data within the main list.
    num_ses : (integer) number of sessions to include in the merging of the sMRI thickness datasets
    OUTPUT: 1 dataframe
    DESCRIPTION: Sets up the sMRI (thickness) individual datasets (per session and brain hemisphere) for merging with others in the 
    make_base_df function. Uses the parameter num_ses session to combine the left and right hemisphere data up to the desired session
    (baseline = 1, five year follow-up = 2, ten year follow-up = 3).
    """
    # Initialise the output variable
    output_df = None

    # Iterate over the list to reformat the ID column and merge the left and right hemisphere measurements up to desired session
    for i in range(0, num_ses):
        # Reformat
        presgene_lh, presgene_rh = format_to_PRESGENE(df_list[i])
        
        # merge lh & rh datasets & add time column
        lh_rh_df = pd.merge(presgene_lh, presgene_rh, on=['PRESGENE_ID'], how='inner')
        lh_rh_df['Time'] = i + 1

        # Update output_df to baseline session for the first iteration
        if i == 0:
            # if baseline (_00)
            output_df = lh_rh_df
        
        # Combine previous output_df with left & right hemisphere obtained for current session
        else:
            # Find common participant IDs between the 2 sessions
            common_ids = pd.merge(output_df[['PRESGENE_ID']], lh_rh_df[['PRESGENE_ID']], on='PRESGENE_ID')

            # Filter the dataframes to only include the selected participants
            base_filtered = output_df[output_df['PRESGENE_ID'].isin(common_ids['PRESGENE_ID'])]
            new_filtered = lh_rh_df[lh_rh_df['PRESGENE_ID'].isin(common_ids['PRESGENE_ID'])]

            # Concantenate  the previous dataframe with the new session dataset
            output_df = pd.concat([base_filtered, new_filtered], ignore_index=True)
        
    return output_df

In [None]:
def clean_fECM(df):
    """
    INPUT: 
    df : (dataframe) dataset with fECM data
    OUTPUT: 1 dataframe
    DESCRIPTION: Isolates the z-scores features and creates 'Time' column (session indicator) based on feature name (bl = 1, fu = 2).
    """
    # Remove all non-zscore columns from the df
    rm_non_z = [col for col in df.columns if not col.startswith('Zscore_') and col != 'PRESGENE_ID']
    temp_df =  df.drop(columns = rm_non_z)
    
    # Make baseline and future columns list
    bl_col_names = temp_df.filter(regex='_bl_').columns.tolist()
    fu_col_names = temp_df.filter(regex='_fu_').columns.tolist()

    # Reformat the baseline and future columns
    df_BL = temp_df[bl_col_names + ['PRESGENE_ID']]
    df_FU = temp_df[fu_col_names + ['PRESGENE_ID']]

    df_BL.columns = df_BL.columns.str.replace('_bl_', '_')
    df_FU.columns = df_FU.columns.str.replace('_fu_', '_')

    # Create the Time column to label the sessions
    df_BL.insert(0, 'Time', 1, False)
    df_FU.insert(0, 'Time', 2, False)

    # Combine the baseline and future sessions
    output_df = pd.concat([df_BL, df_FU], ignore_index=True)

    return output_df

## Merging

In [None]:
def make_base_df(dMRI, sMRI_vol, sMRI_ls, fECM, num_ses):
    """
    INPUT:
    dMRI: (dataframe) diffusion MRI dataset.
    sMRI_vol: (dataframe) structural MRI dataset with volumetric, clinical and demographic data.
    sMRI_ls : (nested list of dataframes) sublists containing the thickness structural MRIs at one of the sessions for the left and 
    right hemisphere of the brain
    fECM : (dataframe) functional MRI dataset.
    num_ses: (integer) the number of the highest session desired in the dataset. (1-3)
    OUTPUT: 2 dataframes. The first with pwMS only and the second with HC + pwMS participants.
    DESCRIPTION: Cleans and merges the sMRI thickness, sMRI volumes, dMRI and fMRI datasets up to the desired session number for the 
    base MS (pwMS only) and ALL participants (HC + pwMS) datasets.
    """
    # Clean the datasets
    dMRI_df = clean_dMRI(dMRI)
    vol_df = clean_sMRI_vol(sMRI_vol)
    fECM_df = clean_fECM(fECM)
    thickness_df = clean_sMRI_thickness(sMRI_ls, num_ses)

    # Merge the dMRI, fMRI & volume sMRI measures:
    dsMRI_df = pd.merge(vol_df, dMRI_df, on=['PRESGENE_ID', 'Time'])
    dsfMRI_df = pd.merge(dsMRI_df, fECM_df, on=['PRESGENE_ID', 'Time'])

    # Find their common Subjects & corresponding sess
    common_ids = pd.merge(dsfMRI_df[['PRESGENE_ID', 'Time']], thickness_df[['PRESGENE_ID', 'Time']], on=['PRESGENE_ID', 'Time'])
    
    # Filter the data frames to include only common subjects
    dsfMRI_filtered = dsfMRI_df[dsfMRI_df[['PRESGENE_ID', 'Time']].apply(tuple, 1).isin(common_ids.apply(tuple, 1))]
    thick_filtered = thickness_df[thickness_df[['PRESGENE_ID', 'Time']].apply(tuple, 1).isin(common_ids.apply(tuple, 1))]

    # Combine the presgene & dMRI datasets
    output_df = pd.merge(dsfMRI_filtered, thick_filtered, on=['PRESGENE_ID','Time'], how = 'inner')

    # Create the ALL particpants base dataframe & remove pwMS specific features
    output_all = output_df.drop(columns = ['MStype','lesion_volume_mm3'])

    # Create the MS particpants base dataframe & remove irrelevant features for pwMS
    output_ms = output_df[output_df['MS'] == 1]
    output_ms = output_ms.drop(columns = 'MS')
    
    print(f'Merging completed')
    return output_ms, output_all

### Import the datasets

Import the dataset containing the sMRI (thickness and volumes), clinical, demographic, dMRI and fMRI data. Group the sMRI thickness by session based on the file name.

In [None]:
# Run the clean_presgene_whole function on the presgene_whole dataframe.
dMRI = pd.read_excel('data/Presgene_wholeBL_forFBA_metrics_JHU_forLME_zscores.xlsx')

# Load the structural MRI dataset
volume_df = pd.read_excel('updated_data/prograMS_database_v1.9_reduced_kayna.xlsx')

# Load the functional MRI (fECM) dataset
fECM = pd.read_excel('data/Presgene_longitudinal_marijn_fECM.xlsx')

# Load the structural MRI datasets
# Create the sublists for session grouping
Y00, Y05, Y10 = list(), list(), list()

sMRIs_path = 'data/sMRIs'
for filename in os.listdir(sMRIs_path):
    # Read and store each file in its appropriate list
    file_path = os.path.join(sMRIs_path, filename)
    df = pd.read_csv(file_path, sep='\t')
    
    if 'Y00' in filename:
        Y00.append(df)
    elif 'Y05' in filename:
        Y05.append(df)
    elif 'Y10' in filename:
        Y10.append(df)

# Create the final dMRI list with all sessions
sMRI_ls = [Y00, Y05, Y10]

### Run the merging function (optional saving)

Cell blocks to run the functions of Step 1 of this notebook, and save the newly created dataframes optionally.

In [None]:
# Run the base function
MS_base2, ALL_base2 = make_base_df(dMRI, volume_df, sMRI_ls, fECM, 2)

In [None]:
# Optionally save the base dataframes
for df in [MS_base2, ALL_base2]:
    # Find the highest included session
    ses = max(df['Time'].value_counts().index)

    # Identify the participants in the df
    label_col = [col for col in df.columns if col.startswith('MS')]

    # Save accordingly as csv files
    if label_col[0] == 'MStype':   
        df.to_csv(f'updated_data/base/MS_no_dMRI_base{ses}.csv', index = False)
    else:
       df.to_csv(f'updated_data/base/ALL_no_dMRI_base{ses}.csv', index = False) 

# Step 2: Train-Test Split

## Generate the train and test subject IDs

In [None]:
def get_test_subjects(ms_df, all_df, file_path):
    """
    INPUT: 
    ms_df : (dataframe) base pwMS dataset
    all_df : (dataframe) base HC + pwMS dataset
    file_path : (string) file path for saving the test subjects arrays
    OUTPUT: 2 series of PRESGENE IDs (strings) for pwMS and HC + pwMS participants
    DESCRIPTION: Splits the subjects into train and test subjects with stratified sampling using an egineered feature combining the
    MS phenotypes and HC/pwMS features for HC + pwMS dataset and the MS phenotype feture for the pwMS dataset.
    """
    # find the max session value in the dataset 
    max_ses = max(ms_df['Time'].value_counts().index)

    # Isolate the HC participants at the max session from the HC + pwMS dataset
    hc_subset = all_df[all_df['MS'] == 2]
    hc_subset = hc_subset[hc_subset['Time'] == max_ses]
    hc_subset = hc_subset[['PRESGENE_ID', 'Time', 'MS']]
    
    # Isolate the MS participants at the max session from the pwMS dataset
    ms_subset = ms_df[['PRESGENE_ID', 'Time', 'MStype']]
    ms_subset = ms_subset[ms_subset['Time'] == max_ses]
    
    # Remove pwMS with missing values or incorrect MStypes
    ms_subset = ms_subset.dropna(subset=['MStype'])
    ms_subset = ms_subset[ms_subset['MStype'] <= 3]
    
    # Merge the HC and pwMS datasets, fill misssing values in Stype column (HC values) with 0
    working_df = pd.concat([hc_subset, ms_subset], axis=0)
    working_df['MStype'] = working_df['MStype'].fillna(0)

    # Run train_test_split() function on the IDs, 80% train  20% test
    train_id_all, test_id_all, _, _ = train_test_split(working_df['PRESGENE_ID'], working_df['MStype'], 
                                                        test_size = 0.2, stratify = working_df['MStype'], random_state = 42)

    train_id_ms, test_id_ms, _, _ = train_test_split(ms_subset['PRESGENE_ID'], ms_subset['MStype'], 
                                                    test_size = 0.2, stratify = ms_subset['MStype'], random_state = 42)

    # Indicate if partcipants of the train dataset appear in the test serries.
    print(f'There are {train_id_all.isin(test_id_all).sum()} subjects of the ALL train split in the ALL test split and {test_id_all.isin(train_id_all).sum()} subjects of the ALL test split in the ALL train split.')
    print(f'There are {train_id_ms.isin(test_id_ms).sum()} subjects of the MS train split in the MS test split and {test_id_ms.isin(train_id_ms).sum()} subjects of the MS test split in the MS train split.')
    
    # Export the test arrays of subjects as pickle files
    test_id_all.to_pickle(f'{file_path}/test_PRESGENE_ID_ALL{max_ses}.pkl')
    test_id_ms.to_pickle(f'{file_path}/test_PRESGENE_ID_MS{max_ses}.pkl')
    
    return test_id_ms, test_id_all

## Split the dataset between train and test

In [None]:
def make_train_test_split(df, test_subs, file_path):
    """
    INPUT:
    df : (dataframe)
    file_path : (string) file path for saving the test subjects arrays
    test_subs : (series of strings) series of the test subjects IDs obtained from get_test_subjects function
    output: 2 subsets of the input df with only the train or test subjects
    DESCRIPTION: Split the input dataframe into the training and testing dataset
    """
    # Get the max session of the df of interest
    max_ses = max(df['Time'].value_counts().index)
    label_col = [col for col in df.columns if col.startswith('MS')]

    if test_subs is None:
        # Get the appropriate .pkl file
        sub = 'ALL' if label_col[0] == 'MS' else 'MS'
        
        # Load the test ids .pkl file corresponding to the session & dataset
        with open (f'{file_path}/test_PRESGENE_ID_{sub}{max_ses}.pkl', 'rb') as file:
            test_subs = pickle.load(file)
    
    # Get the train and test subsets
    test_df = df[df['PRESGENE_ID'].isin(test_subs)]
    train_df = df[~df['PRESGENE_ID'].isin(test_subs)]

    # Reset the indexing of the train & test dfs
    test_df.reset_index(drop=True, inplace=True)
    train_df.reset_index(drop=True, inplace=True)

    # Indicate if partcipants of the train dataset appear in the test dataset.
    print(f"There are {train_df[['PRESGENE_ID']].isin(test_df[['PRESGENE_ID']]).sum()} subjects of the train split in the test split and {test_df[['PRESGENE_ID']].isin(train_df[['PRESGENE_ID']]).sum()} subjects of the test split in the train split.")

    return train_df, test_df

## Run the train test split functions

First obtain the test participant IDs, then get the train and test datasets based on the test participant IDs.

In [None]:
# Get the test subject IDs
testid_MS2, testid_ALL2 = get_test_subjects(MS_base2, ALL_base2, 'updated_data/test')

# Get the train and test dataset for the ALL and MS datasets (2 methods shown below)
ALL_train_base2, ALL_test_base2 = make_train_test_split(ALL_base2, testid_ALL2, None)
MS_train_base2, MS_test_base2 = make_train_test_split(MS_base2, None, 'updated_data/test')

# Step 3: Imputing the dataframes (all/type/neighbor)

In [None]:
def split_df_by_time(df):
    """
    INPUT: 
    df : (dataframe) dataset with data of 2 or more sessions
    OUTPUT: list of dataframes eaach with data for a unique session
    DESCRIPTION: splits the input dataset into unique sessions according to its max session
    """
    # Find the max session
    max_ses = max(df['Time'].value_counts().index)
    
    # Initialise output variables
    output_list = []
    
    # Get session 1
    t1_df = df[df['Time'] == 1]
    output_list.append(t1_df)
    
    # Get session 2 & remove columns not measured in session 2 (presgene_TBSS)
    t2_df = df[df['Time'] == 2]
    output_list.append(t2_df)
    
    # Create session 3 or higher dataset if specified
    if max_ses >= 3:
        for i in range(3,max_ses+1):
            t_df = df[df['Time'] == max_ses]
            output_list.append(t_df)
    
    return output_list

In [None]:
def impute_df(df, impute_param):
    """
    INPUT:
    df : (DataFrame)
    impute_param : (string) imputation type, possible values are: 'all', 'type', or 'neighbor'
    OUTPUT: DataFrame with no missing values
    DESCRIPTION: Imputes the missing values of categorical and numerical features based on the chosen imputation method. 
    """
    # Identify label column, categorical and continuous columns
    label_col = [col for col in df.columns if col.startswith('MS')][0]
    cat_cols = ['Sex', 'HC_CI_CP']
    num_cols = [col for col in df.columns if col not in ['Time', label_col, 'PRESGENE_ID', 'Age_group'] + cat_cols]

    # Save and remove ID column
    PRESGENE_IDs = df[['PRESGENE_ID']]
    noID_df = df.drop(columns='PRESGENE_ID')

    # Ensure numerical columns are numeric
    for col in num_cols:
        noID_df[col] = pd.to_numeric(noID_df[col], errors='coerce')
    
    # Handle HC-specific logic
    if df['PRESGENE_ID'].str.startswith('HC').any():
        noID_df.loc[noID_df['MS'] == 2, 'EDSS'] = 999

    # Copy DataFrame for imputation
    imputed_df = noID_df.copy()

    if impute_param == 'all': # Impute missing values (per session only).
        # Impute NaN for categorical columns with mode.
        for col in cat_cols:
            if imputed_df[col].isna().any():
                imputed_df[col] = noID_df[col].transform(lambda x: x.fillna(x.mode()[0]))
        
        # Impute NaN for continuous columns with the mean.
        for col in num_cols:
            if imputed_df[col].isna().any():
                imputed_df[col] = noID_df[col].transform(lambda x: x.fillna(x.mean()))
            

    if impute_param == 'type':
        # Impute categorical columns with mode by group
        for col in cat_cols:
            if imputed_df[col].isna().any():
                global_mode = noID_df[col].mode().iloc[0]  # Fallback mode
                imputed_df[col] = imputed_df.groupby(label_col)[col].transform(
                    lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else global_mode))
                 
        # Impute numerical columns with mean by group
        for col in num_cols:
            if imputed_df[col].isna().any():
                global_mean = noID_df[col].mean()  # Fallback mean
                imputed_df[col] = imputed_df.groupby(label_col)[col].transform(
                    lambda x: x.fillna(x.mean()).fillna(global_mean))

    elif impute_param == 'neighbor': # Impute missing values per session and neighbour.
        # Initialize KNNImputers
        mode_imputer = KNNImputer(n_neighbors=5, weights="distance", metric="nan_euclidean") # Mode
        mean_imputer = KNNImputer(n_neighbors=5) # Mean
        
        # Impute NaN 
        imputed_cats_df = pd.DataFrame(mode_imputer.fit_transform(imputed_df[cat_cols]), columns = cat_cols, index=imputed_df.index)
        imputed_df[cat_cols] = imputed_cats_df
        
        imputed_nums_df = pd.DataFrame(mean_imputer.fit_transform(imputed_df[num_cols]), columns = num_cols, index=imputed_df.index)
        imputed_df[num_cols] = imputed_nums_df

    # Apply rounding and ensure EDSS values are of the correct values and format
    imputed_df['EDSS'] = imputed_df['EDSS'].apply(
        lambda x: 0 if x < 0.5 else (1 if x < 1.5 else round(x * 2) / 2 if x != 999 else 999))
    imputed_df[['Time', 'Sex', label_col, 'HC_CI_CP']] = imputed_df[['Time', 'Sex', label_col, 'HC_CI_CP']].round().astype(int)

    # Reintroduce ID column
    output_df = pd.concat([PRESGENE_IDs, imputed_df], axis=1)
    return output_df

In [None]:
def merge_time_imputed(ls_df):
    """
    INPUT: 
    ls_df : (list of dataframes) list of dataframes with unique sessions values
    OUTPUT: 1 Dataframe
    DESCRIPTION: Merges the unique sessions dataset to obtain 1 dataset with all sessions in it.
    """
    # Initialisation of output list    
    output_df = None

    # Find the number of sessions to merge
    max_ses = len(ls_df)

    # Identify common columns between the first two dataframes
    common_columns = ls_df[0].columns.intersection(ls_df[1].columns)
    temp_df1 = ls_df[0][common_columns]
    temp_df2 = ls_df[1][common_columns]
    
    # Check if there are more than 2 dataframes to combine
    if max_ses > 2:
        concat_extras = []
        for i in range(2, max_ses):
            # Find common columns between the existing common columns and the nth dataframe
            common_columns = common_columns.intersection(ls_df[i].columns)
            temp_df = ls_df[i][common_columns]
            concat_extras.append(temp_df)
        # Combine all subsets
        output_df = pd.concat([temp_df1, temp_df2]+concat_extras, axis=0)
    
    else: # if only 2 dataframes to combine
        # Combine only the first two subsets
        output_df = pd.concat([temp_df1, temp_df2], axis=0)
    
    return output_df

## Run impute_df and save for the test dataframes

In [None]:
save_file_path = 'updated_data/test'

# Iterate over the test datasets for imputation and saving
for test_df in [ALL_test_base2, MS_test_base2]:
    for imp_method in ['all', 'type', 'neighbor']:
        # Split by sessions
        unique_tp_ls = split_df_by_time(test_df)

    
        # Initialise unique sessions list & impute missing values for each df
        imp_ls = []
        for tp_df in unique_tp_ls:
            print('unique_tp_ls', tp_df.shape)
            imp_tp_df = impute_df(tp_df, imp_method)
            imp_ls.append(imp_tp_df)
    
        # Merge the unique sessions imputed datasets
        imp_df = merge_time_imputed(imp_ls)
        saving_ls = [imp_df] + imp_ls
    
        # Save the imputed test imputed datasets to .csv (no oversampling for test dfs)
        sub = 'ALL' if 'MS' in saving_ls[0].columns else 'MS'
        max_ses = len(imp_ls)
        ses = ['', '00', '05', '10']
        for i, df in enumerate(saving_ls):
            df.to_csv(f'{save_file_path}/{imp_method}/{sub}test_{ses[i]}{max_ses}_{imp_method}.csv', index = False)

# Step 4: Handling Class Imbalances with SMOTE

In [None]:
# Merge the datasets session row wise per subjects (1 row per subjects)
def merge_time_by_rows(ls_df):
    """
    INPUT: 
    ls_df : (nested list of dataframes) nested list containing the imputed datasets, the first list for the MS unique sessions datasets 
    and a second for ALL unique sessions datasets.
    output: list of 2 dataframes for 
    DESCRIPTION: Combines the unique sessions dataframes of each sublist of the input list to obtain 1 row per subject.
    """
    # Initialise the output list 
    output_ls = []

    # Iterate over the particpant specific lists (MS & ALL)
    for idls, ls in enumerate(ls_df):
        # define the base (baseline session) dataset
        base_df = ls[0]
        base_df = ls[0].rename(columns={col: col + f'_ses{str(int(list(base_df["Time"].unique())[0]))}' for col in ls[0].columns if col not in ['PRESGENE_ID', 'Sex']})

        # Iterate over the range of dfs in the sublist
        for idx in range(1, len(ls)):
            working_df = ls[idx]
            working_ses = int(list(working_df['Time'].unique())[0])
            
            # Remove the sex column as it does not change over the sessions
            working_df = working_df[[col for col in working_df.columns if col != 'Sex']]

            # Rename the columns in common
            working_df = working_df.rename(columns={col: col + f'_ses{str(working_ses)}' for col in working_df.columns if col != 'PRESGENE_ID'})

            # Combine the base and the new time point
            base_df = pd.merge(base_df, working_df, on='PRESGENE_ID', how='left')

        # Add the newly reformated df to the output list
        output_ls.append(base_df)

    return output_ls[0], output_ls[1]

In [None]:
def reformat_SMOTE_dfs(df):
    """
    INPUT: 
    df : (dataframes) dataframe with the 1 row per participant format created by the functionmerge_time_by_rows
    output: list with first a dataframe of the merged sessions with X rows per participant per X sessions, followed by the unique sessions
    datasets.
    DESCRIPTION: Splits the input dataframe into unique sessions, remove the session specific suffix from the column names, return a 
    list of the unique sessions dataset and a version with the merged sessions.
    """
    # Reset index to use it as 'PRESGENE_ID' Substitute
    df_idx = df.reset_index(drop = False)

    # Ensure MStype or MS (categorical columns) are whole values
    label_cols = [col for col in df_idx.columns if col.startswith('MS')]
    for col in label_cols:
        df_idx[col] = df_idx[col].round().astype(int)

    # Find the number of sessions included in the dataset
    sessions = [col for col in df.columns if col.startswith('Time')]

    # Initialise the output list for unique session datasets
    unique_ses_ls = []

    # Iterate over the sessions to get the unique sessions dataframes
    for ses in range(1, len(sessions) + 1):
        ses_cols = [col for col in df_idx.columns if f'_ses{ses}' in col]
        working_df = df_idx[['index', 'Sex'] + ses_cols]
        working_df.columns = working_df.columns.str.replace(f'_ses{ses}', '', regex=False)
        
        unique_ses_ls.append(working_df)

    # Combine the unique session data frames into 1 multi-session dataframe
    combined_df = pd.concat(unique_ses_ls, ignore_index=True)

    return [combined_df] + unique_ses_ls

In [None]:
def class_rebalancing_SMOTE(ls_ls_df):
    """
    INPUT:
    ls_ls_df : (nested list of dataframes) nested list containing the imputed datasets, the first list for the MS unique sessions datasets 
    and a second for ALL unique sessions datasets.
    OUTPUT: lists of dataframes for the MS and ALL unique sessions datasets
    DESCRIPTION: Uses SMOTE to fix class imbalance in the MS phenotypes and HC + MS phenotypes in the pwMS dataset and HC + pwMS dataset,
    respectively. 
    """
    # Combine the unique sessions row-wise (1 row per subjects)
    MS_df, ALL_df = merge_time_by_rows(ls_ls_df)

    # Find frequency of RRMS (most frequent is RRMS with MStype = 2)
    resample_size = MS_df['MStype_ses1'].value_counts().max() * 2
    
    ## OVERSAMPLING pwMS DATASET
    
    # Make the sampling strategy for pwMS
    sampling_strategy_ms = {label: resample_size for label in MS_df['MStype_ses1'].unique()}

    # Remove the ID col and isolate the stratifying feature col
    X_ms = MS_df.drop(['MStype_ses1', 'PRESGENE_ID'], axis=1)
    y_ms = MS_df['MStype_ses1']
    
    # Initialise & apply SMOTE with Dynamic number of neighbours ( 3 or smallest class)
    min_samples_class_ms = y_ms.value_counts().min()
    smote_MS = SMOTE(sampling_strategy=sampling_strategy_ms, k_neighbors=min(3, min_samples_class_ms - 1), random_state=42) 
    X_SMOTE_ms, y_SMOTE_ms = smote_MS.fit_resample(X_ms, y_ms)

    # Create the output MS df wiht the SMOTE oversampled data
    output_MS = pd.DataFrame(X_SMOTE_ms, columns = X_ms.columns)
    output_MS = pd.concat([output_MS, pd.Series(y_SMOTE_ms, name='MStype_ses1')], axis=1)

    ## OVERSAMPLING ALL PARTICIPANTS DATASET
    
    # Add the MStype column to the ALL participants df for resampling of MStypes & HC/MS. 
    temp_ALL_mstype = ALL_df.merge(MS_df[['PRESGENE_ID', 'MStype_ses1']], on='PRESGENE_ID', how = 'left')
    temp_ALL_mstype['MStype_ses1'] = temp_ALL_mstype['MStype_ses1'].fillna(0)

    # Make the sampling strategy for ALL
    sampling_strategy_all = {label: resample_size for label in temp_ALL_mstype['MStype_ses1'].unique()}

    # Remove the ID col and isolate the stratifying feature col
    X_all = temp_ALL_mstype.drop(['MStype_ses1', 'PRESGENE_ID'], axis=1)
    y_all = temp_ALL_mstype['MStype_ses1']
    
    # Initialise & apply SMOTE with Dynamic number of neighbours ( 3 or smallest class)
    min_samples_class_all = y_all.value_counts().min()
    smote_ALL = SMOTE(sampling_strategy=sampling_strategy_all, k_neighbors=min(3, min_samples_class_all - 1), random_state=42) 
    X_SMOTE_all, y_SMOTE_all = smote_ALL.fit_resample(X_all, y_all)

    # Create the output ALL df wiht the SMOTE oversampled data
    output_ALL = pd.DataFrame(X_SMOTE_all, columns = X_all.columns)

    # Reformat the extrapolated EDSS values
    edss_cols = [col for col in output_MS.columns if col.startswith('EDSS')]
    for col in edss_cols:
        output_MS[col] = output_MS[col].apply(lambda x: 999 if x >= 10.5 else 
                                              (0 if x < 0.5 else 
                                               (1 if x < 1.5 else 
                                                round(x * 2) / 2 if x != 999 else 999)))
        output_ALL[col] = output_ALL[col].apply(lambda x: 999 if x >= 10.5 else 
                                                (0 if x < 0.5 else 
                                                 (1 if x < 1.5 else 
                                                  round(x * 2) / 2 if x != 999 else 999)))

    # Convert the stratifying features to integers (in case of conversion)
    output_MS[['MStype_ses1']] ==  output_MS[['MStype_ses1']].round().astype(int)
    output_ALL[['MS_ses1']] ==  output_ALL[['MS_ses1']].round().astype(int)

    # Reformat the dataset from 1 row per subject to n rows per n sessions per subject.
    MS_output_ls = reformat_SMOTE_dfs(output_MS)
    ALL_output_ls = reformat_SMOTE_dfs(output_ALL)

    return MS_output_ls, ALL_output_ls

## Run the SMOTE oversampling on the train datasets & save them

In [None]:
save_file_path = 'updated_data/oversampled'
all_train_base = ALL_train_base2
ms_train_base = MS_train_base2

# Iterate over the 3 possible imputation methods
for imp_method in ['all', 'type', 'neighbor']:
    # Iterate over the MS and ALL train datasets for imputation
    train_imp_tp_ls = []
    for run, train_df in enumerate([ms_train_base, all_train_base]):
        # Split by sessions
        unique_tp_ls = split_df_by_time(train_df)
    
        # Initialise unique sessions list & impute missing values for each df
        imp_ls = []
        for tp_df in unique_tp_ls:
            imp_tp_df = impute_df(tp_df, imp_method)
            imp_ls.append(imp_tp_df)
    
        # Add the imputations to the MS or ALL list 
        train_imp_tp_ls.append(imp_ls)

    # Use SMOTE for class rebalancing of the MS and ALL datasets
    MS_smote_ls, ALL_smote_ls = class_rebalancing_SMOTE(train_imp_tp_ls)

    # Save the merged and unique sessions class rebalanced train datasets
    max_ses = len(unique_tp_ls)
    ses = ['', '00', '05', '10']
    for idx, df in enumerate(MS_smote_ls):
        if ses[idx] == '':
            df.to_csv(f'{save_file_path}/{imp_method}/MStrain_{max_ses}_{imp_method}.csv', index = False)
            ALL_smote_ls[idx].to_csv(f'{save_file_path}/{imp_method}/ALLtrain_{max_ses}_{imp_method}.csv', index = False)
        else:
            df.to_csv(f'{save_file_path}/{imp_method}/MStrain_{ses[idx]}{max_ses}_{imp_method}.csv', index = False)
            ALL_smote_ls[idx].to_csv(f'{save_file_path}/{imp_method}/ALLtrain_{ses[idx]}{max_ses}_{imp_method}.csv', index = False)