# Run An Mendelian Randomization

### Authors: Calvin Howard.

#### Last updated: July 6, 2023

Use this to run/test a statistical model on a spreadsheet.

Notes:
- To best use this notebook, you must understand MR and its 3 critical assumptions. 

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [34]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Research/2023/subiculum_cognition_and_age/figures/Figures/odds_ratios'

Import Data

In [35]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Dropbox (Partners HealthCare)/studies/cognition_2023/metadata/master_list_proper_subjects.xlsx'
sheet = 'master_list_proper_subjects'

In [36]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()

Unnamed: 0,subject,Age,Normalized_Percent_Cognitive_Improvement,Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group,Z_Scored_Percent_Cognitive_Improvement,Percent_Cognitive_Improvement,Z_Scored_Subiculum_T_By_Origin_Group_,Z_Scored_Subiculum_Connectivity_T,Subiculum_Connectivity_T,Amnesia_Lesion_T_Map,...,Abs_Cognitive_Improve,Cognitive_Improve,Z_Scored_Cognitive_Baseline,Z_Scored_Cognitive_Baseline__Lower_is_Better_,Min_Max_Normalized_Baseline,MinMaxNormBaseline_Higher_is_Better,ROI_to_Alz_Max,ROI_to_PD_Max,Standardzied_AD_Max,Standardized_PD_Max
0,101,62.0,-0.392857,0.314066,0.314066,-21.428571,-1.282630,-1.282630,56.864683,0.447264,...,-6.0,No,1.518764,-1.518764,0.72,0.28,12.222658,14.493929,-1.714513,-1.227368
1,102,77.0,-0.666667,0.013999,0.013999,-36.363636,-1.760917,-1.760917,52.970984,0.436157,...,-8.0,No,0.465551,-0.465551,0.48,0.52,14.020048,15.257338,-1.155843,-1.022243
2,103,76.0,-1.447368,-0.841572,-0.841572,-78.947368,-0.595369,-0.595369,62.459631,0.497749,...,-15.0,No,-0.061056,0.061056,0.36,0.64,15.118727,17.376384,-0.814348,-0.452865
3,104,65.0,-2.372549,-1.855477,-1.855477,-129.411765,-0.945206,-0.945206,59.611631,0.432617,...,-22.0,No,-0.412127,0.412127,0.28,0.72,13.112424,15.287916,-1.437954,-1.014027
4,105,50.0,-0.192982,0.533109,0.533109,-10.526316,-1.151973,-1.151973,57.928350,0.193389,...,-2.0,No,-0.061056,0.061056,0.36,0.64,15.086568,12.951426,-0.824344,-1.641831
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194,211,58.7,,,,,-0.415745,-0.189000,19.900000,,...,,Yes,,,,,,,,
195,152,69.4,,,,,-0.701419,-0.455000,17.900000,,...,,Yes,,,,,,,,
196,208,79.2,,,,,-0.929958,-0.669000,16.300000,,...,,Yes,,,,,,,,
197,223,71.1,,,,,-0.829972,-0.575000,17.000000,,...,,Yes,,,,,,,,


# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [37]:
data_df.columns

Index(['subject', 'Age', 'Normalized_Percent_Cognitive_Improvement',
       'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group',
       'Z_Scored_Percent_Cognitive_Improvement',
       'Percent_Cognitive_Improvement',
       'Z_Scored_Subiculum_T_By_Origin_Group_',
       'Z_Scored_Subiculum_Connectivity_T', 'Subiculum_Connectivity_T',
       'Amnesia_Lesion_T_Map', 'Memory_Network_T', 'Z_Scored_Memory_Network_R',
       'Memory_Network_R', 'Subiculum_Grey_Matter', 'Subiculum_White_Matter',
       'Subiculum_CSF', 'Subiculum_Total', 'Standardized_Age',
       'Standardized_Percent_Improvement',
       'Standardized_Subiculum_Connectivity',
       'Standardized_Subiculum_Grey_Matter',
       'Standardized_Subiculum_White_Matter', 'Standardized_Subiculum_CSF',
       'Standardized_Subiculum_Total', 'Disease', 'Cohort', 'City',
       'Inclusion_Cohort', 'Age_Group', 'Age_And_Disease',
       'Age_Disease_and_Cohort', 'Subiculum_Group_By_Z_Score_Sign',
       'Subiculum_Group_By_Infl

In [44]:
drop_list = ['City', 'StimMatch', 'Cognitive_Improve', 'Age']

In [45]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

Unnamed: 0,subject,Age,Normalized_Percent_Cognitive_Improvement,Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group,Z_Scored_Percent_Cognitive_Improvement,Percent_Cognitive_Improvement,Z_Scored_Subiculum_T_By_Origin_Group_,Z_Scored_Subiculum_Connectivity_T,Subiculum_Connectivity_T,Amnesia_Lesion_T_Map,...,Abs_Cognitive_Improve,Cognitive_Improve,Z_Scored_Cognitive_Baseline,Z_Scored_Cognitive_Baseline__Lower_is_Better_,Min_Max_Normalized_Baseline,MinMaxNormBaseline_Higher_is_Better,ROI_to_Alz_Max,ROI_to_PD_Max,Standardzied_AD_Max,Standardized_PD_Max
46,1,57.0,-2.609929,-1.372562,-1.372562,-5.673759,1.080695,1.080695,30.376565,-0.113151,...,-8.0,0,-0.115295,-0.115295,0.625,0.625,22.0203,20.46784,1.056258,0.508516
47,2,50.0,0.992806,1.331414,1.331414,2.158273,-0.930548,-0.930548,16.29587,-0.502484,...,3.0,1,-0.935174,-0.935174,0.375,0.375,11.487188,4.94297,-0.759238,-1.629565
48,3,62.0,-0.638889,0.106772,0.106772,-1.388889,1.155469,1.155469,30.900051,-0.398033,...,-2.0,0,1.114522,1.114522,1.0,1.0,23.013479,22.145924,1.227443,0.739621
49,4,50.0,-0.985714,-0.153533,-0.153533,-2.142857,-0.228971,-0.228971,21.207602,-0.426115,...,-3.0,0,-0.525235,-0.525235,0.5,0.5,12.198485,18.933435,-0.636638,0.297198
50,6,60.0,-0.323944,0.343149,0.343149,-0.704225,0.109572,0.109572,23.577739,-0.454075,...,-1.0,0,0.294644,0.294644,0.75,0.75,17.634088,18.128314,0.300247,0.186317
51,7,73.0,-0.326241,0.341424,0.341424,-0.70922,1.977842,1.977842,36.657479,-0.177886,...,-1.0,0,-0.115295,-0.115295,0.625,0.625,24.162033,27.503198,1.425409,1.477423
52,9,64.0,-0.985714,-0.153533,-0.153533,-2.142857,-0.407778,-0.407778,19.955774,-0.494405,...,-3.0,0,-0.525235,-0.525235,0.5,0.5,10.782803,14.964053,-0.880646,-0.249464
53,11,62.0,-0.319444,0.346526,0.346526,-0.694444,-1.093332,-1.093332,15.15622,-0.507962,...,-1.0,0,1.114522,1.114522,1.0,1.0,9.653427,12.002916,-1.075306,-0.657271
54,12,54.0,0.0,0.58628,0.58628,0.0,0.788134,0.788134,28.328345,-0.220427,...,0.0,1,-2.164991,-2.164991,0.0,0.0,21.521001,21.243697,0.970198,0.615367
55,14,49.0,0.321678,0.82771,0.82771,0.699301,-0.45588,-0.45588,19.619016,-0.440579,...,1.0,1,0.704583,0.704583,0.875,0.875,10.881447,15.224677,-0.863644,-0.213571


**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [40]:
data_df.columns

Index(['subject', 'Age', 'Normalized_Percent_Cognitive_Improvement',
       'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group',
       'Z_Scored_Percent_Cognitive_Improvement',
       'Percent_Cognitive_Improvement',
       'Z_Scored_Subiculum_T_By_Origin_Group_',
       'Z_Scored_Subiculum_Connectivity_T', 'Subiculum_Connectivity_T',
       'Amnesia_Lesion_T_Map', 'Memory_Network_T', 'Z_Scored_Memory_Network_R',
       'Memory_Network_R', 'Subiculum_Grey_Matter', 'Subiculum_White_Matter',
       'Subiculum_CSF', 'Subiculum_Total', 'Standardized_Age',
       'Standardized_Percent_Improvement',
       'Standardized_Subiculum_Connectivity',
       'Standardized_Subiculum_Grey_Matter',
       'Standardized_Subiculum_White_Matter', 'Standardized_Subiculum_CSF',
       'Standardized_Subiculum_Total', 'Disease', 'Cohort', 'City',
       'Inclusion_Cohort', 'Age_Group', 'Age_And_Disease',
       'Age_Disease_and_Cohort', 'Subiculum_Group_By_Z_Score_Sign',
       'Subiculum_Group_By_Infl

Set the parameters for dropping rows

In [41]:
column = 'City'  # The column you'd like to evaluate
condition = 'not'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Wurzburg' # The value to drop if found

In [42]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

Unnamed: 0,subject,Age,Normalized_Percent_Cognitive_Improvement,Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group,Z_Scored_Percent_Cognitive_Improvement,Percent_Cognitive_Improvement,Z_Scored_Subiculum_T_By_Origin_Group_,Z_Scored_Subiculum_Connectivity_T,Subiculum_Connectivity_T,Amnesia_Lesion_T_Map,...,Abs_Cognitive_Improve,Cognitive_Improve,Z_Scored_Cognitive_Baseline,Z_Scored_Cognitive_Baseline__Lower_is_Better_,Min_Max_Normalized_Baseline,MinMaxNormBaseline_Higher_is_Better,ROI_to_Alz_Max,ROI_to_PD_Max,Standardzied_AD_Max,Standardized_PD_Max
46,1,57.0,-2.609929,-1.372562,-1.372562,-5.673759,1.080695,1.080695,30.376565,-0.113151,...,-8.0,No,-0.115295,-0.115295,0.625,0.625,22.0203,20.46784,1.056258,0.508516
47,2,50.0,0.992806,1.331414,1.331414,2.158273,-0.930548,-0.930548,16.29587,-0.502484,...,3.0,Yes,-0.935174,-0.935174,0.375,0.375,11.487188,4.94297,-0.759238,-1.629565
48,3,62.0,-0.638889,0.106772,0.106772,-1.388889,1.155469,1.155469,30.900051,-0.398033,...,-2.0,No,1.114522,1.114522,1.0,1.0,23.013479,22.145924,1.227443,0.739621
49,4,50.0,-0.985714,-0.153533,-0.153533,-2.142857,-0.228971,-0.228971,21.207602,-0.426115,...,-3.0,No,-0.525235,-0.525235,0.5,0.5,12.198485,18.933435,-0.636638,0.297198
50,6,60.0,-0.323944,0.343149,0.343149,-0.704225,0.109572,0.109572,23.577739,-0.454075,...,-1.0,No,0.294644,0.294644,0.75,0.75,17.634088,18.128314,0.300247,0.186317
51,7,73.0,-0.326241,0.341424,0.341424,-0.70922,1.977842,1.977842,36.657479,-0.177886,...,-1.0,No,-0.115295,-0.115295,0.625,0.625,24.162033,27.503198,1.425409,1.477423
52,9,64.0,-0.985714,-0.153533,-0.153533,-2.142857,-0.407778,-0.407778,19.955774,-0.494405,...,-3.0,No,-0.525235,-0.525235,0.5,0.5,10.782803,14.964053,-0.880646,-0.249464
53,11,62.0,-0.319444,0.346526,0.346526,-0.694444,-1.093332,-1.093332,15.15622,-0.507962,...,-1.0,No,1.114522,1.114522,1.0,1.0,9.653427,12.002916,-1.075306,-0.657271
54,12,54.0,0.0,0.58628,0.58628,0.0,0.788134,0.788134,28.328345,-0.220427,...,0.0,Yes,-2.164991,-2.164991,0.0,0.0,21.521001,21.243697,0.970198,0.615367
55,14,49.0,0.321678,0.82771,0.82771,0.699301,-0.45588,-0.45588,19.619016,-0.440579,...,1.0,Yes,0.704583,0.704583,0.875,0.875,10.881447,15.224677,-0.863644,-0.213571


**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = None # ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

In [None]:
# for col in data_df.columns:
#     if 'CSF' and 'eh' not in col:
#         data_df[col] = data_df[col] * -1

RCT Plotter

In [46]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.tools import add_constant
import patsy

class MendelianRandomizationAnalysis:
    def __init__(self, data, outcome_cols, exposure_col, iv_col, covariate_cols=None):
        """
        Initializes the class with the data and variable specifications.
        """
        self.data = data
        self.outcome_cols = outcome_cols
        self.exposure_col = exposure_col
        self.iv_col = iv_col
        self.covariate_cols = covariate_cols if covariate_cols else []

    def label_encode_variables(self):
        """
        Label encode categorical variables if needed.
        """
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        for col in categorical_columns:
            self.data[col] = pd.factorize(self.data[col])[0]

    def run_regression(self, y_col, X_cols):
        """
        Runs a regression and returns the model.
        """
        X = add_constant(self.data[X_cols])  # Add a constant term for the intercept
        y = self.data[y_col]
        model = sm.OLS(y, X).fit()
        return model

    def first_stage_regression(self):
        """
        First-stage regression of the IV analysis: regress exposure on IV.
        """
        return self.run_regression(self.exposure_col, [self.iv_col] + self.covariate_cols)

    def second_stage_regression(self, outcome_col, predicted_exposure):
        """
        Second-stage regression of the IV analysis: regress outcome on the predicted exposure.
        """
        return self.run_regression(outcome_col, [predicted_exposure] + self.covariate_cols)

    def mendelian_randomization(self):
        """
        Executes the two-stage least squares Mendelian Randomization.
        """
        self.label_encode_variables()
        first_stage_model = self.first_stage_regression()
        self.data['predicted_exposure'] = first_stage_model.predict()

        results = {}
        for outcome in self.outcome_cols:
            second_stage_model = self.second_stage_regression(outcome, 'predicted_exposure')
            results[outcome] = second_stage_model

        return results

In [47]:
data_df.columns

Index(['subject', 'Age', 'Normalized_Percent_Cognitive_Improvement',
       'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group',
       'Z_Scored_Percent_Cognitive_Improvement',
       'Percent_Cognitive_Improvement',
       'Z_Scored_Subiculum_T_By_Origin_Group_',
       'Z_Scored_Subiculum_Connectivity_T', 'Subiculum_Connectivity_T',
       'Amnesia_Lesion_T_Map', 'Memory_Network_T', 'Z_Scored_Memory_Network_R',
       'Memory_Network_R', 'Subiculum_Grey_Matter', 'Subiculum_White_Matter',
       'Subiculum_CSF', 'Subiculum_Total', 'Standardized_Age',
       'Standardized_Percent_Improvement',
       'Standardized_Subiculum_Connectivity',
       'Standardized_Subiculum_Grey_Matter',
       'Standardized_Subiculum_White_Matter', 'Standardized_Subiculum_CSF',
       'Standardized_Subiculum_Total', 'Disease', 'Cohort', 'City',
       'Inclusion_Cohort', 'Age_Group', 'Age_And_Disease',
       'Age_Disease_and_Cohort', 'Subiculum_Group_By_Z_Score_Sign',
       'Subiculum_Group_By_Infl

In [48]:
# Usage example:
# Assuming 'df' is a pandas DataFrame with the necessary data.
mr_analysis = MendelianRandomizationAnalysis(
    data=data_df,
    outcome_cols=['Percent_Cognitive_Improvement'],
    exposure_col='StimMatch',
    iv_col='Age',
    covariate_cols=None
)

mr_results = mr_analysis.mendelian_randomization()
for outcome, model in mr_results.items():
    print(f'Results for {outcome}:')
    print(model.summary())


Results for Percent_Cognitive_Improvement:
                                  OLS Regression Results                                 
Dep. Variable:     Percent_Cognitive_Improvement   R-squared:                       0.004
Model:                                       OLS   Adj. R-squared:                 -0.038
Method:                            Least Squares   F-statistic:                   0.09291
Date:                           Sat, 09 Mar 2024   Prob (F-statistic):              0.763
Time:                                   12:32:34   Log-Likelihood:                -62.426
No. Observations:                             26   AIC:                             128.9
Df Residuals:                                 24   BIC:                             131.4
Df Model:                                      1                                         
Covariance Type:                       nonrobust                                         
                         coef    std err          t      