# Run Instrumental Variable Analysis

### Authors: Calvin Howard.

- Similar to a mediation analysis, this removes the influence of confounds from an independent variable to attempt to isolate the causal effect upon the dependent variable. 
- For further information, Causal Inference by Scott Cunningham has an excellent chapter on IV analysis. 


# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Research/2023/subiculum_cognition_and_age/figures/Figures/joint_distribution_calculus/validation_cohort'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/cognition_2023/metadata/master_list_proper_subjects.xlsx'
# Specify sheet name as a string if using Excel, otherwise set to None 
sheet = 'master_list_proper_subjects'

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()


# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

Enter names of columns you'd like to drop nans from

In [None]:
drop_list = ['Age', 'Z_Scored_Percent_Cognitive_Improvement']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
data_df

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [34]:
column = 'City'  # The column you'd like to evaluate
condition = 'not'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Toronto' # The value to drop if T

In [35]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

Unnamed: 0,subject,Age,Normalized_Percent_Cognitive_Improvement,Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group,Z_Scored_Percent_Cognitive_Improvement,Percent_Cognitive_Improvement,Z_Scored_Subiculum_T_By_Origin_Group_,Z_Scored_Subiculum_Connectivity_T,Subiculum_Connectivity_T_Redone,Subiculum_Connectivity_T,...,DECLINE,Cognitive_Improve,Z_Scored_Cognitive_Baseline,Z_Scored_Cognitive_Baseline__Lower_is_Better_,Min_Max_Normalized_Baseline,MinMaxNormBaseline_Higher_is_Better,ROI_to_Alz_Max,ROI_to_PD_Max,Standardzied_AD_Max,Standardized_PD_Max
0,101,62.0,-0.392857,0.314066,0.314066,-21.428571,-1.28263,-1.28263,21.150595,56.864683,...,1.0,No,1.518764,-1.518764,0.72,0.28,12.222658,14.493929,-1.714513,-1.227368
1,102,77.0,-0.666667,0.013999,0.013999,-36.363636,-1.760917,-1.760917,19.702349,52.970984,...,1.0,No,0.465551,-0.465551,0.48,0.52,14.020048,15.257338,-1.155843,-1.022243
2,103,76.0,-1.447368,-0.841572,-0.841572,-78.947368,-0.595369,-0.595369,23.231614,62.459631,...,1.0,No,-0.061056,0.061056,0.36,0.64,15.118727,17.376384,-0.814348,-0.452865
3,104,65.0,-2.372549,-1.855477,-1.855477,-129.411765,-0.945206,-0.945206,22.172312,59.611631,...,1.0,No,-0.412127,0.412127,0.28,0.72,13.112424,15.287916,-1.437954,-1.014027
4,105,50.0,-0.192982,0.533109,0.533109,-10.526316,-1.151973,-1.151973,21.546222,57.92835,...,0.0,No,-0.061056,0.061056,0.36,0.64,15.086568,12.951426,-0.824344,-1.641831
5,106,66.0,-0.705128,-0.028151,-0.028151,-38.461538,-0.489205,-0.489205,23.553077,63.323903,...,1.0,No,-1.114269,1.114269,0.12,0.88,15.816634,17.617107,-0.597423,-0.388183
6,107,64.0,-0.282051,0.435498,0.435498,-15.384615,-1.718309,-1.718309,19.831365,53.317851,...,0.0,No,-1.114269,1.114269,0.12,0.88,15.524025,13.452311,-0.688373,-1.507246
7,108,60.0,-0.534722,0.158596,0.158596,-29.166667,-1.145694,-1.145694,21.565235,57.979468,...,1.0,No,0.816622,-0.816622,0.56,0.44,16.546984,13.932696,-0.370413,-1.378169
8,109,72.0,-0.557971,0.133118,0.133118,-30.434783,-0.043697,-0.043697,24.902068,66.950749,...,1.0,No,0.641086,-0.641086,0.52,0.48,19.669539,21.341523,0.600149,0.612551
9,110,72.0,-1.551282,-0.955451,-0.955451,-84.615385,0.240855,0.240855,25.763689,69.267271,...,1.0,No,-1.114269,1.114269,0.12,0.88,18.295718,19.263977,0.173133,0.054323


**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = ['Ordinal_Target_Type', 'Ordinal_Epilepsy_Type']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

In [None]:
data_df.columns

Regress out Covariate

In [None]:
from calvin_utils.statistical_utils.regression_utils import RegressOutCovariates
# use this code block to regress out covariates. Generally better to just include as covariates in a model..
dependent_variable_list = ['dependent_variable_column']
regressors = ['Age', 'Sex']

data_df, adjusted_dep_vars_list = RegressOutCovariates.run(df=data_df, dependent_variable_list=dependent_variable_list, covariates_list=regressors)
print(adjusted_dep_vars_list)

# 01B - Import Directly from a CSV

In [None]:
import pandas as pd
data_df = pd.read_csv('path/to/your/csv.csv')

# 02 - Evaluate Instrumental Variable
- There are 3 assumptions to an IV analysis. This will go over them. 

In [40]:
import statsmodels.api as sm
import patsy
from scipy.stats import pearsonr

def relevance_assumption(data_df, iv, second_var):
    formula = f'{second_var} ~ {iv}'
    y, X = patsy.dmatrices(formula, data=data_df, return_type='dataframe')
    first_stage = sm.OLS(y, X).fit()
    f_stat = first_stage.fvalue
    print(f"Relevance Assumption F-Statistic: {f_stat}")
    if f_stat > 10:
        print("The instrument is acceptable (F-statistic > 10).")
    else:
        print("The instrument is weak (F-statistic ≤ 10). Suggest not using this instrument")
    return f_stat

def exogeneity_assumption(data_df, iv, second_var, dependent_var):
    formula = f'{dependent_var} ~ {second_var}'
    y, X = patsy.dmatrices(formula, data=data_df, return_type='dataframe')
    ols_reg = sm.OLS(y, X).fit()
    residuals = ols_reg.resid
    r_value, p_value = pearsonr(data_df[iv], residuals)
    print(f"Exogeneity Assumption: Pearson correlation between residuals and IV: R = {r_value}, P = {p_value}")
    if p_value > 0.05:
        print("Exogeneity assumption has been met (P > 0.05).")
    else:
        print("Exogeneity assumption failed (P ≤ 0.05).")
    return r_value, p_value

def exclusion_assumption():
    print("Reminder: The exclusion assumption states that the instrumental variable should not be influenced by confounders.")
    print("This assumption must be proven through theory and domain knowledge, not through statistical tests.")
    
def check_iv_assumptions(data_df, iv, second_var, dependent_var):
    print("Checking Relevance Assumption:")
    relevance_assumption(data_df, iv, second_var)
    
    print("\nChecking Exogeneity Assumption:")
    exogeneity_assumption(data_df, iv, second_var, dependent_var)
    
    print("\nChecking Exclusion Assumption:")
    exclusion_assumption()

In [41]:
data_df.columns

Index(['subject', 'Age', 'Normalized_Percent_Cognitive_Improvement',
       'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group',
       'Z_Scored_Percent_Cognitive_Improvement',
       'Percent_Cognitive_Improvement',
       'Z_Scored_Subiculum_T_By_Origin_Group_',
       'Z_Scored_Subiculum_Connectivity_T', 'Subiculum_Connectivity_T_Redone',
       'Subiculum_Connectivity_T', 'Amnesia_Lesion_T_Map', 'Memory_Network_T',
       'Z_Scored_Memory_Network_R', 'Memory_Network_R',
       'Subiculum_Grey_Matter', 'Subiculum_White_Matter', 'Subiculum_CSF',
       'Subiculum_Total', 'Standardized_Age',
       'Standardized_Percent_Improvement',
       'Standardized_Subiculum_Connectivity',
       'Standardized_Subiculum_Grey_Matter',
       'Standardized_Subiculum_White_Matter', 'Standardized_Subiculum_CSF',
       'Standardized_Subiculum_Total', 'Disease', 'Cohort', 'City',
       'Inclusion_Cohort', 'Categorical_Age_Group', 'Age_Group',
       'Age_And_Disease', 'Age_Disease_and_Cohort',

In [83]:
instrumental_variable = 'Age'
independent_variable = 'Subiculum_Grey_Matter'
dependent_var = 'Standardized_Percent_Improvement'

In [84]:
check_iv_assumptions(data_df, iv=instrumental_variable, second_var=independent_variable, dependent_var=dependent_var)

Checking Relevance Assumption:
Relevance Assumption F-Statistic: 3.5279868358107076
The instrument is weak (F-statistic ≤ 10). Suggest not using this instrument

Checking Exogeneity Assumption:
Exogeneity Assumption: Pearson correlation between residuals and IV: R = 0.0769097259238258, P = 0.611435651758564
Exogeneity assumption has been met (P > 0.05).

Checking Exclusion Assumption:
Reminder: The exclusion assumption states that the instrumental variable should not be influenced by confounders.
This assumption must be proven through theory and domain knowledge, not through statistical tests.


# 03 - Run IV analysis

In [85]:
import pandas as pd
import statsmodels.api as sm
import patsy

def iv_analysis(data_df, iv, second_var, dependent_var):
    # 1. OLS regression using patsy formula
    ols_formula = f'{dependent_var} ~ {second_var}'
    y_ols, X_ols = patsy.dmatrices(ols_formula, data=data_df, return_type='dataframe')
    ols_model = sm.OLS(y_ols, X_ols).fit()
    
    # 2. First stage regression (IV as the predictor for second_var)
    first_stage_formula = f'{second_var} ~ {iv}'
    y_first_stage, X_first_stage = patsy.dmatrices(first_stage_formula, data=data_df, return_type='dataframe')
    first_stage_model = sm.OLS(y_first_stage, X_first_stage).fit()
    data_df[second_var] = first_stage_model.resid
    
    # 3. IV (2SLS) regression using predicted values from first stage
    iv_formula = f'{dependent_var} ~ {second_var}'
    y_iv, X_iv = patsy.dmatrices(iv_formula, data=data_df, return_type='dataframe')
    iv_model = sm.OLS(y_iv, X_iv).fit()
    
    # 4. Create DataFrame for the results
    results_df = pd.DataFrame({
        'Coefficient (IV)': iv_model.params,
        'Standard Error (IV)': iv_model.bse,
        'P-Value (IV)': iv_model.pvalues,
        'Coefficient (OLS)': ols_model.params,
        'Standard Error (OLS)': ols_model.bse,
        'P-Value (OLS)': ols_model.pvalues
    })
    
    # Formatting the coefficients and standard errors
    results_df['Coefficient (IV)'] = results_df.apply(
        lambda row: f"{row['Coefficient (IV)']:.4f} ({row['Standard Error (IV)']:.4f})", axis=1
    )
    results_df['Coefficient (OLS)'] = results_df.apply(
        lambda row: f"{row['Coefficient (OLS)']:.4f} ({row['Standard Error (OLS)']:.4f})", axis=1
    )
    
    # Reordering the columns for readability
    results_df = results_df[['Coefficient (IV)', 'P-Value (IV)', 'Coefficient (OLS)', 'P-Value (OLS)']]
    
    # Explanation of the comparison
    print("--Comparison between OLS and 2SLS coefficients--")
    print("If the OLS is unbiased, the coefficient from the IV analysis should be different from the OLS analysis. This coefficient is the 'causal effect' of the Dep var on the Indep var")
    return results_df



In [86]:
iv_analysis(data_df, iv=instrumental_variable, second_var=independent_variable, dependent_var=dependent_var)

--Comparison between OLS and 2SLS coefficients--
If the OLS is unbiased, the coefficient from the IV analysis should be different from the OLS analysis. This coefficient is the 'causal effect' of the Dep var on the Indep var


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df[second_var] = first_stage_model.resid


Unnamed: 0,Coefficient (IV),P-Value (IV),Coefficient (OLS),P-Value (OLS)
Intercept,-0.0306 (0.1515),0.840704,-0.0030 (0.1732),0.986302
Subiculum_Grey_Matter,0.0608 (0.1313),0.645339,0.0419 (0.1265),0.74226
