# Mediation, Moderation, and Conditional Process Analyses

Mediation and moderation are two distinct concepts in statistics that help in understanding the relationships among three or more variables. Mediation deals with explaining the underlying process through which an independent variable (X) influences a dependent variable (Y). Moderation, on the other hand, addresses how the relationship between X and Y changes based on the level of a third variable, known as the moderator (W).

**Mediation Analysis**:

In mediation analysis, we are interested in whether the relationship between X and Y is explained (or mediated) through another variable, called the mediator (M).

- **Path a**: The independent variable (X) affects the mediator (M). This is the "a" path.
- **Path b**: The mediator (M) affects the dependent variable (Y). This is the "b" path.
- **Path c'**: There might also be a direct effect of X on Y that doesn't go through the mediator. This is the "c'" path or direct effect.

```
X ---(a)---> M ---(b)---> Y
 \                        /
  ---------(c')----------
```

The indirect effect is a*b, which is the portion of the effect of X on Y that is transmitted through the mediator M. The total effect of X on Y is the sum of the indirect effect and the direct effect (c').

**Moderated Mediation Analysis**:

Moderated mediation incorporates the idea that the mediation process itself can be contingent upon the levels of another variable (W), known as the moderator. In other words, the indirect effect of X on Y through M may change depending on the level of W.

```
X ---(a)---> M ---(b)---> Y
|            | 
|------(W)---|
 \                        /
  ---------(c')----------
```

**Interpreting the Results**:

In the results of a moderated mediation analysis, you'll encounter several terms:

- ACME (control/treated): Average Causal Mediation Effects. This represents the average effect of the independent variable on the dependent variable through the mediator.
- ADE (control/treated): Average Direct Effects. This represents the average direct effect of the independent variable on the dependent variable not through the mediator.
- Total effect: Sum of ACME and ADE, representing the total effect of the independent variable on the dependent variable.
- Prop. mediated (control/treated): Proportion of the total effect that is mediated by the mediator.
- The 'control' and 'treated' distinction in ACME and ADE accounts for possible treatment or intervention that could affect the mediation. If there's no treatment/intervention involved, they will be the same.
- Lower CI bound and Upper CI bound represent the lower and upper bounds of the confidence interval for the estimates.
- P-value: Indicates the statistical significance of the estimates. A lower p-value (<0.05) typically indicates a statistically significant effect.

- Control: baseline regression without effect of the mediator
- Treatment: mediated regression wherein the mediator is accounted for
 

**Identifying Partial Mediations**:

In partial mediation, the mediator explains some, but not all, of the relationship between the independent variable (exposure) and the dependent variable (outcome). Here's how you can identify partial mediation:

Significant Direct Effect (ADE): In partial mediation, the direct effect of the independent variable on the dependent variable remains significant even after accounting for the mediator. This means that there is still a direct path from the independent variable to the dependent variable that is not explained by the mediator.

Significant Indirect Effect (ACME): The indirect effect through the mediator must also be significant. This means that the mediator is explaining part of the relationship between the independent variable and the dependent variable.

Total Effect > Direct Effect: The total effect of the independent variable on the dependent variable (before the mediator is added to the model) is generally greater than the direct effect (after accounting for the mediator). This shows that the mediator is explaining some portion of the effect.

In the results you usually get from mediation analysis, check for the following:

ADE (average direct effect) should be significant.
ACME (average causal mediation effect) should be significant.
The total effect should be greater than the direct effect, indicating that some portion of the effect is being accounted for by the mediator.
In summary, partial mediation is present when both the direct and indirect paths are significant, but the inclusion of the mediator in the model reduces the direct effect of the independent variable on the dependent variable (without rendering it non-significant).

**Identifying a Mediated Exposure**:

If the exposure was being mediated, you would expect to see:

A significant ACME (Average Causal Mediation Effect) - This indicates that the mediator is having a significant effect in the relationship between the exposure and the outcome.

A significant proportion mediated - This tells you the proportion of the total effect that is mediated. If this is significant, it implies that a noteworthy part of the relationship between the exposure and the outcome is occurring through the mediator.

Also, keep in mind that the proportion mediated can be positive or negative, and this would indicate whether the mediator is increasing or decreasing the effect of the exposure on the outcome.

In [2]:
#Imports
import os
import glob as glob
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

import numpy as np
np.set_printoptions(precision=3, suppress=True)
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns
#Calculate Correlation
from scipy.stats import pearsonr
import os
import glob
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import platform

In [50]:
analysis = 'mediated_moderation_analysis'
data_path = r'/Users/cu135/Dropbox (Partners HealthCare)/resources/datasets/BIDS_AD_DBS_FORNIX/study_metadata/derivative_metadata/quantitative_atrophy/grey_matter_damage_score_and_outcomes/grey_matter_damage_score_and_outcomes.csv'
out_dir = os.path.join(data_path.split('.')[0], f'{analysis}')

save = True
if os.path.exists(out_dir):
    pass
else:
    os.makedirs(out_dir)

In [51]:
#----------------------------------------------------------------user input above----------------------------------------------------------------
from calvin_utils.dataframe_utilities import remove_column_spaces, add_prefix_to_numeric_cols, replace_hyphens
data_df = pd.read_csv(data_path)
data_df = remove_column_spaces(data_df.reset_index(drop=True))
data_df = add_prefix_to_numeric_cols(data_df)
data_df = replace_hyphens(data_df)

# 
data_df.head(1)

Unnamed: 0,"Patient_#_CDR,_ADAS",Age,%_Change_from_baseline_(ADAS_Cog11),Subiculum_Connectivity,Subiculum_Damage_Score,Hippocampus_Damage_Score,Temporal_Damage_Score,Frontal_Damage_Score,Parietal_Damage_Score,Cerebellum_Damage_Score,Insula_Damage_Score,Occipital_Damage_Score
0,101,62,-21.428571,56.864683,-22.341183,-19032.58951,-6811.832831,-3390.513012,-6514.359274,-1733.125937,-133.189922,-4146.07423


Data Cleaning

In [52]:
#Select specific subgroup
# outlier_index = (data_df['percent_change_adascog11'] <= -50)
# data_df = data_df.loc[outlier_index, :]

#Remove outlier
outlier_index=[11, 47, 48, 49]
data_df = data_df.drop(index=outlier_index)
data_df.reset_index(drop=True, inplace=True)

data_df.head(2)

Unnamed: 0,"Patient_#_CDR,_ADAS",Age,%_Change_from_baseline_(ADAS_Cog11),Subiculum_Connectivity,Subiculum_Damage_Score,Hippocampus_Damage_Score,Temporal_Damage_Score,Frontal_Damage_Score,Parietal_Damage_Score,Cerebellum_Damage_Score,Insula_Damage_Score,Occipital_Damage_Score
0,101,62,-21.428571,56.864683,-22.341183,-19032.58951,-6811.832831,-3390.513012,-6514.359274,-1733.125937,-133.189922,-4146.07423
1,102,77,-36.363636,52.970984,-40.309051,-184720.0251,-12864.56597,-6136.065565,-4548.582996,-5422.573735,-622.591915,-3965.251674


In [53]:

#Rename the outcome variable
outcome_variable = data_df.pop('%_Change_from_baseline_(ADAS_Cog11)')
data_df['outcome'] = outcome_variable

In [54]:
#Standardize the data
preserved_df = data_df.copy()
from sklearn.preprocessing import StandardScaler


# Remove anything you don't want to standardize
cols_not_to_standardize = ['Patient_#_CDR,_ADAS', 'Disease']
# Select the columns to be standardized
if cols_not_to_standardize[0] is not None:
    cols_to_standardize = [col for col in data_df.columns if col not in cols_not_to_standardize]
else:
    print('Will not standardize')
# Standardize
scaler = StandardScaler()
data_df[cols_to_standardize] = scaler.fit_transform(data_df[cols_to_standardize])




#Drop NAN
# data_df = data_df.dropna()
data_df

Unnamed: 0,"Patient_#_CDR,_ADAS",Age,Subiculum_Connectivity,Subiculum_Damage_Score,Hippocampus_Damage_Score,Temporal_Damage_Score,Frontal_Damage_Score,Parietal_Damage_Score,Cerebellum_Damage_Score,Insula_Damage_Score,Occipital_Damage_Score,outcome
0,101,-0.61812,-1.296803,0.653972,0.753393,0.32856,0.792594,-0.171774,0.700376,1.052696,-0.145209,0.317537
1,102,1.277448,-1.780375,0.232903,-1.704389,-1.027169,0.203411,0.344164,-1.01024,-0.904186,-0.071388,0.014153
2,103,1.151077,-0.601947,0.713868,0.362441,0.377884,-0.212724,0.370698,-0.16846,-0.913512,0.521784,-0.850871
3,104,-0.239006,-0.95565,0.185493,0.13869,-0.766276,0.180233,0.170034,0.007509,-0.695748,0.349383,-1.87598
4,105,-2.134575,-1.164703,1.12746,0.947191,0.810029,0.140502,0.346378,0.626054,0.388695,0.702233,0.539
5,106,-0.112635,-0.494611,0.562663,0.59855,1.298898,1.064987,0.79222,0.74766,1.104923,0.640275,-0.028462
6,107,-0.365378,-1.737296,0.154203,-0.71766,-0.754325,0.008127,-0.664721,0.409022,-0.873005,-0.968922,0.440311
7,108,-0.870863,-1.158354,-0.287968,0.04263,-0.593088,0.548781,0.020373,-0.267579,0.53064,0.103376,0.160349
8,109,0.645592,-0.04418,0.658735,-0.080832,-0.666064,0.395602,0.302561,0.960952,0.539807,0.066044,0.134589
9,110,0.645592,0.243516,1.024138,0.436538,0.261359,-0.546263,0.192885,0.739912,-2.555798,0.584495,-0.966009


In [7]:
# One-hot encode as needed
# data_df['interaction'] = data_df['Subiculum_Grey_Matter']*data_df['Subiculum_Connectivity']
data_df['Disease'] = np.where(data_df['Disease'] == 'Alzheimer', 1, 0)
# data_df['Age'] = np.where(data_df['Age'] <= 65, 0, 1)

data_df

KeyError: 'Disease'

Select Specific Rows

In [None]:
#Drop a specific set of rows
# data_df = data_df[data_df['Age'] > 65]
data_df = data_df[data_df['Disease'] == 1]
data_df

KeyError: 'Disease'

Select Specific Columns

In [8]:
print(data_df.columns)


Index(['subject_id', 'Age', 'Subiculum_Connectivity', 'Subiculum',
       'Hippocampus', 'Cerebellum', 'Frontal', 'Insula', 'Parietal',
       'Temporal', 'Occipital', 'outcome'],
      dtype='object')


In [9]:
# Tailor the Dataframe
# data_df = data_df.loc[:, ['Age', 'Mesial_Temporal_Grade', 'Frontal', 'Temporal',
#        'Parietal', 'Occipital', 'Cerebellar', 'Subiculum_Connectivity',
#        'Hippocampal_CSF', 'Hippocampal_Grey_Matter',
#        'Hippocampal_White_Matter', 'outcome']]
data_df

Unnamed: 0,subject_id,Age,Subiculum_Connectivity,Subiculum,Hippocampus,Cerebellum,Frontal,Insula,Parietal,Temporal,Occipital,outcome
0,-1.668629,-0.61812,-1.296803,1.965037,1.065597,1.47325,1.319769,1.966055,0.391704,0.682651,-0.134016,0.317537
1,-1.600823,1.277448,-1.780375,-0.07334,-1.308898,-1.410433,-0.380956,-1.285416,-0.130805,-1.179813,-0.71699,0.014153
2,-1.533016,1.151077,-0.601947,0.791373,-0.086283,-0.720202,-0.208736,-0.293358,0.736211,0.353463,0.973817,-0.850871
3,-1.46521,-0.239006,-0.95565,0.324392,-0.086953,-0.289663,0.045841,-0.871558,-0.203976,-0.883183,0.067608,-1.87598
4,-1.397403,-2.134575,-1.164703,0.487747,1.149742,0.280917,-0.520442,-0.059779,-0.427872,0.511637,0.773541,0.539
5,-1.329597,-0.112635,-0.494611,1.618798,0.126147,1.301574,2.052424,1.234574,1.165726,2.030365,0.588945,-0.028462
6,-1.26179,-0.365378,-1.737296,-1.209717,-0.979023,-0.106361,-0.745628,-0.899863,-1.293667,-0.787715,-1.389839,0.440311
7,-1.193984,-0.870863,-1.158354,0.308412,-0.120924,-0.890921,1.255638,1.537913,0.284082,-0.692759,-0.083297,0.160349
8,-1.126177,0.645592,-0.04418,0.339485,-0.273011,1.009738,1.22981,0.966419,0.64041,-0.576135,-0.083443,0.134589
9,-1.058371,0.645592,0.243516,1.635665,0.517817,0.773014,-0.264691,-1.480153,0.41046,0.533005,1.452487,-0.966009


Single Variable Mediation Analysis

Model Shape:
```
 Age ---(an)---> Atrophy Region n ---(bn)---> Outcome
  \                                          /
   -------------------(c')-------------------
```

Description
- This uses partial regressions to essentially (explain in a reductive manner) to what degree an independent variable's relationship with a dependent variable can be replaced by a 'mediator variable'. 
- Said another way, how much is the effect of x upon y mediated by z?

In [70]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

def perform_mediation_analysis(dataframe, exposure, mediator, dependent_variable):
    # Step 1: Fit the mediator model
    # Mediator ~ Independent Variable
    mediator_model = GLM.from_formula(f"{mediator} ~ {exposure}", data=dataframe, family=Gaussian())
    
    # Step 2: Fit the outcome model
    # Dependent Variable ~ Independent Variable + Mediator
    outcome_model = GLM.from_formula(f"{dependent_variable} ~ {exposure} + {mediator}", data=dataframe, family=Gaussian())
    
    # Step 3: Perform mediation analysis
    med = Mediation(outcome_model, mediator_model, exposure=exposure, mediator=mediator)
    try:
        med_result = med.fit()
        # Step 4: Print the results
        print(f'Mediation of {exposure} by {mediator}:')
        print(med_result.summary())
    except:
        print(f"Mediation failed, perfect separation detected. Aborting assessment of {mediator}.")

# Example usage:
# Assuming your DataFrame is named df, and the columns of interest are 'age', 'brain_atrophy', and 'dependent_variable'
# perform_mediation_analysis(data_df, 
                        #    exposure = 'age', 
                        #    mediator = 'brain_atrophy', 
                        #    dependent_variable='dependent_variable')


In [71]:
data_df.columns

Index(['subject_id', 'Age', 'Subiculum_Connectivity', 'Subiculum',
       'Hippocampus', 'Cerebellum', 'Frontal', 'Insula', 'Parietal',
       'Temporal', 'Occipital', 'outcome'],
      dtype='object')

In [72]:
perform_mediation_analysis(data_df, 
                           exposure = 'Age', 
                           mediator = 'Hippocampus', 
                           dependent_variable='outcome')

Mediation of Age by Hippocampus:
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            0.045023       -0.086831        0.229144    0.526
ACME (treated)            0.045023       -0.086831        0.229144    0.526
ADE (control)             0.014344       -0.321610        0.347742    0.944
ADE (treated)             0.014344       -0.321610        0.347742    0.944
Total effect              0.059368       -0.246481        0.379931    0.730
Prop. mediated (control)  0.063145       -6.434341        6.164112    0.860
Prop. mediated (treated)  0.063145       -6.434341        6.164112    0.860
ACME (average)            0.045023       -0.086831        0.229144    0.526
ADE (average)             0.014344       -0.321610        0.347742    0.944
Prop. mediated (average)  0.063145       -6.434341        6.164112    0.860


Run the above code on an entire spreadsheet

In [73]:
[perform_mediation_analysis(data_df, exposure='Age', mediator=f'{col}', dependent_variable='outcome') for col in data_df.columns]

Mediation of Age by subject_id:
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            0.000753       -0.099939        0.108028    0.976
ACME (treated)            0.000753       -0.099939        0.108028    0.976
ADE (control)             0.062537       -0.249145        0.376276    0.678
ADE (treated)             0.062537       -0.249145        0.376276    0.678
Total effect              0.063290       -0.251904        0.389449    0.678
Prop. mediated (control)  0.018449       -2.513944        2.850845    0.846
Prop. mediated (treated)  0.018449       -2.513944        2.850845    0.846
ACME (average)            0.000753       -0.099939        0.108028    0.976
ADE (average)             0.062537       -0.249145        0.376276    0.678
Prop. mediated (average)  0.018449       -2.513944        2.850845    0.846
Mediation failed, perfect separation detected. Aborting assessment of Age.
Mediation of Age by Subiculum_Connectivity:
             

__________
**Multivariate Mediation Analysis (Independent)**

**NOTE**
This is not a proper 'multivariate mediation analysis'.
It does something slightly different: it performs individual mediation analyses for each mediator, while controlling for the exposure. This is different from assessing the conglomerate mediation effect of all mediators acting in unison.


To investigate whether the effect of an independent variable on an outcome variable is fully mediated through multiple variables (e.g., atrophy in different brain regions), you would want to conduct a multiple mediator analysis.

In a multiple mediator analysis, you examine the indirect effects of an independent variable (in this case, age) on a dependent variable through several mediators simultaneously. This allows you to estimate the portion of the effect of the independent variable that is transmitted through each mediator and whether, when taken together, these mediators account for the entirety of the effect.

In your case, the different atrophy regions will serve as multiple mediators. Here’s how this might look diagrammatically:

```
 Age ---(a1)---> Atrophy Region 1 ---(b1)---> Outcome
  |           |
 Age ---(a2)---> Atrophy Region 2 ---(b2)---> Outcome
  |           |
   ...       ...
 Age ---(an)---> Atrophy Region n ---(bn)---> Outcome
  \                                          /
   -------------------(c')-------------------
```

In this example, each atrophy region (1 through n) is a mediator. The indirect effect of age on the outcome through each atrophy region is a_i * b_i, and the total indirect effect is the sum of all these individual indirect effects. If this total indirect effect is significant and accounts for most of the effect of age on the outcome, and the direct effect (c') is non-significant, it indicates full mediation.

You can perform multiple mediator analysis using the same statistical tools you use for single mediator analysis, but you will need to include multiple mediators in your model. This can be done using statistical software such as R, SAS, or Python's statsmodels. I have implemented this below using statsmodels.

Here's how the mediation model is constructed in conceptual terms:

1. Fit a model for each mediator (Atrophy Region 1, Atrophy Region 2, ..., Atrophy Region n) as a function of the independent variable, Age.
2. Fit a model for the outcome as a function of Age and all atrophy regions.
3. Estimate the indirect effect of Age on the outcome through each atrophy region and sum them to get the total indirect effect.
4. Compare the total indirect effect to the total effect of Age on the outcome to determine the proportion mediated.

By performing this analysis, you can ascertain whether the combined atrophy in different regions fully mediates the effect of age on the outcome variable. If you also have a moderation effect to consider (such as subiculum connectivity), this would be a moderated multiple mediation analysis. In this case, the indirect effects and the direct effect may change at different levels of the moderator.

I have generated code which will run mutliple mediation analyses independently, but will therefor not be able to tell you how they sum to contribute to the proportion of the independent variable mediated by them. this is perform_individual_multiple_mediator_analysis. Python does not have a combined multiple mediator analysis supported, so I have developed one. It is perform_combined_multiple_mediator_analysis. Please use with caution. 

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

def perform_individual_multiple_mediator_analysis(dataframe, exposure, mediator_list, outcome):
    # Check if mediator_cols is a list, if not, convert it to a list
    if not isinstance(mediator_list, list):
        mediator_list = [mediator_list]
    
    # Convert mediator columns into formula expression
    mediator_expression = ' + '.join(mediator_list)
    
    # Step 1: Fit the mediator models
    # Mediators ~ Independent Variable
    mediator_models = [GLM.from_formula(f"{mediator} ~ {exposure}", data=dataframe, family=Gaussian()) for mediator in mediator_list]
    
    # Step 2: Fit the outcome model
    # Dependent Variable ~ Independent Variable + Mediators
    outcome_model = GLM.from_formula(f"{outcome} ~ {exposure} + {mediator_expression}", data=dataframe, family=Gaussian())
    
    # Step 3: Perform mediation analysis for each mediator
    for mediator, mediator_model in zip(mediator_list, mediator_models):
        med = Mediation(outcome_model, mediator_model, exposure=exposure, mediator=mediator)
        med_result = med.fit()
        
        # Step 4: Print the results for each mediator
        print(f"Mediation analysis for mediator: {mediator}")
        print(med_result.summary())
        print("\n")

In [11]:
data_df.columns

Index(['subject_id', 'Age', 'Subiculum_Connectivity', 'Subiculum',
       'Hippocampus', 'Cerebellum', 'Frontal', 'Insula', 'Parietal',
       'Temporal', 'Occipital', 'outcome'],
      dtype='object')

In [12]:
perform_individual_multiple_mediator_analysis(data_df, 
                                   exposure='Age', 
                                   mediator_list=['Hippocampus', 'Cerebellum', 'Frontal', 'Parietal',
       'Temporal', 'Occipital'], 
                                   outcome='outcome')

Mediation analysis for mediator: Hippocampus
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            0.050512       -0.086087        0.279076    0.512
ACME (treated)            0.050512       -0.086087        0.279076    0.512
ADE (control)             0.059724       -0.341859        0.495938    0.796
ADE (treated)             0.059724       -0.341859        0.495938    0.796
Total effect              0.110236       -0.265746        0.512699    0.582
Prop. mediated (control)  0.078495       -3.920691        4.380159    0.818
Prop. mediated (treated)  0.078495       -3.920691        4.380159    0.818
ACME (average)            0.050512       -0.086087        0.279076    0.512
ADE (average)             0.059724       -0.341859        0.495938    0.796
Prop. mediated (average)  0.078495       -3.920691        4.380159    0.818


Mediation analysis for mediator: Cerebellum
                          Estimate  Lower CI bound  Upper CI bound  P-val

_____
**Multivariate Mediation Analysis**
Based on Preacher and Hayes 2008, Asymptotic Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models

**NOTE**: This is a proper multivariate mediation analysis

Model Explanation:
In a multiple mediation analysis, you are interested in investigating the indirect effects of an independent variable (exposure) on a dependent variable (outcome) through more than one mediator simultaneously. You are also interested in the direct effect of the independent variable on the dependent variable, controlling for the mediators. This analysis helps in understanding how different mediators may carry the influence of an independent variable to a dependent variable.

Model Form:
```
Independent Variable (IV) -----(c')------> Dependent Variable (DV)
    |        |        |                       ^      ^         ^
    |        |        | (a1)                  |      |         |
    |        |        v                       |      |         |
    |        |        Mediator 1 --(b1)-------|      |         |
    |        |                                       |         |
    |        | (a2)                                  |         |
    |        v                                       |         |
    |    Mediator 2 --------(b2)---------------------|         |
    |                                                          |
    ...                                                        |
    |                                                          |
    | (ak)                                                     |
    v                                                          |
Mediator k --------(bk)----------------------------------------|
```
In this diagram:

IV represents the independent variable.
Mediator 1, Mediator 2, ..., Mediator k represent the k mediators in the model.
DV represents the dependent variable.
The path labeled a1 represents the effect of the IV on Mediator 1. Similarly, a2 represents the effect of the IV on Mediator 2, and so on.
The path labeled b1 represents the effect of Mediator 1 on the DV, controlling for other mediators and the IV. Similarly, b2 represents the effect of Mediator 2 on the DV, controlling for other mediators and the IV, and so on.
The path labeled c' represents the direct effect of the IV on the DV, controlling for all the mediators.

**Technical Notes**

This is not formally implemented in Python as well as in R, so this is Calvin's custom code. Use with care.
This is a more 'true' multivariate mediation analysis than the one above

**Technical Explanation**
The perform_multiple_mediation_analysis function is designed to perform a multiple mediation analysis to estimate the joint indirect effects of an exposure variable through multiple mediators on a dependent variable. This is achieved through the use of bootstrapping.

Bootstrapping is a statistical technique where resampling of the dataset is done with replacement to estimate the sampling distribution of a statistic. In the case of multiple mediation analysis, it's used to estimate the joint indirect effects of mediators.

In the function, for each bootstrap sample, the dataset is resampled with replacement. For each mediator, a simple linear regression is performed to estimate the path from the exposure variable to the mediator, denoted as 'a' paths. The product of these 'a' paths is computed, as this represents the joint effect of the exposure on all the mediators.

Next, a multiple regression is performed with the dependent variable as the response, and the mediators and exposure as predictors. The coefficients for the mediators represent the effect of the mediators on the dependent variable, controlling for the exposure, denoted as 'b' paths. The sum of these 'b' paths is computed.

The indirect effect for the bootstrap sample is then calculated as the product of the 'a' paths and the 'b' path sum. This process is repeated for a specified number of bootstrap samples (default is 5000), and the mean indirect effect is calculated from these bootstrap samples.

Finally, the function prints the mean indirect effect and its 95% confidence interval, which is estimated using the 2.5th and 97.5th percentiles of the bootstrap indirect effects.

This approach provides an approximate estimate of the joint indirect effect through multiple mediators and allows for the calculation of confidence intervals around this effect.

In [13]:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols

def calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators):
    """
    Calculates the confidence intervals and p-value based on the bootstrapped samples.

    Parameters:
    - ab_paths: list of lists containing the bootstrapped ab paths for each mediator.
    - total_indirect_effects: list of bootstrapped summed ab paths.
    - mediators: list of mediator names.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.
    """

    ab_path_values = np.array(ab_paths)
    total_indirect_effects = np.array(total_indirect_effects)

    # Calculate mean indirect effect and confidence intervals for each mediator
    mean_ab_paths = np.mean(ab_path_values, axis=0)
    lower_bounds = np.percentile(ab_path_values, 2.5, axis=0)
    upper_bounds = np.percentile(ab_path_values, 97.5, axis=0)

    # Calculate p-values for each mediator
    ab_path_p_values = [np.mean(np.sign(mean_ab_paths[i]) * ab_path_values[:, i] <= 0) for i in range(len(mean_ab_paths))]


    # Calculate mean indirect effect and confidence intervals for the total indirect effect
    mean_total_indirect_effect = np.mean(total_indirect_effects)
    lower_bound_total = np.percentile(total_indirect_effects, 2.5)
    upper_bound_total = np.percentile(total_indirect_effects, 97.5)
    p_value_total = np.mean(total_indirect_effects > 0) if np.mean(total_indirect_effects) < 0 else np.mean(total_indirect_effects <= 1)

    # Create DataFrame to store the results
    result_df = pd.DataFrame({
        'Point Estimate': np.concatenate((mean_ab_paths, [mean_total_indirect_effect])),
        '2.5th Percentile': np.concatenate((lower_bounds, [lower_bound_total])),
        '97.5th Percentile': np.concatenate((upper_bounds, [upper_bound_total])),
        'P-value': ab_path_p_values + [p_value_total]
    }, index=mediators + ['Total Indirect Effect'])

    return result_df


def perform_multiple_mediation_analysis(dataframe, exposure, mediators, dependent_variable, bootstrap_samples=5000):
    """
    Performs a multiple mediation analysis by estimating the joint indirect effects of an exposure variable
    through multiple mediators on a dependent variable using bootstrapping.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable.
    - mediators: list, column names of the mediator variables.
    - dependent_variable: str, column name of the dependent variable.
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.

    Example Usage:
    result_df = perform_multiple_mediation_analysis(data_df, exposure='Age',
                                                    mediators=['Brain_Lobe1', 'Brain_Lobe2'],
                                                    dependent_variable='outcome')
    """

    # Perform multiple mediation analysis
    ab_paths, total_indirect_effects = [], []

    # Loop over each bootstrap sample
    for i in range(bootstrap_samples):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)

        # Paths from exposure to mediators and from mediators to DV
        ab_paths_sample = [ols(f"{mediator} ~ {exposure}", data=sample).fit().params[exposure] *
                           ols(f"{dependent_variable} ~ {mediator} + {exposure}", data=sample).fit().params[mediator]
                           for mediator in mediators]

        # Sum the individual indirect effects for this bootstrap sample
        total_indirect_effect = sum(ab_paths_sample)

        # Append the ab paths and total indirect effect to the lists
        ab_paths.append(ab_paths_sample)
        total_indirect_effects.append(total_indirect_effect)

    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators)

    return result_df



In [45]:
data_df.columns

Index(['Patient_#_CDR,_ADAS', 'Age', 'Subiculum_Connectivity',
       'Subiculum_Damage_Score', 'Hippocampus_Damage_Score',
       'Temporal_Damage_Score', 'Frontal_Damage_Score',
       'Parietal_Damage_Score', 'Cerebellum_Damage_Score',
       'Insula_Damage_Score', 'Occipital_Damage_Score', 'outcome'],
      dtype='object')

In [14]:
result_df = perform_multiple_mediation_analysis(data_df, 
                                    exposure='Age', 
                                    mediators=['Hippocampus', 'Cerebellum', 'Frontal', 'Parietal',
       'Temporal', 'Occipital']
                                               , 
                                    dependent_variable='outcome',
                                    bootstrap_samples=10000)
result_df

Unnamed: 0,Point Estimate,2.5th Percentile,97.5th Percentile,P-value
Hippocampus,0.037183,-0.111408,0.169448,0.2533
Cerebellum,0.005732,-0.030423,0.060915,0.4135
Frontal,-0.000936,-0.056249,0.062174,0.4516
Parietal,-0.039603,-0.165577,0.080719,0.2171
Temporal,0.009167,-0.053368,0.097143,0.3998
Occipital,-0.002851,-0.092412,0.074276,0.4825
Total Indirect Effect,0.008691,-0.276208,0.378237,0.9996


# First Stage Model Moderated Mediation Analyses

Model Structure:
```
 #Age ---(a1)------> Mediator ---(b1)---> Outcome
  |       |                                                                        
   ---(Moderator)
```
This is a 'first-stage model'

First Stage Moderated Mediation: basically just checks if the mediation is dependent on a moderator relating the independent variable to the mediator. Essentially, it's an interaction effect. 

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

def first_stage_moderated_mediation(dataframe, exposure, moderator, mediator, dependent_variable):
    """
    This function performs a moderated_mediation analysis on the provided data.
    
    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the independent variable (e.g., 'Age').
    - moderator: str, column name of the moderator variable, the variable interacting with the independent variable (e.g., 'Subiculum_Connectivity').
    - mediator: str, column name of the mediator variable, the variable acting through the independent variable (e.g., 'Hippocampus').
    - dependent_variable: str, column name of the dependent variable (e.g., 'outcome').
       
    Example Usage:
    perform_moderation_analysis(data_df, independent_variable='Age', moderator='Subiculum_Connectivity', mediator='Hippocampus', dependent_variable='outcome')
    """
    
    # Step 1: Fit the mediator model with interaction term
    # Mediator ~ IV + Moderator + IV*Moderator
    mediator_model = GLM.from_formula(f"{mediator} ~ {exposure} + {moderator} + {exposure}:{moderator}", data=dataframe, family=Gaussian())
    
    # Step 2: Fit the outcome model with mediator
    # Dependent Variable ~ IV + Moderator + IV*Moderator + Mediator
    outcome_model = GLM.from_formula(f"{dependent_variable} ~ {exposure} + {moderator} + {exposure}:{moderator} + {mediator}", data=dataframe, family=Gaussian())
    
    # Step 3: Perform mediation analysis
    med = Mediation(outcome_model, mediator_model, exposure=exposure, mediator=mediator)
    try:
        med_result = med.fit()
        print(f'Exposure {exposure} moderated by {moderator} mediated by {mediator}')
        print(med_result.summary())
        return med_result
    except:
        print(f'Perfect separation detected, aborting {mediator}')
        return


In [None]:
data_df.columns

Index(['subject_id', 'Age', 'Mesial_Temporal_Grade', 'Frontal', 'Temporal',
       'Parietal', 'Occipital', 'Cerebellar', 'Subiculum_Connectivity',
       'Hippocampal_CSF', 'Hippocampal_Grey_Matter',
       'Hippocampal_White_Matter', 'outcome'],
      dtype='object')

In [49]:
# Example usage:
# Assuming your DataFrame is named data_df, and you're investigating if 'Hippocampus' mediates the relationship between 'Age' and 'outcome', with 'Subiculum_Connectivity' as a moderator.
mediation_results = first_stage_moderated_mediation(data_df, 
                                     exposure='Age', 
                                     moderator='Subiculum_Connectivity', 
                                     mediator='Hippocampus_Damage_Score', 
                                     dependent_variable='outcome')
mediation_results.summary()

Exposure Age moderated by Subiculum_Connectivity mediated by Hippocampus_Damage_Score
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            0.052079       -0.059894        0.223023    0.422
ACME (treated)            0.052079       -0.059894        0.223023    0.422
ADE (control)             0.072082       -0.216863        0.370910    0.634
ADE (treated)             0.072082       -0.216863        0.370910    0.634
Total effect              0.124162       -0.169802        0.440604    0.414
Prop. mediated (control)  0.191037       -3.096974        4.897974    0.576
Prop. mediated (treated)  0.191037       -3.096974        4.897974    0.576
ACME (average)            0.052079       -0.059894        0.223023    0.422
ADE (average)             0.072082       -0.216863        0.370910    0.634
Prop. mediated (average)  0.191037       -3.096974        4.897974    0.576


Unnamed: 0,Estimate,Lower CI bound,Upper CI bound,P-value
ACME (control),0.052079,-0.059894,0.223023,0.422
ACME (treated),0.052079,-0.059894,0.223023,0.422
ADE (control),0.072082,-0.216863,0.37091,0.634
ADE (treated),0.072082,-0.216863,0.37091,0.634
Total effect,0.124162,-0.169802,0.440604,0.414
Prop. mediated (control),0.191037,-3.096974,4.897974,0.576
Prop. mediated (treated),0.191037,-3.096974,4.897974,0.576
ACME (average),0.052079,-0.059894,0.223023,0.422
ADE (average),0.072082,-0.216863,0.37091,0.634
Prop. mediated (average),0.191037,-3.096974,4.897974,0.576


In [None]:
[first_stage_moderated_mediation(data_df, exposure='Age', moderator='Subiculum_Connectivity', mediator=f'{col}', dependent_variable='outcome') for col in data_df.columns]

Exposure Age moderated by Subiculum_Connectivity mediated by subject_id
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)           -0.001905       -0.091951        0.087114    0.940
ACME (treated)           -0.001905       -0.091951        0.087114    0.940
ADE (control)             0.128545       -0.154193        0.406621    0.368
ADE (treated)             0.128545       -0.154193        0.406621    0.368
Total effect              0.126640       -0.164800        0.403956    0.396
Prop. mediated (control)  0.007420       -1.408721        1.889678    0.924
Prop. mediated (treated)  0.007420       -1.408721        1.889678    0.924
ACME (average)           -0.001905       -0.091951        0.087114    0.940
ADE (average)             0.128545       -0.154193        0.406621    0.368
Prop. mediated (average)  0.007420       -1.408721        1.889678    0.924
Perfect separation detected, aborting Age
Exposure Age moderated by Subiculum_Connectivity m

[<statsmodels.stats.mediation.MediationResults at 0x187f08c40>,
 None,
 <statsmodels.stats.mediation.MediationResults at 0x187f68d30>,
 <statsmodels.stats.mediation.MediationResults at 0x1878d5c60>,
 <statsmodels.stats.mediation.MediationResults at 0x187d8c490>,
 <statsmodels.stats.mediation.MediationResults at 0x187bb4370>,
 <statsmodels.stats.mediation.MediationResults at 0x187b69690>,
 <statsmodels.stats.mediation.MediationResults at 0x187c79210>,
 None,
 <statsmodels.stats.mediation.MediationResults at 0x187bc5270>,
 <statsmodels.stats.mediation.MediationResults at 0x111cd7df0>,
 <statsmodels.stats.mediation.MediationResults at 0x187b65ab0>,
 None]

_____________________
Second Stage Model Moderated Mediation

The moderation is on the path from the independent variable (age) to the mediator (atrophy region). This type of model is often termed as a first-stage moderated mediation. The moderation in this case can show if the relationship between age and the atrophy region varies at different levels of another variable (the moderator).

This is model is not natively supported in Python, thus I have developed it myself. 
Highly complex and experimental model. Use with caution. 


Model Structure:
```
 #Age ---(a1)------> Mediator ---(b1)---> Outcome
  |                                  |
  |                                  |
   |                                 |
   ---(Moderator)---------------------
```
Second Stage Moderated Mediation: When the moderation is occurring in the second part of the mediation process, it is called second stage moderated mediation. In this case, the relationship between the mediator (M) and the dependent variable (DV) depends on the moderator. This is sometimes referred to as "moderated b path."


In [50]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

def second_stage_moderated_mediation(dataframe, exposure, moderator, mediator, dependent_variable):
    """
    This function performs a moderated mediation analysis on the provided data.
    
    Parameters:
    - dataframe: DataFrame containing the data.
    - independent_variable: str, column name of the independent variable (e.g., 'Age').
    - moderator: str, column name of the moderator variable, the variable interacting with the independent variable (e.g., 'Subiculum_Connectivity').
    - mediator: str, column name of the mediator variable, the variable acting through the independent variable (e.g., 'Hippocampus').
    - dependent_variable: str, column name of the dependent variable (e.g., 'outcome').
       
    Example Usage:
    perform_moderated_mediation_analysis(data_df, independent_variable='Age', moderator='Subiculum_Connectivity', mediator='Hippocampus', dependent_variable='outcome')
    """
    
    # Step 1: Fit the mediator model with interaction term
    # Mediator ~ IV
    mediator_model = GLM.from_formula(f"{mediator} ~ {exposure}", data=dataframe, family=Gaussian())
    
    # Step 2: Fit the outcome model with mediator
    # Dependent Variable ~ IV + Moderator + IV*Moderator + Mediator
    outcome_model = GLM.from_formula(f"{dependent_variable} ~ {exposure} + {moderator} + {mediator}:{moderator} + {mediator}", data=dataframe, family=Gaussian())

    # Step 3: Perform mediation analysis
    med = Mediation(outcome_model, mediator_model, exposure=exposure, mediator=mediator)
    try:
        med_result = med.fit()
        print(f'Exposure {exposure} mediated by {mediator} and then moderated by {moderator}')
        print(med_result.summary())
        return med_result
    except:
        print(f'Perfect separation detected, aborting {mediator}')
        return np.NaN
     

In [51]:
med_result = second_stage_moderated_mediation(data_df, 
                                exposure='Age', 
                                moderator='Subiculum_Connectivity', 
                                mediator='Hippocampus_Damage_Score', 
                                dependent_variable='outcome')
med_result.summary()

Exposure Age mediated by Hippocampus_Damage_Score and then moderated by Subiculum_Connectivity
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)            0.048508       -0.108348        0.255114    0.550
ACME (treated)            0.048508       -0.108348        0.255114    0.550
ADE (control)             0.020545       -0.301028        0.344528    0.878
ADE (treated)             0.020545       -0.301028        0.344528    0.878
Total effect              0.069053       -0.238494        0.406320    0.660
Prop. mediated (control)  0.193687       -3.697825        4.363509    0.702
Prop. mediated (treated)  0.193687       -3.697825        4.363509    0.702
ACME (average)            0.048508       -0.108348        0.255114    0.550
ADE (average)             0.020545       -0.301028        0.344528    0.878
Prop. mediated (average)  0.193687       -3.697825        4.363509    0.702


Unnamed: 0,Estimate,Lower CI bound,Upper CI bound,P-value
ACME (control),0.048508,-0.108348,0.255114,0.55
ACME (treated),0.048508,-0.108348,0.255114,0.55
ADE (control),0.020545,-0.301028,0.344528,0.878
ADE (treated),0.020545,-0.301028,0.344528,0.878
Total effect,0.069053,-0.238494,0.40632,0.66
Prop. mediated (control),0.193687,-3.697825,4.363509,0.702
Prop. mediated (treated),0.193687,-3.697825,4.363509,0.702
ACME (average),0.048508,-0.108348,0.255114,0.55
ADE (average),0.020545,-0.301028,0.344528,0.878
Prop. mediated (average),0.193687,-3.697825,4.363509,0.702


Run the code above for an entire spreadsheet

In [23]:
[second_stage_moderated_mediation(data_df, exposure='Age', moderator='Subiculum_Connectivity', mediator=f'{col}', dependent_variable='outcome') for col in data_df.columns]

Exposure Age mediated by subject_id and then moderated by Subiculum_Connectivity
                          Estimate  Lower CI bound  Upper CI bound  P-value
ACME (control)           -0.008432       -0.153197        0.113779    0.890
ACME (treated)           -0.008432       -0.153197        0.113779    0.890
ADE (control)             0.070484       -0.229515        0.384556    0.650
ADE (treated)             0.070484       -0.229515        0.384556    0.650
Total effect              0.062052       -0.252095        0.367010    0.712
Prop. mediated (control)  0.032750       -3.912281        3.556485    0.898
Prop. mediated (treated)  0.032750       -3.912281        3.556485    0.898
ACME (average)           -0.008432       -0.153197        0.113779    0.890
ADE (average)             0.070484       -0.229515        0.384556    0.650
Prop. mediated (average)  0.032750       -3.912281        3.556485    0.898
Perfect separation detected, aborting Age
Exposure Age mediated by Mesial_Temporal_

[<statsmodels.stats.mediation.MediationResults at 0x187947ac0>,
 None,
 <statsmodels.stats.mediation.MediationResults at 0x1879853c0>,
 <statsmodels.stats.mediation.MediationResults at 0x187f5bbe0>,
 <statsmodels.stats.mediation.MediationResults at 0x187afd540>,
 <statsmodels.stats.mediation.MediationResults at 0x111cd7700>,
 <statsmodels.stats.mediation.MediationResults at 0x187c285e0>,
 <statsmodels.stats.mediation.MediationResults at 0x1879850c0>,
 <statsmodels.stats.mediation.MediationResults at 0x188108040>,
 <statsmodels.stats.mediation.MediationResults at 0x111b9a320>,
 <statsmodels.stats.mediation.MediationResults at 0x187d39720>,
 <statsmodels.stats.mediation.MediationResults at 0x187d39630>,
 None]

# Mediated Moderation

# Conditional Process Analysis
Based on Preacher and Hayes 2008, Asymptotic Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models

__________  
Multiple Mediators and Moderation Analysis
Also known as a conditional process analysis
the two code segments below will vary the moderator, 
1) allowing the moderator to interact with the exposure upon mediators,
or 
2) allowing the moderator to interact with the mediators upon outcome. 

**Multiple Mediator Analysis with First Stager Moderator**:

This code estimates the joint indirect effects of an exposure variable through multiple mediators on a dependent variable while considering the moderation effect of a specified moderator variable.

In this code, similar to the previous code, the function loops over each bootstrap sample and computes the indirect effects for each mediator, considering the moderation effect of the specified moderator variable in the first stage. It calculates the total indirect effect as the sum of the individual indirect effects for each mediator. The mean indirect effect, 95% confidence interval, and p-value are then calculated based on the bootstrap samples.

The inclusion of the moderator variable and its interaction with the exposure variable in the first stage allows for examining how the mediation effects may vary based on different levels of the moderator. By considering the moderation effect at the first stage, this analysis investigates the conditional indirect effects of the exposure through the mediators.


In [15]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

def perform_multiple_mediation_first_stage_moderator_analysis(dataframe, exposure, mediators, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a multiple mediation analysis by estimating the joint indirect effects of an exposure variable
    through multiple mediators on a dependent variable using bootstrapping.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable (e.g., 'Age').
    - mediators: list, column names of the mediator variables (e.g., ['Brain_Lobe1', 'Brain_Lobe2']).
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable (e.g., 'outcome').
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - None. Prints the mean indirect effect and 95% confidence interval.

    Example Usage:
    perform_multiple_mediation_analysis(data_df, exposure='Age', mediators=['Brain_Lobe1', 'Brain_Lobe2'], dependent_variable='outcome')
    """

    ab_paths, total_indirect_effects = [], []
    
    # Loop over each bootstrap sample
    for i in range(bootstrap_samples):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)
        # Paths from exposure to mediators and from mediators to DV
        ab_paths_sample = [ols(f"{mediator} ~ {exposure} + {moderator} + {exposure}:{moderator}", data=sample).fit().params[exposure] * 
                    ols(f"{dependent_variable} ~ {exposure} + {moderator} + {exposure}:{moderator} + {mediator}", data=sample).fit().params[mediator]
                    for mediator in mediators]
        
        # Sum the individual indirect effects for this bootstrap sample
        total_indirect_effect = sum(ab_paths_sample)
        ab_paths.append(ab_paths_sample)
        total_indirect_effects.append(total_indirect_effect)
    
    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators)

    return result_df


In [53]:
print(data_df.columns)

Index(['Patient_#_CDR,_ADAS', 'Age', 'Subiculum_Connectivity',
       'Subiculum_Damage_Score', 'Hippocampus_Damage_Score',
       'Temporal_Damage_Score', 'Frontal_Damage_Score',
       'Parietal_Damage_Score', 'Cerebellum_Damage_Score',
       'Insula_Damage_Score', 'Occipital_Damage_Score', 'outcome'],
      dtype='object')


In [16]:
perform_multiple_mediation_first_stage_moderator_analysis(data_df, 
                                                    exposure='Age', 
                                                    moderator='Subiculum_Connectivity', 
                                                    mediators= ['Hippocampus', 'Cerebellum', 'Frontal', 'Parietal',
       'Temporal', 'Occipital'], 
                                                    dependent_variable='outcome',
                                                    bootstrap_samples=10000)

Unnamed: 0,Point Estimate,2.5th Percentile,97.5th Percentile,P-value
Hippocampus,0.04356,-0.098567,0.184438,0.2259
Cerebellum,0.002942,-0.036885,0.053274,0.4538
Frontal,0.001818,-0.043764,0.054531,0.4819
Parietal,-0.044634,-0.184069,0.062024,0.2034
Temporal,-0.001114,-0.076069,0.069663,0.4919
Occipital,-0.007869,-0.113032,0.073998,0.4478
Total Indirect Effect,-0.005297,-0.33525,0.326807,0.4902


--------------------
Moderated Mulitple Mediation Analysis, but with all mediators combined instead of run individually
- This is not supported natively in Python libraries, so I have developed it myself. 

Experimental, please use with caution. 

_____
**Multiple Mediator Analysis with Second Stage Moderator**


In this code, the function loops over each bootstrap sample and computes the indirect effects for each mediator, taking into account the moderation effect of the specified moderator variable. It calculates the total indirect effect as the sum of the individual indirect effects for each mediator. The mean indirect effect, 95% confidence interval, and p-value are then calculated based on the bootstrap samples.

The addition of the moderator variable in the second-stage analysis allows for assessing how the mediation effects may vary across different levels of the moderator. By including the moderator variable and its interaction with the mediators, the analysis investigates whether the indirect effects of the exposure through the mediators differ depending on the levels of the moderator.

Overall, this code provides a way to examine multiple mediation effects while considering a second-stage moderation analysis to explore potential moderation effects.

In [17]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

def perform_multiple_mediation_second_stage_moderator_analysis(dataframe, exposure, mediators, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a multiple mediation analysis by estimating the joint indirect effects of an exposure variable
    through multiple mediators on a dependent variable using bootstrapping.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable (e.g., 'Age').
    - mediators: list, column names of the mediator variables (e.g., ['Brain_Lobe1', 'Brain_Lobe2']).
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable (e.g., 'outcome').
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - None. Prints the mean indirect effect and 95% confidence interval.

    Example Usage:
    perform_multiple_mediation_analysis(data_df, exposure='Age', mediators=['Brain_Lobe1', 'Brain_Lobe2'], dependent_variable='outcome')
    """

    ab_paths, total_indirect_effects = [], []
    
    # Loop over each bootstrap sample
    for i in range(bootstrap_samples):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)
        # Paths from exposure to mediators and from mediators to DV
        ab_paths_sample = [ols(f"{mediator} ~ {exposure}", data=sample).fit().params[exposure] * 
                    ols(f"{dependent_variable} ~ {exposure} + {moderator} + {mediator}:{moderator} + {mediator}", data=sample).fit().params[mediator]
                    for mediator in mediators]
        
        # Sum the individual indirect effects for this bootstrap sample
        total_indirect_effect = sum(ab_paths_sample)
        ab_paths.append(ab_paths_sample)
        total_indirect_effects.append(total_indirect_effect)
    
    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators)


    return result_df


In [18]:
perform_multiple_mediation_second_stage_moderator_analysis(data_df, 
                                               exposure = 'Age', 
                                               moderator = 'Subiculum_Connectivity', 
                                               mediators= ['Hippocampus', 'Cerebellum', 'Frontal', 'Parietal',
       'Temporal', 'Occipital'], 
                                                    dependent_variable='outcome',
                                                    bootstrap_samples=10000)

Unnamed: 0,Point Estimate,2.5th Percentile,97.5th Percentile,P-value
Hippocampus,0.01791,-0.149766,0.161816,0.3612
Cerebellum,0.004932,-0.027596,0.054425,0.4064
Frontal,-0.003222,-0.0603,0.057315,0.4221
Parietal,-0.05308,-0.191232,0.066778,0.1601
Temporal,-0.00136,-0.08235,0.071492,0.4932
Occipital,-0.008725,-0.119704,0.071328,0.4387
Total Indirect Effect,-0.043545,-0.367476,0.312034,0.3399


**Mediated Moderation**

Based on Edwards and Lambert 2007, Methods for integrating moderation and mediation: A general analytical framework using moderated path analysis.

Author: Calvin Howard

NOTE: This is a proper mediated moderation analysis

Model Explanation:
A mediated moderation analysis explores how the effect of an independent variable (exposure) on a dependent variable (outcome) is moderated by a third variable, and this moderation effect is mediated by a fourth variable. Essentially, it helps to understand how the interaction between an independent variable and a moderator influences a dependent variable through a mediator. In short: is the moderation (interaction effect) attributable to something else (mediator).

Model Form:
```
Independent Variable (IV) ---(a)--> Mediator (M) ---(b)--> Dependent Variable (DV)
                      |                            ^
                      |                            |
                      |                            |
                      |---(c)----> Moderator (Mod) |
                      |                            |
                      |-----------------(d)--------|
```
In this diagram:

IV represents the independent variable.
M represents the mediator in the model.
Mod represents the moderator in the model.
DV represents the dependent variable.
The path labeled (a) represents the effect of the IV on M.
The path labeled (b) represents the effect of M on the DV, controlling for the IV and Mod.
The path labeled (c) represents the effect of the IV on Mod.
The path labeled (d) represents the interaction between IV and Mod affecting the DV, controlling for M.

Technical Notes

This is not formally implemented in Python as well as in R, so this is custom code. Use with care.

Technical Explanation
The perform_mediated_moderation_analysis function is designed to perform a mediated moderation analysis to estimate the joint indirect effects of an exposure variable through a mediator on a dependent variable, considering the moderating effect of another variable. This is achieved through the use of bootstrapping.

Bootstrapping is a statistical technique where resampling of the dataset is done with replacement to estimate the sampling distribution of a statistic. In the case of mediated moderation analysis, it's used to estimate the joint indirect effects.

In the function, for each bootstrap sample, the dataset is resampled with replacement. Then, a simple linear regression model is fitted to estimate the path from the moderator variable to the mediator, denoted as 'a' path. The interaction between exposure and mediator is estimated using a multiple linear regression model with the dependent variable as the response, and the exposure, mediator, moderator, interaction of exposure and mediator, and interaction of exposure and moderator as predictors. The product of the 'a' path and the interaction effect represent the indirect effect for this bootstrap sample.

This process is repeated for a specified number of bootstrap samples (default is 5000), and the mean indirect effect is calculated from these bootstrap samples.

Finally, the function prints the mean indirect effect and its 95% confidence interval, which is estimated using the 2.5th and 97.5th percentiles of the bootstrap indirect effects.

This approach provides an approximate estimate of the joint indirect effect through a mediator and allows for the calculation of confidence intervals around this effect.

In [40]:
def calculate_confidence_intervals(ab_paths, mediators):
    """
    Calculates the confidence intervals and p-value based on the bootstrapped samples.

    Parameters:
    - ab_paths: list of lists containing the bootstrapped ab paths for each mediator.
    - total_indirect_effects: list of bootstrapped summed ab paths.
    - mediators: list of mediator names.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.
    """
    ab_path_values = np.array(ab_paths)

    # Check if there's only one mediator
    if isinstance(mediators, str):
        mediators = [mediators]

    # Calculate mean indirect effect and confidence intervals for each mediator
    mean_ab_paths = np.mean(ab_path_values, axis=0)
    lower_bounds = np.percentile(ab_path_values, 2.5, axis=0)
    upper_bounds = np.percentile(ab_path_values, 97.5, axis=0)

    # Calculate p-values for each mediator
    ab_path_p_values = [np.mean(np.sign(mean_ab_paths) * ab_path_values <= 0)]

    # Create DataFrame to store the results
    result_df = pd.DataFrame({
        'Point Estimate': mean_ab_paths,
        '2.5th Percentile': lower_bounds,
        '97.5th Percentile': upper_bounds,
        'P-value': ab_path_p_values
    }, index=mediators)

    return result_df


In [55]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from statsmodels.formula.api import ols
def perform_mediated_moderation_analysis(dataframe, exposure, mediator, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a mediated moderation analysis by estimating the joint indirect effects of an exposure variable
    through a mediator on a dependent variable using bootstrapping, considering the moderating effect of another variable.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable.
    - mediator: str, column name of the mediator variable.
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable.
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for the indirect effect.

    Example Usage:
    result_df = perform_mediated_moderation_analysis(data_df, exposure='Age',
                                                     mediator='Brain_Lobe',
                                                     moderator='Stimulation',
                                                     dependent_variable='Outcome')
    """

    ab_paths = []

    # Loop over each bootstrap sample
    for i in tqdm(range(bootstrap_samples)):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)

        # Fit the models and calculate the indirect effect for this bootstrap sample
        model_M = ols(f"{mediator} ~ {moderator}", data=sample).fit()
        model_Y = ols(f"{dependent_variable} ~ {exposure} + {mediator} + {moderator} + {exposure}:{mediator} + {exposure}:{moderator}", data=sample).fit()

        indirect_effect = model_M.params[moderator] * model_Y.params[f'{exposure}:{mediator}']

        # Append the indirect effect to the list
        ab_paths.append(indirect_effect)
    print(len(ab_paths))
    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, mediators=mediator)

    return result_df


In [56]:
data_df.columns

Index(['Patient_#_CDR,_ADAS', 'Age', 'Subiculum_Connectivity',
       'Subiculum_Damage_Score', 'Hippocampus_Damage_Score',
       'Temporal_Damage_Score', 'Frontal_Damage_Score',
       'Parietal_Damage_Score', 'Cerebellum_Damage_Score',
       'Insula_Damage_Score', 'Occipital_Damage_Score', 'outcome'],
      dtype='object')

In [59]:
results_df = perform_mediated_moderation_analysis(dataframe = data_df,
                                                  exposure = 'Subiculum_Connectivity', 
                                                  mediator = 'Subiculum_Damage_Score', 
                                                  moderator = 'Age', 
                                                  dependent_variable ='outcome', 
                                                  bootstrap_samples=5000)
results_df

100%|██████████| 5000/5000 [00:29<00:00, 167.22it/s]

5000





Unnamed: 0,Point Estimate,2.5th Percentile,97.5th Percentile,P-value
Subiculum_Damage_Score,0.005794,-0.088131,0.124588,0.4806
