# Mediation, Moderation, and Conditional Process Analyses

Mediation and moderation are two distinct concepts in statistics that help in understanding the relationships among three or more variables. Mediation deals with explaining the underlying process through which an independent variable (X) influences a dependent variable (Y). Moderation, on the other hand, addresses how the relationship between X and Y changes based on the level of a third variable, known as the moderator (W).

**Mediation Analysis**:

In mediation analysis, we are interested in whether the relationship between X and Y is explained (or mediated) through another variable, called the mediator (M).

- **Path a**: The independent variable (X) affects the mediator (M). This is the "a" path.
- **Path b**: The mediator (M) affects the dependent variable (Y). This is the "b" path.
- **Path c'**: There might also be a direct effect of X on Y that doesn't go through the mediator. This is the "c'" path or direct effect.

```
X ---(a)---> M ---(b)---> Y
 \                        /
  ---------(c')----------
```

The indirect effect is a*b, which is the portion of the effect of X on Y that is transmitted through the mediator M. The total effect of X on Y is the sum of the indirect effect and the direct effect (c').

**Moderated Mediation Analysis**:

Moderated mediation incorporates the idea that the mediation process itself can be contingent upon the levels of another variable (W), known as the moderator. In other words, the indirect effect of X on Y through M may change depending on the level of W.

```
X ---(a)---> M ---(b)---> Y
|            | 
|------(W)---|
 \                        /
  ---------(c')----------
```

**Interpreting the Results**:

In the results of a moderated mediation analysis, you'll encounter several terms:

- ACME (control/treated): Average Causal Mediation Effects. This represents the average effect of the independent variable on the dependent variable through the mediator.
- ADE (control/treated): Average Direct Effects. This represents the average direct effect of the independent variable on the dependent variable not through the mediator.
- Total effect: Sum of ACME and ADE, representing the total effect of the independent variable on the dependent variable.
- Prop. mediated (control/treated): Proportion of the total effect that is mediated by the mediator.
- The 'control' and 'treated' distinction in ACME and ADE accounts for possible treatment or intervention that could affect the mediation. If there's no treatment/intervention involved, they will be the same.
- Lower CI bound and Upper CI bound represent the lower and upper bounds of the confidence interval for the estimates.
- P-value: Indicates the statistical significance of the estimates. A lower p-value (<0.05) typically indicates a statistically significant effect.

- Control: baseline regression without effect of the mediator
- Treatment: mediated regression wherein the mediator is accounted for
 

**Identifying Partial Mediations**:

In partial mediation, the mediator explains some, but not all, of the relationship between the independent variable (exposure) and the dependent variable (outcome). Here's how you can identify partial mediation:

Significant Direct Effect (ADE): In partial mediation, the direct effect of the independent variable on the dependent variable remains significant even after accounting for the mediator. This means that there is still a direct path from the independent variable to the dependent variable that is not explained by the mediator.

Significant Indirect Effect (ACME): The indirect effect through the mediator must also be significant. This means that the mediator is explaining part of the relationship between the independent variable and the dependent variable.

Total Effect > Direct Effect: The total effect of the independent variable on the dependent variable (before the mediator is added to the model) is generally greater than the direct effect (after accounting for the mediator). This shows that the mediator is explaining some portion of the effect.

In the results you usually get from mediation analysis, check for the following:

ADE (average direct effect) should be significant.
ACME (average causal mediation effect) should be significant.
The total effect should be greater than the direct effect, indicating that some portion of the effect is being accounted for by the mediator.
In summary, partial mediation is present when both the direct and indirect paths are significant, but the inclusion of the mediator in the model reduces the direct effect of the independent variable on the dependent variable (without rendering it non-significant).

**Identifying a Mediated Exposure**:

If the exposure was being mediated, you would expect to see:

A significant ACME (Average Causal Mediation Effect) - This indicates that the mediator is having a significant effect in the relationship between the exposure and the outcome.

A significant proportion mediated - This tells you the proportion of the total effect that is mediated. If this is significant, it implies that a noteworthy part of the relationship between the exposure and the outcome is occurring through the mediator.

Also, keep in mind that the proportion mediated can be positive or negative, and this would indicate whether the mediator is increasing or decreasing the effect of the exposure on the outcome.

In [1]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/atrophy_seeds_2023/Figures/supplement_parametric_regression_to_adascog'

Import Data

In [2]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/cognition_2023/metadata/master_list_proper_subjects.xlsx'
sheet = 'master_list_proper_subjects' # 'master_list_proper_subjects'

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()


# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [69]:
drop_list = ['Age', 'Z_Scored_Percent_Cognitive_Improvement', 'Hippocampus_GM_Vol', 'Z_Scored_Subiculum_T_By_Origin_Group_']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
data_df

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [11]:
column = 'City'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Queensland' # The value to drop if T

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = ['Ordinal_Target_Type', 'Ordinal_Epilepsy_Type'] #['Age']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

Single Variable Mediation Analysis

Model Shape:
```
 Age ---(an)---> Atrophy Region n ---(bn)---> Outcome
  \                                          /
   -------------------(c')-------------------
```

Description
- This uses partial regressions to essentially (explain in a reductive manner) to what degree an independent variable's relationship with a dependent variable can be replaced by a 'mediator variable'. 
- Said another way, how much is the effect of x upon y mediated by z?

In [30]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

def perform_mediation_analysis(dataframe, exposure, mediator, dependent_variable):
    # Step 1: Fit the mediator model
    # Mediator ~ Independent Variable
    mediator_model = GLM.from_formula(f"{mediator} ~ {exposure}", data=dataframe, family=Gaussian())
    
    # Step 2: Fit the outcome model
    # Dependent Variable ~ Independent Variable + Mediator
    outcome_model = GLM.from_formula(f"{dependent_variable} ~ {exposure} + {mediator}", data=dataframe, family=Gaussian())
    
    # Step 3: Perform mediation analysis
    med = Mediation(outcome_model, mediator_model, exposure=exposure, mediator=mediator)
    try:
        med_result = med.fit()
        # Step 4: Print the results
        print(f'Mediation of {exposure} by {mediator}:')
        print(med_result.summary())
    except:
        print(f"Mediation failed, perfect separation detected. Aborting assessment of {mediator}.")


In [None]:
data_df.columns

In [None]:
perform_mediation_analysis(data_df, 
                           exposure = 'Age', 
                           mediator = 'Hippocampus', 
                           dependent_variable='outcome')

Run the above code on an entire spreadsheet

In [None]:
[perform_mediation_analysis(data_df, exposure='Age', mediator=f'{col}', dependent_variable='outcome') for col in data_df.columns]

__________
**Multivariate Mediation Analysis (Independent)**

**NOTE**
This is not a proper 'multivariate mediation analysis'.
It does something slightly different: it performs individual mediation analyses for each mediator, while controlling for the exposure. This is different from assessing the conglomerate mediation effect of all mediators acting in unison.


To investigate whether the effect of an independent variable on an outcome variable is fully mediated through multiple variables (e.g., atrophy in different brain regions), you would want to conduct a multiple mediator analysis.

In a multiple mediator analysis, you examine the indirect effects of an independent variable (in this case, age) on a dependent variable through several mediators simultaneously. This allows you to estimate the portion of the effect of the independent variable that is transmitted through each mediator and whether, when taken together, these mediators account for the entirety of the effect.

In your case, the different atrophy regions will serve as multiple mediators. Here’s how this might look diagrammatically:

```
 Age ---(a1)---> Atrophy Region 1 ---(b1)---> Outcome
  |           |
 Age ---(a2)---> Atrophy Region 2 ---(b2)---> Outcome
  |           |
   ...       ...
 Age ---(an)---> Atrophy Region n ---(bn)---> Outcome
  \                                          /
   -------------------(c')-------------------
```

In this example, each atrophy region (1 through n) is a mediator. The indirect effect of age on the outcome through each atrophy region is a_i * b_i, and the total indirect effect is the sum of all these individual indirect effects. If this total indirect effect is significant and accounts for most of the effect of age on the outcome, and the direct effect (c') is non-significant, it indicates full mediation.

You can perform multiple mediator analysis using the same statistical tools you use for single mediator analysis, but you will need to include multiple mediators in your model. This can be done using statistical software such as R, SAS, or Python's statsmodels. I have implemented this below using statsmodels.

Here's how the mediation model is constructed in conceptual terms:

1. Fit a model for each mediator (Atrophy Region 1, Atrophy Region 2, ..., Atrophy Region n) as a function of the independent variable, Age.
2. Fit a model for the outcome as a function of Age and all atrophy regions.
3. Estimate the indirect effect of Age on the outcome through each atrophy region and sum them to get the total indirect effect.
4. Compare the total indirect effect to the total effect of Age on the outcome to determine the proportion mediated.

By performing this analysis, you can ascertain whether the combined atrophy in different regions fully mediates the effect of age on the outcome variable. If you also have a moderation effect to consider (such as subiculum connectivity), this would be a moderated multiple mediation analysis. In this case, the indirect effects and the direct effect may change at different levels of the moderator.

I have generated code which will run mutliple mediation analyses independently, but will therefor not be able to tell you how they sum to contribute to the proportion of the independent variable mediated by them. this is perform_individual_multiple_mediator_analysis. Python does not have a combined multiple mediator analysis supported, so I have developed one. It is perform_combined_multiple_mediator_analysis. Please use with caution. 

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Gaussian
from statsmodels.stats.mediation import Mediation

def perform_individual_multiple_mediator_analysis(dataframe, exposure, mediator_list, outcome):
    # Check if mediator_cols is a list, if not, convert it to a list
    if not isinstance(mediator_list, list):
        mediator_list = [mediator_list]
    
    # Convert mediator columns into formula expression
    mediator_expression = ' + '.join(mediator_list)
    
    # Step 1: Fit the mediator models
    # Mediators ~ Independent Variable
    mediator_models = [GLM.from_formula(f"{mediator} ~ {exposure}", data=dataframe, family=Gaussian()) for mediator in mediator_list]
    
    # Step 2: Fit the outcome model
    # Dependent Variable ~ Independent Variable + Mediators
    outcome_model = GLM.from_formula(f"{outcome} ~ {exposure} + {mediator_expression}", data=dataframe, family=Gaussian())
    
    # Step 3: Perform mediation analysis for each mediator
    for mediator, mediator_model in zip(mediator_list, mediator_models):
        med = Mediation(outcome_model, mediator_model, exposure=exposure, mediator=mediator)
        med_result = med.fit()
        
        # Step 4: Print the results for each mediator
        print(f"Mediation analysis for mediator: {mediator}")
        print(med_result.summary())
        print("\n")

In [None]:
data_df.columns

In [None]:
perform_individual_multiple_mediator_analysis(data_df, 
                                   exposure='Age', 
                                   mediator_list=['Hippocampus', 'Cerebellum', 'Frontal', 'Parietal',
       'Temporal', 'Occipital'], 
                                   outcome='outcome')

_____
**Multivariate Mediation Analysis**
Based on Preacher and Hayes 2008, Asymptotic Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models

**NOTE**: This is a proper multivariate mediation analysis

Model Explanation:
In a multiple mediation analysis, you are interested in investigating the indirect effects of an independent variable (exposure) on a dependent variable (outcome) through more than one mediator simultaneously. You are also interested in the direct effect of the independent variable on the dependent variable, controlling for the mediators. This analysis helps in understanding how different mediators may carry the influence of an independent variable to a dependent variable.

Model Form:
```
Independent Variable (IV) -----(c')------> Dependent Variable (DV)
    |        |        |                       ^      ^         ^
    |        |        | (a1)                  |      |         |
    |        |        v                       |      |         |
    |        |        Mediator 1 --(b1)-------|      |         |
    |        |                                       |         |
    |        | (a2)                                  |         |
    |        v                                       |         |
    |    Mediator 2 --------(b2)---------------------|         |
    |                                                          |
    ...                                                        |
    |                                                          |
    | (ak)                                                     |
    v                                                          |
Mediator k --------(bk)----------------------------------------|
```
In this diagram:

IV represents the independent variable.
Mediator 1, Mediator 2, ..., Mediator k represent the k mediators in the model.
DV represents the dependent variable.
The path labeled a1 represents the effect of the IV on Mediator 1. Similarly, a2 represents the effect of the IV on Mediator 2, and so on.
The path labeled b1 represents the effect of Mediator 1 on the DV, controlling for other mediators and the IV. Similarly, b2 represents the effect of Mediator 2 on the DV, controlling for other mediators and the IV, and so on.
The path labeled c' represents the direct effect of the IV on the DV, controlling for all the mediators.

**Technical Notes**

This is not formally implemented in Python as well as in R, so this is Calvin's custom code. Use with care.
This is a more 'true' multivariate mediation analysis than the one above

**Technical Explanation**
The perform_multiple_mediation_analysis function is designed to perform a multiple mediation analysis to estimate the joint indirect effects of an exposure variable through multiple mediators on a dependent variable. This is achieved through the use of bootstrapping.

Bootstrapping is a statistical technique where resampling of the dataset is done with replacement to estimate the sampling distribution of a statistic. In the case of multiple mediation analysis, it's used to estimate the joint indirect effects of mediators.

In the function, for each bootstrap sample, the dataset is resampled with replacement. For each mediator, a simple linear regression is performed to estimate the path from the exposure variable to the mediator, denoted as 'a' paths. The product of these 'a' paths is computed, as this represents the joint effect of the exposure on all the mediators.

Next, a multiple regression is performed with the dependent variable as the response, and the mediators and exposure as predictors. The coefficients for the mediators represent the effect of the mediators on the dependent variable, controlling for the exposure, denoted as 'b' paths. The sum of these 'b' paths is computed.

The indirect effect for the bootstrap sample is then calculated as the product of the 'a' paths and the 'b' path sum. This process is repeated for a specified number of bootstrap samples (default is 5000), and the mean indirect effect is calculated from these bootstrap samples.

Finally, the function prints the mean indirect effect and its 95% confidence interval, which is estimated using the 2.5th and 97.5th percentiles of the bootstrap indirect effects.

This approach provides an approximate estimate of the joint indirect effect through multiple mediators and allows for the calculation of confidence intervals around this effect.

In [None]:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols

def calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators):
    """
    Calculates the confidence intervals and p-value based on the bootstrapped samples.

    Parameters:
    - ab_paths: list of lists containing the bootstrapped ab paths for each mediator.
    - total_indirect_effects: list of bootstrapped summed ab paths.
    - mediators: list of mediator names.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.
    """

    ab_path_values = np.array(ab_paths)
    total_indirect_effects = np.array(total_indirect_effects)

    # Calculate mean indirect effect and confidence intervals for each mediator
    mean_ab_paths = np.mean(ab_path_values, axis=0)
    lower_bounds = np.percentile(ab_path_values, 2.5, axis=0)
    upper_bounds = np.percentile(ab_path_values, 97.5, axis=0)

    # Calculate p-values for each mediator
    ab_path_p_values = [np.mean(np.sign(mean_ab_paths[i]) * ab_path_values[:, i] <= 0) for i in range(len(mean_ab_paths))]


    # Calculate mean indirect effect and confidence intervals for the total indirect effect
    mean_total_indirect_effect = np.mean(total_indirect_effects)
    lower_bound_total = np.percentile(total_indirect_effects, 2.5)
    upper_bound_total = np.percentile(total_indirect_effects, 97.5)
    p_value_total = np.mean(total_indirect_effects > 0) if np.mean(total_indirect_effects) < 0 else np.mean(total_indirect_effects <= 1)

    # Create DataFrame to store the results
    result_df = pd.DataFrame({
        'Point Estimate': np.concatenate((mean_ab_paths, [mean_total_indirect_effect])),
        '2.5th Percentile': np.concatenate((lower_bounds, [lower_bound_total])),
        '97.5th Percentile': np.concatenate((upper_bounds, [upper_bound_total])),
        'P-value': ab_path_p_values + [p_value_total]
    }, index=mediators + ['Total Indirect Effect'])

    return result_df


def perform_multiple_mediation_analysis(dataframe, exposure, mediators, dependent_variable, bootstrap_samples=5000):
    """
    Performs a multiple mediation analysis by estimating the joint indirect effects of an exposure variable
    through multiple mediators on a dependent variable using bootstrapping.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable.
    - mediators: list, column names of the mediator variables.
    - dependent_variable: str, column name of the dependent variable.
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.

    Example Usage:
    result_df = perform_multiple_mediation_analysis(data_df, exposure='Age',
                                                    mediators=['Brain_Lobe1', 'Brain_Lobe2'],
                                                    dependent_variable='outcome')
    """

    # Perform multiple mediation analysis
    ab_paths, total_indirect_effects = [], []

    # Loop over each bootstrap sample
    for i in range(bootstrap_samples):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)

        # Paths from exposure to mediators and from mediators to DV
        ab_paths_sample = [ols(f"{mediator} ~ {exposure}", data=sample).fit().params[exposure] *
                           ols(f"{dependent_variable} ~ {mediator} + {exposure}", data=sample).fit().params[mediator]
                           for mediator in mediators]

        # Sum the individual indirect effects for this bootstrap sample
        total_indirect_effect = sum(ab_paths_sample)

        # Append the ab paths and total indirect effect to the lists
        ab_paths.append(ab_paths_sample)
        total_indirect_effects.append(total_indirect_effect)

    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators)

    return result_df



In [None]:
data_df.columns

In [None]:
result_df = perform_multiple_mediation_analysis(data_df, 
                                    exposure='Age', 
                                    mediators=['Hippocampus', 'Cerebellum', 'Frontal', 'Parietal',
       'Temporal', 'Occipital']
                                               , 
                                    dependent_variable='outcome',
                                    bootstrap_samples=10000)
result_df

# Mediated Moderation at any/all points along the path

**1) Total Mediated Moderation**

**Edwards and Lambert 2007, DOI: 10.1037/1082-989X.12.1.1**

Author - Calvin Howard


Model Structure:
```
               IV ------------------------> Outcome
             ->|                         | <-------Moderator
            |   --------MEDIATOR---------              |
            |              ^                           |
            |              |                           |
             ------------------------------------------
```
____
- Direct
  - IV to Outcome (Y = B1 + B2X + B3Z + B4XZ + error_B) 
- Indirect
  - IV through mediator to outcome 
    - First stage (Y = B1 + B2X + B3M + B4XM + B5Z + error_B) 
      - Where M = A0 + A1X + A2Z  + A3XZ + error_A
    - Second Stage (Y = B1 + B2X + B3M + B4Z + B5ZM + error_B) 
      - Where M = A0 + A1X
- Total
  - Inlfuence of IV on Outcome through direct and indirect (all paths)
    - Combined accounting of all equations (EQUATION 1)
      - Y = B1 + B2X + B3(A0 + A1X + A2Z  + A3XZ + error_A) + B4X(A0 + A1X + A2Z  + A3XZ + error_A) + B5Z + B6XZ + error_B
- Moderation
  - Is occuring at the first stage (IV->M)
  - Is occuring at the second stage (M->Y)
  - Is occuring at the direct stage (IV->Y)
____
- Estimation of Effects, from estimates in EQUATION 1
  - Direct = B0 + B1Z
  - Indirect = (A1 + A2Z)(B3 + B4Z)
  - Total = B0 + B1Z + (A1 + A2Z)(B3 + B4Z)
____
- Estimation of Significance
  - Resampled bootstrap
  - p = percentage of times bootstrap was opposite sign of the mean bootstrapped value 

In [173]:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
from tqdm import tqdm

class MediatedModerationAnalysis:
    def __init__(self, dataframe, iv, mediator, moderator, dv, moderate_stage_1=False, moderate_stage_2=False, moderate_direct=False, moderator_value=None):
        """
        A class for performing mediated moderation analysis with flexibility to handle any 
        combination of moderation at different stages of the mediation process, including cases with no moderation (simple mediation). 
        This class is designed based on the framework provided by Edwards and Lambert (2007), 
        allowing researchers and students to analyze complex relationships between variables with clarity and precision.

        **How It Works:**
        The model analyzes effects in this DAG, where you control where the moderator is applied. 
               IV -----------------------> DV
                |                         ^            
             -->|                         | <-------Moderator
            |    -------Mediator----------             |
            |              ^                           |
            |              |                           |
             ------------------------------------------

        - **Model Fitting:**
            - **First-Stage Model (Mediator Model):**
                - Predicts the mediator (`M`) from the independent variable (`X`).
                - Includes an interaction term between `X` and the moderator (`Z`) if `moderate_stage_1` is `True`, allowing for moderation at the first stage (X → M).
            - **Outcome Model:**
                - Predicts the dependent variable (`Y`) from `X`, `M`, and `Z`.
                - Includes interaction terms between `M` and `Z` and/or `X` and `Z` based on the `moderate_stage_2` and `moderate_direct` flags, allowing for moderation at the second stage (M → Y) and moderation of the direct effect (X → Y), respectively.

        - **Effect Computation:**
            - **Direct Effect:** The effect of `X` on `Y`, accounting for any moderation specified.
            - **Indirect Effect:** The effect of `X` on `Y` through `M`, considering moderation at the first and second stages.
            - **Total Effect:** The sum of the direct and indirect effects.
            - Effects can be computed at a specific value of the moderator (`moderator_value`), as recommended by Edwards and Lambert (2007). If `moderator_value` is `None`, default coefficients are used.

        - **Bootstrapping:**
            - Performs resampling with replacement to create bootstrap samples.
            - For each bootstrap sample, fits the specified models and computes the effects.
            - Estimates confidence intervals and p-values for the effects based on the distribution of bootstrap estimates.

        **Parameters:**

        - **dataframe (pd.DataFrame):** The dataset containing all variables required for the analysis.
        - **iv (str):** Name of the independent variable (`X`).
        - **mediator (str):** Name of the mediator variable (`M`).
        - **moderator (str):** Name of the moderator variable (`Z`).
        - **dv (str):** Name of the dependent variable (`Y`).
        - **moderate_stage_1 (bool):** Indicates whether to include moderation at the first stage (X → M). Default is `False`.
        - **moderate_stage_2 (bool):** Indicates whether to include moderation at the second stage (M → Y). Default is `False`.
        - **moderate_direct (bool):** Indicates whether to include moderation of the direct effect (X → Y). Default is `False`.
        - **moderator_value (float, optional):** The specific value of the moderator (`Z`) at which to evaluate the effects. According to Edwards and Lambert (2007), setting this value allows testing effects at particular levels of `Z`. If `None`, effects are computed without specifying a `Z` value, typically using mean values or default coefficients.

        **Usage Example:**

        ```python
        # Initialize the analysis with desired settings
        analysis = MediatedModerationAnalysis(
            dataframe=data_df,
            iv='Age',
            mediator='Entorhinal_Cortex_GM_Vol',
            moderator='Z_Scored_Subiculum_Connectivity_T',
            dv='Z_Scored_Percent_Cognitive_Improvement',
            moderate_stage_1=True,
            moderate_stage_2=False,
            moderate_direct=True,
            moderator_value=0  # Example value for Z
        )

        # Fit the first-stage (mediator) model
        analysis.fit_first_stage()

        # Fit the outcome model
        analysis.fit_outcome_model()

        # Compute the effects
        effects = analysis.compute_effects()
        print("Computed Effects:", effects)

        # Perform bootstrapping to estimate confidence intervals and p-values
        analysis.bootstrap(n_bootstraps=10000)

        # Display a summary of the bootstrap results
        summary_df = analysis.summary()
        print(summary_df)
        ```

        **Methods:**

        - **fit_first_stage():**
            - Fits the mediator model based on the specified moderation at the first stage.
            - Updates internal coefficients used in effect computation.

        - **fit_outcome_model():**
            - Fits the outcome model based on the specified moderation at the second stage and/or direct effect.
            - Updates internal coefficients used in effect computation.

        - **compute_effects():**
            - Calculates the direct, indirect, and total effects using the fitted models.
            - Considers the specified `moderator_value` when computing moderated effects.

        - **bootstrap(n_bootstraps=5000):**
            - Performs bootstrapping to estimate the distribution of the effects.
            - Uses the specified number of bootstrap samples (`n_bootstraps`).
            - Updates internal results with bootstrap estimates.

        - **summary():**
            - Returns a pandas DataFrame summarizing the bootstrap results, including point estimates, confidence intervals, and p-values for the indirect, direct, and total effects.

        **Additional Notes:**

        - **Flexibility:** The class can handle any combination of moderation scenarios, including:
            - No moderation (simple mediation).
            - Moderation at the first stage only.
            - Moderation at the second stage only.
            - Moderation of the direct effect only.
            - Any combination of the above.

        - **Data Requirements:** Ensure that the dataset (`dataframe`) contains all the variables specified and that they are properly formatted (e.g., numeric types for continuous variables).

        - **Statistical Assumptions:** The analysis assumes linear relationships between variables and that the data meet the assumptions of regression analysis (e.g., homoscedasticity, normality of residuals).

        **References:**

        - Edwards, J. R., & Lambert, L. S. (2007). **Methods for integrating moderation and mediation: A general analytical framework using moderated path analysis**. *Psychological Methods*, 12(1), 1–22. DOI: [10.1037/1082-989X.12.1.1](https://doi.org/10.1037/1082-989X.12.1.1)

        """
        self.dataframe = dataframe
        self.iv = iv                # Independent Variable (X)
        self.mediator = mediator    # Mediator Variable (M)
        self.moderator = moderator  # Moderator Variable (Z)
        self.dv = dv                # Dependent Variable (Y)
        
        # Moderation Value
        self.z_value = moderator_value
        
        # Moderation Instructions
        self.moderate_stage_1 = moderate_stage_1
        self.moderate_stage_2 = moderate_stage_2
        self.moderate_direct =moderate_direct
        
        # Placeholders for storing results
        self.stage_1_coeffs = None
        self.stage_2_coeffs = None
        self.direct_coeffs = None
        self.indirect_effect = None
        self.direct_effect = None
        self.total_effect = None
        self.bootstrap_results = None

    def fit_first_stage(self):
        """
        Fits the mediator model with first-stage moderation (IV → M).

        Returns:
        - model: Fitted statsmodels OLS regression model.
        """
        # Mediator Model (First-Stage Moderation)
        if self.moderate_stage_1:                                             # Moderate IV->MV
            formula = f"{self.mediator} ~ {self.iv} + {self.moderator} + {self.iv}:{self.moderator}"
            model = ols(formula, data=self.dataframe).fit()
            
            BETA_IV = model.params[[self.iv, f"{self.iv}:{self.moderator}"]]
            BETA_IVZ = model.params[[f"{self.iv}:{self.moderator}"]]
            if self.z_value is not None: 
                self.a_coeffs = BETA_IV + BETA_IVZ*self.z_value
            else:
                self.a_coeffs = BETA_IV + BETA_IVZ
        else:                                                                 # Do not moderate IV->MV
            formula = f"{self.mediator} ~ {self.iv}"
            model = ols(formula, data=self.dataframe).fit()
            self.a_coeffs = model.params[[self.iv]]
        return model

    def fit_outcome_model(self):
        """
        Fits the outcome model

        Returns:
        - model: Fitted statsmodels OLS regression model.
        """
        
        if self.moderate_stage_2 and  self.moderate_direct:                     # Moderate IV->DV and M->DV
            formula = (
            f"{self.dv} ~ {self.iv} + {self.mediator} + {self.moderator} + "
            f"{self.iv}:{self.moderator} + {self.mediator}:{self.moderator}"
            )
            model = ols(formula, data=self.dataframe).fit()
            
            if self.z_value is None:
                self.b_coeffs = model.params[[self.mediator, f"{self.mediator}:{self.moderator}"]]
                self.c_prime_coeff = model.params[[self.iv, f"{self.iv}:{self.moderator}"]]
            else: 
                BETA_M = model.params[[self.mediator, f"{self.mediator}:{self.moderator}"]]
                BETA_MZ = model.params[[f"{self.mediator}:{self.moderator}"]]
                self.b_coeffs = BETA_M + BETA_MZ*self.z_value
                
                BETA_IV = self.c_prime_coeff = model.params[[self.iv]]
                BETA_IVZ = self.c_prime_coeff = model.params[[f"{self.iv}:{self.moderator}"]]
                self.c_prime_coeff = BETA_IV + BETA_IVZ*self.z_value
            
        elif self.moderate_stage_2 and not self.moderate_direct:                # Moderate M->DV only
            formula = (
            f"{self.dv} ~ {self.iv} + {self.mediator} + {self.moderator} + "
            f"{self.mediator}:{self.moderator}"
            )
            model = ols(formula, data=self.dataframe).fit()
            if self.z_value is None:
                self.b_coeffs = model.params[[self.mediator, f"{self.mediator}:{self.moderator}"]]
                self.c_prime_coeff = model.params[[self.iv]]
            else:
                BETA_M = model.params[[self.mediator]]
                BETA_MZ = model.params[[f"{self.mediator}:{self.moderator}"]]
                self.b_coeffs = BETA_M + BETA_MZ*self.z_value
                
                BETA_IV = self.c_prime_coeff = model.params[[self.iv]]
                self.c_prime_coeff = BETA_IV
            
        elif not self.moderate_stage_2 and self.moderate_direct:                # Moderate IV->DV only
            formula = (
            f"{self.dv} ~ {self.iv} + {self.mediator} + {self.moderator} + "
            f"{self.iv}:{self.moderator}"
            )
            model = ols(formula, data=self.dataframe).fit()
            
            if self.z_value is None:
                self.b_coeffs = model.params[[self.mediator]]
                self.c_prime_coeff = model.params[[self.iv, f"{self.iv}:{self.moderator}"]]
            else:
                BETA_M = model.params[[self.mediator]]
                self.b_coeffs = BETA_M
                
                BETA_IV = self.c_prime_coeff = model.params[[self.iv]]
                BETA_IVZ = self.c_prime_coeff = model.params[[f"{self.iv}:{self.moderator}"]]
                self.c_prime_coeff = BETA_IV + BETA_IVZ*self.z_value
            
        else:                                                                   # Moderate neither.
            formula = (
            f"{self.dv} ~ {self.iv} + {self.mediator}"
            )
            model = ols(formula, data=self.dataframe).fit()
            
            self.b_coeffs = model.params[[self.mediator]]
            self.c_prime_coeff = model.params[[self.iv]]
        return model

    def compute_effects(self):
        """
        Computes the direct, indirect, and total effects at a given level of the moderator.

        Parameters:
        - z_value: float, value of the moderator variable (Z) at which to evaluate the effects.

        Returns:
        - dict containing 'indirect_effect', 'direct_effect', and 'total_effect'.
        """
        # Indirect Effect, Accounting for First Stage (a path) and Second Stage (b path)
        self.indirect_effect = np.sum(self.a_coeffs*self.b_coeffs)
        self.direct_effect = self.a_coeffs
        self.total_effect = self.indirect_effect + self.direct_effect

        return {
            'indirect_effect': self.indirect_effect,
            'direct_effect': self.direct_effect,
            'total_effect': self.total_effect
            }

    def bootstrap(self, n_bootstraps=10000):
        """
        Performs bootstrapping to estimate confidence intervals and p-values
        for the direct, indirect, and total effects.

        Parameters:
        - n_bootstraps: int, number of bootstrap samples (default is 10000).

        Returns:
        - dict containing bootstrap results for 'Indirect Effect', 'Direct Effect', and 'Total Effect'.
        """
        indirect_effects = []
        direct_effects = []
        total_effects = []
        successful_samples = 0

        with tqdm(total=n_bootstraps) as pbar:
            while successful_samples < n_bootstraps:
                # Resample the data with replacement
                sample = self.dataframe.sample(frac=1, replace=True)

                # Temporarily replace the dataframe with the bootstrap sample
                original_dataframe = self.dataframe
                self.dataframe = sample

                # Fit models and compute effects using the existing methods
                first_stage_model = self.fit_first_stage()
                outcome_model = self.fit_outcome_model()
                effects = self.compute_effects()
                
                if self._has_invalid_params(first_stage_model, outcome_model):
                    self.dataframe = original_dataframe
                    continue

                # Collect the effects
                indirect_effects.append(effects['indirect_effect'])
                direct_effects.append(effects['direct_effect'])
                total_effects.append(effects['total_effect'])

                successful_samples += 1
                pbar.update(1)

                # Restore the original dataframe
                self.dataframe = original_dataframe

        # Convert to numpy arrays
        indirect_effects = np.array(indirect_effects)
        direct_effects = np.array(direct_effects)
        total_effects = np.array(total_effects)

        # Compute statistics
        results = {}
        for effect_name, effects_array in zip(
            ['Indirect Effect', 'Direct Effect', 'Total Effect'],
            [indirect_effects, direct_effects, total_effects]
        ):
            mean_effect = np.mean(effects_array)
            ci_lower = np.percentile(effects_array, 2.5)
            ci_upper = np.percentile(effects_array, 97.5)
            p_value = np.mean(np.sign(mean_effect) * effects_array <= 0)
            results[effect_name] = {
                'Point Estimate': mean_effect,
                '2.5th Percentile': ci_lower,
                '97.5th Percentile': ci_upper,
                'P-value': p_value
            }

        self.bootstrap_results = results
        return results
    
    def _has_invalid_params(self, debug=False):
        """
        Checks if any of the model parameters or computed effects contain NaNs or infinite values.

        Returns:
        - bool: True if invalid values are present, False otherwise.
        """
        # Check computed effects
        if any([
            np.isnan(self.indirect_effect).any(), np.isinf(self.indirect_effect).any(),
            np.isnan(self.direct_effect).any(), np.isinf(self.direct_effect).any(),
            np.isnan(self.total_effect).any(), np.isinf(self.total_effect).any(),
            ]):
            if debug: 
                print("Indirect effect: ", self.indirect_effect)
                print("Direct effect: ", self.direct_effect)
                print("Total effect: ", self.total_effect)
                print("First Stage Params: ", self.a_coeffs)
                print("Second Stage Params: ", self.b_coeffs)
                print("Direct Params: ", self.c_prime_coeff)
            return True

        return False

    def summary(self):
        """
        Returns a summary DataFrame of the bootstrap results.

        Returns:
        - pandas DataFrame with the bootstrap results.
        """
        if self.bootstrap_results is None:
            print("No bootstrap results available. Please run the bootstrap method first.")
            return
        print("\nIV->MV Effects: \n", self.a_coeffs)
        print("\nMV->DV Effects: \n", self.b_coeffs)
        print("\nIV->DV Effects: \n", self.c_prime_coeff)
        print("Done. Call the .results object to visualize your results.")
        self.results = pd.DataFrame(self.bootstrap_results).T
    
    def run(self):
        self.bootstrap()
        self.summary()


In [174]:
data_df.columns

Index(['subject', 'Age', 'Hippocampus_GM_Vol', 'Parahippocampal_Gyrus_GM_Vol',
       'Entorhinal_Cortex_GM_Vol', 'Normalized_Percent_Cognitive_Improvement',
       'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group',
       'Z_Scored_Percent_Cognitive_Improvement',
       'Percent_Cognitive_Improvement',
       'Z_Scored_Subiculum_T_By_Origin_Group_',
       'Z_Scored_Subiculum_Connectivity_T', 'Subiculum_Connectivity_T_Redone',
       'Subiculum_Connectivity_T', 'Amnesia_Lesion_T_Map', 'Memory_Network_T',
       'Z_Scored_Memory_Network_R', 'Memory_Network_R',
       'Subiculum_Grey_Matter', 'Subiculum_White_Matter', 'Subiculum_CSF',
       'Subiculum_Total', 'Standardized_Age',
       'Standardized_Percent_Improvement',
       'Standardized_Subiculum_Connectivity',
       'Standardized_Subiculum_Grey_Matter',
       'Standardized_Subiculum_White_Matter', 'Standardized_Subiculum_CSF',
       'Standardized_Subiculum_Total', 'Disease', 'Cohort', 'City',
       'Inclusion_Cohort', 

In [175]:
medmod = MediatedModerationAnalysis(
    dataframe=data_df, 
    iv='Subiculum_Connectivity_T', 
    mediator='Hippocampus_GM_Vol', 
    moderator='Age', 
    dv='Z_Scored_Percent_Cognitive_Improvement', 
    moderate_stage_1=True, 
    moderate_stage_2=True, 
    moderate_direct=True, 
    moderator_value=None
    )
medmod.run()
display(medmod.results)

  0%|          | 0/10000 [00:40<?, ?it/s]


KeyboardInterrupt: 

In [103]:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
from tqdm import tqdm
sample = data_df.sample(frac=1, replace=True)
formula = f"Hippocampus_GM_Vol ~ Subiculum_Connectivity_T + Age + Subiculum_Connectivity_T:Age"
model = ols(formula, data=sample).fit()
model.params
                                                   

Intercept                       19.963265
Subiculum_Connectivity_T        -0.173782
Age                             -0.209350
Subiculum_Connectivity_T:Age     0.002357
dtype: float64

Old

In [48]:
def calculate_confidence_intervals(ab_paths, mediators):
    """
    Calculates the confidence intervals and p-value based on the bootstrapped samples.

    Parameters:
    - ab_paths: list of lists containing the bootstrapped ab paths for each mediator.
    - total_indirect_effects: list of bootstrapped summed ab paths.
    - mediators: list of mediator names.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.
    """
    ab_path_values = np.array(ab_paths)

    # Check if there's only one mediator
    if isinstance(mediators, str):
        mediators = [mediators]

    # Calculate mean indirect effect and confidence intervals for each mediator
    mean_ab_paths = np.mean(ab_path_values, axis=0)
    lower_bounds = np.percentile(ab_path_values, 2.5, axis=0)
    upper_bounds = np.percentile(ab_path_values, 97.5, axis=0)

    # Calculate p-values for each mediator
    ab_path_p_values = [np.mean(np.sign(mean_ab_paths) * ab_path_values <= 0)]

    # Create DataFrame to store the results
    result_df = pd.DataFrame({
        'Point Estimate': mean_ab_paths,
        '2.5th Percentile': lower_bounds,
        '97.5th Percentile': upper_bounds,
        'P-value': ab_path_p_values
    }, index=mediators)

    return result_df
import numpy as np
import pandas as pd
from tqdm import tqdm
from statsmodels.formula.api import ols
def perform_partial_mediated_moderation_analysis(dataframe, exposure, mediator, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a partial mediated moderation analysis by estimating the joint indirect effects of an exposure variable
    through a mediator on a dependent variable using bootstrapping, considering the moderating effect of another variable.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable.
    - mediator: str, column name of the mediator variable.
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable.
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for the indirect effect.

    Example Usage:
    result_df = perform_mediated_moderation_analysis(data_df, exposure='Age',
                                                     mediator='Brain_Lobe',
                                                     moderator='Stimulation',
                                                     dependent_variable='Outcome')
    """

    ab_paths = []

    # Loop over each bootstrap sample
    for i in tqdm(range(bootstrap_samples)):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)

        # Fit the models and calculate the indirect effect for this bootstrap sample
        model_M = ols(f"{mediator} ~ {moderator}", data=sample).fit()
        model_Y = ols(f"{dependent_variable} ~ {exposure} + {mediator} + {moderator} + {exposure}:{mediator} + {exposure}:{moderator}", data=sample).fit()

        indirect_effect = model_M.params[moderator] * model_Y.params[f'{exposure}:{mediator}']

        # Append the indirect effect to the list
        ab_paths.append(indirect_effect)
    print(len(ab_paths))
    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, mediators=mediator)

    return result_df

def calculate_direct_effects(
    dataframe, exposure, mediator, moderator, dependent_variable, bootstrap_samples=5000
):
    """
    Calculates the direct effect of the exposure on the dependent variable while controlling for the mediator and moderator,
    using bootstrapping to estimate confidence intervals and p-values.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable.
    - mediator: str, column name of the mediator variable.
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable.
    - bootstrap_samples: int, number of bootstrap samples to be used.

    Returns:
    - DataFrame with the mean direct effect, confidence intervals, and p-values.
    """

    direct_effects = []

    # Loop over each bootstrap sample
    for _ in tqdm(range(bootstrap_samples)):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)

        # Fit the model for the dependent variable
        model_Y = ols(
            f"{dependent_variable} ~ {exposure} + {mediator} + {moderator} + {exposure}:{mediator} + {exposure}:{moderator}",
            data=sample
        ).fit()

        # Direct effect of exposure on dependent variable (c')
        direct_effect = model_Y.params[exposure]

        # Append the direct effect to the list
        direct_effects.append(direct_effect)

    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(direct_effects, mediators='Direct Effect')

    return result_df


In [None]:
data_df.columns

In [None]:
perform_partial_mediated_moderation_analysis(dataframe=data_df, 
                                             exposure='Age', 
                                             mediator='Entorhinal_Cortex_GM_Vol', 
                                             moderator='Z_Scored_Subiculum_Connectivity_T', 
                                             dependent_variable='Z_Scored_Percent_Cognitive_Improvement', 
                                             bootstrap_samples=10000)

In [None]:
direct_result = calculate_direct_effects(
    dataframe=data_df,
    exposure='Age',
    mediator='Hippocampus_GM_Vol',
    moderator='Z_Scored_Subiculum_Connectivity_T',
    dependent_variable='Z_Scored_Percent_Cognitive_Improvement',
    bootstrap_samples=10000
)
print(direct_result)

# Conditional Process Analysis
Based on Preacher and Hayes 2008, Asymptotic Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models

__________  
Multiple Mediators and Moderation Analysis
Also known as a conditional process analysis
the two code segments below will vary the moderator, 
1) allowing the moderator to interact with the exposure upon mediators,
or 
2) allowing the moderator to interact with the mediators upon outcome. 

**Multiple Mediator Analysis with First Stager Moderator**:

This code estimates the joint indirect effects of an exposure variable through multiple mediators on a dependent variable while considering the moderation effect of a specified moderator variable.

In this code, similar to the previous code, the function loops over each bootstrap sample and computes the indirect effects for each mediator, considering the moderation effect of the specified moderator variable in the first stage. It calculates the total indirect effect as the sum of the individual indirect effects for each mediator. The mean indirect effect, 95% confidence interval, and p-value are then calculated based on the bootstrap samples.

The inclusion of the moderator variable and its interaction with the exposure variable in the first stage allows for examining how the mediation effects may vary based on different levels of the moderator. By considering the moderation effect at the first stage, this analysis investigates the conditional indirect effects of the exposure through the mediators.


In [83]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
def calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators):
    """
    Calculates the confidence intervals and p-value based on the bootstrapped samples.

    Parameters:
    - ab_paths: list of lists containing the bootstrapped ab paths for each mediator.
    - total_indirect_effects: list of bootstrapped summed ab paths.
    - mediators: list of mediator names.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.
    """

    ab_path_values = np.array(ab_paths)
    total_indirect_effects = np.array(total_indirect_effects)

    # Calculate mean indirect effect and confidence intervals for each mediator
    mean_ab_paths = np.mean(ab_path_values, axis=0)
    lower_bounds = np.percentile(ab_path_values, 2.5, axis=0)
    upper_bounds = np.percentile(ab_path_values, 97.5, axis=0)

    # Calculate p-values for each mediator
    ab_path_p_values = [np.mean(np.sign(mean_ab_paths[i]) * ab_path_values[:, i] <= 0) for i in range(len(mean_ab_paths))]


    # Calculate mean indirect effect and confidence intervals for the total indirect effect
    mean_total_indirect_effect = np.mean(total_indirect_effects)
    lower_bound_total = np.percentile(total_indirect_effects, 2.5)
    upper_bound_total = np.percentile(total_indirect_effects, 97.5)
    p_value_total = np.mean(total_indirect_effects > 0) if np.mean(total_indirect_effects) < 0 else np.mean(total_indirect_effects <= 1)

    # Create DataFrame to store the results
    result_df = pd.DataFrame({
        'Point Estimate': np.concatenate((mean_ab_paths, [mean_total_indirect_effect])),
        '2.5th Percentile': np.concatenate((lower_bounds, [lower_bound_total])),
        '97.5th Percentile': np.concatenate((upper_bounds, [upper_bound_total])),
        'P-value': ab_path_p_values + [p_value_total]
    }, index=mediators + ['Total Indirect Effect'])

    return result_df

def perform_multiple_mediation_first_stage_moderator_analysis(dataframe, exposure, mediators, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a multiple mediation analysis by estimating the joint indirect effects of an exposure variable
    through multiple mediators on a dependent variable using bootstrapping.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable (e.g., 'Age').
    - mediators: list, column names of the mediator variables (e.g., ['Brain_Lobe1', 'Brain_Lobe2']).
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable (e.g., 'outcome').
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - None. Prints the mean indirect effect and 95% confidence interval.

    Example Usage:
    perform_multiple_mediation_analysis(data_df, exposure='Age', mediators=['Brain_Lobe1', 'Brain_Lobe2'], dependent_variable='outcome')
    """

    ab_paths, total_indirect_effects = [], []
    
    # Loop over each bootstrap sample
    for i in tqdm(range(bootstrap_samples)):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)
        # Paths from exposure to mediators and from mediators to DV
        ab_paths_sample = [ols(f"{mediator} ~ {exposure} + {moderator} + {exposure}:{moderator}", data=sample).fit().params[exposure] * 
                    ols(f"{dependent_variable} ~ {exposure} + {moderator} + {exposure}:{moderator} + {mediator}", data=sample).fit().params[mediator]
                    for mediator in mediators]
        
        # Sum the individual indirect effects for this bootstrap sample
        total_indirect_effect = sum(ab_paths_sample)
        ab_paths.append(ab_paths_sample)
        total_indirect_effects.append(total_indirect_effect)
    
    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators)

    return result_df


In [None]:
print(data_df.columns)

In [None]:
perform_multiple_mediation_first_stage_moderator_analysis(data_df, 
                                                    exposure='Z_Scored_Subiculum_T_By_Origin_Group_', 
                                                    moderator='Age', 
                                                    mediators= ['Subiculum_Grey_Matter'],
                                                    dependent_variable='Z_Scored_Percent_Cognitive_Improvement',
                                                    bootstrap_samples=10000)

--------------------
Moderated Mulitple Mediation Analysis, but with all mediators combined instead of run individually
- This is not supported natively in Python libraries, so I have developed it myself. 

Experimental, please use with caution. 

_____
**Multiple Mediator Analysis with Second Stage Moderator**


In this code, the function loops over each bootstrap sample and computes the indirect effects for each mediator, taking into account the moderation effect of the specified moderator variable. It calculates the total indirect effect as the sum of the individual indirect effects for each mediator. The mean indirect effect, 95% confidence interval, and p-value are then calculated based on the bootstrap samples.

The addition of the moderator variable in the second-stage analysis allows for assessing how the mediation effects may vary across different levels of the moderator. By including the moderator variable and its interaction with the mediators, the analysis investigates whether the indirect effects of the exposure through the mediators differ depending on the levels of the moderator.

Overall, this code provides a way to examine multiple mediation effects while considering a second-stage moderation analysis to explore potential moderation effects.

In [86]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
def calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators):
    """
    Calculates the confidence intervals and p-value based on the bootstrapped samples.

    Parameters:
    - ab_paths: list of lists containing the bootstrapped ab paths for each mediator.
    - total_indirect_effects: list of bootstrapped summed ab paths.
    - mediators: list of mediator names.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for each mediator and the total indirect effect.
    """

    ab_path_values = np.array(ab_paths)
    total_indirect_effects = np.array(total_indirect_effects)

    # Calculate mean indirect effect and confidence intervals for each mediator
    mean_ab_paths = np.mean(ab_path_values, axis=0)
    lower_bounds = np.percentile(ab_path_values, 2.5, axis=0)
    upper_bounds = np.percentile(ab_path_values, 97.5, axis=0)

    # Calculate p-values for each mediator
    ab_path_p_values = [np.mean(np.sign(mean_ab_paths[i]) * ab_path_values[:, i] <= 0) for i in range(len(mean_ab_paths))]


    # Calculate mean indirect effect and confidence intervals for the total indirect effect
    mean_total_indirect_effect = np.mean(total_indirect_effects)
    lower_bound_total = np.percentile(total_indirect_effects, 2.5)
    upper_bound_total = np.percentile(total_indirect_effects, 97.5)
    p_value_total = np.mean(total_indirect_effects > 0) if np.mean(total_indirect_effects) < 0 else np.mean(total_indirect_effects <= 1)

    # Create DataFrame to store the results
    result_df = pd.DataFrame({
        'Point Estimate': np.concatenate((mean_ab_paths, [mean_total_indirect_effect])),
        '2.5th Percentile': np.concatenate((lower_bounds, [lower_bound_total])),
        '97.5th Percentile': np.concatenate((upper_bounds, [upper_bound_total])),
        'P-value': ab_path_p_values + [p_value_total]
    }, index=mediators + ['Total Indirect Effect'])

    return result_df

def perform_multiple_mediation_second_stage_moderator_analysis(dataframe, exposure, mediators, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a multiple mediation analysis by estimating the joint indirect effects of an exposure variable
    through multiple mediators on a dependent variable using bootstrapping.

    Parameters:
    - dataframe: DataFrame containing the data.
    - exposure: str, column name of the exposure variable (e.g., 'Age').
    - mediators: list, column names of the mediator variables (e.g., ['Brain_Lobe1', 'Brain_Lobe2']).
    - moderator: str, column name of the moderator variable.
    - dependent_variable: str, column name of the dependent variable (e.g., 'outcome').
    - bootstrap_samples: int, optional, number of bootstrap samples to be used (default is 5000).

    Returns:
    - None. Prints the mean indirect effect and 95% confidence interval.

    Example Usage:
    perform_multiple_mediation_analysis(data_df, exposure='Age', mediators=['Brain_Lobe1', 'Brain_Lobe2'], dependent_variable='outcome')
    """

    ab_paths, total_indirect_effects = [], []
    
    # Loop over each bootstrap sample
    for i in tqdm(range(bootstrap_samples)):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)
        # Paths from exposure to mediators and from mediators to DV
        ab_paths_sample = [ols(f"{mediator} ~ {exposure}", data=sample).fit().params[exposure] * 
                    ols(f"{dependent_variable} ~ {exposure} + {moderator} + {mediator}:{moderator} + {mediator}", data=sample).fit().params[mediator]
                    for mediator in mediators]
        
        # Sum the individual indirect effects for this bootstrap sample
        total_indirect_effect = sum(ab_paths_sample)
        ab_paths.append(ab_paths_sample)
        total_indirect_effects.append(total_indirect_effect)
    
    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, total_indirect_effects, mediators)


    return result_df


In [None]:
data_df.columns

In [None]:
perform_multiple_mediation_second_stage_moderator_analysis(data_df, 
                                               exposure = 'Subiculum_Connectivity_T', 
                                               moderator = 'Age', 
                                               mediators= ['Subiculum_Grey_Matter'],
                                               dependent_variable='Z_Scored_Percent_Cognitive_Improvement',
                                               bootstrap_samples=10000)

In [None]:
data_df.columns

In [None]:
results_df = perform_mediated_moderation_analysis(dataframe = data_df,
                                                  exposure = 'Subiculum_Connectivity', 
                                                  mediator = 'Subiculum_Damage_Score', 
                                                  moderator = 'Age', 
                                                  dependent_variable ='outcome', 
                                                  bootstrap_samples=5000)
results_df

In [None]:
np.sqrt(.23)

# Complex Models

**Mediated Moderator Analysis AKA Second Stage Moderation Mediation**

The indirect effect of the moderator on the outcome through the mediator is conditional on the exposure and its interactions with the mediator and the moderator.
_____

Let's break this down:

Similar to my previous examples, we will use the same variables. This model suggests that:

Age (the initial moderator) interacts with DBS (the exposure) to influence the outcome.
However, the moderation effect of age might not be the "true" or direct moderation. Instead, it might be that something underlying age (another variable) is what's truly driving this moderation.
In terms of a pathway:

DBS influences the outcome.
The effect of DBS on the outcome varies based on age.
However, the moderating effect of age is itself influenced (or mediated) by another variable. This other variable could be a more direct measure or something that captures the biological, psychological, or physiological changes that come with age.
Here's a step-by-step breakdown of the model:

Mediator Model (for the "true" underlying mediator of age's moderation):
- Mediator is predicted by age.

Outcome Model:
- Outcome is predicted by DBS, the mediator, age, and interactions between DBS and both age and the mediator.
This model seeks to understand if the interaction between age and DBS is not just because of chronological age itself but due to some underlying factor associated with age.

Is it just a mediated moderation?
It's close but slightly different. In a typical mediated moderation:

The exposure affects the mediator.
The mediator, the exposure, and their interaction influence the outcome.In the Second Stage Moderation Mediation:

Age (the moderator) affects the mediator.
The mediator, the exposure (DBS), age, and their interactions influence the outcome.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.mediation import Mediation

def mediated_moderator_analysis(dataframe, exposure, mediator, moderator, dependent_variable, bootstrap_samples=5000):
    """
    Performs a mediated moderator analysis.

    Returns:
    - DataFrame with the mean indirect effect, confidence intervals, and p-values for the indirect effect.

    Example Usage:
    result_df = mediated_moderator_analysis(data_df, exposure='DBS',
                                            mediator='UnderlyingVariable',
                                            moderator='Age',
                                            dependent_variable='Outcome')
    """

    ab_paths = []

    # Loop over each bootstrap sample
    for i in range(bootstrap_samples):
        # Resample the data with replacement
        sample = dataframe.sample(frac=1, replace=True)

        # Fit the mediator model: Mediator ~ Moderator
        model_M = ols(f"{mediator} ~ {moderator}", data=sample).fit()

        # Fit the outcome model: Outcome ~ Exposure + Mediator + Moderator + Exposure:Mediator + Exposure:Moderator
        model_Y = ols(f"{dependent_variable} ~ {exposure} + {mediator} + {moderator} + {exposure}:{mediator} + {exposure}:{moderator}", data=sample).fit()

        # Calculate the indirect effect for this bootstrap sample
        indirect_effect = model_M.params[moderator] * (model_Y.params[exposure] + model_Y.params[f'{exposure}:{mediator}'])

        # Append the indirect effect to the list
        ab_paths.append(indirect_effect)

    # Calculate confidence intervals and p-values
    result_df = calculate_confidence_intervals(ab_paths, mediators=mediator)

    return result_df


In [None]:
data_df.columns

In [None]:
mediated_moderator_analysis(dataframe=data_df, 
                            exposure='Subiculum_Connectivity', 
                            mediator='Whole_Brain', 
                            moderator='Age', 
                            dependent_variable='outcome', 
                            bootstrap_samples=5000)