# Run A Mixed Effects Model

### Authors: Calvin Howard.

#### Last updated: July 6, 2023

Use this to assess if a predictors relationship to the predictee is different between two groups. 

Notes:
- To best use this notebook, you should be familar with mixed effects models

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Dropbox (Partners HealthCare)/studies/cognition_2023/metadata/master_list_proper_subjects.xlsx'

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Research/2023/subiculum_cognition_and_age/figures/Figures/retrospective_cohorts_figure/mixed_effect_analyses'

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet='master_list_proper_subjects')
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()


# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [None]:
drop_list = ['Z_Scored_Subiculum_Connectivity_T', 'Age', 'Z_Scored_Percent_Cognitive_Improvement']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'City'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below')
value = 'Boston'  # The value to compare against

In [None]:
data_df, dropped_df = cal_palm.drop_rows_based_on_value(column, condition, value)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = None #['']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

# 02 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [None]:
data_df.columns

In [None]:
formula = "Z_Scored_Percent_Cognitive_Improvement ~ Age*Z_Scored_Subiculum_Connectivity_T"

# 03 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- design_matrix = palm.create_design_matrix(formula_vars=["var1", "var2", "var1*var2"])
- To include interaction terms, use * between variables, like "var1*var2".
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [None]:
# Define the design matrix
outcome_matrix, design_matrix = cal_palm.define_design_matrix(formula, data_df)
design_matrix

# 04 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [None]:
outcome_matrix

# 05 - Define your Groups to Assess Between

In [None]:
data_df.columns

In [None]:
groups = 'City'

# 06 - Are You Allowing Random Intercepts?
- Set this to False if you do not want to do this. However, it is generally best to define a random intercept in a mixed effects model.

In [None]:
random_intercepts = True

# 07 - What Columns Would You Like to Perform Random Slopes On?
- Set this to None if you would not like to set random slopes.
- Set to a list of column names that you would like to test.

In [None]:
design_matrix.columns

In [None]:
random_slopes = None

# 07 - Run The Model

In [None]:
try:
    print('Running original mixed effects model.')
    result = cal_palm.run_mixed_effects_model(y=outcome_matrix, X=design_matrix, groups=groups, random_intercepts=random_intercepts, random_slopes=random_slopes)
except:
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    try:
        print(f'Excepting and running with: \n - random slopes for variable: {random_slopes} \n - random intercepts for group: {groups}')
        mixed_lm = smf.mixedlm(formula, data_df, groups=groups, re_formula = f"~{random_slopes}")
        result = mixed_lm.fit(method=["lbfgs"])
    except:
        print(f'Excepting and running with: \n - no random slopes \n - random intercepts for group: {groups}')
        mixed_lm = smf.mixedlm(formula, data_df, groups=groups)
        result = mixed_lm.fit()

**Manual Mixed Effects**

-ONLY RUN IF ABOVE FAILED

- **Mixed Effects Formula Structure**
- outcome ~ regressor_1 + regressor_2 + (1 + regressor_2 | group)
    - the ( | ) statement is the random effects statement. 
    - 1 (random intercept) or 0 (fixed intercept) indicates the random intercept
    - regressor_2 indicates the random slope for the regressor of interest. 
    - group is simply the group. So, this random effect is calculated | (per) group. 

Statsmodels Command Structure
- formula is the string: "outcome ~ regressor_1 + regressor_2" from outcome ~ regressor_1 + regressor_2 + (1 + regressor_2 | group)
- random intercepts is defined as "groups=group" from  outcome ~ regressor_1 + regressor_2 + (1 + regressor_2 | group)
- random slopes is defined as "re_formula=~regressor_2"  outcome ~ regressor_1 + regressor_2 + (1 + regressor_2 | group)

In [None]:
backup_formula = "Z_Scored_Percent_Cognitive_Improvement ~ Age*Z_Scored_Subiculum_Connectivity_T"

In [None]:
groups='City'

Set to None if you would like to remove random slopes

In [None]:
re_formula=None

In [None]:
data_df

In [None]:
import statsmodels.formula.api as smf
mixed_lm = smf.mixedlm(backup_formula, data_df, groups=data_df[groups], re_formula=re_formula)
result = mixed_lm.fit()

# 08 - View Results
- If "converged: No" reported in below results, be extremely cautious in interpretation. 
    - I would suggest simplifying until you achieve convergence. 

In [None]:
print(result.summary())

# plot Random Effects

In [None]:
for k, v in result.random_effects.items():
    print(k)
    print(v)

Select one of the random effects above to plot. 

In [None]:
effect_to_plot='Age:Z_Scored_Subiculum_Connectivity_T'

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create an empty list to store interaction effects
interaction_effects = []

# Loop through the groups and extract interaction effects
for group, effects in result.random_effects.items():
    interaction_effect = effects[effect_to_plot]
    interaction_effects.append(interaction_effect)

# Create a boxplot with groups side-by-side
plt.figure(figsize=(10, 6))
sns.boxplot(x=list(result.random_effects.keys()), y=interaction_effects)

plt.xlabel('Group')
plt.ylabel(f'Random Effect: {effect_to_plot}')
plt.title('Distribution of Random Effects by Group')

plt.xticks(rotation=45)
plt.show()



More

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming 'result' is the output from your mixedlm fit
model_result = result  # Use your model fit result

# 1. Fixed Effects Coefficients Plot
fe_params = model_result.params
conf_int = model_result.conf_int()
errors = conf_int[1] - fe_params

plt.errorbar(fe_params.index, fe_params, yerr=errors, fmt='o')
plt.axhline(0, color='black', linestyle='--')
plt.title('Fixed Effects Coefficients')
plt.xticks(rotation=45)
plt.show()

# 2. Random Effects Plot
# This example assumes one random intercept per group
re_params = pd.DataFrame([dict(re) for re in model_result.random_effects.values()])
re_params['group'] = model_result.random_effects.keys()

sns.stripplot(x='group', y='Intercept', data=re_params)
plt.title('Random Intercepts per Group')
plt.xticks(rotation=45)
plt.show()


# Generate a Profile Plot
- This is a plot which generates an estimated marginal mean across a set number of categories/factors
- The `marginal_scenarios_dict` should be carefully set up to include all predictors you wish to analyze. 
          Continuous variables should have the value 'continuous', and categorical variables should list all categories 
          you wish to iterate over.

note: i suspect the profile plot is struggling with mixed effect smodels as it is not setting the random effects appropriately. 

In [None]:
marginal_scenarios_dict = {'City': ['Wurzburg', 'Toronto', 'Queensland'], 'Age': [47, 83], 'Z_Scored_Subiculum_Connectivity_T':['continuous']}

In [None]:
formula

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import ProfilePlot
factor_plot = ProfilePlot(formula, data_df, model=result, data_range=(-3, 3), marginal_scenarios_dict=marginal_scenarios_dict, marginal_method='mean' )
factor_plot.run()


# Run it in R
- Rpy2 is a mess. So just use R Studio's lmer package, plot with ggplot, and emmeans for the analysis

In [None]:
import rpy2
import rpy2.robjects as robjects


In [None]:
import rpy2.robjects.packages as rpackages

# Utility function to check for and install R packages
def install_r_packages(package_names):
    utils = rpackages.importr('utils')
    for package in package_names:
        if not rpackages.isinstalled(package):
            utils.install_packages(package)

# Install R packages required for your analysis
# install_r_packages(['lme4', 'emmeans', 'ggplot2'])
install_r_packages(['lazyeval'])



In [None]:
import rpy2.robjects as ro
# Define and execute the R command to get the version string
ro.r('''
R_version <- R.version.string
''')

# Retrieve the version string from R's global environment and print it
R_version = ro.r['R_version'][0]
print(f"The R version used by rpy2 is: {R_version}")


In [None]:
from rpy2.robjects import pandas2ri
# Convert the DataFrame to an R dataframe
pandas2ri.activate()
r_dataframe = pandas2ri.py2rpy(data_df)
ro.r.assign('r_df', r_dataframe)

In [None]:
# Define and fit the model in R
import rpy2.robjects as ro

# Set the CRAN repository to ensure you're getting the packages from CRAN
ro.r('''
options(repos = "https://cran.r-project.org/")
''')

# Reinstall 'Matrix' from sources. This step ensures you have the latest version compatible with your R setup.
ro.r('''
install.packages("Matrix", type = "source")
''')

# Reinstall 'lme4' from sources. This is crucial since 'lme4' depends on 'Matrix' and must be compatible with its ABI.
ro.r('''
install.packages("lme4", type = "source")
''')

print("Reinstallation of 'Matrix' and 'lme4' from source completed.")

ro.r('''
model <- lmer(Z_Scored_Percent_Cognitive_Improvement ~ Age * Z_Scored_Subiculum_Connectivity_T + (1|City), data = r_df)
''')


In [None]:
import rpy2.robjects.lib.ggplot2 as ggplot2
gp = ggplot2.ggplot(mtcars)


In [None]:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
import rpy2.robjects.lib.ggplot2 as ggplot2

# Activate automatic conversion of pandas objects to R data.frames

# Import R packages
utils = importr('utils')
base = importr('base')

# Install R packages (if not already installed)
# utils.install_packages('lme4')
utils.install_packages('emmeans')
# utils.install_packages('ggplot2')
# utils.install_packages('lazyeval')

# Import R packages
lme4 = importr('lme4')
emmeans = importr('emmeans')
# ggplot2 = importr('ggplot2')

# Load your DataFrame here
# df = your_dataframe




# Calculate EMMs using the 'emmeans' package
ro.r('''
emm_res <- emmeans(model, specs = ~ category * category2)
''')

# Plot the EMMs using 'ggplot2'
ro.r('''
plot <- plot(emm_res) + ggtitle("Estimated Marginal Means with CI") + theme_minimal()
print(plot)
''')


In [None]:
# Calculate EMMs using the 'emmeans' package
ro.r('''
emm_res <- emmeans(model, specs = pairwise ~ Age * Z_Scored_Subiculum_Connectivity_T)
''')

# Plot the EMMs using 'ggplot2'
ro.r('''
plot <- plot(emm_res) + ggtitle("Estimated Marginal Means with CI") + theme_minimal()
print(plot)
''')


In [None]:
data_df.columns

In [None]:
data_df['test'].isna().sum()

In [None]:
import pingouin as pg

pg.mixed_anova(data=data_df, dv='Z_Scored_Percent_Cognitive_Improvement', between='Age_Disease_and_Cohort', within='Subiculum_Group_By_Inflection_Point', subject='test')