# Run Any Kind of OLS Regression (ANOVA, GLM, Logit, etc.)

### Authors: Calvin Howard.

#### Last updated: May 5, 2024

Use this to run/test a statistical model (e.g., regression or T-tests) on a spreadsheet containing covariates and brain image (nii/gii) paths. 

Notes:
- For this to work, it must be installed onto wherever you want to run it. You must run:
```
> git clone https://github.com/Calvinwhow/Research.git
> cd into wherever you installed it. 
> pip install -e .
```
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [1]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/raynor_network_mapping/results/corbetta_cluster/umapsV2/cluster_maps'

Import Data

In [2]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/raynor_network_mapping/metadata/mergedNotCerebellumExtensive_CLEAN.csv'
sheet = None

In [3]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()
data_df

Unnamed: 0,MotorL_acute,MotorR_acute,Motor_IC_acute,MotorL_3month,MotorR_3month,Motor_IC_3month,MotorL_1year,MotorR_1year,Motor_IC_1year,motorl_f_acute,...,gdss_12_1year,gdss_13_1year,gdss_14_1year,gdss_15_1year,gdss_score_1year,clock_acute,mes_tot_miss_acute,conn_path,roi_path,focal_cerebellum
0,,,,-1.3652,0.5878,-0.7725,-1.0227,0.2821,-0.5117,,...,1.0,0.0,0.0,0.0,3.0,6.0,45.0,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0
1,,,,,,,,,,,...,,,,,,,,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,1
2,,,,0.6425,-0.5805,0.0808,0.7433,0.3375,0.7515,,...,0.0,0.0,0.0,0.0,0.0,12.0,2.0,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0
3,,,,,,,,,,,...,,,,,,13.0,2.0,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0
4,,,,0.3583,0.5973,0.7771,-0.1659,0.6261,0.2651,,...,1.0,1.0,0.0,1.0,13.0,12.0,31.0,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130,0.7975,0.5861,0.8521,0.7416,0.4959,0.9150,,,,0.794393,...,,,,,,3.0,,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0
131,,,,,,,,,,,...,,,,,,,,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,1
132,,,,,,,,,,,...,,,,,,,,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0
133,-2.3491,0.0868,-1.7040,,,,,,,-2.333489,...,,,,,,10.0,38.0,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,/Volumes/HowExp/datasets/02a_Corbetta_Stroke_L...,0



# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [4]:
data_df.columns

Index(['MotorL_acute', 'MotorR_acute', 'Motor_IC_acute', 'MotorL_3month',
       'MotorR_3month', 'Motor_IC_3month', 'MotorL_1year', 'MotorR_1year',
       'Motor_IC_1year', 'motorl_f_acute',
       ...
       'gdss_12_1year', 'gdss_13_1year', 'gdss_14_1year', 'gdss_15_1year',
       'gdss_score_1year', 'clock_acute', 'mes_tot_miss_acute', 'conn_path',
       'roi_path', 'focal_cerebellum'],
      dtype='object', length=196)

In [None]:
drop_list = ['conn_path']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'focal_cerebellum'  # The column you'd like to evaluate
condition = 'not'  # The condition to check ('equal', 'above', 'below', 'not')
value = 0 # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

In [None]:
data_df = data_df.bfill()
data_df

# 02 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
**ANOVA**
- Tests differences in means for one categorical variable.
- formula = 'Outcome ~ C(Group1)'

**2-Way ANOVA**
- Tests differences in means for two categorical variables without interaction.
- formula = 'Outcome ~ C(Group1) + C(Group2)'

**2-Way ANOVA with Interaction**
- Tests for interaction effects between two categorical variables.
- formula = 'Outcome ~ C(Group1) * C(Group2)'

**ANCOVA**
- Similar to ANOVA, but includes a covariate to control for its effect.
- formula = 'Outcome ~ C(Group1) + Covariate'

**2-Way ANCOVA**
- Extends ANCOVA with two categorical variables and their interaction, controlling for a covariate.
- formula = 'Outcome ~ C(Group1) * C(Group2) + Covariate'

**Multiple Regression**
- Assesses the impact of multiple predictors on an outcome.
- formula = 'Outcome ~ Predictor1 + Predictor2'

**Simple Linear Regression**
- Assesses the impact of a single predictor on an outcome.
- formula = 'Outcome ~ Predictor'

**MANOVA**
- Assesses multiple dependent variables across groups.
- Note: Not typically set up with a formula in statsmodels. Requires specialized functions.

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [None]:
data_df.columns

In [None]:
x = ' + '.join(str(c) for c in data_df.columns)
print(x)

In [None]:
lst = x.split(' + ')
lst

In [None]:
formula = "MotorL_acute + MotorR_acute + Motor_IC_acute + MotorL_3month + MotorR_3month + Motor_IC_3month + MotorL_1year + MotorR_1year + Motor_IC_1year + motorl_f_acute + motorr_f_acute + motoric_within_acute + motoryn_acute + motor_battery_complete_acute + motoryn_3month + motor_battery_complete_3month + fim_motor_3month + motoryn_1year + motor_battery_complete_1year + fim_motor_1year + nihss_hospital_basic + nihssyn_acute + nih1a_acute + nih1b_acute + nih1c_acute + nih2_acute + nih3_acute + nih4_acute + nih5a_acute + nih5b_acute + nih6a_acute + nih6b_acute + nih7_acute + nih8_acute + nih9_acute + nih10_acute + nih11_acute + nih_total_acute + nih_stroke_scale_complete_acute + nihssyn_3month + nih1a_3month + nih1b_3month + nih1c_3month + nih2_3month + nih3_3month + nih4_3month + nih5a_3month + nih5b_3month + nih6a_3month + nih6b_3month + nih7_3month + nih8_3month + nih9_3month + nih10_3month + nih11_3month + nih_total_3month + nih_stroke_scale_complete_3month + nihssyn_1year + nih1a_1year + nih1b_1year + nih1c_1year + nih2_1year + nih3_1year + nih4_1year + nih5a_1year + nih5b_1year + nih6a_1year + nih6b_1year + nih7_1year + nih8_1year + nih9_1year + nih10_1year + nih11_1year + nih_total_1year + nih_stroke_scale_complete_1year + animal_raw_acute + nonword_acute + bvmt_im_acute + bvmt_learn_acute + bvmt_delay_acute + bvmt_perc_acute + bvmt_hit_acute + bvmt_fa_acute + bvmt_discrim_acute + bvmt_bias_acute + bvmt_imt_acute + bvmt_delayt_acute + bvmt_index_ile_acute + bvmt_im_3month + bvmt_learn_3month + bvmt_delay_3month + bvmt_perc_3month + bvmt_hit_3month + bvmt_fa_3month + bvmt_discrim_3month + bvmt_bias_3month + bvmt_imt_3month + bvmt_delayt_3month + bvmt_im_1year + bvmt_learn_1year + bvmt_delay_1year + bvmt_perc_1year + bvmt_hit_1year + bvmt_fa_1year + bvmt_discrim_1year + bvmt_bias_1year + bvmt_imt_1year + bvmt_delayt_1year + hvlt_im_acute + hvlt_learn_acute + hvlt_delay_acute + hvlt_perc_acute + hvlt_hit_acute + hvlt_fa1_acute + hvlt_fa2_acute + hvlt_fa3_acute + hvlt_discrim_acute + hvlt_imt_acute + hvlt_delayt_acute + hvlt_discrimt_acute + hvlt_im_3month + hvlt_learn_3month + hvlt_delay_3month + hvlt_perc_3month + hvlt_hit_3month + hvlt_fa1_3month + hvlt_fa2_3month + hvlt_fa3_3month + hvlt_discrim_3month + hvlt_imt_3month + hvlt_delayt_3month + hvlt_discrimt_3month + hvlt_im_1year + hvlt_learn_1year + hvlt_delay_1year + hvlt_perc_1year + hvlt_hit_1year + hvlt_fa1_1year + hvlt_fa2_1year + hvlt_fa3_1year + hvlt_discrim_1year + hvlt_imt_1year + hvlt_delayt_1year + hvlt_discrimt_1year + gds_1_acute + gds_2_acute + gds_3_acute + gds_4_acute + gds_5_acute + gds_6_acute + gds_7_acute + gds_8_acute + gds_9_acute + gds_10_acute + gds_11_acute + gds_12_acute + gds_13_acute + gds_14_acute + gds_15_acute + gdss_1_3month + gdss_2_3month + gdss_3_3month + gdss_4_3month + gdss_5_3month + gdss_6_3month + gdss_7_3month + gdss_8_3month + gdss_9_3month + gdss_10_3month + gdss_11_3month + gdss_12_3month + gdss_13_3month + gdss_14_3month + gdss_15_3month + gdss_score_3month + gdss_1_1year + gdss_2_1year + gdss_3_1year + gdss_4_1year + gdss_5_1year + gdss_6_1year + gdss_7_1year + gdss_8_1year + gdss_9_1year + gdss_10_1year + gdss_11_1year + gdss_12_1year + gdss_13_1year + gdss_14_1year + gdss_15_1year + gdss_score_1year + clock_acute + mes_tot_miss_acute ~ conn_path"

In [None]:
# formula = "MotorL_acute + MotorR_acute + gdss_1_3month + gdss_2_3month + gdss_3_3month + gdss_4_3month + gdss_5_3month + gdss_6_3month + gdss_7_3month + gdss_8_3month + gdss_9_3month + gdss_10_3month + gdss_11_3month + gdss_12_3month + gdss_13_3month + gdss_14_3month + gdss_15_3month + gdss_score_3month + gdss_1_1year ~ conn_path"

# 02 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- voxelwise_variable_list = A list containing the names of each variable that has voxelwise variables. Plainly, the variables that represent niftis. 
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [None]:
voxelwise_variable_list=['conn_path']

If you want to run voxelwise INTERACTIONS, then you should specify the exact terms, exactly as specified in your above formula, here. 
- For example, if Formula is outcome ~ voxelwise_var1 * age + dog_number, then voxelwise_interaction_terms are ['voxelwise_var1 * age]
- Set voxelwise_interaction_terms = None if you do not want to specify any interaction terms. 

In [None]:
voxelwise_interaction_terms = None

Make sure ANY voxelwise variables are in formula. 

In [None]:
# Define the design matrix
outcome_df, design_matrix = cal_palm.define_design_matrix(formula, data_df, add_intercept=False,
                                                          voxelwise_variable_list=voxelwise_variable_list, 
                                                          voxelwise_interaction_terms=voxelwise_interaction_terms)
design_matrix

# 03 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [None]:
outcome_df

# 04 - Generate Contrasts

Generate a Contrast Matrix
- This is different from the contrast matrices used in cell-means regressions such as in PALM, but it is much more powerful. 



For more information on contrast matrices, please refer to this: https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf

Generally, these drastically effect the results of ANOVA. However, they are mereley a nuisance for a regression.
In essence, they assess if coefficients are significantly different

________________________________________________________________
A coding matrix (a contrast matrix if it sums to zero) is simply a way of defining what coefficients to evaluate and how to evaluate them. 
If a coefficient is set to 1 and everything else is set to zero, we are taking the mean of the coefficient's means and assessing if they significantly
deviate from zero--IE we are checking if it had a significant impact on the ability to predict the depdendent variable.
If a coefficient is set to 1, another is -1, and others are 0, we are assessing how the means of the two coefficients deviate from eachother. 
If several coefficients are 1 and several others are -1, we are assessing how the group-level means of the two coefficients deviate from eachother.
If a group of coefficients are 1, a group is -1, and a group is 0, we are only assessing how the groups +1 and -1 have differing means. 

1: This value indicates that the corresponding variable's coefficient in the model is included in the contrast. It means you are interested in estimating the effect of that variable.

0: This value indicates that the corresponding variable's coefficient in the model is not included in the contrast. It means you are not interested in estimating the effect of that variable.

-1: This value indicates that the corresponding variable's coefficient in the model is included in the contrast, but with an opposite sign. It means you are interested in estimating the negative effect of that variable.

----------------------------------------------------------------
The contrast matrix is typically a matrix with dimensions (number of contrasts) x (number of regression coefficients). Each row of the contrast matrix represents a contrast or comparison you want to test.

For example, let's say you have the following regression coefficients in your model:

Intercept, Age, connectivity, Age_interaction_connectivity
A contrast matric has dimensions of [n_predictors, n_experiments] where each experiment is a contrast

If you want to test the hypothesis that the effect of Age is significant, you can set up a contrast matrix with a row that specifies this contrast (actually an averaging vector):
```
[0,1,0,0]. This is an averaging vector because it sums to 1
```
This contrast will test the coefficient corresponding to the Age variable against zero.


If you want to test the hypothesis that the effect of Age is different from the effect of connectivity, you can set up a contrast matrix with two rows:
```
[0,1,−1,0]. This is a contrast because it sums to 0
```

Thus, if you want to see if any given effect is significant compared to the intercept (average), you can use the following contrast matrix:
```
[1,0,0,0]
[-1,1,0,0]
[-1,0,1,0]
[-1,0,0,1] actually a coding matrix of averaging vectors
```

The first row tests the coefficient for Age against zero, and the second row tests the coefficient for connectivity against zero. The difference between the two coefficients can then be assessed.
_____
You can define any number of contrasts in the contrast matrix to test different hypotheses or comparisons of interest in your regression analysis.

It's important to note that the specific contrasts you choose depend on your research questions and hypotheses. You should carefully consider the comparisons you want to make and design the contrast matrix accordingly.

- Examples:
    - [Two Sample T-Test](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM#Two-Group_Difference_.28Two-Sample_Unpaired_T-Test.29)
    - [One Sample with Covariate](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM#Single-Group_Average_with_Additional_Covariate)

In [None]:
contrast_matrix = cal_palm.generate_basic_contrast_matrix(design_matrix)

In [None]:
contrast_matrix = [
    [1]
    ]

In [None]:
contrast_matrix_df = cal_palm.finalize_contrast_matrix(design_matrix=design_matrix, 
                                                    contrast_matrix=contrast_matrix) 
contrast_matrix_df

# 05 - Generate Files
Standardization during regression is critical. 
- data_transform_method='standardize' will ensure the voxelwise values are standardized
    - if you design matrix has a column called 'Dataset', the standardization will standardize values within each dataset individually, which is as should be done normally.
    - If you call data_transform_method='standardize' without having a 'Dataset' column in your design matrix, the entire collection of images will be standardized. This is potentially dangerous and misleading. Be careful, and consider not standardizing at all, or going back and adding a 'Dataset' column. 

Mask Path
- set mask_path to the path of your local brain mask which matches the resolution of the files you have collected. Typically this is an MNI 152 brain mask. 
    - download one here: https://nilearn.github.io/dev/modules/generated/nilearn.datasets.load_mni152_brain_mask.html

In [None]:
mask_path = '/Users/cu135/hires_backdrops/MNI/MNI152_T1_2mm_brain_mask.nii'
data_transform_method=None

Define exchangeability block
- Set to none if you don't know
- If you are running multiple cohorts, set exchangeability block to be the column which has each group in it, with groups being indicated by integers. 

In [None]:
exchangeability_col = None

In [None]:
from calvin_utils.permutation_analysis_utils.voxelwise_regression_prep import RegressionPrep
preparer = RegressionPrep(design_matrix=design_matrix, 
                          contrast_matrix=contrast_matrix_df, 
                          outcome_df=outcome_df, 
                          out_dir=out_dir,
                          voxelwise_variables=voxelwise_variable_list, 
                          voxelwise_interactions=voxelwise_interaction_terms,
                          mask_path=mask_path, 
                          exchangeability_block=None, 
                          data_transform_method='standardize',
                          weights=None)
dataset_dict, json_path = preparer.run()

# 06 - Map and Cluster the Brains

In [None]:
from calvin_utils_project.calvin_utils.ml_utils.umap_regression import UmapRegression
umr = UmapRegression(json_path, mask_path, formula, out_dir)
umr.run(0)

That's all

-Calvin