# Run Any Kind of OLS Regression (ANOVA, GLM, etc.)

### Authors: Calvin Howard.

#### Last updated: July 6, 2023

Use this to run/test a statistical model (e.g., regression or T-tests) on a spreadsheet.

Notes:
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/experimental_group_master_list.csv'
sheet = None

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()


# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [None]:
drop_list = ['Age', 'TOTALMOD']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'City'  # The column you'd like to evaluate
condition = 'not'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Toronto' # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = None # ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

In [None]:
data_df.columns

# 02 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
**ANOVA**
- Tests differences in means for one categorical variable.
- formula = 'Outcome ~ C(Group1)'

**2-Way ANOVA**
- Tests differences in means for two categorical variables without interaction.
- formula = 'Outcome ~ C(Group1) + C(Group2)'

**2-Way ANOVA with Interaction**
- Tests for interaction effects between two categorical variables.
- formula = 'Outcome ~ C(Group1) * C(Group2)'

**ANCOVA**
- Similar to ANOVA, but includes a covariate to control for its effect.
- formula = 'Outcome ~ C(Group1) + Covariate'

**2-Way ANCOVA**
- Extends ANCOVA with two categorical variables and their interaction, controlling for a covariate.
- formula = 'Outcome ~ C(Group1) * C(Group2) + Covariate'

**Multiple Regression**
- Assesses the impact of multiple predictors on an outcome.
- formula = 'Outcome ~ Predictor1 + Predictor2'

**Simple Linear Regression**
- Assesses the impact of a single predictor on an outcome.
- formula = 'Outcome ~ Predictor'

**MANOVA**
- Assesses multiple dependent variables across groups.
- Note: Not typically set up with a formula in statsmodels. Requires specialized functions.

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [None]:
data_df.columns

In [None]:
formula = "TOTALMOD ~ local_z6_csf_paths + Age + Sex"

# 02 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- voxelwise_variable = name of the variable in your formula which contains nifti paths.
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [None]:
voxelwise_variable='local_z6_csf_paths'

In [None]:
# Define the design matrix
outcome_matrix, design_matrix = cal_palm.define_design_matrix(formula, data_df, voxelwise_variable=voxelwise_variable)
design_matrix

# 03 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [None]:
outcome_matrix

# 04 - Generate Nifti Files

In [None]:
cal_palm.generate_voxelwise_cov_4d_nifti(design_matrix, voxelwise_variable)
cal_palm.generate_univariate_4d_niftis(design_matrix, voxelwise_variable)
cal_palm.generate_4d_dependent_variable_nifti(outcome_matrix, design_matrix, voxelwise_variable)

# 05 - Run the Regression
- Going to use manual linear regression here to vectorize and maintain speed. 
    - Will be necessary to compare to a voxelwise for-looped linear regression
    this may actually be too computationally intensive to achieve with complete matrix multiplication, at least locally.

SLOW AND CHEAP FOR LOOP. 90S PER EVALUATION 
- THIS IS LIKELY IDEAL DUE TO NATURE OF CPUS ON CLUSTER. SMALL RAM, MANY CPUS, WILL TOLERATE THIS METHOD WELL AND WE WILL GET LOTS OF RESOURCES ALLOCATED EASILY. 
- 

In [None]:
## Manual Code Exposed Prior to Cleaned OOP 
import numpy as np
import nibabel as nib
import os
from sklearn.linear_model import LinearRegression
from tqdm import tqdm


# 1 - IMPORT THE ARRAYS
## this will be output by organized lists returned from the 4D array generator step previously. 
Y_arr = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/TOTALMOD.nii').get_fdata()
X0 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/Intercept.nii').get_fdata()
X1 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/local_z6_csf_paths.nii').get_fdata()
X2 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/Age.nii').get_fdata()
X3 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/Sex[T.M].nii').get_fdata()

# 2 - RESHAPE ARRAYS INTO ARRAYS OF SHAPE = (VOXELS, PATIENTS) 
X0 = X0.reshape(-1, X0.shape[-1])
X1 = X1.reshape(-1, X1.shape[-1])
X2 = X2.reshape(-1, X2.shape[-1])
X3 = X3.reshape(-1, X3.shape[-1])
Y = Y_arr.reshape(-1, Y_arr.shape[-1])

# 2.5 - MASK
mask = nib.load('/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/nimlab/data/MNI152_T1_2mm_brain_mask.nii').get_fdata().reshape(-1, 1)
mask_indices = mask > 0
mask_indices = mask_indices.flatten()

# Apply mask
X0 = X0[mask_indices]
X1 = X1[mask_indices]
X2 = X2[mask_indices]
X3 = X3[mask_indices]
Y = Y[mask_indices]
print(X0.shape,X1.shape,X2.shape,X3.shape,Y.shape)

# 3 - COMBINE THEM IN THE 3RD DIMENSION (3D design matrix), SHAPE = [VOXELS, PATIENTS, COVARIATES]
X_ARR = np.concatenate((X0[:, :, np.newaxis], X1[:, :, np.newaxis], X2[:, :, np.newaxis], X3[:, :, np.newaxis]), axis=2)
print(X_ARR.shape)

# 4 - REGRESS
# Initialize an empty array to store coefficients for each voxel
coefficients = np.zeros((X_ARR.shape[0], X_ARR.shape[2]))
t_values = np.zeros((X_ARR.shape[0], X_ARR.shape[2]))

# Iterate over each voxel and perform regression
for voxel_idx in tqdm(range(X_ARR.shape[0]), desc='Voxels'):
    X_voxel = X_ARR[voxel_idx, :, :]
    Y_voxel = Y[voxel_idx, :].reshape(-1, 1)  # Reshape Y_voxel to have two dimensions

    # Perform regression
    reg = LinearRegression().fit(X_voxel, Y_voxel)

    # Coefficients
    coefficients[voxel_idx, :] = reg.coef_.flatten()
    try:
        # Compute standard errors of coefficients
        var_beta = np.diag(np.dot(X_voxel, np.dot(np.linalg.inv(np.dot(X_voxel.T, X_voxel)), X_voxel.T)))

        # Calculate standard errors of coefficients
        std_err = np.sqrt(var_beta)

        # Calculate t-values
        t_values[voxel_idx, :] = coefficients[voxel_idx, :] / std_err
    # The prediction of variance of the coefficient using X(X*X')-1X can be unstable. 
    except:
        t_values[voxel_idx, :] = 0
    

SUPER FAST, SUPER EXPENSIVE MEMORY VERSION < 60S

In [None]:
## Manual Code Exposed Prior to Cleaned OOP 
import numpy as np
import nibabel as nib
import os

## Manual Code Exposed Prior to Cleaned OOP 
import numpy as np
import nibabel as nib
import os
from sklearn.linear_model import LinearRegression
from tqdm import tqdm


# 1 - IMPORT THE ARRAYS
## this will be output by organized lists returned from the 4D array generator step previously. 
Y_arr = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/TOTALMOD.nii').get_fdata()
X0 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/Intercept.nii').get_fdata()
X1 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/local_z6_csf_paths.nii').get_fdata()
X2 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/Age.nii').get_fdata()
X3 = nib.load('/Users/cu135/Dropbox (Partners HealthCare)/studies/voxelwise_lin_reg/4d_niftis/Sex[T.M].nii').get_fdata()

# 2 - RESHAPE ARRAYS INTO ARRAYS OF SHAPE = (VOXELS, PATIENTS) 
X0 = X0.reshape(-1, X0.shape[-1])
X1 = X1.reshape(-1, X1.shape[-1])
X2 = X2.reshape(-1, X2.shape[-1])
X3 = X3.reshape(-1, X3.shape[-1])
Y = Y_arr.reshape(-1, Y_arr.shape[-1])

# 2.5 - MASK
mask = nib.load('/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/Software/Research/nimlab/nimlab/data/MNI152_T1_2mm_brain_mask.nii').get_fdata().reshape(-1, 1)
mask_indices = mask > 0
mask_indices = mask_indices.flatten()

# Apply mask
X0 = X0[mask_indices]
X1 = X1[mask_indices]
X2 = X2[mask_indices]
X3 = X3[mask_indices]
Y = Y[mask_indices]

# 3 - COMBINE THEM IN THE 3RD DIMENSION (3D design matrix), SHAPE = [VOXELS, PATIENTS, COVARIATES]
X_ARR = np.vstack((X0, X1, X2, X3)).T

# 4 - PERFORM A 3-DIMENSIONAL NORMAL EQUATION CALCULATION OF BETAS. SUPER FAST. LOTS OF MEMORY.
betas = np.linalg.inv(X_ARR.T @ X_ARR) @ X_ARR.T @ Y

print("Coefficients for the first voxel:", betas)

# 06 - Permutation
- convert above code to a script. 
- send the 4d niftis to server using 
     - calvin_utils/notebooks/neuroimaging_notebooks/server_submission_notebooks/submit_multiprocessed_voxelwise_analysis.ipynb
- engage 10K permutations using script above on server.
    - Save each T-value output to tmp dir. 
    - concat 10K t-val niftis. 
    - P-values with 
        - percentile = np.mean(t_val_empiric.nii>t_val_observed.nii)
        - p_vals = percentile > 5
        - thresh_t_vals = t_val_observed[p_vals]
- Done.