# Run Any Kind of OLS Regression (ANOVA, GLM, Logit, etc.)

### Authors: Calvin Howard.

#### Last updated: May 5, 2024

Use this to run/test a statistical model (e.g., regression or T-tests) on a spreadsheet containing covariates and brain image (nii/gii) paths. 

Notes:
- For this to work, it must be installed onto wherever you want to run it. You must run:
```
> git clone https://github.com/Calvinwhow/Research.git
> cd nimlab/calvin_utils/calvin_utils
> pip install -e .
```
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [2]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/cognition_2023/revisions/notebook06/ongoing_ungodly_amount_of_edits'

Import Data

In [3]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/cognition_2023/metadata/revisionsdata.csv'
sheet = None

In [4]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()
data_df

Unnamed: 0,Dataset,Subject,Nifti_File_Path,age,z_scored_improvement,sbc_conn,sex,DatasetInt,Baseline_Cognitive_Score,Frequency,Pulse_Width__uS_,Amperage__mA_
0,AD Fornix DBS,150,/Users/cu135/Partners HealthCare Dropbox/Calvi...,71,14.362311,73.488381,m,1,17.0,130,90,3.5
1,AD Fornix DBS,149,/Users/cu135/Partners HealthCare Dropbox/Calvi...,77,-89.274052,62.007555,m,1,19.0,130,90,3.5
2,AD Fornix DBS,148,/Users/cu135/Partners HealthCare Dropbox/Calvi...,51,-206.966360,75.739873,m,1,13.0,130,90,3.5
3,AD Fornix DBS,147,/Users/cu135/Partners HealthCare Dropbox/Calvi...,59,-4.035957,69.447270,m,1,13.0,130,90,3.5
4,AD Fornix DBS,146,/Users/cu135/Partners HealthCare Dropbox/Calvi...,76,-53.819507,46.331586,m,1,24.0,130,90,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...
75,PD STN DBS,MDST05,,60,0.503206,,m,2,143.0,150,50,3.5
76,PD STN DBS,MDST04,/Users/cu135/Partners HealthCare Dropbox/Calvi...,50,-0.282257,21.207602,m,2,,130,60,3.5
77,PD STN DBS,MDST03,/Users/cu135/Partners HealthCare Dropbox/Calvi...,62,-0.005890,30.900051,f,2,0.0,130,60,3.5
78,PD STN DBS,MDST02,/Users/cu135/Partners HealthCare Dropbox/Calvi...,50,1.294321,16.295870,f,2,2.0,130,60,3.5



# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [5]:
data_df.columns

Index(['Dataset', 'Subject', 'Nifti_File_Path', 'age', 'z_scored_improvement',
       'sbc_conn', 'sex', 'DatasetInt', 'Baseline_Cognitive_Score',
       'Frequency', 'Pulse_Width__uS_', 'Amperage__mA_'],
      dtype='object')

In [6]:
drop_list = ['age', 'z_scored_improvement', 'Nifti_File_Path']

In [7]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [8]:
data_df.columns

Index(['Dataset', 'Subject', 'Nifti_File_Path', 'age', 'z_scored_improvement',
       'sbc_conn', 'sex', 'DatasetInt', 'Baseline_Cognitive_Score',
       'Frequency', 'Pulse_Width__uS_', 'Amperage__mA_'],
      dtype='object')

Set the parameters for dropping rows

In [9]:
column = 'City'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Boston' # The value to drop if found

In [10]:
# data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

Unnamed: 0,Dataset,Subject,Nifti_File_Path,age,z_scored_improvement,sbc_conn,sex,DatasetInt,Baseline_Cognitive_Score,Frequency,Pulse_Width__uS_,Amperage__mA_
0,AD Fornix DBS,150,/Users/cu135/Partners HealthCare Dropbox/Calvi...,71,14.362311,73.488381,m,1,17.0,130,90,3.5
1,AD Fornix DBS,149,/Users/cu135/Partners HealthCare Dropbox/Calvi...,77,-89.274052,62.007555,m,1,19.0,130,90,3.5
2,AD Fornix DBS,148,/Users/cu135/Partners HealthCare Dropbox/Calvi...,51,-206.966360,75.739873,m,1,13.0,130,90,3.5
3,AD Fornix DBS,147,/Users/cu135/Partners HealthCare Dropbox/Calvi...,59,-4.035957,69.447270,m,1,13.0,130,90,3.5
4,AD Fornix DBS,146,/Users/cu135/Partners HealthCare Dropbox/Calvi...,76,-53.819507,46.331586,m,1,24.0,130,90,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...
74,PD STN DBS,MDST06,/Users/cu135/Partners HealthCare Dropbox/Calvi...,60,0.245073,23.577739,m,2,143.0,150,50,3.5
76,PD STN DBS,MDST04,/Users/cu135/Partners HealthCare Dropbox/Calvi...,50,-0.282257,21.207602,m,2,,130,60,3.5
77,PD STN DBS,MDST03,/Users/cu135/Partners HealthCare Dropbox/Calvi...,62,-0.005890,30.900051,f,2,0.0,130,60,3.5
78,PD STN DBS,MDST02,/Users/cu135/Partners HealthCare Dropbox/Calvi...,50,1.294321,16.295870,f,2,2.0,130,60,3.5


**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [11]:
# # Remove anything you don't want to standardize
# cols_not_to_standardize = None # ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']

In [12]:
# data_df = cal_palm.standardize_columns(cols_not_to_standardize)
# data_df

In [13]:
# data_df.columns

# 02 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
**ANOVA**
- Tests differences in means for one categorical variable.
- formula = 'Outcome ~ C(Group1)'

**2-Way ANOVA**
- Tests differences in means for two categorical variables without interaction.
- formula = 'Outcome ~ C(Group1) + C(Group2)'

**2-Way ANOVA with Interaction**
- Tests for interaction effects between two categorical variables.
- formula = 'Outcome ~ C(Group1) * C(Group2)'

**ANCOVA**
- Similar to ANOVA, but includes a covariate to control for its effect.
- formula = 'Outcome ~ C(Group1) + Covariate'

**2-Way ANCOVA**
- Extends ANCOVA with two categorical variables and their interaction, controlling for a covariate.
- formula = 'Outcome ~ C(Group1) * C(Group2) + Covariate'

**Multiple Regression**
- Assesses the impact of multiple predictors on an outcome.
- formula = 'Outcome ~ Predictor1 + Predictor2'

**Simple Linear Regression**
- Assesses the impact of a single predictor on an outcome.
- formula = 'Outcome ~ Predictor'

**MANOVA**
- Assesses multiple dependent variables across groups.
- Note: Not typically set up with a formula in statsmodels. Requires specialized functions.

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [14]:
data_df.columns

Index(['Dataset', 'Subject', 'Nifti_File_Path', 'age', 'z_scored_improvement',
       'sbc_conn', 'sex', 'DatasetInt', 'Baseline_Cognitive_Score',
       'Frequency', 'Pulse_Width__uS_', 'Amperage__mA_'],
      dtype='object')

In [19]:
formula = "z_scored_improvement ~ Nifti_File_Path * age + Dataset"

# 02 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- voxelwise_variable_list = A list containing the names of each variable that has voxelwise variables. Plainly, the variables that represent niftis. 
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [20]:
voxelwise_variable_list=['Nifti_File_Path']

If you want to run voxelwise INTERACTIONS, then you should specify the exact terms, exactly as specified in your above formula, here. 
- For example, if Formula is outcome ~ voxelwise_var1 * age + dog_number, then voxelwise_interaction_terms are ['voxelwise_var1 * age]
- Set voxelwise_interaction_terms = None if you do not want to specify any interaction terms. 

In [21]:
voxelwise_interaction_terms = ['Nifti_File_Path * age']

In [22]:
# Define the design matrix
outcome_df, design_matrix = cal_palm.define_design_matrix(formula, data_df, voxelwise_variable_list=voxelwise_variable_list, voxelwise_interaction_terms=voxelwise_interaction_terms)
design_matrix

Unnamed: 0,Intercept,Dataset[T.PD STN DBS],age,Nifti_File_Path,Nifti_File_Path*age
0,1.0,0.0,71.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
1,1.0,0.0,77.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
2,1.0,0.0,51.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
3,1.0,0.0,59.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
4,1.0,0.0,76.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
...,...,...,...,...,...
74,1.0,1.0,60.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
76,1.0,1.0,50.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
77,1.0,1.0,62.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction
78,1.0,1.0,50.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,voxelwise_interaction


# 03 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [23]:
outcome_df

Unnamed: 0,z_scored_improvement
0,14.362311
1,-89.274052
2,-206.966360
3,-4.035957
4,-53.819507
...,...
74,0.245073
76,-0.282257
77,-0.005890
78,1.294321


# 04 - Generate Contrasts

Generate a Contrast Matrix
- This is different from the contrast matrices used in cell-means regressions such as in PALM, but it is much more powerful. 



For more information on contrast matrices, please refer to this: https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf

Generally, these drastically effect the results of ANOVA. However, they are mereley a nuisance for a regression.
In essence, they assess if coefficients are significantly different

________________________________________________________________
A coding matrix (a contrast matrix if it sums to zero) is simply a way of defining what coefficients to evaluate and how to evaluate them. 
If a coefficient is set to 1 and everything else is set to zero, we are taking the mean of the coefficient's means and assessing if they significantly
deviate from zero--IE we are checking if it had a significant impact on the ability to predict the depdendent variable.
If a coefficient is set to 1, another is -1, and others are 0, we are assessing how the means of the two coefficients deviate from eachother. 
If several coefficients are 1 and several others are -1, we are assessing how the group-level means of the two coefficients deviate from eachother.
If a group of coefficients are 1, a group is -1, and a group is 0, we are only assessing how the groups +1 and -1 have differing means. 

1: This value indicates that the corresponding variable's coefficient in the model is included in the contrast. It means you are interested in estimating the effect of that variable.

0: This value indicates that the corresponding variable's coefficient in the model is not included in the contrast. It means you are not interested in estimating the effect of that variable.

-1: This value indicates that the corresponding variable's coefficient in the model is included in the contrast, but with an opposite sign. It means you are interested in estimating the negative effect of that variable.

----------------------------------------------------------------
The contrast matrix is typically a matrix with dimensions (number of contrasts) x (number of regression coefficients). Each row of the contrast matrix represents a contrast or comparison you want to test.

For example, let's say you have the following regression coefficients in your model:

Intercept, Age, connectivity, Age_interaction_connectivity
A contrast matric has dimensions of [n_predictors, n_experiments] where each experiment is a contrast

If you want to test the hypothesis that the effect of Age is significant, you can set up a contrast matrix with a row that specifies this contrast (actually an averaging vector):
```
[0,1,0,0]. This is an averaging vector because it sums to 1
```
This contrast will test the coefficient corresponding to the Age variable against zero.


If you want to test the hypothesis that the effect of Age is different from the effect of connectivity, you can set up a contrast matrix with two rows:
```
[0,1,−1,0]. This is a contrast because it sums to 0
```

Thus, if you want to see if any given effect is significant compared to the intercept (average), you can use the following contrast matrix:
```
[1,0,0,0]
[-1,1,0,0]
[-1,0,1,0]
[-1,0,0,1] actually a coding matrix of averaging vectors
```

The first row tests the coefficient for Age against zero, and the second row tests the coefficient for connectivity against zero. The difference between the two coefficients can then be assessed.
_____
You can define any number of contrasts in the contrast matrix to test different hypotheses or comparisons of interest in your regression analysis.

It's important to note that the specific contrasts you choose depend on your research questions and hypotheses. You should carefully consider the comparisons you want to make and design the contrast matrix accordingly.

- Examples:
    - [Two Sample T-Test](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM#Two-Group_Difference_.28Two-Sample_Unpaired_T-Test.29)
    - [One Sample with Covariate](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM#Single-Group_Average_with_Additional_Covariate)

In [24]:
contrast_matrix = cal_palm.generate_basic_contrast_matrix(design_matrix)

Here is a basic contrast matrix set up to evaluate the significance of each variable.
Here is an example of what your contrast matrix looks like as a dataframe: 


Unnamed: 0,Intercept,Dataset[T.PD STN DBS],age,Nifti_File_Path,Nifti_File_Path*age
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,0,1


Below is the same contrast matrix, but as an array.
Copy it into a cell below and edit it for more control over your analysis.
[
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 1],
]


In [25]:
contrast_matrix = [
    [0, 0, 0, 0, 1]
    ]

In [26]:
contrast_matrix_df = cal_palm.finalize_contrast_matrix(design_matrix=design_matrix, 
                                                    contrast_matrix=contrast_matrix) 
contrast_matrix_df

Unnamed: 0,Intercept,Dataset[T.PD STN DBS],age,Nifti_File_Path,Nifti_File_Path*age
0,0,0,0,0,1


# 05 - Generate Files
Standardization during regression is critical. 
- data_transform_method='standardize' will ensure the voxelwise values are standardized
    - if you design matrix has a column called 'Dataset', the standardization will standardize values within each dataset individually, which is as should be done normally.
    - If you call data_transform_method='standardize' without having a 'Dataset' column in your design matrix, the entire collection of images will be standardized. This is potentially dangerous and misleading. Be careful, and consider not standardizing at all, or going back and adding a 'Dataset' column. 

Mask Path
- set mask_path to the path of your local brain mask which matches the resolution of the files you have collected. Typically this is an MNI 152 brain mask. 
    - download one here: https://nilearn.github.io/dev/modules/generated/nilearn.datasets.load_mni152_brain_mask.html

In [27]:
mask_path = '/Users/cu135/hires_backdrops/MNI/MNI152_T1_2mm_brain_mask.nii'
data_transform_method='standardize'

In [28]:
import os
import numpy as np
import nibabel as nib
import json
from tqdm import tqdm

class RegressionPrep:
    def __init__(self, design_matrix, contrast_matrix, outcome_df, out_dir,
                 voxelwise_variables=None, voxelwise_interactions=None,
                 mask_path=None, exchangeability_block=None, 
                 data_transform_method='standardize'):
        """
        Initializes the RegressionPrep class.

        Parameters:
        - design_matrix (pd.DataFrame): The design matrix containing scalar and voxelwise variables.
        - contrast_matrix (np.ndarray or list): The contrast matrix specifying the contrasts of interest.
        - outcome_df (pd.DataFrame): The outcome data, either scalar or voxelwise.
        - out_dir (str): The output directory where processed data will be saved.
        - voxelwise_variables (list, optional): List of voxelwise variable names. Defaults to None.
        - voxelwise_interactions (list, optional): List of interaction terms involving voxelwise variables. Defaults to None.
        - mask_path (str, optional): Path to the brain mask NIfTI file. Defaults to None.
        - exchangeability_block (np.ndarray, optional): Exchangeability block for permutation testing. Defaults to None. Should be of shape (n_subjects,), and composed of integers which indicate exchangeability blocks.
        - data_transform_method (str, optional): Method to transform data ('standardize' or other). Defaults to 'standardize'.
        """
        self.design_matrix = design_matrix
        self.contrast_matrix = contrast_matrix
        self.outcome_df = outcome_df
        self.voxelwise_variables = voxelwise_variables or []
        self.voxelwise_interactions = [interaction.replace(' ', '') for interaction in (voxelwise_interactions or [])]
        self.out_dir = out_dir
        self.mask_path = mask_path
        self.exchangeability_block = exchangeability_block
        self.data_transform_method = data_transform_method

        os.makedirs(self.out_dir, exist_ok=True)
    ### setters and getters ###
    def _load_nifti_stack(self, paths):
        data = [nib.load(p).get_fdata() for p in tqdm(paths, desc="Loading NIFTIs")]
        return np.stack(data, axis=-1)

    def _mask_data(self, data):
        mask = nib.load(self.mask_path).get_fdata() > 0 if self.mask_path else None
        return data[mask, :] if mask is not None else data

    def _standardize(self, data):
        mean, std = np.mean(data, axis=-1, keepdims=True), np.std(data, axis=-1, keepdims=True)
        return (data - mean) / (std + 1e-8)

    def _handle_nans(self, arr, value=0):
        """Handles NaNs by replacing them (and pos/neg inf) with finite values."""
        max_val = np.nanmax(arr)
        min_val = np.nanmin(arr)
        arr = np.nan_to_num(arr, nan=value, posinf=max_val, neginf=min_val)
        return arr
    
    def _prepare_voxelwise_terms(self):
        '''Grabs the initial voxelwise data and applies the mask'''
        voxelwise_data = {}
        for term in self.voxelwise_variables:
            if term == self.outcome_df.columns[0]: # don't use the dependent variable
                continue
            paths = self.design_matrix[term].values
            stacked = self._load_nifti_stack(paths)
            stacked = self._mask_data(stacked)
            stacked = self._handle_nans(stacked)
            if self.data_transform_method == 'standardize':
                stacked = self._standardize(stacked)
            voxelwise_data[term] = stacked
        return voxelwise_data

    def _apply_interactions(self, voxelwise_data):
        '''Grabs the voxelwise data and multiplies it by the scalar term'''
        for col in  self.voxelwise_interactions:
            term1, term2 = [x.strip() for x in (col.split(':') if ':' in col else col.split('*'))]
            voxel_term = term1 if term1 in self.voxelwise_variables else term2
            scalar_term = term2 if voxel_term == term1 else term1            
            interaction_values = self.design_matrix[scalar_term].values.astype(float)
            voxelwise_data[col] = voxelwise_data[voxel_term] * interaction_values
        return voxelwise_data

    def _prepare_outcome_data(self):
        '''Handle outcome data, with ability to accept voxelwise (4d) and 2d data'''
        outcome_data = self.outcome_df.values
        if self.outcome_df.columns[0] in self.voxelwise_variables:
            outcome_data = self._load_nifti_stack(outcome_data)
            outcome_data = self._mask_data(outcome_data)
            if self.data_transform_method == 'standardize':
                outcome_data = self._standardize(outcome_data)
            outcome_data = self._handle_nans(outcome_data)
        return outcome_data
    
    def _prepare_design_matrix(self, voxelwise_data):
        """
        Creates a design tensor of shape (observations, parameters, voxels).
        Voxelwise terms vary per voxel; scalar terms repeat or use placeholders.
        """
        num_obs, num_params = self.design_matrix.shape
        example_voxelwise_term = next(iter(voxelwise_data.values()))
        num_voxels = example_voxelwise_term.shape[0]

        design_tensor = np.full((num_obs, num_params, num_voxels), np.nan, dtype=np.float32)
        for idx, col in enumerate(self.design_matrix.columns):
            if (col in voxelwise_data) or (col in self.voxelwise_interactions):
                design_tensor[:, idx, :] = voxelwise_data[col].T        #(voxels, subjects) -> transpose to (subjects, voxels)
            else:
                col_values = self.design_matrix[col].values[:, np.newaxis]      # scalar regressor: broadcast values across voxels
                design_tensor[:, idx, :] = np.repeat(col_values, num_voxels, axis=1)
        return design_tensor

    def _save_dataset(self):
        dataset_dict = {
            'voxelwise_regression': {
                "design_matrix": f"{self.out_dir}/design_matrix.npy",
                "contrast_matrix": f"{self.out_dir}/contrast_matrix.npy",
                "outcome_data": f"{self.out_dir}/outcome_data.npy"
            }
        }
        
        np.save(f"{self.out_dir}/design_matrix.npy", self.design_tensor)
        np.save(f"{self.out_dir}/contrast_matrix.npy", self.contrast_matrix)
        np.save(f"{self.out_dir}/outcome_data.npy", self.outcome_data)
        
        if self.exchangeability_block is not None:
            dataset_dict['voxelwise_regression']["exchangeability_block"] = f"{self.out_dir}/exchangeability_block.npy"
            np.save(f"{self.out_dir}/exchangeability_block.npy", self.exchangeability_block)
        
        with open(f"{self.out_dir}/dataset_dict.json", "w") as f:
            json.dump(dataset_dict, f, indent=4)
        return dataset_dict, f"{self.out_dir}/dataset_dict.json"

    def run(self):
        voxelwise_data = self._prepare_voxelwise_terms()
        voxelwise_data = self._apply_interactions(voxelwise_data)
        print(voxelwise_data.keys())
        self.outcome_data = self._prepare_outcome_data()
        self.design_tensor = self._prepare_design_matrix(voxelwise_data)
        dataset_dict, json_path = self._save_dataset()
        return dataset_dict, json_path


Define exchangeability block
- Set to none if you don't know
- If you are running multiple cohorts, set exchangeability block to be the column which has each group in it, with groups being indicated by integers. 

In [31]:
exchangeability_col = None
# exchangeability_blocks = data_df[exchangeability_col].values

In [32]:
# from calvin_utils.ccm_utils.npy_utils import RegressionNPYPreparer
preparer = RegressionPrep(design_matrix=design_matrix, 
                          contrast_matrix=contrast_matrix_df, 
                          outcome_df=outcome_df, 
                          out_dir=out_dir,
                          voxelwise_variables=voxelwise_variable_list, 
                          voxelwise_interactions=voxelwise_interaction_terms,
                          mask_path=mask_path, 
                          exchangeability_block=None, 
                          data_transform_method='standardize')

dataset_dict, json_path = preparer.run()


Loading NIFTIs: 100%|██████████| 72/72 [00:00<00:00, 99.76it/s] 


dict_keys(['Nifti_File_Path', 'Nifti_File_Path*age'])


# 06 - Run the Voxelwise Regression

In [33]:
import json
import numpy as np
from scipy.stats import t
from calvin_utils.ccm_utils.npy_utils import DataLoader

class SimpleVoxelwiseRegression:
    def __init__(self, json_path, mask_path=None, out_dir=None):
        self.json_path = json_path
        self.mask_path = mask_path
        self.out_dir = out_dir
        self.data_loader = DataLoader(self.json_path)
        self.load_data()
        self.set_variables()
        
    #### Setter/Getter methods ####
    def load_data(self):
        with open(self.json_path, 'r') as f:
            paths = json.load(f)['voxelwise_regression']
        self.design_tensor = np.load(paths['design_matrix'])  # shape: (observations, predictors, voxels)
        self.contrast_matrix = np.load(paths['contrast_matrix'])  # shape: (contrasts, predictors)
        self.outcome_data = np.load(paths['outcome_data'])  # shape: (observations, voxels)
        self.exchangeability_blocks = np.load(paths["exchangeability_block"]) if "exchangeability_block" in paths else None

    def set_variables(self):
        self.n_obs, self.n_preds, self.n_voxels = self.design_tensor.shape
        self.n_contrasts = self.contrast_matrix.shape[0]
        
    #### Nifti Saving Methods ####
    def _unmask_array(self, data_array):
        """
        Unmasks a vectorized image to full-brain shape using self.mask_path.
        Returns:
            unmasked_array: full-brain NIfTI-like array
            mask_affine: affine transformation from mask
        """
        if self.mask_path is None:
            raise ValueError("Mask path is not provided. Provide the mask used to create the data_array.")
        else:
            mask = nib.load(self.mask_path)
            mask_data = mask.get_fdata()
            mask_indices = mask_data.flatten() > 0  # Assuming mask is binary
            unmasked_array = np.zeros(mask_indices.shape)
            unmasked_array[mask_indices] = data_array.flatten()
        return unmasked_array.reshape(mask_data.shape), mask.affine

    def _save_map(self, map_data, file_name):
        """
        Saves unmasked NIfTI image to disk.
        """
        if self.out_dir is None:
            return
        
        unmasked_map, mask_affine = self._unmask_array(map_data)
        img = nib.Nifti1Image(unmasked_map, affine=mask_affine)
        file_path = os.path.join(self.out_dir, file_name)
        os.makedirs(os.path.dirname(file_path), exist_ok=True)
        nib.save(img, file_path)
        return img
    
    def _save_nifti_maps(self):
        """
        Example method that unmasks and saves NIfTI maps for BETA & T.
        Assumes you have self._unmask_array() and self._save_map() in place.
        """
        if not self.out_dir or not self.mask_path:
            return
        
        # Save betas: shape (n_preds, n_voxels)
        if hasattr(self, 'BETA'):
            for i in range(self.n_preds):
                beta_name = f"beta_predictor_{i}.nii.gz"
                self._save_map(self.BETA[i, :], beta_name)
        # Save T-values for Betas: shape (n_preds, n_voxels)
        if hasattr(self, 'T'):
            for c in range(self.n_contrasts):
                self._save_map(self.T[c, :], f"contrast_{c}_tval.nii.gz")
        # Save overall R2 (measure of model overall model fit)
        if hasattr(self, 'R2'):
            self._save_map(self.R2, f"R2_vals.nii.gz")

        # Save FWE-corrected significance masks if we have permutation results
        if hasattr(self, 'Tp'):
            for c in range(self.n_contrasts):
                sig_mask = (self.Tp[c, :] < 0.05)
                sig_tvals = np.where(sig_mask, self.T[c, :], np.nan)
                self._save_map(sig_tvals, f"contrast_{c}_tval_FWE.nii.gz")
                self._save_map(self.Tp[c, :], f"contrast_{c}_pval_FWE.nii.gz")
                
        # Save FWE-corrected significance masks if we have permutation results
        if hasattr(self, 'R2p'):
            sig_mask = (self.R2p < 0.05)
            sig_r2vals = np.where(sig_mask, self.R2, np.nan)
            self._save_map(sig_r2vals, f"R2_FWE.nii.gz")
            self._save_map(self.R2p, f"R2_pval_FWE.nii.gz")
            
    #### Regression Methods ####
    def get_r2(self, Y, Y_HAT, e=1e-6):
        """
        Calculate R-squared value from 1 - SS_Residuals/SS_TotalObservations
        Y is the observed data
        Y_HAT is the predicted data
        """
        SS_residual = np.sum((Y - Y_HAT)**2)
        SS_total = np.sum((Y - np.mean(Y))**2)
        return 1 - (SS_residual / (SS_total+e))
    
    def apply_contrasts(self, XtX_inv, BETA, MSE, e=1e-6, get_p=False):
        """
        t = (C @ BETA) / sqrt(diag(C @ XtX_inv @ C.T) * MSE)

        'C' is (c, p)
        'BETA' is (p,) or (p, 1)
        'XtX_inv' is (p, p)
        'MSE' is scalar (residual variance)
        """
        C = self.contrast_matrix
        NUM = C @ BETA
        var_diag = np.diag(C @ XtX_inv @ C.T)  
        DEN = np.sqrt(var_diag * MSE)      
        T = NUM / (DEN+e)
        if get_p:
            df = self.n_obs - self.n_preds
            P = 2 * t.sf(np.abs(T), df=df)
        return T

    def _run_regression(self, X, Y):
        '''Runs a standard linear regression. Gets Betas and T values.'''
        XtX_inv = np.linalg.pinv(X.T @ X)                               # (preds × preds)
        BETA = XtX_inv @ X.T @ Y
        Y_HAT  = X @ BETA                                               # (obs,) <- (obs, preds) @ (preds,)
        residuals = Y.T - Y_HAT                                         # (n_sub, n_voxels) <- (n_sub, n_voxels) - (n_sub, n_voxels)
        dof = self.n_obs - self.n_preds                                 # scalar  <- (n_sub, ) - (n_cov, )
        mse = np.sum(residuals**2, axis=0) / dof                        # (n_voxels,) <- summed (n_sub, n_voxels) along n_sub
        T = self.apply_contrasts(XtX_inv, BETA, mse)                 # (n_cov, n_voxels) <- (n_cov, n_voxels) / (n_cov, n_voxels)
        R2 = self.get_r2(Y, Y_HAT)                                        # (n_voxels,) <- (n_sub, n_voxels) - (n_sub, n_voxels)
        return BETA, T, R2                                               # beta is (predictors,), while T and P are (contrasts,)
    
    def _get_targets(self, permutation):
        regressor = self.design_tensor
        if permutation:
            if self.exchangeability_blocks is None:
                resample_idx = np.random.permutation(self.outcome_data.shape[0])
                regressand = self.outcome_data[resample_idx, :]
            else:
                block_labels = self.exchangeability_blocks.ravel()
                unique_blocks = np.unique(block_labels)
                resample_idx = []
                for block in unique_blocks:
                    block_indices = np.where(block_labels == block)[0]
                    resample_idx.extend(block_indices)
                resample_idx = np.array(resample_idx)
                regressand = self.outcome_data[resample_idx, :]
        else:
            regressand = self.outcome_data
        return regressor, regressand
    
    def voxelwise_regression(self, permutation=False):
        '''Relies on hat matrix (X'@(X'X)^-1@X')@Y to calculate beta, t-values, and p-values'''
        BETA = np.zeros((self.n_preds, self.n_voxels))
        T = np.zeros((self.n_contrasts, self.n_voxels))
        R2 = np.zeros((1, self.n_voxels))
        
        regressor, regressand = self._get_targets(permutation)
        for idx in (range(self.n_voxels) if permutation else tqdm(range(self.n_voxels), desc='Running voxelwise regressions')):
            X = regressor[:, :, idx]                                
            Y = regressand.flatten() if regressand.shape[1] == 1 else regressand[:, idx]
            BETA[:,idx], T[:,idx], R2[:,idx] = self._run_regression(X, Y)
        return BETA, T, R2
    
    def _get_max_stat(self, arr, pseudo_var_smooth=True, t=75):
        """Return the 99.9th percentile of the absolute values in arr. Or just the raw maximum if pseudo_var_smooth is false (this is subject to chaotic noise)."""
        if pseudo_var_smooth:        
            return np.nanpercentile(np.abs(arr), t, axis=1)  # Calculate along rows, ignoring NaNs
        else: 
            return np.nanmax(np.abs(arr), axis=1)  # Calculate along rows

    def run_permutation(self, n_permutations):
        if n_permutations < 1:
            print("No permutations requested.")
            return
        
        Tp = np.zeros_like(self.T)        
        R2p = np.zeros_like(self.R2) 
        for i in tqdm(range(n_permutations), desc='running permutations'):
            _, permT, permR2 = self.voxelwise_regression(permutation=True)
            max_statsT = self._get_max_stat(permT)
            max_statsR2 = self._get_max_stat(permR2)
            Tp += (max_statsT[:, None] > np.abs(self.T)).astype(int)  #max t is already absval. self.T must be set to absval for a 2-sample t test. 
            R2p += (max_statsR2 > self.R2).astype(int)                #R2 does not need to be absval. It is inherently 1-sided t test.
        self.Tp = Tp / n_permutations 
        self.R2p = R2p / n_permutations
                    
    #### Orchestration Code ####
    def run(self, n_permutations=0):
        self.BETA, self.T, self.R2 = self.voxelwise_regression()
        self.run_permutation(n_permutations)
        self._save_nifti_maps()

In [34]:
regression = SimpleVoxelwiseRegression(json_path, mask_path=mask_path, out_dir=out_dir)
results = regression.run(n_permutations=1)

Running voxelwise regressions: 100%|██████████| 228483/228483 [00:11<00:00, 19843.19it/s]
running permutations: 100%|██████████| 1/1 [00:11<00:00, 11.15s/it]


That's all

-Calvin