# Run Any Kind of OLS Regression (ANOVA, GLM, etc.)

### Authors: Calvin Howard.

#### Last updated: July 6, 2023

Use this to run/test a statistical model (e.g., regression or T-tests) on a spreadsheet.

Notes:
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Research/2023/subiculum_cognition_and_age/figures/Figures/joint_distribution_calculus/analyses'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Dropbox (Partners HealthCare)/studies/cognition_2023/metadata/master_list_proper_subjects.xlsx'

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet='master_list_proper_subjects')
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()


# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [None]:
drop_list = ['Age', 'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'City'  # The column you'd like to evaluate
condition = 'not'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Toronto' # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

# 02 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
**ANOVA**
- Tests differences in means for one categorical variable.
- formula = 'Outcome ~ C(Group1)'

**2-Way ANOVA**
- Tests differences in means for two categorical variables without interaction.
- formula = 'Outcome ~ C(Group1) + C(Group2)'

**2-Way ANOVA with Interaction**
- Tests for interaction effects between two categorical variables.
- formula = 'Outcome ~ C(Group1) * C(Group2)'

**ANCOVA**
- Similar to ANOVA, but includes a covariate to control for its effect.
- formula = 'Outcome ~ C(Group1) + Covariate'

**2-Way ANCOVA**
- Extends ANCOVA with two categorical variables and their interaction, controlling for a covariate.
- formula = 'Outcome ~ C(Group1) * C(Group2) + Covariate'

**Multiple Regression**
- Assesses the impact of multiple predictors on an outcome.
- formula = 'Outcome ~ Predictor1 + Predictor2'

**Simple Linear Regression**
- Assesses the impact of a single predictor on an outcome.
- formula = 'Outcome ~ Predictor'

**MANOVA**
- Assesses multiple dependent variables across groups.
- Note: Not typically set up with a formula in statsmodels. Requires specialized functions.

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [None]:
data_df.columns

In [None]:
formula = "Z_Scored_Percent_Cognitive_Improvement ~ Age*Subiculum_Connectivity_T"

# 02 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- design_matrix = palm.create_design_matrix(formula_vars=["var1", "var2", "var1*var2"])
- To include interaction terms, use * between variables, like "var1*var2".
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [None]:
# Define the design matrix
outcome_matrix, design_matrix = cal_palm.define_design_matrix(formula, data_df)
design_matrix

# 03 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [None]:
outcome_matrix

# 04 - Define Exchangeability Blocks (Optional)

Optional - Exchangability Blocks
- This is optional and for when you are doing a meta-analysis
- Not yet implemented

In [None]:
### This is just an example, you will have to edit to adapt to your data, 
### but it should be integers, starting with 1,2,3....

# coding_key = {"Prosopagnosia_w_Yeo1000": 1,
#              "Corbetta_Lesions": 1,
#              "DBS_dataset": 2
#              }

# eb_matrix = pd.DataFrame()
# eb_matrix = clean_df['dataset'].replace(coding_key)
# display(eb_matrix)

# 05 - Run the Regression

Regression Results Are Displayed Below

In [None]:
import statsmodels.api as sm
# Fit the regression model
model = sm.OLS(outcome_matrix, design_matrix)
results = model.fit()
print(results.summary2())

Visualize the Regression as a Forest Plot
- This will probably look poor if you ran a regression without standardizing your data. 

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import ForestPlot
forest = ForestPlot(model=results, sig_digits=2, out_dir=out_dir, table=False)
forest.run()

Visualize The Model's Fit

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import model_diagnostics
model_diagnostics(results)

Visualize the Partial Regression Plots

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import PartialRegressionPlot
partial_plot = PartialRegressionPlot(model=results, design_matrix=design_matrix, out_dir=out_dir, palette='Reds')
partial_plot = partial_plot.run()

# 06 - Find First Partial Derivative of Each Regressor

**Partial Derivative Explanation for the Equation $ y = B_1x + B_2z + B_3xz $**

When taking the partial derivative of the equation $ y = B_1x + B_2z + B_3xz $ with respect to $ x $, the logic is as follows:

- Treat $ z $ as a constant since we are differentiating with respect to $ x $. 
- Derivatives of constants are zero. Derivatives of first-order polynomials ($ x $) are one. 
- All terms with $ z $ are treated as constants.
    - This means both $ B_2z $ and $ B_3z $ are considered constants.
    - When differentiated with respect to $ x $:
        - $ B_2z $ does not have $ x $. Thus its derivative is zero.
        - $ B_3z $ has an $ x $ term in $ B_3zx $, thus its derivative is the constant $ B_3z $. 
            - This is due to the special situation of the product rule wherein the derivative of a constant and a differentiable variable is = constant * derivative of differentiable variable.

Hence, the partial derivative of $ y $ with respect to $ x $ is given by:

$$ {\partial y}/{\partial x} = {\partial y}/{\partial x}(B_1x) + {\partial y}/{\partial x}(B_2z) + {\partial y}/{\partial x}(B_3xz) $$

The product rule is applied to the interaction term, which expanding provides:

$$ {\partial y}/{\partial x} = {\partial y}/{\partial x}(B_1x) + {\partial y}/{\partial x}(B_2z) + {\partial y}/{\partial x}(B_3x) * {\partial y}/{\partial x}(B_3z) $$

Which applying the product rule, is equivalent to:

$$ {\partial y}/{\partial x} = {\partial y}/{\partial x}(B_1x) + {\partial y}/{\partial x}(B_2z) + 1 * {\partial y}/{\partial x}(B_3z) $$

The derivative of a constant (z) is equivalent to zero. Thus, simplifying this, we get:

$$ {\partial y}{\partial x} = B_1 + 0 + B_3z $$

Therefore, the resulting equation for the partial derivative is:

$$ {\partial y}{\partial x} = B_1 + B_3z $$

This equation represents the rate of change of $ y $ with respect to $ x $, while holding $ z $ constant.

In [None]:
coefficients = results.params
coefficients

Use the Above Coefficients With This Equation to Obtain Zero Point

$$ 0 = B_1 + B_3z $$

The critical point of $ x $ occurs at this point:

$$ -B_1 / B_3 = z $$

In [None]:
print('Zero point of Age: ', -coefficients['Subiculum_Connectivity_T']/coefficients['Age:Subiculum_Connectivity_T'])
print('Zero point of Subiculum Connectivty: ', -coefficients['Age']/coefficients['Age:Subiculum_Connectivity_T'])

**At this Point, I suggest visualizing the distribution of the data to insure you did not screw up your math and create a nonsensical number**

In [None]:
import seaborn as sns
pair_df = data_df.loc[:, ['Age', 'Subiculum_Connectivity_T', 'Z_Scored_Percent_Cognitive_Improvement']]
sns.pairplot(pair_df)

# 07 - Visualize the Partials

Suggest using regression_visualization.ipynb for the joint distribution and visualization of its gradient. 

**Note**
- This gets more complicated with more interaction terms. Each equation must be solved by you. 

# Compare Quadrants Defined by the Partials

Enjoy.

-- Calvin