# Run Any Kind of Logistic Regression (Binomial, Multinomial, etc.)

### Authors: Calvin Howard.

#### Last updated: March 16, 2024

Use this to run/test a statistical model on a spreadsheet.

Notes:
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Directory

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Dropbox (Partners HealthCare)/studies/collaborations/barotono_disease_classification'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Dropbox (Partners HealthCare)/studies/collaborations/barotono_disease_classification/metadata/Cort_Thick_Spatial_Rs.csv'
sheet = None

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()

# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [None]:
drop_list = ['Unnamed__11', 'Unnamed__12']

In [None]:
# data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

Drop Columns which Have NaNs

In [None]:
data_df.dropna(inplace=True, axis=1)
data_df

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'dx'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'PPMI' # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = ['cat_dx'] # ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']


In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

# 00 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
**Normal Logistic**
- Assesses the impact of multiple predictors on an outcome.
- formula = 'Binary Outcome ~ Predictor1 + Predictor2'

**Multiple Logistic**
- Assesses the impact of predictor on an outcome.
- formula = 'Ordinal Outcome ~ Predictor1 + Predictor2'

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [None]:
data_df.columns

In [None]:
formula = "dx ~ svPPA + bvFTD + AD + CN + nfaPPA + CBS + PSP"

# 02 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- design_matrix = palm.create_design_matrix(formula_vars=["var1", "var2", "var1*var2"])
- To include interaction terms, use * between variables, like "var1*var2".
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [None]:
# Define the design matrix
outcome_matrix, design_matrix = cal_palm.define_design_matrix(formula, data_df)
design_matrix

# 03 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [None]:
outcome_matrix

**CRITICAL IN MULTINOMIAL LOGISTIC REGRESSION**
- A multinomial logistic reg. will set results RELATIVE TO A REFERENCE class. 
- The reference class is the first classification the multinomial encounters.
- **Especially if you are running a multinomial logistic regression, set your reference class below**

In [None]:
reference = 'dx[CN]'


In [None]:
# ref_col = outcome_matrix.pop(reference)
# outcome_matrix.insert(loc=0, column=reference, value=ref_col)
# outcome_matrix

# Or completely reorganize the columns
outcome_matrix = outcome_matrix.loc[:, ['dx[AD]', 'dx[SV]', 'dx[PNFA]', 'dx[BV]', 'dx[PSP]', 'dx[CBS]', 'dx[CN]']]
outcome_matrix

In [None]:
#Multico. Check
from calvin_utils.statistical_utils.statistical_measurements import calculate_vif
calculate_vif(design_matrix)

# 04 - Run the Regression

Regression Results Are Displayed Below

- This will run a binomial or a multinomial logit dependig on your outcome matrix. 
- A multinomial logit will display N-1 categories, where N is the number of potential classifications you have. This occurs because everything is set in reference to that class. 
- So, the reference will either be the first column in your outcomes_matrix, or you can manually set it first.

In [None]:
from calvin_utils.statistical_utils.logistic_regression import LogisticRegression
logreg = LogisticRegression(outcome_matrix, design_matrix)
results = logreg.run()
results.summary2()

# 05 - Get Classification Metrics
**A) Confusion Matrix**
- The classifications here represent the current threshold, not the optimal threshold which may be identified by ROC. 
- The index of the maximal prediction corresponds to the choice as ordered by outcome_matrix. 
- When normalizing by ground truth, off-diagonal inaccuracies may occur in large numbers due to the inherently rare occurence of classes which are not easily distinguished from others. 

In [None]:
from calvin_utils.statistical_utils.classification_statistics import ClassificationEvaluation
classification_results = ClassificationEvaluation(results, outcome_matrix, normalization='true', thresholds=None)
classification_results.run()

# 05 B) - Prove Why You need a Model
- Use the possible model-free method. 
- If all values are normalized/standardized, you can simply enter predictions_df=design_matrix. 
    - This will just take the largest number as your predictor.

Drop the Columns which do not Include your regressors of interest

In [None]:
dummy_predictions = design_matrix.copy()
dummy_predictions.columns
dummy_predictions.pop('Intercept')

In [None]:
from calvin_utils.statistical_utils.classification_statistics import ClassificationEvaluation
dummy_results = ClassificationEvaluation(results, predictions_df=dummy_predictions, observation_df=outcome_matrix, normalization='true', thresholds=None)
dummy_results.run()

Random Chance of Being Correct

In [None]:
import pandas as pd

def calculate_p_correct(outcome_sums):
    """
    Calculate the probability of making a correct choice by chance
    based on the prevalence of each outcome.

    Parameters:
    outcome_sums (pd.Series): A Pandas Series with the sum of outcomes,
                              representing the prevalence of each choice.

    Returns:
    float: The probability of making a correct choice by chance.
    """
    # Calculate the proportion of each outcome in the total
    total = outcome_sums.sum()
    prevalences = outcome_sums / total
    
    # Square each prevalence and sum to get the probability of a correct choice by chance
    p_correct = (prevalences ** 2).sum()
    
    return p_correct

# Example usage:
# Assuming 'outcome_matrix' is your DataFrame of outcomes
outcome_sums = outcome_matrix.sum()
p_correct = calculate_p_correct(outcome_sums)
print(f"The probability of making a correct choice by chance is: {p_correct}")


**B) Receiver Operating Characteristic**
- The ROC considers clasisfications acoss ALL POSSIBLE PROBABILITIES, demonstrating what is ultiamtely accomplishable at the best possible threshold

- First curve is ROC for classifcation of each class with respect to all other classes
- Second Curve (Macro Average) is basically a meta-analytic ROC with equal weight per class.
- Third Curve (Micro Average) is basically a meta-analytic ROC with weight proportional to class sample

In [None]:
from calvin_utils.statistical_utils.classification_statistics import ComprehensiveMulticlassROC
evaluator = ComprehensiveMulticlassROC(fitted_model=results, observation_df=outcome_matrix)
evaluator.run()

# 06 - Visualize the Regression as a Forest Plot
- This will probably look poor if you ran a regression without standardizing your data. 

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import MultinomialForestPlot

multinomial_forest = MultinomialForestPlot(model=results, sig_digits=2, out_dir=None, table=False)
multinomial_forest.run()

# 07 - Generate Partial Dependence Plots

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import PartialDependencePlot
pdp = PartialDependencePlot(formula=formula, data_df=data_df, model=results, design_matrix=design_matrix, outcomes_df=outcome_matrix, data_range=(-1,1), out_dir=None, marginal_method='mean', debug=False)
pdp.run()

# 08 - Visualize the Partial Regression Plots

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import PartialRegressionPlot
partial_plot = PartialRegressionPlot(model=results, design_matrix=design_matrix, out_dir=out_dir, palette=None)
partial_plot = partial_plot.run()

# 09 - LOOCV

In [None]:
from calvin_utils.statistical_utils.logistic_regression import LogisticRegression

loocv_metrics = LogisticRegression.run_loocv(outcome_matrix, design_matrix)
print(loocv_metrics)

Enjoy.

-- Calvin