# Run Any Kind of Logistic Regression (Binomial, Multinomial, etc.)

### Authors: Calvin Howard.

#### Last updated: March 16, 2024

Use this to run/test a statistical model on a spreadsheet.

Notes:
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Directory

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/atrophy_seeds_2023/shared_analysis/diagnostic_analysis/csf'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Volumes/OneTouch/datasets/adni/metadata/updated_master_list/train_test_splits/train_data_csf.csv'
sheet = None

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()
data_df

# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
drop_list = ['peak_atrophy']

In [None]:
# data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'DX_BASELINE'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'MCI' # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
data_df.columns

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = ['Age', 'Male', 'DX_BASELINE'] # ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']


In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

Convert Categorical Column to Ordinal

In [None]:
data_df.columns

In [None]:
from calvin_utils.file_utils.dataframe_utilities import convert_to_ordinal
# data_df, map = convert_to_ordinal(data_df, ['DX_BASELINE'])

# 01 - Define Your Formula

This is the formula relating outcome to predictors, and takes the form:
- y = B0 + B1 + B2 + B3 + . . . BN

It is defined using the columns of your dataframe instead of the variables above:
- 'Apples_Picked ~ hours_worked + owns_apple_picking_machine'

____
**Normal Logistic**
- Assesses the impact of multiple predictors on an outcome.
- formula = 'Binary Outcome ~ Predictor1 + Predictor2'

**Multiple Logistic**
- Assesses the impact of predictor on an outcome.
- formula = 'Ordinal Outcome ~ Predictor1 + Predictor2'

____
Use the printout below to design your formula. 
- Left of the "~" symbol is the thing to be predicted. 
- Right of the "~" symbol are the predictors. 
- ":" indicates an interaction between two things. 
- "*" indicates and interactions AND it accounts for the simple effects too. 
- "+" indicates that you want to add another predictor. 

In [None]:
data_df.columns

In [None]:
vars = ['Age', 'Male', 'Fusiform__sum_csf', 'Temporal_Pole_Mid__sum_csf',
       'Occipital_Sup__sum_csf', 'Postcentral__sum_csf',
       'Cerebelum_Crus2__sum_csf', 'Temporal_Inf__sum_csf',
       'Rolandic_Oper__sum_csf', 'Cerebelum_9__sum_csf', 'Rectus__sum_csf',
       'Temporal_Sup__sum_csf', 'Cerebelum_8__sum_csf', 'Precuneus__sum_csf',
       'Occipital_Inf__sum_csf', 'OFCpost__sum_csf', 'Cingulate_Mid__sum_csf',
       'Cerebelum_4_5__sum_csf', 'Vermis_10__sum_csf', 'OFClat__sum_csf',
       'Olfactory__sum_csf', 'Cingulate_Post__sum_csf',
       'Frontal_Sup_2__sum_csf', 'Angular__sum_csf', 'Putamen__sum_csf',
       'Vermis_6__sum_csf', 'Heschl__sum_csf', 'OFCmed__sum_csf',
       'Pallidum__sum_csf', 'Cuneus__sum_csf', 'Cerebelum_3__sum_csf',
       'Cerebelum_Crus1__sum_csf', 'Vermis_7__sum_csf', 'Insula__sum_csf',
       'Paracentral_Lobule__sum_csf', 'Hippocampus__sum_csf',
       'ParaHippocampal__sum_csf', 'SupraMarginal__sum_csf',
       'Precentral__sum_csf', 'Occipital_Mid__sum_csf',
       'Temporal_Pole_Sup__sum_csf', 'Lingual__sum_csf', 'Caudate__sum_csf',
       'Amygdala__sum_csf', 'Frontal_Inf_Tri__sum_csf',
       'Supp_Motor_Area__sum_csf', 'Parietal_Inf__sum_csf',
       'Frontal_Med_Orb__sum_csf', 'Vermis_1_2__sum_csf', 'Vermis_3__sum_csf',
       'Temporal_Mid__sum_csf', 'Calcarine__sum_csf', 'Cerebelum_6__sum_csf',
       'Parietal_Sup__sum_csf', 'Cerebelum_10__sum_csf',
       'Cerebelum_7b__sum_csf', 'Frontal_Sup_Medial__sum_csf',
       'Vermis_8__sum_csf', 'Vermis_4_5__sum_csf', 'Thalamus__sum_csf',
       'OFCant__sum_csf', 'Vermis_9__sum_csf', 'Frontal_Mid_2__sum_csf',
       'Frontal_Inf_Orb_2__sum_csf', 'Frontal_Inf_Oper__sum_csf',
       'Cingulate_Ant__sum_csf']
t = ' + '.join(vars)
t

In [None]:
formula = "DX_BASELINE ~ Age + Male + Fusiform__sum_csf + Temporal_Pole_Mid__sum_csf + Occipital_Sup__sum_csf + Postcentral__sum_csf + Cerebelum_Crus2__sum_csf + Temporal_Inf__sum_csf + Rolandic_Oper__sum_csf + Cerebelum_9__sum_csf + Rectus__sum_csf + Temporal_Sup__sum_csf + Cerebelum_8__sum_csf + Precuneus__sum_csf + Occipital_Inf__sum_csf + OFCpost__sum_csf + Cingulate_Mid__sum_csf + Cerebelum_4_5__sum_csf + Vermis_10__sum_csf + OFClat__sum_csf + Olfactory__sum_csf + Cingulate_Post__sum_csf + Frontal_Sup_2__sum_csf + Angular__sum_csf + Putamen__sum_csf + Vermis_6__sum_csf + Heschl__sum_csf + OFCmed__sum_csf + Pallidum__sum_csf + Cuneus__sum_csf + Cerebelum_3__sum_csf + Cerebelum_Crus1__sum_csf + Vermis_7__sum_csf + Insula__sum_csf + Paracentral_Lobule__sum_csf + Hippocampus__sum_csf + ParaHippocampal__sum_csf + SupraMarginal__sum_csf + Precentral__sum_csf + Occipital_Mid__sum_csf + Temporal_Pole_Sup__sum_csf + Lingual__sum_csf + Caudate__sum_csf + Amygdala__sum_csf + Frontal_Inf_Tri__sum_csf + Supp_Motor_Area__sum_csf + Parietal_Inf__sum_csf + Frontal_Med_Orb__sum_csf + Vermis_1_2__sum_csf + Vermis_3__sum_csf + Temporal_Mid__sum_csf + Calcarine__sum_csf + Cerebelum_6__sum_csf + Parietal_Sup__sum_csf + Cerebelum_10__sum_csf + Cerebelum_7b__sum_csf + Frontal_Sup_Medial__sum_csf + Vermis_8__sum_csf + Vermis_4_5__sum_csf + Thalamus__sum_csf + OFCant__sum_csf + Vermis_9__sum_csf + Frontal_Mid_2__sum_csf + Frontal_Inf_Orb_2__sum_csf + Frontal_Inf_Oper__sum_csf + Cingulate_Ant__sum_csf"

# 02 - Visualize Your Design Matrix

This is the explanatory variable half of your regression formula
_______________________________________________________
Create Design Matrix: Use the create_design_matrix method. You can provide a list of formula variables which correspond to column names in your dataframe.

- design_matrix = palm.create_design_matrix(formula_vars=["var1", "var2", "var1*var2"])
- To include interaction terms, use * between variables, like "var1*var2".
- By default, an intercept will be added unless you set intercept=False
- **don't explicitly add the 'intercept' column. I'll do it for you.**

In [None]:
# Define the design matrix
outcome_matrix, design_matrix = cal_palm.define_design_matrix(formula, data_df, add_intercept=True)
design_matrix

Check multicollinearity in design matrix

In [None]:
#Multico. Check
from calvin_utils.statistical_utils.statistical_measurements import calculate_vif
calculate_vif(design_matrix)

# 03 - Visualize Your Dependent Variable

I have generated this for you based on the formula you provided

In [None]:
# outcome_matrix = outcome_matrix.iloc[:, [0]]
outcome_matrix

outcome_matrix.sum()

# 04 - Run the Regression

Regression Results Are Displayed Below

- This will run a binomial or a multinomial logit dependig on your outcome matrix. 
- A multinomial logit will display N-1 categories, where N is the number of potential classifications you have. This occurs because everything is set in reference to that class. 
- So, the reference will either be the first column in your outcomes_matrix, or you can manually set it first.

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import pandas as pd

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(design_matrix)

# Train SVM
y = outcome_matrix.idxmax(axis=1)
svm = SVC(probability=True, kernel='linear', random_state=42)
svm.fit(X_scaled, y)

# Predict probabilities on training data
probabilities = svm.predict_proba(X_scaled)
predictions_df = pd.DataFrame(probabilities, columns=svm.classes_)

# Output
print(predictions_df)

# Optional: evaluate performance on training data
y_pred = svm.predict(X_scaled)
print(classification_report(y, y_pred))


In [None]:
from calvin_utils.statistical_utils.logistic_regression import LogisticRegression
logreg = LogisticRegression(outcome_matrix, design_matrix)
results = logreg.run()
results.summary2()

# 6 - Receiver Operating Characteristic
- The ROC considers clasisfications acoss ALL POSSIBLE PROBABILITIES, demonstrating what is ultiamtely accomplishable at the best possible threshold

- First curve is ROC for classifcation of each class with respect to all other classes
- Second Curve (Macro Average) is basically a meta-analytic ROC with equal weight per class.
- Third Curve (Micro Average) is basically a meta-analytic ROC with weight proportional to class sample

In [None]:
from calvin_utils.statistical_utils.classification_statistics import ComprehensiveMulticlassROC
evaluator = ComprehensiveMulticlassROC(fitted_model=results, predictions_df=None, observation_df=outcome_matrix, normalization='pred', thresholds=None, out_dir=out_dir+'/train_results')
evaluator.run()

Visuialize OVR CIs

In [None]:
df, bootstrap  = evaluator.bootstrap_ovr_auroc(raw_observations=evaluator.raw_observations, raw_predictions=evaluator.raw_predictions, outcome_matrix_cols=evaluator.outcome_matrix.columns)
ComprehensiveMulticlassROC.plot_ovr_auc_with_ci(df, out_dir=out_dir+'/train_auc_per_diagnosis')

ADVANCED
- code specific manual thresholds to intervene upon classifications

Step 1: relate integer (index) to class

In [None]:
# evaluator.relate_index_to_class()

Step 2: in a dictionary of the indices (corresponding to class), key in the lambda function to edit the probability. 
- Code from left to right, giving priority to each method. 
- Example:
```
>thresholds = {
>            0: lambda probs: 0 if probs[0] > 0.5 else (1 if probs[0] > 0.25 else 2),  # Adjust class_0 predictions
>            1: lambda probs: None,  # No threshold adjustment for class_1
>            2: lambda probs: None   # No threshold adjustment for class_2
>        }
```

In [None]:
# thresholds = {
#     0: lambda prob: 0,  # Always keep class 0
#     1: lambda prob: 1,  # Always keep class 1
#     2: lambda prob: 2 if prob[2] > 0.5 else (1 if prob[1] > 0.3 else 0)  # Conditional adjustment for class 2
# }


Step 3: Check the effect

In [None]:
# from calvin_utils.statistical_utils.classification_statistics import ComprehensiveMulticlassROC
# evaluator = ComprehensiveMulticlassROC(fitted_model=results, observation_df=outcome_matrix, normalization='pred', thresholds=thresholds, out_dir=out_dir)
# evaluator.run()

Step 4: YOU MUST LOOCV AND VALIDATE IN OUT-OF-SAMPLE DATA.
- add thresholds as an argument to any further calls to ComprehensiveMulticlassROC

Bootstrap the Micro Average AUC

In [None]:
# import matplotlib
# from calvin_utils.statistical_utils.classification_statistics import bootstrap_auc
# matplotlib.use('Agg')  # Use a non-interactive backend

# mean_auc, lower_ci, upper_ci = bootstrap_auc(outcome_matrix, design_matrix, n_iterations=1000)
# print(f'Mean AUC: {mean_auc}, 95% CI: ({lower_ci}, {upper_ci})')

Permutation Test Two Different Formulas by Comparing Their AUCs

In [None]:
data_df.columns

In [None]:
# f1 = "Diagnosis ~ CerebellumCSF + ParietalCSF + MTLCSF + OccipitalCSF + FrontalCSF + temp_ins_csf + SubcortexCSF"
# f2 = "Diagnosis ~ CerebellumGM + ParietalGM + MTLGM + OccipitalGM + FrontalGM + temp_ins_gm + SubcortexGM"

In [None]:
# import matplotlib
# matplotlib.use('Agg')  # Use a non-interactive backend
# from calvin_utils.statistical_utils.classification_statistics import permute_auc_difference
# obs_diff, lower_ci, upper_ci, p_value = permute_auc_difference(data_df, formula1=f1, 
#                                                                   formula2=f2,
#                                                                   cal_palm=cal_palm, n_iterations=1000)
# print(f'Observde AUC Difference: {obs_diff}, 95% CI: ({lower_ci}, {upper_ci}), p-value: {p_value}')

# 06 - Visualize the Regression as a Forest Plot
- This will probably look poor if you ran a regression without standardizing your data. 

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import MultinomialForestPlot

# multinomial_forest = MultinomialForestPlot(model=results, sig_digits=2, out_dir=out_dir+'/forest_plots', table=False)
# multinomial_forest.run()

# 07 - Generate Partial Dependence Plots

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import PartialDependencePlot
# pdp = PartialDependencePlot(formula=formula, data_df=data_df, model=results, design_matrix=design_matrix, outcomes_df=outcome_matrix, data_range=[-1,1], out_dir=out_dir+'/partial_dep_plots', marginal_method='mean', debug=False)
# pdp.run()

# 08 - Visualize the Partial Regression Plots

In [None]:
from calvin_utils.statistical_utils.statistical_measurements import PartialRegressionPlot
# partial_plot = PartialRegressionPlot(model=results, design_matrix=design_matrix, out_dir=out_dir+'/partial_regression_plot', palette=None)
# partial_plot = partial_plot.run()

# 09 - LOOCV

In [None]:
# import pandas as pd
# from calvin_utils.statistical_utils.logistic_regression import LogisticRegression
# from calvin_utils.statistical_utils.classification_statistics import ComprehensiveMulticlassROC
# y_true, y_pred, test_prob = LogisticRegression.run_loocv(outcome_matrix, design_matrix)
# loocv_evaluator = ComprehensiveMulticlassROC(fitted_model=None, predictions_df=pd.DataFrame(design_matrix, columns=outcome_matrix.columns), observation_df=outcome_matrix, normalization='true', thresholds=None, out_dir=out_dir+'/loocv_results')
# loocv_evaluator.run()

In [None]:
# df, bootstrap  = loocv_evaluator.bootstrap_ovr_auroc(raw_observations=loocv_evaluator.raw_observations, raw_predictions=loocv_evaluator.raw_predictions, outcome_matrix_cols=loocv_evaluator.outcome_matrix.columns)
# ComprehensiveMulticlassROC.plot_ovr_auc_with_ci(df, out_dir=out_dir+'/loocv_auc_per_diagnosis')

# 10 - Predict Unseen Data
- Unseen data is expected to be in a held-out CSV with the exact same naming conventions used by the training data

In [None]:
new_csv_path='/Volumes/OneTouch/datasets/adni/metadata/updated_master_list/train_test_splits/test_data_csf.csv'

Get New Data

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
new_palm = CalvinStatsmodelsPalm(input_csv_path=new_csv_path, output_dir=out_dir+'/test_results', sheet=sheet)
other_df = new_palm.read_and_display_data()
other_df

In [None]:
column = 'DX_BASELINE'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'MCI' # The value to drop if found

In [None]:
other_df, _ = new_palm.drop_rows_based_on_value(column, condition, value)
display(other_df)

In [None]:
# Find the minimum count among the categories in DX_BASELINE
category_counts = other_df['DX_BASELINE'].value_counts()
min_count = category_counts.min()

# Downsample each category to the minimum count
other_df_balanced = (
    other_df.groupby('DX_BASELINE', group_keys=False)
    .apply(lambda x: x.sample(min_count, random_state=42))
    .reset_index(drop=True)
)

# Display the balanced dataframe
other_df_balanced['DX_BASELINE'].value_counts()

Prepare Data

In [None]:
import pandas as pd
other_outcome_matrix, other_design_matrix = new_palm.define_design_matrix(formula, other_df, add_intercept=True)

# Ensure both matrices have the same columns
if len(other_outcome_matrix.columns) != len(outcome_matrix.columns):
    # Create a zero-filled DataFrame with the same columns as outcome_matrix
    zero_df = pd.DataFrame(0, index=other_outcome_matrix.index, columns=outcome_matrix.columns)
    
    # Fill zero_df with values from other_outcome_matrix where columns exist
    common_columns = other_outcome_matrix.columns.intersection(outcome_matrix.columns)
    zero_df.loc[:, common_columns] = other_outcome_matrix.loc[:, common_columns]
    
    other_outcome_matrix = zero_df

other_design_matrix

Predict

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Standardize features
if choice=='SVM'
    scaler = StandardScaler()
    testX_scaled = scaler.fit_transform(other_design_matrix)
    probabilities = svm.predict_proba(testX_scaled)
    predictions_df = pd.DataFrame(probabilities)
elif choice=='Logistic':
    # Use the same scaler as for training
    testX_scaled = scaler.transform(other_design_matrix)
    predictions_df = results.predict(testX_scaled)
else:
    raise ValueError("Invalid choice. Please select either 'SVM' or 'Logistic'.")

In [None]:
# thresholds = {
#     0: lambda prob: 0 if prob < 0.33 else 1,
#     1: lambda prob: 1 if prob > 0.33 else 0
# }

In [None]:
from calvin_utils.statistical_utils.classification_statistics import ComprehensiveMulticlassROC
loocv_evaluator = ComprehensiveMulticlassROC(fitted_model=None, predictions_df=predictions_df, observation_df=other_outcome_matrix, normalization='true', thresholds=None, out_dir=out_dir+'/test_results')
loocv_evaluator.run() 

In [None]:
loocv_evaluator.save_dataframes()

Get One Vs. All Confidence Intervals on AUC

In [None]:
df, bootstrap = ComprehensiveMulticlassROC.bootstrap_ovr_auroc(raw_observations=loocv_evaluator.raw_observations, raw_predictions=loocv_evaluator.raw_predictions, outcome_matrix_cols=loocv_evaluator.outcome_matrix.columns)
ComprehensiveMulticlassROC.plot_ovr_auc_with_ci(df, out_dir=out_dir+'/test_auc_per_diagnosis')

In [None]:
display(df)

Get Confidence Intervals on Sensitivity, Specificity, NPV, PPV, and Accuracy for Each Class

In [None]:
from calvin_utils.statistical_utils.classification_statistics import calculate_youden_and_metrics, save_dfs
dfs, youden_dict = calculate_youden_and_metrics(raw_observations=loocv_evaluator.raw_observations, 
                                                raw_predictions=loocv_evaluator.raw_predictions, 
                                                outcome_matrix_cols=loocv_evaluator.outcome_matrix.columns,
                                                out_dir=out_dir+'/metrics_per_diagnosis')
save_dfs(dfs, out_dir=out_dir+'/metrics_per_diagnosis')

In [None]:
ComprehensiveMulticlassROC.generate_all_plots(dfs, out_dir=out_dir+'/metrics_per_diagnosis')

Get Overall Micro Average AUC

In [None]:
loocv_evaluator.get_micro_auc()

That's all

-Calvin