# Run An Odds Ratio

### Authors: Calvin Howard.

#### Last updated: July 6, 2023

Use this to run/test a statistical model on a spreadsheet.

Notes:
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Research/2023/subiculum_cognition_and_age/figures/Figures/suplements_3_cohort_age_optimized/unstandardized_data'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Dropbox (Partners HealthCare)/studies/cognition_2023/metadata/master_list_proper_subjects.xlsx'
sheet = 'master_list_proper_subjects'

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()

# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [None]:
drop_list = ['Z_Scored_Percent_Cognitive_Improvement', 'Subiculum_Group_By_24', 'City', 'Age_Group']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'City'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'Queensland' # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Invert Distributions**

In [None]:
from calvin_utils.statistical_utils.distribution_statistics import invert_distribution
mask = data_df['City'] == 'Toronto'
data_df.loc[mask, ['Cognitive_Baseline']] = invert_distribution(data_df.loc[mask, ['Cognitive_Baseline']])

mask = data_df['City'] == 'Queensland'
data_df.loc[mask, ['Cognitive_Baseline']] = invert_distribution(data_df.loc[mask, ['Cognitive_Baseline']])

mask = data_df['City'] == 'Toronto'
data_df.loc[mask, ['Cognitive_Score_1_Yr']] = invert_distribution(data_df.loc[mask, ['Cognitive_Score_1_Yr']])

mask = data_df['City'] == 'Queensland'
data_df.loc[mask, ['Cognitive_Score_1_Yr']] = invert_distribution(data_df.loc[mask, ['Cognitive_Score_1_Yr']])


**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = None #['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

Standard Columns by Mask

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import zscore

def mask_and_zscore(df, mask_column, zscore_columns, reference_column=None):
    """
    For a given DataFrame, create a mask based on the unique values of a specified column. 
    Then, for each column in a provided list, replace the values with z-scored counterparts 
    using only the indices corresponding to the mask.

    Parameters:
    - df (pandas.DataFrame): The DataFrame to operate on.
    - mask_column (str): The column name to use for creating the mask based on its unique values.
    - zscore_columns (list): A list of column names for which the values will be replaced with their z-scored counterparts.

    Returns:
    - pandas.DataFrame: The modified DataFrame with specified columns z-scored within the mask.
    """

    # Create a mask from unique values in the specified column
    unique_values = df[mask_column].unique()
    if reference_column is not None:
        for cohort in unique_values:
            mask = df[mask_column] == cohort

            for column in zscore_columns:
                if column in df.columns:
                    # Use dropna() to ensure no NaNs interfere, though you mentioned there are none
                    cohort_values = df.loc[mask, column].dropna()
                    reference_values = df.loc[mask, reference_column].dropna() if reference_column else cohort_values

                    mean_reference = reference_values.mean()
                    std_reference = reference_values.std()

                    if std_reference > 0:  # Ensuring standard deviation is not zero
                        z_scores = (cohort_values - mean_reference) / std_reference
                        df.loc[mask, column] = z_scores
                    else:
                        # Handle the case where std is 0, if needed, such as assigning a default value
                        pass
                else:
                    print(f"Column '{column}' not found in DataFrame.")

        # for cohort in unique_values:
        #     mask = df[mask_column] == cohort

        #     # For each column in the list, replace values with z-scored counterparts within the mask
        #     for column in zscore_columns:
        #         if column in df.columns:
        #             # Compute z-scores for the masked subset of the column
        #             z_scores = (df.loc[mask, [column]] - np.mean(df.loc[mask, [reference_column]])) / np.std(df.loc[mask, [reference_column]])
        #             # Replace the original values with z-scores within the mask
        #             df.loc[mask, column] = z_scores
        #         else:
        #             print(f"Column '{column}' not found in DataFrame.")
        
    else:
        for cohort in unique_values:
            mask = df[mask_column] == cohort

            # For each column in the list, replace values with z-scored counterparts within the mask
            for column in zscore_columns:
                if column in df.columns:
                    # Compute z-scores for the masked subset of the column
                    z_scores = (df.loc[mask, [column]] - np.mean(df.loc[mask, [column]])) / np.std(df.loc[mask, [column]])
                    # Replace the original values with z-scores within the mask
                    df.loc[mask, column] = z_scores
                else:
                    print(f"Column '{column}' not found in DataFrame.")

    return df

In [None]:
data_df['Cognitive_Baseline'].isna().sum()
data_df['Cognitive_Score_1_Yr'].isna().sum()

In [None]:
df2 = mask_and_zscore(data_df.copy(), mask_column='City', zscore_columns=['Cognitive_Score_1_Yr'])#, reference_column='Cognitive_Baseline')
df2['Cognitive_Score_1_Yr']
df2 = mask_and_zscore(df2, mask_column='City', zscore_columns=['Cognitive_Baseline'])
df2['Cognitive_Baseline']

Normalize Data

In [None]:
def min_max_normalize_minus_one_to_one(series, reference_series=None):
    """
    Normalize a series to the range [-1, 1]. If a reference series is provided,
    use its min and max values for normalization; otherwise, use the series' own min and max values.

    Parameters:
    series (pd.Series): The series to be normalized.
    reference_series (pd.Series, optional): The reference series to use for normalization.

    Returns:
    pd.Series: The normalized series with values in the range [-1, 1].
    """
    if reference_series is not None:
        min_val = reference_series.min()
        max_val = reference_series.max()
        return 2 * (series - min_val) / (max_val - min_val) - 1
    else:
        min_val = series.min()
        max_val = series.max()
        return 2 * (series - min_val) / (max_val - min_val) - 1


In [None]:
grouping_col = 'City'  # Ensures normalization is only applied to rows falling into these categories
col_to_normalize = 'Cognitive_Baseline'
reference_col = 'Cognitive_Baseline'

# Apply the normalization using the reference series
data_df[f'{col_to_normalize}_normalized'] = data_df.groupby(grouping_col).apply(
    lambda group: min_max_normalize_minus_one_to_one(group[col_to_normalize], group[reference_col])
).reset_index(level=0, drop=True)

Invert a Distribution

In [None]:
def invert_distribution(series):
    """
    Invert the distribution of a series.

    Parameters:
    series (pd.Series): The series to be inverted.

    Returns:
    pd.Series: The series with its distribution inverted.
    """
    max_val = series.max()
    return max_val - series

In [None]:
data_df.columns

In [None]:
grouping_col = 'City'  # Ensures normalization is only applied to rows falling into these categories
specific_group_to_flip = 'Toronto'
col_to_normalize = 'Cognitive_Baseline_normalized'
import pandas as pd
# Apply the invert distribution function only where City == 'Toronto'
data_df[col_to_normalize] = data_df.apply(
    lambda row: invert_distribution(pd.Series([row[col_to_normalize]]))[0] if row[grouping_col] == specific_group_to_flip else row[col_to_normalize],
    axis=1
)

Categorize Values

In [None]:
import numpy as np

# Define conditions
conditions = [
    df2['Cognitive_Baseline'] > 2,  # Values over 2
    df2['Cognitive_Baseline'] < -2  # Values under -2
]

# Define choices corresponding to the conditions
choices = [
    1,  # Choice for values over 2
    -1  # Choice for values under -2
]

# Apply conditions and choices, default value is 0 for values between -2 and 2
df2['Cognitive_Baseline'] = np.select(conditions, choices, default=0)


Pivot a Dataframe

In [None]:
data_df.columns

In [None]:
def pivot_dataframe(df, concat_col, category_col):
    # Create a new DataFrame where each unique category becomes a column
    # and the values from concat_col are listed under these category columns
    # First, ensure that the index is reset for the DataFrame to avoid issues during pivoting
    df.reset_index(drop=True, inplace=True)

    # Create a new DataFrame where each row will have the category as a column and the corresponding values
    # from concat_col under that category
    pivoted_df = df.pivot(columns=category_col, values=concat_col)
    
    return pivoted_df

In [None]:
pdf = pivot_dataframe(data_df, 'Cognitive_Score_1_Yr_normalized','StimMatch')
pdf

In [None]:
pdf.describe()

RCT Plotter

In [None]:
data_df.columns

In [None]:
import numpy as np
np.sum(data_df['Cognitive_Baseline_normalized'] == -1)
data_df['City'].unique()

In [None]:
from calvin_utils.statistical_utils.rct import RCTPlotter

# Initialize the RCTPlotter
plotter = RCTPlotter(data=data_df, obs_cols=['Cognitive_Baseline_normalized', 'Cognitive_Score_1_Yr_normalized'], arm_col='StimMatch', category_col=None, out_dir=out_dir)

# Run the RCTPlotter and display the plot
plotter.run()

Differe in Differences Plotter

In [None]:
data_df.columns

In [None]:
from calvin_utils.statistical_utils.rct import DiDAnalysis
analysis = DiDAnalysis(data=data_df, obs_cols=['Cognitive_Baseline', 'Cognitive_Score_1_Yr'], arm_col='StimMatch', category_col='City')

# Run the DiDAnalysis and display the plot
analysis.run()

Propensity Stratification Match

In [None]:
data_df.columns

In [None]:
from calvin_utils.statistical_utils.rct import PropensityStratifiedRCTPlotter
ps_rct_plotter = PropensityStratifiedRCTPlotter(data=data_df, obs_cols=['Cognitive_Baseline', 'Cognitive_Score_1_Yr'], arm_col='StimMatch', covariate_cols=['Age', 'Cognitive_Baseline'], n_strata=2)
ps_rct_plotter.run()