# Run Any Kind of OLS Regression (ANOVA, GLM, etc.)

### Authors: Calvin Howard.

#### Last updated: July 6, 2023

Use this to run/test a statistical model (e.g., regression or T-tests) on a spreadsheet.

Notes:
- To best use this notebook, you should be familar with GLM design and Contrast Matrix design. See this webpage to get started:
[FSL's GLM page](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GLM)

# 00 - Import CSV with All Data
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

Prep Output Direction

In [None]:
# Specify where you want to save your results to
out_dir = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Research/2023/roca/figures/survey'

Import Data

In [None]:
# Specify the path to your CSV file containing NIFTI paths
input_csv_path = '/Users/cu135/Library/CloudStorage/OneDrive-Personal/OneDrive_Documents/Work/KiTH_Solutions/Research/Clinical Trial/study_metadata/all_performances.xlsx'
sheet = 'survey'

In [None]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=input_csv_path, output_dir=out_dir, sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()

# 01 - Preprocess Your Data

**Handle NANs**
- Set drop_nans=True is you would like to remove NaNs from data
- Provide a column name or a list of column names to remove NaNs from

In [None]:
data_df.columns

In [None]:
drop_list = ['Age', 'Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_']

In [None]:
data_df = cal_palm.drop_nans_from_columns(columns_to_drop_from=drop_list)
display(data_df)

**Drop Row Based on Value of Column**

Define the column, condition, and value for dropping rows
- column = 'your_column_name'
- condition = 'above'  # Options: 'equal', 'above', 'below'

In [None]:
data_df.columns

Set the parameters for dropping rows

In [None]:
column = 'Disease_Status'  # The column you'd like to evaluate
condition = 'equal'  # The condition to check ('equal', 'above', 'below', 'not')
value = 'MCI' # The value to drop if found

In [None]:
data_df, other_df = cal_palm.drop_rows_based_on_value(column, condition, value)
display(data_df)

**Standardize Data**
- Enter Columns you Don't want to standardize into a list

In [None]:
# Remove anything you don't want to standardize
cols_not_to_standardize = None # ['Z_Scored_Percent_Cognitive_Improvement_By_Origin_Group', 'Z_Scored_Subiculum_T_By_Origin_Group_'] #['Age']

In [None]:
data_df = cal_palm.standardize_columns(cols_not_to_standardize)
data_df

In [None]:
# for col in data_df.columns:
#     if 'CSF' and 'eh' not in col:
#         data_df[col] = data_df[col] * -1

# 02 - Compare Central Tendencies Across Multiple Groups within a Supergroup

In [None]:
data_df.columns


Select Columns

In [None]:
data_df = data_df.loc[:, ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q14', 'TOTAL11', 'TOTALMOD']]

Alter Dataframe columns

In [None]:
import numpy as np
data_df['Disease_Status'] = np.where(data_df['Disease_Status'] == 'Normal', 'Normal', 'Impaired')
data_df

Melt the Dataframe

In [None]:
import pandas as pd

def melt_dataframe(df, var_name='group', value_name='value'):
    """
    Melts a wide-format DataFrame into a long format.

    Parameters:
    df (DataFrame): The wide-format DataFrame to be melted.
    var_name (str): The name to be given to the 'variable' column in the melted DataFrame.
    value_name (str): The name to be given to the 'value' column in the melted DataFrame.

    Returns:
    DataFrame: The melted long-format DataFrame.
    """
    melted_df = df.reset_index().melt(id_vars='index', var_name=var_name, value_name=value_name)
    melted_df = melted_df.drop(columns='index')  # Remove the 'index' column if not needed
    return melted_df


In [None]:
value = 'score'
variable = 'question'

In [None]:
melted_df = melt_dataframe(data_df, var_name=variable, value_name=value)

Normalize Data if Desired

# Plot Violin Plots for Each Group Across Categories
plot_violin_strip(data, 'x_col', 'y_col', 'hue_col', dodge=True, adjust_condition=None)


This function generates a combined plot of violin and strip plots to visualize the distribution of a continuous variable across different categories. The DataFrame is expected to have columns for x-axis categories (x_col), y-axis values (y_col), and grouping variable (hue_col).

In [None]:
import os 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_violin_strip_combined(data, factors, group_col, out_dir=None):
    """
    Plot violin and strip plots for multiple factors in the same plot, distinguishing between groups.
    
    Parameters:
        data (DataFrame): Input DataFrame containing the data.
        factors (list of str): List of columns in the DataFrame to plot.
        group_col (str): Name of the column for grouping (e.g., 'Alzheimer' vs. 'Control').
    """
    # Melt the DataFrame from wide to long format
    long_df = pd.melt(data, id_vars=[group_col], value_vars=factors, var_name='Factor', value_name='Value')
    
    # Initialize the matplotlib figure
    if len(factors) < 3:
        length = 6
    else:
        length = 1.3*(len(factors))
    plt.figure(figsize=(length, 6))
    sns.palette = 'tab10'
    
    # Plot violin plot
    sns.violinplot(x='Factor', y='Value', hue=group_col, data=long_df, dodge=True, palette='tab10', cut=.0010)
    
    # Add strip plot on top of the violin plot to show individual data points
    sns.stripplot(x='Factor', y='Value', hue=group_col, data=long_df, dodge=True, size=5, marker='o', edgecolor='k', linewidth=1, facecolors='none', legend=False)
    
    sns.despine()
    
    # Improve the legend
    plt.legend(title=group_col, loc=2)
    
    # Set titles and labels
    plt.title('Distribution of Factors by Group')
    plt.xlabel('Factors')
    plt.ylabel('Values')

    plt.tight_layout()
    if out_dir is not None: 
        plt.savefig(os.path.join(out_dir, 'central_tendency_plot.svg'))
        print("Saved to: ", out_dir)
    plt.show()
    


In [None]:
data_df.columns

In [None]:
factors = ['What_is_your_age_']
group_col = 'What_is_your_sex_'

Plot

In [None]:
plot_violin_strip_combined(data_df, factors, group_col, out_dir=out_dir)