### Main research question
1) Main research question: 
    What are the relationships between specific Alzheimer's disease biomarkers and cognitive decline in the dataset, including positive or negative correlations?
    For statistical purposes we will assume that the null hypothesis is the following; There are no associated positive correlations or negative between the chosen biomarkers and cognitive decline indicators.
### Sub research questions with themes 
##### 1. Data Security  
   - a)  Does the encryption and decryption process using a SHA-256-derived Fernet key ensure secure and reliable access to sensitive Alzheimer’s disease datasets, while maintaining data integrity for subsequent statistical analysis?

##### 2. Dataset Stratification on Core Datasets  
   *(sea-ad_cohort_mtg-tissue_extractions-luminex_data and sea-ad_cohort_donor_metadata_072524)*  
   - a) Are the four Alzheimer’s disease biomarkers sufficiently distinct to enable robust statistical analysis, as evidenced by principal component analysis (PCA) retaining a significant proportion of total variance (e.g., ≥95%) and facilitating k-means clustering to identify separable patient subgroups with potential cognitive decline trajectories?  
   - b) Do biomarker distributions significantly differ between demented and non-demented groups when analyzed using the Mann-Whitney U test, providing a non-parametric validation of the PCA and clustering results?  

##### 3. Correlational Analysis  
   - a) Do individual Alzheimer’s disease biomarkers (ABeta40, ABeta42, tTAU, pTAU) extracted using RIPA and GuHCl buffer methods exhibit statistically significant correlations with cognitive decline?  
   - b) Do these biomarkers also show significant correlations with MMSE test scores?  
   - c) Do volumetric measurements of key brain regions in Alzheimer's (left/right hippocampus, left/right entorhinal cortex) significantly correlate with dementia status, as assessed by point-biserial correlation?  
   - d) Is there a statistically significant association between years of education and the age of dementia diagnosis, as assessed by Spearman correlation in both the base dataset and combined dataset (base + kaggle data for increased statistical significance), indicating potential cognitive reserve effects?  
   - e) Does age significantly correlate with dementia diagnosis status, in the kaggle dataset as determined by point-biserial correlation, providing statistical evidence for age as a risk factor in Alzheimer’s disease?  

#### Extra details for improved notebook clarity
For this notebook the sign of a correlation (positive or negative) is meaningful only in the context of the variables analyzed. For example, a Pearson correlation between ABeta42 (pg/ug.1) and a given variable yields r = 0.3693 with p = 0.0005, leading to rejection of the null hypothesis (p < 0.001, strong evidence). Conversely, a Spearman correlation for ABeta42 (pg/ug.1) produces tau = –0.2475 with p = 0.0017, also rejecting the null hypothesis (p < 0.01, strong evidence). These opposing signs are consistent when considering the variables’ scales—specifically, MMSE scores range from 30 (no impairment) to 0 (severe cognitive impairment). 
Additionally different metrics of cognitive decline like MMSE are used to measure the symptom of dementia (which is tightly related to Alzheimers disease as the core symptom) meaning they 
measure the same symptom but in different manners. 

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

## Step 1 Data Security 

In [15]:
# Reading the encrypted data
import hashlib
import base64
from io import BytesIO
from cryptography.fernet import Fernet

# In a real application this key would be stored in a .env file and NOT included in the notebook
# as this is only an example of how to use encryption for data security, for convince, we left the key in the notebook.
key = "poqwjoifj2398wesd"

def decrypt_file(file_path, key):
    fernet_key = base64.urlsafe_b64encode(hashlib.sha256(key.encode()).digest())
    cipher = Fernet(fernet_key)

    with open(file_path, "rb") as encrypted_file:
        encrypted_data = encrypted_file.read()

    decrypted_data = cipher.decrypt(encrypted_data)
    return decrypted_data

alzheimers_prediction_dataset = pd.read_csv(BytesIO(decrypt_file("./data/alzheimers_prediction_dataset.csv", key)))
sea_ad_cohort_donor_metadata = pd.read_excel(BytesIO(decrypt_file("./data/sea-ad_cohort_donor_metadata_072524 (1).xlsx", key)))
sea_ad_cohort_mri_volumetrics = pd.read_excel(BytesIO(decrypt_file("./data/sea-ad_cohort_mri_volumetrics (2).xlsx", key)))
sea_ad_cohort_mtg_tissue_extractions_luminex_data = pd.read_excel(BytesIO(decrypt_file("./data/sea-ad_cohort_mtg-tissue_extractions-luminex_data (1).xlsx", key)), header=1)

## Step 2 Data evalutation
Evaluating whether the data is suitable for statistical analysis with enough diversity. 

### 1. **Understanding data analysis feasiblity via principal component analysis and k means clustering**
needed due to biomarker count which makes dimensionality = 8

In [None]:
# CODE 1: PCA and k-means clustering using Cognitive Status solely as a visualization label
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

metadata_dementia = sea_ad_cohort_donor_metadata.copy()
markers = sea_ad_cohort_mtg_tissue_extractions_luminex_data.copy()
markers_metadata = markers.merge(metadata_dementia, on="Donor ID", how="inner")


# Define biomarker column sets
biomarker_cols_extr1 = ["ABeta40 pg/ug", "ABeta42 pg/ug", "tTAU pg/ug", "pTAU pg/ug"]
biomarker_cols_extr2 = ["ABeta40 pg/ug.1", "ABeta42 pg/ug.1", "tTAU pg/ug.1", "pTAU pg/ug.1"]

# Extraction 1)
data_clean_extr1 = markers_metadata.dropna(subset=biomarker_cols_extr1)
quantiles_extr1 = {col: data_clean_extr1[col].quantile(0.90) for col in biomarker_cols_extr1}
data_clean_extr1 = data_clean_extr1[(data_clean_extr1[biomarker_cols_extr1] < pd.Series(quantiles_extr1)).all(axis=1)]
scaler_extr1 = StandardScaler()
X_extr1 = scaler_extr1.fit_transform(data_clean_extr1[biomarker_cols_extr1])

# PCA computed only on biomarker features, target (Cognitive Status) is used only in visualization.
pca_extr1 = PCA(n_components=2)
X_pca_extr1 = pca_extr1.fit_transform(X_extr1)
cog_status_extr1 = data_clean_extr1["Cognitive Status"].apply(lambda x: 1 if x == "Dementia" else 0)
kmeans_extr1 = KMeans(n_clusters=2, random_state=42)
clusters_extr1 = kmeans_extr1.fit_predict(X_extr1)

# Extraction 2) 
data_clean_extr2 = markers_metadata.dropna(subset=biomarker_cols_extr2)
quantiles_extr2 = {col: data_clean_extr2[col].quantile(0.95) for col in biomarker_cols_extr2}
data_clean_extr2 = data_clean_extr2[(data_clean_extr2[biomarker_cols_extr2] < pd.Series(quantiles_extr2)).all(axis=1)]
scaler_extr2 = StandardScaler()
X_extr2 = scaler_extr2.fit_transform(data_clean_extr2[biomarker_cols_extr2])
pca_extr2 = PCA(n_components=2)
X_pca_extr2 = pca_extr2.fit_transform(X_extr2)
cog_status_extr2 = data_clean_extr2["Cognitive Status"].apply(lambda x: 1 if x == "Dementia" else 0)
kmeans_extr2 = KMeans(n_clusters=2, random_state=42)
clusters_extr2 = kmeans_extr2.fit_predict(X_extr2)

# Plotting 3) 
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

# Extraction 1: PCA colored by Cognitive Status (visual) 
sns.scatterplot(x=X_pca_extr1[:, 0], y=X_pca_extr1[:, 1],
                hue=cog_status_extr1, palette="viridis", style=cog_status_extr1, s=80, ax=axs[0, 0])
axs[0, 0].set_title("Extraction 1: PCA by Cognitive Status")
axs[0, 0].set_xlabel("PC1")
axs[0, 0].set_ylabel("PC2")

# Extraction 1: PCA colored by k-means clusters
sns.scatterplot(x=X_pca_extr1[:, 0], y=X_pca_extr1[:, 1],
                hue=clusters_extr1, palette="Set1", style=clusters_extr1, s=80, ax=axs[0, 1])
axs[0, 1].set_title("Extraction 1: PCA with KMeans Clusters")
axs[0, 1].set_xlabel("PC1")
axs[0, 1].set_ylabel("PC2")

# Extraction 2: PCA colored by Cognitive Status
sns.scatterplot(x=X_pca_extr2[:, 0], y=X_pca_extr2[:, 1],
                hue=cog_status_extr2, palette="viridis", style=cog_status_extr2, s=80, ax=axs[1, 0])
axs[1, 0].set_title("Extraction 2: PCA by Cognitive Status")
axs[1, 0].set_xlabel("PC1")
axs[1, 0].set_ylabel("PC2")

# Extraction 2: PCA colored by k-means clusters
sns.scatterplot(x=X_pca_extr2[:, 0], y=X_pca_extr2[:, 1],
                hue=clusters_extr2, palette="Set1", style=clusters_extr2, s=80, ax=axs[1, 1])
axs[1, 1].set_title("Extraction 2: PCA with KMeans Clusters")
axs[1, 1].set_xlabel("PC1")
axs[1, 1].set_ylabel("PC2")

plt.tight_layout()
plt.show()


In [None]:
# CODE 2: PCA and k-means clustering using MMSE Score only for visuals 
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler



# Extraction 1)
# Drop rows missing any biomarker or MMSE value 
data_clean_extr1 = markers_metadata.dropna(subset=biomarker_cols_extr1 + ["Last MMSE Score"])
quantiles_extr1 = {col: data_clean_extr1[col].quantile(0.90) for col in biomarker_cols_extr1}
data_clean_extr1 = data_clean_extr1[(data_clean_extr1[biomarker_cols_extr1] < pd.Series(quantiles_extr1)).all(axis=1)]
scaler_extr1 = StandardScaler()
X_extr1 = scaler_extr1.fit_transform(data_clean_extr1[biomarker_cols_extr1])
# PCA computed only on biomarkers 
pca_extr1 = PCA(n_components=2)
X_pca_extr1 = pca_extr1.fit_transform(X_extr1)
mmse_extr1 = data_clean_extr1["Last MMSE Score"].astype(float)
kmeans_extr1 = KMeans(n_clusters=2, random_state=42)
clusters_extr1 = kmeans_extr1.fit_predict(X_extr1)

# Extraction 2)
data_clean_extr2 = markers_metadata.dropna(subset=biomarker_cols_extr2 + ["Last MMSE Score"])
quantiles_extr2 = {col: data_clean_extr2[col].quantile(0.90) for col in biomarker_cols_extr2}
data_clean_extr2 = data_clean_extr2[(data_clean_extr2[biomarker_cols_extr2] < pd.Series(quantiles_extr2)).all(axis=1)]
scaler_extr2 = StandardScaler()
X_extr2 = scaler_extr2.fit_transform(data_clean_extr2[biomarker_cols_extr2])
pca_extr2 = PCA(n_components=2)
X_pca_extr2 = pca_extr2.fit_transform(X_extr2)
mmse_extr2 = data_clean_extr2["Last MMSE Score"].astype(float)
kmeans_extr2 = KMeans(n_clusters=2, random_state=42)
clusters_extr2 = kmeans_extr2.fit_predict(X_extr2)

# Plotting 3) 
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

# Extraction 1: PCA colored by MMSE Score
sc1 = axs[0, 0].scatter(X_pca_extr1[:, 0], X_pca_extr1[:, 1],
                        c=mmse_extr1, cmap="viridis", s=80)
axs[0, 0].set_title("Extraction 1: PCA by MMSE Score")
axs[0, 0].set_xlabel("PC1")
axs[0, 0].set_ylabel("PC2")
fig.colorbar(sc1, ax=axs[0, 0], label="MMSE Score")

# Extraction 1: PCA colored by k-means clusters
sns.scatterplot(x=X_pca_extr1[:, 0], y=X_pca_extr1[:, 1],
                hue=clusters_extr1, palette="Set1", s=80, ax=axs[0, 1])
axs[0, 1].set_title("Extraction 1: PCA with KMeans Clusters")
axs[0, 1].set_xlabel("PC1")
axs[0, 1].set_ylabel("PC2")

# Extraction 2: PCA colored by MMSE Score
sc2 = axs[1, 0].scatter(X_pca_extr2[:, 0], X_pca_extr2[:, 1],
                        c=mmse_extr2, cmap="viridis", s=80)
axs[1, 0].set_title("Extraction 2: PCA by MMSE Score")
axs[1, 0].set_xlabel("PC1")
axs[1, 0].set_ylabel("PC2")
fig.colorbar(sc2, ax=axs[1, 0], label="MMSE Score")

# Extraction 2: PCA colored by k-means clusters
sns.scatterplot(x=X_pca_extr2[:, 0], y=X_pca_extr2[:, 1],
                hue=clusters_extr2, palette="Set1", s=80, ax=axs[1, 1])
axs[1, 1].set_title("Extraction 2: PCA with KMeans Clusters")
axs[1, 1].set_xlabel("PC1")
axs[1, 1].set_ylabel("PC2")

plt.tight_layout()
plt.show()


### 2. **Understanding whether a signfiicant difference between demented and non demented people exists in the data**
understanding data feasibliy via a second method if method 1 fails

In [None]:
has_dementia = markers_metadata["Cognitive Status"] == "Dementia"
group_with_dementia = markers_metadata[has_dementia]
group_without_dementia = markers_metadata[~has_dementia]

biomarker_cols_extr1 = ["ABeta40 pg/ug", "ABeta42 pg/ug", "tTAU pg/ug", "pTAU pg/ug"]
biomarker_cols_extr2 = ["ABeta40 pg/ug.1", "ABeta42 pg/ug.1", "tTAU pg/ug.1", "pTAU pg/ug.1"]

# Combine both biomarker lists
biomarkers = biomarker_cols_extr1 + biomarker_cols_extr2

# Significance threshold for P value (alpha)
alpha = 0.05

# Compute Mann–Whitney U test for each biomarker 
results = []
for biomarker in biomarkers:
    # Missing value handling 
    data_with = group_with_dementia[biomarker].dropna()
    data_without = group_without_dementia[biomarker].dropna()
    stat, p_value = stats.mannwhitneyu(data_with, data_without, alternative="two-sided")
    results.append((biomarker, stat, p_value))

# Sort biomarkers by ascending p-value (lowest p indicates strongest evidence of group difference)
results_sorted = sorted(results, key=lambda x: x[2])

# Results in a nice format
print("Biomarker\tU Statistic\tP-value\t\tDecision")
for biomarker, stat, p_value in results_sorted:
    decision = ("Reject null hypothesis: statistically significant difference"
                if p_value < alpha
                else "Fail to reject null hypothesis: insufficient evidence for a difference")
    print(f"{biomarker:20s}\t{stat:10.3f}\t{p_value:10.4f}\t-> {decision}")


## Step 3: Correlation Analysis

In [None]:
### Data preparation and viewing 
### Merging of the datasets with correct patient ID filtering 
metadata_dementia = sea_ad_cohort_donor_metadata.copy()
markers = sea_ad_cohort_mtg_tissue_extractions_luminex_data.copy()
markers_metadata = markers.merge(metadata_dementia, on="Donor ID", how="inner")
markers_metadata.head()

### 1. **Correlating Single Biomarker Variables with Cognitive Decline**  
   - Example: ABeta40 and cognitive decline, ABeta42 and cognitive decline, etc.  
   - **Variable Groups**:  
     - **RIPA Buffer Extraction**: ABeta40, ABeta42, tTAU, pTAU  
     - **GuHCl (Guanidine Hydrochloride) Buffer Tissue Extraction**: ABeta40, ABeta42, tTAU, pTAU  
   - **Total Correlations**: 8 (each variable correlated with cognitive decline)


Correlating single variables with single outcomes, e.g., ABeta40 and cognitive decline, ABeta42 with cognitive decline, and so on and so forth.
   - **Full list 1**: ABeta40, ABeta42, tTAU, pTAU in RIPA buffer extraction individually with cognitive decline.
   - **Full list 2**: ABeta40, ABeta42, tTAU, pTAU in GuHCl (Guanidine Hydrochloride) buffer tissue extractions individually with cognitive decline.
   - **Total of 8 correlations**.

**Type of correlations**: A point biserial correlation coefficient and the associated p-value (for hypothesis validation) due to nominal and continuous variables involved. 

The function we are using is two-tailed, meaning it can infer both negative and positive correlation, outputting the r coefficient. Thus, our main null hypothesis also has that feature.

In [None]:
"""
    1) correlating single variables with single outcomes e.g. ABeta40 and cognitive decline , ABeta42 with cognitive decline and so on and so forth.
    (full list 1: ABeta40, ABeta42, tTAU ,pTAU in RIPA buffer extraction individually with cognitive decline ) 
    (full list 2: ABeta40, ABeta42, tTAU, pTAU in GuHCl (Guanidine Hydrochloride) Buffer Tissue extractions individually with cognitive decline) 
    (total of 8 correlations)
    Type of correlations: a point biserial correlation coefficient and the associated p-value (for hypothesis validation) due to nominal and continuous variables involved.
    The function we are using is two tailed meaning it can infer both negative and positive correaltion outputting the r coefficient, thus our main null hypothesis also has that feature
"""
# Define MMSE scores and prepare correlation tests
has_dementia = markers_metadata["Cognitive Status"] == "Dementia"
stat_tests = []

#Correlate for each mentioned variable
stat_tests.append(("ABeta40 pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["ABeta40 pg/ug"])))
stat_tests.append(("ABeta40 pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["ABeta40 pg/ug.1"]))) #.1 signifies the second buffer extraction method i.e. GuHCl (Guanidine Hydrochloride) Buffer Tissue extractions

stat_tests.append(("ABeta42 pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["ABeta42 pg/ug"])))
stat_tests.append(("ABeta42 pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["ABeta42 pg/ug.1"])))

stat_tests.append(("tTAU pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["tTAU pg/ug"])))
stat_tests.append(("tTAU pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["tTAU pg/ug.1"])))


stat_tests.append(("pTAU pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["pTAU pg/ug"])))
stat_tests.append(("pTAU pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["pTAU pg/ug.1"])))


# Sort tests by p-value in ascending order
stat_tests.sort(key=lambda x: x[1].pvalue)
alpha = 0.05  # significance level for p value

# Printing correlation with added result interpretation
for marker_name, stat_result in stat_tests:
    decision = ("Reject null hypothesis: evidence of a statistically significant correlation"
                if stat_result.pvalue < alpha 
                else "Fail to reject null hypothesis: insufficient evidence to infer a correlation")
    print(f"{marker_name}: r = {stat_result.statistic:.4f}, p = {stat_result.pvalue:.4f} -> {decision}")

### 2. **Correlating Single Biomarker Variables with MMSE Test Scores**  
   - Example: ABeta40 and MMSE test scores, ABeta42 and MMSE test scores, etc.  
   - **Variable Groups**:  
     - **RIPA Buffer Extraction**: ABeta40, ABeta42, tTAU, pTAU (each correlated with MMSE test scores)  
     - **GuHCl (Guanidine Hydrochloride) Buffer Tissue Extraction**: ABeta40, ABeta42, tTAU, pTAU (each correlated with MMSE test scores)  
   - **Total Correlations**: 8 (each variable correlated with MMSE test scores)  
  

In [None]:
""""
    2) correlating single variables with single outcomes e.g. ABeta40 and MMSE test scores , ABeta42 with MMSE test scores and so on and so forth.
    (full list 1: ABeta40, ABeta42, tTAU ,pTAU in RIPA buffer extraction individually with MMSE test scores  ) 
    (full list 2: ABeta40, ABeta42, tTAU, pTAU in GuHCl (Guanidine Hydrochloride) Buffer Tissue extractions individually with MMSE test scores) 
    (total of 8 correlations)
    Type of correlations: a Kendall’s tau correlation with the TAU statistic and the associated p-value (for hypothesis validation) due to continuous and categorical variables involved.
    We chose this type of correlation due to the nominal and continuous variables involved. 
    The function we are using is two tailed meaning it can infer both negative and positive correaltion outputting the r coefficient, thus our main null hypothesis also has that feature
"""

# Define MMSE scores and prepare correlation tests
MMSE = markers_metadata["Last MMSE Score"]
stat_tests_2 = []

# Applying Kendall's tau correlation for each biomarker versus MMSE (omitting NaNs)
stat_tests_2.append(("ABeta40 pg/ug", stats.kendalltau(MMSE, markers_metadata["ABeta40 pg/ug"], nan_policy="omit")))
stat_tests_2.append(("ABeta40 pg/ug.1", stats.kendalltau(MMSE, markers_metadata["ABeta40 pg/ug.1"], nan_policy="omit")))
stat_tests_2.append(("ABeta42 pg/ug", stats.kendalltau(MMSE, markers_metadata["ABeta42 pg/ug"], nan_policy="omit")))
stat_tests_2.append(("ABeta42 pg/ug.1", stats.kendalltau(MMSE, markers_metadata["ABeta42 pg/ug.1"], nan_policy="omit")))
stat_tests_2.append(("tTAU pg/ug", stats.kendalltau(MMSE, markers_metadata["tTAU pg/ug"], nan_policy="omit")))
stat_tests_2.append(("tTAU pg/ug.1", stats.kendalltau(MMSE, markers_metadata["tTAU pg/ug.1"], nan_policy="omit")))
stat_tests_2.append(("pTAU pg/ug", stats.kendalltau(MMSE, markers_metadata["pTAU pg/ug"], nan_policy="omit")))
stat_tests_2.append(("pTAU pg/ug.1", stats.kendalltau(MMSE, markers_metadata["pTAU pg/ug.1"], nan_policy="omit")))

# Sort tests by p-value in ascending order
stat_tests_2.sort(key=lambda x: x[1].pvalue)
alpha = 0.05  # significance level for p value 

# Print the correlation with added result interpretation
for marker_name, stat_result in stat_tests_2:
    decision = ("Reject null hypothesis: statistically significant correlation"
                if stat_result.pvalue < alpha
                else "Fail to reject null hypothesis: insufficient evidence to infer a correlation")
    print(f"{marker_name}: tau = {stat_result.correlation:.4f}, p = {stat_result.pvalue:.4f} -> {decision}")


### 3. **Correlating Brain Region Volumes with Dementia Diagnosis (yes/no)**  
   - **Regions Analyzed:**  
     - Left Hippocampus Volume  
     - Right Hippocampus Volume  
     - Left Entorhinal Cortex Volume  
     - Right Entorhinal Cortex Volume  
   - **Methodology:**  
     - Extract volumetric data for each brain region.  
     - Compute **point-biserial correlation** between each brain region’s volume and dementia diagnosis.  
   - **Significance Threshold:**  
     - p < 0.05 is considered **statistically significant** evidence of association.  
   - **Purpose:**  
     - Identify whether atrophy in specific brain regions is significantly associated with dementia status.  

In [None]:
"""
    3) Correlating regional volumetric measures with dementia status (binary outcome) e.g. Left Hippocampus Volume vs Dementia presence, Right Entorhinal Cortex Volume vs Dementia presence, etc.
    (Full list: Left Hippocampus Volume, Right Hippocampus Volume, Left Entorhinal Cortex Volum e, Right Entorhinal Cortex Volume - 4 correlations total)
    Type of correlation: Point-biserial correlation with r coefficient and associated p-value (for hypothesis validation) due to continuous (volume) and binary (dementia status) variables involved.
"""

regions = [
    "Left Hippocampus Volume", 
    "Right Hippocampus Volume", 
    "Left Entorhinal Cortex Volume", 
    "Right Entorhinal Cortex Volume"
]

for region in regions:
    
    # Load and decrypt metadata and volumetric data
    metadata_dementia = sea_ad_cohort_donor_metadata.copy()
    volumetric = sea_ad_cohort_mri_volumetrics.copy()

    # Extract region-specific data and remove rows with missing values
    region_vol = volumetric[["Donor ID", region]].dropna()
    merged_df = region_vol.merge(metadata_dementia, on="Donor ID", how="inner")
    
    # Create boolean indicator: True for "Dementia", False otherwise
    has_dementia = merged_df["Cognitive Status"] == "Dementia"
    
    # Calculate point-biserial correlation; evidence: significance threshold set at p < 0.05 (strong evidence if true)
    correlation = stats.pointbiserialr(has_dementia, merged_df[region])
    significance = "significant" if correlation.pvalue < 0.05 else "not significant"
    
    # Display results with detailed statistical values
    print(f"Region: {region}")
    print(f"Data shape (rows, columns): {merged_df.shape}")
    print(f"Point-biserial correlation: r = {correlation.correlation:.3f}, p = {correlation.pvalue:.3f} ({significance})")
    print("-" * 40)

### 4. **Correlating Years of Education with Age of Dementia Diagnosis**  
   - **Datasets Used:**  
     - Main dataset (sea-ad_cohort_donor_metadata)  
     - Additional dataset (kaggle - alzheimers_prediction_dataset)  
   - **Methodology:**  
     - Extract **years of education** and **age at dementia diagnosis** from both datasets.  
     - Compute **Spearman correlation** to assess the relationship between education level and dementia onset age.  
     - Analyze correlations separately for Main dataset and combined datasets.  
   - **Significance Threshold:**  
     - p < 0.05 is considered **statistically significant** evidence of association.  
   - **Purpose:**  
     - Determine whether higher education levels are associated with delayed onset of dementia, providing insights into cognitive reserve effects.  

In [None]:
"""
    4) Correlating ordinal education metrics with continuous dementia onset age variables: "Years of education" vs "Age of Dementia diagnosis" in both base cohort data and combined cohort+external Alzheimer’s patient data (2 total correlations).
    Type of correlation: Spearman’s rank-order correlation with rho coefficient and associated p-value (for non-parametric hypothesis validation).
    We chose this correlation type due to the ordinal nature of education years and non-normal distribution assumptions.
    The analysis compares results from:
    - Original data
    - Augmented dataset combining original data with filtered external Alzheimer’s cases (Alzheimer’s Diagnosis == "Yes").
"""

metadata_dementia = sea_ad_cohort_donor_metadata.copy()
volumetric = sea_ad_cohort_mri_volumetrics.copy()

base_dataset_metadata = metadata_dementia[["Years of education", "Age of Dementia diagnosis"]].dropna()

spearman_base = stats.spearmanr(
    base_dataset_metadata["Age of Dementia diagnosis"],
    base_dataset_metadata["Years of education"]
)
significance_base = "significant" if spearman_base.pvalue < 0.05 else "not significant"

print("Base Data:")
print(f"Spearman correlation: r = {spearman_base.correlation:.3f}, "
      f"p = {spearman_base.pvalue:.3f} ({significance_base})")
print("-" * 40)

# Load and preprocess additional data
kaggle_data = alzheimers_prediction_dataset.copy()
kaggle_data = kaggle_data[kaggle_data["Alzheimer’s Diagnosis"] == "Yes"]

kaggle_dataset = kaggle_data[["Education Level", "Age"]].rename(columns={
    "Education Level": "Years of education",
    "Age": "Age of Dementia diagnosis"
})

# Combine datasets
combined_data = pd.concat([base_dataset_metadata, kaggle_dataset])

# Compute Spearman correlation for combined data
spearman_combined = stats.spearmanr(
    combined_data["Age of Dementia diagnosis"],
    combined_data["Years of education"]
)
significance_combined = "significant" if spearman_combined.pvalue < 0.05 else "not significant"

print("Combined Data:")
print(f"Spearman correlation: r = {spearman_combined.correlation:.3f}, "
      f"p = {spearman_combined.pvalue:.3f} ({significance_combined})")
print("-" * 40)


### 5. **Correlating Age with Dementia Diagnosis (yes/no)**  
   - **Dataset Used:**  
     - Alzheimer's Prediction Dataset (kaggle - alzheimers_prediction_dataset)  
   - **Methodology:**  
     - Extract **age** and **dementia diagnosis status**.  
     - Convert dementia diagnosis into a boolean indicator (Yes = True, No = False).  
     - Compute **point-biserial correlation** between age and dementia diagnosis status.  
   - **Significance Threshold:**  
     - p < 0.05 is considered **statistically significant** evidence of association.  
   - **Purpose:**  
     - Assess whether increasing age is significantly associated with dementia diagnosis, reinforcing age as a key risk factor for Alzheimer’s disease.  

In [None]:
"""
  5) Correlating continuous age values with binary dementia diagnosis status (presence/absence of Alzheimer’s diagnosis).
  Type of correlation: Point-biserial correlation with r coefficient and associated p-value (for hypothesis validation) due to continuous (age) and binary (diagnosis status) variable pairing.
"""
# Load additional data
kaggle_data = alzheimers_prediction_dataset.copy()

# Create boolean indicator for dementia diagnosis
has_dementia = kaggle_data["Alzheimer’s Diagnosis"] == "Yes"

# Compute point-biserial correlation between age and dementia diagnosis
pb_corr = stats.pointbiserialr(has_dementia, kaggle_data["Age"])
significance = "significant" if pb_corr.pvalue < 0.05 else "not significant"

# Output detailed results
print("Age vs. Dementia Diagnosis:")
print(f"Data shape: {kaggle_data.shape}")
print(f"Point-biserial correlation: r = {pb_corr.correlation:.3f}, p = {pb_corr.pvalue:.3f} ({significance})")
