### Research questions 
1) What are the relationships between specific Alzheimer's disease biomarkers (e.g., β-amyloid, tau proteins) and cognitive decline in the dataset, including positive or negative correlations and potential causal inferences?
    For statistical purposes we will assume that the null hypothesis is the following; There are no associated positive correlations or negative between the chosen biomarkers and cognitive decline 
    indicators.  Initially will conduct the analysis for each biomarker meaning there will be multiple sub null hypotheses that will be either rejected or accepted depending on the P values. 

2) How can diverse Alzheimer's disease datasets be harmonized and integrated to facilitate comprehensive analyses (data lake + warehouse) (data integration with different sources e.g. spatial transcriptomics, DICOM data) ? 

3) How can a data lake + warehouse be designed to fit these medical datasets best (encryption is critical for any medical data and integrates well as a task for data handling)? We may add additional analysis tools like regression models or deep neural networks to enrich the data analysis. In the end we aim to build a full working pipeline 

### Extra details for improved notebook clarity
For this notebook the sign of a correlation (positive or negative) is meaningful only in the context of the variables analyzed. For example, a Pearson correlation between ABeta42 (pg/ug.1) and a given variable yields r = 0.3693 with p = 0.0005, leading to rejection of the null hypothesis (p < 0.001, strong evidence). Conversely, a Spearman correlation for ABeta42 (pg/ug.1) produces tau = –0.2475 with p = 0.0017, also rejecting the null hypothesis (p < 0.01, strong evidence). These opposing signs are consistent when considering the variables’ scales—specifically, MMSE scores range from 30 (no impairment) to 0 (severe cognitive impairment). 
Additionally different metrics of cognitive decline like MMSE are used to measure the symptom of dementia (which is tightly related to Alzheimers disease as the core symptom) meaning they 
measure the same symptom but in different manners. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

In [10]:
# Reading the encrypted data
import hashlib
import base64
from io import BytesIO
from cryptography.fernet import Fernet

key = "poqwjoifj2398wesd"


def decrypt_file(file_path, key):
    fernet_key = base64.urlsafe_b64encode(hashlib.sha256(key.encode()).digest())
    cipher = Fernet(fernet_key)

    with open(file_path, "rb") as encrypted_file:
        encrypted_data = encrypted_file.read()

    decrypted_data = cipher.decrypt(encrypted_data)
    return decrypted_data



alzheimers_prediction_dataset = pd.read_csv(BytesIO(decrypt_file("./data/alzheimers_prediction_dataset.csv", key)))
sea_ad_cohort_donor_metadata = pd.read_excel(BytesIO(decrypt_file("./data/sea-ad_cohort_donor_metadata_072524 (1).xlsx", key)))
sea_ad_cohort_mri_volumetrics = pd.read_excel(BytesIO(decrypt_file("./data/sea-ad_cohort_mri_volumetrics (2).xlsx", key)))
sea_ad_cohort_mtg_tissue_extractions_luminex_data = pd.read_excel(BytesIO(decrypt_file("./data/sea-ad_cohort_mtg-tissue_extractions-luminex_data (1).xlsx", key)))

Unnamed: 0,Country,Age,Gender,Education Level,BMI,Physical Activity Level,Smoking Status,Alcohol Consumption,Diabetes,Hypertension,...,Dietary Habits,Air Pollution Exposure,Employment Status,Marital Status,Genetic Risk Factor (APOE-ε4 allele),Social Engagement Level,Income Level,Stress Levels,Urban vs Rural Living,Alzheimer’s Diagnosis
0,Spain,90,Male,1,33.0,Medium,Never,Occasionally,No,No,...,Healthy,High,Retired,Single,No,Low,Medium,High,Urban,No
1,Argentina,72,Male,7,29.9,Medium,Former,Never,No,No,...,Healthy,Medium,Unemployed,Widowed,No,High,Low,High,Urban,No
2,South Africa,86,Female,19,22.9,High,Current,Occasionally,No,Yes,...,Average,Medium,Employed,Single,No,Low,Medium,High,Rural,No
3,China,53,Male,17,31.2,Low,Never,Regularly,Yes,No,...,Healthy,Medium,Retired,Single,No,High,Medium,Low,Rural,No
4,Sweden,58,Female,3,30.0,High,Former,Never,Yes,No,...,Unhealthy,High,Employed,Married,No,Low,Medium,High,Rural,No


## Step 1 Data evalutation
Evaluating whether the data is suitable for statistical analysis with enough diversity. 



## Step 1: Correlation Analysis

### Substeps:

1. **Main Group of Correlations 1: Correlating Single Variables with Cognitive Decline**  
   - Example: ABeta40 and cognitive decline, ABeta42 and cognitive decline, etc.  
   - **Variable Groups**:  
     - **RIPA Buffer Extraction**: ABeta40, ABeta42, tTAU, pTAU  
     - **GuHCl (Guanidine Hydrochloride) Buffer Tissue Extraction**: ABeta40, ABeta42, tTAU, pTAU  
   - **Total Correlations**: 8 (each variable correlated with cognitive decline)

2. **Main Group of Correlations 2: Correlating Single Variables with MMSE Test Scores**  
   - Example: ABeta40 and MMSE test scores, ABeta42 and MMSE test scores, etc.  
   - **Variable Groups**:  
     - **RIPA Buffer Extraction**: ABeta40, ABeta42, tTAU, pTAU (each correlated with MMSE test scores)  
     - **GuHCl (Guanidine Hydrochloride) Buffer Tissue Extraction**: ABeta40, ABeta42, tTAU, pTAU (each correlated with MMSE test scores)  
   - **Total Correlations**: 8 (each variable correlated with MMSE test scores)  
  
  3. **Complex correlation analysis**  TODO

In [2]:
### Data preparation and viewing 
### Merging of the datasets with correct patient ID filtering 
metadata_dementia = pd.read_excel("./data/sea-ad_cohort_donor_metadata_072524 (1).xlsx")
markers = pd.read_excel("./data/sea-ad_cohort_mtg-tissue_extractions-luminex_data (1).xlsx", header=1)
markers_metadata = markers.merge(metadata_dementia, on="Donor ID", how="inner")
markers_metadata.head()

Unnamed: 0,Donor ID,ABeta40 pg/ug,ABeta42 pg/ug,tTAU pg/ug,pTAU pg/ug,ABeta40 pg/ug.1,ABeta42 pg/ug.1,tTAU pg/ug.1,pTAU pg/ug.1,Primary Study Name,...,CERAD score,Overall CAA Score,Highest Lewy Body Disease,Total Microinfarcts (not observed grossly),Total microinfarcts in screening sections,Atherosclerosis,Arteriolosclerosis,LATE,RIN,Severely Affected Donor
0,H20.33.045,981.444,142.778,1122.432229,5.415789,2179.336,1737.483712,27.065263,2.638947,ADRC Clinical Core,...,Frequent,Not identified,Not Identified (olfactory bulb not assessed),4,4,Moderate,Severe,LATE Stage 2,8.15,
1,H20.33.044,0.007088,0.245263,7005.543158,5.630526,0.203116,0.311579,109.728421,1.957895,ACT,...,Absent,Not identified,Not Identified (olfactory bulb not assessed),3,1,Mild,Moderate,LATE Stage 1,9.2,
2,H21.33.045,21.423158,53.878947,147.565263,11.489474,46.231579,954.656984,14.405263,1.678947,ADRC Clinical Core,...,Frequent,Moderate,Limbic (Transitional),0,0,Moderate,Moderate,LATE Stage 3,6.55,Y
3,H20.33.046,25.295789,69.988421,283.436842,15.917895,46.929474,1103.876312,17.046316,2.871579,ACT,...,Frequent,Moderate,Not Identified (olfactory bulb not assessed),1,1,Moderate,Severe,LATE Stage 2,5.67,Y
4,H20.33.014,0.526168,16.137895,258.624211,3.398947,1.675758,97.642105,83.395789,1.68,ADRC Clinical Core,...,Sparse,Mild,Olfactory bulb only,1,1,,Mild,Not Identified,8.67,


In [3]:
""""
    1) correlating single variables with single outcomes e.g. ABeta40 and cognitive decline , ABeta42 with cognitive decline and so on and so forth.
    (full list 1: ABeta40, ABeta42, tTAU ,pTAU in RIPA buffer extraction individually with cognitive decline ) 
    (full list 2: ABeta40, ABeta42, tTAU, pTAU in GuHCl (Guanidine Hydrochloride) Buffer Tissue extractions individually with cognitive decline) 
    (total of 8 correlations)
    Type of correlations: a point biserial correlation coefficient and the associated p-value (for hypothesis validation) due to nominal and continuous variables involved.
    The function we are using is two tailed meaning it can infer both negative and positive correaltion outputting the r coefficient, thus our main null hypothesis also has that feature
"""

# Define MMSE scores and prepare correlation tests
has_dementia = markers_metadata["Cognitive Status"] == "Dementia"
stat_tests = []

#Correlate for each mentioned variable
stat_tests.append(("ABeta40 pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["ABeta40 pg/ug"])))
stat_tests.append(("ABeta40 pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["ABeta40 pg/ug.1"])))

stat_tests.append(("ABeta42 pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["ABeta42 pg/ug"])))
stat_tests.append(("ABeta42 pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["ABeta42 pg/ug.1"])))

stat_tests.append(("tTAU pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["tTAU pg/ug"])))
stat_tests.append(("tTAU pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["tTAU pg/ug.1"])))


stat_tests.append(("pTAU pg/ug", stats.pointbiserialr(has_dementia, markers_metadata["pTAU pg/ug"])))
stat_tests.append(("pTAU pg/ug.1", stats.pointbiserialr(has_dementia, markers_metadata["pTAU pg/ug.1"])))


# Sort tests by p-value in ascending order
stat_tests.sort(key=lambda x: x[1].pvalue)
alpha = 0.05  # significance level for p value

# Printing correlation with added result interpretation
for marker_name, stat_result in stat_tests:
    decision = ("Reject null hypothesis: evidence of a statistically significant correlation"
                if stat_result.pvalue < alpha 
                else "Fail to reject null hypothesis: insufficient evidence to infer a correlation")
    print(f"{marker_name}: r = {stat_result.statistic:.4f}, p = {stat_result.pvalue:.4f} -> {decision}")


ABeta42 pg/ug.1: r = 0.3693, p = 0.0005 -> Reject null hypothesis: evidence of a statistically significant correlation
ABeta40 pg/ug: r = 0.1923, p = 0.0796 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
ABeta42 pg/ug: r = 0.1893, p = 0.0846 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
ABeta40 pg/ug.1: r = 0.1861, p = 0.0902 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
pTAU pg/ug: r = 0.1752, p = 0.1110 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
tTAU pg/ug.1: r = -0.1653, p = 0.1330 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
tTAU pg/ug: r = -0.1221, p = 0.2685 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
pTAU pg/ug.1: r = -0.0660, p = 0.5507 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation


In [4]:
""""
    1) correlating single variables with single outcomes e.g. ABeta40 and MMSE test scores , ABeta42 with MMSE test scores and so on and so forth.
    (full list 1: ABeta40, ABeta42, tTAU ,pTAU in RIPA buffer extraction individually with MMSE test scores  ) 
    (full list 2: ABeta40, ABeta42, tTAU, pTAU in GuHCl (Guanidine Hydrochloride) Buffer Tissue extractions individually with MMSE test scores) 
    (total of 8 correlations)
    Type of correlations: a Kendall’s tau correlation with the TAU statistic and the associated p-value (for hypothesis validation) due to continuous and categorical variables involved.
    We chose this type of correlation due to the nominal and continuous variables involved. 
    The function we are using is two tailed meaning it can infer both negative and positive correaltion outputting the r coefficient, thus our main null hypothesis also has that feature
    
"""

# Define MMSE scores and prepare correlation tests
MMSE = markers_metadata["Last MMSE Score"]
stat_tests_2 = []

# Applying Kendall's tau correlation for each biomarker versus MMSE (omitting NaNs)
stat_tests_2.append(("ABeta40 pg/ug", stats.kendalltau(MMSE, markers_metadata["ABeta40 pg/ug"], nan_policy="omit")))
stat_tests_2.append(("ABeta40 pg/ug.1", stats.kendalltau(MMSE, markers_metadata["ABeta40 pg/ug.1"], nan_policy="omit")))
stat_tests_2.append(("ABeta42 pg/ug", stats.kendalltau(MMSE, markers_metadata["ABeta42 pg/ug"], nan_policy="omit")))
stat_tests_2.append(("ABeta42 pg/ug.1", stats.kendalltau(MMSE, markers_metadata["ABeta42 pg/ug.1"], nan_policy="omit")))
stat_tests_2.append(("tTAU pg/ug", stats.kendalltau(MMSE, markers_metadata["tTAU pg/ug"], nan_policy="omit")))
stat_tests_2.append(("tTAU pg/ug.1", stats.kendalltau(MMSE, markers_metadata["tTAU pg/ug.1"], nan_policy="omit")))
stat_tests_2.append(("pTAU pg/ug", stats.kendalltau(MMSE, markers_metadata["pTAU pg/ug"], nan_policy="omit")))
stat_tests_2.append(("pTAU pg/ug.1", stats.kendalltau(MMSE, markers_metadata["pTAU pg/ug.1"], nan_policy="omit")))

# Sort tests by p-value in ascending order
stat_tests_2.sort(key=lambda x: x[1].pvalue)
alpha = 0.05  # significance level for p value 

# Print the correlation with added result interpretation
for marker_name, stat_result in stat_tests_2:
    decision = ("Reject null hypothesis: statistically significant correlation"
                if stat_result.pvalue < alpha
                else "Fail to reject null hypothesis: insufficient evidence to infer a correlation")
    print(f"{marker_name}: tau = {stat_result.correlation:.4f}, p = {stat_result.pvalue:.4f} -> {decision}")


ABeta42 pg/ug.1: tau = -0.2475, p = 0.0017 -> Reject null hypothesis: statistically significant correlation
ABeta40 pg/ug.1: tau = -0.2314, p = 0.0033 -> Reject null hypothesis: statistically significant correlation
ABeta40 pg/ug: tau = -0.2068, p = 0.0087 -> Reject null hypothesis: statistically significant correlation
pTAU pg/ug: tau = -0.2057, p = 0.0090 -> Reject null hypothesis: statistically significant correlation
ABeta42 pg/ug: tau = -0.1696, p = 0.0313 -> Reject null hypothesis: statistically significant correlation
tTAU pg/ug.1: tau = 0.1249, p = 0.1127 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
tTAU pg/ug: tau = 0.1058, p = 0.1789 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation
pTAU pg/ug.1: tau = -0.0763, p = 0.3328 -> Fail to reject null hypothesis: insufficient evidence to infer a correlation


In [None]:
regions = ["Left Hippocampus Volume", "Right Hippocampus Volume", "Left Entorhinal Cortex Volume", "Right Entorhinal Cortex Volume"]

for region in regions:
    metadata_dementia = pd.read_excel("./data/sea-ad_cohort_donor_metadata_072524 (1).xlsx")
    volumetric = pd.read_excel("./data/sea-ad_cohort_mri_volumetrics (2).xlsx")
    lh_vol = volumetric[["Donor ID", region]].dropna()
    lh_vol_metadata = lh_vol.merge(metadata_dementia, on="Donor ID", how="inner")

    has_dementia = lh_vol_metadata["Cognitive Status"] == "Dementia"

    print(lh_vol_metadata.shape)
    print(stats.pointbiserialr(has_dementia, lh_vol_metadata[region]))

In [None]:
# The correlation with the kaggle data Years of education - Age of Dementia diagnosis

metadata_dementia = pd.read_excel("./data/sea-ad_cohort_donor_metadata_072524 (1).xlsx")
c = metadata_dementia[["Years of education", "Age of Dementia diagnosis"]].dropna()
print(stats.spearmanr(
    c["Age of Dementia diagnosis"],
    c["Years of education"],
))

more_data = pd.read_csv("./data/alzheimers_prediction_dataset.csv")
more_data = more_data[more_data['Alzheimer’s Diagnosis'] == "Yes"]
d = more_data[["Education Level", 'Age']]
# Assumption: Age is the Age of Dementia diagnosis and Education Level is Years of education
d = d.rename(columns={"Education Level": "Years of education", "Age": "Age of Dementia diagnosis"})

e = pd.concat([c, d])
print(stats.spearmanr(
    e["Age of Dementia diagnosis"],
    e["Years of education"],
))

In [None]:
# Correlate age with if the person has Dimentia


more_data = pd.read_csv("./data/alzheimers_prediction_dataset.csv")
has_dementia = more_data['Alzheimer’s Diagnosis'] == "Yes"

stats.pointbiserialr(has_dementia, more_data["Age"])