# Statistical Data Analysis
## ADNI Alzheimer's Data
- Functions/algorithms used for calculating the statistics are in the sda module

In [1]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# import wrangle_adni to import/wrangle the data
from adnidatawrangling import wrangle_adni

# import custom modules eda, sda
# eda: exploratory data analysis module for additional functions
# sda: statistical data analysis module
import eda, sda

# set seaborn defaults
sns.set()

In [2]:
# import data, clean, and extract data
adni_comp, clin_data, scan_data = wrangle_adni()

In [3]:
# extract final exam data: only the last exam for each patient
final_exam = eda.get_final_exam(adni_comp)

# calculate the change in variables over the course of the study
eda.calc_deltas(final_exam)

### Statistical Questions:
- Change in biomarkers
    - Is there a difference between males and females in the amount of change observed in biomarkers with progression towards Alzheimer's Disease (AD)?
        - Further questions will be split by males/females if a difference is found
        - Permutation sampling will be used here to test if males/females come from the same distribution with regards to the amount of change observed
    - Which biomarkers show a statistically significant change as a person develops AD?
    - In biomarkers that show a change, what amount of change is correlated with progression towards AD?
- Biomarker baseline values as predictors of AD
    - Are there statistically significant thresholds for baseline values of a biomarker that suggest a person will develop AD?
        - How does this vary when including/excluding certain diagnosis groups?
            1. Looking at the entire sample of patients (all baseline diagnoses, including those with AD already)
            2. Only patients that were cognitively normal (CN) or had mild cognitive impairment (MCI) at baseline
            3. Including only patients that were CN at baseline
        - Exploratory data analysis suggests that there may be different thresholds for different genders

#### Change in Biomarkers
- From the exploratory data analysis, the following biomarkers were revealed as good candidates for statistical analysis
    - Clinical tests: CDRSB, ADAS11, ADAS13, MMSE, RAVLT_immediate
    - Brain scans: Hippocampus, Ventricles, WholeBrain, Entorhinal MidTemp
- Approach
    - As sample sizes are not very large, bootstrapping will be used to generate a distribution for the change in each biomarker for patients that showed no change in diagnosis during the study
    - The null hypothesis is that when patients are divided into groups based on their change in diagnosis (CN to MCI, MCI to AD, CN to AD) all groups will have the same distribution as the group that ended the study with no change
        - May have to examine whether patients with no change in diagnosis that are not CN (MCI to MCI and AD to AD) impact the results
    - The alternative hypothesis is that the distributions for each group will be different enough that threshold values can be identified to signify beginning early treatment for MCI/AD or raising concerns about progression to AD
        - A 'statistically significant' result of p < 0.05 is not necessarily needed to indicate that a change in a biomarker can be used to initiate early treatment, further monitoring, or further testing
        - Instead, this analysis is looking for probability levels that show reasons for actions
            - If someone told you that you had an 80% chance of developing AD, would you start doing brain exercises and puzzles, painting, or listening to classical music if you thought it might help at all?

In [4]:
# divide the data by gender

# replace this code with 
# fe_males, fe_females = divide_genders(final_exam)

fe_males = final_exam[final_exam.PTGENDER == 'Male']
fe_females = final_exam[final_exam.PTGENDER == 'Female']

In [5]:
# test whether or not males and females have the same distribution for the change in each biomarker

c_arr = np.array(final_exam['CDRSB_bl'])
r_arr = np.random.permutation(c_arr)


In [6]:
n_arr1 = r_arr[:len(final_exam[final_exam.PTGENDER == 'Male'])]

In [7]:
n_arr2 = r_arr[len(final_exam[final_exam.PTGENDER == 'Female']):]

In [12]:
np.mean(n_arr1)

1.455607476635514

In [13]:
np.mean(n_arr2)

1.4065420560747663

In [14]:
np.mean(c_arr)

1.4482905982905983

In [8]:
p = np.sum(np.mean(c_arr) >= abs((np.mean(n_arr1) - np.mean(n_arr2)))) / len(c_arr)
print('p-value: ', p)

p-value:  0.0008547008547008547


In [4]:
sda.test_gender_deltas(final_exam, 'CDRSB_bl')

Distribution Test for Males/Females
Variable:  CDRSB_bl
If p < 0.05, then split the data by gender
p-value:  0.0008547008547008547


In [5]:
sda.test_gender_deltas(final_exam, 'ADAS11_bl')

Distribution Test for Males/Females
Variable:  ADAS11_bl
If p < 0.05, then split the data by gender
p-value:  0.0008547008547008547


#### CDRSB Change

In [16]:
# is the amount of change in each biomarker consistent between males/females



In [8]:
# divide data into groups based on change in diagnosis

# replace this codeblock with 
# no_change, cn_mci, mci_ad, cn_ad = sda.get_deltadx_groups()

# isolate patients with no diagnosis change
no_change = final_exam[final_exam['DX'] == final_exam['DX_bl2']]
    
# isolate patients who progressed from 'CN' to 'AD'
cn_mci = final_exam[(final_exam['DX'] == 'MCI') & (final_exam['DX_bl2'] == 'CN')]
    
# isolate patients who progressed from 'MCI' to 'AD'
mci_ad = final_exam[(final_exam['DX'] == 'AD') & (final_exam['DX_bl2'] == 'MCI')]
    
# isolate patients who progressed from 'CN' to 'AD'
cn_ad = final_exam[(final_exam['DX'] == 'AD') & (final_exam['DX_bl2'] == 'CN')]

#### Baseline Values for Predicting Alzheimer's Disease
- From the exploratory data analysis, the following biomarkers emerged as good candidates for statistical testing
    - Clinical tests: ADAS11 and ADAS13
    - Brain scans: Hippocampus and MidTemp
- Approach
    - The data will be divided by final diagnosis into two groups: those that ended the study with AD and those that didn't
    - Because the sample sizes are relatively small, bootstrap distributions will be generated for each group
    - The null hypothesis is that both the group that ended with AD and the non AD group will have the same distribution of baseline values
    - The alternative hypothesis is that the group that ended the study with AD will have a different baseline distribution for the analyzed biomarkers
        - Again, 'statistical significance' of p < 0.05 is not necessarily needed to find value in using baseline values for a biomarker to predict progression to AD
        - Intead, the probability of having certain baseline values will be examined, taking into account the probability that someone will not progress to AD with a certain baseline value and the probability that someone will progress to AD with their observed baseline value