# COGS 108 - Data Checkpoint

# Names

- Shivangi Gupta
- Joseph Hwang
- Zijun Yang
- Johnny Gonzales
- Tanishq Rathore

# Research Question

Utilizing clinical MRI Data and personal details of an individual, can we predict via machine learning model whether an individual will have an onset of Alzheimer's disease? Features the model will be trained on include variables such as Mini Mental State Examination (MMSE), visit number, Clinical Dementia Rating (CDR), gender, age, years of education, socioeconomic status, Estimated total intracranial volume (eTIV), Normalize Whole Brain Volume (nWBV), and Atlas Scaling Factor (ASF).

## Background and Prior Work

Advancements in healthcare, improvements in living conditions, and
breakthroughs in medicine have collectively contributed to longer life
expectancies worldwide; simultaneously, developed countries are also
experiencing declining fertility
rates.[<sup>1</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4255510/)
The combination of these two circumstances has resulted in the
proportion of older people within populations to steadily increase. The
World Health Organization (WHO) reported that “in 2020, the number of
people aged 60 and older outnumbered children younger than 5
years”.[<sup>2</sup>](https://www.who.int/news-room/fact-sheets/detail/ageing-and-health)
In addition, they also state that “between 2015 and 2050, the proportion
of the world’s population over 60 years will nearly double from 12% to
22%”. As a result, it is reasonable that we examine common health
conditions associated with older age, one being Alzheimer’s disease.

So what is Alzheimer’s disease? Alzheimer’s disease is a progressive
neurodegenerative brain disorder that impairs memory and cognitive
functions. It is the most common cause of dementia and affects about 6.5
million people in the United States who are aged 65 and
older.[<sup>3</sup>](https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447)
At the moment, there are no cures for the disease but medicines may
improve or slow the progression of
symptoms.[<sup>3</sup>](https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447)
As such, it is our project to create a model that is able to predict
Alzheimer’s disease based on clinical data that include factors that
show risk and progression of the disease.

There are several other projects that have asked similar questions and
approached similar problems for other diseases. For instance, one study
tried to use machine learning methods to predict risk of cardiovascular
disease based on major contributing
factors.[<sup>4</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10036320/)
Similarly, another paper used machine learning and ranker-based feature
selection methods to predict eye diseases based on
symptoms.[<sup>5</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854513/)
Lastly, there was a paper that predicted thyroid disease using selective
features and machine learning
techniques.[<sup>6</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9405591/)
All three papers seem to have been relatively successful in predicting
the disease based on distinct features. Evidently, training machine
learning models based on datasets which contain factors and indicators
for a given disease is not a novel format of question and method; we
hope to achieve similarly for Alzheimer’s disease.

<u>In-Depth Study Analysis</u>

Our group analyzed two studies published in the National Institute of
Health’s (NIH) journal database. The first study[<sup>7</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8927715/) employed
machine learning models to predict early-stage Alzheimer's Disease using
Open Access Series of Imaging Studies (OASIS) data, focusing on metrics
like precision, recall, accuracy, and F1-score. The authors, with
backgrounds in technology and health research, aimed to enhance early
diagnosis, potentially lowering Alzheimer's mortality rates. The study
demonstrated that machine learning techniques such as decision trees,
random forests, SVM, gradient boosting, and voting classifiers can
effectively predict early-stage Alzheimer's Disease with an accuracy of
up to 83%. This achievement highlights the critical role of data science
in identifying Alzheimer's at an early phase, leveraging feature
selection and advanced algorithms to enhance diagnostic accuracy. Early
detection is crucial for timely intervention, potentially mitigating the
disease's progression and impact on patients and their families (Kavitha
et al.).

The second study[<sup>8</sup>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6138240/) titled
“Application of machine learning methods for diagnosis of dementia based
on the 10/66 battery of cognitive function tests in south India”
investigated the use of machine learning for diagnosing dementia in
South India, employing the culturally and educationally fair 10/66
battery of cognitive function tests designed for use in low and
middle-income countries. Through the analysis of neuropsychological
data, demographic information, and normative data, the research applied
Jrip classification algorithm among others, achieving high diagnostic
accuracy. This approach demonstrates the potential to streamline the
diagnostic process, making it quicker and more accessible for clinicians
and patients in India, thereby addressing the significant healthcare
challenge of efficiently identifying dementia in community settings
(Bhagyashree et al).

<u>In-Depth Analysis of Similar Projects</u>

Our group also delved into actual Kaggle projects that are directly
associated with the dataset we’ve chosen to use, delving into EDA and
prediction models using Scikit-Learn and Tensorflow. The first
project[<sup>9</sup>](https://www.kaggle.com/code/shreyaspj/alzheimer-s-analysis-using-mri)
I will discuss starts with an introduction to Alzheimer's disease and
the problem statement of estimating the Clinical Dementia Rating (CDR)
using MRI dataset features. It progresses through data loading and
preprocessing, including null value handling and normalization, and
employs machine learning techniques, specifically mentioning model
training with hyperparameter tuning for XGBClassifier and
GradientBoostingClassifier. The notebook concludes with predictions and
performance evaluation, indicated by confusion matrix and classification
report visualizations, and the model was able to reach a final accuracy
of \~80%.

The second project[<sup>10</sup>](https://www.kaggle.com/code/andrew32bit/predict-alzheimer-disease-sl-and-tf) tried to predict the Clinical Rating of Alzheimer's disease (CRA) by
integrating data loading, visualization, and extensive machine learning,
including the use of TensorFlow for neural network models. It explored
various machine learning models, with a special emphasis on model
training and evaluation, culminating in the finding that the
DecisionTreeClassifier performed the best among the models tested. The
conclusion stressed the need for more data to enhance the precision of
Alzheimer's disease predictions, highlighting the challenge of data
scarcity in achieving accurate diagnostic models.

**References**

1.  [^](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4255510/) Nargund G. (2009) Declining birth rate in Developed Countries: A
     radical policy re-think is required. *Facts, views & vision in
     ObGyn, 1(3), 191–193.*
     [<u>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4255510/</u>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4255510/)

2.  [^](https://www.who.int/news-room/fact-sheets/detail/ageing-and-health) World Health Organization. (1 Oct 2022) Ageing and health. *World
     Health Organization*.
     [<u>https://www.who.int/news-room/fact-sheets/detail/ageing-and-health</u>](https://www.who.int/news-room/fact-sheets/detail/ageing-and-health)

3.  [^](https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447) Mayo Foundation for Medical Education and Research. (30
     August 2023) Alzheimer’s disease. *Mayo Clinic*.
     [<u>https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447</u>](https://www.mayoclinic.org/diseases-conditions/alzheimers-disease/symptoms-causes/syc-20350447)

4.  [^](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10036320/) Peng, M., Hou, F., Cheng, Z., Shen, T., Liu, K., Zhao, C., &
     Zheng, W. (23 Mar 2023) Prediction of cardiovascular disease risk
     based on major contributing features. *Scientific reports, 13(1),
     4778*.
     [<u>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10036320/</u>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10036320/)

5.  [^](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854513/) Marouf, A. A., Mottalib, M. M., Alhajj, R., Rokne, J., &
     Jafarullah, O. (24 Dec 2022) An Efficient Approach to Predict Eye
     Diseases from Symptoms Using Machine Learning and Ranker-Based
     Feature Selection Methods. *Bioengineering (Basel, Switzerland),
     10(1), 25*.
     [<u>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854513/</u>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9854513/)

6.  [^](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9405591/) Chaganti, R., Rustam, F., De La Torre Díez, I., Mazón, J. L. V.,
     Rodríguez, C. L., & Ashraf, I. (13 Aug 2022). Thyroid Disease
     Prediction Using Selective Features and Machine Learning
     Techniques. *Cancers, 14(16), 3914*.
     [<u>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9405591/</u>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9405591/)

7.  [^](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8927715/) Kavitha C, Mani V, Srividhya SR, Khalaf OI, Tavera Romero CA.
     Early-Stage Alzheimer's Disease Prediction Using Machine Learning
     Models. Front Public Health. 2022 Mar 3;10:853294. doi:
     10.3389/fpubh.2022.853294. PMID: 35309200; PMCID: PMC8927715.
     [<u>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8927715/</u>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8927715/)

8.  [^](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6138240/) Bhagyashree SIR, Nagaraj K, Prince M, Fall CHD, Krishna M.
     Diagnosis of Dementia by Machine learning methods in
     Epidemiological studies: a pilot exploratory study from south
     India. Soc Psychiatry Psychiatr Epidemiol. 2018 Jan;53(1):77-86.
     doi: 10.1007/s00127-017-1410-0. Epub 2017 Jul 11. PMID: 28698926;
     PMCID: PMC6138240.
     [<u>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6138240/</u>](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6138240/)

9.  [^](https://www.kaggle.com/code/shreyaspj/alzheimer-s-analysis-using-mri) Reddy, Shreyas. 2021. Alzheimer's analysis using MRI, February
     8, 2024.
     [<u>https://www.kaggle.com/code/shreyaspj/alzheimer-s-analysis-using-mri</u>](https://www.kaggle.com/code/shreyaspj/alzheimer-s-analysis-using-mri)

10. [^](https://www.kaggle.com/code/andrew32bit/predict-alzheimer-disease-sl-and-tf) Andrew. 2017. Predict alzheimer disease sl and tf, February 8, 2024.
     [<u>https://www.kaggle.com/code/andrew32bit/predict-alzheimer-disease-sl-and-tf</u>](https://www.kaggle.com/code/andrew32bit/predict-alzheimer-disease-sl-and-tf)

# Hypothesis


Our project's hypothesis is the following: "It is possible to predict the onset of Alzheimers based on the combination of (1) clinical data* and (2) personal features such as gender, age, years of education, and socioeconomic status." We believe that we will be able to successfully train a model that is able to predict the onset of Alzheimer's disease because, as mentioned in the background portion of the proposal, there has been numerous successful machine learning models trained on clinical data to predict the onset of a disease. The clinical data provide variables that capture both the cognitive and structural changes associated with the disease's progression. Incorporating personal features such as gender and age in the prediction model is justified by extensive research indicating that these factors can influence the risk and progression rate of Alzheimer's disease. Together, we believe that this will be enough data to train a model to predict the onset of Alzheimer's disease.

*Clinical data include Mini Mental State Examination (MMSE), visit number, Clinical Dementia Rating (CDR), Estimated total intracranial volume (eTIV), Normalize Whole Brain Volume (nWBV), and the Atlas Scaling Factor (ASF)

# Data

## Data overview

- Dataset #1
  - Dataset Name: OASIS-2: Longitudinal MRI Data in Nondemented and Demented Older Adults
  - Link to the dataset:https://www.oasis-brains.org/#data
  - Number of observations: 373
  - Number of variables: 15

The dataset provided by The Open Access Series of Imaging Studies (OASIS) contains a collection of 150 subjects aged 60 to 96. Each subject was scanned on 2 or more visits, separated by one year for a total of 373 imaging sessions. All subjects were right-handed and included both men and women. 72 subjects were charaterized as nondemented throughout the study. 64 subjects subjects were characterized as demented at their initial vists and remained so for subsequent scans. Lastly, 14 subjects were characterized as nondemented at their initial vist but were subsequently characterized as demented in later visits. Important variables are years of education (EDUC), socioeconomic status (SES; 1 to 5), mini mental state examination (MMSE score), clinical dementia rating (CDR rating scale), estimated total intracranial volume (eTIV), normalize whole brain volume(nWBV), and atlas scaling factor (ASF). 

The variables may be proxies for Alzheimers/cognitive state (conjecture):
- __EDUC__: Represents years of education which may proxy for cognitive reserve. Higher education levels may result in better cognitive function and reduced risk of dementia. 
- __SES__: Ranging from 1 to 5, is a proxy for income, education level, and occupation, which represents environmental influences on cognitive health. 
- __MMSE__: A widely used screening tool for cognitive impairment. Scores range from 1 to 30 with higher scores indicating better cognitive function. The examination assesses domains such as orientation, memory, attention, and language. 
- __CDR__: CDR scale is also commonly used to assess severity of dementia symptoms. The scale ranges from 0 to 3 where 0 indicates no dementia, 0.5 is questionable dementia, 1 is mild dementia, 2 is moderate dementia, and 3 is severe dementia.
- __eTIV__: Indicates the total volume inside the skull (including brain tissue, cerebrospinal fluid, etc.). It is measured in cubic centimeters (cc) and can help establish baseline brain size. Decrease/varying sizes may proxy for cognitive function and Alzheimer's risk.
- __nWBV__: Represents the volume of the brain normalized to the subject's eTIV. It is expressed as a percentage, reflecting the proportion of the brain occupying the eTIV (nWBV = brain volume / eTIV). Changes in nWBV may be indicative of brain atrophy which is a common feature of Alzheimer's and dementia.
- __ASF__: A scaling factor used in brain imaging to adjust for individual differences in brain size and shape. It accounts for variability in brain morphology allowing for accurate comparisons of brain structures.

The dataset provided by OASIS is relatively clean and organized. However, some cleaning may be required, such as changing M/F and group to numeric values, dropping columns such as MRI ID, Visit, and Hand which will not be used in our predictions. All of this will be done by manipulating the dataframe. 

## Dataset #1 (use name instead of number here)

In [44]:
import pandas as pd 
pd.options.mode.copy_on_write = True

df = pd.read_csv('oasis_longitudinal.csv')
df

Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,OAS2_0001_MR1,Nondemented,1,0,M,R,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,OAS2_0001_MR2,Nondemented,2,457,M,R,88,14,2.0,30.0,0.0,2004,0.681,0.876
2,OAS2_0002,OAS2_0002_MR1,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,OAS2_0002_MR2,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.010
4,OAS2_0002,OAS2_0002_MR3,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,OAS2_0185,OAS2_0185_MR2,Demented,2,842,M,R,82,16,1.0,28.0,0.5,1693,0.694,1.037
369,OAS2_0185,OAS2_0185_MR3,Demented,3,2297,M,R,86,16,1.0,26.0,0.5,1688,0.675,1.040
370,OAS2_0186,OAS2_0186_MR1,Nondemented,1,0,F,R,61,13,2.0,30.0,0.0,1319,0.801,1.331
371,OAS2_0186,OAS2_0186_MR2,Nondemented,2,763,F,R,63,13,2.0,30.0,0.0,1327,0.796,1.323


In [45]:
pd.options.mode.copy_on_write = True

#Only use date from subject's first visit. 
df = df.loc[df['Visit']==1]

#Convert M/F to numeric values
df['M/F'] = df['M/F'].replace(['M','F'], [1,0])

#Change 'Converted' to 'Demented'
df['Group'] = df['Group'].replace('Converted', 'Demented')

#Convert Group to numeric values.
df['Group'] = df['Group'].replace(['Demented', 'Nondemented'], [1,0])

#Drop variables that will not be used in predictions.
df = df.drop(['MRI ID', 'Visit', 'Hand'], axis=1)

df = df.reset_index(drop=True)

In [46]:
df

Unnamed: 0,Subject ID,Group,MR Delay,M/F,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,0,0,1,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0002,1,0,1,75,12,,23.0,0.5,1678,0.736,1.046
2,OAS2_0004,0,0,0,88,18,3.0,28.0,0.0,1215,0.710,1.444
3,OAS2_0005,0,0,1,80,12,4.0,28.0,0.0,1689,0.712,1.039
4,OAS2_0007,1,0,1,71,16,,28.0,0.5,1357,0.748,1.293
...,...,...,...,...,...,...,...,...,...,...,...,...
145,OAS2_0182,1,0,1,73,12,,23.0,0.5,1661,0.698,1.056
146,OAS2_0183,0,0,0,66,13,2.0,30.0,0.0,1495,0.746,1.174
147,OAS2_0184,1,0,0,72,16,3.0,24.0,0.5,1354,0.733,1.296
148,OAS2_0185,1,0,1,80,16,1.0,28.0,0.5,1704,0.711,1.030


# Ethics & Privacy

Ethics & Privacy Considerations:

Biases/Privacy/Terms of Use Issues with Proposed Data:

1.  The potential datasets considered, such as OASIS MRI, UK Biobank,
     and NACC, may have biases and privacy considerations. For
     instance, the OASIS MRI project's terms of use and participant
     selection could introduce biases. UK Biobank, despite its size,
     might not include a diverse representation of certain populations,
     leading to potential biases in the dataset.

Potential Biases in Dataset Composition and Collection:

2.  Biases may arise in dataset composition and collection, affecting
     the equitable analysis of Alzheimer's prediction. For example, if
     the data predominantly includes participants from specific
     demographic groups, it could introduce biases in the model.
     Additionally, variations in data collection methods across
     different research centers, as in the case of NACC, may impact
     standardization, potentially leading to biases.

Detection and Mitigation of Biases:

3.  To detect biases, the group will conduct a thorough review of the
     dataset sources, including participant demographics and
     recruitment methods. During data preprocessing, the team will
     analyze variables for potential biases, ensuring a balanced
     representation. The group plans to collaborate with experts in the
     field and seek external input to validate the fairness and
     inclusivity of the dataset.

Other Issues Related to Privacy and Equitable Impact:

4.  Privacy concerns arise from the sensitive nature of medical data,
     especially in Alzheimer's research. Ensuring participant anonymity
     and adhering to privacy regulations are paramount. Equitable
     impact considerations involve understanding if the model's
     predictions could disproportionately affect certain groups. It is
     essential to communicate findings responsibly, avoiding
     reinforcing existing biases or stigmatizing specific populations.

Handling Identified Issues:

5.  The group commits to transparently communicating any identified
     biases throughout the research process. Mitigation strategies will
     be implemented during data preprocessing and model development.
     The team will consider alternative datasets or additional sampling
     methods if biases persist. Ethical review boards will be
     consulted, and the group aims to publish findings with a clear
     acknowledgment of potential limitations and biases, promoting
     responsible and equitable use of the predictive model.

In summary, the group is dedicated to addressing ethical concerns
comprehensively, from data collection to analysis and post-analysis.
Transparency, collaboration with experts, and continuous evaluation of
potential biases will guide the research, ensuring responsible and
ethical development of the Alzheimer's prediction model.

# Team Expectations 

1. Clear communication and relatively reasonable responsiveness to messages.
2. Shared Responsibility and Accountability: finish all tasked work by the designated due date.
3. Maintain quality work and attention to detail.
4. Attendance and Participation: during designated days in which we meet, everyone should be present unless notified previously. 

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/24  | 6 PM  | Import & Wrangle Data (Joseph); | Review/Edit wrangling; Discuss Analysis Plan and EDA   |
| 3/9  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Shivangi; Zijun) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Johnny; Tanishq)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |