#  **🧠Alzheimer’s diseases prediction**

This dataset contains health information on 2,149 patients with unique IDs from 4751 to 6900. It includes demographics, lifestyle factors, medical history, measurements, cognitive assessments, symptoms, and Alzheimer's Disease diagnoses. The dataset is valuable for researchers studying Alzheimer's risk factors, developing predictive models, and conducting statistical analyses. The data is consist of Demographic Details, Lifestyle Factors, Medical History, Clinical Measurements, Cognitive and Functional Assessments, Symptoms, Diagnosis Information, Confidential Information and Patient Information.
Artificial Intelligence (AI) is revolutionizing healthcare by enhancing diagnostics, operational efficiency, and personalized treatments. As a data scientist, I will support a research company leveraging AI technologies—particularly machine learning and natural language processing—to analyze complex medical data for Alzheimer's Disease (AD).
By using this dataset, I'll train two Machine learning model to cluster AD (Unsupervised Learning) and to Predict AD (Supervised Learning) based on specific features in each patient.

### **📝Table of Contents:**

**PatientID:** A unique identifier assigned to each patient (4751 to 6900).
_______________________________________________________________________________________________________________________________
Demographic Details:

**Age:** The age of the patients ranges from 60 to 90 years.

**Gender:** Gender of the patients, where 0 represents Male and 1 represents Female.

**Ethnicity:** The ethnicity of the patients, coded as follows:
(0: Caucasian, 1: African American, 2: Asian, 3: Other)

**EducationLevel:** The education level of the patients, coded as follows:
(0: None, 1: High School, 2: Bachelor's, 3: Higher)
_______________________________________________________________________________________________________________________________
Lifestyle Factors:

**BMI:** Body Mass Index of the patients, ranging from 15 to 40.

**Smoking:** Smoking status, where 0 indicates No and 1 indicates Yes.

**AlcoholConsumption:** Weekly alcohol consumption in units, ranging from 0 to 20.

**PhysicalActivity:** Weekly physical activity in hours, ranging from 0 to 10.

**DietQuality:** Diet quality score, ranging from 0 to 10.

**SleepQuality:** Sleep quality score, ranging from 4 to 10.
_______________________________________________________________________________________________________________________________
Medical History:

**FamilyHistoryAlzheimers:** Family history of Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.

**CardiovascularDisease:** Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.

**Diabetes** Presence of diabetes, where 0 indicates No and 1 indicates Yes.

**Depression:** Presence of depression, where 0 indicates No and 1 indicates Yes.

**HeadInjury:** History of head injury, where 0 indicates No and 1 indicates Yes.

**Hypertension:** Presence of hypertension, where 0 indicates No and 1 indicates Yes.
_______________________________________________________________________________________________________________________________
Clinical Measurements:

**SystolicBP:** Systolic blood pressure, ranging from 90 to 180 mmHg.

**DiastolicBP:** Diastolic blood pressure, ranging from 60 to 120 mmHg.

**CholesterolTotal:** Total cholesterol levels, ranging from 150 to 300 mg/dL.

**CholesterolLDL:** Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.

**CholesterolHDL:** High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.

**CholesterolTriglycerides:** Triglycerides levels, ranging from 50 to 400 mg/dL.
_______________________________________________________________________________________________________________________________
Cognitive and Functional Assessments:

**MMSE:** Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.

**FunctionalAssessment:** Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.

**MemoryComplaints:** Presence of memory complaints, where 0 indicates No and 1 indicates Yes.

**BehavioralProblems:** Presence of behavioral problems, where 0 indicates No and 1 indicates Yes.

**ADL:** Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
_______________________________________________________________________________________________________________________________
Symptoms:

**Confusion:** Presence of confusion, where 0 indicates No and 1 indicates Yes.

**Disorientation:** Presence of disorientation, where 0 indicates No and 1 indicates Yes.

**PersonalityChanges:** Presence of personality changes, where 0 indicates No and 1 indicates Yes.

**DifficultyCompletingTasks:** Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.

**Forgetfulness:** Presence of forgetfulness, where 0 indicates No and 1 indicates Yes.
_______________________________________________________________________________________________________________________________
Diagnosis Information:

**Diagnosis:** Diagnosis status for Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
_______________________________________________________________________________________________________________________________
Confidential Information:

**DoctorInCharge:** This column contains confidential information about the doctor in charge, with "XXXConfid" as the value for all patients.

Reference:
10.34740/kaggle/dsv/8668279

https://github.com/koiralasandeep/Alzheimers_disease/blob/main/alzheimers_disease_data.csv

### **Import Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, silhouette_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
import itertools
import warnings

In [2]:
warnings.filterwarnings("ignore", category=DeprecationWarning)

### **Load and Exploring the dataset**

In [44]:
df=pd.read_csv('alzheimers_disease_data.csv')
df.head()

Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2149 entries, 0 to 2148
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PatientID                  2149 non-null   int64  
 1   Age                        2149 non-null   int64  
 2   Gender                     2149 non-null   int64  
 3   Ethnicity                  2149 non-null   int64  
 4   EducationLevel             2149 non-null   int64  
 5   BMI                        2149 non-null   float64
 6   Smoking                    2149 non-null   int64  
 7   AlcoholConsumption         2149 non-null   float64
 8   PhysicalActivity           2149 non-null   float64
 9   DietQuality                2149 non-null   float64
 10  SleepQuality               2149 non-null   float64
 11  FamilyHistoryAlzheimers    2149 non-null   int64  
 12  CardiovascularDisease      2149 non-null   int64  
 13  Diabetes                   2149 non-null   int64

In this phase, the data is examined to gain insights that help with data preprocessing, feature engineering, and model building. The dataset consist of **2149 rows** and **34 columns** (that determining total features mostly demographic. Both categorical and numerical are among our datatypes). The dataset would generally be considered **moderate in size**. It's not too small, but it's also not large enough to be classified as "big data."

In [28]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PatientID,2149.0,5825.0,620.507185,4751.0,5288.0,5825.0,6362.0,6899.0
Age,2149.0,74.908795,8.990221,60.0,67.0,75.0,83.0,90.0
Gender,2149.0,0.506282,0.500077,0.0,0.0,1.0,1.0,1.0
Ethnicity,2149.0,0.697534,0.996128,0.0,0.0,0.0,1.0,3.0
EducationLevel,2149.0,1.286645,0.904527,0.0,1.0,1.0,2.0,3.0
BMI,2149.0,27.655697,7.217438,15.008851,21.611408,27.823924,33.869778,39.992767
Smoking,2149.0,0.288506,0.453173,0.0,0.0,0.0,1.0,1.0
AlcoholConsumption,2149.0,10.039442,5.75791,0.002003,5.13981,9.934412,15.157931,19.989293
PhysicalActivity,2149.0,4.920202,2.857191,0.003616,2.570626,4.766424,7.427899,9.987429
DietQuality,2149.0,4.993138,2.909055,0.009385,2.458455,5.076087,7.558625,9.998346


This table presenting a statistical summary of various health-related metrics for 2,149 patients of this dataset. Key insights include:

Age and Gender: Patients have **a mean age of approximately 74.9 years**, with **a balanced gender distribution (50.3% male)**.

Health Indicators: **The average BMI is around 27.6**, indicating a tendency towards overweight. Common health issues include **hypertension (14.9% reported)** and **diabetes (15.1%)**.

Lifestyle Factors: **Alcohol consumption shows a mean of 10.04**, while **smoking prevalence is about 28.6%**.

Cognitive Assessment: **MMSE scores average 14.75**, suggesting potential cognitive impairments in this population.

Assessment Indicators: Variability in psychological factors such as **depression and memory complaints indicates a diverse range of mental health issues** among patients.

Overall, the dataset highlights significant health challenges, particularly related to **age, cognitive function, and lifestyle factors**, which could guide targeted interventions and resource allocation.

### **Data Preprocessing**

In [29]:
df.isnull().sum()

Unnamed: 0,0
PatientID,0
Age,0
Gender,0
Ethnicity,0
EducationLevel,0
BMI,0
Smoking,0
AlcoholConsumption,0
PhysicalActivity,0
DietQuality,0


We don't have any null-value in our target columns. Otherwise I should remove Null Values in columns: df.dropna(), or by using  "Imputer" fill missing values with the mean or median or mode of that column.

### **Spliting the data into test and train:**

We split data to "Train" and "Test" **to prevent data leakage.**

In [45]:
X = df.drop(['PatientID', 'Diagnosis', 'DoctorInCharge'], axis=1)
y = df['Diagnosis']

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state=42)
#test_size = 0.2: 20% of the data will be used for testing, and the remaining 80% will be used for training.
#random_state=42: ensures that the split is reproducible

### **Feature Engineering**

In [41]:
#Feature Selection (Remove irrelevant features)
selector = SelectKBest(score_func=f_classif, k=20)  # Keep top 20 features
X_train_selected = selector.fit_transform(X_train, y_train)

print("Selected features shape:", X_train_selected.shape)

Selected features shape: (1719, 20)


In [42]:
selected_cols = X.columns[selector.get_support()]

In [50]:
X_train_selected = X_train[selected_cols]
X_test_selected = X_test[selected_cols]

# **Model Training (Supervised Mode)**
To train model on known labeled data (Group) to predict future outcomes.

In [52]:
#Model selection
rf = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100, random_state=42))
gb = make_pipeline(StandardScaler(), GradientBoostingClassifier(n_estimators=100, random_state=42))
models = {
    'RandomForest': rf,
    'GradientBoosting': gb
}

results = {}
for name, model in models.items():
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    results[name] = roc_auc_score(y_test, model.predict_proba(X_test_selected)[:,1])

In [53]:
best_model = max(results, key=results.get)
print("\nModel AUC Scores:")
print(results)
print(f"Best Model based on AUC: {best_model}")


Model AUC Scores:
{'RandomForest': np.float64(0.9521365706330667), 'GradientBoosting': np.float64(0.9505792690120574)}
Best Model based on AUC: RandomForest


The code evaluates machine learning models' accuracy in binary classification tasks using AUC metric. I trained several machine learning models, including Logistic Regression(0.894363040041528), SVM(0.8970765201387414), KNN(0.7491210684032938), Decision Tree(0.8926877610249876), Random Forest(0.949918595), and Gradient Boosting(0.9553101625728511). After evaluating their performance using precision, recall, and F1-score, I identified **Gradient Boosting** and **Random Forest** as the top two models due to excellent accuracy, robust metrics and AUC scores.

### **Hyperparameter Tuning to reach Model**

In [55]:
# Cross-validation comparison
print("\nCross-Validation Scores:")
for name, model in models.items():
    scores = cross_val_score(model, X_train_selected, y_train, cv=5)  # 5-fold cross-validation
    print(f"{name}: Mean Accuracy = {scores.mean():.4f}")


Cross-Validation Scores:
RandomForest: Mean Accuracy = 0.9401
GradientBoosting: Mean Accuracy = 0.9395


The comparison of models shows varying means of accuracy with

**the best Performers: Gradient Boosting: 0.9389 , Random Forest: 0.9273**

Both Gradient Boosting and Random Forest demonstrate strong predictive power and reliability, making them excellent choices for further analysis.

In [56]:
#Gradient Boosting Tuning (Best Model)
param_grid = {
    'gradientboostingclassifier__n_estimators': [50, 100, 200],
    'gradientboostingclassifier__learning_rate': [0.01, 0.1, 0.2],
    'gradientboostingclassifier__max_depth': [3, 4, 5]
}

In [58]:
grid_gb = GridSearchCV(estimator=gb, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_gb.fit(X_train_selected, y_train)

print("Best Gradient Boosting Params:", grid_gb.best_params_)
print("Best Score:", grid_gb.best_score_)

Best Gradient Boosting Params: {'gradientboostingclassifier__learning_rate': 0.01, 'gradientboostingclassifier__max_depth': 5, 'gradientboostingclassifier__n_estimators': 100}
Best Score: 0.9550168146041049


In [24]:
grid_gb.fit(X_train_selected, y_train)

### **Final Evaluation on Test Set to Predict AD (Supervised Learning)**

In [59]:
#predicted probabilities
y_pred_best = grid_gb.predict(X_test_selected)
y_pred_proba = grid_gb.predict_proba(X_test_selected)[:, 1]

In [60]:
print("\nFinal Model Evaluation Results:")
print("Classification Report:")
print(classification_report(y_test, y_pred_best))


Final Model Evaluation Results:
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.97      0.96       277
           1       0.94      0.90      0.92       153

    accuracy                           0.94       430
   macro avg       0.94      0.93      0.94       430
weighted avg       0.94      0.94      0.94       430



The tuned **Gradient Boosting** model performs exceptionally well, achieving **perfect prediction** on the test data with optimal feature selection and hyperparameters.