# Alzheimer's Detection

A machine learning system will be developed to detect signs of Alzheimer’s disease (AD) based on drawing patterns and handwriting. Classification models will be developed that can classify whether or not a subject shows early signs of AD using data gathered from the DARWIN dataset. 

The DARWIN dataset contains 174 labeled samples collected from 25 cognitive and motor tasks, which include drawing, writing, and retracing forms. From these different tasks, 18 kinematic features per task are used. These include features like:
- `total task time` 
- `pen pressure`
- `acceleration` 
- `air time`, etc. 

These will help the models in identifying patterns associated with cognitive decline. The goal is to train and evaluate classification models, starting with simple baselines and progressing toward more advanced models, to assess the feasibility of handwriting as a non-invasive diagnostic tool for early AD detection. This will help determine whether handwriting-based screening is a good complement to traditional diagnostic tools in real-world clinical settings.

### Import required libraries

In [1]:
import zipfile
import tempfile
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from pathlib import Path

# Preprocessing tools
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

# Hyperparamater Tuning and Pipelines
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV

# Models to train and classify with
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

# Scoring and Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Import the Data

In [2]:
# Define OS-independent file path
data_dir = Path.cwd() / 'data'/ 'darwin-alzheimers.zip'

# Create a temp directory to extract to
with tempfile.TemporaryDirectory() as temp_output_dir:
    # Extract the zip file 
    with zipfile.ZipFile(data_dir, 'r') as zip_ref:
        zip_ref.extractall(str(temp_output_dir))

    # Get data CSV filepath
    file_path = Path(f"{str(temp_output_dir)}/data.csv")

    # Load into pandas Dataframe
    data = pd.read_csv(file_path)

In [3]:
# Visualize data
data.head()

Unnamed: 0,ID,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,...,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25,class
0,id_1,5160,1.3e-05,120.804174,86.853334,957,6601,0.3618,0.217459,103.828754,...,0.141434,0.024471,5.596487,3.184589,71,40120,1749.278166,296102.7676,144605,P
1,id_2,51980,1.6e-05,115.318238,83.448681,1694,6998,0.272513,0.14488,99.383459,...,0.049663,0.018368,1.665973,0.950249,129,126700,1504.768272,278744.285,298640,P
2,id_3,2600,1e-05,229.933997,172.761858,2333,5802,0.38702,0.181342,201.347928,...,0.178194,0.017174,4.000781,2.392521,74,45480,1431.443492,144411.7055,79025,P
3,id_4,2130,1e-05,369.403342,183.193104,1756,8159,0.556879,0.164502,276.298223,...,0.113905,0.01986,4.206746,1.613522,123,67945,1465.843329,230184.7154,181220,P
4,id_5,2310,7e-06,257.997131,111.275889,987,4732,0.266077,0.145104,184.63651,...,0.121782,0.020872,3.319036,1.680629,92,37285,1841.702561,158290.0255,72575,P


In [4]:
# Get descriptive statistics on data
data.describe()

Unnamed: 0,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,mean_jerk_in_air1,...,mean_gmrt25,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25
count,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,...,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0
mean,5664.166667,1e-05,297.666685,200.504413,1977.965517,7323.896552,0.416374,0.179823,249.085549,0.067556,...,221.360646,0.148286,0.019934,4.472643,2.871613,85.83908,43109.712644,1629.585962,163061.76736,164203.3
std,12653.772746,3e-06,183.943181,111.629546,1648.306365,2188.290512,0.381837,0.064693,132.698462,0.074776,...,63.762013,0.062207,0.002388,1.501411,0.852809,27.485518,19092.024337,324.142316,56845.610814,496939.7
min,65.0,2e-06,28.734515,29.935835,754.0,561.0,0.067748,0.096631,41.199445,0.011861,...,69.928033,0.030169,0.014987,1.323565,0.950249,32.0,15930.0,474.049462,26984.92666,29980.0
25%,1697.5,8e-06,174.153023,136.524742,1362.5,6124.0,0.218209,0.146647,161.136182,0.029523,...,178.798382,0.107732,0.018301,3.485934,2.401199,66.0,32803.75,1499.112088,120099.0468,59175.0
50%,2890.0,9e-06,255.791452,176.494494,1681.0,6975.5,0.275184,0.163659,224.445268,0.039233,...,217.431621,0.140483,0.019488,4.510578,2.830672,81.0,37312.5,1729.38501,158236.7718,76115.0
75%,4931.25,1.1e-05,358.917885,234.05256,2082.75,8298.5,0.442706,0.188879,294.392298,0.071057,...,264.310776,0.199168,0.021134,5.212794,3.335828,101.5,46533.75,1865.626974,200921.078475,127542.5
max,109965.0,2.8e-05,1168.328276,865.210522,18602.0,15783.0,2.772566,0.62735,836.784702,0.543199,...,437.373267,0.375078,0.029227,10.416715,5.602909,209.0,139575.0,1999.775983,352981.85,5704200.0


In [6]:
# Check for missing values
missing = data.isna().sum()

missing[missing > 0]

Series([], dtype: int64)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Columns: 452 entries, ID to class
dtypes: float64(300), int64(150), object(2)
memory usage: 614.6+ KB


#### Preprocess Data

### Generate Train/Test splits

#### Train Models

#### Evaluation

#### Pipeline for Preprocessing and Hyperparameter Tuning

#### Model Selection

#### Final Results