In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
darwin = fetch_ucirepo(id=732) 
  
# data (as pandas dataframes) 
X = darwin.data.features 
y = darwin.data.targets 
  
# metadata 
print(darwin.metadata) 
  
# variable information 
print(darwin.variables) 

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## The DARWIN Dataset

We'll be working with the DARWIN dataset. The DARWIN dataset includes handwriting data of 174 participants. And the task is to classify and distinguish Alzheimer's disease patients from healthy people.

Let's have a look at the dataset.

In [None]:
darwin.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,ID,Feature,Categorical,,,,no
1,air_time1,Feature,Integer,,,,no
2,disp_index1,Feature,Categorical,,,,no
3,gmrt_in_air1,Feature,Continuous,,,,no
4,gmrt_on_paper1,Feature,Continuous,,,,no
...,...,...,...,...,...,...,...
447,paper_time25,Feature,Integer,,,,no
448,pressure_mean25,Feature,Continuous,,,,no
449,pressure_var25,Feature,Continuous,,,,no
450,total_time25,Feature,Integer,,,,no


### What the hell are this features?

Damn... 450 are a lot of features. But wait, isn't the dataset composed of handritten data? What kind of features are this? "gmrt_in_air1" and "paper_time25" are not attributes that I relate to that kind of data, for someone who innitialy though this might be a computer vision problem, some sort of MNIST 2.0, I am quite lost.

### Don't lose hope

Ok, let's try to interpret these features on our own. Later we can try to search what they really are.

In [42]:
feature_names = X.columns.tolist()
for i in range(1, 25):
    print(feature_names[i])

air_time1
disp_index1
gmrt_in_air1
gmrt_on_paper1
max_x_extension1
max_y_extension1
mean_acc_in_air1
mean_acc_on_paper1
mean_gmrt1
mean_jerk_in_air1
mean_jerk_on_paper1
mean_speed_in_air1
mean_speed_on_paper1
num_of_pendown1
paper_time1
pressure_mean1
pressure_var1
total_time1
air_time2
disp_index2
gmrt_in_air2
gmrt_on_paper2
max_x_extension2
max_y_extension2


### What do they mean?
Two things stand out from the features. One, they seem measurements of a pen and paper, like accuracy of the pen on the paper (acc_on_paper), the mean speed of the pen on the paper (mean_speed_on_paper) and air (mean_speed_on_air), pressure of the pen on the paper (pressure_mean), and other alike. This can all be obtained using a digital tablet, so it would seem something similar were used to capture the data. 
The other thing that stands out is the numbers at the end of each feature, and how the feature names repeat air_time1, air_time2, gmrt_on_paper1, gmrt_on_paper2. We can identify 18 unique features for number 1. Let's see if this pattern repeats itself.



This repeats till the number 25, so the conclusion would be that the participants were asked to write 25 sentences/words and each time the features were measured.


In [None]:
feature_names = X.columns.tolist()[1:]
largest_word = len(max(feature_names, key=len))
for feature in range(0,18):
    row = ""
    for task in range(1,25):
        row += feature_names[feature*18+task] + " "*(largest_word - len(feature_names[feature*18+task])) + " | "
    print(row)

This seems like good assumptions, now let's try to search in the web for some answers.

### The answers...
The article [Handwriting Task-Selection based on the Analysis of Patterns in Classification Results on Alzheimer Dataset](https://ceur-ws.org/Vol-3521/paper2.pdf) has a good description of how the data and experiments were set-up. The article tries to do the exact thing we're trying to do here, so reading it feels like cheating let's only read a few things about the features for better understanding. We were really close with our assumptions, participants were asked to perform 25 handwriting taks and from each task 18 features were extract using a digital table (WACOM Bamboo Folio). 

Extracted from the article "The extracted features are related to the time spent to complete the task (total time); such time were also divided in on-paper time and in-air time; the average speed, acceleration and jerk computed separately for on-paper movements and in-airmovements; mean and variance ofthe pressure; the Generalization of the Mean Relative Tremor (GMRT), which consists of the average of the sum of thedifferences between the i-th point and its d-th predecessor, firstly divided on-paper and in-air values, and the averagingthe previous values; the maximal extension about x and y axis; and finally the Dispersion Index which consists ofdividing the paper sheet in fixed-size boxes (e.g. 3x3) then counting how many boxes are covered by the handwritingtraits and successively dividing that number with the total amount of the boxes; in this way, the coverage ratio of thepaper sheet is computed."

In [17]:
X.head()

Unnamed: 0,ID,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,...,mean_gmrt25,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25
0,id_1,5160,1.3e-05,120.804174,86.853334,957,6601,0.3618,0.217459,103.828754,...,249.729085,0.141434,0.024471,5.596487,3.184589,71,40120,1749.278166,296102.7676,144605
1,id_2,51980,1.6e-05,115.318238,83.448681,1694,6998,0.272513,0.14488,99.383459,...,77.258394,0.049663,0.018368,1.665973,0.950249,129,126700,1504.768272,278744.285,298640
2,id_3,2600,1e-05,229.933997,172.761858,2333,5802,0.38702,0.181342,201.347928,...,193.667018,0.178194,0.017174,4.000781,2.392521,74,45480,1431.443492,144411.7055,79025
3,id_4,2130,1e-05,369.403342,183.193104,1756,8159,0.556879,0.164502,276.298223,...,163.065803,0.113905,0.01986,4.206746,1.613522,123,67945,1465.843329,230184.7154,181220
4,id_5,2310,7e-06,257.997131,111.275889,987,4732,0.266077,0.145104,184.63651,...,147.094679,0.121782,0.020872,3.319036,1.680629,92,37285,1841.702561,158290.0255,72575


# Missing Data

In [14]:
total = X.isnull().sum().sort_values(ascending=False)
percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(5)

Unnamed: 0,Total,Percent
ID,0,0.0
num_of_pendown19,0,0.0
disp_index18,0,0.0
air_time18,0,0.0
total_time17,0,0.0


# No missing data

Unnamed: 0,ID,air_time1,disp_index1,gmrt_in_air1,gmrt_on_paper1,max_x_extension1,max_y_extension1,mean_acc_in_air1,mean_acc_on_paper1,mean_gmrt1,...,mean_gmrt25,mean_jerk_in_air25,mean_jerk_on_paper25,mean_speed_in_air25,mean_speed_on_paper25,num_of_pendown25,paper_time25,pressure_mean25,pressure_var25,total_time25
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
170,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
171,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
172,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
