# Modelling data

- Select features for training and testing model
- Normalize values on a *per-column* basis* [0-1]

*Do features need to be dimensionally reduced?*

In [27]:
import numpy as np
import pandas as pd

import os
import pickle

In [28]:
df = pd.read_csv('working/combined.csv')

### Re-order columns

Separate columns into risk factors, cognitive and mri features

In [29]:
df.shape

(564, 197)

In [30]:
# reorder columns
cols = df.columns.to_list()

mri_cols = cols[1:178]
rest_cols = cols[178:]
risk_cols = [
    'RID', 'AGE', 'PTGENDER',
    'PTEDUCAT', 'MOTHDEM', 'FATHDEM'
]
cognitive_cols = [c for c in rest_cols if c not in risk_cols]

In [31]:
cols = risk_cols + cognitive_cols + mri_cols
df = df.reindex(columns=cols)

In [32]:
df.head()

Unnamed: 0,RID,AGE,PTGENDER,PTEDUCAT,MOTHDEM,FATHDEM,PHC_MEM,PHC_EXF,PHC_LAN,AD_LABEL,...,wm-rh-superiorfrontal,wm-rh-superiorparietal,wm-rh-superiortemporal,wm-rh-supramarginal,wm-rh-frontalpole,wm-rh-temporalpole,wm-rh-transversetemporal,wm-rh-insula,wm-lh-Unsegmented,wm-rh-Unsegmented
0,21,84.8186,2,18.0,0.0,0.0,1.173,-0.15,0.666,1.0,...,1.134314,1.018792,1.385129,1.172204,1.206521,1.469661,1.18214,1.214626,1.614727,1.587751
1,31,91.3073,2,18.0,0.0,0.0,1.038,-0.318,0.269,1.0,...,1.27884,1.128747,1.351128,1.211336,1.434194,1.470025,1.788683,1.324016,0.730924,0.814179
2,56,83.7563,2,13.0,0.0,0.0,0.349,0.09,0.155,2.0,...,1.260969,1.316324,1.271024,1.296633,1.447026,1.191524,1.020604,1.384523,1.323594,1.33008
3,59,84.9665,2,13.0,0.0,0.0,0.485,0.405,0.236,2.0,...,1.266842,1.245742,1.207279,1.322977,1.284196,1.209476,1.008124,1.253794,1.447955,1.407817
4,69,87.0582,1,16.0,0.0,0.0,0.087,1.025,1.06,2.0,...,1.381038,1.41893,1.426585,1.400109,1.566343,1.650883,1.144346,1.416124,0.863427,0.926379


### Gender

Update gender to be $[0,1]$ where:
- $0 \rightarrow$ Male
- $1 \rightarrow$ Female

In [33]:
df['PTGENDER'] = df['PTGENDER'] - 1

### Negative cols

Make all negative cols positive before normalization.

See article under heading [normalizing negative data](https://people.revoledu.com/kardi/tutorial/Similarity/Normalization.html).

In [34]:
negative_cols = [k for k,v in df.items() if v.min() < 0]

In [35]:
for k, v in df[negative_cols].items():
    df[k] = df[k].map(lambda x: x + abs(v.min()))

### Normalize columns

Now that all columns contain positive values, normalize
all features where $\max > 1$.

Normalize features using *min-max scaling*

In [36]:
# NOTE: if you remove the CDR score you won't have to adjust
# your array indexes when exporting later
excluded = ['RID', 'AD_LABEL']
cols_to_normalize = [k for k, v in df.items() if v.max() > 1 and k not in excluded]

In [37]:
normalized_df = df[cols_to_normalize]
numer = normalized_df - normalized_df.min()
denom = normalized_df.max() - normalized_df.min()

df[cols_to_normalize] = (numer / denom)

### Export

- Convert dataframe to numpy array (excl. rid and label)
- Create data dictionary
- Split into training and test (80%, 20%)

In [38]:
# convert dataframe into numpy array (excl. rid and label)
cols_to_keep = [c for c in df.columns if c not in excluded]
arr = df.loc[:, cols_to_keep].values

In [39]:
keys = ['CN', 'MCI', 'AD' ]
data_dict = {k: [] for k in keys}

for i, subj in enumerate(arr):
    key = int(df['AD_LABEL'][i]) - 1
    item = np.expand_dims(subj, axis=0)
    data_dict[keys[key]].append(item)

In [40]:
# separate data dictionary into training and test sets
train_dict = {}
test_dict = {}
for k, v in data_dict.items():
    no_train_samples = round(len(v) * .80)
    train_dict[k] = v[:no_train_samples]
    test_dict[k] = v[no_train_samples:]

In [41]:
if not os.path.exists('data/'):
    os.mkdir('data/')

In [42]:
np.save('data/ad_class_train', train_dict, allow_pickle=True)

In [43]:
np.save('data/ad_class_test', train_dict, allow_pickle=True)

### Log

Keep track of datashape and where features sit for future usage.

**Order of features in array are as follows:**

*Risk factors* $\rightarrow$ *Cognitive factors* $\rightarrow$ *MRI factors*

In [44]:
print([(k, len(v)) for k,v in train_dict.items()])
print([(k, len(v)) for k,v in test_dict.items()])

# we have removed rid from risk cols and ad_label from cognitive cols
# NOTE: if we decide to remove CDR then cognitive cols - 2
print(f"""
    Risk features (excl. rid): {len(risk_cols) - 1},
    Cognitive features (excl. label): {len(cognitive_cols) - 1},
    MRI cols: {len(mri_cols)}
""")

[('CN', 294), ('MCI', 110), ('AD', 47)]
[('CN', 73), ('MCI', 28), ('AD', 12)]

    Risk features (excl. rid): 5,
    Cognitive features (excl. label): 13,
    MRI cols: 177



### Data loading test

In [45]:
test_root = 'data/ad_class_train.npy'
with open(test_root, 'rb') as f:
    data_dict = np.load(f, allow_pickle=True)

# load object in numpy using data_dict[()]['AD']

In [50]:
data_dict[()]['AD'][0]

array([[0.6934069 , 0.        , 0.83333333, 0.        , 0.        ,
        0.30626023, 0.61769956, 0.40131437, 0.30733206, 0.76850211,
        0.75      , 0.84615385, 0.        , 0.04918033, 0.56521739,
        0.47826087, 0.31225047, 0.49602302, 0.48687168, 0.55034526,
        0.23328005, 0.25137255, 0.66739609, 0.14982578, 0.3935899 ,
        0.28211031, 0.60698206, 0.36265916, 0.49857326, 0.30694524,
        0.52610557, 0.51844572, 0.08747958, 0.44354742, 0.60496959,
        0.24449593, 0.14651356, 0.59559876, 0.11961297, 0.2970036 ,
        0.47124603, 0.34294008, 0.54375861, 0.07236322, 0.34860891,
        0.        , 0.        , 0.        , 0.17561781, 0.61345251,
        0.12282124, 0.29954476, 0.18425925, 0.2721465 , 0.10019129,
        0.3935871 , 0.32728262, 0.07167696, 0.27436288, 0.19766599,
        0.06163092, 0.24838172, 0.21448134, 0.41819001, 0.1813811 ,
        0.56666207, 0.32984083, 0.21373426, 0.1252416 , 0.11028754,
        0.20612522, 0.25791234, 0.1931159 , 0.24