## 02 Data preprocessing

In this notebook we split the data into train and test subsets and discuss the validation.

### 1. Data loading and adjusting

We load the data and rename columns as mentioned in 01_exploratory_data_analysis.ipynb to standarize column names.

We know that there are 4 entries with wrong values, 1 for obesity and 3 for comorbidity. Obesity did not show significant correlation with the gallstone status. In addition, there are other metrics that likely contain similar biological information (e.g. BMI). We could leave the obesity outlier in the dataset but decided to keep our data as clean as possible. On the other hand, comorbidity is likely to be correlated with the gallstone status, we exclude from the dataset entries with comobidity above 1 (it is expected to be binary).


In [423]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold


path_train_data = "data/train_data_fold_split.csv"
path_train_data_strat = "data/train_data_fold_split_strat.csv"
path_test_data = "data/test_data.csv"
path_test_data_strat = "data/test_data_strat.csv"
save_files = True # change to True to rewrite train data and test data files

In [424]:
df_orig = pd.read_csv("data/gallstone_.csv")
col_map = {
    'Gallstone Status': 'gallstone',
    'Age': 'age',
    'Gender': 'gender',
    'Comorbidity': 'comorbidity',
    'Coronary Artery Disease (CAD)': 'cad',
    'Hypothyroidism': 'hypothyroidism',
    'Hyperlipidemia': 'hyperlipidemia',
    'Diabetes Mellitus (DM)': 'diabetes',
    'Height': 'height',
    'Weight': 'weight',
    'Body Mass Index (BMI)': 'bmi',
    'Total Body Water (TBW)': 'tbw',
    'Extracellular Water (ECW)': 'ecw',
    'Intracellular Water (ICW)': 'icw',
    'Extracellular Fluid/Total Body Water (ECF/TBW)': 'ecf_tbw',
    'Total Body Fat Ratio (TBFR) (%)': 'tbfr',
    'Lean Mass (LM) (%)': 'lm',
    'Body Protein Content (Protein) (%)': 'protein',
    'Visceral Fat Rating (VFR)': 'vfr',
    'Bone Mass (BM)': 'bm',
    'Muscle Mass (MM)': 'mm',
    'Obesity (%)': 'obesity',
    'Total Fat Content (TFC)': 'tfc',
    'Visceral Fat Area (VFA)': 'vfa',
    'Visceral Muscle Area (VMA) (Kg)': 'vma',
    'Hepatic Fat Accumulation (HFA)': 'hfa',
    'Glucose': 'glucose',
    'Total Cholesterol (TC)': 'tc',
    'Low Density Lipoprotein (LDL)': 'ldl',
    'High Density Lipoprotein (HDL)': 'hdl',
    'Triglyceride': 'triglyceride',
    'Aspartat Aminotransferaz (AST)': 'ast',
    'Alanin Aminotransferaz (ALT)': 'alt',
    'Alkaline Phosphatase (ALP)': 'alp',
    'Creatinine': 'creatinine',
    'Glomerular Filtration Rate (GFR)': 'gfr',
    'C-Reactive Protein (CRP)': 'crp',
    'Hemoglobin (HGB)': 'hgb',
    'Vitamin D': 'vitamin_d'
}

df = df_orig.rename(columns=col_map)

df = df[df["comorbidity"]<2]
df = df[df["obesity"]<1900]
df.reset_index(drop=True)

print(f"The dataset shape after excluding outliers: {df.shape}")

The dataset shape after excluding outliers: (315, 39)


### 2. Data Split Concept

We use k-fold cross-validation with repetitions (k=5, rep = 3) for both the regression and neural network models. Although k-fold is uncommon for large NNs, it is useful for our small dataset (315 samples) to get reliable performance estimates.

A hold-out test set (30%) is kept for final evaluation. EDA showed that gallstone tatus is not evenly distributed across genders. Here, to experimenz we propose two stratification strategies:
- stratification by gallstone status to ensure that the disease status is balanced in train and test datatsets. Such stratification normally produces more stable models;
- stratification by combination of gallstone status and gender to preserve joint distribution of gallstone and gender. In this way. likely the realistic patient distribution will be used for both sets. On the other hand, since the dataset is 315 samples only, such a stratification can hide natural variability in the data.

The same stratification is used for train/test and fold splitting 
Train/test files are saved separately, with the stratifiction column and k-fold assignments stored in the train file for reproducibility.

```
Full Dataset (315 samples)
│
├── 70% Train (220 samples) (train_data.csv) ──> K-Fold (k=5)
│      ├── Fold 1: Train / Validation
│      ├── Fold 2: Train / Validation
│      ├── Fold 3: Train / Validation
│      ├── Fold 4: Train / Validation
│      └── Fold 5: Train / Validation
│
└── 30% Test (95 samples) (test_data.csv) ──> Final evaluation
```

### 3. Data split to train and test sets.

There is an misbalance in gender distributions grouped by gallstone status. To take it into account we stratify the split by gender in addition to the gallstone status.

In [425]:
df["strat"] = df["gallstone"].astype(str) +  "_"  + df["gender"].astype(str)

train_data, test_data, y_train, y_test = train_test_split(df, df["gallstone"], test_size=0.3, stratify = df["gallstone"], random_state = 37)
train_data_strat, test_data_strat, y_train_strat, y_test_strat = train_test_split(df, df["gallstone"], test_size=0.3, stratify=df["strat"], random_state = 37)


Let's verify how the gender distribution looks like after random (stratified by gallstone status only) splitting and a splitting stratified by gallstone + gender.

In [426]:
def check_stratification_on_gallstone(df, feature, data_label = "train"):
    ct = pd.crosstab(df['gallstone'], df[feature])
    if feature == "gender":
        ct = ct.rename(columns = {0: "male", 1: "female"})
    ct_norm = ct.div(ct.sum(axis=1), axis=0).add_suffix(" (%)")* 100

    ct_side_by_side = pd.concat([ct, ct_norm.round(0).astype(int)], axis=1)

    display(ct_side_by_side.style.set_caption(f"{feature} distribution on gallstone status for {data_label} data"))

check_stratification_on_gallstone(df = train_data, feature = "gender", data_label="train")
check_stratification_on_gallstone(df = test_data, feature = "gender", data_label="test")

check_stratification_on_gallstone(df = train_data_strat, feature = "gender", data_label="train stratified")
check_stratification_on_gallstone(df = test_data_strat, feature = "gender", data_label="test stratified")



gender,male,female,male (%),female (%)
gallstone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,63,47,57,43
1,52,58,47,53


gender,male,female,male (%),female (%)
gallstone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,20,58,42
1,16,31,34,66


gender,male,female,male (%),female (%)
gallstone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,64,47,58,42
1,47,62,43,57


gender,male,female,male (%),female (%)
gallstone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,27,20,57,43
1,21,27,44,56


The gender distribution is the same for gender stratified train and test datasets. We can now proceed with k-fold splitting.

### 4. Train data splitting to k-folds

As described above, we use stratification by gallstone status + gender for k-folds. K equals 5.

In [427]:
# make stratified split and add columns "FoldXRepY" to df where values are "train" or "val"
# in such a way we can save splits to a file together with the data for better reproducebility

train_data.reset_index(drop = True, inplace = True)
train_data_strat.reset_index(drop = True, inplace = True)

n_splits = 5
n_repeats = 3
skf = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats = n_repeats, random_state=42)

splits = {} # save splits for a check later
splits_strat = {} # save splits for a check later

for fold, (train_idx, val_idx) in enumerate(skf.split(train_data, train_data["gallstone"])):
    fold = fold
    fold_num = fold%n_splits + 1
    rep_num = fold//n_splits + 1
    fold_col = f"Fold{fold_num}_Rep{rep_num}"
    print(f"{fold_col}: Train {len(train_idx)}, Val {len(val_idx)}")
    train_data[fold_col] = "train"
    train_data.loc[val_idx, fold_col] = "val"
    splits[fold_col] =  {"train": train_idx.tolist(), "val": val_idx.tolist()} 

for fold, (train_idx, val_idx) in enumerate(skf.split(train_data, train_data_strat["strat"])):
    fold = fold
    fold_num = fold%n_splits + 1
    rep_num = fold//n_splits + 1
    fold_col = f"Fold{fold_num}_Rep{rep_num}"
    #print(f"{fold_col}: Train {len(train_idx)}, Val {len(val_idx)}")
    train_data_strat[fold_col] = "train"
    train_data_strat.loc[val_idx, fold_col] = "val"
    splits_strat[fold_col] =  {"train": train_idx.tolist(), "val": val_idx.tolist()} 

Fold1_Rep1: Train 176, Val 44
Fold2_Rep1: Train 176, Val 44
Fold3_Rep1: Train 176, Val 44
Fold4_Rep1: Train 176, Val 44
Fold5_Rep1: Train 176, Val 44
Fold1_Rep2: Train 176, Val 44
Fold2_Rep2: Train 176, Val 44
Fold3_Rep2: Train 176, Val 44
Fold4_Rep2: Train 176, Val 44
Fold5_Rep2: Train 176, Val 44
Fold1_Rep3: Train 176, Val 44
Fold2_Rep3: Train 176, Val 44
Fold3_Rep3: Train 176, Val 44
Fold4_Rep3: Train 176, Val 44
Fold5_Rep3: Train 176, Val 44


We can now save the train data to the file. Note that we keep gallstone status, stratification column and fold splittings together with the features for better tracebility.

In [428]:
# save train data with splits to a file

if save_files:
    train_data.to_csv(path_train_data, index = False)
    test_data.to_csv(path_test_data, index = False)
    train_data_strat.to_csv(path_train_data_strat, index = False)
    test_data_strat.to_csv(path_test_data_strat, index = False)

We quickly check how the saved data looks like.

In [429]:
train_data_loaded = pd.read_csv(path_train_data)
train_data_strat_loaded = pd.read_csv(path_train_data_strat)

display(train_data_loaded.head(10).style.format())

Unnamed: 0,gallstone,age,gender,comorbidity,cad,hypothyroidism,hyperlipidemia,diabetes,height,weight,bmi,tbw,ecw,icw,ecf_tbw,tbfr,lm,protein,vfr,bm,mm,obesity,tfc,vfa,vma,hfa,glucose,tc,ldl,hdl,triglyceride,ast,alt,alp,creatinine,gfr,crp,hgb,vitamin_d,strat,Fold1_Rep1,Fold2_Rep1,Fold3_Rep1,Fold4_Rep1,Fold5_Rep1,Fold1_Rep2,Fold2_Rep2,Fold3_Rep2,Fold4_Rep2,Fold5_Rep2,Fold1_Rep3,Fold2_Rep3,Fold3_Rep3,Fold4_Rep3,Fold5_Rep3
0,0,44,0,0,0,0,0,0,178,86.4,27.3,49.2,18.7,29.5,40.0,20.2,79.75,16.98,8,3.4,65.5,24.0,17.5,10.6,35.9,2,103.0,194.0,127.0,39.0,180.0,22.0,40.0,73.0,0.82,116.75,0.0,15.7,25.15,0_0,train,train,train,val,train,val,train,train,train,train,train,train,val,train,train
1,0,32,0,0,0,0,0,0,170,64.3,22.2,39.8,16.5,23.3,41.0,15.3,84.76,16.67,4,2.7,51.8,1.1,9.8,5.9,27.6,1,91.0,185.0,125.0,35.0,165.0,23.0,22.0,69.0,0.67,127.22,0.5,16.0,22.925,0_0,val,train,train,train,train,train,train,train,val,train,train,train,train,val,train
2,1,57,0,0,0,0,0,0,186,118.1,34.1,57.4,22.0,35.0,38.3,29.89,70.11,17.16,16,2.0,78.7,33.87,35.3,21.63,40.5,2,83.0,209.0,139.0,44.0,148.0,22.0,33.0,80.0,0.64,110.4,0.5,13.6,18.7,1_0,val,train,train,train,train,train,train,train,train,val,train,train,train,train,val
3,0,51,1,1,0,1,0,0,166,59.3,21.5,33.0,14.1,18.9,43.0,21.6,78.41,17.9,4,2.4,44.1,2.1,12.8,5.2,25.8,0,99.0,254.0,158.0,64.0,107.0,21.0,21.0,61.0,0.7,93.34,0.0,13.5,46.9,0_1,val,train,train,train,train,train,train,val,train,train,val,train,train,train,train
4,1,55,0,0,0,0,0,0,177,106.3,33.9,50.6,21.9,28.7,43.0,32.2,67.83,16.02,18,3.6,68.5,54.3,34.2,30.8,37.1,0,94.5,240.0,148.0,55.5,141.5,17.0,15.0,70.5,0.715,100.7,0.65,14.15,17.75,1_0,val,train,train,train,train,train,train,train,val,train,val,train,train,train,train
5,1,57,0,1,0,0,0,0,169,104.2,36.5,55.0,23.0,32.0,41.81,26.78,73.22,15.9,16,3.8,72.6,43.07,27.9,16.55,37.4,3,98.0,151.0,95.0,41.0,121.0,28.0,48.0,41.0,0.83,101.4,0.0,15.0,5.7,1_0,train,train,train,train,val,train,train,val,train,train,train,train,train,val,train
6,1,35,0,0,0,0,0,0,171,83.8,28.7,47.1,19.0,28.0,40.33,22.17,77.83,16.01,8,2.3,62.0,22.09,18.6,10.88,33.3,2,94.0,207.0,139.0,48.0,108.0,21.0,39.0,66.0,0.85,116.2,8.1,16.5,11.2,1_0,train,train,train,val,train,train,train,val,train,train,train,train,train,val,train
7,1,39,1,1,0,0,0,0,165,74.0,27.2,33.7,15.1,18.6,45.0,36.4,63.6,13.46,6,2.4,44.7,23.5,26.9,13.1,25.9,0,103.0,219.0,136.0,63.0,94.0,19.0,19.0,49.0,0.58,117.9,0.0,11.9,9.5,1_1,train,train,train,val,train,val,train,train,train,train,train,train,train,train,val
8,1,51,1,0,0,0,0,0,158,66.5,26.6,32.0,12.0,20.0,37.5,28.8,71.13,18.6,6,2.4,44.9,8.73,19.2,8.1,26.5,0,96.0,356.0,262.0,62.0,79.0,14.0,15.0,54.0,0.75,96.3,9.0,12.9,21.1,1_1,train,train,train,train,val,train,train,train,train,val,train,train,train,train,val
9,0,36,1,0,0,0,0,0,173,78.1,26.1,38.6,16.6,22.0,43.0,31.0,69.01,14.55,4,2.7,51.2,18.7,24.2,11.7,29.6,0,97.0,265.0,186.0,54.0,159.0,21.0,21.0,50.0,0.58,108.81,0.0,14.4,46.0,0_1,train,train,train,val,train,train,train,train,val,train,train,train,train,train,val


### 4 Extracting fold indices from the loaded data

We can now extract indices for train and validation set for each fold.


In [430]:
folds = train_data_loaded[[c for c in train_data_loaded.columns if c.startswith("Fold")]]
folds_strat = train_data_strat_loaded[[c for c in train_data_strat_loaded.columns if c.startswith("Fold")]]

splits_from_file = {} # to check if our splits are read correctly
splits_strat_from_file = {} 

for fold in folds.columns:
    train_idx =  folds.index[folds[fold] == "train"].to_numpy()
    val_idx = folds.index[folds[fold] == "val"].to_numpy()
    splits_from_file[fold] = {"train": train_idx.tolist(), "val": val_idx.tolist()}

for fold in folds_strat.columns:
    train_idx =  folds_strat.index[folds_strat[fold] == "train"].to_numpy()
    val_idx = folds_strat.index[folds_strat[fold] == "val"].to_numpy()
    splits_strat_from_file[fold] = {"train": train_idx.tolist(), "val": val_idx.tolist()}


A sanity check: verify if splits obtaind from StratifiedKFold in the notebook are identical to the splits loaded from the file.

In [431]:
print(f"Sanity check 1: are splits from RepeatedStratifiedKFold and from the train data file identical? {splits == splits_from_file}")
print(f"Sanity check 2: are splits_strat from RepeatedStratifiedKFold and from the train data strat file identical? {splits_strat == splits_strat_from_file}")
print(f"Sanity check 3: are splits and splits_strat identical? {splits == splits_strat}")

Sanity check 1: are splits from RepeatedStratifiedKFold and from the train data file identical? True
Sanity check 2: are splits_strat from RepeatedStratifiedKFold and from the train data strat file identical? True
Sanity check 3: are splits and splits_strat identical? False


### 5 Summary
In this notebook we 
- splitted the data to train and test sets using stratification by gallstone status and gender (30% test, 70% train i.e. 95 samples test, 220 samples train) 
- saved splits with fold information and the tratification column to files for better tracebility