## Feature Engineering

Now that have covered the EDA phase, lets move to some simple feature engineering tasks prior to modeling.

A few things to note:

* It is good practice to split the data prior to augmenting it
* Ill drop the RID column for now

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer

# Import dataset
dataset = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/ADNI(Rawdata).csv")
dataset.head()

Unnamed: 0,RID,Gender,Ageatscreening,Diagnosis,MMSE0m,HipsASMbaseline,HipsContrastbaseline,HipsCorelationbaseline,HipsVariancebaseline,HipsSumAveragebaseline,...,ERCsContrastbaseline,ERCsCorelationbaseline,ERCsVariancebaseline,ERCsSumAveragebaseline,ERCsSumVariancebaseline,ERCsEntropybaseline,ERCsClusterShadebaseline,ERCs_thicknessbaseline,ERCsVolumebaseline,HipposcampusVolumebaseline
0,3,0,81.3479,3,20.0,,158.27,0.63,218.3,28.37,...,253.1,0.4,208.65,23.39,581.5,,-2568.19,2.31,1176.0,3047.0
1,4,0,67.6904,1,27.0,0.06,147.64,0.55,173.64,44.72,...,220.88,0.48,215.7,33.74,641.9,3.33,4113.01,2.76,1942.0,3449.0
2,5,0,73.8027,0,29.0,0.1,199.66,0.55,222.27,41.18,...,220.37,0.54,232.18,29.18,708.36,2.87,-1388.41,3.18,2044.0,3441.0
3,8,1,84.5945,0,28.0,0.08,184.21,0.53,201.55,43.04,...,198.42,0.54,220.48,26.68,683.5,2.77,-2506.55,2.68,1959.0,2875.0
4,10,1,73.9726,3,24.0,0.11,233.02,0.48,229.88,39.46,...,196.55,0.53,210.63,26.6,645.95,2.72,-1164.02,2.64,1397.0,2700.0


In [7]:
# let's separate into training and testing set
dataset.drop(labels = "RID", axis = 1, inplace = True)

X_train, X_test, y_train, y_test = train_test_split(
    dataset.drop("Diagnosis", axis=1),  
    dataset["Diagnosis"],  
    test_size=0.3,  
    random_state=0,  
)

X_train.shape, X_test.shape

((425, 22), (183, 22))

### Pipeline: Data imputation & feature scaling

Lets extract all columns with missing values

In [8]:
na_columns = dataset.columns[dataset.isnull().sum() > 0]
na_columns

Index(['MMSE0m', 'HipsASMbaseline', 'HipsContrastbaseline',
       'HipsCorelationbaseline', 'HipsVariancebaseline',
       'HipsSumAveragebaseline', 'HipsSumVariancebaseline',
       'HipsEntropybaseline', 'HipsClusterShadebaseline', 'ERCsASMbaseline',
       'ERCsContrastbaseline', 'ERCsCorelationbaseline',
       'ERCsVariancebaseline', 'ERCsSumAveragebaseline',
       'ERCsSumVariancebaseline', 'ERCsEntropybaseline',
       'ERCsClusterShadebaseline', 'ERCs_thicknessbaseline',
       'ERCsVolumebaseline', 'HipposcampusVolumebaseline'],
      dtype='object')

In [9]:
pipe = Pipeline([
    ("imputer", MeanMedianImputer(
        imputation_method="mean", 
        variables=[
            'MMSE0m', 'HipsASMbaseline', 'HipsContrastbaseline',
            'HipsCorelationbaseline', 'HipsVariancebaseline',
            'HipsSumAveragebaseline', 'HipsSumVariancebaseline',
            'HipsEntropybaseline', 'HipsClusterShadebaseline', 
            'ERCsASMbaseline', 'ERCsContrastbaseline', 
            'ERCsCorelationbaseline', 'ERCsVariancebaseline', 
            'ERCsSumAveragebaseline', 'ERCsSumVariancebaseline',
            'ERCsEntropybaseline', 'ERCsClusterShadebaseline', 
            'ERCs_thicknessbaseline', 'ERCsVolumebaseline', 
            'HipposcampusVolumebaseline'
        ]
    )),
    ("scaler", StandardScaler().set_output(transform="pandas")),
])

pipe.fit(X_train)

# let's transform the data with the pipeline
X_train_scaled = pipe.transform(X_train)
X_test_scaled = pipe.transform(X_test)

**Note**

* Typically, after imputing the dataset, we analyze and visualize the data to assess whether it has been significantly affected. However, for the sake of simplicity, we will temporarily skip this step.
* A split analysis will be conducted after the first iteration.
* Additionally, no feature selection will be performed until we complete the first iteration.
* Variables have not been dropped.
* Hyperparameter optimization has not been performed.

## Model Selection

We will experiment with various models that were previously mentioned in paper. 

* Logistic regression
* Support vector machine
* Decision tree
* Random forest

I will only focus on these 4 models for now. Though i would love to check how a simple ANN would work here. I'll try that afterwards.

Documentation on sklearn for any of the below models
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

lg = LogisticRegression(multi_class = "multinomial", 
                        solver = "lbfgs",
                        max_iter = 1000,
                        random_state = 42)

svm = SVC(kernel ='rbf', 
          decision_function_shape ='ovo',
          probability = True,
          random_state = 42)

dt = decision_tree_model = DecisionTreeClassifier(
     criterion ='gini',      
     max_depth = 5,           
     min_samples_split = 10,  
     min_samples_leaf = 5,    
     max_features = 'sqrt',    
     random_state = 42)

rf = RandomForestClassifier(
     n_estimators = 100,     
     criterion = 'gini',     
     max_depth = 5,           
     min_samples_split = 10, 
     min_samples_leaf = 5,   
     max_features = 'sqrt',   
     bootstrap = True,        
     random_state = 42)

In the paper, it was stated that they performed 5 KFolds, so we will replicate their approach. 

ROC AUC along with other mentioned metrics will be covered here

In [30]:
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import KFold, cross_validate

kf = KFold(n_splits = 5, 
           shuffle = True, 
           random_state = 42)

# Define metrics to evaluate
scoring_metrics = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average = 'weighted', zero_division=0),
    'recall': make_scorer(recall_score, average = 'weighted', zero_division=0),
    'f1': make_scorer(f1_score, average = 'weighted', zero_division=0),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', response_method = "predict_proba")
}

lg_results = cross_validate(
    lg,
    X_train_scaled,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

svm_results = cross_validate(
    svm,
    X_train_scaled,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

dt_results = cross_validate(
    dt,
    X_train,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

rf_results = cross_validate(
    rf,
    X_train,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

Now lets check the metrics

In [32]:
# Print results for Logistic Regression
print("------------------------------------------------------")
print("Logistic Regression")
print('Mean train set accuracy:', np.mean(lg_results['train_accuracy']), '±', np.std(lg_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(lg_results['test_accuracy']), '±', np.std(lg_results['test_accuracy']))
print('Mean train precision:', np.mean(lg_results['train_precision']), '±', np.std(lg_results['train_precision']))
print('Mean test precision:', np.mean(lg_results['test_precision']), '±', np.std(lg_results['test_precision']))
print('Mean train recall:', np.mean(lg_results['train_recall']), '±', np.std(lg_results['train_recall']))
print('Mean test recall:', np.mean(lg_results['test_recall']), '±', np.std(lg_results['test_recall']))
print('Mean train F1 score:', np.mean(lg_results['train_f1']), '±', np.std(lg_results['train_f1']))
print('Mean test F1 score:', np.mean(lg_results['test_f1']), '±', np.std(lg_results['test_f1']))
print('Mean train ROC AUC:', np.mean(lg_results['train_roc_auc']), '±', np.std(lg_results['train_roc_auc']))
print('Mean test ROC AUC:', np.mean(lg_results['test_roc_auc']), '±', np.std(lg_results['test_roc_auc']), "\n")

# Print results for Support Vector Machine
print("------------------------------------------------------")
print("Support Vector Machine")
print('Mean train set accuracy:', np.mean(svm_results['train_accuracy']), '±', np.std(svm_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(svm_results['test_accuracy']), '±', np.std(svm_results['test_accuracy']))
print('Mean train precision:', np.mean(svm_results['train_precision']), '±', np.std(svm_results['train_precision']))
print('Mean test precision:', np.mean(svm_results['test_precision']), '±', np.std(svm_results['test_precision']))
print('Mean train recall:', np.mean(svm_results['train_recall']), '±', np.std(svm_results['train_recall']))
print('Mean test recall:', np.mean(svm_results['test_recall']), '±', np.std(svm_results['test_recall']))
print('Mean train F1 score:', np.mean(svm_results['train_f1']), '±', np.std(svm_results['train_f1']))
print('Mean test F1 score:', np.mean(svm_results['test_f1']), '±', np.std(svm_results['test_f1']))
print('Mean train ROC AUC:', np.mean(svm_results['train_roc_auc']), '±', np.std(svm_results['train_roc_auc']))
print('Mean test ROC AUC:', np.mean(svm_results['test_roc_auc']), '±', np.std(svm_results['test_roc_auc']), "\n")

# Print results for Decision Tree
print("------------------------------------------------------")
print("Decision Tree")
print('Mean train set accuracy:', np.mean(dt_results['train_accuracy']), '±', np.std(dt_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(dt_results['test_accuracy']), '±', np.std(dt_results['test_accuracy']))
print('Mean train precision:', np.mean(dt_results['train_precision']), '±', np.std(dt_results['train_precision']))
print('Mean test precision:', np.mean(dt_results['test_precision']), '±', np.std(dt_results['test_precision']))
print('Mean train recall:', np.mean(dt_results['train_recall']), '±', np.std(dt_results['train_recall']))
print('Mean test recall:', np.mean(dt_results['test_recall']), '±', np.std(dt_results['test_recall']))
print('Mean train F1 score:', np.mean(dt_results['train_f1']), '±', np.std(dt_results['train_f1']))
print('Mean test F1 score:', np.mean(dt_results['test_f1']), '±', np.std(dt_results['test_f1']))
print('Mean train ROC AUC:', np.mean(dt_results['train_roc_auc']), '±', np.std(dt_results['train_roc_auc']))
print('Mean test ROC AUC:', np.mean(dt_results['test_roc_auc']), '±', np.std(dt_results['test_roc_auc']), "\n")

# Print results for Random Forest
print("------------------------------------------------------")
print("Random Forest")
print('Mean train set accuracy:', np.mean(rf_results['train_accuracy']), '±', np.std(rf_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(rf_results['test_accuracy']), '±', np.std(rf_results['test_accuracy']))
print('Mean train precision:', np.mean(rf_results['train_precision']), '±', np.std(rf_results['train_precision']))
print('Mean test precision:', np.mean(rf_results['test_precision']), '±', np.std(rf_results['test_precision']))
print('Mean train recall:', np.mean(rf_results['train_recall']), '±', np.std(rf_results['train_recall']))
print('Mean test recall:', np.mean(rf_results['test_recall']), '±', np.std(rf_results['test_recall']))
print('Mean train F1 score:', np.mean(rf_results['train_f1']), '±', np.std(rf_results['train_f1']))
print('Mean test F1 score:', np.mean(rf_results['test_f1']), '±', np.std(rf_results['test_f1']))
print('Mean train ROC AUC:', np.mean(rf_results['train_roc_auc']), '±', np.std(rf_results['train_roc_auc']))
print('Mean test ROC AUC:', np.mean(rf_results['test_roc_auc']), '±', np.std(rf_results['test_roc_auc']), "\n")

------------------------------------------------------
Logistic Regression
Mean train set accuracy: 0.7176470588235294 ± 0.010685824779167581
Mean test set accuracy: 0.628235294117647 ± 0.021820278812931068
Mean train precision: 0.7137715549459933 ± 0.009994613394817947
Mean test precision: 0.6346294513584386 ± 0.016375364642513202
Mean train recall: 0.7176470588235294 ± 0.010685824779167581
Mean test recall: 0.628235294117647 ± 0.021820278812931068
Mean train F1 score: 0.7141321948311611 ± 0.01008676011818872
Mean test F1 score: 0.6264786751157274 ± 0.018573953162226077
Mean train ROC AUC: 0.9092337235320909 ± 0.005358305500420103
Mean test ROC AUC: 0.8598573511351006 ± 0.021712265716621503 

------------------------------------------------------
Support Vector Machine
Mean train set accuracy: 0.8023529411764706 ± 0.005060191333554468
Mean test set accuracy: 0.5999999999999999 ± 0.034899757586332535
Mean train precision: 0.8076762897810934 ± 0.004279107619041377
Mean test precision: 0

It appears that based on the accuracy metric only, Random Forest can be considered as the best model with a mean test accuracy of 0.6329.
While the worst Model would be Decision Tree, with a mean test accuracy of 0.4847.