## Feature Engineering

* It is good practice to split the data prior to augmenting it
* A pipeline will be created on the second iteration.
* Ill drop the RID for now

In [39]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import Pipeline
from feature_engine.imputation import MeanMedianImputer

# Import dataset
dataset = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/ADNI(Rawdata).csv")
dataset.head()

Unnamed: 0,RID,Gender,Ageatscreening,Diagnosis,MMSE0m,HipsASMbaseline,HipsContrastbaseline,HipsCorelationbaseline,HipsVariancebaseline,HipsSumAveragebaseline,...,ERCsContrastbaseline,ERCsCorelationbaseline,ERCsVariancebaseline,ERCsSumAveragebaseline,ERCsSumVariancebaseline,ERCsEntropybaseline,ERCsClusterShadebaseline,ERCs_thicknessbaseline,ERCsVolumebaseline,HipposcampusVolumebaseline
0,3,0,81.3479,3,20.0,,158.27,0.63,218.3,28.37,...,253.1,0.4,208.65,23.39,581.5,,-2568.19,2.31,1176.0,3047.0
1,4,0,67.6904,1,27.0,0.06,147.64,0.55,173.64,44.72,...,220.88,0.48,215.7,33.74,641.9,3.33,4113.01,2.76,1942.0,3449.0
2,5,0,73.8027,0,29.0,0.1,199.66,0.55,222.27,41.18,...,220.37,0.54,232.18,29.18,708.36,2.87,-1388.41,3.18,2044.0,3441.0
3,8,1,84.5945,0,28.0,0.08,184.21,0.53,201.55,43.04,...,198.42,0.54,220.48,26.68,683.5,2.77,-2506.55,2.68,1959.0,2875.0
4,10,1,73.9726,3,24.0,0.11,233.02,0.48,229.88,39.46,...,196.55,0.53,210.63,26.6,645.95,2.72,-1164.02,2.64,1397.0,2700.0


In [40]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    dataset.drop("Diagnosis", axis=1),  
    dataset["Diagnosis"],  
    test_size=0.3,  
    random_state=0,  
)

X_train.shape, X_test.shape

((425, 23), (183, 23))

### Pipeline: Data imputation & feature scaling

Lets extract all columns with missing values

In [41]:
na_columns = dataset.columns[dataset.isnull().sum() > 0]
na_columns

Index(['MMSE0m', 'HipsASMbaseline', 'HipsContrastbaseline',
       'HipsCorelationbaseline', 'HipsVariancebaseline',
       'HipsSumAveragebaseline', 'HipsSumVariancebaseline',
       'HipsEntropybaseline', 'HipsClusterShadebaseline', 'ERCsASMbaseline',
       'ERCsContrastbaseline', 'ERCsCorelationbaseline',
       'ERCsVariancebaseline', 'ERCsSumAveragebaseline',
       'ERCsSumVariancebaseline', 'ERCsEntropybaseline',
       'ERCsClusterShadebaseline', 'ERCs_thicknessbaseline',
       'ERCsVolumebaseline', 'HipposcampusVolumebaseline'],
      dtype='object')

In [42]:
pipe = Pipeline([
    ("imputer", MeanMedianImputer(
        imputation_method="mean", 
        variables=[
            'MMSE0m', 'HipsASMbaseline', 'HipsContrastbaseline',
            'HipsCorelationbaseline', 'HipsVariancebaseline',
            'HipsSumAveragebaseline', 'HipsSumVariancebaseline',
            'HipsEntropybaseline', 'HipsClusterShadebaseline', 
            'ERCsASMbaseline', 'ERCsContrastbaseline', 
            'ERCsCorelationbaseline', 'ERCsVariancebaseline', 
            'ERCsSumAveragebaseline', 'ERCsSumVariancebaseline',
            'ERCsEntropybaseline', 'ERCsClusterShadebaseline', 
            'ERCs_thicknessbaseline', 'ERCsVolumebaseline', 
            'HipposcampusVolumebaseline'
        ]
    )),
    ("scaler", StandardScaler().set_output(transform="pandas")),
])

pipe.fit(X_train)

# let's transform the data with the pipeline
X_train_scaled = pipe.transform(X_train)
X_test_scaled = pipe.transform(X_test)

**Note**

* Usually after imputing the dataset, we do an analysis and visualize the data if it has been affected greatly, but for simplicity sake, we will temporary ignore this step
* Of course, a split analysis will be done after the first iteration
* Also, no feature selection until we are done with first iteration
* Didnt dropped variables
* Didnt do hyperparameter optimization

## Model Selection

We will experiment with various models that were previously mentioned in paper. 

* Logistic regression
* Support vector machine
* Decision tree
* Random forest

I will only focus on these 4 models for now. Though i would love to check how a simple ANN would work here. I'll try that afterwards.

For simplicity sake, I will not do any hyper parameter optimization yet.

Documentation on sklearn for any of the below models
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [85]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_validate

lg = LogisticRegression(multi_class = "multinomial", 
                        solver = "lbfgs",
                        max_iter = 1000,
                        random_state = 42)

svm = SVC(kernel ='rbf', 
          decision_function_shape ='ovo',
          probability = True,
          random_state = 42)

dt = decision_tree_model = DecisionTreeClassifier(
     criterion ='gini',      # 'gini' or 'entropy'
     max_depth = 5,           # Set depth to prevent overfitting
     min_samples_split = 10,  # Minimum samples required to split a node
     min_samples_leaf = 5,    # Minimum samples required at a leaf node
     max_features = 'sqrt',    # Use square root of features
     random_state = 42)

rf = RandomForestClassifier(
     n_estimators = 100,      # Number of trees in the forest
     criterion = 'gini',      # 'gini' or 'entropy'
     max_depth = 5,           # Set depth to prevent overfitting
     min_samples_split = 10,  # Minimum samples required to split a node
     min_samples_leaf = 5,    # Minimum samples required at a leaf node
     max_features = 'sqrt',    # Use square root of features
     bootstrap = True,        # Use bootstrap sampling
     random_state = 42)

In [86]:
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

kf = KFold(n_splits = 5, 
           shuffle = True, 
           random_state = 42)

# Define metrics to evaluate
scoring_metrics = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average = 'weighted', zero_division=0),
    'recall': make_scorer(recall_score, average = 'weighted', zero_division=0),
    'f1': make_scorer(f1_score, average = 'weighted', zero_division=0),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', response_method = "predict_proba")
}

lg_results = cross_validate(
    lg,
    X_train_scaled,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

svm_results = cross_validate(
    svm,
    X_train_scaled,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

dt_results = cross_validate(
    dt,
    X_train,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

rf_results = cross_validate(
    rf,
    X_train,
    y_train,
    scoring = scoring_metrics,
    return_train_score = True,
    cv = kf)

# Print results for Logistic Regression
print("Logistic Regression")
print('Mean train set accuracy:', np.mean(lg_results['train_accuracy']), '±', np.std(lg_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(lg_results['test_accuracy']), '±', np.std(lg_results['test_accuracy']))
print('Mean precision:', np.mean(lg_results['test_precision']), '±', np.std(lg_results['test_precision']))
print('Mean recall:', np.mean(lg_results['test_recall']), '±', np.std(lg_results['test_recall']))
print('Mean F1 score:', np.mean(lg_results['test_f1']), '±', np.std(lg_results['test_f1']))
print('Mean ROC AUC:', np.mean(lg_results['test_roc_auc']), '±', np.std(lg_results['test_roc_auc']), "\n")

# Repeat the print statements for SVM, Decision Tree, and Random Forest
print("Support Vector Machine")
print('Mean train set accuracy:', np.mean(svm_results['train_accuracy']), '±', np.std(svm_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(svm_results['test_accuracy']), '±', np.std(svm_results['test_accuracy']))
print('Mean precision:', np.mean(svm_results['test_precision']), '±', np.std(svm_results['test_precision']))
print('Mean recall:', np.mean(svm_results['test_recall']), '±', np.std(svm_results['test_recall']))
print('Mean F1 score:', np.mean(svm_results['test_f1']), '±', np.std(svm_results['test_f1']))
print('Mean ROC AUC:', np.mean(svm_results['test_roc_auc']), '±', np.std(svm_results['test_roc_auc']), "\n")

print("Decision Tree")
print('Mean train set accuracy:', np.mean(dt_results['train_accuracy']), '±', np.std(dt_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(dt_results['test_accuracy']), '±', np.std(dt_results['test_accuracy']))
print('Mean precision:', np.mean(dt_results['test_precision']), '±', np.std(dt_results['test_precision']))
print('Mean recall:', np.mean(dt_results['test_recall']), '±', np.std(dt_results['test_recall']))
print('Mean F1 score:', np.mean(dt_results['test_f1']), '±', np.std(dt_results['test_f1']))
print('Mean ROC AUC:', np.mean(dt_results['test_roc_auc']), '±', np.std(dt_results['test_roc_auc']), "\n")

print("Random Forest")
print('Mean train set accuracy:', np.mean(rf_results['train_accuracy']), '±', np.std(rf_results['train_accuracy']))
print('Mean test set accuracy:', np.mean(rf_results['test_accuracy']), '±', np.std(rf_results['test_accuracy']))
print('Mean precision:', np.mean(rf_results['test_precision']), '±', np.std(rf_results['test_precision']))
print('Mean recall:', np.mean(rf_results['test_recall']), '±', np.std(rf_results['test_recall']))
print('Mean F1 score:', np.mean(rf_results['test_f1']), '±', np.std(rf_results['test_f1']))
print('Mean ROC AUC:', np.mean(rf_results['test_roc_auc']), '±', np.std(rf_results['test_roc_auc']), "\n")

Logistic Regression
Mean train set accuracy: 0.7276470588235295 ± 0.013745084053585797
Mean test set accuracy: 0.6423529411764706 ± 0.038375309247764916
Mean precision: 0.6516870783681201 ± 0.03412220878151855
Mean recall: 0.6423529411764706 ± 0.038375309247764916
Mean F1 score: 0.6399057838266943 ± 0.03730497417504237
Mean ROC AUC: 0.8614475185771248 ± 0.022785370156711542 

Support Vector Machine
Mean train set accuracy: 0.8194117647058823 ± 0.008442764761416076
Mean test set accuracy: 0.6188235294117647 ± 0.03690444033260735
Mean precision: 0.6261895072083623 ± 0.05357789052945798
Mean recall: 0.6188235294117647 ± 0.03690444033260735
Mean F1 score: 0.601580918420586 ± 0.03826163961470308
Mean ROC AUC: 0.8386040607469702 ± 0.022730705036446718 

Decision Tree
Mean train set accuracy: 0.6882352941176471 ± 0.013542193450848648
Mean test set accuracy: 0.5341176470588236 ± 0.07609864295904109
Mean precision: 0.5376986290470461 ± 0.07457026661839881
Mean recall: 0.5341176470588236 ± 0.076