## Modelling

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler 

In [None]:
X_train_data = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/datasets/processed/filtered_features/all_groups/train_processed.csv")
X_test_data = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/datasets/processed/filtered_features/all_groups/test_processed.csv")

y_train = X_train_data["Diagnosis"]
y_test = X_test_data["Diagnosis"]
X_train = X_train_data.drop("Diagnosis", axis=1)
X_test = X_test_data.drop("Diagnosis", axis=1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Unnamed: 0,MMSE0m,HipsASMbaseline,HipsContrastbaseline,HipsCorelationbaseline,HipsSumVariancebaseline,HipsEntropybaseline,HipsClusterShadebaseline,ERCsContrastbaseline,ERCsCorelationbaseline,ERCsVariancebaseline,ERCsSumAveragebaseline,ERCsSumVariancebaseline,ERCsClusterShadebaseline,ERCs_thicknessbaseline,ERCsVolumebaseline,HipposcampusVolumebaseline
0,24.0,0.09,170.02,0.45,615.39,3.60,16693.64,257.59,0.43,224.58,29.39,640.73,57.73,2.53,1278.0,2448.0
1,26.0,0.08,165.75,0.51,550.26,3.59,22784.56,217.25,0.51,217.43,28.37,652.46,2072.42,2.61,1027.0,2349.0
2,20.0,0.15,147.66,0.57,575.46,3.49,7233.57,287.61,0.39,233.07,29.95,644.68,-467.36,2.45,1819.0,3631.0
3,29.0,0.12,184.28,0.58,677.96,3.55,10587.17,204.92,0.52,213.05,29.07,654.06,1421.62,3.48,2002.0,3400.0
4,28.0,0.14,217.01,0.53,696.61,2.83,-507.05,217.01,0.53,228.40,29.23,696.61,-507.05,3.36,1945.0,4210.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,30.0,0.11,182.35,0.57,666.08,3.75,10796.82,209.84,0.49,209.96,25.64,630.00,-2180.39,2.38,1199.0,2999.0
421,30.0,0.07,166.95,0.49,516.98,3.83,25330.58,235.69,0.47,222.23,29.77,653.24,1565.56,2.71,1446.0,3163.0
422,24.0,0.13,178.95,0.58,713.98,3.51,11179.20,204.70,0.53,214.28,24.44,652.40,-3077.22,2.33,1327.0,2918.0
423,27.0,0.08,168.89,0.52,563.85,3.72,21153.11,207.24,0.51,216.75,29.69,659.77,1771.64,2.81,1801.0,3543.0


In [26]:
# let's separate into training and testing set
dataset.drop(labels = "RID", axis = 1, inplace = True)

X_train, X_test, y_train, y_test = train_test_split(
    dataset.drop("Diagnosis", axis=1),  
    dataset["Diagnosis"],  
    test_size=0.3,  
    random_state=0,  
)

X_train.shape, X_test.shape

((425, 22), (183, 22))

### Pipeline: Data imputation & feature scaling

Lets extract all columns with missing values

In [27]:
na_columns = dataset.columns[dataset.isnull().sum() > 0]
na_columns

Index(['MMSE0m', 'HipsASMbaseline', 'HipsContrastbaseline',
       'HipsCorelationbaseline', 'HipsVariancebaseline',
       'HipsSumAveragebaseline', 'HipsSumVariancebaseline',
       'HipsEntropybaseline', 'HipsClusterShadebaseline', 'ERCsASMbaseline',
       'ERCsContrastbaseline', 'ERCsCorelationbaseline',
       'ERCsVariancebaseline', 'ERCsSumAveragebaseline',
       'ERCsSumVariancebaseline', 'ERCsEntropybaseline',
       'ERCsClusterShadebaseline', 'ERCs_thicknessbaseline',
       'ERCsVolumebaseline', 'HipposcampusVolumebaseline'],
      dtype='object')

In [28]:
pipe = Pipeline([
    ("imputer", MeanMedianImputer(
        imputation_method="mean", 
        variables=[
            'MMSE0m', 'HipsASMbaseline', 'HipsContrastbaseline',
            'HipsCorelationbaseline', 'HipsVariancebaseline',
            'HipsSumAveragebaseline', 'HipsSumVariancebaseline',
            'HipsEntropybaseline', 'HipsClusterShadebaseline', 
            'ERCsASMbaseline', 'ERCsContrastbaseline', 
            'ERCsCorelationbaseline', 'ERCsVariancebaseline', 
            'ERCsSumAveragebaseline', 'ERCsSumVariancebaseline',
            'ERCsEntropybaseline', 'ERCsClusterShadebaseline', 
            'ERCs_thicknessbaseline', 'ERCsVolumebaseline', 
            'HipposcampusVolumebaseline'
        ]
    )),
    ("scaler", StandardScaler().set_output(transform="pandas")),
])

pipe.fit(X_train)

# let's transform the data with the pipeline
X_train_scaled = pipe.transform(X_train)
X_test_scaled = pipe.transform(X_test)

**Note**

* Typically, after imputing the dataset, we analyze and visualize the data to assess whether it has been significantly affected. However, for the sake of simplicity, we will temporarily skip this step.
* A split analysis will be conducted after the first iteration.
* Additionally, no feature selection will be performed until we complete the first iteration.
* Variables have not been dropped.
* Hyperparameter optimization has not been performed.

## Model Selection

We will experiment with various models that were previously mentioned in paper. 

* Logistic regression
* Support vector machine
* Decision tree
* Random forest

I will only focus on these 4 models for now. Though i would love to check how a simple ANN would work here. I'll try that afterwards.

Documentation on sklearn for any of the below models
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

lg = LogisticRegression(multi_class = "multinomial", solver = "lbfgs", max_iter = 1000, random_state = 42)

svm = SVC(kernel ='rbf', decision_function_shape ='ovo', probability = True, random_state = 42)

dt = DecisionTreeClassifier(criterion ='gini', max_depth = 5, min_samples_split = 10, 
                            min_samples_leaf = 5, max_features = 'sqrt', random_state = 42)

rf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_depth = 5, min_samples_split = 10, 
                            min_samples_leaf = 5, max_features = 'sqrt', bootstrap = True, random_state = 42)

In the paper, it was stated that they performed 5 KFolds, so we will replicate their approach. 

ROC AUC along with other mentioned metrics will be covered here

In [34]:
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import KFold, cross_validate

kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# Define metrics to evaluate
scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average = 'weighted', zero_division=0),
    'recall': make_scorer(recall_score, average = 'weighted', zero_division=0),
    'f1': make_scorer(f1_score, average = 'weighted', zero_division=0),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', response_method = "predict_proba")
}

models  = {"Logistic Regression": lg, 
           "Support Vector Machine": svm, 
           "Decision Tree": dt, 
           "Random Forest": rf
}

model_data_mapping = {
    'Logistic Regression': X_train_scaled,
    'Support Vector Machine': X_train_scaled,
    'Decision Tree': X_train,
    'Random Forest': X_train
}

Now lets check the metrics

In [35]:
for model_name, model in models.items():

    X_train_to_use = model_data_mapping[model_name]
    
    results = cross_validate(model, 
                             X_train_to_use, 
                             y_train, 
                             scoring = scoring_metrics,
                             return_train_score = True,
                             cv = kf)
    print("------------------------------------------------------")
    print(model_name)
    for metric in scoring_metrics.keys():
            print(f'Mean train {metric}:', np.mean(results[f'train_{metric}']), '±', np.std(results[f'train_{metric}']))
            print(f'Mean test {metric}:', np.mean(results[f'test_{metric}']), '±', np.std(results[f'test_{metric}']))

------------------------------------------------------
Logistic Regression
Mean train accuracy: 0.7176470588235294 ± 0.010685824779167581
Mean test accuracy: 0.628235294117647 ± 0.021820278812931068
Mean train precision: 0.7137715549459933 ± 0.009994613394817947
Mean test precision: 0.6346294513584386 ± 0.016375364642513202
Mean train recall: 0.7176470588235294 ± 0.010685824779167581
Mean test recall: 0.628235294117647 ± 0.021820278812931068
Mean train f1: 0.7141321948311611 ± 0.01008676011818872
Mean test f1: 0.6264786751157274 ± 0.018573953162226077
Mean train roc_auc: 0.9092337235320909 ± 0.005358305500420103
Mean test roc_auc: 0.8598573511351006 ± 0.021712265716621503
------------------------------------------------------
Support Vector Machine
Mean train accuracy: 0.8023529411764706 ± 0.005060191333554468
Mean test accuracy: 0.5999999999999999 ± 0.034899757586332535
Mean train precision: 0.8076762897810934 ± 0.004279107619041377
Mean test precision: 0.5951572225624752 ± 0.06058994

It appears that based on the accuracy metric only, Random Forest can be considered as the best model with a mean test accuracy of 0.6329.
While the worst Model would be Decision Tree, with a mean test accuracy of 0.4847.