## Modeling

In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler 

# Import dataset
X_train = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/Datasets/processed/X_train_processed.csv")
X_test = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/Datasets/processed/X_test_processed.csv")
y_train = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/Datasets/processed/y_train_processed.csv")
y_test = pd.read_csv("C:/Users/steve/Desktop/Notebooks/Thesis-Project/Datasets/processed/y_test_processed.csv")

In [13]:
sc = StandardScaler()

# let's transform the data 
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape

((425, 16), (183, 16), (425,), (183,))

### Pipeline: Data imputation & feature scaling

Lets extract all columns with missing values

**Note**

* Typically, after imputing the dataset, we analyze and visualize the data to assess whether it has been significantly affected. However, for the sake of simplicity, we will temporarily skip this step.
* A split analysis will be conducted after the first iteration.
* Additionally, no feature selection will be performed until we complete the first iteration.
* Variables have not been dropped.
* Hyperparameter optimization has not been performed.

## Model Selection

We will experiment with various models that were previously mentioned in paper. 

* Logistic regression
* Support vector machine
* Decision tree
* Random forest

I will only focus on these 4 models for now. Though i would love to check how a simple ANN would work here. I'll try that afterwards.

Documentation on sklearn for any of the below models
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

lg = LogisticRegression(multi_class = "multinomial", solver = "lbfgs", max_iter = 1000, random_state = 42)

svm = SVC(kernel ='rbf', decision_function_shape ='ovo', probability = True, random_state = 42)

dt = DecisionTreeClassifier(criterion ='gini', max_depth = 5, min_samples_split = 10, 
                            min_samples_leaf = 5, max_features = 'sqrt', random_state = 42)

rf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_depth = 5, min_samples_split = 10, 
                            min_samples_leaf = 5, max_features = 'sqrt', bootstrap = True, random_state = 42)

In the paper, it was stated that they performed 5 KFolds, so we will replicate their approach. 

ROC AUC along with other mentioned metrics will be covered here

In [15]:
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import KFold, cross_validate

kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# Define metrics to evaluate
scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average = 'weighted', zero_division=0),
    'recall': make_scorer(recall_score, average = 'weighted', zero_division=0),
    'f1': make_scorer(f1_score, average = 'weighted', zero_division=0),
    'roc_auc': make_scorer(roc_auc_score, multi_class='ovr', response_method = "predict_proba")
}

models  = {"Logistic Regression": lg, 
           "Support Vector Machine": svm, 
           "Decision Tree": dt, 
           "Random Forest": rf
}

model_data_mapping = {
    'Logistic Regression': X_train_scaled,
    'Support Vector Machine': X_train_scaled,
    'Decision Tree': X_train,
    'Random Forest': X_train
}

Now lets check the metrics

In [16]:
for model_name, model in models.items():

    X_train_to_use = model_data_mapping[model_name]
    
    results = cross_validate(model, 
                             X_train_to_use, 
                             y_train, 
                             scoring = scoring_metrics,
                             return_train_score = True,
                             cv = kf)
    print("------------------------------------------------------")
    print(model_name)
    for metric in scoring_metrics.keys():
            print(f'Mean train {metric}:', np.mean(results[f'train_{metric}']), '±', np.std(results[f'train_{metric}']))
            print(f'Mean test {metric}:', np.mean(results[f'test_{metric}']), '±', np.std(results[f'test_{metric}']))

------------------------------------------------------
Logistic Regression
Mean train accuracy: 0.6941176470588235 ± 0.01146681687624584
Mean test accuracy: 0.64 ± 0.03121529214452141
Mean train precision: 0.6871284262244898 ± 0.012349662904269539
Mean test precision: 0.6531650575161011 ± 0.018219520864726575
Mean train recall: 0.6941176470588235 ± 0.01146681687624584
Mean test recall: 0.64 ± 0.03121529214452141
Mean train f1: 0.6875366459443062 ± 0.011608383632557891
Mean test f1: 0.6393745655926393 ± 0.027455935848831044
Mean train roc_auc: 0.8920854430888395 ± 0.004100231014048017
Mean test roc_auc: 0.8532891781281544 ± 0.01923201734459709
------------------------------------------------------
Support Vector Machine
Mean train accuracy: 0.7752941176470587 ± 0.009593827311941234
Mean test accuracy: 0.583529411764706 ± 0.038375309247764895
Mean train precision: 0.7749055251185821 ± 0.010843378163222764
Mean test precision: 0.5800207410258946 ± 0.031212381300747563
Mean train recall: 0

It appears that based on the accuracy metric only, Random Forest can be considered as the best model with a mean test accuracy of 0.6329.
While the worst Model would be Decision Tree, with a mean test accuracy of 0.4847.