## Compare model results and model selection

    1. Evaluate all the saved models on the validation set
    2. Select the best model based on performance on the validation set
    3. Evaluate that model on the holdout test set

In real-world scenarios, the test set is often unavailable as it is meant to represent new data. In such cases, the literature recommends dividing the available data into training and validation sets using a 70/30 or 80/20 ratio.

## Libraries

In [2]:
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score
from time import time

## Import Validation and Training Data

In [3]:
val_features = pd.read_csv('val_features.csv')
val_labels = pd.read_csv('val_labels.csv')

test_features = pd.read_csv('test_features.csv')
test_labels = pd.read_csv('test_labels.csv')

## Import Models

In [6]:
models = {}
for mdl in ['LR', 'SVC', 'MLP', 'RF', 'GB']:
    models[mdl] = joblib.load(f'{mdl}_model.pkl')

In [7]:
models

{'LR': LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=1000,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 'SVC': SVC(C=0.1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False),
 'MLP': MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
               beta_2=0.999, early_stopping=False, epsilon=1e-08,
               hidden_layer_sizes=(50,), learning_rate='invscaling',
               learning_rate_init=0.001, max_fun=15000, max_iter=1000,
               momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
               power_t=0.5, ran

## Function to evaluate models

In [8]:
def evaluate_model(name, model, features, labels):
    start = time()
    pred = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred), 3)
    recall = round(recall_score(labels, pred),3)
    
    print(f'{name} -- Accuracy: {accuracy} / Precision: {precision} / Recall: {recall} / Latency: {round((end-start)*1000, 1)}ms')
    

## Evaluate models on the validation set

![Evaluation Metrics](img/eval_metrics.png)

In [9]:
for name, mdl in models.items():
    evaluate_model(name, mdl, val_features, val_labels)

LR -- Accuracy: 0.775 / Precision: 0.712 / Recall: 0.646 / Latency: 5.0ms
SVC -- Accuracy: 0.747 / Precision: 0.672 / Recall: 0.6 / Latency: 2.0ms
MLP -- Accuracy: 0.753 / Precision: 0.684 / Recall: 0.6 / Latency: 50.2ms
RF -- Accuracy: 0.809 / Precision: 0.792 / Recall: 0.646 / Latency: 98.9ms
GB -- Accuracy: 0.809 / Precision: 0.804 / Recall: 0.631 / Latency: 0.0ms


## Evaluate best model on test set

Based on the results, both Random Forest and Gradient Boosting demonstrate good performance. The choice of algorithm should be based on the most relevant metrics.

In [10]:
evaluate_model('Random Forest', models['RF'], test_features, test_labels)
evaluate_model('Gradient Boosting', models['GB'], test_features, test_labels)

Random Forest -- Accuracy: 0.804 / Precision: 0.836 / Recall: 0.671 / Latency: 44.6ms
Gradient Boost -- Accuracy: 0.816 / Precision: 0.852 / Recall: 0.684 / Latency: 1.7ms
