# Model Selection Protocol

The model Selection will be performed as follows:
1. Define Machine Learning Algorithms to fit models on the given Data:
-- Logistic Regression
-- SVM Classifier
-- Decision Tree Classifier
-- Random Forrest Classifier
2. Define suitable ranges of hyperparameter values for each algorithm that shall be tested
3. Chose suitable performance score to measure the individual models' performance
-- F1_Mirco_Averaging was Chosen.
-- This is due to the imbalance of the classes
4. Perform Grid-Search Cross-Validation to obtain the optimap hyperparameter combinations w.r.t the defined ranges, data, and performance measure
5. Analyze the optimal hyperparameter configurations for each algorithm

In [1]:
import pandas as pd
from pandas import DataFrame

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import multiprocessing

from sklearn.model_selection import train_test_split

# Load the Data from 02_Data_Preprocessing
Now two dataset are loaded.
1. The data with the original data for the Decision Tree and Random Forrest
2. The min-max-scaled data for the logistic_regression

In [2]:
path_train_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\star_classification_preprocessed_train_data.csv"
path_test_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\star_classification_preprocessed_test_data.csv"
path_scaled_train_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\star_classification_preprocessed_train_data_min_max_scale.csv"
path_scaled_test_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\star_classification_preprocessed_test_data_min_max_scale.csv"

train_data: DataFrame = pd.read_csv(path_train_data, index_col="index")
test_data: DataFrame = pd.read_csv(path_test_data, index_col="index")
scaled_train_data: DataFrame = pd.read_csv(path_scaled_train_data, index_col="index")
scaled_test_data: DataFrame = pd.read_csv(path_scaled_test_data, index_col="index")

In [3]:
x_train: DataFrame = train_data.drop(["target"], axis=1)
x_test: DataFrame = test_data.drop(["target"], axis=1)
y_train: DataFrame = train_data["target"]
y_test: DataFrame = test_data["target"]

x_train_scaled: DataFrame = scaled_train_data.drop(["target"], axis=1)
x_test_scaled: DataFrame = scaled_test_data.drop(["target"], axis=1)
y_train_scaled: DataFrame = scaled_train_data["target"]
y_test_scaled: DataFrame = scaled_test_data["target"]

print(f"num training samples: {len(x_train)}, shape: {x_train.shape}        num scaled training samples: {len(x_train_scaled)}, shape: {x_train_scaled.shape}")
print(f"num testing samples:  {len(x_test)}, shape: {x_test.shape}        num scaled training samples: {len(x_test_scaled)}, shape: {x_test_scaled.shape}")

num training samples: 74999, shape: (74999, 7)        num scaled training samples: 74999, shape: (74999, 7)
num testing samples:  25000, shape: (25000, 7)        num scaled training samples: 25000, shape: (25000, 7)


# Set up a hyperparameter tuning pipeline for different classifiers with different hyperparameter configurations

In [5]:
classification_models: dict = {
    "log_reg": LogisticRegression, # Internally performs a binary classification OvR
    "random_forrest": RandomForestClassifier,
    "decision_tree": DecisionTreeClassifier,
    "svm": LinearSVC,
}

hyper_parameter_config: dict = {
    "log_reg": {
        "penalty": ["l1", "l2"], # regularization
        "C": [1.0, 0.8, 0.5, 0.2], # inverse of the regularization strength
        "class_weight": ["balanced"],
        "max_iter": [1000, 2000, 3000],
        "solver": ["saga"], # set the solver to saga. According to doc it is faster for large data and enables elastic net and l1 reg
        "multi_class": ["ovr"] # set the multiclass mode to ovr
    },
    "random_forrest": {
        "n_estimators": [200, 300, 500],
        "max_depth": [50, 100, 200],
        "min_samples_leaf": [2, 5, 10],
        "class_weight": ["balanced"],
        "criterion": ["gini", "entropy", "log_loss"]
    },
    "decision_tree": {
        "min_samples_leaf": [2, 5, 10],
        "max_depth": [5, 10, 20],
        "class_weight": ["balanced"],
        "criterion": ["gini", "entropy", "log_loss"]
    },
    "svm": {
        "loss": ["hinge", "squared_hinge"],
        "penalty": ["l2"], # regularization
        "multi_class": ["ovr"],
        "class_weight": ["balanced"],
        "max_iter": [1000, 2000, 3000],
        "C": [1.0, 0.8, 0.5], # inverse of the regularization strength
    }
}

model_training_data: dict = {
    "log_reg": (x_train_scaled, y_train_scaled),
    "random_forrest": (x_train, y_train),
    "decision_tree": (x_train, y_train),
    "svm": (x_train_scaled, y_train_scaled),
}

# perform k-fold cross validation

Since in this project deals with a multi class classification, binary scores such as simple roc_auc of F1 score cannot be applied for the models score when performing CrossValidation.
However, there are ways to use F1 score when dealing with more than one positive class.

In General, Precision and Recall describe how certain a classifier is when it classified a sample to be positive and also the amount of positive samples found among all positive tha could have been found.
However, this all works in a binary setup, in multi-classification it's a bit different
Multiclass classifications, can be also treated as binary classifications, where:

ovr: one versus rest, compares one class against all the others where all the others are considered to be one class.
This as well reduces the multiclass classification problem into n binary classifications where n is the number of classes.

for those n binary classification problems it is possible to determine Precision and Recall and compute the f1 score.
there are different ways of how to merge the n-scores together. Here micro averaging is used since it is sensitive w.r.t class imbalances.
And from the data analysis we know, there is an imbalance in the data.


# Treat the imbalance in the data
to treat the imbalance in the data, class weights are assigned inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

In [6]:
from sklearn.model_selection import GridSearchCV
grid_search_results: list = []

for model_name in classification_models.keys():
    # get model from model dictionary
    classification_model = classification_models[model_name]
    model = classification_model()

    # get all hyperparameter configurations for the model
    hyper_parameter = hyper_parameter_config[model_name]

    # get training/cross validation data
    x, y = model_training_data[model_name]

    # initialize the Gridsearch and perform Gridsearch
    grid_cv = GridSearchCV(estimator=model, param_grid=hyper_parameter, cv=5, verbose=1, n_jobs=multiprocessing.cpu_count(), scoring="f1_micro")
    grid_cv.fit(x, y)

    grid_search_results.append(grid_cv)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [7]:
"""
    Store the results of all hyperparameter combinations in a dataframe for further analysis
"""
result_tables: dict = {}
for i, key in enumerate(classification_models.keys()):
    cv_data: DataFrame = pd.concat(
        [
            DataFrame(grid_search_results[i].cv_results_["params"]),
            DataFrame(grid_search_results[i].cv_results_["mean_test_score"],
            columns=["mean_test_score"])
        ],
        axis=1
    )

    result_tables.update({key: cv_data})

# Model Selection
To find the best performing hyperparameter configuration for each model, in the following the results obtained by the Grid Search Cross-validation will analyzed

## Visualize Results

In [48]:
import plotly.graph_objects as go
def show_hyperparameter_combinations(model_name: str, parameters: list[str], title_labels: list[str], target_name:str, size:tuple, font_color="white",save=True):
    df: DataFrame = result_tables[model_name]

    line = dict(color = df[target_name],
                showscale = True,
                colorscale = "Agsunset",
                cmin = df[target_name].min(),
                cmax = df[target_name].max()
    )

    dimensions: list = [
        dict(
            range = [df[parameter].min(),df[parameter].max()],
            constraintrange = [0, 500],
            label = title_labels[i], values = df[parameter]
        )
        for i, parameter in enumerate(parameters)
    ]

    fig = go.Figure(data=go.Parcoords(line = line,dimensions = dimensions), layout=go.Layout(
        autosize=False,
        width=size[0],
        height=size[1],
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)',
        font_color=font_color,
    )
    )

    if save:
        fig.write_html(fr"D:\Documents\GitHub\UNI_Stellar_Classification\Plots\{model_name}.html")
    fig.show()

## Decision Tree Results for the Gridsearch Cross Validation

In [49]:
model_name: str = "decision_tree"
target_name: str = "mean_test_score"
parameters: list = ["min_samples_leaf", "max_depth", "mean_test_score"]
title_labels: list = [ "Min Samples Leaf", "Max Depth", "F1 Score"]

show_hyperparameter_combinations(model_name=model_name, parameters=parameters, title_labels=title_labels, target_name=target_name, size=(1000, 500), font_color="white", save=False)

This plot depicts the performance of all hyperparameter combinations of the Gridsearch Cross-validation.
The color indicates the score that was achieved which is also visible in the legend on the right.
Following a Trace from left to right, one can read the hyperparameter combination of one particular model and its achieved cross-validation score.

It becomes visible that the hyperparameter that has the most impact is the depth of the tree. All yellow bars go throug the value of 10 when it comes to max depth.
This indicates that this is the optimal value among the defined range.

## Random Forest Results

In [52]:
model_name: str = "random_forrest"
target_name: str = "mean_test_score"
parameters: list = ["n_estimators", "max_depth", "min_samples_leaf", "mean_test_score"]
title_labels: list = [ "Number of Estimators", "Max Depth", "Min Samples per Leaf", "F1 Score"]

show_hyperparameter_combinations(model_name=model_name, parameters=parameters, title_labels=title_labels, target_name=target_name, size=(1000, 500), font_color="white", save=False)

For random Forrest it seems that the number of samples per leaf have a strong impact on the overall performance. This hyperparameter can be used to control
over-fitting since it restricts the tree to not form leafs with just a few samples. However, in this case it seems that the fewer samples were used the better the score was.
Overall one can see that the range of different f1-scores that were achieved is really narrow such that the overall impact of tuning hyperparameters does not change the performance too much

## Logistic Regression Results

In [54]:
model_name: str = "log_reg"
target_name: str = "mean_test_score"
parameters: list = ["max_iter", "C",  "mean_test_score"]
title_labels: list = [ "Max Iterations", "Regularization Strength", "F1 Score"]

show_hyperparameter_combinations(model_name=model_name, parameters=parameters, title_labels=title_labels, target_name=target_name, size=(1000, 500), font_color="white", save=False)

## SVM Results

In [57]:
model_name: str = "svm"
target_name: str = "mean_test_score"
parameters: list = ["max_iter", "C",  "mean_test_score"]
title_labels: list = [ "Max Iterations", "Regularization Strength", "F1 Score"]
show_hyperparameter_combinations(model_name=model_name, parameters=parameters, title_labels=title_labels, target_name=target_name, size=(1000, 500), font_color="white", save=False)

# Best Hyperparameter combination for each model

In [58]:
result_tables["svm"][result_tables["svm"]["mean_test_score"] == result_tables["svm"].mean_test_score.max()]

Unnamed: 0,C,class_weight,loss,max_iter,multi_class,penalty,mean_test_score
3,1.0,balanced,squared_hinge,1000,ovr,l2,0.904399
4,1.0,balanced,squared_hinge,2000,ovr,l2,0.904399
5,1.0,balanced,squared_hinge,3000,ovr,l2,0.904399


The fact that all models have the exact same score indicates that the max iterations were never reached.
It is not in the notebook anymore but beforehand SVM and Logistic Regression were trained on non-normalized data, and it took 10000 iterations until this score was reached

In [59]:
result_tables["random_forrest"][result_tables["random_forrest"]["mean_test_score"] == result_tables["random_forrest"].mean_test_score.max()]

Unnamed: 0,class_weight,criterion,max_depth,min_samples_leaf,n_estimators,mean_test_score
45,balanced,entropy,200,2,200,0.978826


In [60]:
result_tables["decision_tree"][result_tables["decision_tree"]["mean_test_score"] == result_tables["decision_tree"].mean_test_score.max()]

Unnamed: 0,class_weight,criterion,max_depth,min_samples_leaf,mean_test_score
3,balanced,gini,10,2,0.970773


In [61]:
result_tables["log_reg"][result_tables["log_reg"]["mean_test_score"] == result_tables["log_reg"].mean_test_score.max()]

Unnamed: 0,C,class_weight,max_iter,multi_class,penalty,solver,mean_test_score
2,1.0,balanced,2000,ovr,l1,saga,0.931652
4,1.0,balanced,3000,ovr,l1,saga,0.931652


As well as for the SVM it seems that the number of iterations >2000 do not contribute to an increasing performance

# Discussion about Model Hyperparameter Selection:

It becomes visible that for Random Forests as well as  for the Decision Tree, the Micro F1 score for all hyperparameter combinations only differed a bit.

e.g.
The worst performing Random Forrest classifier performed only 0.006 worse in terms of micro F1 than the best one.
However, those are only the risk estimates from the Cross Validation. Testing all hyperparameter combinations on the test data to see which combination performs best would violate the model selection protocol in terms of obtaining a pessimistic estimator. Therefore, the best performing hyperparameter combination will be used for the evaluation protocol


# Save the Test Data and Cross Validation Data to use in the final model Evaluation

In [62]:
save_path_cross_val_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\cross_validation_data.csv"
save_path_test_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\test_data.csv"

cross_validation_data = pd.concat((x_train, y_train), axis=1)
test_data = pd.concat((x_test, y_test), axis=1)
cross_validation_data.to_csv(save_path_cross_val_data)
test_data.to_csv(save_path_test_data)

# Save the splits for scaled data
save_path_scaled_cross_val_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\scaled_cross_validation_data.csv"
save_path_scaled_test_data: str = r"D:\Documents\GitHub\UNI_Stellar_Classification\Data\scaled_test_data.csv"

scaled_cross_validation_data = pd.concat((x_train_scaled, y_train_scaled), axis=1)
test_data_scaled = pd.concat((x_test_scaled, y_test_scaled), axis=1)
scaled_cross_validation_data.to_csv(save_path_scaled_cross_val_data)
test_data_scaled.to_csv(save_path_scaled_test_data)