## **Notebook for the Hyperparamter optimization**
I just use bayesian optimization (BO) (again ^^).
I encode the hyperparameters into one array (hyperparameters) and call the optimization function just with that.
As metric I use cross-entropy again, but this time actually the one for more than 2 classes.
In the optimization function I use 5-fold cross-validation and avarage the cross-entropy-scores I get as the score to be optimized.
Since our bayesian-optimization maximises the funtions, I multipy the score by -1 so the score gets minimized.
Candidate values for the BO are just all combinations of hyperparameters.
Thats kinda it.
Results for me were lowest score: 0.58042515 best parameters: 2 hiddenlayers 33 hiddenunits alpha: 1e-03 learningrate: 1e-02 (see below for all, took ~1Hour)
Scores were actually more spread than I would have thought but on the good side many Models performed relativly equal.
The problem here is probably again the dataset and the very high portion of adults.
For a higher number of evals the results would be more meaningful.

## **Hypothesis test for Modelselection**
I would have also performed a hypothesis test to see of one model is better than another.
Problems I saw here were that comparing more than 2 models would need more sophisticated tests than the ones we implemented (Friedman -> Nemenyi for example).
In the case of the Friedman test the distribution is only close to the chi squared distribution if the number of models that are compared is greater than 5 and more than 5 datasets are used (Statistical Comparisons of Classifiers over Multiple Data Sets, Janez Demsar).
For lower numbers there are special values computed but I think this would be out of the scope of this project.
Another (smaller) Problem is the size of the dataset (1480 samples).
Splitting the dataset to aquire a meaningful number of measurements (>5) of the performance would probably signifcantly hurt the performance.

In [4]:
import pandas as pd
import numpy as np
import warnings
from e2ml.models._gaussian_process_regression_solution import GaussianProcessRegression
from e2ml.experimentation._bayesian_optimization_solution import perform_bayesian_optimization
from sklearn.metrics import log_loss
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from e2ml.evaluation._error_estimation_solution import cross_validation
from e2ml.preprocessing import StandardScaler

def cross_entropy_loss(y_true, y_pred):
    """
    Calculate the multiclass cross entropy for the given model for every input
    (Note: This is the function I should have used)

    Parameters:
        y_true: np.array true label either as index or onehot encoded
        y_pred: np.array containing the predicted labels of the data as probabilities
    """
    #if y contains the information of which class is true as index
    if(len(y_true.shape) == 1):
        return np.array([np.log(y_pred[i]) for i in y_true])
    #if onehot encoded
    elif(len(y_true.shape) == 2):
        return np.array([-(y_true[i] * np.log(y_pred[i])).sum() for i in range(len(y_pred))])

def loadFullData() -> pd.DataFrame:
    """
    Get full data available
    """
    initial_molluscs_data = pd.read_csv('../data/initial_molluscs_data.csv')
    full_data = initial_molluscs_data
    first_batch = pd.read_csv("../data/batch1_labels.csv")
    full_data = pd.concat((full_data, first_batch))
    second_batch = pd.read_csv("../data/batch2_labels.csv")
    full_data = pd.concat((full_data, second_batch))
    third_batch = pd.read_csv("../data/batch3_labels.csv")
    full_data = pd.concat((full_data, third_batch))
    third_batch = pd.read_csv("../data/batch3_labels.csv")
    full_data = pd.concat((full_data, third_batch))
    third_batch = pd.read_csv("../data/batch4_labels.csv")
    full_data = pd.concat((full_data, third_batch))
    return full_data

def getOneHotEncoding(data:np.array, values:np.array=None):
    """
    One hot encode data

    Parameters:
        data: np.array containing the (probably categorical) data to be one-hot encoded
        values: np.array containing the possible values of the data. Important to ensure the data is encoded the same everytime and can be decoded acordingly
    """
    if(values == None):
        values = np.sort(np.unique(data))
    enc = np.zeros((len(data), len(values)))
    for i, x in enumerate(data):
        enc[i, np.where(values == x)[0][0]] = 1
    return enc

def softmax(x):
    """
    implements the softmax function. Used to normalize the probabiility outputs of the models
    
    Parameters:
        x: array-like that should be normalized
    """
    x = np.array(x)
    return np.exp(x) / np.exp(x).sum(axis=1).reshape(x.shape[0],-1)

full_data = loadFullData()
scaler = StandardScaler().fit(full_data.values[:,1:-1])
x = np.concatenate((getOneHotEncoding(full_data["Sex"]), scaler.transform(full_data.values[:,1:-1])), axis=1)
y = getOneHotEncoding(full_data["Stage of Life"])

def objective_function_MLP(hyperparameters):
    num_hiddenlayers = hyperparameters[0]
    num_hiddenunits = hyperparameters[1]
    alpha = hyperparameters[2]
    learning_rate_init = hyperparameters[3]
    hidden_layer_sizes = [int(num_hiddenunits) for _ in range(int(num_hiddenlayers))]
    model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes, alpha=alpha, learning_rate_init=learning_rate_init, max_iter=1000)
    sample_indices = np.arange(len(full_data), dtype=int)
    n_folds = 5
    train, test = cross_validation(sample_indices=sample_indices, n_folds=n_folds, y=None, random_state=2)
    scores = 0
    for i in range(n_folds):
        with warnings.catch_warnings():#to get rid of the annoying ConverganceWarning
            warnings.simplefilter("ignore")
            model.fit(x[train[i]], y[train[i]])
            y_pred = softmax(model.predict_proba(x[test[i]]))
            scores += cross_entropy_loss(y[test[i]], y_pred).mean()
    return -scores/n_folds,

def perform_hyperparameter_optimization_MLP():
    nums_hiddenlayers = np.array([0,1,2,3,4])
    nums_hiddenunits = np.array([10,33,50,100,150,200,250,300,350,400])
    alpha = np.linspace(0.001, 0.00001, 20)
    learning_rates = np.linspace(0.01, 0.0001, 20)
    candidates = np.array(np.meshgrid(nums_hiddenlayers, nums_hiddenunits, alpha, learning_rates)).T.reshape(-1,4)
    metrics_dict = {'gamma': None, 'metric': 'rbf'}
    gpr = GaussianProcessRegression(beta=1.e-3, metrics_dict=metrics_dict)
    best_params, best_scores = perform_bayesian_optimization(candidates, gpr, acquisition_func="ei", obj_func=objective_function_MLP, n_evals=30, n_random_init=2)
    print(best_params, best_scores)
    return best_params, best_scores

best_params, best_scores = perform_hyperparameter_optimization_MLP()

[[0.45       0.345      0.12       ... 0.1655     0.095      0.135     ]
 [0.475      0.38       0.145      ... 0.167      0.118      0.187     ]
 [0.61       0.485      0.17       ... 0.419      0.2405     0.36      ]
 ...
 [0.64668016 0.4351138  0.25650057 ... 0.58246501 0.28639193 0.44137179]
 [0.64307546 0.43226801 0.25525184 ... 0.57601198 0.28321823 0.43654118]
 [0.63947075 0.42942222 0.25400312 ... 0.56955895 0.28004452 0.43171058]]
1480
0
1


config: num_hiddenlayers, num_hiddenuntis, learning_rate, alpha
[[1.00000000e+00 1.00000000e+02 7.91578947e-03 7.91578947e-04]
 [2.00000000e+00 1.00000000e+02 5.31052632e-03 1.14210526e-04]
 [0.00000000e+00 1.00000000e+01 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 3.30000000e+01 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 5.00000000e+01 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 1.50000000e+02 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 2.00000000e+02 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 2.50000000e+02 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 3.00000000e+02 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 3.50000000e+02 1.00000000e-02 1.00000000e-03]
 [0.00000000e+00 4.00000000e+02 1.00000000e-02 1.00000000e-03]
 [4.00000000e+00 3.50000000e+02 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 1.50000000e+02 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 1.00000000e+01 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 4.00000000e+02 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 3.30000000e+01 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 2.00000000e+02 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 5.00000000e+01 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 3.00000000e+02 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 2.50000000e+02 1.00000000e-04 1.00000000e-05]
 [4.00000000e+00 1.00000000e+02 1.00000000e-04 1.00000000e-03]
 [2.00000000e+00 4.00000000e+02 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 3.50000000e+02 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 3.00000000e+02 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 2.50000000e+02 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 2.00000000e+02 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 1.50000000e+02 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 5.00000000e+01 1.00000000e-04 1.00000000e-05]
 [2.00000000e+00 3.30000000e+01 1.00000000e-02 1.00000000e-03]
 [2.00000000e+00 1.00000000e+01 1.00000000e-02 1.00000000e-03]] 
 
 scores: higher = better
 [-0.59396657 -0.59218798 -0.70476239 -0.70483637 -0.70496765 -0.7046212
 -0.70485052 -0.7054535  -0.70507148 -0.70429256 -0.70483418 -0.58633322
 -0.60650235 -0.95359764 -0.58568454 -0.73858945 -0.59787859 -0.66547465
 -0.59069548 -0.59327332 -0.58153428 -0.62986713 -0.63434779 -0.64003818
 -0.64626333 -0.65748636 -0.67012698 -0.75528564 -0.58042515 -0.59160392]