# Wi-Fi localization. Modele de clasificare

Sîrbu Matei-Dan, _grupa 10LF383_

<i>Sursă dataset:</i> http://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization

<i>Descriere dataset:</i> [DOI 10.1007/978-981-10-3322-3_27 via ResearchGate](Docs/chp_10.1007_978-981-10-3322-3_27.pdf)

<i>Synopsis:</i> Setul de date _Wireless Indoor Localization_ cuprinde 2000 de măsurători ale puterii semnalului (măsurat în dBm) recepționat de la routerele unui birou din Pittsburgh. Acest birou are șapte routere și patru camere; un utilizator înregistrează cu ajutorul unui smartphone o dată pe secundă puterea semnalelor venite de la cele șapte routere, fiecărei înregistrări fiindu-i asociate camera în care se afla utilizatorul la momentul măsurării (1, 2, 3 sau 4).

În figura de mai jos este ilustrat un sample din dataset: <br><br>
![Sample](./Images/wifi_localization_sample.png)

În cele ce urmează, coloana Class (camera) este reprezentată de y, iar coloanele WS1 - WS7 (features: puterea semnalului de la fiecare router), de X.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

In [2]:
header = ['WS1', 'WS2', 'WS3', 'WS4', 'WS5', 'WS6', 'WS7', 'Class']
data_wifi = pd.read_csv("./Datasets/wifi_localization.txt", names=header, sep='\t')
display(HTML("<i>Dataset overview:</i>"))
display(data_wifi)
X = data_wifi.values[:, :7]
y = data_wifi.values[:, -1]
folds = 5

Unnamed: 0,WS1,WS2,WS3,WS4,WS5,WS6,WS7,Class
0,-64,-56,-61,-66,-71,-82,-81,1
1,-68,-57,-61,-65,-71,-85,-85,1
2,-63,-60,-60,-67,-76,-85,-84,1
3,-61,-60,-68,-62,-77,-90,-80,1
4,-63,-65,-60,-63,-77,-81,-87,1
...,...,...,...,...,...,...,...,...
1995,-59,-59,-48,-66,-50,-86,-94,4
1996,-59,-56,-50,-62,-47,-87,-90,4
1997,-62,-59,-46,-65,-45,-87,-88,4
1998,-62,-58,-52,-61,-41,-90,-85,4


# <u>Modelele de clasificare _in a nutshell_</u>
Pentru clasificarea datelor din datasetul _Wi-Fi localization_ vom utiliza următoarele modele de clasificare: kNN, Decision Tree, MLP, Gaussian NB și Random Forest. În paragraful următor vă vom explica modul de funcționare al fiecărui algoritm și evidenția hiperparametrii și formulele care stau la baza acestora.
<div style="text-align:center"><img src="./Images/xkcd_machine_learning.png"><br>"hiperparametrii și formulele care stau la baza acestora"<br>sursă: <a href="https://xkcd.com/1838/">xkcd 1838: Machine Learning</a></div>

### [1. <i>k</i>-nearest neighbors classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

class `sklearn.neighbors.KNeighborsClassifier`<i>(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)</i>

Într-o problemă de clasificare, algoritmul kNN (_k_-nearest neighbors) identifică cei mai apropiați _k_ vecini ai fiecărui item neclasificat - indiferent de etichetele acestora - vecini localizați în setul de antrenare. Determinarea claselor din care fac parte itemii neclasificați se face prin votare: clasa în care aparțin majoritatea vecinilor se consideră clasa itemului.
<div style="text-align:center"><img style="width: 500px" src="./Images/knn_example.png"><br>Exemplu: clasificarea itemului c cu 3NN. În urma votării se determină clasa lui c: <b>o</b>.<br>sursă: <a href="http://youtu.be/UqYde-LULfs">YouTube (<i>How kNN algorithm works</i> de Thales Sehn Körting)</a></div><br>

Pentru determinarea distanței dintre itemi se pot utiliza mai multe metrici. Scikit-learn admite orice funcție Python ca și metrică, dar implicit folosește metrica _Minkowski_. Iată câteva exemple de metrici des utilizate în kNN:

- _distanța Minkowski_: $d_{st} = \sqrt[p]{\sum_{j=1}^n |x_{sj} - y_{tj}|^p}$  (_Obs._: p este un hiperparametru utilizat de Scikit-learn)
- _distanța Euclideană_: $d(\textbf{x},\textbf{y}) = \sqrt{\sum_{i=1}^n (y_i - x_i)^2}$
- _distanța Manhattan (City block)_: $d_{st} = \sum_{j=1}^n |x_{sj} - y_{tj}|$
- _distanța Mahalanobis_: $d(\textbf{x},\textbf{y}) = \sqrt{\sum_{i=1}^n \frac{(x_i - y_i)^2}{s_i^2}}$, unde $s_i$ este deviația standard a lui $x_i$ și $y_i$ în sample

### [2. Decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

class `sklearn.tree.DecisionTreeClassifier`<i>(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)</i>

Un arbore de decizie (_decision tree_) este o structură arborescentă tip flowchart unde un nod intern reprezintă un feature, ramura este un criteriu de decizie, iar fiecare frunză este un rezultat, o clasificare. Algoritmul Decision tree selectează cel mai bun feature folosind o metrică ASM (_Attribute Selection Measure_), convertește un nod feature la un nod tip criteriu de decizie, și partiționează (splits) datasetul în subseturi. Procesul se execută recursiv până arborele conține numai noduri criterii de decizie și noduri frunză rezultat. Cu cât arborele este mai adânc, cu atât sunt mai complexe criteriile de decizie și modelul are o acuratețe mai mare. 

<div style="text-align:center"><img style="width: 600px" src="./Images/dt_diagram.png"><br>Structura unui arbore de decizie.<br>sursă: <a href="https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html">KDnuggets (Decision Tree Algorithm, Explained)</a></div>

<br>Pentru măsurarea calității unui split, Scikit-learn utilizează două metrici ASM:

- _impuritatea Gini_ (cât de des este etichetat greșit un element ales aleator dacă a fost etichetat folosind distribuția etichetelor dintr-un subset; poate determina overfitting-ul modelului): <br>$Gini(p) = 1 - \sum_{j=1}^c p_j^2$ <br>
- _entropia_ (similar cu Gini impurity, mai intensă d.p.d.v. computațional din cauza funcției logaritmice): <br>$H(p) = - \sum_{j=1}^c p_j \log p_j$

(unde c este numărul de clase (etichete), iar $p_j$ este subsetul etichetat cu clasă i, unde $j \in \{1, 2, ..., c\}$).

### [3. Multilayer perceptron (MLP) classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

class `sklearn.neural_network.MLPClassifier`<i>(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)</i>

_Perceptronii_ sunt o clasă de clasificatori utilizați în învățarea supervizată, fiind un model matematic al unui neuron biologic. În particular, _perceptronii multistrat (MLP)_ formează rețele neuronale cu mai multe straturi de perceptroni: un strat de intrare, unul sau mai multe straturi intermediare (ascunse), și un strat de ieșire.

<div style="text-align:center"><img style="width: 500px" src="./Images/mlp_diagram.png"><br>Un perceptron multistrat ilustrat.<br>sursă: <a href="https://github.com/ledell/sldm4-h2o/blob/master/sldm4_h2o_oct2016.pdf">GitHub (ledell/sldm4-h2o)</a></div>

<br>Într-o rețea neuronală, o _funcție de activare_ definește ieșirea unui perceptron după ce este supus unui set de intrare. În forma lui cea mai simplă, funcția poate returna un rezultat binar (funcție liniară, output 0 sau 1): făcând analogie cu neuronul biologic, dacă trece un impuls electric prin axonul acestuia sau nu. În cazul rețelelor neuronale moderne care utilizează mai multe straturi de perceptroni, funcțiile de activare pot fi și non-binare (non-liniare). Scikit-learn admite funcții de activare de ambele tipuri în implementarea MLP classifier:
- _funcția identitate_: $f(x) = x$
- _sigmoida logistică_: $f(x) = \frac{1}{1 + \exp(-x)}$
- _tangenta hiperbolică_: $f(x) = \tanh(x) = \frac{\sinh(x)}{\cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- _Rectified Linear Unit (ReLU)_: $f(x) = \max(0, x) = \begin{cases} 0 & \text{dacă } x \leq 0 \\ x & \text{dacă } x > 0 \end{cases}$

De asemenea, clasificatorul MLP din Scikit-learn utilizează și algoritmi de optimizare a ponderilor (solvers): _LBFGS_ (algoritm Quasi-Newton), _SGD_ (stochastic gradient descent) și _Adam_ (algoritm derivat din SGD, creat de Diederik P. Kingma și Jimmy Lei Ba).

### [4. Gaussian Naïve Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

class `sklearn.naive_bayes.GaussianNB`<i>(priors=None, var_smoothing=1e-09)<i>

Algoritmul de clasificare _Gaussian Naïve Bayes_ aparține familiei de clasificatori _Naïve Bayes_, care presupun că prezența unui feature într-o clasă nu este afectată de prezența altor features; pe scurt, proprietățile contribuie independent la probabilitatea apartenenței la o clasă. În particular, algoritmul _Gaussian Naïve Bayes_ urmează funcția de probabilitate (PDF) a unei distribuții normale (Gaussiene):
$$\large P(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma_y^2}}\exp\bigg(-\frac{(x_i - \mu_y)^2}{2 \sigma_y^2}\bigg),$$
unde parametrii $\sigma_y$ și $\mu_y$, deviația standard și media, sunt determinați folosind maximum likelihood estimation (MLE), o metodă de estimare a parametrilor unei PDF prin maximizarea unei funcții de likelihood (cât de bine se potrivește un sample cu un model statistic).

### [5. Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

class `sklearn.ensemble.RandomForestClassifier`<i>(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)<i>

Un clasificator _Random forest_ se folosește de ipotezele emise de mai mulți arbori de decizie aleatori (_random trees_), obținuți în urma unui _random split_. Un random forest se obține prin construirea unui random tree pentru fiecare set de antrenare. Acești arbori funcționează ca un ansamblu; pentru fiecare dată de intrare se aplică modelele din ansamblu, și rezultatul final se obține agregând rezultatele prin votare. Astfel, un random forest este un _meta-estimator_: se obține o predicție în urma mai multor predicții.
<div style="text-align:center"><img style="width: 400px" src="./Images/rf_diagram.png"><br>Un model random forest făcând o predicție; în urma votării se obține rezultatul 1.<br>sursă: <a href="https://towardsdatascience.com/understanding-random-forest-58381e0602d2">Medium (Towards Data Science: Understanding Random Forest)</a></div>

<br>La fel ca la _Decision Tree classifier_, pentru măsurarea calității unui split, Scikit-learn utilizează două metrici:

- _impuritatea Gini_: $Gini(p) = 1 - \sum_{j=1}^c p_j^2$ <br>
- _entropia_: $H(p) = - \sum_{j=1}^c p_j \log p_j$

(unde c este numărul de clase (etichete), iar $p_j$ este subsetul etichetat cu clasă i, unde $j \in \{1, 2, ..., c\}$).

# <u>Testarea algoritmilor de clasificare pe setul de date cu Scikit-learn</u>

In [3]:
# just a pretty printing function, don't mind me...
def print_stats_cv(model_cv_stats):
    print(f"Test accuracy for each fold: {model_cv_stats['test_accuracy']} \n=> Average test accuracy: {round(model_cv_stats['test_accuracy'].mean() * 100, 3)}%")
    print(f"Train accuracy for each fold: {model_cv_stats['train_accuracy']} \n=> Average train accuracy: {round(model_cv_stats['train_accuracy'].mean() * 100, 3)}%")
    print(f"Test F1 score for each fold: {model_cv_stats['test_f1_macro']} \n=> Average test F1 score: {round(model_cv_stats['test_f1_macro'].mean() * 100, 3)}%")
    print(f"Train F1 score for each fold: {model_cv_stats['train_f1_macro']} \n=> Average train F1 score: {round(model_cv_stats['train_f1_macro'].mean() * 100, 3)}%")

### 1. <i>k</i>-nearest neighbors classifier

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import MinMaxScaler

# hiperparametri
knn_neighbors = 4
knn_minkowski_p = 3

# scalare date
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors, p=knn_minkowski_p)
model_cv_stats = cross_validate(model, X_scaled, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.9675 0.985  0.9675 0.9875 0.9775] 
=> Average test accuracy: 97.7%
Train accuracy for each fold: [0.99     0.99     0.9925   0.99     0.990625] 
=> Average train accuracy: 99.062%
Test F1 score for each fold: [0.96747175 0.98498649 0.96707199 0.98754412 0.97761589] 
=> Average test F1 score: 97.694%
Train F1 score for each fold: [0.98999996 0.99002075 0.99250155 0.9900046  0.99063789] 
=> Average train F1 score: 99.063%


### 2. Decision Tree classifier

In [5]:
from sklearn.tree import DecisionTreeClassifier

# hiperparametri
dt_criterion = 'gini'
dt_splitter = 'best'

# implementare Decision Tree
model = DecisionTreeClassifier(criterion=dt_criterion, splitter=dt_splitter, random_state=42)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Decision Trees classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.96   0.9325 0.9275 0.9825 0.9725] 
=> Average test accuracy: 95.5%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [0.95972496 0.93246313 0.92562893 0.98251804 0.97253289] 
=> Average test F1 score: 95.457%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### 3. Multilayer Perceptron (MLP) classifier

In [6]:
from sklearn.neural_network import MLPClassifier

# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'relu'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)

# implementare MLP
model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, random_state=42)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for MLP classification</h4>"))
display(HTML(f"using hardcoded hyperparameters - Solver: <b>{mlp_solver}</b>, Activation function: <b>{mlp_activation}</b>, Parameter for regularization (α): <b>{mlp_alpha}</b>, Hidden layer sizes: <b>{mlp_hidden_layer_sizes}</b>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.9725 0.9975 0.95   0.975  0.9775] 
=> Average test accuracy: 97.45%
Train accuracy for each fold: [0.98125  0.971875 0.983125 0.98125  0.981875] 
=> Average train accuracy: 97.988%
Test F1 score for each fold: [0.97244414 0.99749994 0.94983905 0.97514624 0.97746665] 
=> Average test F1 score: 97.448%
Train F1 score for each fold: [0.98124085 0.97171261 0.98314044 0.98122406 0.9819253 ] 
=> Average train F1 score: 97.985%


### 4. Gaussian Naïve Bayes classifier

In [7]:
from sklearn.naive_bayes import GaussianNB

# implementare GNB
model = GaussianNB()
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Gaussian NB classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.99   0.9725 0.98   0.98   0.985 ] 
=> Average test accuracy: 98.15%
Train accuracy for each fold: [0.98125  0.983125 0.9875   0.985    0.983125] 
=> Average train accuracy: 98.4%
Test F1 score for each fold: [0.99001188 0.97255265 0.9799862  0.98006098 0.9849985 ] 
=> Average test F1 score: 98.152%
Train F1 score for each fold: [0.98127765 0.98313802 0.98751983 0.98501642 0.9831722 ] 
=> Average train F1 score: 98.402%


### 5. Random Forest classifier

In [8]:
from sklearn.ensemble import RandomForestClassifier

# hiperparametri
rfc_n_estimators = 150
rfc_criterion = 'gini'

# implementare Random Forest
model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Random Forest classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.9825 0.96   0.975  0.985  0.9875] 
=> Average test accuracy: 97.8%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [0.98244899 0.95974235 0.97494678 0.9850456  0.98749969] 
=> Average test F1 score: 97.794%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


# <u>Nested Cross Validation pentru optimizarea hiperparametrilor</u>

In [9]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score

# CVs configuration
inner_cv = KFold(n_splits=4, shuffle=True)
outer_cv = KFold(n_splits=5, shuffle=True)

# outer CV folds:
print("5-fold cross validation: split overview")
splits = outer_cv.split(range(data_wifi.index.size))
subsets = pd.DataFrame(columns=['Fold', 'Train row indices', 'Test row indices'])
for i, split_data in enumerate(splits):
    subsets.loc[i]=[i + 1, split_data[0], split_data[1]]
display(subsets)

5-fold cross validation: split overview


Unnamed: 0,Fold,Train row indices,Test row indices
0,1,"[0, 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 1...","[3, 7, 25, 29, 34, 46, 56, 63, 64, 65, 67, 69,..."
1,2,"[0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[4, 5, 16, 40, 44, 51, 52, 55, 59, 73, 77, 87,..."
2,3,"[0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15...","[1, 13, 18, 19, 22, 27, 28, 32, 35, 36, 39, 41..."
3,4,"[1, 3, 4, 5, 7, 8, 13, 14, 16, 18, 19, 21, 22,...","[0, 2, 6, 9, 10, 11, 12, 15, 17, 20, 23, 24, 3..."
4,5,"[0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15...","[8, 14, 21, 26, 30, 33, 37, 45, 50, 60, 61, 62..."


### 1. <i>k</i>-nearest neighbors classifier

In [10]:
outer_cv_acc = []
param_candidates = {'n_neighbors': np.linspace(start=1, stop=30, num=30, dtype=int),
                    'p': np.linspace(start=1, stop=5, num=4, dtype=int)} 
param_search = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv, random_state=42)
    
for fold in range(5):
    X_train = X_scaled[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X_scaled[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'kNN model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 7}
kNN model accuracy with optimal hyperparameters: 0.9775
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.9775
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.9925
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.9925
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.98

Average model accuracy: 98.4%


### 2. Decision Tree classifier

In [11]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                    'splitter': ['best', 'random']} 
param_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Decision Tree model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.965
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.9675
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.97
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.9825
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.9525

Average model accuracy: 96.75%


### 3. Multilayer Perceptron (MLP) classifier

In [12]:
outer_cv_acc = []
param_candidates = {'alpha': np.linspace(start=0, stop=1e-1, num=500),
                    'activation': ['identity', 'logistic', 'tanh', 'relu']}
param_search = RandomizedSearchCV(estimator=MLPClassifier(max_iter=1000), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
     
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'MLP model accuracy with optimal hyperparameters: {accuracy}')
     
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.0468937875751503, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.9825
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.04809619238476954, 'activation': 'relu'}
MLP model accuracy with optimal hyperparameters: 0.97
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.04609218436873747, 'activation': 'relu'}
MLP model accuracy with optimal hyperparameters: 0.98
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.1, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.9775
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.002004008016032064, 'activation': 'tanh'}
MLP model accuracy with optimal hyperparameters: 0.9625

Average model accuracy: 97.45%


### 4. Gaussian Naïve Bayes classifier

In [13]:
outer_cv_acc = []
param_candidates = {'var_smoothing': np.linspace(start=1e-9, stop=1e-2, num=500)} 
param_search = RandomizedSearchCV(estimator=GaussianNB(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Gaussian Naïve Bayes model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.0072545092925851715}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.98
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.0027054115511022047}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.9825
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 1e-09}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.9775
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.007875751715430862}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.99
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.008897795701402806}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.985

Average model accuracy: 98.3%


### 5. Random Forest classifier

In [14]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                   'n_estimators': np.linspace(start=1, stop=500, num=500, dtype=int)} 
param_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Random Forest Classifier model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 344, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9875
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 138, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9825
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 239, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.98
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 336, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.98
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 15, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9775

Average model accuracy: 98.15%
