# Echocardiogram. Modele de clasificare

Sîrbu Matei-Dan, _grupa 10LF383_

<i>Sursă dataset:</i> http://archive.ics.uci.edu/ml/datasets/Echocardiogram

<i>Synopsis:</i> Datasetul _Echocardiogram_ cuprinde 132 de înregistrări de date obținute din ecocardiografiile unor pacienți care au suferit un atac de cord în trecut. Datele sunt utilizate pentru a prezice dacă un pacient va supraviețui cel puțin un an după un atac de cord.

În cele ce urmează, atributul **alive-at-1** este reprezentată de y, iar coloanele **survival, still-alive, age-at-heart-attack, pericardial-effusion, fractional-shortening, epss, lvdd, wall-motion-score, wall-motion-index, mult**, de X.

_Verbose attribute info:_
1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
2. still-alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
3. age-at-heart-attack -- age in years when heart attack occurred
4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
8. wall-motion-score -- a measure of how the segments of the left ventricle are moving
9. wall-motion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
10. mult -- a derivate var which can be ignored
11. name -- the name of the patient (I have replaced them with "name")
12. group -- meaningless, ignore it
13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

In [2]:
header = ['survival', 'still-alive', 'age-at-heart-attack', 'pericardial-effusion', 'fractional-shortening', 'epss', 'lvdd', 'wall-motion-score', 'wall-motion-index', 'mult', 'name', 'group', 'alive-at-1']
data_echo = pd.read_csv("./Datasets/echocardiogram.csv", names=header)
display(HTML("<i>Dataset overview:</i>"))
display(data_echo)
# ignoring irrelevant columns, as per verbose attribute info:
del data_echo['mult']
del data_echo['name']
del data_echo['group']
folds = 5

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,mult,name,group,alive-at-1
0,11,0,71,0,0.260,9,4.600,14,1,1,name,1,0
1,19,0,72,0,0.380,6,4.100,14,1.700,0.588,name,1,0
2,16,0,55,0,0.260,4,3.420,14,1,1,name,1,0
3,57,0,60,0,0.253,12.062,4.603,16,1.450,0.788,name,1,0
4,19,1,57,0,0.160,22,5.750,18,2.250,0.571,name,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,7.5,1,64,0,0.24,12.9,4.72,12,1,0.857,name,?,?
128,41,0,64,0,0.28,5.40,5.47,11,1.10,0.714,name,?,?
129,36,0,69,0,0.20,7.00,5.05,14.5,1.21,0.857,name,?,?
130,22,0,57,0,0.14,16.1,4.36,15,1.36,0.786,name,?,?


După cum se poate observa în subsolul tabelului, despre unii pacienți nu se știe dacă au supraviețuit la 1 an de la atacul de cord, exact informația pe care vrem să o prezicem după antrenarea modelelor de clasificare. Aceste ecocardiograme se vor elimina din dataset.

In [3]:
data_echo = data_echo[(data_echo['alive-at-1'] == '0') | (data_echo['alive-at-1'] == '1')]
display(HTML("<i>Last records in sanitized dataset:</i>"))
data_echo.tail()

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,alive-at-1
104,1.25,1,63,0,0.3,6.9,3.52,18.16,1.51,1
105,24.0,0,59,0,0.17,14.3,5.49,13.5,1.5,0
106,25.0,0,57,0,0.228,9.7,4.29,11.0,1.0,0
108,0.75,1,78,0,0.23,40.0,6.23,14.0,1.4,1
109,3.0,1,62,0,0.26,7.6,4.42,14.0,1.0,1


De asemenea, unele înregistrări au valori lipsă, valori pe care va trebui să le improvizăm prin _missing value imputation_. Pentru a decide modul în care vom imputa date, trebuie să aflăm modul în care lipsesc datele; există trei tipuri de _missingness mechanisms_: 
- Missingness completely at random
- Missingness at random
- Missingness that depends on unobserved predictors
- Missingness that depends on the missing value itself.

# TODO: pick best data imputation technique and explain choice; 'mean' strategy tested below.

In [4]:
data_echo.replace(to_replace='?', value=np.nan, inplace=True)
display(HTML("<i>Dataset extract with missing values:</i>"))
data_echo[33:39]

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,alive-at-1
43,46.0,0,56,0,0.33,,3.59,14.0,1.0,0
46,19.5,1,81,0,0.12,,,9.0,1.25,0
47,20.0,1,59,0,0.03,21.3,6.29,17.0,1.31,0
48,0.25,1,63,1,,,,23.0,2.3,1
50,2.0,1,56,1,0.04,14.0,5.0,,,1
51,7.0,1,61,1,0.27,,,9.0,1.5,1


In [28]:
from sklearn.impute import SimpleImputer
header = ['survival', 'still-alive', 'age-at-heart-attack', 'pericardial-effusion', 'fractional-shortening', 'epss', 'lvdd', 'wall-motion-score', 'wall-motion-index', 'alive-at-1']
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(data_echo)
data_echo = pd.DataFrame(data=imputer.transform(data_echo), columns=header)
display(HTML("<i>Dataset extract with values imputed:</i>"))
display(data_echo[33:39])
X = data_echo.values[:, :9]
y = data_echo.values[:, -1]

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,alive-at-1
33,46.0,0.0,56.0,0.0,0.33,12.576636,3.59,14.0,1.0,0.0
34,19.5,1.0,81.0,0.0,0.12,12.576636,4.785912,9.0,1.25,0.0
35,20.0,1.0,59.0,0.0,0.03,21.3,6.29,17.0,1.31,0.0
36,0.25,1.0,63.0,1.0,0.219057,12.576636,4.785912,23.0,2.3,1.0
37,2.0,1.0,56.0,1.0,0.04,14.0,5.0,15.348082,1.433795,1.0
38,7.0,1.0,61.0,1.0,0.27,12.576636,4.785912,9.0,1.5,1.0


# <u>Modelele de clasificare _in a nutshell_</u>
Pentru clasificarea datelor din datasetul _Echocardiogram_ vom utiliza următoarele modele de clasificare: kNN, Decision Tree, MLP, Gaussian NB și Random Forest. În paragraful următor vă vom explica modul de funcționare al fiecărui algoritm și evidenția hiperparametrii și formulele care stau la baza acestora.
<div style="text-align:center"><img src="./Images/xkcd_machine_learning.png"><br>"hiperparametrii și formulele care stau la baza acestora"<br>sursă: <a href="https://xkcd.com/1838/">xkcd 1838: Machine Learning</a></div>

### [1. <i>k</i>-nearest neighbors classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

class `sklearn.neighbors.KNeighborsClassifier`<i>(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)</i>

Într-o problemă de clasificare, algoritmul kNN (_k_-nearest neighbors) identifică cei mai apropiați _k_ vecini ai fiecărui item neclasificat - indiferent de etichetele acestora - vecini localizați în setul de antrenare. Determinarea claselor din care fac parte itemii neclasificați se face prin votare: clasa în care aparțin majoritatea vecinilor se consideră clasa itemului.
<div style="text-align:center"><img style="width: 500px" src="./Images/knn_example.png"><br>Exemplu: clasificarea itemului c cu 3NN. În urma votării se determină clasa lui c: <b>o</b>.<br>sursă: <a href="http://youtu.be/UqYde-LULfs">YouTube (<i>How kNN algorithm works</i> de Thales Sehn Körting)</a></div><br>

Pentru determinarea distanței dintre itemi se pot utiliza mai multe metrici. Scikit-learn admite orice funcție Python ca și metrică, dar implicit folosește metrica _Minkowski_. Iată câteva exemple de metrici des utilizate în kNN:

- _distanța Minkowski_: $d_{st} = \sqrt[p]{\sum_{j=1}^n |x_{sj} - y_{tj}|^p}$  (_Obs._: p este un hiperparametru utilizat de Scikit-learn)
- _distanța Euclideană_: $d(\textbf{x},\textbf{y}) = \sqrt{\sum_{i=1}^n (y_i - x_i)^2}$
- _distanța Manhattan (City block)_: $d_{st} = \sum_{j=1}^n |x_{sj} - y_{tj}|$
- _distanța Mahalanobis_: $d(\textbf{x},\textbf{y}) = \sqrt{\sum_{i=1}^n \frac{(x_i - y_i)^2}{s_i^2}}$, unde $s_i$ este deviația standard a lui $x_i$ și $y_i$ în sample

### [2. Decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

class `sklearn.tree.DecisionTreeClassifier`<i>(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)</i>

Un arbore de decizie (_decision tree_) este o structură arborescentă tip flowchart unde un nod intern reprezintă un feature, ramura este un criteriu de decizie, iar fiecare frunză este un rezultat, o clasificare. Algoritmul Decision tree selectează cel mai bun feature folosind o metrică ASM (_Attribute Selection Measure_), convertește un nod feature la un nod tip criteriu de decizie, și partiționează (splits) datasetul în subseturi. Procesul se execută recursiv până arborele conține numai noduri criterii de decizie și noduri frunză rezultat. Cu cât arborele este mai adânc, cu atât sunt mai complexe criteriile de decizie și modelul are o acuratețe mai mare. 

<div style="text-align:center"><img style="width: 600px" src="./Images/dt_diagram.png"><br>Structura unui arbore de decizie.<br>sursă: <a href="https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html">KDnuggets (Decision Tree Algorithm, Explained)</a></div>

<br>Pentru măsurarea calității unui split, Scikit-learn utilizează două metrici ASM:

- _impuritatea Gini_ (cât de des este etichetat greșit un element ales aleator dacă a fost etichetat folosind distribuția etichetelor dintr-un subset; poate determina overfitting-ul modelului): <br>$Gini(p) = 1 - \sum_{j=1}^c p_j^2$ <br>
- _entropia_ (similar cu Gini impurity, mai intensă d.p.d.v. computațional din cauza funcției logaritmice): <br>$H(p) = - \sum_{j=1}^c p_j \log p_j$

(unde c este numărul de clase (etichete), iar $p_j$ este subsetul etichetat cu clasă i, unde $j \in \{1, 2, ..., c\}$).

### [3. Multilayer perceptron (MLP) classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

class `sklearn.neural_network.MLPClassifier`<i>(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)</i>

_Perceptronii_ sunt o clasă de clasificatori utilizați în învățarea supervizată, fiind un model matematic al unui neuron biologic. În particular, _perceptronii multistrat (MLP)_ formează rețele neuronale cu mai multe straturi de perceptroni: un strat de intrare, unul sau mai multe straturi intermediare (ascunse), și un strat de ieșire.

<div style="text-align:center"><img style="width: 500px" src="./Images/mlp_diagram.png"><br>Un perceptron multistrat ilustrat.<br>sursă: <a href="https://github.com/ledell/sldm4-h2o/blob/master/sldm4_h2o_oct2016.pdf">GitHub (ledell/sldm4-h2o)</a></div>

<br>Într-o rețea neuronală, o _funcție de activare_ definește ieșirea unui perceptron după ce este supus unui set de intrare. În forma lui cea mai simplă, funcția poate returna un rezultat binar (funcție liniară, output 0 sau 1): făcând analogie cu neuronul biologic, dacă trece un impuls electric prin axonul acestuia sau nu. În cazul rețelelor neuronale moderne care utilizează mai multe straturi de perceptroni, funcțiile de activare pot fi și non-binare (non-liniare). Scikit-learn admite funcții de activare de ambele tipuri în implementarea MLP classifier:
- _funcția identitate_: $f(x) = x$
- _sigmoida logistică_: $f(x) = \frac{1}{1 + \exp(-x)}$
- _tangenta hiperbolică_: $f(x) = \tanh(x) = \frac{\sinh(x)}{\cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- _Rectified Linear Unit (ReLU)_: $f(x) = \max(0, x) = \begin{cases} 0 & \text{dacă } x \leq 0 \\ x & \text{dacă } x > 0 \end{cases}$

De asemenea, clasificatorul MLP din Scikit-learn utilizează și algoritmi de optimizare a ponderilor (solvers): _LBFGS_ (algoritm Quasi-Newton), _SGD_ (stochastic gradient descent) și _Adam_ (algoritm derivat din SGD, creat de Diederik P. Kingma și Jimmy Lei Ba).

### [4. Gaussian Naïve Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

class `sklearn.naive_bayes.GaussianNB`<i>(priors=None, var_smoothing=1e-09)<i>

Algoritmul de clasificare _Gaussian Naïve Bayes_ aparține familiei de clasificatori _Naïve Bayes_, care presupun că prezența unui feature într-o clasă nu este afectată de prezența altor features; pe scurt, proprietățile contribuie independent la probabilitatea apartenenței la o clasă. În particular, algoritmul _Gaussian Naïve Bayes_ urmează funcția de probabilitate (PDF) a unei distribuții normale (Gaussiene):
$$\large P(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma_y^2}}\exp\bigg(-\frac{(x_i - \mu_y)^2}{2 \sigma_y^2}\bigg),$$
unde parametrii $\sigma_y$ și $\mu_y$, deviația standard și media, sunt determinați folosind maximum likelihood estimation (MLE), o metodă de estimare a parametrilor unei PDF prin maximizarea unei funcții de likelihood (cât de bine se potrivește un sample cu un model statistic).

### [5. Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

class `sklearn.ensemble.RandomForestClassifier`<i>(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)<i>

Un clasificator _Random forest_ se folosește de ipotezele emise de mai mulți arbori de decizie aleatori (_random trees_), obținuți în urma unui _random split_. Un random forest se obține prin construirea unui random tree pentru fiecare set de antrenare. Acești arbori funcționează ca un ansamblu; pentru fiecare dată de intrare se aplică modelele din ansamblu, și rezultatul final se obține agregând rezultatele prin votare. Astfel, un random forest este un _meta-estimator_: se obține o predicție în urma mai multor predicții.
<div style="text-align:center"><img style="width: 400px" src="./Images/rf_diagram.png"><br>Un model random forest făcând o predicție; în urma votării se obține rezultatul 1.<br>sursă: <a href="https://towardsdatascience.com/understanding-random-forest-58381e0602d2">Medium (Towards Data Science: Understanding Random Forest)</a></div>

<br>La fel ca la _Decision Tree classifier_, pentru măsurarea calității unui split, Scikit-learn utilizează două metrici:

- _impuritatea Gini_: $Gini(p) = 1 - \sum_{j=1}^c p_j^2$ <br>
- _entropia_: $H(p) = - \sum_{j=1}^c p_j \log p_j$

(unde c este numărul de clase (etichete), iar $p_j$ este subsetul etichetat cu clasă i, unde $j \in \{1, 2, ..., c\}$).

# <u>Testarea algoritmilor de clasificare pe setul de date cu Scikit-learn</u>

In [8]:
# just a pretty printing function, don't mind me...
def print_stats_cv(model_cv_stats):
    print(f"Test accuracy for each fold: {model_cv_stats['test_accuracy']} \n=> Average test accuracy: {round(model_cv_stats['test_accuracy'].mean() * 100, 3)}%")
    print(f"Train accuracy for each fold: {model_cv_stats['train_accuracy']} \n=> Average train accuracy: {round(model_cv_stats['train_accuracy'].mean() * 100, 3)}%")
    print(f"Test F1 score for each fold: {model_cv_stats['test_f1_macro']} \n=> Average test F1 score: {round(model_cv_stats['test_f1_macro'].mean() * 100, 3)}%")
    print(f"Train F1 score for each fold: {model_cv_stats['train_f1_macro']} \n=> Average train F1 score: {round(model_cv_stats['train_f1_macro'].mean() * 100, 3)}%")

### 1. <i>k</i>-nearest neighbors classifier

In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import MinMaxScaler

# hiperparametri
knn_neighbors = 4
knn_minkowski_p = 3

# scalare date
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors, p=knn_minkowski_p)
model_cv_stats = cross_validate(model, X_scaled, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.93333333 0.93333333 0.93333333 1.         0.85714286] 
=> Average test accuracy: 93.143%
Train accuracy for each fold: [0.98305085 0.96610169 0.98305085 0.94915254 0.95      ] 
=> Average train accuracy: 96.627%
Test F1 score for each fold: [0.92822967 0.92063492 0.92822967 1.         0.78787879] 
=> Average test F1 score: 91.299%
Train F1 score for each fold: [0.98031365 0.96118421 0.98085037 0.94094094 0.94442729] 
=> Average train F1 score: 96.154%


### 2. Decision Tree classifier

In [11]:
from sklearn.tree import DecisionTreeClassifier

# hiperparametri
dt_criterion = 'gini'
dt_splitter = 'best'

# implementare Decision Tree
model = DecisionTreeClassifier(criterion=dt_criterion, splitter=dt_splitter, random_state=42)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Decision Trees classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [1.         1.         0.93333333 0.93333333 1.        ] 
=> Average test accuracy: 97.333%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [1.         1.         0.92822967 0.92063492 1.        ] 
=> Average test F1 score: 96.977%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### 3. Multilayer Perceptron (MLP) classifier

In [15]:
from sklearn.neural_network import MLPClassifier

# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'relu'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)

# implementare MLP
model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, random_state=42, max_iter=1000)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for MLP classification</h4>"))
display(HTML(f"using hardcoded hyperparameters - Solver: <b>{mlp_solver}</b>, Activation function: <b>{mlp_activation}</b>, Parameter for regularization (α): <b>{mlp_alpha}</b>, Hidden layer sizes: <b>{mlp_hidden_layer_sizes}</b>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [1.         0.93333333 1.         0.93333333 1.        ] 
=> Average test accuracy: 97.333%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [1.         0.92063492 1.         0.92063492 1.        ] 
=> Average test F1 score: 96.825%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### 4. Gaussian Naïve Bayes classifier

In [13]:
from sklearn.naive_bayes import GaussianNB

# implementare GNB
model = GaussianNB()
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Gaussian NB classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.93333333 0.93333333 0.86666667 0.93333333 1.        ] 
=> Average test accuracy: 93.333%
Train accuracy for each fold: [0.94915254 0.94915254 0.96610169 1.         0.93333333] 
=> Average train accuracy: 95.955%
Test F1 score for each fold: [0.92822967 0.92822967 0.86111111 0.92063492 1.        ] 
=> Average test F1 score: 92.764%
Train F1 score for each fold: [0.94393411 0.94393411 0.96217949 1.         0.92822967] 
=> Average train F1 score: 95.566%


### 5. Random Forest classifier

In [14]:
from sklearn.ensemble import RandomForestClassifier

# hiperparametri
rfc_n_estimators = 150
rfc_criterion = 'gini'

# implementare Random Forest
model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Random Forest classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [1.         1.         1.         0.93333333 1.        ] 
=> Average test accuracy: 98.667%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [1.         1.         1.         0.92063492 1.        ] 
=> Average test F1 score: 98.413%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


# <u>Nested Cross Validation pentru optimizarea hiperparametrilor</u>

In [18]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score

# CVs configuration
inner_cv = KFold(n_splits=4, shuffle=True)
outer_cv = KFold(n_splits=5, shuffle=True)

# outer CV folds:
print("5-fold cross validation: split overview")
splits = outer_cv.split(range(data_echo.index.size))
subsets = pd.DataFrame(columns=['Fold', 'Train row indices', 'Test row indices'])
for i, split_data in enumerate(splits):
    subsets.loc[i]=[i + 1, split_data[0], split_data[1]]
display(subsets)

5-fold cross validation: split overview


Unnamed: 0,Fold,Train row indices,Test row indices
0,1,"[0, 1, 3, 4, 6, 8, 9, 10, 11, 14, 15, 16, 18, ...","[2, 5, 7, 12, 13, 17, 20, 25, 31, 35, 42, 55, ..."
1,2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14,...","[11, 18, 19, 22, 43, 48, 52, 54, 57, 58, 59, 6..."
2,3,"[1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[0, 6, 21, 26, 27, 34, 37, 41, 45, 46, 47, 50,..."
3,4,"[0, 2, 3, 5, 6, 7, 9, 11, 12, 13, 14, 16, 17, ...","[1, 4, 8, 10, 15, 28, 29, 30, 39, 44, 49, 53, ..."
4,5,"[0, 1, 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 1...","[3, 9, 14, 16, 23, 24, 32, 33, 36, 38, 40, 69,..."


### 1. <i>k</i>-nearest neighbors classifier

In [19]:
outer_cv_acc = []
param_candidates = {'n_neighbors': np.linspace(start=1, stop=30, num=30, dtype=int),
                    'p': np.linspace(start=1, stop=5, num=4, dtype=int)} 
param_search = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv, random_state=42)
    
for fold in range(5):
    X_train = X_scaled[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X_scaled[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'kNN model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'p': 5, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.9285714285714286

Average model accuracy: 91.9%


### 2. Decision Tree classifier

In [20]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                    'splitter': ['best', 'random']} 
param_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Decision Tree model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'random'}
Decision Tree model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'random'}
Decision Tree model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 1.0

Average model accuracy: 93.29%


### 3. Multilayer Perceptron (MLP) classifier

In [27]:
outer_cv_acc = []
param_candidates = {'alpha': np.linspace(start=0, stop=1e-1, num=500),
                    'activation': ['identity', 'logistic', 'tanh', 'relu']}
param_search = RandomizedSearchCV(estimator=MLPClassifier(max_iter=1000), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
     
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'MLP model accuracy with optimal hyperparameters: {accuracy}')
     
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.06773547094188377, 'activation': 'identity'}
MLP model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.09859719438877755, 'activation': 'identity'}
MLP model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.0028056112224448897, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.011222444889779559, 'activation': 'relu'}
MLP model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.0935871743486974, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.9285714285714286

Average model accuracy: 94.46%


### 4. Gaussian Naïve Bayes classifier

In [23]:
outer_cv_acc = []
param_candidates = {'var_smoothing': np.linspace(start=1e-9, stop=1e-2, num=500)} 
param_search = RandomizedSearchCV(estimator=GaussianNB(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Gaussian Naïve Bayes model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.0030260528016032066}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.00022044185971943888}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.006472946244488979}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.008056112418837675}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.0027054115511022047}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.8571428571428571

Average model accuracy: 93.54%


### 5. Random Forest classifier

In [24]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                   'n_estimators': np.linspace(start=1, stop=500, num=500, dtype=int)} 
param_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Random Forest Classifier model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 69, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 236, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 308, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 241, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 47, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0

Average model accuracy: 94.17%
