# Laborator 6: modele de clasificare

Sîrbu Matei-Dan, _grupa 10LF383_ <br>
Tătaru Dragoș-Cătălin, _grupa 10LF383_

Dataseturile utilizate pentru antrenarea algoritmilor de clasificare:

- Wi-Fi localization
- Echocardiogram
- Seeds
- Dermatology

# <u>Modelele de clasificare in a nutshell</u>
Pentru clasificarea datelor din dataseturile enumerate mai sus vom utiliza următoarele modele de clasificare: kNN, Decision Tree, MLP, Gaussian NB și Random Forest. În paragraful următor vă vom explica modul de funcționare al fiecărui algoritm și evidenția hiperparametrii și formulele care stau la baza acestora.
<div style="text-align:center"><img src="./Images/xkcd_machine_learning.png"><br>"hiperparametrii și formulele care stau la baza acestora"<br>sursă: <a href="https://xkcd.com/1838/">xkcd 1838: Machine Learning</a></div>

### [1. <i>k</i>-nearest neighbors classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

class `sklearn.neighbors.KNeighborsClassifier`<i>(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)</i>

Într-o problemă de clasificare, algoritmul kNN (_k_-nearest neighbors) identifică cei mai apropiați _k_ vecini ai fiecărui item neclasificat - indiferent de etichetele acestora - vecini localizați în setul de antrenare. Determinarea claselor din care fac parte itemii neclasificați se face prin votare: clasa în care aparțin majoritatea vecinilor se consideră clasa itemului.
<div style="text-align:center"><img style="width: 500px" src="./Images/knn_example.png"><br>Exemplu: clasificarea itemului c cu 3NN. În urma votării se determină clasa lui c: <b>o</b>.<br>sursă: <a href="http://youtu.be/UqYde-LULfs">YouTube (<i>How kNN algorithm works</i> de Thales Sehn Körting)</a></div><br>

Pentru determinarea distanței dintre itemi se pot utiliza mai multe metrici. Scikit-learn admite orice funcție Python ca și metrică, dar implicit folosește metrica _Minkowski_. Iată câteva exemple de metrici des utilizate în kNN:

- _distanța Minkowski_: $d_{st} = \sqrt[p]{\sum_{j=1}^n |x_{sj} - y_{tj}|^p}$  (_Obs._: p este un hiperparametru utilizat de Scikit-learn)
- _distanța Euclideană_: $d(\textbf{x},\textbf{y}) = \sqrt{\sum_{i=1}^n (y_i - x_i)^2}$
- _distanța Manhattan (City block)_: $d_{st} = \sum_{j=1}^n |x_{sj} - y_{tj}|$
- _distanța Mahalanobis_: $d(\textbf{x},\textbf{y}) = \sqrt{\sum_{i=1}^n \frac{(x_i - y_i)^2}{s_i^2}}$, unde $s_i$ este deviația standard a lui $x_i$ și $y_i$ în sample

### [2. Decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

class `sklearn.tree.DecisionTreeClassifier`<i>(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)</i>

Un arbore de decizie (_decision tree_) este o structură arborescentă tip flowchart unde un nod intern reprezintă un feature, ramura este un criteriu de decizie, iar fiecare frunză este un rezultat, o clasificare. Algoritmul Decision tree selectează cel mai bun feature folosind o metrică ASM (_Attribute Selection Measure_), convertește un nod feature la un nod tip criteriu de decizie, și partiționează (splits) datasetul în subseturi. Procesul se execută recursiv până arborele conține numai noduri criterii de decizie și noduri frunză rezultat. Cu cât arborele este mai adânc, cu atât sunt mai complexe criteriile de decizie și modelul are o acuratețe mai mare. 

<div style="text-align:center"><img style="width: 600px" src="./Images/dt_diagram.png"><br>Structura unui arbore de decizie.<br>sursă: <a href="https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html">KDnuggets (Decision Tree Algorithm, Explained)</a></div>

<br>Pentru măsurarea calității unui split, Scikit-learn utilizează două metrici ASM:

- _impuritatea Gini_ (cât de des este etichetat greșit un element ales aleator dacă a fost etichetat folosind distribuția etichetelor dintr-un subset; poate determina overfitting-ul modelului): <br>$Gini(p) = 1 - \sum_{j=1}^c p_j^2$ <br>
- _entropia_ (similar cu Gini impurity, mai intensă d.p.d.v. computațional din cauza funcției logaritmice): <br>$H(p) = - \sum_{j=1}^c p_j \log p_j$

(unde c este numărul de clase (etichete), iar $p_j$ este subsetul etichetat cu clasă i, unde $j \in \{1, 2, ..., c\}$).

### [3. Multilayer perceptron (MLP) classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

class `sklearn.neural_network.MLPClassifier`<i>(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)</i>

_Perceptronii_ sunt o clasă de clasificatori utilizați în învățarea supervizată, fiind un model matematic al unui neuron biologic. În particular, _perceptronii multistrat (MLP)_ formează rețele neuronale cu mai multe straturi de perceptroni: un strat de intrare, unul sau mai multe straturi intermediare (ascunse), și un strat de ieșire.

<div style="text-align:center"><img style="width: 500px" src="./Images/mlp_diagram.png"><br>Un perceptron multistrat ilustrat.<br>sursă: <a href="https://github.com/ledell/sldm4-h2o/blob/master/sldm4_h2o_oct2016.pdf">GitHub (ledell/sldm4-h2o)</a></div>

<br>Într-o rețea neuronală, o _funcție de activare_ definește ieșirea unui perceptron după ce este supus unui set de intrare. În forma lui cea mai simplă, funcția poate returna un rezultat binar (funcție liniară, output 0 sau 1): făcând analogie cu neuronul biologic, dacă trece un impuls electric prin axonul acestuia sau nu. În cazul rețelelor neuronale moderne care utilizează mai multe straturi de perceptroni, funcțiile de activare pot fi și non-binare (non-liniare). Scikit-learn admite funcții de activare de ambele tipuri în implementarea MLP classifier:
- _funcția identitate_: $f(x) = x$
- _sigmoida logistică_: $f(x) = \frac{1}{1 + \exp(-x)}$
- _tangenta hiperbolică_: $f(x) = \tanh(x) = \frac{\sinh(x)}{\cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- _Rectified Linear Unit (ReLU)_: $f(x) = \max(0, x) = \begin{cases} 0 & \text{dacă } x \leq 0 \\ x & \text{dacă } x > 0 \end{cases}$

De asemenea, clasificatorul MLP din Scikit-learn utilizează și algoritmi de optimizare a ponderilor (solvers): _LBFGS_ (algoritm Quasi-Newton), _SGD_ (stochastic gradient descent) și _Adam_ (algoritm derivat din SGD, creat de Diederik P. Kingma și Jimmy Lei Ba).

### [4. Gaussian Naïve Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

class `sklearn.naive_bayes.GaussianNB`<i>(priors=None, var_smoothing=1e-09)<i>

Algoritmul de clasificare _Gaussian Naïve Bayes_ aparține familiei de clasificatori _Naïve Bayes_, care presupun că prezența unui feature într-o clasă nu este afectată de prezența altor features; pe scurt, proprietățile contribuie independent la probabilitatea apartenenței la o clasă. În particular, algoritmul _Gaussian Naïve Bayes_ urmează funcția de probabilitate (PDF) a unei distribuții normale (Gaussiene):
$$\large P(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma_y^2}}\exp\bigg(-\frac{(x_i - \mu_y)^2}{2 \sigma_y^2}\bigg),$$
unde parametrii $\sigma_y$ și $\mu_y$, deviația standard și media, sunt determinați folosind maximum likelihood estimation (MLE), o metodă de estimare a parametrilor unei PDF prin maximizarea unei funcții de likelihood (cât de bine se potrivește un sample cu un model statistic).

### [5. Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

class `sklearn.ensemble.RandomForestClassifier`<i>(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)<i>

Un clasificator _Random forest_ se folosește de ipotezele emise de mai mulți arbori de decizie aleatori (_random trees_), obținuți în urma unui _random split_. Un random forest se obține prin construirea unui random tree pentru fiecare set de antrenare. Acești arbori funcționează ca un ansamblu; pentru fiecare dată de intrare se aplică modelele din ansamblu, și rezultatul final se obține agregând rezultatele prin votare. Astfel, un random forest este un _meta-estimator_: se obține o predicție în urma mai multor predicții.
<div style="text-align:center"><img style="width: 400px" src="./Images/rf_diagram.png"><br>Un model random forest făcând o predicție; în urma votării se obține rezultatul 1.<br>sursă: <a href="https://towardsdatascience.com/understanding-random-forest-58381e0602d2">Medium (Towards Data Science: Understanding Random Forest)</a></div>

<br>La fel ca la _Decision Tree classifier_, pentru măsurarea calității unui split, Scikit-learn utilizează două metrici:

- _impuritatea Gini_: $Gini(p) = 1 - \sum_{j=1}^c p_j^2$ <br>
- _entropia_: $H(p) = - \sum_{j=1}^c p_j \log p_j$

(unde c este numărul de clase (etichete), iar $p_j$ este subsetul etichetat cu clasă i, unde $j \in \{1, 2, ..., c\}$).

# <u>1. Wi-Fi localization</u>

Sîrbu Matei-Dan, _grupa 10LF383_

<i>Sursă dataset:</i> http://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization

<i>Descriere dataset:</i> [DOI 10.1007/978-981-10-3322-3_27 via ResearchGate](Docs/chp_10.1007_978-981-10-3322-3_27.pdf)

<i>Synopsis:</i> Setul de date _Wireless Indoor Localization_ cuprinde 2000 de măsurători ale puterii semnalului (măsurat în dBm) recepționat de la routerele unui birou din Pittsburgh. Acest birou are șapte routere și patru camere; un utilizator înregistrează cu ajutorul unui smartphone o dată pe secundă puterea semnalelor venite de la cele șapte routere, fiecărei înregistrări fiindu-i asociate camera în care se afla utilizatorul la momentul măsurării (1, 2, 3 sau 4).

În figura de mai jos este ilustrat un sample din dataset: <br><br>
![Sample](./Images/wifi_localization_sample.png)

În cele ce urmează, coloana Class (camera) este reprezentată de y, iar coloanele WS1 - WS7 (features: puterea semnalului de la fiecare router), de X.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

In [2]:
header = ['WS1', 'WS2', 'WS3', 'WS4', 'WS5', 'WS6', 'WS7', 'Class']
data_wifi = pd.read_csv("./Datasets/wifi_localization.txt", names=header, sep='\t')
display(HTML("<i>Dataset overview:</i>"))
display(data_wifi)
X = data_wifi.values[:, :7]
y = data_wifi.values[:, -1]
folds = 5

Unnamed: 0,WS1,WS2,WS3,WS4,WS5,WS6,WS7,Class
0,-64,-56,-61,-66,-71,-82,-81,1
1,-68,-57,-61,-65,-71,-85,-85,1
2,-63,-60,-60,-67,-76,-85,-84,1
3,-61,-60,-68,-62,-77,-90,-80,1
4,-63,-65,-60,-63,-77,-81,-87,1
...,...,...,...,...,...,...,...,...
1995,-59,-59,-48,-66,-50,-86,-94,4
1996,-59,-56,-50,-62,-47,-87,-90,4
1997,-62,-59,-46,-65,-45,-87,-88,4
1998,-62,-58,-52,-61,-41,-90,-85,4


# Testarea algoritmilor de clasificare pe setul de date cu Scikit-learn

In [3]:
# just a pretty printing function, don't mind me...
def print_stats_cv(model_cv_stats):
    print(f"Test accuracy for each fold: {model_cv_stats['test_accuracy']} \n=> Average test accuracy: {round(model_cv_stats['test_accuracy'].mean() * 100, 3)}%")
    print(f"Train accuracy for each fold: {model_cv_stats['train_accuracy']} \n=> Average train accuracy: {round(model_cv_stats['train_accuracy'].mean() * 100, 3)}%")
    print(f"Test F1 score for each fold: {model_cv_stats['test_f1_macro']} \n=> Average test F1 score: {round(model_cv_stats['test_f1_macro'].mean() * 100, 3)}%")
    print(f"Train F1 score for each fold: {model_cv_stats['train_f1_macro']} \n=> Average train F1 score: {round(model_cv_stats['train_f1_macro'].mean() * 100, 3)}%")

### 1. <i>k</i>-nearest neighbors classifier

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import MinMaxScaler

# hiperparametri
knn_neighbors = 4
knn_minkowski_p = 3

# scalare date
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors, p=knn_minkowski_p)
model_cv_stats = cross_validate(model, X_scaled, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.9675 0.985  0.9675 0.9875 0.9775] 
=> Average test accuracy: 97.7%
Train accuracy for each fold: [0.99     0.99     0.9925   0.99     0.990625] 
=> Average train accuracy: 99.062%
Test F1 score for each fold: [0.96747175 0.98498649 0.96707199 0.98754412 0.97761589] 
=> Average test F1 score: 97.694%
Train F1 score for each fold: [0.98999996 0.99002075 0.99250155 0.9900046  0.99063789] 
=> Average train F1 score: 99.063%


### 2. Decision Tree classifier

In [5]:
from sklearn.tree import DecisionTreeClassifier

# hiperparametri
dt_criterion = 'gini'
dt_splitter = 'best'

# implementare Decision Tree
model = DecisionTreeClassifier(criterion=dt_criterion, splitter=dt_splitter, random_state=42)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Decision Trees classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.96   0.9325 0.9275 0.9825 0.9725] 
=> Average test accuracy: 95.5%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [0.95972496 0.93246313 0.92562893 0.98251804 0.97253289] 
=> Average test F1 score: 95.457%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### 3. Multilayer Perceptron (MLP) classifier

In [6]:
from sklearn.neural_network import MLPClassifier

# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'relu'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)

# implementare MLP
model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, random_state=42)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for MLP classification</h4>"))
display(HTML(f"using hardcoded hyperparameters - Solver: <b>{mlp_solver}</b>, Activation function: <b>{mlp_activation}</b>, Parameter for regularization (α): <b>{mlp_alpha}</b>, Hidden layer sizes: <b>{mlp_hidden_layer_sizes}</b>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.9725 0.9975 0.95   0.975  0.9775] 
=> Average test accuracy: 97.45%
Train accuracy for each fold: [0.98125  0.971875 0.983125 0.98125  0.981875] 
=> Average train accuracy: 97.988%
Test F1 score for each fold: [0.97244414 0.99749994 0.94983905 0.97514624 0.97746665] 
=> Average test F1 score: 97.448%
Train F1 score for each fold: [0.98124085 0.97171261 0.98314044 0.98122406 0.9819253 ] 
=> Average train F1 score: 97.985%


### 4. Gaussian Naïve Bayes classifier

In [7]:
from sklearn.naive_bayes import GaussianNB

# implementare GNB
model = GaussianNB()
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Gaussian NB classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.99   0.9725 0.98   0.98   0.985 ] 
=> Average test accuracy: 98.15%
Train accuracy for each fold: [0.98125  0.983125 0.9875   0.985    0.983125] 
=> Average train accuracy: 98.4%
Test F1 score for each fold: [0.99001188 0.97255265 0.9799862  0.98006098 0.9849985 ] 
=> Average test F1 score: 98.152%
Train F1 score for each fold: [0.98127765 0.98313802 0.98751983 0.98501642 0.9831722 ] 
=> Average train F1 score: 98.402%


### 5. Random Forest classifier

In [8]:
from sklearn.ensemble import RandomForestClassifier

# hiperparametri
rfc_n_estimators = 150
rfc_criterion = 'gini'

# implementare Random Forest
model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Random Forest classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.9825 0.9525 0.9775 0.985  0.9875] 
=> Average test accuracy: 97.7%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [0.98244899 0.95206741 0.97743338 0.9850456  0.98749969] 
=> Average test F1 score: 97.69%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


# Nested Cross Validation pentru optimizarea hiperparametrilor

In [9]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score

# CVs configuration
inner_cv = KFold(n_splits=4, shuffle=True)
outer_cv = KFold(n_splits=5, shuffle=True)

# outer CV folds:
print("5-fold cross validation: split overview")
splits = outer_cv.split(range(data_wifi.index.size))
subsets = pd.DataFrame(columns=['Fold', 'Train row indices', 'Test row indices'])
for i, split_data in enumerate(splits):
    subsets.loc[i]=[i + 1, split_data[0], split_data[1]]
display(subsets)

5-fold cross validation: split overview


Unnamed: 0,Fold,Train row indices,Test row indices
0,1,"[0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[2, 5, 16, 20, 21, 32, 42, 50, 53, 55, 57, 62,..."
1,2,"[0, 1, 2, 3, 4, 5, 7, 8, 9, 11, 12, 13, 14, 16...","[6, 10, 15, 18, 30, 40, 45, 48, 49, 56, 58, 66..."
2,3,"[0, 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 14, 15, 1...","[1, 7, 11, 17, 23, 24, 25, 29, 31, 34, 35, 37,..."
3,4,"[1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[0, 3, 19, 38, 41, 44, 54, 64, 72, 79, 83, 90,..."
4,5,"[0, 1, 2, 3, 5, 6, 7, 10, 11, 15, 16, 17, 18, ...","[4, 8, 9, 12, 13, 14, 22, 26, 27, 28, 33, 36, ..."


### 1. <i>k</i>-nearest neighbors classifier

In [10]:
outer_cv_acc = []
param_candidates = {'n_neighbors': np.linspace(start=1, stop=30, num=30, dtype=int),
                    'p': np.linspace(start=1, stop=5, num=4, dtype=int)} 
param_search = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv, random_state=42)
    
for fold in range(5):
    X_train = X_scaled[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X_scaled[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'kNN model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 7}
kNN model accuracy with optimal hyperparameters: 0.98
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.985
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.985
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 11}
kNN model accuracy with optimal hyperparameters: 0.985
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.985

Average model accuracy: 98.4%


### 2. Decision Tree classifier

In [11]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                    'splitter': ['best', 'random']} 
param_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Decision Tree model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.975
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.9525
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.96
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.9825
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 0.9675

Average model accuracy: 96.75%


### 3. Multilayer Perceptron (MLP) classifier

In [12]:
outer_cv_acc = []
param_candidates = {'alpha': np.linspace(start=0, stop=1e-1, num=500),
                    'activation': ['identity', 'logistic', 'tanh', 'relu']}
param_search = RandomizedSearchCV(estimator=MLPClassifier(max_iter=1000), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
     
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'MLP model accuracy with optimal hyperparameters: {accuracy}')
     
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.001002004008016032, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.975
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.08857715430861723, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.97
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.08336673346693386, 'activation': 'logistic'}
MLP model accuracy with optimal hyperparameters: 0.975
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.021042084168336674, 'activation': 'tanh'}
MLP model accuracy with optimal hyperparameters: 0.9925
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.06472945891783567, 'activation': 'tanh'}
MLP model accuracy with optimal hyperparameters: 0.9675

Average model accuracy: 97.6%


### 4. Gaussian Naïve Bayes classifier

In [13]:
outer_cv_acc = []
param_candidates = {'var_smoothing': np.linspace(start=1e-9, stop=1e-2, num=500)} 
param_search = RandomizedSearchCV(estimator=GaussianNB(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Gaussian Naïve Bayes model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.00274549170741483}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.985
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.002404810378757515}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.9825
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.0012024056893787576}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.985
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 2.0041078156312626e-05}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.98
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.002725451629258517}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.9875

Average model accuracy: 98.4%


### 5. Random Forest classifier

In [14]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                   'n_estimators': np.linspace(start=1, stop=500, num=500, dtype=int)} 
param_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Random Forest Classifier model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 404, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9875
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 386, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9675
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 246, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9825
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 52, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.99
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 333, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.985

Average model accuracy: 98.25%


# <u>2. Echocardiogram</u>

Sîrbu Matei-Dan, _grupa 10LF383_

<i>Sursă dataset:</i> http://archive.ics.uci.edu/ml/datasets/Echocardiogram

<i>Synopsis:</i> Datasetul _Echocardiogram_ cuprinde 132 de înregistrări de date obținute din ecocardiografiile unor pacienți care au suferit un atac de cord în trecut. Datele sunt utilizate pentru a prezice dacă un pacient va supraviețui cel puțin un an după un atac de cord.

În cele ce urmează, atributul **alive-at-1** este reprezentată de y, iar coloanele **survival, still-alive, age-at-heart-attack, pericardial-effusion, fractional-shortening, epss, lvdd, wall-motion-score, wall-motion-index, mult**, de X.

_Verbose attribute info:_
1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
2. still-alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
3. age-at-heart-attack -- age in years when heart attack occurred
4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
8. wall-motion-score -- a measure of how the segments of the left ventricle are moving
9. wall-motion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
10. mult -- a derivate var which can be ignored
11. name -- the name of the patient (I have replaced them with "name")
12. group -- meaningless, ignore it
13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

In [15]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

In [16]:
header = ['survival', 'still-alive', 'age-at-heart-attack', 'pericardial-effusion', 'fractional-shortening', 'epss', 'lvdd', 'wall-motion-score', 'wall-motion-index', 'mult', 'name', 'group', 'alive-at-1']
data_echo = pd.read_csv("./Datasets/echocardiogram.csv", names=header)
display(HTML("<i>Dataset overview:</i>"))
display(data_echo)
# ignoring irrelevant columns, as per verbose attribute info:
del data_echo['mult']
del data_echo['name']
del data_echo['group']
folds = 5

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,mult,name,group,alive-at-1
0,11,0,71,0,0.260,9,4.600,14,1,1,name,1,0
1,19,0,72,0,0.380,6,4.100,14,1.700,0.588,name,1,0
2,16,0,55,0,0.260,4,3.420,14,1,1,name,1,0
3,57,0,60,0,0.253,12.062,4.603,16,1.450,0.788,name,1,0
4,19,1,57,0,0.160,22,5.750,18,2.250,0.571,name,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,7.5,1,64,0,0.24,12.9,4.72,12,1,0.857,name,?,?
128,41,0,64,0,0.28,5.40,5.47,11,1.10,0.714,name,?,?
129,36,0,69,0,0.20,7.00,5.05,14.5,1.21,0.857,name,?,?
130,22,0,57,0,0.14,16.1,4.36,15,1.36,0.786,name,?,?


După cum se poate observa în subsolul tabelului, despre unii pacienți nu se știe dacă au supraviețuit la 1 an de la atacul de cord, exact informația pe care vrem să o prezicem după antrenarea modelelor de clasificare. Aceste ecocardiograme se vor elimina din dataset.

In [17]:
data_echo = data_echo[(data_echo['alive-at-1'] == '0') | (data_echo['alive-at-1'] == '1')]
display(HTML("<i>Last records in sanitized dataset:</i>"))
data_echo.tail()

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,alive-at-1
104,1.25,1,63,0,0.3,6.9,3.52,18.16,1.51,1
105,24.0,0,59,0,0.17,14.3,5.49,13.5,1.5,0
106,25.0,0,57,0,0.228,9.7,4.29,11.0,1.0,0
108,0.75,1,78,0,0.23,40.0,6.23,14.0,1.4,1
109,3.0,1,62,0,0.26,7.6,4.42,14.0,1.0,1


De asemenea, unele înregistrări au valori lipsă, valori pe care va trebui să le improvizăm prin _missing value imputation_. Un _imputer_ eficient pe care îl putem utiliza este `KNNImputer` din `sklearn.impute`: acesta aproximează valorile lipsă dintr-o ecocardiogramă analizând _k_ vecini ai acesteia, vecini din punct de vedere al altor parametri existenți. De exemplu, dacă nu cunoaștem variabila **epss**, dar știm valoarea **fractional_shortening**, o putem aproxima pe prima comparând-o pe cea din urmă cu a vecinilor, pentru că amândouă sunt o măsură a contractilității.

In [18]:
data_echo.replace(to_replace='?', value=np.nan, inplace=True)
display(HTML("<i>Dataset extract with missing values:</i>"))
data_echo[33:39]

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,alive-at-1
43,46.0,0,56,0,0.33,,3.59,14.0,1.0,0
46,19.5,1,81,0,0.12,,,9.0,1.25,0
47,20.0,1,59,0,0.03,21.3,6.29,17.0,1.31,0
48,0.25,1,63,1,,,,23.0,2.3,1
50,2.0,1,56,1,0.04,14.0,5.0,,,1
51,7.0,1,61,1,0.27,,,9.0,1.5,1


In [19]:
from sklearn.impute import KNNImputer
header = ['survival', 'still-alive', 'age-at-heart-attack', 'pericardial-effusion', 'fractional-shortening', 'epss', 'lvdd', 'wall-motion-score', 'wall-motion-index', 'alive-at-1']
imputer = KNNImputer(missing_values=np.nan, n_neighbors=5)
imputer.fit(data_echo)
data_echo = pd.DataFrame(data=imputer.transform(data_echo), columns=header)
display(HTML("<i>Dataset extract with values imputed using KNNImputer:</i>"))
display(data_echo[33:39])
X = data_echo.values[:, :9]
y = data_echo.values[:, -1]

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,alive-at-1
33,46.0,0.0,56.0,0.0,0.33,9.62,3.59,14.0,1.0,0.0
34,19.5,1.0,81.0,0.0,0.12,9.18,4.61,9.0,1.25,0.0
35,20.0,1.0,59.0,0.0,0.03,21.3,6.29,17.0,1.31,0.0
36,0.25,1.0,63.0,1.0,0.182,16.98,5.124,23.0,2.3,1.0
37,2.0,1.0,56.0,1.0,0.04,14.0,5.0,17.534,1.67,1.0
38,7.0,1.0,61.0,1.0,0.27,13.0,4.776,9.0,1.5,1.0


# Testarea algoritmilor de clasificare pe setul de date cu Scikit-learn

### 1. <i>k</i>-nearest neighbors classifier

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import MinMaxScaler

# hiperparametri
knn_neighbors = 4
knn_minkowski_p = 3

# scalare date
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors, p=knn_minkowski_p)
model_cv_stats = cross_validate(model, X_scaled, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.93333333 0.93333333 0.93333333 0.93333333 0.85714286] 
=> Average test accuracy: 91.81%
Train accuracy for each fold: [0.98305085 0.94915254 0.98305085 0.96610169 0.95      ] 
=> Average train accuracy: 96.627%
Test F1 score for each fold: [0.92822967 0.92063492 0.92822967 0.92063492 0.78787879] 
=> Average test F1 score: 89.712%
Train F1 score for each fold: [0.98031365 0.94094094 0.98085037 0.96118421 0.94442729] 
=> Average train F1 score: 96.154%


### 2. Decision Tree classifier

In [21]:
from sklearn.tree import DecisionTreeClassifier

# hiperparametri
dt_criterion = 'gini'
dt_splitter = 'best'

# implementare Decision Tree
model = DecisionTreeClassifier(criterion=dt_criterion, splitter=dt_splitter, random_state=42)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Decision Trees classification:</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [1.         1.         0.93333333 0.93333333 1.        ] 
=> Average test accuracy: 97.333%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [1.         1.         0.92822967 0.92063492 1.        ] 
=> Average test F1 score: 96.977%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### 3. Multilayer Perceptron (MLP) classifier

In [22]:
from sklearn.neural_network import MLPClassifier

# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'relu'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)

# implementare MLP
model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, random_state=42, max_iter=1000)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for MLP classification</h4>"))
display(HTML(f"using hardcoded hyperparameters - Solver: <b>{mlp_solver}</b>, Activation function: <b>{mlp_activation}</b>, Parameter for regularization (α): <b>{mlp_alpha}</b>, Hidden layer sizes: <b>{mlp_hidden_layer_sizes}</b>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [1.         1.         1.         0.93333333 1.        ] 
=> Average test accuracy: 98.667%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [1.         1.         1.         0.92063492 1.        ] 
=> Average test F1 score: 98.413%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### 4. Gaussian Naïve Bayes classifier

In [23]:
from sklearn.naive_bayes import GaussianNB

# implementare GNB
model = GaussianNB()
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for Gaussian NB classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.93333333 0.93333333 0.86666667 0.93333333 1.        ] 
=> Average test accuracy: 93.333%
Train accuracy for each fold: [0.93220339 0.93220339 0.96610169 1.         0.91666667] 
=> Average train accuracy: 94.944%
Test F1 score for each fold: [0.92822967 0.92822967 0.86111111 0.92063492 1.        ] 
=> Average test F1 score: 92.764%
Train F1 score for each fold: [0.92435897 0.92435897 0.96217949 1.         0.90939293] 
=> Average train F1 score: 94.406%


### 5. Random Forest classifier

In [24]:
from sklearn.ensemble import RandomForestClassifier

# hiperparametri
rfc_n_estimators = 150
rfc_criterion = 'gini'

# implementare Random Forest
model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_cv_stats = cross_validate(model, X, y, cv=folds, scoring=('accuracy', 'f1_macro'), return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Random Forest classification</h4>"))
print_stats_cv(model_cv_stats)

Test accuracy for each fold: [0.93333333 1.         1.         0.93333333 1.        ] 
=> Average test accuracy: 97.333%
Train accuracy for each fold: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score for each fold: [0.92822967 1.         1.         0.92063492 1.        ] 
=> Average test F1 score: 96.977%
Train F1 score for each fold: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


# Nested Cross Validation pentru optimizarea hiperparametrilor

In [25]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score

# CVs configuration
inner_cv = KFold(n_splits=4, shuffle=True)
outer_cv = KFold(n_splits=5, shuffle=True)

# outer CV folds:
print("5-fold cross validation: split overview")
splits = outer_cv.split(range(data_echo.index.size))
subsets = pd.DataFrame(columns=['Fold', 'Train row indices', 'Test row indices'])
for i, split_data in enumerate(splits):
    subsets.loc[i]=[i + 1, split_data[0], split_data[1]]
display(subsets)

5-fold cross validation: split overview


Unnamed: 0,Fold,Train row indices,Test row indices
0,1,"[0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 1...","[8, 9, 16, 22, 29, 33, 36, 39, 45, 56, 61, 62,..."
1,2,"[1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 1...","[0, 2, 10, 23, 25, 28, 43, 49, 51, 53, 54, 58,..."
2,3,"[0, 2, 3, 4, 5, 7, 8, 9, 10, 12, 14, 16, 17, 2...","[1, 6, 11, 13, 15, 18, 19, 20, 24, 32, 35, 40,..."
3,4,"[0, 1, 2, 3, 5, 6, 8, 9, 10, 11, 13, 14, 15, 1...","[4, 7, 12, 26, 38, 41, 44, 46, 48, 50, 59, 63,..."
4,5,"[0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 15, 1...","[3, 5, 14, 17, 21, 27, 30, 31, 34, 37, 42, 47,..."


### 1. <i>k</i>-nearest neighbors classifier

In [26]:
outer_cv_acc = []
param_candidates = {'n_neighbors': np.linspace(start=1, stop=30, num=30, dtype=int),
                    'p': np.linspace(start=1, stop=5, num=4, dtype=int)} 
param_search = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv, random_state=42)
    
for fold in range(5):
    X_train = X_scaled[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X_scaled[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'kNN model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 7}
kNN model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'p': 1, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'p': 5, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'p': 3, 'n_neighbors': 3}
kNN model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'p': 5, 'n_neighbors': 12}
kNN model accuracy with optimal hyperparameters: 0.9285714285714286

Average model accuracy: 93.24%


### 2. Decision Tree classifier

In [27]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                    'splitter': ['best', 'random']} 
param_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Decision Tree model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'criterion': 'entropy', 'splitter': 'random'}
Decision Tree model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'random'}
Decision Tree model accuracy with optimal hyperparameters: 1.0
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'random'}
Decision Tree model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'criterion': 'gini', 'splitter': 'best'}
Decision Tree model accuracy with optimal hyperparameters: 1.0

Average model accuracy: 97.33%


### 3. Multilayer Perceptron (MLP) classifier

In [28]:
outer_cv_acc = []
param_candidates = {'alpha': np.linspace(start=0, stop=1e-1, num=500),
                    'activation': ['identity', 'logistic', 'tanh', 'relu']}
param_search = RandomizedSearchCV(estimator=MLPClassifier(max_iter=1000), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
     
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'MLP model accuracy with optimal hyperparameters: {accuracy}')
     
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.03547094188376754, 'activation': 'tanh'}
MLP model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.04208416833667335, 'activation': 'tanh'}
MLP model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.09458917835671343, 'activation': 'relu'}
MLP model accuracy with optimal hyperparameters: 1.0
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.06653306613226453, 'activation': 'tanh'}
MLP model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'alpha': 0.025851703406813628, 'activation': 'relu'}
MLP model accuracy with optimal hyperparameters: 1.0

Average model accuracy: 98.67%


### 4. Gaussian Naïve Bayes classifier

In [29]:
outer_cv_acc = []
param_candidates = {'var_smoothing': np.linspace(start=1e-9, stop=1e-2, num=500)} 
param_search = RandomizedSearchCV(estimator=GaussianNB(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Gaussian Naïve Bayes model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.0002404819378757515}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.00048096287575150303}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 1.0
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.006332665697394791}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.008056112418837675}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.8666666666666667
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'var_smoothing': 0.007795591402805612}
Gaussian Naïve Bayes model accuracy with optimal hyperparameters: 0.9285714285714286

Average model accuracy: 93.24%


### 5. Random Forest classifier

In [30]:
outer_cv_acc = []
param_candidates = {'criterion': ['gini', 'entropy'],
                   'n_estimators': np.linspace(start=1, stop=500, num=500, dtype=int)} 
param_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_candidates, scoring='accuracy', cv=inner_cv)

for fold in range(5):
    X_train = X[subsets.loc[subsets.index[fold],'Train row indices']]
    y_train = y[subsets.loc[subsets.index[fold],'Train row indices']]
    X_test = X[subsets.loc[subsets.index[fold],'Test row indices']]
    y_test = y[subsets.loc[subsets.index[fold],'Test row indices']]
    
    param_search.fit(X_train, y_train)
    y_estimated = param_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_estimated)
    outer_cv_acc.append(accuracy)
    
    print(f'Outer fold {fold+1}, optimal hyperparameters after inner 4-fold CV: {param_search.best_params_}')
    print(f'Random Forest Classifier model accuracy with optimal hyperparameters: {accuracy}')
    
print(f'\nAverage model accuracy: {round(np.mean(outer_cv_acc) * 100, 2)}%')

Outer fold 1, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 436, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0
Outer fold 2, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 53, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 3, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 164, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0
Outer fold 4, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 299, 'criterion': 'gini'}
Random Forest Classifier model accuracy with optimal hyperparameters: 0.9333333333333333
Outer fold 5, optimal hyperparameters after inner 4-fold CV: {'n_estimators': 192, 'criterion': 'entropy'}
Random Forest Classifier model accuracy with optimal hyperparameters: 1.0

Average model accuracy: 97.33%


# <u>3. Seeds</u>

Tătaru Dragoș-Cătălin, _grupa 10LF383_

_Sursă dataset:_ http://archive.ics.uci.edu/ml/datasets/Seeds

_Articol relevant:_ M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, 'A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images', in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.

_Scurtă descriere:_ Baza de date curentă a fost alcătuită prin scanarea imaginilor unor boabe de grâu din trei specii diferite. Cuprinde 209 intrări și 7 atribute obținute prin cuantificarea parametrilor imaginilor boabelor de grâu.

Parametri:
- Area
- Perimeter
- Compactness
- Length
- Width
- Asymmetry
- Groove
- Class

<div style="text-align:center"><img style="height: 350px" src="./Images/Seeds_Atributes.png">  <img style="height: 350px" src="./Images/Seeds_X_Ray.png"><br>(imaginile au fost preluate din articolul sursă citat mai sus.)</div>

In [31]:
import numpy as np
import pandas as pd

In [32]:
from IPython.display import display, HTML
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate,cross_val_score, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

## Split the data

In [33]:
header = ['Area', 'Perimeter', 'Compactness', 'Length', 'Width', 'Asymmetry', 'Groove', 'Class']
data_seeds = pd.read_csv("./Datasets/seeds_dataset.txt", names=header,sep='\t')
display(HTML("<h3><b>Seeds Dataset"))
X = data_seeds.values[:, :7]
y = data_seeds.values[:, -1]
display(data_seeds)

if np.any(np.isnan(data_seeds))==False:print("Setul de date NU contine valori lipsa")
else: print("Setul de date contine valori lipsa")
    
if np.any(np.isfinite(data_seeds)==True):print("Setul de date NU contine valori infinite")
else: print("Setul de date contine valori infinite")  

Unnamed: 0,Area,Perimeter,Compactness,Length,Width,Asymmetry,Groove,Class
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,3
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,3
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,3
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3


Setul de date NU contine valori lipsa
Setul de date NU contine valori infinite


## The print function

In [34]:
def print_function(data_set:dict):
    df_print = pd.DataFrame({"Test accuracy for each fold":data_set['test_accuracy'], 
                    "Train accuracy for each fold": data_set['train_accuracy'], 
                    "Average test accuracy %": round(data_set['test_accuracy'].mean() * 100, 4),
                    "Average train accuracy %": round(data_set['train_accuracy'].mean() * 100, 4),
                    "Test F1 score for each fold": data_set['test_f1_macro'],
                    "Train F1 score for each fold": data_set['train_f1_macro'],
                    "Average test F1 score %": round(data_set['test_f1_macro'].mean() * 100, 4),
                    "Average train F1 score %":round(data_set['train_f1_macro'].mean() * 100, 4)
                   })
    display(HTML(df_print.to_html()))

# Executare algoritmi cu hiperparametri hardcodați

## K-Nearest Neighbors Classifier

In [35]:
# hiperparametri
knn_neighbors = 5
knn_minkowski_p = 3

# scalare date
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors, p=knn_minkowski_p)
model_cv_stats = cross_validate(model, X_scaled, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True) 

# statistici
display(HTML(f"<h4>5-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print_function(model_cv_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.97619,0.940476,92.8571,95.5952,0.97616,0.940248,92.89,95.5886
1,0.952381,0.946429,92.8571,95.5952,0.952137,0.94642,92.89,95.5886
2,0.952381,0.946429,92.8571,95.5952,0.95137,0.946319,92.89,95.5886
3,0.952381,0.958333,92.8571,95.5952,0.952381,0.958349,92.89,95.5886
4,0.809524,0.988095,92.8571,95.5952,0.812454,0.988095,92.89,95.5886


## Decision Tree Classifier

In [36]:
# hiperparametri
dt_criterion = 'gini'# default value
dt_splitter = 'best' # intrebarea care reduce cel mai mult incertitudinea
# gini impurity-cantitatea de incertitudine pe un singur nod, cat de amestecate sunt clasificarile din frunze dupa intrebarea din nod
 
# implementare Decision Tree
model = DecisionTreeClassifier(criterion=dt_criterion, splitter=dt_splitter)
model_dc_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
# afisez cateva medii, sa vad daca sunt diferente in functie de clasa

print(data_seeds.groupby('Class')['Area'].mean()),
print(data_seeds.groupby('Class')['Perimeter'].mean()),
print(data_seeds.groupby('Class')['Asymmetry'].mean())

display(HTML(f"<h4>5-fold cross validation for Decision Trees classification:</h4>"))
print_function(model_dc_stats)

Class
1    14.334429
2    18.334286
3    11.873857
Name: Area, dtype: float64
Class
1    14.294286
2    16.135714
3    13.247857
Name: Perimeter, dtype: float64
Class
1    2.667403
2    3.644800
3    4.788400
Name: Asymmetry, dtype: float64


Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.952381,1.0,90.0,100.0,0.95137,1.0,89.8762,100.0
1,0.928571,1.0,90.0,100.0,0.927742,1.0,89.8762,100.0
2,0.857143,1.0,90.0,100.0,0.853236,1.0,89.8762,100.0
3,0.880952,1.0,90.0,100.0,0.879979,1.0,89.8762,100.0
4,0.880952,1.0,90.0,100.0,0.881481,1.0,89.8762,100.0


## Random Forest Classifier

In [37]:
# hiperparametri
rfc_n_estimators = 150
rfc_criterion = 'gini'

# implementare Random Forest
model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_rfc_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)

display(HTML(f"<h4>5-fold cross validation for Random Forest classification</h4>"))
print_function(model_rfc_stats) 

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.904762,1.0,89.0476,100.0,0.904061,1.0,89.2476,100.0
1,0.928571,1.0,89.0476,100.0,0.927742,1.0,89.2476,100.0
2,0.97619,1.0,89.0476,100.0,0.97616,1.0,89.2476,100.0
3,0.97619,1.0,89.0476,100.0,0.97616,1.0,89.2476,100.0
4,0.666667,1.0,89.0476,100.0,0.678255,1.0,89.2476,100.0


## Multilayer Perceptron Classifier

In [38]:
# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'logistic'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)
max_iter=10000

# implementare MLP
model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, max_iter=max_iter, random_state=0)
model_mpc_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)
print_function(model_mpc_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.97619,0.970238,91.4286,97.2619,0.97616,0.970289,91.5605,97.2596
1,0.952381,0.97619,91.4286,97.2619,0.952137,0.976136,91.5605,97.2596
2,0.952381,0.97619,91.4286,97.2619,0.952351,0.97619,91.5605,97.2596
3,0.952381,0.964286,91.4286,97.2619,0.952381,0.964286,91.5605,97.2596
4,0.738095,0.97619,91.4286,97.2619,0.744997,0.976079,91.5605,97.2596


## Gaussian Naive Bayes Classifier

In [39]:
# implementare GNB
model = GaussianNB()
model_gnb_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>5-fold cross validation for Gaussian NB classification</h4>"))
print_function(model_gnb_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.880952,0.904762,88.5714,91.1905,0.880307,0.904598,88.7466,91.1489
1,0.928571,0.89881,88.5714,91.1905,0.927742,0.898591,88.7466,91.1489
2,0.952381,0.910714,88.5714,91.1905,0.952137,0.910378,88.7466,91.1489
3,0.97619,0.904762,88.5714,91.1905,0.97616,0.903406,88.7466,91.1489
4,0.690476,0.940476,88.5714,91.1905,0.700985,0.940474,88.7466,91.1489


# Optimizarea hiperparametrilor

## Split the data

In [50]:
header = ['Area', 'Perimeter', 'Compactness', 'Length', 'Width', 'Asymmetry', 'Groove', 'Class']
data_seeds = pd.read_csv("./Datasets/seeds_dataset.txt", names=header,sep='\t')
X = data_seeds.values[:, :7]
y = data_seeds.values[:, -1]

## K-Nearest Neighbors Classifier

In [51]:
pipe = Pipeline([('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())])
parameter_grid = {'knn__n_neighbors': list(range(1, 10)), 'knn__p': list(range(1, 5))}
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4, return_train_score=True)

scores = cross_val_score(grid_search, X, y, cv=strat_k_fold )
print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())


grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)

grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.9047619  0.92857143 0.88095238 0.92857143 1.        ]
Media scorurilor: 0.9285714285714286
Cel mai bun set de parametri: {'knn__n_neighbors': 9, 'knn__p': 3}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__n_neighbors,param_knn__p,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,mean_train_score,std_train_score
0,0.000737,0.00016,0.00209,0.000607,1,1,"{'knn__n_neighbors': 1, 'knn__p': 1}",0.962264,0.981132,0.903846,0.711538,0.889695,0.106732,34,1.0,1.0,1.0,1.0,1.0,0.0
1,0.000584,3.1e-05,0.001701,9e-06,1,2,"{'knn__n_neighbors': 1, 'knn__p': 2}",0.962264,0.981132,0.923077,0.75,0.904118,0.091411,27,1.0,1.0,1.0,1.0,1.0,0.0
2,0.000604,3.5e-05,0.002173,4.8e-05,1,3,"{'knn__n_neighbors': 1, 'knn__p': 3}",0.962264,0.962264,0.923077,0.769231,0.904209,0.079555,25,1.0,1.0,1.0,1.0,1.0,0.0
3,0.000577,2e-06,0.002238,0.000183,1,4,"{'knn__n_neighbors': 1, 'knn__p': 4}",0.962264,0.943396,0.923077,0.769231,0.899492,0.076472,29,1.0,1.0,1.0,1.0,1.0,0.0
4,0.000698,0.000133,0.001919,0.000128,2,1,"{'knn__n_neighbors': 2, 'knn__p': 1}",0.924528,0.962264,0.961538,0.673077,0.880352,0.120639,36,0.968153,0.974522,0.968354,0.974684,0.971428,0.003176
5,0.000799,0.000156,0.002064,0.000338,2,2,"{'knn__n_neighbors': 2, 'knn__p': 2}",0.981132,0.962264,0.903846,0.692308,0.884888,0.114779,35,0.968153,0.974522,0.981013,0.974684,0.974593,0.004547
6,0.00074,9.3e-05,0.002464,0.000128,2,3,"{'knn__n_neighbors': 2, 'knn__p': 3}",0.981132,0.943396,0.903846,0.730769,0.889786,0.095789,33,0.949045,0.974522,0.981013,0.974684,0.969816,0.012275
7,0.000671,2.3e-05,0.002636,0.000314,2,4,"{'knn__n_neighbors': 2, 'knn__p': 4}",0.962264,0.943396,0.923077,0.75,0.894684,0.084675,31,0.955414,0.980892,0.987342,0.974684,0.974583,0.011938
8,0.000725,7.5e-05,0.002126,4.3e-05,3,1,"{'knn__n_neighbors': 3, 'knn__p': 1}",0.924528,0.943396,0.961538,0.769231,0.899673,0.07644,28,0.961783,0.955414,0.974684,0.968354,0.965059,0.007198
9,0.000651,1.2e-05,0.002161,6.5e-05,3,2,"{'knn__n_neighbors': 3, 'knn__p': 2}",0.943396,0.962264,0.884615,0.769231,0.889877,0.075312,32,0.961783,0.961783,0.981013,1.0,0.976145,0.015853


## Decision Tree Classifier

In [52]:
pipe = Pipeline([('dtc', DecisionTreeClassifier())])

parameter_grid = { 'dtc__criterion': ['gini', 'entropy'],'dtc__splitter': ['best', 'random'] }
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold )

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.83333333 0.85714286 0.9047619  0.88095238 0.95238095]
Media scorurilor: 0.8857142857142858
Cel mai bun set de parametri: {'dtc__criterion': 'entropy', 'dtc__splitter': 'best'}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_dtc__criterion,param_dtc__splitter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000892,0.000353,0.000246,2.257403e-05,gini,best,"{'dtc__criterion': 'gini', 'dtc__splitter': 'best'}",0.962264,0.924528,0.903846,0.692308,0.870737,0.105124,3
1,0.00042,4e-06,0.000226,1.881091e-06,gini,random,"{'dtc__criterion': 'gini', 'dtc__splitter': 'random'}",0.962264,0.962264,0.826923,0.692308,0.86094,0.111945,4
2,0.000699,2.3e-05,0.000237,6.743496e-07,entropy,best,"{'dtc__criterion': 'entropy', 'dtc__splitter': 'best'}",0.886792,0.924528,0.903846,0.807692,0.880715,0.044226,1
3,0.000445,2.2e-05,0.000229,2.979636e-06,entropy,random,"{'dtc__criterion': 'entropy', 'dtc__splitter': 'random'}",0.924528,0.90566,0.923077,0.75,0.875816,0.073019,2


## Random Forest Classifier

In [53]:
pipe = Pipeline([('rfc', RandomForestClassifier())])

parameter_grid = { 'rfc__criterion': ['gini', 'entropy'],'rfc__n_estimators': np.linspace(start=1, stop=150, num=25, dtype=int)}
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold )

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.92857143 0.92857143 0.85714286 0.88095238 0.97619048]
Media scorurilor: 0.9142857142857143
Cel mai bun set de parametri: {'rfc__criterion': 'gini', 'rfc__n_estimators': 32}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__criterion,param_rfc__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002051,0.000365,0.000525,1.8e-05,gini,1,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 1}",0.867925,0.943396,0.884615,0.692308,0.847061,0.093641,49
1,0.008417,0.000242,0.001087,0.000142,gini,7,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 7}",0.962264,0.924528,0.961538,0.653846,0.875544,0.128904,20
2,0.01506,0.000112,0.001415,8e-05,gini,13,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 13}",0.924528,0.943396,0.961538,0.596154,0.856404,0.150824,48
3,0.020253,0.000473,0.001722,5.9e-05,gini,19,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 19}",0.90566,0.924528,0.980769,0.692308,0.875816,0.109492,11
4,0.026627,0.000726,0.002124,7.4e-05,gini,25,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 25}",0.924528,0.924528,0.961538,0.634615,0.861303,0.131747,43
5,0.033078,0.000272,0.002687,0.000302,gini,32,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 32}",0.924528,0.943396,0.980769,0.711538,0.890058,0.105037,1
6,0.04034,0.00127,0.002925,4.9e-05,gini,38,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 38}",0.924528,0.924528,0.980769,0.634615,0.86611,0.135611,35
7,0.046144,0.001151,0.003304,2e-05,gini,44,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 44}",0.924528,0.924528,0.980769,0.615385,0.861303,0.143825,43
8,0.052326,0.000922,0.003739,5.8e-05,gini,50,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 50}",0.924528,0.943396,0.980769,0.634615,0.870827,0.137871,25
9,0.057427,0.000676,0.004118,5e-05,gini,56,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 56}",0.924528,0.943396,0.980769,0.673077,0.880443,0.121421,7


## Gaussian Naive Bayes Classifier

In [54]:
pipe = Pipeline([('gnb', GaussianNB())])

parameter_grid = { 'gnb__var_smoothing': np.linspace(start=1e-9, stop=1e-2, num=100)}
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold )

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.92857143 0.88095238 0.97619048 0.95238095 0.78571429]
Media scorurilor: 0.9047619047619048
Cel mai bun set de parametri: {'gnb__var_smoothing': 0.0028282835454545457}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gnb__var_smoothing,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000963,0.0003626476,0.000386,9.582039e-05,1e-09,{'gnb__var_smoothing': 1e-09},0.90566,0.943396,0.942308,0.673077,0.86611,0.112478,100
1,0.000552,2.57599e-05,0.000315,1.33156e-05,0.000101011,{'gnb__var_smoothing': 0.00010101109090909092},0.90566,0.943396,0.942308,0.711538,0.875726,0.096003,95
2,0.000522,2.678896e-06,0.000301,2.468381e-06,0.000202021,{'gnb__var_smoothing': 0.00020202118181818183},0.90566,0.943396,0.942308,0.711538,0.875726,0.096003,95
3,0.000521,1.512596e-06,0.000308,1.741966e-05,0.000303031,{'gnb__var_smoothing': 0.0003030312727272728},0.90566,0.943396,0.942308,0.711538,0.875726,0.096003,95
4,0.000521,4.289464e-06,0.000304,7.693157e-06,0.000404041,{'gnb__var_smoothing': 0.0004040413636363637},0.90566,0.943396,0.942308,0.711538,0.875726,0.096003,95
5,0.000521,1.611531e-06,0.000299,1.227335e-06,0.000505051,{'gnb__var_smoothing': 0.0005050514545454546},0.90566,0.943396,0.942308,0.711538,0.875726,0.096003,95
6,0.000534,2.27593e-05,0.000347,8.535568e-05,0.000606062,{'gnb__var_smoothing': 0.0006060615454545455},0.90566,0.943396,0.942308,0.730769,0.880533,0.08779,94
7,0.000522,1.31671e-06,0.000299,1.097438e-06,0.000707072,{'gnb__var_smoothing': 0.0007070716363636365},0.90566,0.943396,0.942308,0.75,0.885341,0.079602,72
8,0.000524,1.617033e-06,0.000318,2.967833e-05,0.000808082,{'gnb__var_smoothing': 0.0008080817272727274},0.90566,0.943396,0.942308,0.75,0.885341,0.079602,72
9,0.000524,1.88769e-06,0.0003,1.135621e-06,0.000909092,{'gnb__var_smoothing': 0.0009090918181818183},0.90566,0.943396,0.942308,0.75,0.885341,0.079602,72


## Multilayer Perceptron Classifier

In [55]:
pipe = Pipeline([('mlp', MLPClassifier(max_iter= 10000))])

parameter_grid = { 'mlp__alpha': np.linspace(start=0, stop=1e-1, num=50),
                  'mlp__activation': ['identity', 'logistic', 'tanh', 'relu']
                 }
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold )

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.88095238 0.92857143 0.95238095 0.95238095 0.88095238]
Media scorurilor: 0.919047619047619
Cel mai bun set de parametri: {'mlp__activation': 'identity', 'mlp__alpha': 0.030612244897959186}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mlp__activation,param_mlp__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.351683,0.020719,0.000381,1.2e-05,identity,0.0,"{'mlp__activation': 'identity', 'mlp__alpha': 0.0}",0.981132,0.981132,0.903846,0.730769,0.89922,0.102245,145
1,0.27746,0.166691,0.00038,4.5e-05,identity,0.00204082,"{'mlp__activation': 'identity', 'mlp__alpha': 0.0020408163265306124}",0.981132,0.981132,0.096154,0.788462,0.71172,0.363997,196
2,0.275396,0.154407,0.000358,1.5e-05,identity,0.00408163,"{'mlp__activation': 'identity', 'mlp__alpha': 0.004081632653061225}",0.622642,0.981132,0.903846,0.788462,0.82402,0.134976,160
3,0.350732,0.040702,0.000369,4e-06,identity,0.00612245,"{'mlp__activation': 'identity', 'mlp__alpha': 0.006122448979591837}",0.981132,0.981132,0.903846,0.788462,0.913643,0.078861,3
4,0.384516,0.010422,0.000371,2e-06,identity,0.00816327,"{'mlp__activation': 'identity', 'mlp__alpha': 0.00816326530612245}",0.981132,0.981132,0.903846,0.788462,0.913643,0.078861,3
5,0.389373,0.026157,0.000371,3e-06,identity,0.0102041,"{'mlp__activation': 'identity', 'mlp__alpha': 0.010204081632653062}",0.981132,0.981132,0.903846,0.730769,0.89922,0.102245,145
6,0.271325,0.151787,0.00036,1.9e-05,identity,0.0122449,"{'mlp__activation': 'identity', 'mlp__alpha': 0.012244897959183675}",0.981132,0.113208,0.903846,0.730769,0.682239,0.340807,197
7,0.276706,0.16446,0.000358,2.5e-05,identity,0.0142857,"{'mlp__activation': 'identity', 'mlp__alpha': 0.014285714285714287}",0.981132,0.981132,0.903846,0.153846,0.754989,0.348501,185
8,0.379784,0.058511,0.000392,3.4e-05,identity,0.0163265,"{'mlp__activation': 'identity', 'mlp__alpha': 0.0163265306122449}",0.981132,0.981132,0.903846,0.788462,0.913643,0.078861,3
9,0.362626,0.047718,0.00037,7e-06,identity,0.0183673,"{'mlp__activation': 'identity', 'mlp__alpha': 0.018367346938775512}",0.981132,0.981132,0.903846,0.788462,0.913643,0.078861,3


# <u>4. Dermatology</u>

Tătaru Dragoș-Cătălin, _grupa 10LF383_

_Sursă dataset_: http://archive.ics.uci.edu/ml/datasets/Dermatology

_Articol relevant_: G. Demiroz, H. A. Govenir, and N. Ilter, "Learning Differential Diagnosis of Eryhemato-Squamous Diseases using Voting Feature Intervals", Artificial Intelligence in Medicine

_Scurtă descriere_: Baza de date cuprinde informații privitoare la condițiile necesare pentru diagnosticarea diferențiată pentru bolile de piele. Cuprinde 366 intrări și 33 atribute ce pot lua valori între 0 și 3, fiecare dintre acestea indicând severitatea unei condiții medicale. Sunt incluse 6 clase codificate cu cifre între 1 și 6 după cum urmează:

_Cuprinde clasele_:
- psoriasis
- seboreic dermatitis
- lichen planus
- pityriasis rosea
- cronic dermatitis
- pityriasis rubra pilaris

In [58]:
import numpy as np
import pandas as pd

In [59]:
from IPython.display import display, HTML
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate,cross_val_score, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.impute import SimpleImputer

## Load the data

In [61]:
header =['erythema','scaling','definite borders','itching','koebner phenomenon','polygonal papules',
'follicular papules','oral mucosal involvement','knee and elbow involvement','scalp involvement','family history','melanin incontinence',
'eosinophils in the infiltrate','PNL infiltrate','fibrosis of the papillary dermis','exocytosis','acanthosis','hyperkeratosis',
'parakeratosis','clubbing of the rete ridges','elongation of the rete ridges','thinning of the suprapapillary epidermis','spongiform pustule','munro microabcess','focal hypergranulosis',
'disappearance of the granular layer','vacuolisation and damage of basal layer','spongiosis','saw-tooth appearance of retes','follicular horn plug','perifollicular parakeratosis','inflammatory monoluclear inflitrate',
'band-like infiltrate','Age','Class']

print(len(header))
# Alcatuim o lista de stringuri ce ar putea reprezenta missing data in data setul nostru
missingValues = ["n/a", "na", "--"," ","?"]
data_spam = pd.read_csv("./Datasets/dermatology.data",names=header,sep=',',na_values=missingValues)

35


Se observa existenta unor date lipsa, fiecare coloana ar trebui sa aibe 4601 intrari, acest lucru reiese din informatiile despre setul de date:

In [62]:
display(data_spam.info())
# Afisam numarul de missind data in functie de coloana
print (data_spam.isnull().sum())
# Afisam numarul de missind data in total
print (data_spam.isnull().sum().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 35 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   erythema                                  366 non-null    int64  
 1   scaling                                   366 non-null    int64  
 2   definite borders                          366 non-null    int64  
 3   itching                                   366 non-null    int64  
 4   koebner phenomenon                        366 non-null    int64  
 5   polygonal papules                         366 non-null    int64  
 6   follicular papules                        366 non-null    int64  
 7   oral mucosal involvement                  366 non-null    int64  
 8   knee and elbow involvement                366 non-null    int64  
 9   scalp involvement                         366 non-null    int64  
 10  family history                        

None

erythema                                    0
scaling                                     0
definite borders                            0
itching                                     0
koebner phenomenon                          0
polygonal papules                           0
follicular papules                          0
oral mucosal involvement                    0
knee and elbow involvement                  0
scalp involvement                           0
family history                              0
melanin incontinence                        0
eosinophils in the infiltrate               0
PNL infiltrate                              0
fibrosis of the papillary dermis            0
exocytosis                                  0
acanthosis                                  0
hyperkeratosis                              0
parakeratosis                               0
clubbing of the rete ridges                 0
elongation of the rete ridges               0
thinning of the suprapapillary epi

In [63]:
display(HTML("<h3><b>Dermatology Dataset"))
display(HTML(data_spam[5:27].to_html()))

Unnamed: 0,erythema,scaling,definite borders,itching,koebner phenomenon,polygonal papules,follicular papules,oral mucosal involvement,knee and elbow involvement,scalp involvement,family history,melanin incontinence,eosinophils in the infiltrate,PNL infiltrate,fibrosis of the papillary dermis,exocytosis,acanthosis,hyperkeratosis,parakeratosis,clubbing of the rete ridges,elongation of the rete ridges,thinning of the suprapapillary epidermis,spongiform pustule,munro microabcess,focal hypergranulosis,disappearance of the granular layer,vacuolisation and damage of basal layer,spongiosis,saw-tooth appearance of retes,follicular horn plug,perifollicular parakeratosis,inflammatory monoluclear inflitrate,band-like infiltrate,Age,Class
5,2,3,2,0,0,0,0,0,0,0,0,0,2,1,0,2,2,0,2,0,0,0,1,0,0,0,0,2,0,0,0,1,0,41.0,2
6,2,1,0,2,0,0,0,0,0,0,0,0,0,0,3,1,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,2,0,18.0,5
7,2,2,3,3,3,3,0,2,0,0,0,2,0,0,0,2,3,0,0,0,0,0,0,0,0,2,2,3,2,0,0,3,3,57.0,3
8,2,2,1,0,2,0,0,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0,0,0,0,0,0,2,0,0,0,2,0,22.0,4
9,2,2,1,0,1,0,0,0,0,0,0,0,0,0,0,3,2,0,2,0,0,0,0,0,0,0,0,2,0,0,0,2,0,30.0,4
10,3,3,2,1,1,0,0,0,2,2,1,0,0,0,0,0,3,2,3,2,2,2,1,1,0,0,0,0,0,0,0,1,0,20.0,1
11,2,2,0,3,0,0,0,0,0,0,0,0,0,2,0,2,2,0,0,0,0,0,1,0,0,0,0,3,0,0,0,1,0,21.0,2
12,3,3,1,2,0,0,0,0,0,1,0,0,0,2,0,3,1,0,1,0,0,0,0,0,0,0,0,2,0,0,0,1,0,22.0,2
13,2,3,3,0,0,0,0,0,1,1,1,0,0,1,0,0,2,1,2,1,2,3,0,2,0,0,0,0,0,0,0,2,0,10.0,1
14,2,2,3,3,0,3,0,2,0,0,0,2,0,0,0,1,1,1,1,0,0,0,0,0,2,0,3,0,3,0,0,1,3,65.0,3


## Inlocuim datele lipsa:

In [64]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
print(imputer)
imputer.fit(data_spam)
data_spam = pd.DataFrame(data=imputer.transform(data_spam), columns=header)

display(HTML("<i>Dataset extract with values imputed:</i>"))
display(HTML(data_spam[5:20].to_html()))

X = data_spam.values[:, :34]
y = data_spam.values[:, -1]

display(X)
display(np.unique(y))

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)


Unnamed: 0,erythema,scaling,definite borders,itching,koebner phenomenon,polygonal papules,follicular papules,oral mucosal involvement,knee and elbow involvement,scalp involvement,family history,melanin incontinence,eosinophils in the infiltrate,PNL infiltrate,fibrosis of the papillary dermis,exocytosis,acanthosis,hyperkeratosis,parakeratosis,clubbing of the rete ridges,elongation of the rete ridges,thinning of the suprapapillary epidermis,spongiform pustule,munro microabcess,focal hypergranulosis,disappearance of the granular layer,vacuolisation and damage of basal layer,spongiosis,saw-tooth appearance of retes,follicular horn plug,perifollicular parakeratosis,inflammatory monoluclear inflitrate,band-like infiltrate,Age,Class
5,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,2.0,2.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,41.0,2.0
6,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,3.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,18.0,5.0
7,2.0,2.0,3.0,3.0,3.0,3.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,3.0,2.0,0.0,0.0,3.0,3.0,57.0,3.0
8,2.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,22.0,4.0
9,2.0,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,30.0,4.0
10,3.0,3.0,2.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,3.0,2.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,20.0,1.0
11,2.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,21.0,2.0
12,3.0,3.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,3.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,22.0,2.0
13,2.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,1.0,2.0,1.0,2.0,3.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,10.0,1.0
14,2.0,2.0,3.0,3.0,0.0,3.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.0,0.0,3.0,0.0,0.0,1.0,3.0,65.0,3.0


array([[ 2.,  2.,  0., ...,  1.,  0., 55.],
       [ 3.,  3.,  3., ...,  1.,  0.,  8.],
       [ 2.,  1.,  2., ...,  2.,  3., 26.],
       ...,
       [ 3.,  2.,  2., ...,  2.,  3., 28.],
       [ 2.,  1.,  3., ...,  2.,  3., 50.],
       [ 3.,  2.,  2., ...,  3.,  0., 35.]])

array([1., 2., 3., 4., 5., 6.])

## The print Function

In [65]:
def print_function(data_set:dict):
    df_print = pd.DataFrame({"Test accuracy for each fold":data_set['test_accuracy'], 
                    "Train accuracy for each fold": data_set['train_accuracy'], 
                    "Average test accuracy %": round(data_set['test_accuracy'].mean() * 100, 4),
                    "Average train accuracy %": round(data_set['train_accuracy'].mean() * 100, 4),
                    "Test F1 score for each fold": data_set['test_f1_macro'],
                    "Train F1 score for each fold": data_set['train_f1_macro'],
                    "Average test F1 score %": round(data_set['test_f1_macro'].mean() * 100, 4),
                    "Average train F1 score %":round(data_set['train_f1_macro'].mean() * 100, 4)
                   })
    display(HTML(df_print.to_html()))

# Executare algoritmi cu hiperparametri hardcodați:

## K-Nearest Neighbors Classifier

In [66]:
# hiperparametri
knn_neighbors = 5
knn_minkowski_p = 3

# scalare date
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors, p=knn_minkowski_p)
model_cv_stats = cross_validate(model, X_scaled, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True) 

# statistici
display(HTML(f"<h4>5-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print_function(model_cv_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.945946,0.982877,95.0833,97.5415,0.946673,0.981964,94.7936,97.3208
1,0.945205,0.976109,95.0833,97.5415,0.942044,0.97373,94.7936,97.3208
2,0.958904,0.972696,95.0833,97.5415,0.953684,0.970766,94.7936,97.3208
3,0.986301,0.976109,95.0833,97.5415,0.988188,0.974982,94.7936,97.3208
4,0.917808,0.969283,95.0833,97.5415,0.909091,0.964597,94.7936,97.3208


## Decision Tree Classifier

In [67]:
# hiperparametri
dt_criterion = 'gini'#default value
dt_splitter = 'best' #intrebarea care reduce cel mai mult incertitudinea
#gini imputity-cantitatea de incertitudine pe un singur nod, cat de amestecate sunt clasificarile din frunze dupa intrebarea din nod
 
# implementare Decision Tree
model = DecisionTreeClassifier(criterion=dt_criterion, splitter=dt_splitter)
model_dc_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>5-fold cross validation for Decision Trees classification:</h4>"))
print_function(model_dc_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.918919,1.0,92.625,100.0,0.905717,1.0,92.2922,100.0
1,0.986301,1.0,92.625,100.0,0.986581,1.0,92.2922,100.0
2,0.863014,1.0,92.625,100.0,0.883445,1.0,92.2922,100.0
3,0.917808,1.0,92.625,100.0,0.899474,1.0,92.2922,100.0
4,0.945205,1.0,92.625,100.0,0.939394,1.0,92.2922,100.0


## Random Forest Classifier

In [68]:
# hiperparametri
rfc_n_estimators = 150
rfc_criterion = 'gini'

# implementare Random Forest
model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_rfc_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)

display(HTML(f"<h4>5-fold cross validation for Random Forest classification</h4>"))
print_function(model_rfc_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.972973,1.0,97.5417,100.0,0.974704,1.0,97.3621,100.0
1,1.0,1.0,97.5417,100.0,1.0,1.0,97.3621,100.0
2,0.972603,1.0,97.5417,100.0,0.969444,1.0,97.3621,100.0
3,0.986301,1.0,97.5417,100.0,0.984561,1.0,97.3621,100.0
4,0.945205,1.0,97.5417,100.0,0.939394,1.0,97.3621,100.0


## Multilayer Perceptron Classifier

In [69]:
# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'logistic'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)
max_iter=1000

# implementare MLP
model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, max_iter=max_iter, random_state=0)
model_mpc_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)
print_function(model_mpc_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.959459,1.0,96.9974,100.0,0.95619,1.0,96.6856,100.0
1,1.0,1.0,96.9974,100.0,1.0,1.0,96.6856,100.0
2,0.972603,1.0,96.9974,100.0,0.969444,1.0,96.6856,100.0
3,0.986301,1.0,96.9974,100.0,0.984561,1.0,96.6856,100.0
4,0.931507,1.0,96.9974,100.0,0.924086,1.0,96.6856,100.0


## Gaussian Naive Bayes Classifier

In [70]:
# implementare GNB
model = GaussianNB()
model_gnb_stats = cross_validate(model, X, y, cv=5, scoring=('accuracy', 'f1_macro'), return_train_score=True)

# statistici
display(HTML(f"<h4>5-fold cross validation for Gaussian NB classification</h4>"))
print_function(model_gnb_stats)

Unnamed: 0,Test accuracy for each fold,Train accuracy for each fold,Average test accuracy %,Average train accuracy %,Test F1 score for each fold,Train F1 score for each fold,Average test F1 score %,Average train F1 score %
0,0.905405,0.900685,87.4232,89.9591,0.892797,0.882689,83.8773,88.1502
1,0.849315,0.904437,87.4232,89.9591,0.795756,0.888485,83.8773,88.1502
2,0.849315,0.901024,87.4232,89.9591,0.787987,0.88478,83.8773,88.1502
3,0.876712,0.890785,87.4232,89.9591,0.848276,0.867591,83.8773,88.1502
4,0.890411,0.901024,87.4232,89.9591,0.869048,0.883966,83.8773,88.1502


# Optimizarea hiperparametrilor

## K-Nearest Neighbors Classifier

In [71]:
pipe = Pipeline([('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())])
parameter_grid = {'knn__n_neighbors': list(range(1, 10)), 'knn__p': list(range(1, 5))}
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4, return_train_score=True)

scores = cross_val_score(grid_search, X, y, cv=strat_k_fold)
print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())

grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)

grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.98648649 0.95890411 0.95890411 1.         0.95890411]
Media scorurilor: 0.972639763050722
Cel mai bun set de parametri: {'knn__n_neighbors': 6, 'knn__p': 1}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__n_neighbors,param_knn__p,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,mean_train_score,std_train_score
0,0.001045,0.000248,0.004319,0.000277,1,1,"{'knn__n_neighbors': 1, 'knn__p': 1}",0.978261,0.967391,0.945055,0.934066,0.956193,0.017508,11,1.0,1.0,1.0,1.0,1.0,0.0
1,0.00089,1.6e-05,0.003866,0.000252,1,2,"{'knn__n_neighbors': 1, 'knn__p': 2}",0.967391,0.978261,0.934066,0.923077,0.950699,0.022792,23,1.0,1.0,1.0,1.0,1.0,0.0
2,0.001258,0.000267,0.01524,0.000518,1,3,"{'knn__n_neighbors': 1, 'knn__p': 3}",0.934783,0.967391,0.923077,0.923077,0.937082,0.01814,30,1.0,1.0,1.0,1.0,1.0,0.0
3,0.000954,5.6e-05,0.014879,0.000483,1,4,"{'knn__n_neighbors': 1, 'knn__p': 4}",0.923913,0.967391,0.912088,0.923077,0.931617,0.021175,34,1.0,1.0,1.0,1.0,1.0,0.0
4,0.001198,0.000433,0.004684,0.001454,2,1,"{'knn__n_neighbors': 2, 'knn__p': 1}",0.945652,0.978261,0.945055,0.945055,0.953506,0.014294,14,0.985401,0.970803,0.989091,0.992727,0.984506,0.008324
5,0.001077,0.000208,0.005241,0.001595,2,2,"{'knn__n_neighbors': 2, 'knn__p': 2}",0.956522,0.98913,0.934066,0.934066,0.953446,0.02255,16,0.985401,0.974453,0.981818,0.992727,0.9836,0.006584
6,0.000903,4.6e-05,0.015663,0.000398,2,3,"{'knn__n_neighbors': 2, 'knn__p': 3}",0.934783,0.978261,0.912088,0.923077,0.937052,0.025109,31,0.985401,0.974453,0.981818,0.992727,0.9836,0.006584
7,0.000986,7.5e-05,0.015031,0.000837,2,4,"{'knn__n_neighbors': 2, 'knn__p': 4}",0.934783,0.956522,0.923077,0.923077,0.934365,0.013656,33,0.985401,0.974453,0.981818,0.985455,0.981782,0.004481
8,0.000838,3.3e-05,0.003641,0.000107,3,1,"{'knn__n_neighbors': 3, 'knn__p': 1}",0.978261,0.956522,0.978022,0.934066,0.961718,0.018242,5,0.978102,0.981752,0.978182,0.992727,0.982691,0.005979
9,0.000771,7e-06,0.003527,8.2e-05,3,2,"{'knn__n_neighbors': 3, 'knn__p': 2}",0.978261,0.978261,0.934066,0.934066,0.956163,0.022097,13,0.974453,0.981752,0.978182,0.992727,0.981778,0.006828


## Decision Tree Classifier

In [72]:
pipe = Pipeline([('dtc', DecisionTreeClassifier())])

parameter_grid = { 'dtc__criterion': ['gini', 'entropy'],'dtc__splitter': ['best', 'random'] }
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold)

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametrii:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.97297297 0.93150685 0.94520548 0.95890411 0.93150685]
Media scorurilor: 0.9480192521288411
Cel mai bun set de parametrii: {'dtc__criterion': 'entropy', 'dtc__splitter': 'random'}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_dtc__criterion,param_dtc__splitter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001097,0.000413,0.000307,6.813087e-05,gini,best,"{'dtc__criterion': 'gini', 'dtc__splitter': 'best'}",0.913043,0.945652,0.934066,0.967033,0.939949,0.019523,2
1,0.00066,3e-05,0.000245,5.556675e-06,gini,random,"{'dtc__criterion': 'gini', 'dtc__splitter': 'random'}",0.923913,0.880435,0.923077,0.967033,0.923614,0.030619,3
2,0.000821,3.8e-05,0.000269,4.59599e-05,entropy,best,"{'dtc__criterion': 'entropy', 'dtc__splitter': 'best'}",0.836957,0.945652,0.945055,0.945055,0.91818,0.046895,4
3,0.000652,3.5e-05,0.00024,8.658075e-07,entropy,random,"{'dtc__criterion': 'entropy', 'dtc__splitter': 'random'}",0.869565,0.98913,0.956044,0.978022,0.94819,0.04693,1


## Random Forest Classifier

In [73]:
pipe = Pipeline([('rfc', RandomForestClassifier())])

parameter_grid = { 'rfc__criterion': ['gini', 'entropy'],'rfc__n_estimators': np.linspace(start=1, stop=150, num=25, dtype=int)}
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold)

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.98648649 0.98630137 0.95890411 0.94520548 0.98630137]
Media scorurilor: 0.9726397630507219
Cel mai bun set de parametri: {'rfc__criterion': 'gini', 'rfc__n_estimators': 69}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__criterion,param_rfc__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002058,0.000822,0.000491,1.7e-05,gini,1,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 1}",0.880435,0.891304,0.879121,0.813187,0.866012,0.030863,50
1,0.008636,0.000515,0.001155,0.00011,gini,7,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 7}",0.967391,0.967391,0.912088,0.956044,0.950729,0.022785,47
2,0.014436,0.000186,0.001377,1.3e-05,gini,13,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 13}",0.945652,0.978261,0.945055,0.978022,0.961747,0.016396,43
3,0.021043,0.000608,0.001814,5.5e-05,gini,19,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 19}",0.967391,0.98913,0.978022,0.945055,0.9699,0.016274,20
4,0.028297,0.000808,0.002329,9.1e-05,gini,25,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 25}",0.967391,1.0,0.956044,0.956044,0.96987,0.018002,22
5,0.034309,0.000134,0.002706,4.3e-05,gini,32,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 32}",0.978261,1.0,0.978022,0.945055,0.975334,0.019628,7
6,0.041022,0.000506,0.003172,5.2e-05,gini,38,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 38}",0.98913,0.978261,0.956044,0.945055,0.967123,0.017451,35
7,0.047801,0.001984,0.003838,0.000495,gini,44,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 44}",0.978261,1.0,0.934066,0.967033,0.96984,0.023813,27
8,0.053266,0.000356,0.003978,3.3e-05,gini,50,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 50}",0.967391,0.98913,0.945055,0.956044,0.964405,0.016314,39
9,0.061237,0.001793,0.004907,0.000617,gini,56,"{'rfc__criterion': 'gini', 'rfc__n_estimators': 56}",0.978261,1.0,0.945055,0.956044,0.96984,0.021126,27


## Gaussian Naive Bayes Classifier

In [74]:
pipe = Pipeline([('gnb', GaussianNB())])

parameter_grid = { 'gnb__var_smoothing': np.linspace(start=1e-9, stop=1e-2, num=100)}
grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold)

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.95945946 0.93150685 0.98630137 0.94520548 1.        ]
Media scorurilor: 0.9644946316179193
Cel mai bun set de parametri: {'gnb__var_smoothing': 0.0004040413636363637}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gnb__var_smoothing,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001019,0.0004165698,0.000602,0.0002599635,1e-09,{'gnb__var_smoothing': 1e-09},0.913043,0.869565,0.857143,0.901099,0.885213,0.022692,100
1,0.000756,1.222157e-05,0.000448,1.361532e-05,0.000101011,{'gnb__var_smoothing': 0.00010101109090909092},0.967391,0.978261,0.923077,0.956044,0.956193,0.020671,30
2,0.000773,4.195604e-05,0.00044,1.819651e-06,0.000202021,{'gnb__var_smoothing': 0.00020202118181818183},0.978261,0.98913,0.934066,0.956044,0.964375,0.021176,8
3,0.000751,3.456556e-06,0.00044,2.440153e-06,0.000303031,{'gnb__var_smoothing': 0.0003030312727272728},0.978261,0.98913,0.934066,0.956044,0.964375,0.021176,8
4,0.000772,4.034032e-05,0.000449,1.610991e-05,0.000404041,{'gnb__var_smoothing': 0.0004040413636363637},0.98913,0.98913,0.934066,0.956044,0.967093,0.023368,1
5,0.000749,4.904995e-06,0.000438,1.572479e-06,0.000505051,{'gnb__var_smoothing': 0.0005050514545454546},0.978261,0.978261,0.934066,0.956044,0.961658,0.018331,10
6,0.000748,1.784161e-06,0.000438,1.819651e-06,0.000606062,{'gnb__var_smoothing': 0.0006060615454545455},0.978261,0.978261,0.934066,0.956044,0.961658,0.018331,10
7,0.000793,7.751886e-05,0.000443,4.975471e-06,0.000707072,{'gnb__var_smoothing': 0.0007070716363636365},0.978261,0.978261,0.934066,0.967033,0.964405,0.018106,2
8,0.000744,6.35817e-06,0.000432,2.827925e-06,0.000808082,{'gnb__var_smoothing': 0.0008080817272727274},0.978261,0.978261,0.934066,0.967033,0.964405,0.018106,2
9,0.000761,2.766394e-05,0.000536,0.0001344804,0.000909092,{'gnb__var_smoothing': 0.0009090918181818183},0.978261,0.978261,0.934066,0.956044,0.961658,0.018331,10


## Multilayer Perceptron Classifier

In [75]:
pipe = Pipeline([('mlp', MLPClassifier(max_iter= 1000))])

parameter_grid = { 'mlp__alpha': np.linspace(start=0, stop=1e-1, num=50),
                  'mlp__activation': ['identity', 'logistic', 'tanh', 'relu']
                 }

grid_search =GridSearchCV(pipe, param_grid=parameter_grid, scoring='accuracy', cv=4)
strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(grid_search, X, y, cv=strat_k_fold )

print("Scorurile rezultate in urma 5-fold cross validation:",scores)
print("Media scorurilor:",scores.mean())
grid_search.fit(X, y)
print("Cel mai bun set de parametri:",grid_search.best_params_)
grid_search = pd.DataFrame(grid_search.cv_results_)
display(HTML(grid_search.to_html()))

Scorurile rezultate in urma 5-fold cross validation: [0.98648649 0.98630137 0.97260274 0.95890411 0.95890411]
Media scorurilor: 0.9726397630507219
Cel mai bun set de parametri: {'mlp__activation': 'identity', 'mlp__alpha': 0.010204081632653062}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mlp__activation,param_mlp__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.481432,0.066465,0.000399,5.095064e-06,identity,0.0,"{'mlp__activation': 'identity', 'mlp__alpha': 0.0}",0.978261,0.978261,0.956044,0.945055,0.964405,0.01439,135
1,0.427364,0.066748,0.000425,4.944596e-05,identity,0.00204082,"{'mlp__activation': 'identity', 'mlp__alpha': 0.0020408163265306124}",0.98913,0.978261,0.956044,0.945055,0.967123,0.017451,79
2,0.39837,0.069465,0.000425,4.520732e-05,identity,0.00408163,"{'mlp__activation': 'identity', 'mlp__alpha': 0.004081632653061225}",0.98913,0.98913,0.956044,0.945055,0.96984,0.019678,37
3,0.433742,0.041603,0.000441,7.289304e-05,identity,0.00612245,"{'mlp__activation': 'identity', 'mlp__alpha': 0.006122448979591837}",0.98913,0.98913,0.956044,0.945055,0.96984,0.019678,37
4,0.382429,0.046235,0.000425,3.599938e-05,identity,0.00816327,"{'mlp__activation': 'identity', 'mlp__alpha': 0.00816326530612245}",0.978261,0.978261,0.956044,0.945055,0.964405,0.01439,135
5,0.41063,0.044936,0.000455,5.547047e-05,identity,0.0102041,"{'mlp__activation': 'identity', 'mlp__alpha': 0.010204081632653062}",1.0,1.0,0.956044,0.945055,0.975275,0.025029,1
6,0.372168,0.058613,0.000509,9.389506e-05,identity,0.0122449,"{'mlp__activation': 'identity', 'mlp__alpha': 0.012244897959183675}",0.98913,0.978261,0.956044,0.945055,0.967123,0.017451,79
7,0.410256,0.04923,0.000421,4.190961e-05,identity,0.0142857,"{'mlp__activation': 'identity', 'mlp__alpha': 0.014285714285714287}",0.98913,0.98913,0.956044,0.945055,0.96984,0.019678,37
8,0.438304,0.090868,0.000438,3.962091e-05,identity,0.0163265,"{'mlp__activation': 'identity', 'mlp__alpha': 0.0163265306122449}",0.98913,0.98913,0.956044,0.945055,0.96984,0.019678,37
9,0.406581,0.054361,0.000432,5.221993e-05,identity,0.0183673,"{'mlp__activation': 'identity', 'mlp__alpha': 0.018367346938775512}",0.978261,0.978261,0.956044,0.945055,0.964405,0.01439,135
