# Wi-Fi localization. Modele de clasificare

Sîrbu Matei-Dan, _grupa 10LF383_

<i>Sursă dataset:</i> http://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization

<i>Descriere dataset:</i> [DOI 10.1007/978-981-10-3322-3_27 via ResearchGate](Docs/chp_10.1007_978-981-10-3322-3_27.pdf)

<i>Synopsis:</i> Setul de date _Wireless Indoor Localization_ cuprinde 2000 de măsurători ale puterii semnalului (măsurat în dBm) recepționat de la routerele unui birou din Pittsburgh. Acest birou are șapte routere și patru camere; un utilizator înregistrează cu ajutorul unui smartphone o dată pe secundă puterea semnalelor venite de la cele șapte routere, fiecărei înregistrări fiindu-i asociate camera în care se afla utilizatorul la momentul măsurării (1, 2, 3 sau 4).

În figura de mai jos este ilustrat un sample din dataset: <br><br>
![Sample](./Images/wifi_localization_sample.png)

În cele ce urmează, coloana Class (camera) este reprezentată de y, iar coloanele WS1 - WS7 (features: puterea semnalului de la fiecare router), de X.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

In [2]:
header = ['WS1', 'WS2', 'WS3', 'WS4', 'WS5', 'WS6', 'WS7', 'Class']
data_wifi = pd.read_csv("./Datasets/wifi_localization.txt", names=header, sep='\t')
display(HTML("<i>Dataset overview:</i>"))
display(data_wifi)
X = data_wifi.values[:, :7]
y = data_wifi.values[:, -1]
folds = 5

Unnamed: 0,WS1,WS2,WS3,WS4,WS5,WS6,WS7,Class
0,-64,-56,-61,-66,-71,-82,-81,1
1,-68,-57,-61,-65,-71,-85,-85,1
2,-63,-60,-60,-67,-76,-85,-84,1
3,-61,-60,-68,-62,-77,-90,-80,1
4,-63,-65,-60,-63,-77,-81,-87,1
...,...,...,...,...,...,...,...,...
1995,-59,-59,-48,-66,-50,-86,-94,4
1996,-59,-56,-50,-62,-47,-87,-90,4
1997,-62,-59,-46,-65,-45,-87,-88,4
1998,-62,-58,-52,-61,-41,-90,-85,4


# Modele de clasificare
### [1. <i>k</i>-nearest neighbors classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

class `sklearn.neighbors.KNeighborsClassifier`<i>(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)</i>

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler

# hiperparametri
knn_neighbors = 4

# implementare KNN
model = KNeighborsClassifier(n_neighbors=knn_neighbors)
model_acc = cross_validate(model, X, y, cv=folds, scoring='accuracy', return_train_score=True)
model_f1 = cross_validate(model, X, y, cv=folds, scoring='f1_macro', return_train_score=True)

# statistici
display(HTML(f"<h4>{folds}-fold cross validation for {knn_neighbors}-nearest neighbors classification:</h4>"))
print(f"Test accuracy: {model_acc['test_score']} \n=> Average test accuracy: {round(model_acc['test_score'].mean() * 100, 3)}%")
print(f"Train accuracy: {model_acc['train_score']} \n=> Average train accuracy: {round(model_acc['train_score'].mean() * 100, 3)}%")
print(f"Test F1 score: {model_f1['test_score']} \n=> Average test F1 score: {round(model_f1['test_score'].mean() * 100, 3)}%")
print(f"Train F1 score: {model_f1['train_score']} \n=> Average train F1 score: {round(model_f1['train_score'].mean() * 100, 3)}%")

Test accuracy: [0.965  0.98   0.975  0.9825 0.985 ] 
=> Average test accuracy: 97.75%
Train accuracy: [0.993125 0.991875 0.993125 0.99     0.99125 ] 
=> Average train accuracy: 99.188%
Test F1 score: [0.96489838 0.97996795 0.97488398 0.98255407 0.98503657] 
=> Average test F1 score: 97.747%
Train F1 score: [0.99311709 0.99188648 0.99312342 0.98998978 0.99124912] 
=> Average train F1 score: 99.187%


### [2. Decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

class `sklearn.tree.DecisionTreeClassifier`<i>(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)</i>

In [4]:
from sklearn.tree import DecisionTreeClassifier

# implementare decision trees
model = DecisionTreeClassifier(random_state=42)
model_acc = cross_validate(model, X, y, cv=folds, scoring='accuracy', return_train_score=True)
model_f1 = cross_validate(model, X, y, cv=folds, scoring='f1_macro', return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Decision Trees classification:</h4>"))
print(f"Test accuracy: {model_acc['test_score']} \n=> Average test accuracy: {round(model_acc['test_score'].mean() * 100, 3)}%")
print(f"Train accuracy: {model_acc['train_score']} \n=> Average train accuracy: {round(model_acc['train_score'].mean() * 100, 3)}%")
print(f"Test F1 score: {model_f1['test_score']} \n=> Average test F1 score: {round(model_f1['test_score'].mean() * 100, 3)}%")
print(f"Train F1 score: {model_f1['train_score']} \n=> Average train F1 score: {round(model_f1['train_score'].mean() * 100, 3)}%")

Test accuracy: [0.96   0.9325 0.9275 0.9825 0.9725] 
=> Average test accuracy: 95.5%
Train accuracy: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score: [0.95972496 0.93246313 0.92562893 0.98251804 0.97253289] 
=> Average test F1 score: 95.457%
Train F1 score: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### [3. Multilayer perceptron (MLP) classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

class `sklearn.neural_network.MLPClassifier`<i>(hidden_layer_sizes=(100, ), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)</i>

In [5]:
from sklearn.neural_network import MLPClassifier

# Solvere pentru optimizarea ponderilor:
# - lbfgs (quasi-Newton optimizer
# - sgd (stochastic gradient descent)
# - adam (stochastic gradient descent by Kingma, Diederik and Jimmy Ba)
#
# Funcții de activare pentru stratul ascuns:
# - identity (linear)
# - logistic (sigmoid)
# - tanh (hyperbolic tangent)
# - relu (rectified linear unit)

# hiperparametri
mlp_solver = 'adam'
mlp_activation = 'relu'
mlp_alpha=1e-3
mlp_hidden_layer_sizes = (50,50)

model = MLPClassifier(solver=mlp_solver, activation=mlp_activation, alpha=mlp_alpha, hidden_layer_sizes=mlp_hidden_layer_sizes, random_state=42)

model_acc = cross_validate(model, X, y, cv=folds, scoring='accuracy', return_train_score=True)
model_f1 = cross_validate(model, X, y, cv=folds, scoring='f1_macro', return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for MLP classification</h4>"))
display(HTML(f"using hyperparameters - Solver: <b>{mlp_solver}</b>, Activation function: <b>{mlp_activation}</b>, Parameter for regularization (α): <b>{mlp_alpha}</b>, Hidden layer sizes: <b>{mlp_hidden_layer_sizes}</b>"))
print(f"\nTest accuracy: {model_acc['test_score']} \n=> Average test accuracy: {round(model_acc['test_score'].mean() * 100, 3)}%")
print(f"Train accuracy: {model_acc['train_score']} \n=> Average train accuracy: {round(model_acc['train_score'].mean() * 100, 3)}%")
print(f"Test F1 score: {model_f1['test_score']} \n=> Average test F1 score: {round(model_f1['test_score'].mean() * 100, 3)}%")
print(f"Train F1 score: {model_f1['train_score']} \n=> Average train F1 score: {round(model_f1['train_score'].mean() * 100, 3)}%")


Test accuracy: [0.9725 0.9975 0.95   0.975  0.9775] 
=> Average test accuracy: 97.45%
Train accuracy: [0.98125  0.971875 0.983125 0.98125  0.981875] 
=> Average train accuracy: 97.988%
Test F1 score: [0.97244414 0.99749994 0.94983905 0.97514624 0.97746665] 
=> Average test F1 score: 97.448%
Train F1 score: [0.98124085 0.97171261 0.98314044 0.98122406 0.9819253 ] 
=> Average train F1 score: 97.985%


### [4. Gaussian Naïve Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

class `sklearn.naive_bayes.GaussianNB`<i>(priors=None, var_smoothing=1e-09)<i>

Algoritmul de clasificare _Gaussian Naïve Bayes_ aparține familiei de clasificatori _Naïve Bayes_, care presupun că prezența unui feature într-o clasă nu este afectată de prezența altor features; pe scurt, proprietățile contribuie independent la probabilitatea apartenenței la o clasă. În particular, algoritmul _Gaussian Naïve Bayes_ urmează funcția de probabilitate (PDF) a unei distribuții normale (Gaussiene):
$$\large P(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma_y^2}} \exp \bigg(-\frac{(x_i - \mu_y)^2}{2 \sigma_y^2}\bigg),$$
unde parametrii $\sigma_y$ și $\mu_y$, deviația standard și media, sunt determinați folosind maximum likelihood estimation (MLE), o metodă de estimare a parametrilor unei PDF prin maximizarea unei funcții de likelihood (cât de bine se potrivește un sample cu un model statistic).

In [23]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model_acc = cross_validate(model, X, y, cv=folds, scoring='accuracy', return_train_score=True)
model_f1 = cross_validate(model, X, y, cv=folds, scoring='f1_macro', return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Gaussian NB classification</h4>"))
print(f"\nTest accuracy: {model_acc['test_score']} \n=> Average test accuracy: {round(model_acc['test_score'].mean() * 100, 3)}%")
print(f"Train accuracy: {model_acc['train_score']} \n=> Average train accuracy: {round(model_acc['train_score'].mean() * 100, 3)}%")
print(f"Test F1 score: {model_f1['test_score']} \n=> Average test F1 score: {round(model_f1['test_score'].mean() * 100, 3)}%")
print(f"Train F1 score: {model_f1['train_score']} \n=> Average train F1 score: {round(model_f1['train_score'].mean() * 100, 3)}%")


Test accuracy: [0.99   0.9725 0.98   0.98   0.985 ] 
=> Average test accuracy: 98.15%
Train accuracy: [0.98125  0.983125 0.9875   0.985    0.983125] 
=> Average train accuracy: 98.4%
Test F1 score: [0.99001188 0.97255265 0.9799862  0.98006098 0.9849985 ] 
=> Average test F1 score: 98.152%
Train F1 score: [0.98127765 0.98313802 0.98751983 0.98501642 0.9831722 ] 
=> Average train F1 score: 98.402%


### [5. Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

class `sklearn.ensemble.RandomForestClassifier`<i>(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)<i>

Un clasificator _Random forest_ se folosește de ipotezele emise de mai mulți _random trees_ (arbori de decizie aleatori), obținuți în urma unui _random split_. Un _random forest_ se obține prin construirea unui _random tree_ pentru fiecare set de antrenare. Acești arbori funcționează ca un ansamblu; pentru fiecare dată de intrare se aplică modelele din ansamblu, și rezultatul final se obține agregând rezultatele prin votare.
<div style="text-align:center"><img style="width: 400px" src="./Images/wifi_rf_diagram.png"><br>Un model random forest făcând o predicție; în urma votării se obține rezultatul 1.<br>sursă: Medium (Towards Data Science: Understanding Random Forest)</div>

In [26]:
from sklearn.ensemble import RandomForestClassifier

rfc_n_estimators = 150
rfc_criterion = 'gini' # 'gini' sau 'entropy'

model = RandomForestClassifier(n_estimators=rfc_n_estimators, criterion=rfc_criterion)
model_acc = cross_validate(model, X, y, cv=folds, scoring='accuracy', return_train_score=True)
model_f1 = cross_validate(model, X, y, cv=folds, scoring='f1_macro', return_train_score=True)

display(HTML(f"<h4>{folds}-fold cross validation for Random Forest classification</h4>"))
print(f"\nTest accuracy: {model_acc['test_score']} \n=> Average test accuracy: {round(model_acc['test_score'].mean() * 100, 3)}%")
print(f"Train accuracy: {model_acc['train_score']} \n=> Average train accuracy: {round(model_acc['train_score'].mean() * 100, 3)}%")
print(f"Test F1 score: {model_f1['test_score']} \n=> Average test F1 score: {round(model_f1['test_score'].mean() * 100, 3)}%")
print(f"Train F1 score: {model_f1['train_score']} \n=> Average train F1 score: {round(model_f1['train_score'].mean() * 100, 3)}%")


Test accuracy: [0.985  0.95   0.975  0.985  0.9875] 
=> Average test accuracy: 97.65%
Train accuracy: [1. 1. 1. 1. 1.] 
=> Average train accuracy: 100.0%
Test F1 score: [0.9874995  0.96228787 0.97494678 0.9850456  0.98749969] 
=> Average test F1 score: 97.946%
Train F1 score: [1. 1. 1. 1. 1.] 
=> Average train F1 score: 100.0%


### <i> TODO: classification algorithms descriptions and nested cross validation for optimizing hyperparameters </i>