# Localization System

## Table of Content 
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<!-- <li><a href="#eda">Exploratory Data Analysis</a></li> -->
<li><a href="#model"> Building a model </a></li>
</ul>

<a id='intro'></a>
## Introduction 
Localziation system built using machine learning classifiers, data is collected at Cairo University Faculty of Engineering Biomedical Engineering Department. The Data is a collection of WiFi\`s RSSI\`s (dbm) of Wifi Networks available in the department. RSSI measurements represent the relative quality of a received signal on a device. RSSI indicates the power level being received after any possible loss at the antenna and cable level. The higher the RSSI value, the stronger the signal. When measured in negative numbers, the number that is closer to zero usually means better signal. As an example -50 is a pretty good signal, -75 - is fairly reasonable, and -100 is no signal at all. Identifying a patient\`s location in a hospital is useful for many reasons one of which is to identify points of conjestions and try to rearrange the hospital survices, also used to identify how many patients are in the hospital. 

So We are going to dive into our gathered data and invetigate our findings. 

In [1]:
# Basic Importations 
import pandas as pd 
import numpy as np

<a id="#wrangling"></a>
## Data Wrangling 
Let\`s dive into our gathered data and find it\`s secerets. 

In [2]:
# Read our datasets 
df_ta = pd.read_csv('esp_csv_only/csv/Ta.csv')
df_lab = pd.read_csv('esp_csv_only/csv/Lab.csv')
df_ts = pd.read_csv('esp_csv_only/csv/ts.csv')
df_hall_4 = pd.read_csv('esp_csv_only/csv/hall_4.csv')
df_hall_5 = pd.read_csv('esp_csv_only/csv/hall_5.csv')
df_hall_6 = pd.read_csv('esp_csv_only/csv/hall_6.csv')
df_main_hall = pd.read_csv('esp_csv_only/csv/main_hall.csv')
# Show Heads 
df_lab.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,-81.624161,-57.053691,-87.214765,-77.194631,-68.946309,-87.946309,-84.315436,0.0,0.0,-58.57047,-60.677852
std,12.310442,4.658304,2.255815,27.815919,35.480826,2.189611,4.30444,0.0,0.0,30.234948,31.482356
min,-89.0,-73.0,-93.0,-89.0,-90.0,-94.0,-92.0,0.0,0.0,-85.0,-87.0
25%,-85.0,-60.0,-89.0,-88.0,-88.0,-89.0,-87.0,0.0,0.0,-74.0,-77.0
50%,-84.0,-58.0,-87.0,-87.0,-87.0,-88.0,-85.0,0.0,0.0,-73.0,-74.0
75%,-82.0,-54.0,-86.0,-86.0,-85.0,-87.0,-82.0,0.0,0.0,-71.0,-71.0
max,0.0,-47.0,-80.0,0.0,0.0,-78.0,-71.0,0.0,0.0,0.0,0.0


In [3]:
df_ta.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,-78.980132,-69.589404,-71.913907,-84.940397,-83.238411,-68.847682,-86.317881,0.0,0.0,-77.874172,-75.317881
std,10.841568,9.522795,35.96942,10.426397,19.977557,37.952909,10.471785,0.0,0.0,20.161781,25.352021
min,-92.0,-85.0,-95.0,-95.0,-91.0,-93.0,-95.0,0.0,0.0,-90.0,-89.0
25%,-84.0,-73.0,-91.0,-88.0,-91.0,-90.0,-89.0,0.0,0.0,-86.5,-86.0
50%,-80.0,-70.0,-91.0,-86.0,-87.0,-90.0,-88.0,0.0,0.0,-82.0,-83.0
75%,-76.0,-67.0,-86.0,-85.0,-85.0,-86.5,-86.0,0.0,0.0,-80.0,-81.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The first look into the data insights the following:

- BMEStudentLab3 & CMP_LAB are out of range and their values wont benefit our model. 
- There are a lot of zeros in the recorded data.

### The Zeros problem :
The data is recorded in a way that, if one of the specified wifi networks is not found or out of range it records the value for that moment as zero, according to the documentation of the ESP Module the RSSI\`s of a wifi network is a value that ranges from -100 to -50 dB, -100 for bad network connectivity (Signal Strength) and -50 good network connnectivity.
<img src="sources/wifi.jpeg">
So we can treat a zero value as miss recorded value (seems to be an error in the chip) and replace it with any method we want, we will stick with replacing with the mean for now. 

## Imputing miss calculated records 

In [4]:
# Drop the two bad coloumns 
bad_cols = [df_lab.BMEStudentLab3.name, df_lab.CMP_LAB.name]

# dfs classes 1, 2, 3, 4, 5, 6
dfs = [df_ta, df_lab, df_ts, df_hall_4, df_hall_5, df_hall_6, df_main_hall]

for df in dfs:
    df.drop(bad_cols, axis=1, inplace=True)

imp_cols = ['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'CUFE',
       'RehabLab', 'lab001', 'CMP_LAB1', 'CMP_LAB2']

# Check 
print([(df.columns.values == imp_cols).all() for df in dfs])

[True, True, True, True, True, True, True]


In [5]:
(df.columns.values == imp_cols).all()

True

In [6]:
# Replace Zeros with the mean
cols = df_lab.columns

for df in dfs:
    df[cols] = df[cols].replace({0: np.nan})

# Check 
print([(df[imp_cols].isnull().sum().values > 0).any() for df in dfs])

[True, True, False, False, False, False, True]


In [7]:
miss = []

for df in dfs:
        missing = [col for col in df if df[col].isnull().sum()>0]
        miss.append(missing)

        
# for i in df_lab.columns:
#     missing_lab = df_lab[i].isnull().sum()
#     missing_ta = df_ta[i].isnull().sum()
    
#     if missing_lab >0 :
#         miss_lab.append(i)
        df_hall_4
#     if missing_ta >0:
#         miss_ta.append(i)
# print(miss_lab)
# print(miss_ta)
miss

[['StudBME1',
  'STUDBME2',
  'SBME_STAFF3',
  'SBME_STAFF',
  'CUFE',
  'RehabLab',
  'lab001',
  'CMP_LAB1',
  'CMP_LAB2'],
 ['StudBME1', 'SBME_STAFF', 'CUFE', 'CMP_LAB1', 'CMP_LAB2'],
 [],
 [],
 [],
 [],
 ['CUFE', 'RehabLab', 'lab001']]

In [61]:
df_hall_4 = pd.read_csv('esp_csv_only/csv/hall_4_v2.csv')
df_hall_4
dfs[3] = df_hall_4

In [62]:
# Replace missing with mean 
for idx, df_cols in enumerate(miss):
    for df_col in df_cols:
        dfs[idx][df_col].fillna(round(dfs[idx][df_col].mean(),1), inplace=True)

In [63]:
# Set Classes for each location
for idx, df in enumerate(dfs):
    df['location'] = idx+1

In [64]:
df_lab.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,-83.301342,-57.053691,-87.214765,-87.132215,-87.067785,-87.946309,-84.315436,-73.966443,-76.614765,2.0
std,3.704216,4.658304,2.255815,1.044824,1.275087,2.189611,4.30444,2.613417,4.165372,0.0
min,-89.0,-73.0,-93.0,-89.0,-90.0,-94.0,-92.0,-85.0,-87.0,2.0
25%,-85.0,-60.0,-89.0,-88.0,-88.0,-89.0,-87.0,-74.0,-77.0,2.0
50%,-84.0,-58.0,-87.0,-87.1,-87.0,-88.0,-85.0,-73.0,-76.0,2.0
75%,-82.0,-54.0,-86.0,-86.0,-87.0,-87.0,-82.0,-73.0,-74.0,2.0
max,-73.0,-47.0,-80.0,-85.0,-84.0,-78.0,-71.0,-69.0,-69.0,2.0


In [65]:
df_ta.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,-80.039735,-70.523179,-89.735099,-86.080795,-87.895364,-89.615894,-87.476821,-82.809272,-83.622517,1.0
std,5.766433,5.024718,1.741846,3.349661,2.983764,0.84586,2.996853,4.424076,3.588763,0.0
min,-92.0,-85.0,-95.0,-95.0,-91.0,-93.0,-95.0,-90.0,-89.0,1.0
25%,-84.0,-73.0,-91.0,-88.0,-91.0,-90.0,-89.0,-86.5,-86.0,1.0
50%,-80.0,-70.5,-91.0,-86.0,-87.9,-90.0,-88.0,-82.8,-83.6,1.0
75%,-76.5,-68.0,-88.0,-85.0,-85.0,-89.6,-86.5,-80.0,-82.0,1.0
max,-62.0,-54.0,-86.0,-71.0,-82.0,-86.0,-78.0,-69.0,-70.0,1.0


In [66]:
df_ts.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-81,-55,-55,-88,-73,-50,-86,-62,-63,3
1,-83,-55,-58,-88,-72,-61,-86,-62,-63,3
2,-84,-70,-60,-88,-72,-55,-86,-58,-63,3
3,-84,-72,-58,-88,-71,-47,-86,-58,-63,3
4,-84,-71,-59,-88,-71,-57,-86,-67,-63,3


In [67]:
df_hall_4.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-64,-46,-84,-79,-76,-86,-79,-66,-67,4
1,-64,-54,-86,-79,-76,-84,-69,-66,-67,4
2,-63,-50,-86,-69,-76,-88,-69,-69,-67,4
3,-63,-48,-85,-74,-76,-86,-69,-69,-67,4
4,-63,-52,-87,-69,-76,-87,-69,-69,-69,4


In [68]:
df_hall_5.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-60,-55,-63,-86,-73,-72,-86,-55,-58,5
1,-60,-54,-70,-74,-68,-79,-86,-51,-53,5
2,-60,-54,-76,-74,-76,-72,-86,-53,-53,5
3,-60,-54,-68,-74,-68,-71,-88,-50,-53,5
4,-61,-57,-80,-78,-71,-72,-84,-53,-56,5


In [69]:
df_hall_6.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-74,-59,-84,-68,-73,-81,-83,-67,-67,6
1,-66,-58,-75,-71,-70,-81,-83,-56,-56,6
2,-66,-56,-77,-74,-65,-75,-83,-60,-61,6
3,-66,-55,-78,-74,-63,-80,-83,-60,-52,6
4,-73,-59,-76,-74,-68,-82,-86,-60,-52,6


In [70]:
df = pd.concat(dfs, ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     1053 non-null   float64
 1   STUDBME2     1053 non-null   float64
 2   SBME_STAFF3  1053 non-null   float64
 3   SBME_STAFF   1053 non-null   float64
 4   CUFE         1053 non-null   float64
 5   RehabLab     1053 non-null   float64
 6   lab001       1053 non-null   float64
 7   CMP_LAB1     1053 non-null   float64
 8   CMP_LAB2     1053 non-null   float64
 9   location     1053 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 82.4 KB


In [71]:
df = df.sample(frac=1)

In [72]:
df.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
909,-51.0,-56.0,-87.0,-59.0,-83.0,-87.9,-90.0,-72.0,-70.0,7
934,-65.0,-66.0,-89.0,-75.0,-87.0,-88.0,-87.0,-76.0,-72.0,7
203,-82.0,-49.0,-85.0,-88.0,-84.0,-88.0,-77.0,-75.0,-74.0,2
558,-61.0,-50.0,-85.0,-72.0,-76.0,-82.0,-71.0,-58.0,-58.0,4
266,-78.0,-52.0,-86.0,-88.0,-85.0,-85.0,-85.0,-71.0,-69.0,2


<a id='model'></a>
## Building the Model 

In [73]:
# Importations 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_auc_score, classification_report

In [74]:
# Get target and feautre variables 
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [75]:
classifiers = {
    'knn': KNeighborsClassifier(5),
    'NB': GaussianNB(),
    'tree': DecisionTreeClassifier(max_depth=10),
    'forest': RandomForestClassifier(n_estimators=10, max_depth=10),
    'SV': SVC(probability=True, gamma=0.001),
    'LR': LogisticRegression(solver='newton-cg', max_iter=400)
}

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [77]:
# Check Balancing 
pd.Series(y_train).value_counts()

1    130
7    124
5    124
6    120
3    116
4    115
2    113
dtype: int64

In [78]:
X_train

array([[-63. , -51. , -76. , ..., -78. , -60. , -63. ],
       [-66. , -49. , -79. , ..., -87. , -66. , -66. ],
       [-87. , -80. , -57. , ..., -89. , -70. , -69. ],
       ...,
       [-68. , -73. , -91. , ..., -87. , -72. , -85. ],
       [-82. , -78. , -91. , ..., -89. , -87. , -86. ],
       [-83. , -75. , -89.7, ..., -87. , -83. , -82. ]])

In [81]:
# Metrics 
metrics = ['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted', 'precision_weighted', 'recall_weighted']

# Cross Validate the models 
scores = {}
for name, clf in classifiers.items():
    print(f'Training {name} model .. ')
    res = cross_validate(clf, X_train, y_train, cv=10, scoring=metrics, return_estimator=True)
#     print(res)
    
    score = {"test_f1_weighted": np.mean(res['test_f1_weighted']), 
             'test_roc_auc_ovr_weighted': np.mean(res['test_roc_auc_ovr_weighted']), 
             'test_precision_weighted': np.mean(res['test_precision_weighted']),
             'test_recall_weighted': np.mean(res['test_recall_weighted']),
             'test_accuracy': np.mean(res['test_accuracy']),
             'estimator': res['estimator'][0]
            }
    
    print("test_f1_weighted", score['test_f1_weighted'])
    print('test_roc_auc_ovr', score['test_roc_auc_ovr_weighted'])
    print('test_precicison', score['test_precision_weighted'])
    print('test_recall', score['test_recall_weighted'])
    print('accuracy', score['test_accuracy'])
    print('\n')
    
    # Add each model`s scores to scores
    scores[name] = score

Training knn model .. 
test_f1_weighted 0.9365594299780587
test_roc_auc_ovr 0.9919528622301013
test_precicison 0.9416065196086205
test_recall 0.937016806722689
accuracy 0.937016806722689


Training NB model .. 
test_f1_weighted 0.9261603537774465
test_roc_auc_ovr 0.9954856396316742
test_precicison 0.9300739351684729
test_recall 0.9263165266106442
accuracy 0.9263165266106442


Training tree model .. 
test_f1_weighted 0.9147062440119681
test_roc_auc_ovr 0.9548992782464554
test_precicison 0.9214045903703589
test_recall 0.9157002801120449
accuracy 0.9157002801120449


Training forest model .. 
test_f1_weighted 0.9524873842370951
test_roc_auc_ovr 0.9952076366927981
test_precicison 0.9559917743321107
test_recall 0.9524929971988796
accuracy 0.9524929971988796


Training SV model .. 
test_f1_weighted 0.9472434117200589
test_roc_auc_ovr 0.9971417057601105
test_precicison 0.950362274047148
test_recall 0.9476610644257702
accuracy 0.9476610644257702


Training LR model .. 
test_f1_weighted 0.93524

### Desicion 

According to our application we decided to choose our model according to the F1 Score

In [82]:
# Find the best estimator according to one metric
choosen = 'test_' + metrics[3]
best_metric = [(name, res[choosen]) for name, res in scores.items()]
best_estimator = max(best_metric, key=lambda x: x[1])
print(f"The best estimator is {best_estimator[0]} with Score {best_estimator[1]} ({choosen})")

The best estimator is forest with Score 0.9559917743321107 (test_precision_weighted)


In [83]:
# Fetch Estimator Object
# choosen_estimator = scores[best_estimator[0]]['estimator']
choosen_estimator = classifiers[best_estimator[0]]
choosen_estimator

RandomForestClassifier(max_depth=10, n_estimators=10)

In [84]:
# Train the choosen model
choosen_estimator.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, n_estimators=10)

In [85]:
# Predict Using the choosen estimator 
res = choosen_estimator.predict(X_test)

#### Reporting the results 

In [86]:
# Print the confusion matrix
confusion_matrix(y_test, res)

array([[20,  0,  0,  0,  0,  0,  1],
       [ 0, 36,  0,  0,  0,  0,  0],
       [ 0,  0, 35,  0,  0,  0,  0],
       [ 0,  0,  0, 33,  3,  0,  0],
       [ 0,  0,  0,  0, 25,  2,  0],
       [ 0,  0,  0,  0,  1, 29,  0],
       [ 0,  0,  0,  0,  0,  0, 26]])

In [87]:
# Show The classification Report
print(classification_report(y_test, res))

              precision    recall  f1-score   support

           1       1.00      0.95      0.98        21
           2       1.00      1.00      1.00        36
           3       1.00      1.00      1.00        35
           4       1.00      0.92      0.96        36
           5       0.86      0.93      0.89        27
           6       0.94      0.97      0.95        30
           7       0.96      1.00      0.98        26

    accuracy                           0.97       211
   macro avg       0.97      0.97      0.97       211
weighted avg       0.97      0.97      0.97       211



In [88]:
# Save the model 
from joblib import dump

In [89]:
dump(choosen_estimator, 'child/model.joblib')

['child/model.joblib']

#### Tunning 

In [90]:
hp_sv = {
    'C': [0.1, 0.01, 0.001],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'max_iter': [1000, 2000, -1],
    
}
hp

In [91]:
gridres = GridSearchCV(choosen_estimator, hp, scoring=metrics, n_jobs=3, refit=metrics[3])

In [92]:
gridres.fit(X_train, y_train)

ValueError: Invalid parameter C for estimator RandomForestClassifier(max_depth=10, n_estimators=10). Check the list of available parameters with `estimator.get_params().keys()`.

In [None]:
gridres.cv_results_

In [None]:
best_estimator = gridres.best_estimator_

In [None]:
pred = best_estimator.predict(X_test)

# Show The classification Report
print(classification_report(y_test, res))

In [None]:
dump(best_estimator, 'child/model.joblib')