# Localization System

## Table of Content 
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<!-- <li><a href="#eda">Exploratory Data Analysis</a></li> -->
<li><a href="#model"> Building a model </a></li>
</ul>

<a id='intro'></a>
## Introduction 
Localziation system built using machine learning classifiers, data is collected at Cairo University Faculty of Engineering Biomedical Engineering Department. The Data is a collection of WiFi\`s RSSI\`s (dbm) of Wifi Networks available in the department. RSSI measurements represent the relative quality of a received signal on a device. RSSI indicates the power level being received after any possible loss at the antenna and cable level. The higher the RSSI value, the stronger the signal. When measured in negative numbers, the number that is closer to zero usually means better signal. As an example -50 is a pretty good signal, -75 - is fairly reasonable, and -100 is no signal at all. Identifying a patient\`s location in a hospital is useful for many reasons one of which is to identify points of conjestions and try to rearrange the hospital survices, also used to identify how many patients are in the hospital. 

So We are going to dive into our gathered data and invetigate our findings. 

In [1]:
# Basic Importations 
import pandas as pd 
import numpy as np

<a id="#wrangling"></a>
## Data Wrangling 
Let\`s dive into our gathered data and find it\`s secerets. 

In [2]:
# Read our datasets 
df_ta = pd.read_csv('esp_csv_only/csv/Ta.csv')
df_lab = pd.read_csv('esp_csv_only/csv/Lab.csv')
df_ts = pd.read_csv('esp_csv_only/csv/ts.csv')
df_hall_4 = pd.read_csv('esp_csv_only/csv/hall_4.csv')
df_hall_5 = pd.read_csv('esp_csv_only/csv/hall_5.csv')
df_hall_6 = pd.read_csv('esp_csv_only/csv/hall_6.csv')
df_main_hall = pd.read_csv('esp_csv_only/csv/main_hall.csv')
# Show Heads 
df_lab.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,-81.624161,-57.053691,-87.214765,-77.194631,-68.946309,-87.946309,-84.315436,0.0,0.0,-58.57047,-60.677852
std,12.310442,4.658304,2.255815,27.815919,35.480826,2.189611,4.30444,0.0,0.0,30.234948,31.482356
min,-89.0,-73.0,-93.0,-89.0,-90.0,-94.0,-92.0,0.0,0.0,-85.0,-87.0
25%,-85.0,-60.0,-89.0,-88.0,-88.0,-89.0,-87.0,0.0,0.0,-74.0,-77.0
50%,-84.0,-58.0,-87.0,-87.0,-87.0,-88.0,-85.0,0.0,0.0,-73.0,-74.0
75%,-82.0,-54.0,-86.0,-86.0,-85.0,-87.0,-82.0,0.0,0.0,-71.0,-71.0
max,0.0,-47.0,-80.0,0.0,0.0,-78.0,-71.0,0.0,0.0,0.0,0.0


In [3]:
df_ta.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,-78.980132,-69.589404,-71.913907,-84.940397,-83.238411,-68.847682,-86.317881,0.0,0.0,-77.874172,-75.317881
std,10.841568,9.522795,35.96942,10.426397,19.977557,37.952909,10.471785,0.0,0.0,20.161781,25.352021
min,-92.0,-85.0,-95.0,-95.0,-91.0,-93.0,-95.0,0.0,0.0,-90.0,-89.0
25%,-84.0,-73.0,-91.0,-88.0,-91.0,-90.0,-89.0,0.0,0.0,-86.5,-86.0
50%,-80.0,-70.0,-91.0,-86.0,-87.0,-90.0,-88.0,0.0,0.0,-82.0,-83.0
75%,-76.0,-67.0,-86.0,-85.0,-85.0,-86.5,-86.0,0.0,0.0,-80.0,-81.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The first look into the data insights the following:

- BMEStudentLab3 & CMP_LAB are out of range and their values wont benefit our model. 
- There are a lot of zeros in the recorded data.

### The Zeros problem :
The data is recorded in a way that, if one of the specified wifi networks is not found or out of range it records the value for that moment as zero, according to the documentation of the ESP Module the RSSI\`s of a wifi network is a value that ranges from -100 to -50 dB, -100 for bad network connectivity (Signal Strength) and -50 good network connnectivity.
<img src="sources/wifi.jpeg">
So we can treat a zero value as miss recorded value (seems to be an error in the chip) and replace it with any method we want, we will stick with replacing with the mean for now. 

## Imputing miss calculated records 

In [4]:
# Drop the two bad coloumns 
bad_cols = [df_lab.BMEStudentLab3.name, df_lab.CMP_LAB.name]

# dfs classes 1, 2, 3, 4, 5, 6
dfs = [df_ta, df_lab, df_ts, df_hall_4, df_hall_5, df_hall_6, df_main_hall]

for df in dfs:
    df.drop(bad_cols, axis=1, inplace=True)

imp_cols = ['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'CUFE',
       'RehabLab', 'lab001', 'CMP_LAB1', 'CMP_LAB2']
# Check 
print([(df.columns.values == imp_cols).all() for df in dfs])


[True, True, True, True, True, True, True]


In [5]:
# Replace Zeros with the mean
cols = df_lab.columns

for df in dfs:
    df[cols] = df[cols].replace({0: np.nan})

# Check 
print([(df[imp_cols].isnull().sum().values > 0).any() for df in dfs])

[True, True, False, False, False, False, True]


In [6]:
# Find missing in coloumns and replace with mean
# miss_lab = []
# miss_ta = []
miss = []

for df in dfs:
        missing = [col for col in df if df[col].isnull().sum()>0]
        miss.append(missing)

        
# for i in df_lab.columns:
#     missing_lab = df_lab[i].isnull().sum()
#     missing_ta = df_ta[i].isnull().sum()
    
#     if missing_lab >0 :
#         miss_lab.append(i)
        
#     if missing_ta >0:
#         miss_ta.append(i)
# print(miss_lab)
# print(miss_ta)
miss

[['StudBME1',
  'STUDBME2',
  'SBME_STAFF3',
  'SBME_STAFF',
  'CUFE',
  'RehabLab',
  'lab001',
  'CMP_LAB1',
  'CMP_LAB2'],
 ['StudBME1', 'SBME_STAFF', 'CUFE', 'CMP_LAB1', 'CMP_LAB2'],
 [],
 [],
 [],
 [],
 ['CUFE', 'RehabLab', 'lab001']]

In [7]:
# Replace missing with mean 
for idx, df_cols in enumerate(miss):
    for df_col in df_cols:
        dfs[idx][df_col].fillna(round(dfs[idx][df_col].mean(),1), inplace=True)
# for col_lab in miss_lab:
#     df_lab[col_lab].fillna(round(df_lab[col_lab].mean(),1), inplace=True)

# for col_ta in miss_ta:
#     df_ta[col_ta].fillna(round(df_ta[col_ta].mean(), 1), inplace=True)


In [8]:
# Set Classes for each location
for idx, df in enumerate(dfs):
    df['location'] = idx+1


In [9]:
df_lab.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-83.3,-65.0,-88.0,-87.1,-87.1,-87.0,-84.0,-74.0,-76.6,2
1,-83.3,-61.0,-90.0,-87.1,-87.1,-89.0,-87.0,-74.0,-76.6,2
2,-83.3,-55.0,-90.0,-87.1,-87.1,-88.0,-87.0,-74.0,-76.6,2
3,-87.0,-55.0,-88.0,-87.1,-87.1,-87.0,-91.0,-74.0,-76.6,2
4,-89.0,-60.0,-89.0,-87.1,-87.1,-87.0,-87.0,-74.0,-76.6,2


In [10]:
df_ta.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-80.0,-70.5,-89.7,-86.1,-87.9,-89.6,-87.5,-81.0,-86.0,1
1,-81.0,-70.5,-89.7,-84.0,-88.0,-89.6,-87.5,-83.0,-83.0,1
2,-69.0,-69.0,-88.0,-82.0,-88.0,-89.6,-82.0,-82.0,-84.0,1
3,-73.0,-73.0,-90.0,-84.0,-88.0,-89.6,-90.0,-82.0,-84.0,1
4,-73.0,-67.0,-90.0,-76.0,-89.0,-93.0,-87.0,-82.0,-84.0,1


In [11]:
df_ts.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-81,-55,-55,-88,-73,-50,-86,-62,-63,3
1,-83,-55,-58,-88,-72,-61,-86,-62,-63,3
2,-84,-70,-60,-88,-72,-55,-86,-58,-63,3
3,-84,-72,-58,-88,-71,-47,-86,-58,-63,3
4,-84,-71,-59,-88,-71,-57,-86,-67,-63,3


In [12]:
df_hall_4.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-58,-46,-84,-60,-83,-87,-70,-69,-65,4
1,-58,-51,-84,-69,-83,-94,-79,-65,-65,4
2,-65,-47,-84,-69,-79,-89,-74,-65,-65,4
3,-65,-46,-86,-80,-79,-87,-68,-58,-58,4
4,-65,-47,-81,-66,-72,-85,-72,-60,-53,4


In [13]:
df_hall_5.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-60,-55,-63,-86,-73,-72,-86,-55,-58,5
1,-60,-54,-70,-74,-68,-79,-86,-51,-53,5
2,-60,-54,-76,-74,-76,-72,-86,-53,-53,5
3,-60,-54,-68,-74,-68,-71,-88,-50,-53,5
4,-61,-57,-80,-78,-71,-72,-84,-53,-56,5


In [14]:
df_hall_6.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-74,-59,-84,-68,-73,-81,-83,-67,-67,6
1,-66,-58,-75,-71,-70,-81,-83,-56,-56,6
2,-66,-56,-77,-74,-65,-75,-83,-60,-61,6
3,-66,-55,-78,-74,-63,-80,-83,-60,-52,6
4,-73,-59,-76,-74,-68,-82,-86,-60,-52,6


In [15]:
df = pd.concat(dfs, ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     1053 non-null   float64
 1   STUDBME2     1053 non-null   float64
 2   SBME_STAFF3  1053 non-null   float64
 3   SBME_STAFF   1053 non-null   float64
 4   CUFE         1053 non-null   float64
 5   RehabLab     1053 non-null   float64
 6   lab001       1053 non-null   float64
 7   CMP_LAB1     1053 non-null   float64
 8   CMP_LAB2     1053 non-null   float64
 9   location     1053 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 82.4 KB


In [16]:
df = df.sample(frac=1)

In [17]:
df.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
601,-68.0,-50.0,-77.0,-79.0,-81.0,-85.0,-79.0,-70.0,-68.0,4
430,-84.0,-72.0,-54.0,-87.0,-80.0,-58.0,-90.0,-58.0,-55.0,3
647,-68.0,-47.0,-79.0,-71.0,-82.0,-81.0,-86.0,-60.0,-59.0,5
812,-74.0,-58.0,-66.0,-80.0,-70.0,-73.0,-85.0,-48.0,-46.0,6
602,-60.0,-55.0,-63.0,-86.0,-73.0,-72.0,-86.0,-55.0,-58.0,5


<a id='model'></a>
## Building the Model 

In [18]:
# Importations 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_auc_score, classification_report

In [19]:
# Get target and feautre variables 
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [20]:
classifiers = {
    'knn': KNeighborsClassifier(5),
    'NB': GaussianNB(),
    'tree': DecisionTreeClassifier(max_depth=5),
    'forest': RandomForestClassifier(n_estimators=10, max_depth=5),
    'SV': SVC(probability=True, gamma=0.001),
    'LR': LogisticRegression(solver='newton-cg', max_iter=300)
}

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
# Check Balancing 
pd.Series(y_train).value_counts()

4    126
7    121
1    121
5    120
3    120
6    118
2    116
dtype: int64

In [23]:
X_train

array([[-66., -48., -81., ..., -74., -65., -68.],
       [-63., -49., -85., ..., -72., -67., -66.],
       [-72., -58., -77., ..., -87., -53., -51.],
       ...,
       [-68., -59., -55., ..., -84., -51., -50.],
       [-65., -57., -72., ..., -79., -50., -46.],
       [-66., -53., -90., ..., -80., -66., -66.]])

In [24]:
# Metrics 
metrics = ['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted', 'precision_weighted', 'recall_weighted']

# Cross Validate the models 
scores = {}
for name, clf in classifiers.items():
    print(f'Training {name} model .. ')
    res = cross_validate(clf, X_train, y_train, cv=10, scoring=metrics, return_estimator=True)
#     print(res)
    
    score = {"test_f1_weighted": np.mean(res['test_f1_weighted']), 
             'test_roc_auc_ovr_weighted': np.mean(res['test_roc_auc_ovr_weighted']), 
             'test_precision_weighted': np.mean(res['test_precision_weighted']),
             'test_recall_weighted': np.mean(res['test_recall_weighted']),
             'test_accuracy': np.mean(res['test_accuracy']),
             'estimator': res['estimator'][0]
            }
    
    print("test_f1_weighted", score['test_f1_weighted'])
    print('test_roc_auc_ovr', score['test_roc_auc_ovr_weighted'])
    print('test_precicison', score['test_precision_weighted'])
    print('test_recall', score['test_recall_weighted'])
    print('accuracy', score['test_accuracy'])
    print('\n')
    
    # Add each model`s scores to scores
    scores[name] = score

Training knn model .. 
test_f1_weighted 0.9281919901034748
test_roc_auc_ovr 0.9902318746472781
test_precicison 0.931598036417364
test_recall 0.9287254901960784
accuracy 0.9287254901960784


Training NB model .. 
test_f1_weighted 0.9160060416662805
test_roc_auc_ovr 0.9932245770786716
test_precicison 0.9204393097565365
test_recall 0.9168347338935574
accuracy 0.9168347338935574


Training tree model .. 
test_f1_weighted 0.8782088122945835
test_roc_auc_ovr 0.9724831008950898
test_precicison 0.8884397871036528
test_recall 0.8776890756302521
accuracy 0.8776890756302521


Training forest model .. 
test_f1_weighted 0.9099732416739077
test_roc_auc_ovr 0.9931860427052431
test_precicison 0.9153927421201582
test_recall 0.9109103641456582
accuracy 0.9109103641456582


Training SV model .. 
test_f1_weighted 0.9352879971900251
test_roc_auc_ovr 0.9948248824437667
test_precicison 0.9376417583350356
test_recall 0.9358263305322128
accuracy 0.9358263305322128


Training LR model .. 




test_f1_weighted 0.9211663567288338
test_roc_auc_ovr 0.9946802213001666
test_precicison 0.9253511045443817
test_recall 0.9215686274509803
accuracy 0.9215686274509803




### Desicion 

According to our application we decided to choose our model according to the F1 Score

In [25]:
# Find the best estimator according to one metric
choosen = 'test_roc_auc_ovr_weighted'
best_metric = [(name, res[choosen]) for name, res in scores.items()]
best_estimator = max(best_metric, key=lambda x: x[1])
print(f"The best estimator is {best_estimator[0]} with Score {best_estimator[1]} ({choosen})")

The best estimator is SV with Score 0.9948248824437667 (test_roc_auc_ovr_weighted)


In [26]:
# Fetch Estimator Object
# choosen_estimator = scores[best_estimator[0]]['estimator']
choosen_estimator = classifiers[best_estimator[0]]
choosen_estimator

SVC(gamma=0.001, probability=True)

In [27]:
# Train the choosen model
choosen_estimator.fit(X_train, y_train)

SVC(gamma=0.001, probability=True)

In [28]:
# Predict Using the choosen estimator 
res = choosen_estimator.predict(X_test)

#### Reporting the results 

In [29]:
# Print the confusion matrix
confusion_matrix(y_test, res)

array([[28,  1,  0,  0,  0,  0,  1],
       [ 0, 33,  0,  0,  0,  0,  0],
       [ 0,  0, 31,  0,  0,  0,  0],
       [ 0,  0,  0, 22,  2,  0,  1],
       [ 0,  0,  0,  4, 26,  1,  0],
       [ 0,  0,  0,  0,  1, 31,  0],
       [ 1,  0,  0,  1,  0,  0, 27]])

In [30]:
# Show The classification Report
print(classification_report(y_test, res))

              precision    recall  f1-score   support

           1       0.97      0.93      0.95        30
           2       0.97      1.00      0.99        33
           3       1.00      1.00      1.00        31
           4       0.81      0.88      0.85        25
           5       0.90      0.84      0.87        31
           6       0.97      0.97      0.97        32
           7       0.93      0.93      0.93        29

    accuracy                           0.94       211
   macro avg       0.94      0.94      0.94       211
weighted avg       0.94      0.94      0.94       211



In [31]:
# Save the model 
from joblib import dump

In [35]:
dump(choosen_estimator, 'child/model.joblib')

['child/model.joblib']