# Localization System

## Table of Content 
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<!-- <li><a href="#eda">Exploratory Data Analysis</a></li> -->
<li><a href="#model"> Building a model </a></li>
</ul>

<a id='intro'></a>
## Introduction 
Localziation system built using machine learning classifiers, data is collected at Cairo University Faculty of Engineering Biomedical Engineering Department. The Data is a collection of WiFi\`s RSSI\`s (dbm) of Wifi Networks available in the department. RSSI measurements represent the relative quality of a received signal on a device. RSSI indicates the power level being received after any possible loss at the antenna and cable level. The higher the RSSI value, the stronger the signal. When measured in negative numbers, the number that is closer to zero usually means better signal. As an example -50 is a pretty good signal, -75 - is fairly reasonable, and -100 is no signal at all. Identifying a patient\`s location in a hospital is useful for many reasons one of which is to identify points of conjestions and try to rearrange the hospital survices, also used to identify how many patients are in the hospital. 

So We are going to dive into our gathered data and invetigate our findings. 

In [1]:
# Basic Importations 
import pandas as pd 
import numpy as np

<a id="#wrangling"></a>
## Data Wrangling 
Let\`s dive into our gathered data and find it\`s secerets. 

In [14]:
# Read our datasets 
df_ta = pd.read_csv('esp_csv_only/csv/Ta.csv')
df_lab = pd.read_csv('esp_csv_only/csv/Lab.csv')
df_ts = pd.read_csv('esp_csv_only/csv/ts.csv')
df_hall_4 = pd.read_csv('esp_csv_only/csv/hall_4.csv')
df_hall_5 = pd.read_csv('esp_csv_only/csv/hall_5.csv')
df_hall_6 = pd.read_csv('esp_csv_only/csv/hall_6.csv')
df_main_hall = pd.read_csv('esp_csv_only/csv/main_hall.csv')
# Show Heads 
df_lab.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,-81.624161,-57.053691,-87.214765,-77.194631,-68.946309,-87.946309,-84.315436,0.0,0.0,-58.57047,-60.677852
std,12.310442,4.658304,2.255815,27.815919,35.480826,2.189611,4.30444,0.0,0.0,30.234948,31.482356
min,-89.0,-73.0,-93.0,-89.0,-90.0,-94.0,-92.0,0.0,0.0,-85.0,-87.0
25%,-85.0,-60.0,-89.0,-88.0,-88.0,-89.0,-87.0,0.0,0.0,-74.0,-77.0
50%,-84.0,-58.0,-87.0,-87.0,-87.0,-88.0,-85.0,0.0,0.0,-73.0,-74.0
75%,-82.0,-54.0,-86.0,-86.0,-85.0,-87.0,-82.0,0.0,0.0,-71.0,-71.0
max,0.0,-47.0,-80.0,0.0,0.0,-78.0,-71.0,0.0,0.0,0.0,0.0


In [15]:
df_ta.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,-78.980132,-69.589404,-71.913907,-84.940397,-83.238411,-68.847682,-86.317881,0.0,0.0,-77.874172,-75.317881
std,10.841568,9.522795,35.96942,10.426397,19.977557,37.952909,10.471785,0.0,0.0,20.161781,25.352021
min,-92.0,-85.0,-95.0,-95.0,-91.0,-93.0,-95.0,0.0,0.0,-90.0,-89.0
25%,-84.0,-73.0,-91.0,-88.0,-91.0,-90.0,-89.0,0.0,0.0,-86.5,-86.0
50%,-80.0,-70.0,-91.0,-86.0,-87.0,-90.0,-88.0,0.0,0.0,-82.0,-83.0
75%,-76.0,-67.0,-86.0,-85.0,-85.0,-86.5,-86.0,0.0,0.0,-80.0,-81.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The first look into the data insights the following:

- BMEStudentLab3 & CMP_LAB are out of range and their values wont benefit our model. 
- There are a lot of zeros in the recorded data.

### The Zeros problem :
The data is recorded in a way that, if one of the specified wifi networks is not found or out of range it records the value for that moment as zero, according to the documentation of the ESP Module the RSSI\`s of a wifi network is a value that ranges from -100 to -50 dB, -100 for bad network connectivity (Signal Strength) and -50 good network connnectivity.
<img src="sources/wifi.jpeg">
So we can treat a zero value as miss recorded value (seems to be an error in the chip) and replace it with any method we want, we will stick with replacing with the mean for now. 

## Imputing miss calculated records 

In [16]:
# Drop the two bad coloumns 
bad_cols = [df_lab.BMEStudentLab3.name, df_lab.CMP_LAB.name, df_lab.CUFE.name]

# dfs classes 1, 2, 3, 4, 5, 6
dfs = [df_ta, df_lab, df_ts, df_hall_4, df_hall_5, df_hall_6, df_main_hall]

for df in dfs:
    df.drop(bad_cols, axis=1, inplace=True)
    print(df.columns)

imp_cols = ['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF',
       'RehabLab', 'lab001', 'CMP_LAB1', 'CMP_LAB2']
# Check 
print([(df.columns.values == imp_cols).all() for df in dfs])


Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')
[True, True, True, True, True, True, True]


In [17]:
df_lab.columns

Index(['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'RehabLab',
       'lab001', 'CMP_LAB1', 'CMP_LAB2'],
      dtype='object')

In [18]:
# Replace Zeros with the mean
cols = df_lab.columns

for df in dfs:
    df[cols] = df[cols].replace({0: np.nan})

# Check 
print([(df[imp_cols].isnull().sum().values > 0).any() for df in dfs])

[True, True, False, False, False, False, True]


In [19]:
miss = []

for df in dfs:
        missing = [col for col in df if df[col].isnull().sum()>0]
        miss.append(missing)

        
# for i in df_lab.columns:
#     missing_lab = df_lab[i].isnull().sum()
#     missing_ta = df_ta[i].isnull().sum()
    
#     if missing_lab >0 :
#         miss_lab.append(i)
        
#     if missing_ta >0:
#         miss_ta.append(i)
# print(miss_lab)
# print(miss_ta)
miss

[['StudBME1',
  'STUDBME2',
  'SBME_STAFF3',
  'SBME_STAFF',
  'RehabLab',
  'lab001',
  'CMP_LAB1',
  'CMP_LAB2'],
 ['StudBME1', 'SBME_STAFF', 'CMP_LAB1', 'CMP_LAB2'],
 [],
 [],
 [],
 [],
 ['RehabLab', 'lab001']]

In [20]:
# Replace missing with mean 
for idx, df_cols in enumerate(miss):
    for df_col in df_cols:
        dfs[idx][df_col].fillna(round(dfs[idx][df_col].mean(),1), inplace=True)
# for col_lab in miss_lab:
#     df_lab[col_lab].fillna(round(df_lab[col_lab].mean(),1), inplace=True)

# for col_ta in miss_ta:
#     df_ta[col_ta].fillna(round(df_ta[col_ta].mean(), 1), inplace=True)


In [21]:
# Set Classes for each location
for idx, df in enumerate(dfs):
    df['location'] = idx+1


In [47]:
df_lab.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,-83.301342,-57.053691,-87.214765,-87.132215,-87.946309,-84.315436,-73.966443,-76.614765,2.0
std,3.704216,4.658304,2.255815,1.044824,2.189611,4.30444,2.613417,4.165372,0.0
min,-89.0,-73.0,-93.0,-89.0,-94.0,-92.0,-85.0,-87.0,2.0
25%,-85.0,-60.0,-89.0,-88.0,-89.0,-87.0,-74.0,-77.0,2.0
50%,-84.0,-58.0,-87.0,-87.1,-88.0,-85.0,-73.0,-76.0,2.0
75%,-82.0,-54.0,-86.0,-86.0,-87.0,-82.0,-73.0,-74.0,2.0
max,-73.0,-47.0,-80.0,-85.0,-78.0,-71.0,-69.0,-69.0,2.0


In [48]:
df_ta.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,-80.039735,-70.523179,-89.735099,-86.080795,-89.615894,-87.476821,-82.809272,-83.622517,1.0
std,5.766433,5.024718,1.741846,3.349661,0.84586,2.996853,4.424076,3.588763,0.0
min,-92.0,-85.0,-95.0,-95.0,-93.0,-95.0,-90.0,-89.0,1.0
25%,-84.0,-73.0,-91.0,-88.0,-90.0,-89.0,-86.5,-86.0,1.0
50%,-80.0,-70.5,-91.0,-86.0,-90.0,-88.0,-82.8,-83.6,1.0
75%,-76.5,-68.0,-88.0,-85.0,-89.6,-86.5,-80.0,-82.0,1.0
max,-62.0,-54.0,-86.0,-71.0,-86.0,-78.0,-69.0,-70.0,1.0


In [24]:
df_ts.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-81,-55,-55,-88,-50,-86,-62,-63,3
1,-83,-55,-58,-88,-61,-86,-62,-63,3
2,-84,-70,-60,-88,-55,-86,-58,-63,3
3,-84,-72,-58,-88,-47,-86,-58,-63,3
4,-84,-71,-59,-88,-57,-86,-67,-63,3


In [25]:
df_hall_4.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-58,-46,-84,-60,-87,-70,-69,-65,4
1,-58,-51,-84,-69,-94,-79,-65,-65,4
2,-65,-47,-84,-69,-89,-74,-65,-65,4
3,-65,-46,-86,-80,-87,-68,-58,-58,4
4,-65,-47,-81,-66,-85,-72,-60,-53,4


In [26]:
df_hall_5.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-60,-55,-63,-86,-72,-86,-55,-58,5
1,-60,-54,-70,-74,-79,-86,-51,-53,5
2,-60,-54,-76,-74,-72,-86,-53,-53,5
3,-60,-54,-68,-74,-71,-88,-50,-53,5
4,-61,-57,-80,-78,-72,-84,-53,-56,5


In [27]:
df_hall_6.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-74,-59,-84,-68,-81,-83,-67,-67,6
1,-66,-58,-75,-71,-81,-83,-56,-56,6
2,-66,-56,-77,-74,-75,-83,-60,-61,6
3,-66,-55,-78,-74,-80,-83,-60,-52,6
4,-73,-59,-76,-74,-82,-86,-60,-52,6


In [28]:
df = pd.concat(dfs, ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     1053 non-null   float64
 1   STUDBME2     1053 non-null   float64
 2   SBME_STAFF3  1053 non-null   float64
 3   SBME_STAFF   1053 non-null   float64
 4   RehabLab     1053 non-null   float64
 5   lab001       1053 non-null   float64
 6   CMP_LAB1     1053 non-null   float64
 7   CMP_LAB2     1053 non-null   float64
 8   location     1053 non-null   int64  
dtypes: float64(8), int64(1)
memory usage: 74.2 KB


In [29]:
df = df.sample(frac=1)

In [30]:
df.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
379,-91.0,-82.0,-60.0,-89.0,-58.0,-89.0,-69.0,-70.0,3
156,-89.0,-53.0,-89.0,-87.1,-87.0,-89.0,-74.0,-76.6,2
857,-69.0,-54.0,-62.0,-75.0,-61.0,-88.0,-42.0,-42.0,6
959,-85.0,-77.0,-87.0,-84.0,-86.0,-88.0,-68.0,-68.0,7
509,-62.0,-50.0,-85.0,-64.0,-79.0,-74.0,-70.0,-69.0,4


<a id='model'></a>
## Building the Model 

In [31]:
# Importations 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_auc_score, classification_report

In [32]:
# Get target and feautre variables 
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [33]:
classifiers = {
    'knn': KNeighborsClassifier(5),
    'NB': GaussianNB(),
    'tree': DecisionTreeClassifier(max_depth=5),
    'forest': RandomForestClassifier(n_estimators=10, max_depth=5),
    'SV': SVC(probability=True, gamma=0.001),
    'LR': LogisticRegression(solver='newton-cg', max_iter=300)
}

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [35]:
# Check Balancing 
pd.Series(y_train).value_counts()

4    127
3    123
7    121
5    120
1    120
2    117
6    114
dtype: int64

In [36]:
X_train

array([[-69. , -53. , -89. , ..., -67. , -67. , -60. ],
       [-57. , -51. , -89. , ..., -67. , -68. , -66. ],
       [-63. , -70. , -91. , ..., -87. , -72. , -85. ],
       ...,
       [-69. , -50. , -77. , ..., -88. , -52. , -52. ],
       [-82. , -71. , -89.7, ..., -87. , -82. , -85. ],
       [-64. , -46. , -88. , ..., -71. , -68. , -63. ]])

In [37]:
# Metrics 
metrics = ['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted', 'precision_weighted', 'recall_weighted']

# Cross Validate the models 
scores = {}
for name, clf in classifiers.items():
    print(f'Training {name} model .. ')
    res = cross_validate(clf, X_train, y_train, cv=10, scoring=metrics, return_estimator=True)
#     print(res)
    
    score = {"test_f1_weighted": np.mean(res['test_f1_weighted']), 
             'test_roc_auc_ovr_weighted': np.mean(res['test_roc_auc_ovr_weighted']), 
             'test_precision_weighted': np.mean(res['test_precision_weighted']),
             'test_recall_weighted': np.mean(res['test_recall_weighted']),
             'test_accuracy': np.mean(res['test_accuracy']),
             'estimator': res['estimator'][0]
            }
    
    print("test_f1_weighted", score['test_f1_weighted'])
    print('test_roc_auc_ovr', score['test_roc_auc_ovr_weighted'])
    print('test_precicison', score['test_precision_weighted'])
    print('test_recall', score['test_recall_weighted'])
    print('accuracy', score['test_accuracy'])
    print('\n')
    
    # Add each model`s scores to scores
    scores[name] = score

Training knn model .. 
test_f1_weighted 0.9117340062227834
test_roc_auc_ovr 0.9842096744935246
test_precicison 0.9164250288304909
test_recall 0.9120868347338936
accuracy 0.9120868347338936


Training NB model .. 
test_f1_weighted 0.8835826966008187
test_roc_auc_ovr 0.9904888606567029
test_precicison 0.8866601129962476
test_recall 0.8847619047619049
accuracy 0.8847619047619049


Training tree model .. 
test_f1_weighted 0.8360668392205598
test_roc_auc_ovr 0.9582075595452508
test_precicison 0.8532576999742718
test_recall 0.834873949579832
accuracy 0.834873949579832


Training forest model .. 
test_f1_weighted 0.8896888266037163
test_roc_auc_ovr 0.98751434159693
test_precicison 0.8940057727053524
test_recall 0.8906862745098039
accuracy 0.8906862745098039


Training SV model .. 
test_f1_weighted 0.9062503770259159
test_roc_auc_ovr 0.992108876398331
test_precicison 0.9110448833768162
test_recall 0.9061904761904762
accuracy 0.9061904761904762


Training LR model .. 




test_f1_weighted 0.8993858720833693
test_roc_auc_ovr 0.9904969975642194
test_precicison 0.9040491089209578
test_recall 0.9002521008403361
accuracy 0.9002521008403361




### Desicion 

According to our application we decided to choose our model according to the F1 Score

In [38]:
# Find the best estimator according to one metric
choosen = 'test_roc_auc_ovr_weighted'
best_metric = [(name, res[choosen]) for name, res in scores.items()]
best_estimator = max(best_metric, key=lambda x: x[1])
print(f"The best estimator is {best_estimator[0]} with Score {best_estimator[1]} ({choosen})")

The best estimator is SV with Score 0.992108876398331 (test_roc_auc_ovr_weighted)


In [39]:
# Fetch Estimator Object
# choosen_estimator = scores[best_estimator[0]]['estimator']
choosen_estimator = classifiers[best_estimator[0]]
choosen_estimator

SVC(gamma=0.001, probability=True)

In [40]:
# Train the choosen model
choosen_estimator.fit(X_train, y_train)

SVC(gamma=0.001, probability=True)

In [41]:
# Predict Using the choosen estimator 
res = choosen_estimator.predict(X_test)

#### Reporting the results 

In [42]:
# Print the confusion matrix
confusion_matrix(y_test, res)

array([[30,  0,  0,  0,  0,  0,  1],
       [ 0, 32,  0,  0,  0,  0,  0],
       [ 0,  0, 28,  0,  0,  0,  0],
       [ 0,  0,  0, 20,  3,  0,  1],
       [ 0,  0,  0,  4, 26,  1,  0],
       [ 0,  0,  0,  0, 10, 26,  0],
       [ 0,  1,  0,  4,  0,  0, 24]])

In [43]:
# Show The classification Report
print(classification_report(y_test, res))

              precision    recall  f1-score   support

           1       1.00      0.97      0.98        31
           2       0.97      1.00      0.98        32
           3       1.00      1.00      1.00        28
           4       0.71      0.83      0.77        24
           5       0.67      0.84      0.74        31
           6       0.96      0.72      0.83        36
           7       0.92      0.83      0.87        29

    accuracy                           0.88       211
   macro avg       0.89      0.88      0.88       211
weighted avg       0.90      0.88      0.88       211



In [44]:
# Save the model 
from joblib import dump

In [45]:
dump(choosen_estimator, 'child/model.joblib')

['child/model.joblib']

In [51]:
from sklearn.feature_selection import RFE

In [59]:
selector = RFE(LogisticRegression(solver='newton-cg'), n_features_to_select=5, verbose=1)

In [60]:
res= selector.fit(X_train, y_train)

Fitting estimator with 8 features.




Fitting estimator with 7 features.




Fitting estimator with 6 features.




In [61]:
res

RFE(estimator=LogisticRegression(solver='newton-cg'), n_features_to_select=5,
    verbose=1)

In [62]:
res.support_

array([ True, False,  True, False,  True,  True, False,  True])

In [68]:
df_lab.columns[:-1][res.support_]

Index(['StudBME1', 'SBME_STAFF3', 'RehabLab', 'lab001', 'CMP_LAB2'], dtype='object')