# Localization System

## Table of Content 
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<!-- <li><a href="#eda">Exploratory Data Analysis</a></li> -->
<li><a href="#model"> Building a model </a></li>
</ul>

<a id='intro'></a>
## Introduction 
Localziation systems built using machine learning classifiers, data is collected at Cairo University Faculty of 
engineering Biomedical Engineering Department. The Data is a collection of WiFi\`s strenghts (dbm) of 3 Wifi Networks available in the department. Identifying a patient\`s location in a hospital is useful for many reasons one of which is to identify points of conjestions and try to rearrange the hospital survices, also used to identify how many patients are in the hospital. 

So We are going to dive into our gathered data and invetigate our findings. 

In [1]:
# Basic Importations 
import pandas as pd 
import numpy as np

<a id="#wrangling"></a>
## Data Wrangling 
Let\`s dive into our gathered data and find it\`s secerets. 

In [2]:
# Read our datasets 
df_ta = pd.read_csv('esp_csv_only/csv/Ta.csv')
df_lab = pd.read_csv('esp_csv_only/csv/Lab.csv')

# Show Heads 
df_lab.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,-81.624161,-57.053691,-87.214765,-77.194631,-68.946309,-87.946309,-84.315436,0.0,0.0,-58.57047,-60.677852
std,12.310442,4.658304,2.255815,27.815919,35.480826,2.189611,4.30444,0.0,0.0,30.234948,31.482356
min,-89.0,-73.0,-93.0,-89.0,-90.0,-94.0,-92.0,0.0,0.0,-85.0,-87.0
25%,-85.0,-60.0,-89.0,-88.0,-88.0,-89.0,-87.0,0.0,0.0,-74.0,-77.0
50%,-84.0,-58.0,-87.0,-87.0,-87.0,-88.0,-85.0,0.0,0.0,-73.0,-74.0
75%,-82.0,-54.0,-86.0,-86.0,-85.0,-87.0,-82.0,0.0,0.0,-71.0,-71.0
max,0.0,-47.0,-80.0,0.0,0.0,-78.0,-71.0,0.0,0.0,0.0,0.0


In [3]:
df_ta.describe()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,BMEStudentLab3,CMP_LAB,CMP_LAB1,CMP_LAB2
count,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0,151.0
mean,-78.980132,-69.589404,-71.913907,-84.940397,-83.238411,-68.847682,-86.317881,0.0,0.0,-77.874172,-75.317881
std,10.841568,9.522795,35.96942,10.426397,19.977557,37.952909,10.471785,0.0,0.0,20.161781,25.352021
min,-92.0,-85.0,-95.0,-95.0,-91.0,-93.0,-95.0,0.0,0.0,-90.0,-89.0
25%,-84.0,-73.0,-91.0,-88.0,-91.0,-90.0,-89.0,0.0,0.0,-86.5,-86.0
50%,-80.0,-70.0,-91.0,-86.0,-87.0,-90.0,-88.0,0.0,0.0,-82.0,-83.0
75%,-76.0,-67.0,-86.0,-85.0,-85.0,-86.5,-86.0,0.0,0.0,-80.0,-81.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The first look into the data insights the following:

- BMEStudentLab3 & CMP_LAB are out of range and their values wont benefit our model. 
- There are a lot of zeros in the recorded data.

### The Zeros problem :
The data is recorded in a way that, if one of the specified wifi networks is not found or out of range it records the value for that moment as zero, according to the documentation of the ESP Module the strenghts of a wifi network is a value that ranges from -100 to -50 dB, -100 for bad network connectivity (Signal Strength) and -50 good network connnectivity.
<img src="sources/wifi.jpeg">
So we can treat a zero value as miss recorded value (seems to be an error in the chip) and replace it with any method we want, we will stick with replacing with the mean for now. 

## Imputing miss calculated records 

In [4]:
# Drop the two bad coloumns 
bad_cols = [df_lab.BMEStudentLab3.name, df_lab.CMP_LAB.name]
df_lab.drop(bad_cols, axis=1,inplace=True)
df_ta.drop(bad_cols, axis=1, inplace=True)

In [5]:
# Replace Zeros with the mean
cols = df_lab.columns
df_lab[cols] = df_lab[cols].replace({0:np.nan})
df_ta[cols] = df_ta[cols].replace({0:np.nan})

In [6]:
df_lab.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     146 non-null    float64
 1   STUDBME2     149 non-null    float64
 2   SBME_STAFF3  149 non-null    float64
 3   SBME_STAFF   132 non-null    float64
 4   CUFE         118 non-null    float64
 5   RehabLab     149 non-null    float64
 6   lab001       149 non-null    float64
 7   CMP_LAB1     118 non-null    float64
 8   CMP_LAB2     118 non-null    float64
dtypes: float64(9)
memory usage: 10.6 KB


In [7]:
df_ta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151 entries, 0 to 150
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     149 non-null    float64
 1   STUDBME2     149 non-null    float64
 2   SBME_STAFF3  121 non-null    float64
 3   SBME_STAFF   149 non-null    float64
 4   CUFE         143 non-null    float64
 5   RehabLab     116 non-null    float64
 6   lab001       149 non-null    float64
 7   CMP_LAB1     142 non-null    float64
 8   CMP_LAB2     136 non-null    float64
dtypes: float64(9)
memory usage: 10.7 KB


In [8]:
# Find missing in coloumns and replace with mean
miss_lab = []
miss_ta = []

for i in df_lab.columns:
    missing_lab = df_lab[i].isnull().sum()
    missing_ta = df_ta[i].isnull().sum()
    
    if missing_lab >0 :
        miss_lab.append(i)
        
    if missing_ta >0:
        miss_ta.append(i)
print(miss_lab)
print(miss_ta)

['StudBME1', 'SBME_STAFF', 'CUFE', 'CMP_LAB1', 'CMP_LAB2']
['StudBME1', 'STUDBME2', 'SBME_STAFF3', 'SBME_STAFF', 'CUFE', 'RehabLab', 'lab001', 'CMP_LAB1', 'CMP_LAB2']


In [9]:
# Replace missing with mean 
for col_lab in miss_lab:
    df_lab[col_lab].fillna(round(df_lab[col_lab].mean(),1), inplace=True)

for col_ta in miss_ta:
    df_ta[col_ta].fillna(round(df_ta[col_ta].mean(), 1), inplace=True)


In [10]:
df_lab.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     149 non-null    float64
 1   STUDBME2     149 non-null    float64
 2   SBME_STAFF3  149 non-null    float64
 3   SBME_STAFF   149 non-null    float64
 4   CUFE         149 non-null    float64
 5   RehabLab     149 non-null    float64
 6   lab001       149 non-null    float64
 7   CMP_LAB1     149 non-null    float64
 8   CMP_LAB2     149 non-null    float64
dtypes: float64(9)
memory usage: 10.6 KB


In [11]:
df_ta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151 entries, 0 to 150
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     151 non-null    float64
 1   STUDBME2     151 non-null    float64
 2   SBME_STAFF3  151 non-null    float64
 3   SBME_STAFF   151 non-null    float64
 4   CUFE         151 non-null    float64
 5   RehabLab     151 non-null    float64
 6   lab001       151 non-null    float64
 7   CMP_LAB1     151 non-null    float64
 8   CMP_LAB2     151 non-null    float64
dtypes: float64(9)
memory usage: 10.7 KB


In [12]:
# Set Classes for each location
df_ta['location'] = 1
df_lab['location']= 2

In [13]:
df_lab.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-83.3,-65.0,-88.0,-87.1,-87.1,-87.0,-84.0,-74.0,-76.6,2
1,-83.3,-61.0,-90.0,-87.1,-87.1,-89.0,-87.0,-74.0,-76.6,2
2,-83.3,-55.0,-90.0,-87.1,-87.1,-88.0,-87.0,-74.0,-76.6,2
3,-87.0,-55.0,-88.0,-87.1,-87.1,-87.0,-91.0,-74.0,-76.6,2
4,-89.0,-60.0,-89.0,-87.1,-87.1,-87.0,-87.0,-74.0,-76.6,2


In [14]:
df_ta.head()

Unnamed: 0,StudBME1,STUDBME2,SBME_STAFF3,SBME_STAFF,CUFE,RehabLab,lab001,CMP_LAB1,CMP_LAB2,location
0,-80.0,-70.5,-89.7,-86.1,-87.9,-89.6,-87.5,-81.0,-86.0,1
1,-81.0,-70.5,-89.7,-84.0,-88.0,-89.6,-87.5,-83.0,-83.0,1
2,-69.0,-69.0,-88.0,-82.0,-88.0,-89.6,-82.0,-82.0,-84.0,1
3,-73.0,-73.0,-90.0,-84.0,-88.0,-89.6,-90.0,-82.0,-84.0,1
4,-73.0,-67.0,-90.0,-76.0,-89.0,-93.0,-87.0,-82.0,-84.0,1


In [15]:
df = pd.concat([df_lab, df_ta], ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StudBME1     300 non-null    float64
 1   STUDBME2     300 non-null    float64
 2   SBME_STAFF3  300 non-null    float64
 3   SBME_STAFF   300 non-null    float64
 4   CUFE         300 non-null    float64
 5   RehabLab     300 non-null    float64
 6   lab001       300 non-null    float64
 7   CMP_LAB1     300 non-null    float64
 8   CMP_LAB2     300 non-null    float64
 9   location     300 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 23.6 KB


<a id='model'></a>
## Building the Model 

In [46]:
# Importations 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_auc_score, classification_report

In [47]:
# Get target and feautre variables 
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [18]:
classifiers = {
    'knn': KNeighborsClassifier(5),
    'NB': GaussianNB(),
    'tree': DecisionTreeClassifier(max_depth=5),
    'forest': RandomForestClassifier(n_estimators=10, max_depth=5),
    'SV': SVC(probability=True),
    'LR': LogisticRegression(solver='newton-cg')
}

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
# Check Balancing 
pd.Series(y_train).value_counts()

2    120
1    120
dtype: int64

In [23]:
X_train

array([[-86. , -72. , -91. , ..., -86. , -80. , -86. ],
       [-78. , -65. , -87. , ..., -85. , -74. , -76. ],
       [-89. , -59. , -87. , ..., -87. , -74. , -76.6],
       ...,
       [-78. , -55. , -91. , ..., -82. , -73. , -84. ],
       [-74. , -69. , -91. , ..., -94. , -87. , -86. ],
       [-84. , -60. , -89. , ..., -80. , -73. , -84. ]])

In [50]:
# Metrics 
metrics = ['accuracy', 'f1_weighted', 'roc_auc_ovr', 'precision_weighted', 'recall_weighted']

# Cross Validate the models 
scores = {}
for name, clf in classifiers.items():
    print(f'Training {name} model .. ')
    res = cross_validate(clf, X_train, y_train, cv=10, scoring=metrics, return_estimator=True)
#     print(res)
    
    score = {"test_f1_weighted": np.mean(res['test_f1_weighted']), 
             'test_roc_auc_ovr': np.mean(res['test_roc_auc_ovr']), 
             'test_precision_weighted': np.mean(res['test_precision_weighted']),
             'test_recall_weighted': np.mean(res['test_recall_weighted']),
             'test_accuracy': np.mean(res['test_accuracy']),
             'estimator': res['estimator'][0]
            }
    
    print("test_f1_weighted", score['test_f1_weighted'])
    print('test_roc_auc_ovr', score['test_roc_auc_ovr'])
    print('test_precicison', score['test_precision_weighted'])
    print('test_recall', score['test_recall_weighted'])
    print('accuracy', score['test_accuracy'])
    print('\n')
    
    # Add each model`s scores to scores
    scores[name] = score

Training knn model .. 
test_f1_weighted 0.9833043478260871
test_roc_auc_ovr 0.9958333333333333
test_precicison 0.9846153846153847
test_recall 0.9833333333333334
accuracy 0.9833333333333334


Training NB model .. 
test_f1_weighted 0.9749710144927537
test_roc_auc_ovr 0.9979166666666666
test_precicison 0.9762820512820513
test_recall 0.9750000000000002
accuracy 0.9750000000000002


Training tree model .. 
test_f1_weighted 0.9707388263910003
test_roc_auc_ovr 0.9708333333333334
test_precicison 0.9736263736263735
test_recall 0.9708333333333334
accuracy 0.9708333333333334


Training forest model .. 
test_f1_weighted 0.979144927536232
test_roc_auc_ovr 0.9958333333333332
test_precicison 0.9801282051282051
test_recall 0.9791666666666667
accuracy 0.9791666666666667


Training SV model .. 
test_f1_weighted 0.9833043478260871
test_roc_auc_ovr 0.9958333333333332
test_precicison 0.9846153846153847
test_recall 0.9833333333333334
accuracy 0.9833333333333334


Training LR model .. 
test_f1_weighted 0.974

### Desicion 

According to our application we decided to choose our model according to the F1 Score

In [54]:
# Find the best estimator according to one metric
choosen = 'test_f1_weighted'
best_metric = [(name, res[choosen]) for name, res in scores.items()]
best_estimator = max(best_metric, key=lambda x: x[1])
print(f"The best estimator is {best_estimator[0]} with Score {best_estimator[1]} ({choosen})")

The best estimator is knn with Score 0.9833043478260871 (test_f1_weighted)


In [58]:
# Fetch Estimator Object
choosen_estimator = scores[best_estimator[0]]['estimator']

In [59]:
# Predict Using the choosen estimator 
res = choosen_estimator.predict(X_test)

#### Reporting the results 

In [62]:
# Print the confusion matrix
confusion_matrix(y_test, res)

array([[30,  1],
       [ 0, 29]])

In [63]:
# Show The classification Report
print(classification_report(y_test, res))

              precision    recall  f1-score   support

           1       1.00      0.97      0.98        31
           2       0.97      1.00      0.98        29

    accuracy                           0.98        60
   macro avg       0.98      0.98      0.98        60
weighted avg       0.98      0.98      0.98        60



In [65]:
# Save the model 
from joblib import dump

In [66]:
dump(choosen_estimator, 'model.joblib')

['model.joblib']