# Minodes Fingerprint to Zone Classification Challenge (v3)

## Introduction

At Minodes, we provide insights to retailers using WiFi Analytics. Insights are derived by analyzing the location of people's smartphones during their visits.

How does it work? We install a set of WiFi routers (so-called "nodes") with a custom firmware inside or close to the areas to be monitored. Nodes collect the signal strength (`RSSI`) of WiFi probe requests sent regularly by smartphones (so-called "observations"). Observations are then used to estimate the position of smartphones: Let `RSSI(N,X)` be a function that returns the signal strength of probe request `X` observed from node `N` (measured in dBm: `0` is strongest signal strength, `-100` is weakest signal strength, extremes are seldom reached). The set of observations `RSSI(Ni,X)` for all nodes `i` is called "fingerprint" and constitutes a radio signature of the smartphone correlated (non linearly) to its position in space (a distance of few meters translates to a signal strength within `[0,-30]`). Stores are manually partitioned into regions (so-called "zones"), e.g. "entrance" and "checkout area". Fingerprints and zones are used as features and prediction classes, respectively.


## Dataset

This archive contains a dataset in CSV format located at `data/fingerprints_gt_ver3.csv`. The dataset represents a large deployment in a mall (three floors), and has been generated by walking inside each zone with a set of phones, labeling fingerprints manually with their corresponding zone. Nodes are located at the exterior of the entrance of the stores, and not inside the stores. Zones map to shops, aisles, and surrounding areas. The dataset is structured as follows:

| fr_observation_time  | fr_values  | fr_mac_address_id | zo_name  |
| -------------------- | ---------- | ----------------- | ---------|  
| 2015-12-08 10:00:13  | {'9': '-83', '13': '-67', '33': '-62', '101': ...  | 3192369 | Zone 355  |
| 2015-12-08 10:00:13  | {'12': '-69', '33': '-61', '128': '-68', '276'...  | 2002427 | Zone 355  |

Description of the fields:

* `fr_observation_time`: first timestamp in ascending order of the observations aggregated into the fingerprint (aggregation spans 1 second by default, to accommodate time desynchronization issues on the nodes)
* `fr_values`: fingerprint, represented as a dictionary  {`node_id`: `signal_strength`, ...}. If a node ID is not present,  you can assume that its associated signal strength is `-100`
* `fr_mac_address_id`: unique id of the mac address of the phone that emitted the corresponding probe request
* `zo_name`: zone name (class to be predicted). Zone names are strings containing numbers and do not have a direct semantic meaning.

Some statistics about the dataset:

* `343449` unique fingerprints (rows in the CSV file, excluding header line)
* `19` unique mac address ids. That means that 19 different phones were used to collect the dataset
* `449` unique zones
* `261` unique nodes
* On average, each node is present in `23224` fingerprints


## Problem

Given a fingerprint, predict its corresponding zone with a classifier.
The dataset must be used for training and testing the classifier.
Assess your solution by reporting confusion matrix, precision, recall and F1 score.


## solution part:

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sbn
%matplotlib inline

In [2]:
pwd

'C:\\Users\\Tiny Ants\\Documents'

In [3]:
fingerprints = pd.read_csv('fingerprints_gt_ver3.csv')

In [4]:
fingerprints.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 343449 entries, 0 to 343448
Data columns (total 4 columns):
fr_observation_time    343449 non-null object
fr_values              343449 non-null object
fr_mac_address_id      343449 non-null int64
zo_name                343449 non-null object
dtypes: int64(1), object(3)
memory usage: 10.5+ MB


In [5]:
fingerprints.describe()

Unnamed: 0,fr_mac_address_id
count,343449.0
mean,5012072.0
std,4236813.0
min,390442.0
25%,906632.0
50%,3192369.0
75%,9819582.0
max,9819586.0


In [6]:
fingerprints.head()

Unnamed: 0,fr_observation_time,fr_values,fr_mac_address_id,zo_name
0,2015-12-08 10:00:13,"{'12': '-69', '33': '-61', '128': '-68', '276'...",2002427,Zone 355
1,2015-12-08 10:00:13,"{'9': '-83', '13': '-67', '33': '-62', '101': ...",3192369,Zone 355
2,2015-12-08 10:00:14,"{'9': '-83', '10': '-77', '11': '-85', '12': '...",2002427,Zone 355
3,2015-12-08 10:00:14,"{'9': '-86', '10': '-83', '11': '-87', '12': '...",3192369,Zone 355
4,2015-12-08 10:00:15,"{'10': '-76', '11': '-86', '12': '-65', '13': ...",480806,Zone 355


In [7]:
fingerprints.columns

Index(['fr_observation_time', 'fr_values', 'fr_mac_address_id', 'zo_name'], dtype='object')

## Before Train-Test Split

Random seed =101

Before splitting the data as Train and Test, we are gonna understand the features and do some pre-processing like adding or removing features. Once we have proper data structure we can then spilt them with the help of sklearn library 

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_data_dict = fingerprints.drop('zo_name', axis=1)

In [10]:
x_data_dict.shape

(343449, 3)

In [11]:
x_data_dict.head()

Unnamed: 0,fr_observation_time,fr_values,fr_mac_address_id
0,2015-12-08 10:00:13,"{'12': '-69', '33': '-61', '128': '-68', '276'...",2002427
1,2015-12-08 10:00:13,"{'9': '-83', '13': '-67', '33': '-62', '101': ...",3192369
2,2015-12-08 10:00:14,"{'9': '-83', '10': '-77', '11': '-85', '12': '...",2002427
3,2015-12-08 10:00:14,"{'9': '-86', '10': '-83', '11': '-87', '12': '...",3192369
4,2015-12-08 10:00:15,"{'10': '-76', '11': '-86', '12': '-65', '13': ...",480806


## Preparing the X Input Features:
 getting the x features from the fr_values and concatinating the with the rest of others features.
 Since the output y labels depend only on the strengh of the signals (i.e) fr_values, we are gonna drop other features.

In [12]:
x_data_dict['fr_values'] = x_data_dict['fr_values'].apply(lambda x: dict(eval(x)))
temp = x_data_dict['fr_values'].apply(pd.Series)
temp;

In [13]:
x_data_dict = pd.concat([x_data_dict, temp], axis=1).drop('fr_values', axis=1)
x_data_dict;

# Final Input Features:

In [14]:
x_data_dict = x_data_dict.drop(['fr_observation_time', 'fr_mac_address_id'], axis=1)

Replacing all the 'NaN' with '-100', since all the NaN means Zero signal [signal strength = -100 ]

In [15]:
x_data_dict = x_data_dict.fillna(-100)
x_data_dict.head()

Unnamed: 0,1,10,101,102,103,104,105,107,109,11,...,89,9,90,91,92,93,94,97,98,99
0,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100
1,-100,-100,-64,-100,-100,-100,-100,-100,-100,-100,...,-100,-83,-100,-100,-100,-100,-100,-100,-100,-100
2,-100,-77,-69,-71,-100,-100,-100,-100,-100,-85,...,-100,-83,-100,-100,-100,-100,-100,-100,-79,-74
3,-100,-83,-65,-100,-100,-100,-100,-100,-100,-87,...,-100,-86,-100,-100,-100,-100,-100,-100,-100,-80
4,-100,-76,-65,-100,-100,-100,-100,-100,-100,-86,...,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100


In [16]:
cols = x_data_dict.columns
cols

Index(['1', '10', '101', '102', '103', '104', '105', '107', '109', '11',
       ...
       '89', '9', '90', '91', '92', '93', '94', '97', '98', '99'],
      dtype='object', length=261)

## Y_labels
Converting the strings to integer

In [17]:
fingerprints['zones'] = fingerprints.zo_name.str[5:8]
fingerprints['zones'] = fingerprints['zones'].astype(str).astype(int)
# for label encoding
y_labels1 = pd.DataFrame(data= fingerprints['zones']) 
y_labels1['zones'] = y_labels1['zones'].astype(str).astype(int)
# for one hot encoding, for later use
y_labels = pd.DataFrame(data= fingerprints['zones']) 
y_labels['zones'] = y_labels['zones'].astype(str).astype(int)

In [18]:
y_labels1.shape

(343449, 1)

In [19]:
y_labels1.head()

Unnamed: 0,zones
0,355
1,355
2,355
3,355
4,355


In [20]:
from sklearn.preprocessing import LabelEncoder
dummy = LabelEncoder()

In [21]:

#Label encode
y_labels1.values[:,0] = dummy.fit_transform(y_labels1.values[:,0])

In [22]:
y_labels1;

In [23]:
y_labels1.shape

(343449, 1)

In [24]:
# converting y labels to data frame if needed
#df = pd.DataFrame(y_labels, index=range(y_labels.shape[0]), columns=range(y_labels.shape[1]))

## Test and Train Split:

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x_data_dict, y_labels1, test_size=0.3, random_state=101)

In [26]:
x_train.head()

Unnamed: 0,1,10,101,102,103,104,105,107,109,11,...,89,9,90,91,92,93,94,97,98,99
76780,-100,-81,-100,-58,-50,-69,-64,-82,-100,-81,...,-100,-87,-100,-100,-100,-100,-100,-100,-100,-100
318851,-80,-51,-100,-100,-100,-100,-100,-100,-100,-66,...,-100,-65,-100,-100,-100,-100,-100,-77,-73,-70
327610,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-85,-100,-69,-72,-61,-58,-57,-100,-100,-100
168927,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100
327537,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100


In [27]:
x_train.shape

(240414, 261)

In [28]:
y_train.shape

(240414, 1)

In [29]:
y_test.shape

(103035, 1)

# Import the Confusion Matrix and Classification Report

In [30]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## Naive Bayes Classifier

In [31]:
from sklearn.naive_bayes import GaussianNB

In [32]:
gauss_naive = GaussianNB().fit(x_train, y_train)

  y = column_or_1d(y, warn=True)


In [33]:
gauss_naive_accuracy = gauss_naive.score(x_test, y_test)
gauss_naive_accuracy

0.19483670597369826

In [34]:
gauss_naive_predictions = gauss_naive.predict(x_test)

In [35]:
gauss_naive_cm = confusion_matrix(y_test, gauss_naive_predictions)
gauss_naive_cm

array([[ 1,  0,  0, ...,  0,  0,  0],
       [37, 37, 13, ...,  0,  0,  0],
       [ 1,  0, 11, ...,  0,  0,  0],
       ..., 
       [ 0,  0,  0, ..., 45,  0,  0],
       [ 0,  0,  0, ...,  0, 21,  0],
       [ 0,  0,  0, ...,  0,  0, 50]], dtype=int64)

In [36]:
gauss_naive_report = classification_report(y_test, gauss_naive_predictions)
gauss_naive_report

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


'             precision    recall  f1-score   support\n\n          0       0.02      1.00      0.03         1\n          1       0.17      0.42      0.24        89\n          2       0.03      0.92      0.05        12\n          3       0.05      0.11      0.07       195\n          4       0.27      0.50      0.35       316\n          5       0.24      0.09      0.13       374\n          6       0.35      0.31      0.33       102\n          7       0.01      0.56      0.02        16\n          8       0.34      0.14      0.20       147\n          9       0.47      0.12      0.19       136\n         10       0.28      0.28      0.28       360\n         11       0.49      0.06      0.11       318\n         12       0.22      0.01      0.01       363\n         13       0.49      0.13      0.21       397\n         14       0.20      0.49      0.29       347\n         15       0.37      0.09      0.15       359\n         16       0.50      0.00      0.01       320\n         17       0.20   

## Decision-Tree Classifier

In [37]:
from sklearn.tree import DecisionTreeClassifier

In [38]:
dtree_model = DecisionTreeClassifier(max_depth= 261, min_samples_leaf=499, min_samples_split=259, random_state=101)
dtree_model.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=261,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=499, min_samples_split=259,
            min_weight_fraction_leaf=0.0, presort=False, random_state=101,
            splitter='best')

In [39]:
dtree_predictions = dtree_model.predict(x_test)

In [40]:
dtree_accuracy = dtree_model.score(x_test, y_test)
print(dtree_accuracy)

0.382200223225


In [41]:
dtree_cm = confusion_matrix(y_test, dtree_predictions)
dtree_cm

array([[  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       ..., 
       [  0,   0,   0, ..., 114,   0,   0],
       [  0,   0,   0, ...,   0,  88,   0],
       [  0,   0,   0, ...,   0,   0,  92]], dtype=int64)

If you like to see heat map of confusion matrix , run the below cell!!! My laptop is very slow to process them all! [remove the multi line comment]

In [42]:
'''
plt.figure(figsize=(15,15))
sbn.heatmap(dtree_cm, annot=True, fmt='.3f', linewidths= 0.5, square = True, cmap= 'Blues_r');
all_sample_title = 'Acurracy score: {0}'.format(dtree_accuracy)
plt.title(all_sample_title, size= 15);
'''

"\nplt.figure(figsize=(15,15))\nsbn.heatmap(dtree_cm, annot=True, fmt='.3f', linewidths= 0.5, square = True, cmap= 'Blues_r');\nall_sample_title = 'Acurracy score: {0}'.format(dtree_accuracy)\nplt.title(all_sample_title, size= 15);\n"

In [43]:
dtree_report = classification_report(y_test, dtree_predictions)
dtree_report;

  'precision', 'predicted', average, warn_for)


## Random Forest

In [44]:
from sklearn.ensemble import RandomForestClassifier

In [45]:
rand_forest = RandomForestClassifier(n_estimators=100, max_depth=261, min_samples_leaf=499, random_state=101)

In [46]:
rand_forest.fit(x_train ,y_train)

  """Entry point for launching an IPython kernel.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=261, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=499, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=101, verbose=0, warm_start=False)

In [47]:
rand_forest_predictions = rand_forest.predict(x_test)

In [48]:
#l = rand_forest.predict_proba(x_test)[0:10]
#l;

In [49]:
rand_forest_accuracy = rand_forest.score(x_test, y_test)
rand_forest_accuracy

0.50177124278157903

In [50]:
rand_forest_cm = confusion_matrix(y_test, rand_forest_predictions)
rand_forest_cm

array([[  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0],
       ..., 
       [  0,   0,   0, ...,  53,   0,   0],
       [  0,   0,   0, ...,   0, 194,   0],
       [  0,   0,   0, ...,   0,   1,  66]], dtype=int64)

In [51]:
rand_forest_report = classification_report(y_test, rand_forest_predictions)
rand_forest_report;

  'precision', 'predicted', average, warn_for)


If you like to see heat map of confusion matrix , run the below cell!!! My laptop is very slow to process them all! [remove the multi line comment]

In [52]:
'''
plt.figure(figsize=(15,15))
sbn.heatmap(rand_forest_cm, annot=True, fmt='.3f', linewidths= 0.5, square = True, cmap= 'Blues_r');
all_sample_title = 'Acurracy score: {0}'.format(rand_forest_accuracy)
plt.title(all_sample_title, size= 15);
'''

"\nplt.figure(figsize=(15,15))\nsbn.heatmap(rand_forest_cm, annot=True, fmt='.3f', linewidths= 0.5, square = True, cmap= 'Blues_r');\nall_sample_title = 'Acurracy score: {0}'.format(rand_forest_accuracy)\nplt.title(all_sample_title, size= 15);\n"

# Layers API based Dense Neural Network Classifier

## Preparing one hot encoded data

In [53]:
x_data_dict.head();

In [54]:
# for one_hot encoding - reshaping
y_labels = y_labels.values.reshape(343449,-1)
y_labels.shape

(343449, 1)

In [55]:
from sklearn.preprocessing import OneHotEncoder
dummy = OneHotEncoder(categorical_features= [0])

In [56]:
# one_hot encode
y_labels = dummy.fit_transform(y_labels).toarray()

In [57]:
y_labels.shape

(343449, 449)

In [58]:
y_labels;

## Train-Test Split

In [59]:
x_train, x_test, y_train, y_test = train_test_split(x_data_dict, y_labels, test_size=0.3, random_state=101)

In [60]:
x_train.head()

Unnamed: 0,1,10,101,102,103,104,105,107,109,11,...,89,9,90,91,92,93,94,97,98,99
76780,-100,-81,-100,-58,-50,-69,-64,-82,-100,-81,...,-100,-87,-100,-100,-100,-100,-100,-100,-100,-100
318851,-80,-51,-100,-100,-100,-100,-100,-100,-100,-66,...,-100,-65,-100,-100,-100,-100,-100,-77,-73,-70
327610,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-85,-100,-69,-72,-61,-58,-57,-100,-100,-100
168927,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100
327537,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,...,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100


In [61]:
x_train.shape

(240414, 261)

In [62]:
y_train.shape

(240414, 449)

In [63]:
y_test.shape

(103035, 449)

Algorithm starts

In [64]:
num_feat = 261
num_hidden1 = 200
num_hidden2 = 200
num_hidden3 = 300
num_hidden4 = 300
num_hidden5 = 250
num_hidden6 = 250
num_hidden7 = 100
num_outputs = 449

In [65]:
alpha = 0.01

In [66]:
from tensorflow.contrib.layers import fully_connected

In [67]:
x = tf.placeholder(tf.float32, shape=[None,num_feat])

In [68]:
y = tf.placeholder(tf.float32, shape=[None,449])

In [69]:
activ_fn = tf.nn.relu

In [70]:
hidden1 = fully_connected(x, num_hidden1, activation_fn=activ_fn)

In [71]:
hidden2 = fully_connected(hidden1, num_hidden2, activation_fn=activ_fn)

In [72]:
hidden3 = fully_connected(hidden2, num_hidden3, activation_fn=activ_fn)

In [73]:
hidden4 = fully_connected(hidden3, num_hidden4, activation_fn=activ_fn)

In [74]:
hidden5 = fully_connected(hidden4, num_hidden5, activation_fn=activ_fn)

In [75]:
hidden6 = fully_connected(hidden5, num_hidden6, activation_fn=activ_fn)

In [76]:
hidden7 = fully_connected(hidden6, num_hidden7, activation_fn=activ_fn)

In [77]:
output = fully_connected(hidden7, num_outputs)

In [78]:
loss = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=output)

In [79]:
optimizer = tf.train.AdamOptimizer(learning_rate=alpha)

In [80]:
train = optimizer.minimize(loss)

In [81]:
init = tf.global_variables_initializer()

In [82]:
batch = 1000
batch_size = 300

In [83]:
def next_batch(num, data, labels):
    '''
    Return a total of `num` random samples and labels. 
    '''
    idx = np.arange(0 , len(data))
    np.random.shuffle(idx)
    idx = idx[:num]
    data_shuffle = [data[i] for i in idx]
    labels_shuffle = [labels[i] for i in idx]

    return np.asarray(data_shuffle), np.asarray(labels_shuffle)

The algorithm worked upto 68% accuracy but after making batches of data, there was some kind of Key error, which I could not resolve unfortunately!! 

In [None]:
with tf.Session() as sess:
    sess.run(init)
    
    for i in range(batch):
        x_train, y_train = next_batch(batch_size, x_train, y_train)
        sess.run(train, feed_dict={x: x_train, y: y_train})
        
        if i%100 == 0:
            print('accuracy on {} step:'.format(i))
            logits = output.eval(feed_dict={x:x_test})
            pred = tf.argmax(logits, axis=1)
            #res = pred.eval()
            true = tf.argmax(y, axis=1)
            equal = tf.equal(pred, true)
            accuracy_layers = tf.reduce_mean(tf.cast(equal, tf.float32))
            print(sess.run(accuracy_layers, feed_dict={x: x_test, y: y_test}))
            print('\n')

# Summary:

## Naive bayes algorithm = 19% accuracy

## Decision Tree = 38% accuracy

## Random Forest = 50% accuracy

## Dense Neural network = 68 % accuracy [but will improve on more training]

The KNN algorithm did not work well for such large data sets!

The logistic regression took long time to train eventhough it was split into tiny batches. So could not get good accuracy score in the given time period.

I have tried the entire project in estimator API. It generated the systems errors, which according to stack overflow is a error in tensor flow version.