# BAMBOO - Data Playground

The purpose of this notebook is to understand how to deal with the dataset, in the context of pairwise boosting to create a storage-efficient fingerprint for Wi-Fi Probe Requests.

## Libraries and Configurations

Import configuration files

In [1]:
from configparser import ConfigParser

config = ConfigParser()
config.read("../config.ini")

['../config.ini']

Import **data libraries**

In [2]:
import pandas as pd

Import **other libraries**

In [3]:
from rich.progress import Progress
from rich import traceback

traceback.install()

<bound method InteractiveShell.excepthook of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7f1ce659f990>>

Custom helper scripts

In [4]:
%cd ..
from scripts import plotHelper, encodingHelper
%cd data_exploration_cleaning

/home/bacci/COMPACT/notebooks
/home/bacci/COMPACT/notebooks/data_exploration_cleaning


## Import Data

In [5]:
# Combined dataframe
combined_df_csv = (
    config["DEFAULT"]["interim_path"] + "dissected/std_burst_dissected_df.csv"
)

In [6]:
combined_df = pd.read_csv(combined_df_csv, index_col=0)

In [7]:
combined_df

Unnamed: 0,MAC Address,Channel,DS Channel,Vendor Specific Tags,Length,Label,Supported Rates 1,Supported Rates 2,Supported Rates 3,Supported Rates 4,...,TIM_Broadcast,BSS_Transition,Multiple_BSSID,Timing_Measurement,SSID_List,DMS,Interworking,QoS_Map,WNM_Notification,Operating_Mode_Notification
0,00:0f:00:6a:68:8b,1,,2,279,SamsungJ6_K,65.0,66.0,69.5,75.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,02:00:00:00:00:00,11,9.0,11,123,SamsungM31_A,1.0,2.0,5.5,11.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,02:00:00:00:3e:b2,11,11.0,62,132,iPhone11_C,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,02:00:3a:5e:a1:f4,11,10.0,62,132,iPhone11_B,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5,02:02:70:30:b6:43,1,3.0,62,143,iPhone12_W,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4410,fe:f9:ac:47:0d:b7,11,11.0,62,131,iPhone12_W,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4411,fe:f9:fc:fb:83:9e,6,1.0,1,156,iPhone6_N,1.0,2.0,5.5,11.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
4412,fe:fc:07:34:10:69,1,1.0,62,132,iPhone11_C,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4413,fe:fc:aa:d1:89:d1,1,2.0,62,143,iPhone12_W,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [8]:
column_names = combined_df.columns.tolist()
column_names.remove("MAC Address")
column_names.remove("Label")

min_max_df = combined_df[column_names].describe().loc[["min", "max"]]

In [9]:
max_bits_df = pd.DataFrame(
    {
        "Column": column_names,
        "Max Bits": [int(combined_df[col].max()).bit_length() for col in column_names],
    }
)
max_bits_df

Unnamed: 0,Column,Max Bits
0,Channel,4
1,DS Channel,4
2,Vendor Specific Tags,6
3,Length,9
4,Supported Rates 1,7
5,Supported Rates 2,7
6,Supported Rates 3,7
7,Supported Rates 4,7
8,Extended Supported Rates 1,5
9,Extended Supported Rates 2,6


MAC addresses are **48 bits** long. According to the `feature_selection_forward_RF_std` notebook, the total number of bits required for the selected feature, along with the MAC Address, would be: 48+9+3 = **60 bits**, not considering the Vendor Specific Tag length. Most of the space is used for the MAC Address.

Since the `Vendor Specific Tags` length is not specified by the standard, we can use the UJI Dataset as a reference, looking for the longest tag.

The longest Vendor Specific Tag in UJI dataset, which is not a malformed packet, is 248 Bytes long, consequently requiring **1984 bits**.

In [10]:
combined_df

Unnamed: 0,MAC Address,Channel,DS Channel,Vendor Specific Tags,Length,Label,Supported Rates 1,Supported Rates 2,Supported Rates 3,Supported Rates 4,...,TIM_Broadcast,BSS_Transition,Multiple_BSSID,Timing_Measurement,SSID_List,DMS,Interworking,QoS_Map,WNM_Notification,Operating_Mode_Notification
0,00:0f:00:6a:68:8b,1,,2,279,SamsungJ6_K,65.0,66.0,69.5,75.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,02:00:00:00:00:00,11,9.0,11,123,SamsungM31_A,1.0,2.0,5.5,11.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,02:00:00:00:3e:b2,11,11.0,62,132,iPhone11_C,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,02:00:3a:5e:a1:f4,11,10.0,62,132,iPhone11_B,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5,02:02:70:30:b6:43,1,3.0,62,143,iPhone12_W,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4410,fe:f9:ac:47:0d:b7,11,11.0,62,131,iPhone12_W,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4411,fe:f9:fc:fb:83:9e,6,1.0,1,156,iPhone6_N,1.0,2.0,5.5,11.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
4412,fe:fc:07:34:10:69,1,1.0,62,132,iPhone11_C,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4413,fe:fc:aa:d1:89:d1,1,2.0,62,143,iPhone12_W,65.0,66.0,69.5,75.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Vendor Specific Tags Length

We are now checking the maximum length of the Vendor Specific Tag (raw) within the devices of our dataset.

In [11]:
# Combined dataframe
raw_df_csv = config["DEFAULT"]["interim_path"] + "dissected/dissected_df_raw.csv"

In [12]:
raw_df = pd.read_csv(raw_df_csv, index_col=0)

  raw_df = pd.read_csv(raw_df_csv, index_col=0)


In [13]:
raw_df.dropna(subset=["Vendor Specific Tags"], inplace=True)

We are now removing the rows relative to a `MAC Address` we consider to be noise in the Pintor et al. dataset. If we don't drop said address, we get a maximum Vendor Specific Tag length of 334 Bytes.

In [14]:
raw_df = raw_df[raw_df["MAC Address"] != "00:0f:00:6a:68:8b"]

In [15]:
raw_df

Unnamed: 0_level_0,MAC Address,Channel,DS Channel,Vendor Specific Tags,SSID,VHT Capabilities,HE Capabilities,Length,Label,Supported Rates 1,...,Channel_Schedule_Management,Geodatabase_Inband_Enabling_Signal,Network_Channel_Control,White_Space_Map,Channel_Availability_Query,FTM_Responder,FTM_Initiator,Reserved_6,ESM_Capability,Future_Channel_Guidance
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-10-11 12:20:14.689429045,22:e4:72:fb:91:70,1,1.0,0050f208002800,,92f19033faff6203faff6203,020046,155,OppoFindX3Neo_A,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2023-10-11 12:20:14.691803932,22:e4:72:fb:91:70,1,1.0,0050f208002800,wlan_saltuaria,92f19033faff6203faff6203,020045,169,OppoFindX3Neo_A,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2023-10-11 12:20:14.693077087,22:e4:72:fb:91:70,1,1.0,0050f208002800,Anto_HotSpot,92f19033faff6203faff6203,020045,167,OppoFindX3Neo_A,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2023-10-11 12:20:14.693079948,22:e4:72:fb:91:70,1,1.0,0050f208002800,BBBELL-0BCF,92f19033faff6203faff6203,020045,166,OppoFindX3Neo_A,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2023-10-11 12:20:14.694402933,22:e4:72:fb:91:70,1,1.0,0050f208002800,SantaDomitillaWiFi,92f19033faff6203faff6203,020045,173,OppoFindX3Neo_A,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-06-03 13:54:07.334428072,da:a1:19:45:40:f0,11,13.0,0050f208006200,,,,119,XiaomiRedmi4_B,1.0,...,,,,,,,,,,
2021-06-03 13:54:07.340886116,da:a1:19:45:40:f0,11,13.0,0050f208006200,1117sx,,,125,XiaomiRedmi4_B,1.0,...,,,,,,,,,,
2021-06-03 13:54:07.341959000,da:a1:19:45:40:f0,11,13.0,0050f208006200,!op0ssum@,,,128,XiaomiRedmi4_B,1.0,...,,,,,,,,,,
2021-06-03 13:54:07.343002081,da:a1:19:45:40:f0,11,13.0,0050f208006200,Vodafone,,,127,XiaomiRedmi4_B,1.0,...,,,,,,,,,,


In [16]:
max_length_vendor_tag = raw_df.loc[
    raw_df["Vendor Specific Tags"].str.len().idxmax(), "Vendor Specific Tags"
]
print(max_length_vendor_tag)

00904c0408bf0c7678910ffaff0000faff0020


The row in `Vendor Specific Tags` of maximum length is long:

In [17]:
print(len(max_length_vendor_tag), "Bytes")

38 Bytes


Converting to bit, we need:

In [18]:
print(len(max_length_vendor_tag) * 8, "bits")

304 bits


In [19]:
max_length_vendor_tag

'00904c0408bf0c7678910ffaff0000faff0020'

In [20]:
max_length_vendor_tag_number = int(max_length_vendor_tag, 16)
print(max_length_vendor_tag_number)

12570035996384752485514106064852224451739680


In [21]:
tokenized_tag = [
    max_length_vendor_tag[i : i + 2] for i in range(0, len(max_length_vendor_tag), 2)
]

In [22]:
tokenized_tag

['00',
 '90',
 '4c',
 '04',
 '08',
 'bf',
 '0c',
 '76',
 '78',
 '91',
 '0f',
 'fa',
 'ff',
 '00',
 '00',
 'fa',
 'ff',
 '00',
 '20']

In [23]:
tokenized_tag_int = [int(tag, 16) for tag in tokenized_tag]

In [24]:
tokenized_tag_int

[0, 144, 76, 4, 8, 191, 12, 118, 120, 145, 15, 250, 255, 0, 0, 250, 255, 0, 32]