# Generate Testing Subsets

In [9]:
# TODO: find a good balance between the number of combinations and the generalization

To test the performance of the proposed methods at different sizes of the population of devices, we adopt the following strategy: for each target population size $p = 1..P − 1$, where $P$ is the maximum number of devices in the original dataset, we produce $d = 10$ different subsets by selecting p devices at random.

## Libraries and Configurations

Import configuration files

In [10]:
from configparser import ConfigParser

config = ConfigParser()
config.read("../config.ini")

['../config.ini']

Import **data libraries**

In [11]:
import pandas as pd

Import **other libraries**

In [12]:
from rich.progress import Progress
from rich import traceback

traceback.install()

<bound method InteractiveShell.excepthook of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x79d4fa0afb90>>

In [13]:
from itertools import combinations

import random

random.seed(42)

Custom helper scripts

In [14]:
%cd ..
from scripts import plotHelper, encodingHelper
%cd data_exploration_cleaning

/home/bacci/COMPACT/notebooks
/home/bacci/COMPACT/notebooks/data_exploration_cleaning


## Import Data

We are using as input data the burst view of the dataset. **Important:** the burst view of the datasaet does not include devices that do not perform MAC Address randomization, since otherwise they will count as one `Label` with only one row, due to `groupby` MAC Address.

In [15]:
# Combined dataframe
balanced_df_csv = (
    config["DEFAULT"]["interim_path"] + "dissected/std_burst_dissected_df.csv"
)

In [16]:
df = pd.read_csv(balanced_df_csv, index_col=0)

We are only interested in the unique labels of the dataset

In [17]:
unique_labels = df["Label"].unique()

In [18]:
for label in unique_labels:
    print(label, " ", end="")

SamsungJ6_K  SamsungM31_A  iPhone11_C  iPhone11_B  iPhone12_W  SamsungS7_I  iPhone6_N  iPhone11_M  iPhone12_M  iPhone7_X  iPhoneXR_L  GooglePixel3A_V  iPhoneXSMax_M  GooglePixel3A_L  XiaomiRedmiNote9S_T  OnePlusNord_O  XiaomiA2_E  iPhoneXR_A  S21Ultra_M  iPhone11_F  iPhoneXR_U  iPhone7_F  OppoFindX3Neo_A  HuaweiHonor9_R  XiaomiRedmiNote7_S  XiaomiRedmi5_J  XiaomiRedmi4_B  

In [19]:
max_devices = len(unique_labels)
print("Number of devices in the dataset:", max_devices)

Number of devices in the dataset: 27


## Create Random Subsets

In [20]:
random_combinations = []

Number of subset for each cardinality (until it is possible)

In [21]:
n_subsets = 10

In [22]:
for r in range(2, max_devices + 1):
    for subset in range(n_subsets):
        random_combinations.append(random.sample(df["Label"].unique().tolist(), r))

Removing subsets that contain the same devices, and are just permutations of each other

In [23]:
unique_combinations = []
for subset in random_combinations:
    subset.sort()
    if subset not in unique_combinations:
        unique_combinations.append(subset)

unique_combinations.sort(key=len)

Number of unique combinations per cardinality

In [24]:
subset_counts = []
for i in range(2, max_devices + 1):
    count = len([x for x in unique_combinations if len(x) == i])
    subset_counts.append({"Cardinality": i, "Count": count})

df_subset_counts = pd.DataFrame(subset_counts)
df_subset_counts

Unnamed: 0,Cardinality,Count
0,2,10
1,3,10
2,4,10
3,5,10
4,6,10
5,7,10
6,8,10
7,9,10
8,10,10
9,11,10


Export combinations to file

In [25]:
# export unique_combinations to csv
reports_path = config["DEFAULT"]["reports_path"]

df_unique_combinations = pd.DataFrame(unique_combinations)
df_unique_combinations.to_csv(
    reports_path + "/CSV/subset_combinations/unique_combinations.csv", index=False
)