# BAMBOO: Binary descriptor based on AsymMetric pairwise BOOsting

In this notebook we include the implementation of the BAMBOO descriptor to provide a compressed representation of probe requests.

## Libraries and Configurations

Logger

In [1]:
import logging
from rich.logging import RichHandler

logging.getLogger("scapy.runtime").setLevel(logging.CRITICAL)

FORMAT = "%(message)s"
logging.basicConfig(
    level="NOTSET",
    format=FORMAT,
    datefmt="[%X]",
    handlers=[RichHandler(rich_tracebacks=True)],
)

log = logging.getLogger("rich")

log.setLevel("DEBUG")

Import configuration files

In [2]:
from configparser import ConfigParser

config = ConfigParser()
config.read("../config.ini")

['../config.ini']

Import **data libraries**

In [3]:
import pandas as pd

Import **other libraries**

In [4]:
from rich.progress import Progress
from rich import traceback

traceback.install()

from tqdm.notebook import tqdm

In [5]:
import numpy as np
import math

## Import Data

Importing **concatenated columns** and **pairs** datasets

In [6]:
pairs_df = pd.read_csv("../../data/interim/pairs_df.csv", index_col=0)

In [7]:
pairs_df

Unnamed: 0,Item 1,Item 2,Equality
0,0,1,1
1,0,2,1
2,0,3,1
3,0,4,1
4,0,5,1
...,...,...,...
11493610,4791,4793,1
11493611,4791,4794,1
11493612,4792,4793,1
11493613,4792,4794,1


In [8]:
strings_df = pd.read_csv("../../data/interim/string_df.csv", index_col=0)
strings_df = strings_df.rename(columns={strings_df.columns[0]: "Probes"})

In [9]:
strings_df

Unnamed: 0,Probes
0,0000000000000100000000100000010000001011000101...
1,0000000000000100000000100000010000001011000101...
2,0000000000000100000000100000010000001011000101...
3,0000000000000100000000100000010000001011000101...
4,0000000000000100000000100000010000001011000101...
...,...
4790,0000110000000100000000100000010000001011000101...
4791,0000101100000100000000100000010000001011000101...
4792,0001001000000100000000100000010000001011000101...
4793,0000100100000100000000100000010000001011000101...


Importing bitmask **filters**

In [10]:
filters_df = pd.read_csv("../../data/filters/bitmasks.csv", index_col=0)

Getting actual bitmask filters' column

In [11]:
filters = filters_df["Bitmask"]

In [12]:
def generate_thresholds(bitmasks):
    """
    Generate thresholds for each bitmask in a set.

    Parameters:
        bitmasks (set): A set containing the bitmasks.

    Returns:
        dict: A dictionary where keys are bitmasks and values are sets of thresholds.
    """
    thresholds_dict = {}
    for bitmask in bitmasks:
        max_ones = bitmask.count("1")
        thresholds = set(range(max_ones + 1))
        thresholds_dict[bitmask] = thresholds
        logger.log.debug(f"Bitmask {bitmask}, Thresholds {thresholds}")
    return thresholds_dict

Generating thresholds from bitmask filters

In [None]:
thresholds_dict = generate_thresholds(filters)

### Functions

The **bitwise AND** function performs said operation on 2 binary strings

In [14]:
def bitwise_and(bit_str1, bit_str2):
    # Convert bit strings to integers
    int1 = int(bit_str1, 2)
    int2 = int(bit_str2, 2)

    # Perform bitwise AND operation
    result = int1 & int2

    # Convert result back to binary string
    result_str = bin(result)[2:]  # [2:] to remove '0b' prefix

    # Return result
    return result_str.zfill(max(len(bit_str1), len(bit_str2)))

The **sum filter** takes as input a (binary) string and sums the values

In [15]:
def sumFilter(bitwise_and: str) -> int:
    sum = 0
    for i in bitwise_and:
        sum += int(i)
    return sum

**Sign function** returns -1 if negative value

In [16]:
def sign(number: int) -> int:
    if number < 0:
        return -1
    elif number >= 0:
        return 1

The **weak classifier** filters a couple of tuples, and given a threshold it, returns +1 or -1

In [17]:
def weak_classifier(pair: tuple, threshold: int, filter: str) -> int:
    print(pair, threshold, filter)
    filtered1 = sumFilter(bitwise_and(pair[0], filter))
    filtered2 = sumFilter(bitwise_and(pair[1], filter))
    return sign((filtered1 - threshold) * (filtered2 - threshold))

Implementation of the **Dirach delta** function

In [18]:
def delta(prediction: int, ground_truth: int) -> int:
    if prediction != ground_truth:
        return 1
    else:
        return 0

The **get error** function calculates the weighted value of the filter, given the prevision and the ground truth

In [19]:
def get_error(weigth: float, prediction: int, ground_truth: int) -> float:
    error = weigth * delta(prediction, ground_truth)

    logger.log.debug(
        f"Weigth {weigth}, Prediction {prediction}, Ground Truth {ground_truth}"
    )

    return error

### BAMBOO

Input:
- Ground truth relationships $\langle x_{a(n)}, x_{b(n)}; y_n\rangle$
  - $n=1,..,N$
  - $y_n \in \{+1, -1\}$
- A set of filters $\mathcal{H} = \{h_1 , ..., h_F\}$
- A set of binarization thresholds $\mathcal{T} = \{t_1 , ..., t_T\}$

Output:
- A set of $M<F$ filters $[h_{i(1)}, ..., h_{i(M)}]$
- Corresponding set of binarization thresholds $[t_{j(1)}, ..., t_{j(M)}]$

Define **BAMBOO input**

In [20]:
# Input
dataset = pairs_df.copy()
filters
M = 10

# Initial weights
weights = np.ones(len(dataset)) / len(dataset)

# Errors per iteration
errors = {}

Algorithm implementation

In [21]:
for m in range(M):  # iterations
    for filters_entry in filters:  # for each filter
        filters_list, threshold_list = filters_entry
        for filter, thresholds in zip(
            filters_list, [threshold_list] * len(filters_list)
        ):
            for threshold in thresholds:  # for each threshold
                error = 0
                for pair in range(len(dataset)):  # for each pair
                    prediction = weak_classifier(dataset[pair][0:2], threshold, filter)
                    error += get_error(weights[pair], prediction, dataset[pair][2])
                    print("LABEL", dataset[pair][2])
                print("[!] ERROR", error)
                errors[(filter, threshold)] = error
        print("errors", errors)
    best_filter, best_threshold = min(errors, key=lambda k: abs(errors[k]))

    print("Best Filter:", best_filter)
    print("Best Threshold:", best_threshold)

    min_error = errors[(best_filter, best_threshold)]
    print(min_error)
    confidence = math.log(
        (1 - min_error) / min_error
    )  # confidence of the weak classifier
    print("Confidence:", confidence)

    # Asymmetric Weight Update
    for pair in range(len(dataset)):
        if dataset[pair][2] == +1:
            if (
                weak_classifier(dataset[pair][0:2], best_threshold, best_filter)
                != dataset[pair][2]
            ):
                weights[pair] = weights[pair] * math.exp(confidence)

    for pair in range(len(dataset)):
        if dataset[pair][2] == +1:
            weights[pair] = weights[pair] / sum(
                weights[pair] for pair in range(len(dataset)) if dataset[pair][2] == +1
            )

In [None]:
print("Best Filter:", best_filter)
print("Best Threshold:", best_threshold)
print("Min error", min_error)

Best Filter: 000000001111000000000000
Best Threshold: 2
Min error 0.0
