First, let's run the cell below to import all the packages that we will need.

In [95]:
import pandas as pd
from collections import OrderedDict
import pandas as pd
import numpy as np
from datetime import datetime
from math import ceil
import itertools
from itertools import permutations
from datetime import date, datetime
import math
import random
import mmh3
from datetime import datetime
from bitarray import bitarray
from typing import List, Tuple
from typing import Dict, List
import warnings
warnings.filterwarnings("ignore")

Define Topologies: We start by defining our three topologies: line, intersect, and multiple path.

Setup a Loop: I then create a loop that iterates through a list of these topologies.

Assign Variables: For each iteration, I assign the current topology to STATIONS, and based on the topology, we set LINE and GROUND_TRUTH.

Process Each Topology: Inside the loop, I include the logic that processes the topology.

In [96]:
line_topology = ['N_1', 'N_2', 'N_3','N_4', 'N_5']
intersect_topology = ['N_1', 'N_5', 'N_7','N_8']
stations_line3 = ['N_1', 'N_5', 'N_9','N_10']
stations_line13 = ['N_1', 'N_11', 'N_12','N_13', 'N_10']
multi_path_topology = [
    {"name": "Line 3", "stations": stations_line3, "line": "\\16003", "ground_truth": "\\16003"},
    {"name": "Line13", "stations": stations_line13, "line": "\\16013", "ground_truth": "\\16013"}
]

topologies = [
    {"name": "Line Topology", "stations": line_topology, "line": "\\16002", "ground_truth": "\\16002"},
    {"name": "Intersect Topology", "stations": intersect_topology, "line": "\\16005", "ground_truth": "\\16005"}
]

This section describes the implementation details of the Bloom Filter, including the calculation of the bit array size, the number of hash functions, and the class definition.


In [97]:
FALSEPROB = 0.001 # False positive probability for the Bloom Filter

def get_size(n, p):
	''' 
	Return the size of bit array(m) to used using
	following formula
	'''
	#return BFLEN
	m = -(n * math.log(p))/(math.log(2)**2)
	return int(m)
	
	
def get_hash_count(m, n):
	'''
	Return the hash function(k) to be used using
	following formula
	'''
	#return NHASH
	k = (m/n) * math.log(2)
	return int(k)

class BloomFilter ( object ):

	'''
		Class for Bloom filter, using murmur3 hash function
	'''
	def __init__(self, length, numOfHashes):
		
		'''
		items_count : int
			Number of items expected to be stored in bloom filter
		fp_prob : float
			False Positive probability in decimal
		'''
		# Size of bit array to use
		self.size = length
		# number of hash functions to use
		self.hash_count = numOfHashes
		# Bit array of given size
		self.bit_array = bitarray(self.size)
		# initialize all bits as 0
		self.bit_array.setall(0)
		
	def add(self, item):
		'''
		Add an item in the filter
		'''
		digests = []
		for i in range(self.hash_count):
			# create digest for given item.
			# i work as seed to mmh3.hash() function
			# With different seed, digest created is different
			digest = mmh3.hash(item, i) % self.size
			digests.append(digest)
			# set the bit True in bit_array
			self.bit_array[digest] = True
		return self.bit_array

	def check(self, item):
		'''
		Check for existence of an item in filter
		'''
		for i in range(self.hash_count):
			digest = mmh3.hash(item, i) % self.size
			if self.bit_array[digest] == False:
				return False
		return True

The `transfer_to_bfs` function is responsible for converting a dictionary of card IDs into a dictionary of Bloom filters. Each Bloom filter represents the set of card IDs associated with a specific key (e.g., station name).


card_ids_dict: A dictionary where keys are identifiers ( station names) and values are lists of card IDs.

BFLEN: The length of the Bloom filter bit array, determining its size.

NHASH: The number of hash functions used in the Bloom filter, affecting its accuracy and chance of false positives.

In [98]:

def transfer_to_bfs(card_ids_dict):
    card_id_bfs = {}
    for key, values in card_ids_dict.items():
        bloomf = BloomFilter(BFLEN, NHASH)
        for value in values:
            bloomf.add(str(value))
        card_id_bfs[key] = bloomf.bit_array

    return card_id_bfs

The `bitwise_or` function performs a bitwise OR operation between two bit arrays. This operation is used to combine the bits of two arrays, where the result bit is set to 1 if at least one of the bits at that position in either array is 1.

[0,0,0,1,1,0,1] union [0,1,0,1,0,0,1]= [0,1,0,1,1,0,1]

In [99]:
def bitwise_or(a, b):
    result_or = a.copy()
    result_or |= b
    return result_or

The "bitwise_and" function performs a bitwise AND operation between two bit arrays. This operation is used to find common elements in the sets represented by the bit arrays, where the result bit is set to 1 only if the bits at that position in both arrays are 1.

Parameters

a: The first bit array.

b: The second bit array.

Return Value

Returns a new bit array resulting from the bitwise AND operation on a and b.




In [100]:
def bitwise_and(a, b):
    result_arr = a.copy()
    result_arr &= b
    return result_arr

The `cardinality` function estimates the number of unique elements in a set represented by a Bloom filter. This function is particularly useful for estimating the size of a set after operations like unions or intersections have been performed on the Bloom filter representing the set.

The function calculates the cardinality based on the formula derived from the properties of Bloom filters. It computes the proportion of bits set to 1 (t) and uses the length of the Bloom filter (BFLEN) and the number of hash functions used (NHASH) to estimate the number of unique elements. This method relies on the assumption that the Bloom filter's bits are set in a relatively uniform distribution.

Returns an integer estimate of the number of unique elements in the set.


In [101]:
def cardinality(union_lst):
    t = sum(int(bit) for bit in union_lst)
    return int(-1 * (BFLEN / NHASH) * math.log(1 - t/BFLEN))

The "calculate_accuracy" function computes the accuracy of estimates against the ground truth. It is for evaluating the performance of our algorithms that estimate the size of tracvelers in compare with GT.

Parameters

gt: A list containing the ground truth values for the size of sets.

estimated: A list containing the estimated values for the size of Bloom filters.

Return Value

Returns a list of accuracy values, calculated as 1 - (absolute difference between estimated and ground truth / ground truth), formatted to two decimal places. Perfect accuracy is represented as 1.0, and 0.0 indicates no accuracy.


In [102]:
def calculate_accuracy(gt, estimated):
    accuracy = []
    for i in range(len(gt)):
        if gt[i] == 0:
            if estimated[i] == 0:
                acc = 1.0  # Perfect accuracy if both are zero
            else:
                acc = 0.0  # No accuracy if gt is zero but estimated is not
        else:
            acc = max(1 - (abs(estimated[i] - gt[i]) / gt[i]), 0)
        accuracy.append(round(acc, 2))  # Format the accuracy value to two decimal places
    return accuracy

The `split_df_by_table_in_table_out` function divides the DataFrame containing route information into two separate tables: one for check-in data and another for check-out data. This separation facilitates analyses that require distinct handling of check-in and check-out events.

Return Value

Returns a dictionary with two keys: 'check_in' and 'check_out'. Each key maps to a DataFrame where:

The check_in DataFrame contains columns for card ID, check-in time (epoch), check-in location (station), and route (line).

The check_out DataFrame contains columns for card ID, check-out time (epoch), check-out location (station), and route (line).



In [103]:
def split_df_by_table_in_table_out(filtered_routes):
    # This table selects only the columns related to check-in (card_ID, check-in time, location of check-in, and the route)
    check_in_table = filtered_routes[['card_ID', 'check_in', 'loc1', 'Routes']]
    check_in_table = check_in_table.rename({"card_id": "card_ID", "loc1": "station", "check_in": "epoch", 'Routes':"line"}, axis=1)
    
    # Create the check-out table
    check_out_table = filtered_routes[['card_ID', 'check_out', 'loc2', 'Routes']]
    check_out_table = check_out_table.rename({"card_id": "card_ID", "loc2": "station", "check_out": "epoch", 'Routes':"line"}, axis=1)
    # Combine the check-in and check-out tables into a dictionary for easy access
    dfs_in_out = {'check_in': check_in_table, 'check_out': check_out_table}    # The dictionary keys 'check_in' and 'check_out' can be used to retrieve the respective tables
    return dfs_in_out

The `extract_card_ids` function is designed to aggregate card IDs from the DataFrame based on specified station names(for each topology). This aggregation is essential for analyses that require identification of unique cards (e.g., travelers) passing through or utilizing specific stations. 

Returns a dictionary where each key is a station name from the STATIONS list, and the corresponding value is a list of card IDs that were recorded at that station.


In [104]:
def extract_card_ids(STATIONS, filtered_df):
    card_ids_dict = {}

    for station in STATIONS:
        # Check which columns are present in the DataFrame. This is important to determine 
        # how to filter the DataFrame based on the available information.
        has_loc1 = 'loc1' in filtered_df.columns
        has_loc2 = 'loc2' in filtered_df.columns
        # If both 'loc1' and 'loc2' columns are present, it indicates a detection scenario.
        # In this case, extract card IDs where the station is mentioned in either 'loc1' or 'loc2'.
        if has_loc1 and has_loc2:
            # For Detection scenario: Both loc1 and loc2 are present
            card_ids = filtered_df[(filtered_df['loc1'] == station) | (filtered_df['loc2'] == station)]['card_ID'].tolist()
        else:
            # If only 'loc1' or 'loc2' is present, it indicates a scenario where we need to distinguish 
            card_ids = filtered_df[filtered_df['station'] == station]['card_ID'].tolist()
        card_ids_dict[station] = card_ids

    return card_ids_dict 

The `calculate_checkout_counts` function calculates the actual number of check-outs (ground truth) at each station within a given list of stations. This ground truth data is crucial for validating models or algorithms that estimate travelers  and station usage within each topology.


In [105]:
def calculate_checkout_counts(df, STATIONS):
    checkout_counts = []
    
    for index, station in enumerate(STATIONS[1:], start=1):
        # Get all previous stations as a list of possible check-in locations
        possible_check_in_stations = STATIONS[:index]
        # Filter dataframe for rows where loc2 is the current station
        # and loc1 is in the list of possible check-in stations
        filtered_df = df[(df['loc2'] == station) & (df['loc1'].isin(possible_check_in_stations))]
        # Count the number of check-outs at the current station
        checkout_count = len(filtered_df)
        # Store the count in the dictionary with the station as the key
        checkout_counts.append(checkout_count)

    return checkout_counts

 In the first scenario, we explore the network's dynamics by considering only detection data. We use Bloom filters to represent the set of card IDs detected at each station and then calculate the estimated size of intersections between these sets. This method allows us to estimate the number of unique card IDs that were detected at a station and also at any prior stations in the sequence. 

(A ∪ B ∪ C ∪ D ∪ E) ∩ ( F )

The "output_IDS_bfs "list contains the estimated counts of travelers detected up to each station, based on the union and intersections of Bloom filters.

The first_result variable holds the accuracy scores of these estimates against the ground truth, providing insights into the effectiveness of using Bloom filters for this type of analysis.

In [106]:
def calculate_first_scenario(STATIONS: List[str], filtered_df: pd.DataFrame, BFLEN: int, GT: List[int]) -> Tuple[List[int], List[float]]:
 #Extract card IDs and transfer them to Bloom filters for each station
    card_ids_dict = extract_card_ids(STATIONS, filtered_df)
    card_id_bfs = transfer_to_bfs(card_ids_dict)

    output_IDS_bfs = []
    for station in reversed(STATIONS):
        method_1 = []
        union = bitarray([False] * BFLEN)
        for prev_station_bfs in STATIONS[:STATIONS.index(station)]:
            union = bitwise_or(union, card_id_bfs[prev_station_bfs])
        inter_bfs = bitwise_and(card_id_bfs[station], union)
        method_1.append(cardinality(inter_bfs))
        
        if station == STATIONS[0]:  # Stop if it's the first station
            break
        output_IDS_bfs.extend(method_1)
    output_IDS_bfs.reverse()  # Reverse to maintain the correct order
    first_result = calculate_accuracy(GT, output_IDS_bfs)
    return output_IDS_bfs, first_result

The second scenario introduces a distinction between check-ins and check-outs, providing a more nuanced analysis of travelers flow. This scenario simulates a more realistic situation where passengers check in at the start of their journey and check out at their destination.


In [107]:
def calculate_second_scenario(filtered_df, STATIONS, BFLEN, GT):
    card_ids_dict_in = extract_card_ids(STATIONS, check_in_table)
    card_ids_dict_out = extract_card_ids(STATIONS, check_out_table)
    card_id_bfs_in = transfer_to_bfs(card_ids_dict_in)
    card_id_bfs_out = transfer_to_bfs(card_ids_dict_out)

    output_IDS_bfs_2 = []
    for station in reversed(STATIONS):
        scenario_2 = []
        union = bitarray([False] * BFLEN)
        for prev_station_bfs in STATIONS[:STATIONS.index(station)]:
            union = bitwise_or(union, card_id_bfs_in[prev_station_bfs])
        inter_bfs_2 = bitwise_and(card_id_bfs_out[station], union)
        scenario_2.append(cardinality(inter_bfs_2))
        
        if station == STATIONS[0]:  # If it's the first station, stop
            break
        output_IDS_bfs_2.extend(scenario_2)
    output_IDS_bfs_2.reverse()  # Reverse to maintain the correct order
    
    scenod_scenario_result = calculate_accuracy(GT, output_IDS_bfs_2)    
    return output_IDS_bfs_2, scenod_scenario_result

The third scenario refines our analysis further by considering specific lines within the network and distinguishing between check-in and check-out activities. This approach allows for a detailed understanding of travelers flows on particular line, providing insights that are closer to real-world operations.


In [108]:
def calculate_third_scenario(check_in_table, check_out_table, STATIONS, BFLEN, LINE, GT):
    # Filter the check-in and check-out tables for records in the selected LINE
    check_in_table_line = check_in_table[check_in_table['line'].isin(LINE)]
    check_out_table_line = check_out_table[check_out_table['line'].isin(LINE)]
    # Extract card IDs and convert them to Bloom filters
    card_ids_dict_in_line = extract_card_ids(STATIONS, check_in_table_line)
    card_ids_dict_out_line = extract_card_ids(STATIONS, check_out_table_line)
    card_id_bfs_in_line = transfer_to_bfs(card_ids_dict_in_line)
    card_id_bfs_out_line = transfer_to_bfs(card_ids_dict_out_line)

    output_IDS_bfs_3 = []
    for station in reversed(STATIONS):
        scenario_3 = []
        union = bitarray([False] * BFLEN)
        for prev_station_bfs in STATIONS[:STATIONS.index(station)]:
            union = bitwise_or(union, card_id_bfs_in_line[prev_station_bfs])
        inter_bfs_3 = bitwise_and(card_id_bfs_out_line[station], union)
        scenario_3.append(cardinality(inter_bfs_3))
        
        if station == STATIONS[0]:  # Stop if it's the first station
            break
        output_IDS_bfs_3.extend(scenario_3)
    output_IDS_bfs_3.reverse()  # Reverse to maintain the correct order
    third_scenario_result = calculate_accuracy(GT, output_IDS_bfs_3)    
    return output_IDS_bfs_3, third_scenario_result

Function: calculate_scenario_accuracy

The calculate_scenario_accuracy function is designed to compute the cardinality and accuracy of a given scenario "within a multi-path network topolog"y analysis. It utilizes Bloom filters to estimate the size of intersections and evaluates the accuracy based on a provided ground truth.

In [109]:
def calculate_scenario_accuracy(card_id_bfs_4, boeierbrug_value, GT_line3_13, BFLEN):
    union = bitarray([False] * BFLEN)
    for station in card_id_bfs_4:
        union = bitwise_or(union, card_id_bfs_4[station])
    inter_bfs_line3_line13_scenario1 = bitwise_and(boeierbrug_value, union)
    scenario1_cardinality = [cardinality(inter_bfs_line3_line13_scenario1)]
    multi_paths_scenario_1_acc = calculate_accuracy(GT_line3_13, scenario1_cardinality)

    return scenario1_cardinality, multi_paths_scenario_1_acc

In this section, we load our dataset and prepare it for analysis by assigning unique identifiers to each traveler. This process is crucial for anonymizing the data and ensuring that each record can be individually tracked through the transportation network without revealing personal information.


We start by loading the dataset from a CSV file. This dataset, presumably named `lelystad_data.csv`, contains records of travelers' movements within the Lelystad transportation network. The data is separated by semicolons (`;`), a common format for datasets where commas may appear within individual data fields.

Before employing Bloom Filters in our analysis, it's necessary to initialize their parameters. These parameters are crucial for optimizing the filter's efficiency and accuracy in representing the set of items (in our case card IDs) without taking up too much space.



In [110]:
########################################################
                   # Main
########################################################
df = pd.read_csv('synthetic_topology_data.csv', sep=';')
travelerIDSet = random.sample(range(100000000000000), len(df)) # Generate random traveler IDs
df['card_ID']=travelerIDSet # Assign random IDs to 'card_ID' column
BFLEN = get_size(len(df), FALSEPROB)
NHASH = get_hash_count(BFLEN, len(df))


1- The code snippet below demonstrates how to iterate through a list of topologies, each represented as a dictionary with keys for the topology's name, the stations involved, the associated line, and the ground truth data. For each topology, the DataFrame is filtered to include only rows where the location of the event (`loc1` for check-ins or `loc2` for check-outs) matches one of the stations in the topology.

2- To evaluate the effectiveness of our analysis, it's crucial to establish a ground truth against which we can compare our estimates. The ground truth in this context refers to the actual number of check-outs at each station, which we calculate based on the filtered dataset. 

3- The filter_line DataFrame is a subset of the original data, filtered to include only transactions that occurred on the routes defined by LINE for each topology.

4- GT is a list containing the actual number of check-outs at each station within the STATIONS list, providing a basis for assessing the accuracy of our later analyses.

5- First scenario results

6- The second scenario results.

7- The third scenario results.



In [111]:
# Filter DataFrame for rows where either 'loc1' or 'loc2' matches the stations in the selected topology
for topology in topologies:
    STATIONS = topology["stations"]
    LINE = [topology["line"]]
    GROUND_TRUTH = [topology["ground_truth"]]
    
    print(f"\nRunning analysis for {topology['name']}")
 
    filtered_df = df[df['loc1'].isin(STATIONS) | df['loc2'].isin(STATIONS)] 

    ########################### Calculate ground truth (GT) for the checkout counts at each station###################################
    filter_line = filtered_df[filtered_df['Routes'].isin(LINE)]
    GT = calculate_checkout_counts(filter_line, STATIONS)
    print("GT:",GT)
    tables_dict = split_df_by_table_in_table_out(filtered_df)
    check_in_table = tables_dict['check_in']
    check_out_table = tables_dict['check_out']
    
    # first scenario
    first_scenario_estimated_sizes, first_cenario_accuracies = calculate_first_scenario(STATIONS, filtered_df, BFLEN, GT)
    print("First Scenario Estimated Sizes:", first_scenario_estimated_sizes)
    print("First Scenario Accuracies:", first_cenario_accuracies)
    
    # second scenario
    estimated_sizes_second_scenario, accuracies_second_scenario = calculate_second_scenario(filtered_df, STATIONS, BFLEN, GT)
    print("Second Scenario Estimated Sizes:", estimated_sizes_second_scenario)
    print("Second scenario Accuracies:", accuracies_second_scenario)
    
    # third scenario
    estimated_sizes_third_scenario, accuracies_third_scenario = calculate_third_scenario(check_in_table, check_out_table, STATIONS, BFLEN, LINE, GT)
    print("Third Scenario Estimated Sizes:", estimated_sizes_third_scenario)
    print("Third scenario Accuracies:", accuracies_third_scenario)


Running analysis for Line Topology
GT: [145, 247, 355, 529]
First Scenario Estimated Sizes: [421, 645, 816, 2181]
First Scenario Accuracies: [0, 0, 0, 0]
Second Scenario Estimated Sizes: [190, 285, 375, 1059]
Second scenario Accuracies: [0.69, 0.85, 0.94, 0]
Third Scenario Estimated Sizes: [152, 258, 363, 528]
Third scenario Accuracies: [0.95, 0.96, 0.98, 1.0]

Running analysis for Intersect Topology
GT: [194, 401, 613]
First Scenario Estimated Sizes: [1480, 980, 1244]
First Scenario Accuracies: [0, 0, 0]
Second Scenario Estimated Sizes: [626, 439, 614]
Second scenario Accuracies: [0, 0.91, 1.0]
Third Scenario Estimated Sizes: [202, 412, 614]
Third scenario Accuracies: [0.96, 0.97, 1.0]


Running analysis for multi-path topology involves calculating the ground truth values for checkout counts at each station, given a set of paths. Each path includes stations and a specific line identifier. The process is as follows:

Iterate over each path in multi_path_topology.
For each path, identify the stations and line involved.
Filter the DataFrame to include only the relevant stations for the current path.
Calculate the ground truth (GT) for the checkout counts at each station.
Sum the last ground truth values from both paths to obtain a combined ground truth for the multi-path scenario as the Destination or inetersted node.

In [112]:
print(f"\nRunning analysis for multi_path_topology")
# Ground truth
gt_values = [] 
for path in multi_path_topology:
    STATIONS = path["stations"]
    LINE = [path["line"]]
    GROUND_TRUTH = [path["ground_truth"]]
    filtered_df = df[df['loc1'].isin(STATIONS) | df['loc2'].isin(STATIONS)] # looking at nodes we are interested not the whole nodes.
    ########################### Calculate ground truth (GT) for the checkout counts at each station###################################
    filter_line = filtered_df[filtered_df['Routes'].isin(LINE)]
    GT = calculate_checkout_counts(filter_line, STATIONS)
    gt_values.append(GT[-1])   
GT_line3_13 = [sum(gt_values)]  # Summing the last GT values from both paths
print("Combined GT:", GT_line3_13) 


Running analysis for multi_path_topology
Combined GT: [1144]


Scenario 1 Analysis

In the first scenario, we aim to analyze the network topology by combining stations from two different lines, stations_line3 and stations_line13, to create a unified list of stations. This list is then used to filter our dataset for relevant records. The goal is to calculate the estimated sizes and accuracies of the multi-path topology for these combined stations, focusing on a specific station of interest: 'Lls Boeierbrug' as destination node. 

The steps involved are:

Merge stations_line3 and stations_line13 lists to form a unified set of unique stations.

Filter the DataFrame to include only records related to these stations.

Extract and transform these stations into Bloom filters.

Remove the Bloom filter for 'Lls Boeierbrug' to use it in a separate intersection analysis.

Create a union of all remaining station Bloom filters.

Calculate the cardinality of the intersection between this union and the 'Lls Boeierbrug' filter.

Compute the accuracy of this scenario by comparing the cardinality against the combined ground truth values.

In [113]:
#scenario1
STATIONS  = list(set(stations_line3 + stations_line13))
filtered_df = df[df['loc1'].isin(STATIONS) | df['loc2'].isin(STATIONS)] # looking at nodes we are interested not the whole nodes.
card_ids_dict_4 = extract_card_ids(STATIONS, filtered_df)
card_id_bfs_4 = transfer_to_bfs(card_ids_dict_4)
boeierbrug_value = card_id_bfs_4.pop('N_11', None)
union = bitarray([False] * BFLEN)
scenario1_cardinality, multi_paths_scenario_1_acc = calculate_scenario_accuracy(card_id_bfs_4, boeierbrug_value, GT_line3_13,BFLEN)
print("First Scenario Estimated Sizes Multi paths:", scenario1_cardinality)
print("First scenario Accuracies Multi paths:", multi_paths_scenario_1_acc)

First Scenario Estimated Sizes Multi paths: [1044]
First scenario Accuracies Multi paths: [0.91]


Scenario 2 Analysis

Scenario 2 extends the analysis by distinguishing between check-in and check-out activities at the stations, considering the same unified list of stations from Scenario 1. The objective is to understand the interactions at these stations with an emphasis on the 'Lls Boeierbrug' station.

In [114]:
#scenario2
tables_dict_3_13 = split_df_by_table_in_table_out(filtered_df)
check_in_table_3_13 = tables_dict_3_13['check_in']
check_out_table_3_13 = tables_dict_3_13['check_out']
card_ids_dict_in_s2 = extract_card_ids(STATIONS, check_in_table_3_13)
card_ids_dict_out_s2 = extract_card_ids(STATIONS, check_out_table_3_13)
card_id_bfs_in_s2 = transfer_to_bfs(card_ids_dict_in_s2)
card_id_bfs_out_s2 = transfer_to_bfs(card_ids_dict_out_s2)
boeierbrug_value_s2 = card_id_bfs_out_s2.pop('N_11', None)
card_id_bfs_in_s2.pop('Lls Boeierbrug', None)
scenario2_cardinality, multi_paths_scenario_2_acc = calculate_scenario_accuracy(card_id_bfs_in_s2, boeierbrug_value_s2, GT_line3_13,BFLEN)
print("second Scenario Estimated Sizes Multi paths:", scenario2_cardinality)
print("second scenario Accuracies Multi paths:", multi_paths_scenario_2_acc)


second Scenario Estimated Sizes Multi paths: [511]
second scenario Accuracies Multi paths: [0.45]


In Scenario 3, our focus shifts to a detailed examination of two specific lines: Line 03 and Line 13. This scenario aims to delve deeper into the check-in and check-out dynamics at the stations belonging to these lines, with a particular emphasis on understanding the role of the 'Lls Boeierbrug' station.

The analytical steps for this scenario include:

Filtering the check-in and check-out data tables to retain only those records associated with the targeted lines, Line 03 and Line 13.

From these filtered tables, extracting card IDs and transforming them into Bloom filters for a more refined analysis.

Isolating the 'Lls Boeierbrug' station's Bloom filter from both the check-in and check-out sets to allow for an individualized analysis of its interaction patterns.

Calculating the union of all "in" station Bloom filters, excluding 'Lls Boeierbrug', to gauge the cumulative interactions across these stations.

Conducting a bitwise AND operation between the 'Lls Boeierbrug' filter (from the check-out set) and the union of "in" Bloom filters to estimate the interaction scale.

Determining the cardinality from this interaction to estimate the scenario's size.

Comparing the estimated size with the combined ground truth for Lines 03 and 13 to measure accuracy.

In [115]:
#Third scenario
LINE = ["\\16003", "\\16013"]
check_in_table_line = check_in_table_3_13[check_in_table_3_13['line'].isin(LINE)]
check_out_table_line = check_out_table_3_13[check_out_table_3_13['line'].isin(LINE)]
card_ids_dict_in_s3 = extract_card_ids(STATIONS, check_in_table_line)
card_ids_dict_out_s3 = extract_card_ids(STATIONS, check_out_table_line)
card_id_bfs_in_s3 = transfer_to_bfs(card_ids_dict_in_s3)
card_id_bfs_out_s3 = transfer_to_bfs(card_ids_dict_out_s3)
boeierbrug_value_s3 = card_id_bfs_out_s3.pop('N_11', None)
card_id_bfs_in_s3.pop('N_11', None)
scenario3_cardinality, multi_paths_scenario_3_acc = calculate_scenario_accuracy(card_id_bfs_in_s3, boeierbrug_value_s3, GT_line3_13,BFLEN)
print("Third Scenario Estimated Sizes Multi paths:", scenario3_cardinality)
print("Third scenario Accuracies Multi paths:", multi_paths_scenario_3_acc)

Third Scenario Estimated Sizes Multi paths: [511]
Third scenario Accuracies Multi paths: [0.45]
