# Set Membership

The cell below defines two **abstract classes**: the first represents a set and basic insert/search operations on it. You will need to impement this API four times, to implement (1) sequential search, (2) binary search tree, (3) balanced search tree, and (4) bloom filter. The second defines the synthetic data generator you will need to implement as part of your experimental framework. <br><br>**Do NOT modify the next cell** - use the dedicated cells further below for your implementation instead. <br>

In [2]:
# DO NOT MODIFY THIS CELL

from abc import ABC, abstractmethod  

# abstract class to represent a set and its insert/search operations
class AbstractSet(ABC):
    
    # constructor
    @abstractmethod
    def __init__(self):
        pass           
        
    # inserts "element" in the set
    # returns "True" after successful insertion, "False" if the element is already in the set
    # element : str
    # inserted : bool
    @abstractmethod
    def insertElement(self, element):     
        inserted = False
        return inserted   
    
    # checks whether "element" is in the set
    # returns "True" if it is, "False" otherwise
    # element : str
    # found : bool
    @abstractmethod
    def searchElement(self, element):
        found = False
        return found    
    
    
    
# abstract class to represent a synthetic data generator
class AbstractTestDataGenerator(ABC):
    
    # constructor
    @abstractmethod
    def __init__(self):
        pass           
        
    # creates and returns a list of length "size" of strings
    # size : int
    # data : list<str>
    @abstractmethod
    def generateData(self, size):     
        data = [""]*size
        return data


Use the cell below to define any auxiliary data structure and python function you may need. Leave the implementation of the main API to the next code cells instead.

In [3]:
# ADD AUXILIARY DATA STRUCTURE DEFINITIONS AND HELPER CODE HERE

class TreeNode_BinarySearchTree():
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None

class TreeNode_BalancedSearchTree():
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None
        self.top = None
        self.color = "red"

# Helper Function for hashmap
"""def hashFunction(self, string, hash_count, size):
    hash_values = []
    hash1 = hash(string)
    hash2 = hash(str(hash1) + string)
    for i in range(hash_count):
        value = (hash1 + i * hash2) % size
        hash_values.append(value)
    return hash_values"""
def hashFunction(self, string, hash_count, size):
    hash1 = hash(string)
    hash2 = hash(str(hash1) + string)
    value = (hash1 + 0 * hash2) % size
    return value

# Helper Function to calculate false postives in bloom filter
def falsePositiveRate(self, test_dataset):
    false_positives = 0
    for element in test_dataset:
        if self.searchElement(element):
            false_positives += 1
    return false_positives / len(test_dataset)

Use the cell below to implement the requested API by means of **sequential search**.

In [18]:
class SequentialSearchSet(AbstractSet):
    def __init__(self):
        # ADD YOUR CODE HERE
        self.data = []        

    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        if element in self.data:
            return inserted
        else:
            self.data.append(element)
            inserted = True
            return inserted

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE
        for i in range (len(self.data)):
            if self.data[i] == element:
                found = True
                break
        return found

Use the cell below to implement the requested API by means of **binary search tree**.

In [5]:
class BinarySearchTreeSet(AbstractSet):
    def __init__(self):
        self.root = None
        
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        new_node = TreeNode_BinarySearchTree(element)
        if self.root is None:
            self.root = new_node
            inserted = True
        else:
            current = self.root
            while True:
                if element < current.value:
                    # check the left when the element want to insert is smaller
                    if current.left is None:
                        current.left = new_node
                        inserted = True
                        break
                    else:
                        # move to the left
                        current = current.left
                elif element > current.value:
                    # check the right when the element want to insert is smaller
                    if current.right is None:
                        current.right = new_node
                        inserted = True
                        break
                    else:
                        # move to the right
                        current = current.right
                else:
                    # element is already in the tree
                    break
        return inserted
    
    def searchElement(self, element):
        found = False
        # ADD YOUR CODE HERE
        current = self.root
        while current is not None:
            if element == current.value:
                found = True
                break
            elif element < current.value:
                # move to the left if the element want to find is smaller
                current = current.left
            else:
                # move to the right if the element want to find is larger
                current = current.right
        return found

Use the cell below to implement the requested API by means of **balanced search tree**.

In [6]:
class BalancedSearchTreeSet(AbstractSet):
    
    def __init__(self):
        self.root = None       
    
    def insertElement(self, element):
        inserted = False
        if self.root is None:
            self.root = TreeNode_BalancedSearchTree(element)
            inserted = True
            return inserted
        else:
            currentNode = self.root
            while True:
                if currentNode.value == element:
                    return inserted
                elif currentNode.value > element:
                    if currentNode.left is None:
                        currentNode.left = TreeNode_BalancedSearchTree(element)
                        currentNode.left.top = currentNode
                        inserted = True
                        break
                    else:
                        currentNode = currentNode.left
                else:
                    if currentNode.right is None:
                        currentNode.right = TreeNode_BalancedSearchTree(element)
                        currentNode.right.top = currentNode
                        inserted = True
                        break
                    else:
                        currentNode = currentNode.right
            if inserted:
                while currentNode is not None:
                    if (currentNode.left is None or currentNode.left.color == "black") and currentNode.right is not None and currentNode.right.color == "red":
                        # left rotation 
                        tampNode= currentNode.right
                        if tampNode.left is not None:
                            currentNode.right = tampNode.left
                            tampNode.left.top = currentNode
                        else:
                            currentNode.right = None
                        if currentNode.top is not None:
                            topNode = currentNode.top
                            if topNode.left == currentNode:
                                topNode.left = tampNode
                            else:
                                topNode.right = tampNode
                        elif currentNode == self.root:
                            self.root = tampNode
                        tampNode.left = currentNode
                        tampNode.top = currentNode.top
                        currentNode.top = tampNode
                        tampNode.color = currentNode.color
                        currentNode.color = "red"
                    if currentNode.top is not None and currentNode.color == "red" and currentNode.left is not None and currentNode.left.color == "red":
                        # right rotation
                        rightNode = currentNode.top
                        currentNode.color = rightNode.color
                        rightNode.color = "red"
                        if rightNode.top is not None:
                            topNode = rightNode.top
                            if topNode.left == rightNode:
                                topNode.left = currentNode
                            else:
                                topNode.right = currentNode
                        if rightNode == self.root:
                            self.root = currentNode
                        rightNode.left = currentNode.right
                        if currentNode.right is not None:
                            currentNode.right.top = rightNode
                        currentNode.right = rightNode
                        currentNode.top = rightNode.top
                        rightNode.top = currentNode
                    if currentNode.left is not None and currentNode.left.color == "red" and currentNode.right is not None and currentNode.right.color == "red":
                        # shift color
                        currentNode.left.color = "black"
                        currentNode.right.color = "black"
                        currentNode.color = "red"
                    currentNode = currentNode.top
            return inserted

    def searchElement(self, element):     
        found = False
        currentNode = self.root
        while currentNode is not None:
            if currentNode.value == element:
                found = True
                return found
            elif currentNode.value > element:
                currentNode = currentNode.left
            else:
                currentNode = currentNode.right
        return found 

Use the cell below to implement the requested API by means of **bloom filter**.

In [7]:
class BloomFilterSet(AbstractSet):
    
    def __init__(self):
        # ADD YOUR CODE HERE
        self.size = 191701
        self.hash_count = 13
        self.bit_array = [0] * self.size

    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        for seed in range(self.hash_count):
            result = hashFunction(seed, element, self.hash_count, self.size) % self.size
            self.bit_array[result] = 1
        inserted = True
        return inserted

    def searchElement(self, element):
        found = True
        # ADD YOUR CODE HERE
        for seed in range(self.hash_count):
            result = hashFunction(seed, element, self.hash_count, self.size) % self.size
            if self.bit_array[result] == 0:
                found = False
                break
        return found

Use the cell below to implement the **synthetic data generator** as part of your experimental framework.

In [8]:
import string
import random

class TestDataGenerator(AbstractTestDataGenerator):
    
    def __init__(self):
        # ADD YOUR CODE HERE
        pass           
        
    def generateData(self, size):     
        # ADD YOUR CODE HERE
        data = []
        # loop size times
        for i in range(size):
            # generate a random string of random length between 1 and 10
            rand_len_str = random.randint(1, 10)
            rand_str = ''.join(random.choices(string.ascii_letters, k = rand_len_str))
            # add the string to the set
            data.append(rand_str)        
        # data.sort()
        return data

Use the cells below for the python code needed to **fully evaluate your implementations**, first on real data and subsequently on synthetic data (i.e., read data from test files / generate synthetic one, instantiate each of the 4 set implementations in turn, then thorouhgly experiment with insert/search operations and measure their performance).

In [26]:
import timeit

# The following search_words_list stores words want to be searched during the experiment.

search_words_list = []
with open(r"testfiles\\" + "test-search.txt", "r", encoding = "utf-8") as search_file:
    search_words_list = search_file.read().split()

# The following test<index>_words_list stores words want to be inserted during the experiment.

file_list = ["test1-mobydick.txt", "test2-warpeace.txt", "test3-dickens.txt"]
test1_words_list = []
test2_words_list = []
test3_words_list = []
for file in file_list:
    with open(r"testfiles\\" + file, "r", encoding="utf-8") as current_file:
        if file == file_list[0]:
            test1_words_list = current_file.read().split()
        elif file == file_list[1]:
            test2_words_list = current_file.read().split()
        else:
            test3_words_list = current_file.read().split()

# Function to insert words stored in given words_list to a given data structure.

def test_insert(algorithms, words_list):
    for word in words_list:
        algorithms.insertElement(word)

# Function to search words stored in search_words_list in a given data structure.

def test_search(algorithms):
    for word in search_words_list:
        algorithms.searchElement(word)

# The following lists store the results of insert and search time of different files.
# ss is SequentialSearch
# bst is BinarySearchTree
# bal is BalancedSearchTree
# bf is BloomFilter
# [0] is the data for file "test1-mobydick.txt"
# [1] is the data for file "test2-warpeace.txt"
# [2] is the data for file "test3-dickens.txt"

ss_insert_time_data = []
ss_search_time_data = []

bst_insert_time_data = []
bst_search_time_data = []

bal_insert_time_data = []
bal_search_time_data = []

bf = BloomFilterSet()
bf_insert_time_data = []
bf_search_time_data = []

# test1 ===================================================================================================#
print("test1:")
# test1 Sequential Search =====================#
ss1 = SequentialSearchSet()
ss1_insert_time_test1 = timeit.timeit('test_insert(ss1, test1_words_list)', number = 1, globals = globals())
ss_insert_time_data.append(ss1_insert_time_test1)
print(f"Sequential Search total insert time: {ss1_insert_time_test1} seconds")
ss1_search_time_test1 = timeit.timeit('test_search(ss1)', number = 1, globals = globals())
ss_search_time_data.append(ss1_search_time_test1)
print(f"Sequential Search total search time: {ss1_search_time_test1} seconds")
# test1 Binary Search Tree ====================#
bst1 = BinarySearchTreeSet()
bst1_insert_time_test1 = timeit.timeit('test_insert(bst1, test1_words_list)', number = 1, globals = globals())
bst_insert_time_data.append(bst1_insert_time_test1)
print(f"Binary Search Tree total insert time: {bst1_insert_time_test1} seconds")
bst1_search_time_test1 = timeit.timeit('test_search(bst1)', number = 1, globals = globals())
bst_search_time_data.append(bst1_search_time_test1)
print(f"Binary Search Tree total search time: {bst1_search_time_test1} seconds")
# test1 Balanced Search Tree ==================#
bal1 = BalancedSearchTreeSet()
bal1_insert_time_test1 = timeit.timeit('test_insert(bal1, test1_words_list)', number = 1, globals = globals())
bal_insert_time_data.append(bal1_insert_time_test1)
print(f"Balanced Search Tree total insert time: {bal1_insert_time_test1} seconds")
bal1_search_time_test1 = timeit.timeit('test_search(bal1)', number = 1, globals = globals())
bal_search_time_data.append(bal1_search_time_test1)
print(f"Balanced Search Tree total search time: {bal1_search_time_test1} seconds")
# test1 Bloom Filter ==========================#
bf1 = BloomFilterSet()
bf1_insert_time_test1 = timeit.timeit('test_insert(bf1, test1_words_list)', number = 1, globals = globals())
bf_insert_time_data.append(bf1_insert_time_test1)
print(f"Bloom Filter total insert time: {bf1_insert_time_test1} seconds")
bf1_search_time_test1 = timeit.timeit('test_search(bf1)', number = 1, globals = globals())
bf_search_time_data.append(bf1_search_time_test1)
print(f"Bloom Filter total search time: {bf1_search_time_test1} seconds")
print("\n")

# test2 ===================================================================================================#
print("test2:")
# test2 Sequential Search =====================#
ss2 = SequentialSearchSet()
ss2_insert_time_test2 = timeit.timeit('test_insert(ss2, test2_words_list)', number = 1, globals = globals())
ss_insert_time_data.append(ss2_insert_time_test2)
print(f"Sequential Search total insert time: {ss2_insert_time_test2} seconds")
ss2_search_time_test2 = timeit.timeit('test_search(ss2)', number = 1, globals = globals())
ss_search_time_data.append(ss2_search_time_test2)
print(f"Sequential Search total search time: {ss2_search_time_test2} seconds")
# test2 Binary Search Tree ====================#
bst2 = BinarySearchTreeSet()
bst2_insert_time_test2 = timeit.timeit('test_insert(bst2, test2_words_list)', number = 1, globals = globals())
bst_insert_time_data.append(bst2_insert_time_test2)
print(f"Binary Search Tree total insert time: {bst2_insert_time_test2} seconds")
bst2_search_time_test2 = timeit.timeit('test_search(bst2)', number = 1, globals = globals())
bst_search_time_data.append(bst2_search_time_test2)
print(f"Binary Search Tree total search time: {bst2_search_time_test2} seconds")
# test2 Balanced Search Tree ==================#
bal2 = BalancedSearchTreeSet()
bal2_insert_time_test2 = timeit.timeit('test_insert(bal2, test2_words_list)', number = 1, globals = globals())
bal_insert_time_data.append(bal2_insert_time_test2)
print(f"Balanced Search Tree total insert time: {bal2_insert_time_test2} seconds")
bal2_search_time_test2 = timeit.timeit('test_search(bal2)', number = 1, globals = globals())
bal_search_time_data.append(bal2_search_time_test2)
print(f"Balanced Search Tree total search time: {bal2_search_time_test2} seconds")
# test2 Bloom Filter ==========================#
bf2 = BloomFilterSet()
bf2_insert_time_test2 = timeit.timeit('test_insert(bf2, test2_words_list)', number = 1, globals = globals())
bf_insert_time_data.append(bf2_insert_time_test2)
print(f"Bloom Filter total insert time: {bf2_insert_time_test2} seconds")
bf2_search_time_test2 = timeit.timeit('test_search(bf2)', number = 1, globals = globals())
bf_search_time_data.append(bf2_search_time_test2)
print(f"Bloom Filter total search time: {bf2_search_time_test2} seconds")
print("\n")

# test3 ===================================================================================================#
print("test3:")
# test3 Sequential Search =====================#
ss3 = SequentialSearchSet()
ss3_insert_time_test3 = timeit.timeit('test_insert(ss3, test3_words_list)', number = 1, globals = globals())
ss_insert_time_data.append(ss3_insert_time_test3)
print(f"Sequential Search total insert time: {ss3_insert_time_test3} seconds")
ss3_search_time_test3 = timeit.timeit('test_search(ss3)', number = 1, globals = globals())
ss_search_time_data.append(ss3_search_time_test3)
print(f"Sequential Search total search time: {ss3_search_time_test3} seconds")
# test3 Binary Search Tree ====================#
bst3 = BinarySearchTreeSet()
bst3_insert_time_test3 = timeit.timeit('test_insert(bst3, test3_words_list)', number = 1, globals = globals())
bst_insert_time_data.append(bst3_insert_time_test3)
print(f"Binary Search Tree total insert time: {bst3_insert_time_test3} seconds")
bst3_search_time_test3 = timeit.timeit('test_search(bst3)', number = 1, globals = globals())
bst_search_time_data.append(bst3_search_time_test3)
print(f"Binary Search Tree total search time: {bst3_search_time_test3} seconds")
# test3 Balanced Search Tree ==================#
bal3 = BalancedSearchTreeSet()
bal3_insert_time_test3 = timeit.timeit('test_insert(bal3, test3_words_list)', number = 1, globals = globals())
bal_insert_time_data.append(bal3_insert_time_test3)
print(f"Balanced Search Tree total insert time: {bal3_insert_time_test3} seconds")
bal3_search_time_test3 = timeit.timeit('test_search(bal3)', number = 1, globals = globals())
bal_search_time_data.append(bal3_search_time_test3)
print(f"Balanced Search Tree total search time: {bal3_search_time_test3} seconds")
# test3 Bloom Filter ==========================#
bf3 = BloomFilterSet()
bf3_insert_time_test3 = timeit.timeit('test_insert(bf3, test3_words_list)', number = 1, globals = globals())
bf_insert_time_data.append(bf3_insert_time_test3)
print(f"Bloom Filter total insert time: {bf3_insert_time_test3} seconds")
bf3_search_time_test3 = timeit.timeit('test_search(bf3)', number = 1, globals = globals())
bf_search_time_data.append(bf3_search_time_test3)
print(f"Bloom Filter total search time: {bf3_search_time_test3} seconds")
print("Real data test end.")

test1:
Sequential Search total insert time: 10.85253439983353 seconds
Sequential Search total search time: 0.2856450998224318 seconds
Binary Search Tree total insert time: 0.5241820001974702 seconds
Binary Search Tree total search time: 0.0019842996262013912 seconds
Balanced Search Tree total insert time: 0.5636499999091029 seconds
Balanced Search Tree total search time: 0.0014055999927222729 seconds
Bloom Filter total insert time: 2.196078199893236 seconds
Bloom Filter total search time: 0.005100099835544825 seconds


test2:
Sequential Search total insert time: 19.312011800240725 seconds
Sequential Search total search time: 0.3174239997752011 seconds
Binary Search Tree total insert time: 1.3722150004468858 seconds
Binary Search Tree total search time: 0.002388200256973505 seconds
Balanced Search Tree total insert time: 1.1934756999835372 seconds
Balanced Search Tree total search time: 0.0013227001763880253 seconds
Bloom Filter total insert time: 5.452361700125039 seconds
Bloom Filter 

KeyboardInterrupt: 

In [33]:
import timeit

# ADD YOUR TEST CODE HERE TO WORK ON SYNTHETIC DATA

# Function to insert words stored in given words_list to prepare the given data structure for testing the insert time when the data structure already have an certain amount of data, during Synthetic Test.

def prepare_test_insert(algorithms, words_list, limit):
    for word_index in range(limit):
        algorithms.insertElement(words_list[word_index])

# Function to insert the last word stored in words_list in a given data structure, during Synthetic Test.

def test_insert_Synthetic_Test(algorithms, words_list, index_word_test):
    algorithms.insertElement(words_list[index_word_test])

# Function to search words stored in words_list in a given data structure, during Synthetic Test.

def test_total_search_Synthetic_Test(algorithms, words_list):
    for word in words_list:
        algorithms.searchElement(word)

SyntheticTest = TestDataGenerator()

# The following lists are used to store results, from data size 0 to given max_size.

ss_insert_time_data_for_synthetic = []
ss_average_search_time_data_for_synthetic = []
bst_insert_time_data_for_synthetic = []
bst_average_search_time_data_for_synthetic = []
bal_insert_time_data_for_synthetic = []
bal_average_search_time_data_for_synthetic = []
bf_insert_time_data_for_synthetic = []
bf_average_search_time_data_for_synthetic = []

# Test
max_size = 10000
for num_data in range(1, max_size + 1):
    data = SyntheticTest.generateData(num_data)
    ss = SequentialSearchSet()
    bst = BinarySearchTreeSet()
    bal = BalancedSearchTreeSet()
    bf = BloomFilterSet()
    print(num_data)
    if num_data == 1:
        # Sequential search
        ss_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(ss, data, num_data - 1)', number = 1, globals = globals())
        ss_insert_time_data_for_synthetic.append(ss_insert_time_with_this_data_size)
        # print(f"Sequential Search insert time: {ss_insert_time_with_this_data_size} seconds")
        ss_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(ss, data)', number = 1, globals = globals())
        ss_average_search_time_data_for_synthetic.append(ss_search_time_with_this_data_size)
        # Binary search tree
        bst_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(bst, data, num_data - 1)', number = 1, globals = globals())
        bst_insert_time_data_for_synthetic.append(bst_insert_time_with_this_data_size)
        bst_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(bst, data)', number = 1, globals = globals())
        bst_average_search_time_data_for_synthetic.append(bst_search_time_with_this_data_size)
        # Balanced search tree
        bal_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(bal, data, num_data - 1)', number = 1, globals = globals())
        bal_insert_time_data_for_synthetic.append(bal_insert_time_with_this_data_size)
        bal_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(bal, data)', number = 1, globals = globals())
        bal_average_search_time_data_for_synthetic.append(bal_search_time_with_this_data_size)
        # Bloom filter
        bf_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(bf, data, num_data - 1)', number = 1, globals = globals())
        bf_insert_time_data_for_synthetic.append(bf_insert_time_with_this_data_size)
        bf_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(bf, data)', number = 1, globals = globals())
        bf_average_search_time_data_for_synthetic.append(bf_search_time_with_this_data_size)
    else:
        # Sequential search
        prepare_test_insert(ss, data, num_data - 1)
        ss_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(ss, data, num_data - 1)', number = 1, globals = globals())
        ss_insert_time_data_for_synthetic.append(ss_insert_time_with_this_data_size)
        # print(f"Sequential Search insert time: {ss_insert_time_with_this_data_size} seconds")
        ss_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(ss, data)', number = 1, globals = globals())
        ss_average_search_time_with_this_data_size = ss_search_time_with_this_data_size / num_data
        ss_average_search_time_data_for_synthetic.append(ss_average_search_time_with_this_data_size)
        # Binary search tree
        prepare_test_insert(bst, data, num_data - 1)
        bst_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(bst, data, num_data - 1)', number = 1, globals = globals())
        bst_insert_time_data_for_synthetic.append(bst_insert_time_with_this_data_size)
        bst_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(bst, data)', number = 1, globals = globals())
        bst_average_search_time_with_this_data_size = bst_search_time_with_this_data_size / num_data
        bst_average_search_time_data_for_synthetic.append(bst_average_search_time_with_this_data_size)
        # Balanced search tree
        prepare_test_insert(bal, data, num_data - 1)
        bal_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(bal, data, num_data - 1)', number = 1, globals = globals())
        bal_insert_time_data_for_synthetic.append(bal_insert_time_with_this_data_size)
        bal_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(bal, data)', number = 1, globals = globals())
        bal_average_search_time_with_this_data_size = bal_search_time_with_this_data_size / num_data
        bal_average_search_time_data_for_synthetic.append(bal_average_search_time_with_this_data_size)
        # Bloom filter
        prepare_test_insert(bf, data, num_data - 1)
        bf_insert_time_with_this_data_size = timeit.timeit('test_insert_Synthetic_Test(bf, data, num_data - 1)', number = 1, globals = globals())
        bf_insert_time_data_for_synthetic.append(bf_insert_time_with_this_data_size)
        bf_search_time_with_this_data_size = timeit.timeit('test_total_search_Synthetic_Test(bf, data)', number = 1, globals = globals())
        bf_average_search_time_with_this_data_size = bf_search_time_with_this_data_size / num_data
        bf_average_search_time_data_for_synthetic.append(bf_average_search_time_with_this_data_size)

# # Create an object of class TestDataGenerator and use the object to generateData
# SyntheticTest = TestDataGenerator()
# num_data = 1000
# data = SyntheticTest.generateData(num_data)

# ss = SequentialSearchSet()
# ss_insert_time_data_for_synthetic = []
# ss_average_search_time_data_for_synthetic = []
# ss_worst_search_time_data_for_synthetic = []
# for iteration in range(len(data)):
#     ss_insert_start_time = timeit.default_timer()
#     ss.insertElement(data[iteration])
#     ss_insert_end_time = timeit.default_timer()
#     ss_insert_time = ss_insert_end_time - ss_insert_start_time
#     ss_insert_time_data_for_synthetic.append(ss_insert_time)
#     ss_worst_search_time = 0
#     ss_tot_search_time = 0
#     for id in range(iteration + 1):
#         value = data[id]
#         ss_search_start_time = timeit.default_timer()
#         ss.searchElement(value)
#         ss_search_end_time = timeit.default_timer()
#         ss_search_time = ss_search_end_time - ss_search_start_time
#         if ss_search_time > ss_worst_search_time:
#             ss_worst_search_time = ss_search_time
#         ss_tot_search_time += ss_search_time
#     ss_worst_search_time_data_for_synthetic.append(ss_worst_search_time)
#     ss_average_search_time = ss_tot_search_time / (iteration + 1)
#     ss_average_search_time_data_for_synthetic.append(ss_average_search_time)


# bst = BinarySearchTreeSet()
# bst_insert_time_data_for_synthetic = []
# bst_average_search_time_data_for_synthetic = []
# bst_worst_search_time_data_for_synthetic = []
# for iteration in range(len(data)):
#     bst_insert_start_time = timeit.default_timer()
#     bst.insertElement(data[iteration])
#     bst_insert_end_time = timeit.default_timer()
#     bst_insert_time = bst_insert_end_time - bst_insert_start_time
#     bst_insert_time_data_for_synthetic.append(bst_insert_time)
#     bst_worst_search_time = 0
#     bst_tot_search_time = 0
#     for id in range(iteration + 1):
#         value = data[id]
#         bst_search_start_time = timeit.default_timer()
#         bst.searchElement(value)
#         bst_search_end_time = timeit.default_timer()
#         bst_search_time = bst_search_end_time - bst_search_start_time
#         if bst_search_time > bst_worst_search_time:
#             bst_worst_search_time = bst_search_time
#         bst_tot_search_time += bst_search_time
#     bst_worst_search_time_data_for_synthetic.append(bst_worst_search_time)
#     bst_average_search_time = bst_tot_search_time / (iteration + 1)
#     bst_average_search_time_data_for_synthetic.append(bst_average_search_time)

# bal = BalancedSearchTreeSet()
# bal_insert_time_data_for_synthetic = []
# bal_average_search_time_data_for_synthetic = []
# bal_worst_search_time_data_for_synthetic = []
# for iteration in range(len(data)):
#     bal_insert_start_time = timeit.default_timer()
#     bal.insertElement(data[iteration])
#     bal_insert_end_time = timeit.default_timer()
#     bal_insert_time = bal_insert_end_time - bal_insert_start_time
#     bal_insert_time_data_for_synthetic.append(bal_insert_time)
#     bal_worst_search_time = 0
#     bal_tot_search_time = 0
#     for id in range(iteration + 1):
#         value = data[id]
#         bal_search_start_time = timeit.default_timer()
#         bal.searchElement(value)
#         bal_search_end_time = timeit.default_timer()
#         bal_search_time = bal_search_end_time - bal_search_start_time
#         if bal_search_time > bal_worst_search_time:
#             bal_worst_search_time = bal_search_time
#         bal_tot_search_time += bal_search_time
#     bal_worst_search_time_data_for_synthetic.append(bal_worst_search_time)
#     bal_average_search_time = bal_tot_search_time / (iteration + 1)
#     bal_average_search_time_data_for_synthetic.append(bal_average_search_time)

# bf = BloomFilterSet()
# bf_insert_time_data_for_synthetic = []
# bf_average_search_time_data_for_synthetic = []
# bf_worst_search_time_data_for_synthetic = []
# for iteration in range(num_data):
#     bf_insert_start_time = timeit.default_timer()
#     bf.insertElement(data[iteration])
#     bf_insert_end_time = timeit.default_timer()
#     bf_insert_time = bf_insert_end_time - bf_insert_start_time
#     bf_insert_time_data_for_synthetic.append(bf_insert_time)
#     bf_worst_search_time= 0
#     bf_tot_search_time = 0
#     for id in range(iteration + 1):
#         value = data[id]
#         bf_search_start_time = timeit.default_timer()
#         bf.searchElement(value)
#         bf_search_end_time = timeit.default_timer()
#         bf_search_time = bf_search_end_time - bf_search_start_time
#         if bf_search_time > bf_worst_search_time:
#             bf_worst_search_time = bf_search_time
#         bf_tot_search_time += bf_search_time
#     bf_worst_search_time_data_for_synthetic.append(bf_worst_search_time)
#     bf_average_search_time = bf_tot_search_time / (iteration + 1)
#     bf_average_search_time_data_for_synthetic.append(bf_average_search_time)
    
# #-------------------------- Accuracy test for bloom filter-----------------------#

# # Data for the bloom filter
# BFTest = TestDataGenerator()
# BFdata = BFTest.generateData(1000)

# bf2 = BloomFilterSet()

# # insert 500 integers in the Bloom Filter
# for i in range(500):
#     bf2.insertElement(str(i))

# # Test the Bloom Filter with 1000 integers
# false_positives = 0
# for i in range(1000):
#     if i < 500:
#         if not bf.searchElement(str(i)):
#             pass
#     else:
#         if bf.searchElement(str(i)):
#             false_positives += 1

# print(f"False positives: {false_positives}") 

# # Print results
# # for num_data_have in range(num_data):
# #     print(f"When there are {num_data_have + 1} elements")
# #     print(f"Sequential Search insert time: {ss_insert_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Sequential Search average search time: {ss_average_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Sequential Search worst search time: {ss_worst_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Binary Search Tree insert time: {bst_insert_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Binary Search Tree average search time: {bst_average_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Binary Search Tree worst search time: {bst_worst_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Balanced Search Tree insert time: {bal_insert_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Balanced Search Tree average search time: {bal_average_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Balanced Search Tree worst search time: {bal_worst_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Bloom Filter insert time: {bf_insert_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Bloom Filter average search time: {bf_average_search_time_data_for_synthetic[num_data_have]:.6f} seconds")
# #     print(f"Bloom Filter worst search time: {bf_worst_search_time_data_for_synthetic[num_data_have]:.6f} seconds")

1
Sequential Search insert time: 6.6999346017837524e-06 seconds
2
Sequential Search insert time: 2.800021320581436e-06 seconds
3
Sequential Search insert time: 2.299901098012924e-06 seconds
4
Sequential Search insert time: 3.899913281202316e-06 seconds
5
Sequential Search insert time: 4.500150680541992e-06 seconds
6
Sequential Search insert time: 5.3998082876205444e-06 seconds
7
Sequential Search insert time: 6.899703294038773e-06 seconds
8
Sequential Search insert time: 6.600283086299896e-06 seconds
9
Sequential Search insert time: 5.900394171476364e-06 seconds
10
Sequential Search insert time: 4.8996880650520325e-06 seconds
11
Sequential Search insert time: 4.800036549568176e-06 seconds
12
Sequential Search insert time: 4.600267857313156e-06 seconds
13
Sequential Search insert time: 6.6999346017837524e-06 seconds
14
Sequential Search insert time: 5.200039595365524e-06 seconds
15
Sequential Search insert time: 4.600267857313156e-06 seconds
16
Sequential Search insert time: 6.000045686

KeyboardInterrupt: 

In [1]:
import numpy as np
from matplotlib import pyplot as plt

#-------------------------- Plot-----------------------#

# Used this piece of code below as archetype. (change title, xlabel, ylabel, scatter, polyfit and plot(in case))
x = np.arange(1, 1001)
plt.title("Binary Search Tree Space Complexity")
plt.xlabel("Size")
plt.ylabel("Space")
plt.scatter(x, bst_space_data_for_synthetic)
slope , intercept = np.polyfit(x, bst_space_data_for_synthetic, deg = 1)
plt.plot(x, intercept + slope * x, color = "k", lw = 2, label = 'y={:.8f}x + {:.8f}'.format(slope, intercept))
plt.legend(fontsize = 9)
plt.show()

ModuleNotFoundError: No module named 'numpy'

In [8]:
import math

def calculate_bloom_filter_params(num_elements, false_pos_rate):
    size = - (num_elements * math.log(false_pos_rate)) / (math.log(2)**2)
    hash_count = (size / num_elements) * math.log(2)
    return int(size), int(hash_count)

calculate_bloom_filter_params(10000, 0.0001)

(191701, 13)