# Set Membership

The cell below defines two **abstract classes**: the first represents a set and basic insert/search operations on it. You will need to impement this API four times, to implement (1) sequential search, (2) binary search tree, (3) balanced search tree, and (4) bloom filter. The second defines the synthetic data generator you will need to implement as part of your experimental framework. <br><br>**Do NOT modify the next cell** - use the dedicated cells further below for your implementation instead. <br>

In [1]:
# DO NOT MODIFY THIS CELL

from abc import ABC, abstractmethod  

# abstract class to represent a set and its insert/search operations
class AbstractSet(ABC):
    
    # constructor
    @abstractmethod
    def __init__(self):
        pass           
        
    # inserts "element" in the set
    # returns "True" after successful insertion, "False" if the element is already in the set
    # element : str
    # inserted : bool
    @abstractmethod
    def insertElement(self, element):     
        inserted = False
        return inserted   
    
    # checks whether "element" is in the set
    # returns "True" if it is, "False" otherwise
    # element : str
    # found : bool
    @abstractmethod
    def searchElement(self, element):
        found = False
        return found    
    
    
    
# abstract class to represent a synthetic data generator
class AbstractTestDataGenerator(ABC):
    
    # constructor
    @abstractmethod
    def __init__(self):
        pass           
        
    # creates and returns a list of length "size" of strings
    # size : int
    # data : list<str>
    @abstractmethod
    def generateData(self, size):     
        data = [""]*size
        return data   


Use the cell below to define any auxiliary data structure and python function you may need. Leave the implementation of the main API to the next code cells instead.

In [2]:
# ADD AUXILIARY DATA STRUCTURE DEFINITIONS AND HELPER CODE HERE

# Bloom filter helper code:


In [3]:
class Node:
    def __init__(self, data=None, next=None):
        self.data = data
        self.next = next

class SequentialSearchSet(AbstractSet):
    def __init__(self):
        self.head = None

    def insertElement(self, element):
        node = Node(element, self.head)
        self.head = node
    
    #def insertElement(self, element):
        #if self.head is None:
            #self.head = Node(element, None)
            #return
            
        
        #node = self.head
        #while node.next is not None:
            #node = node.next
        
        #node.next = Node(element, None)


    def searchElement(self, element):
        search_node = self.head
        while search_node is not None:
            if search_node.data == element:
                return True
            search_node = search_node.next
        return False
    
    def print(self):
        if self.head is None:
            print("Empty linked list")
            return

        node = self.head # Iterator will start at the beginning of the linked list
        linked_list_string = '' # Linked list string to print out the list of elements
        while node is not None:
            linked_list_string += str(node.data) + '-->'
            node = node.next # Following the linked list and iterating through elements one by one

        print(linked_list_string)


In [4]:
linked_list = SequentialSearchSet()
linked_list.insertElement(5)
linked_list.insertElement(2)
linked_list.print()
linked_list.searchElement(5)

2-->5-->


True

Use the cell below to implement the requested API by means of **sequential search**.

In [5]:
class SequentialSearchSet(AbstractSet):
    
    def __init__(self):
        self.arr = []
        
        pass           
     
    def insertElement(self, element):
        inserted = False
        self.arr.append(element)
        inserted = True
      
        return inserted
    

    def searchElement(self, element):     
        found = False
        for value in self.arr:
            if value == element:
                found = True
        
        return found

In [6]:
arrayList = SequentialSearchSet()
arrayList.insertElement(5)
arrayList.insertElement(2)
print(arrayList.arr)
arrayList.searchElement(5)

[5, 2]


True

Use the cell below to implement the requested API by means of **binary search tree**.

In [7]:
class BinarySearchTreeSet(AbstractSet):
    
    def __init__(self):
        # ADD YOUR CODE HERE

        
        pass           
     
    
        
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
      
        
        return inserted
    
    

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE

        
        return found    

Use the cell below to implement the requested API by means of **balanced search tree**.

In [8]:
class BalancedSearchTreeSet(AbstractSet):
    
    def __init__(self):
        # ADD YOUR CODE HERE

        
        pass           
     
    
        
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
      
        
        return inserted
    
    

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE

        
        return found    

Use the cell below to implement the requested API by means of **bloom filter**.

In [9]:
class BloomFilterSet(AbstractSet):
    
    def __init__(self):
        # ADD YOUR CODE HERE
        self.size = 9161520
        self.hash_count = 6
        self.bit_array = [0] * self.size
        self.hash_functions = self.generate_hash_functions()
    
    def generate_hash_functions(self): # Code to generate a list of unique hash functions
            hash_fuctions = []
            for i in range(1,self.hash_count+1):
                hash_fuctions.append(self.generate_hash_function(i))
            return hash_fuctions
    
    def generate_hash_function(self,seed): # Code to generate a unique hash function based on an inputed seed value
        def hash_function(value):
            return hash(value + str(seed)) % self.size
        return hash_function
        
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        for hash_function in self.hash_functions:
            self.bit_array[hash_function(element)] = 1
        inserted = True
        
        return inserted

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE
        for hash_function in self.hash_functions:
            if self.bit_array[hash_function(element)] == 0:
                return found
        found = True
        return found    

Use the cell below to implement the **synthetic data generator** as part of your experimental framework.

In [10]:
import string
import random

class TestDataGenerator(AbstractTestDataGenerator):
    
    def __init__(self, length, lower=True, upper=True, digits=True, punctuation=False):
        self.length = length
        self.lower = lower
        self.upper = upper
        self.digits = digits
        self.punctuation = punctuation

    def getCharacters(self):
        character = ''
        if self.lower:
            character += string.ascii_lowercase
        if self.upper:
            character += string.ascii_uppercase
        if self.digits:
            character += string.digits
        if self.punctuation:
            character += string.punctuation

        return character

        
    def generateData(self, size):     
        data = [""] * size
        characters = self.getCharacters()


        for i in range(size):
            random_character = ''.join(random.choice(characters) for _ in range(self.length))
            data[i] = random_character
    
           
        for value in data:
            print(value)
    

test = TestDataGenerator(20)
test.generateData(10)



f3svWp3HScpEmzjJNDSc
rv02ABXidLO69WJ7nyYl
cxSvkqMGHdOLZh76Fskh
jqz5J8R95wNaTqPJUpAX
wrnZLIO9SfCqvCo0f3hD
CJMqiXWYq2i09WssyTn0
e2wiFZdW2UBrQiSb3zJt
CfY8K4uMV3ApCyjz9c8H
GmdRrR99zT3qWhVvLuam
F8UCcJ7fVfxrbQAm7xm8


Use the cells below for the python code needed to **fully evaluate your implementations**, first on real data and subsequently on synthetic data (i.e., read data from test files / generate synthetic one, instantiate each of the 4 set implementations in turn, then thorouhgly experiment with insert/search operations and measure their performance).

In [18]:
import timeit

# ADD YOUR TEST CODE HERE TO WORK ON REAL DATA
iterations= 1
test_file_1 = open("./testfiles/test1-mobydick.txt", "r")
test_file_2 = open("./testfiles/test2-warpeace.txt", "r")
test_file_3 = open("./testfiles/test3-dickens.txt", "r")
test_search = open("./testfiles/test-search.txt", "r")


def bloomfilter_insert(bloomfilter,file):
    for line in file:
        for word in line.split():
            bloomfilter.insertElement(word)

def bloomfilter_search(bf):
    for word in test_search:
        word = word.strip()
        if (bf.searchElement(word)):
            print("found word: " + word)
            pass
        else:
            pass
    

def bf_test_1():
    bloomfilter = BloomFilterSet()
    bf_insert_time_taken = timeit.timeit(lambda: bloomfilter_insert(bloomfilter, test_file_1), number=iterations)
    print("Bloom filter insert time taken for file 1" + ":", bf_insert_time_taken/iterations)
    bf_search_time_taken = timeit.timeit(lambda: bloomfilter_search(bloomfilter), number=iterations)
    print("Bloom filter search time taken for file 1" + ":", bf_search_time_taken/iterations)

def bf_test_2():
    bloomfilter = BloomFilterSet()
    bf_insert_time_taken = timeit.timeit(lambda: bloomfilter_insert(bloomfilter, test_file_2), number=iterations)
    print("Bloom filter insert time taken for file 2" + ":", bf_insert_time_taken/iterations)
    bf_search_time_taken = timeit.timeit(lambda: bloomfilter_search(bloomfilter), number=iterations)
    print("Bloom filter search time taken for file 2" + ":", bf_search_time_taken/iterations)
    
def bf_test_3():
    bloomfilter = BloomFilterSet()
    bf_insert_time_taken = timeit.timeit(lambda: bloomfilter_insert(bloomfilter, test_file_3), number=iterations)
    print("Bloom filter insert time taken for file 3" + ":", bf_insert_time_taken/iterations)
    bf_search_time_taken = timeit.timeit(lambda: bloomfilter_search(bloomfilter), number=iterations)
    print("Bloom filter search time taken for file 3" + ":", bf_search_time_taken/iterations)

bf_test_1()
print("--------------------------------------------------")
bf_test_2()
print("--------------------------------------------------")
bf_test_3()
    
            






Bloom filter insert time taken for file 1: 0.3095694170333445
found word: able
found word: about
found word: above
found word: according
found word: accordingly
found word: across
found word: actually
found word: after
found word: afterwards
found word: again
found word: against
found word: ain't
found word: all
found word: allow
found word: almost
found word: alone
found word: along
found word: already
found word: also
found word: although
found word: always
found word: am
found word: among
found word: amongst
found word: an
found word: and
found word: another
found word: any
found word: anybody
found word: anyhow
found word: anyone
found word: anything
found word: anyway
found word: anyways
found word: anywhere
found word: apart
found word: appear
found word: appropriate
found word: are
found word: around
found word: as
found word: aside
found word: ask
found word: associated
found word: at
found word: available
found word: away
found word: be
found word: became
found word: because
f

In [12]:
import timeit

# ADD YOUR TEST CODE HERE TO WORK ON SYNTHETIC DATA



