The goal of this notebook is to create a bloom filter on movie genres.\
The idea of this filter is to set a genre, for example "western" and be able to process all the movies in the list using the bloom filter to say whether or not they are westerns.

In [1]:
# Import the needed modules as always
import csv
import mmh3
# We will be using the bitarray module for the bloom filter because of the way Python handles lists
# Elements of a Python list are objects and are therefore way larger than the bits we need to store
from bitarray import bitarray

In [2]:
# Let's first define the bloom filter class since we have to re implement it from scratch
class BloomFilter:
    # The main parameters of a bloom filter are : the size of the bitmap and the number of hash functions used
    def __init__(self, bitmap_size, hash_count):
        self.size = bitmap_size
        self.hash_count = hash_count
        # Initialize the bitarray and initalize eveyrthing to 0
        self.bit_array = bitarray(bitmap_size)
        self.bit_array.setall(0)

    # To add a value to the filter, we have to hash it multiple times and set the corresponding bits to 1
    # Therefore we need to implement a hash function : in that case the hash function will directly return an array of all the indices to be set to 1
    def _hashes(self, value):
        hash_results = []
        for i in range(self.hash_count):
            # Once again we use the mmh3 hash function (here with 64 bits output) since it is faster than SHA
            # We then take the modulo of the hash value to make sure it fits in the bitmap
            hash_result = mmh3.hash64(value, i) % self.size
            hash_results.append(hash_result)
        return hash_results

    # Adding a value to the filter is nothing but setting all the bits corresponding to the hash values to 1
    def add(self, value):
        for hash_val in self._hashes(value):
            self.bit_array[hash_val] = 1

    # Fianlly we need a function to "query" the filter, i.e. check if a value is in the filter
    # To do this we need to check if all the bits corresponding to the hash values are 1
    def query(self, value):
        # We use the built in all() function that returns True if all the elements in the list in argument are True,
        # and list comprehension to have a one liner for the check
        return all(self.bit_array[hash_val] for hash_val in self._hashes(value))

Now that we have defined the Bloom filter class, we can use it to our heart's contempt to filter what we want.