The goal of this notebook is to create a bloom filter on movie genres.\
The idea of this filter is to choose a genre, for instance western, and to create a filter on all the movies that have this genre. This can be done through 2 ways, either by using the movie ID, or using the full vector of informations about the movie (without the genre attribute).\
The goal is to be able to, without the genre information, be able to "reconstruct" the information using the filter.

In [17]:
# Import the needed modules as always
import csv
import mmh3
import random
import time
# We will be using the bitarray module for the bloom filter because of the way Python handles lists
# Elements of a Python list are objects and are therefore way larger than the bits we need to store
from bitarray import bitarray

In [4]:
# Define the global variables
sampled = False
sr = 0.1 if sampled else 1.0
genres_csv = "data/genres.csv"
movies_csv = "data/movies.csv"

In [11]:
# Let's first define the bloom filter class since we have to re implement it from scratch
class BloomFilter:
    # The main parameters of a bloom filter are : the size of the bitmap and the number of hash functions used
    def __init__(self, bitmap_size, hash_count):
        self.size = bitmap_size
        self.hash_count = hash_count
        # Initialize the bitarray and initalize eveyrthing to 0
        self.bit_array = bitarray(bitmap_size)
        self.bit_array.setall(0)

    # To add a value to the filter, we have to hash it multiple times and set the corresponding bits to 1
    # Therefore we need to implement a hash function : in that case the hash function will directly return an array of all the indices to be set to 1
    def _hashes(self, value):
        hash_results = []
        for i in range(self.hash_count):
            # Once again we use the mmh3 hash function (here with 64 bits output) since it is faster than SHA
            # We then take the modulo of the hash value to make sure it fits in the bitmap
            # Note that the hash64 function returns a tuple with two 64 bit integers because it uses the same backend as the hash128, so we need to take only one
            hash_result = mmh3.hash64(value, i)[0] % self.size
            hash_results.append(hash_result)
        return hash_results

    # Adding a value to the filter is nothing but setting all the bits corresponding to the hash values to 1
    def add(self, value):
        for hash_val in self._hashes(value):
            self.bit_array[hash_val] = 1

    # Fianlly we need a function to "query" the filter, i.e. check if a value is in the filter
    # To do this we need to check if all the bits corresponding to the hash values are 1
    def query(self, value):
        # We use the built in all() function that returns True if all the elements in the list in argument are True,
        # and list comprehension to have a one liner for the check
        return all(self.bit_array[hash_val] for hash_val in self._hashes(value))

Now that we have defined the Bloom filter class, we can use it to our heart's contempt to filter what we want.\
In our case, we will process the genres file, add all the movie IDs that are associated with western, and then process the movie file which does not contain the genre information and try to categorize the movies.\
We can also try to apply an "exact" approach to estimate the performances of the filter.

In [21]:
# Let's first define a helper function to process the file and "fill up" the bloom filter
def init_filter(file, bloom):
    start = time.process_time()
    with open(file, 'r', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile)
        counter = 0
        for row in csvreader:
            # In that case we don't use the sample rate because we want ALL the western movies to be added to the filter
            if row['genre'] == "Western":
                bloom.add(row['id'])
                counter += 1
    print(f"Added {counter} Western movies to the bloom filter")
    print(f"Time taken to initialize the filter : {time.process_time() - start} seconds")
    return counter # We return the number of elements added to the filter for later use in the false positive rate calculation

In [22]:
# Then we can define the main function that will test the bloom filter (process the movies file)
def test_filter(file, bloom, sample_rate=1.0):
    start = time.process_time()
    with open(file, 'r', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile)
        counter = 0
        ids = []
        for row in csvreader:
            # We use the sample rate to only check a fraction of the movies
            if random.random() < sample_rate:
                # If the movie id is in the filter, we increment the counter
                if bloom.query(row['id']):
                    counter += 1
                    # We also keep track of the ids that have been found to check for false positives
                    # This is only needed if the set is sampled because if it is not we can just compute the rate with the number found vs real number
                    if sample_rate != 1.0:
                        ids.append(row['id'])
    print(f"Found {counter} Western movies in the set")
    print(f"Time taken to process the dataset : {time.process_time() - start} seconds")
    return counter, ids

In [23]:
# Let's also define a helper function to compute the number of false positive
def count_fp(ids):
    counter = 0
    # We start by counting the number of movies that are actually western
    with open(genres_csv, 'r', encoding='utf-8') as csvfile:
        csvreader = csv.DictReader(csvfile)
        for row in csvreader:
            for id in ids:
                if row['id'] == id and row['genre'] == "Western":
                    counter += 1
                    break
    # Then we substract this number from the total number of movies found to get the number of false positives
    return len(ids) - counter

In [27]:
# Now that everything is defined, we can start the actual processing
# First we initialize the filter, the size of the bitmap will determine the performance of the filter, depending on the number of elements to add
bloom = BloomFilter(500000, 10)
# Initialize the filter
nb_westerns = init_filter(genres_csv, bloom)
# Test the filter
nb_found, ids = test_filter(movies_csv, bloom, sr)
# Compute the false positive rate
if not sampled:
    # If the set is not sampled, we can directly compute the rate
    # Since there cannot be false negatives, the number found is always >= to the real number
    false_positives = nb_found - nb_westerns
    false_positive_rate = false_positives / (nb_found)
else:
    # If the set is sampled, we have to check the ids that have been found and if they are really western movies
    false_positives = count_fp(ids)
    false_positive_rate = false_positives / len(ids)
print(f"False positive rate : {false_positive_rate}")
# Let's quickly compute the memory usage of the filter
# Since we are dealing with a bitmap, each element is one bit, and we can just use the length of the bitmap to get the size in bits (converted in bytes afterwards)
print(f"Memory usage of the filter : {len(bloom.bit_array) / 8} bytes")

Added 8638 Western movies to the bloom filter
Time taken to initialize the filter : 1.609375 seconds
Found 8638 Western movies in the set
Time taken to process the dataset : 9.640625 seconds
False positive rate : 0.0
Memory usage of the filter : 62500.0 bytes


After a first execution of the code, we can see that there are approximately 9000 western movies in the dataset. This information can help us to estimate how large the bitmap needs to be.\
In our case, we have 5 hash functions and 9000 movies, we want to have a bitmap that is quite sparse in order to avoid collisions as much as possible. If all the hash values are different, then we would get about 45000 ones in the bitmap, therefore choosing a bitmap size that is about 4/5 times bigger would already be a good start.\
With 200000 bits in the bitmap, we get a false positive rate of about 3% on the whole dataset, which is already pretty good. The bitmap is about 25kB and it takes less than 10 seconds to initialize the filter and process the dataset.\
The performances could be improved by adding more hash functions and increasing the size of the bitmap. Note that of course, the time to process the data as well as the memory needed are of course more important so there is a balance to find.\
With a bitmap of 500000 bits and 10 hash functions (about the same ratio as the previous), the false positive rate drops to 0, and it only takes 62kB of memory and about 10 seconds to process the whole dataset. Initializing the filter takes less than 2 seconds with this setup !