# Algorithms for Data Science -- Laboratory 3
Author: Pablo Mollá Chárlez

## Filtering Stream Items

### 1. Preliminaries 

The objective of this lab is to implement algorithms for filtering "good" items on streams. We will start by the simple implementation using only one hash function, and then it will be required of you to implement the full Bloom filter. We assume a random stream $S$ of $m$ email strings. We assume that the first $g$ emails are the good ones, that we have $n$ bits allocated in the bit array $B$ (for simplicity, implemented as an array here).

In [70]:
import random
import math
from string import ascii_lowercase

# Parameters
m = 100
g = 10
stream_size = 10000
n = 512

# Generate some random strings of size 5 + 1 (@) + 5
D = [''.join(random.choice(ascii_lowercase) for _ in range(5)) + '@' + ''.join(random.choice(ascii_lowercase) for _ in range(5)) for _ in range(m)]

print(D)

['wfdnr@pqujz', 'pqeno@hyjmk', 'jrtyf@sdfnw', 'qwsim@yslda', 'figxw@xxeva', 'egott@bedrx', 'bbotw@kythb', 'ggtnb@foirg', 'lfkgj@cwkqv', 'ylypp@ewrkq', 'nwfzb@yproi', 'gubfb@rnbhn', 'gvwbk@rvkrx', 'imqay@darfd', 'ptfjv@zfuvi', 'jjxid@aoowq', 'dtqal@rrtce', 'utubq@ffbqu', 'czxxs@uuxtr', 'mmykx@zznqc', 'xgkfa@cycwc', 'kgovt@ydyhl', 'damxf@jtcxx', 'dckfc@tmhog', 'nnenq@cjoti', 'erdob@ntfqi', 'jbwpv@qnjka', 'rlvqv@mpxte', 'dtwfn@wqwcg', 'fzmim@hjgee', 'wswde@bnizl', 'ikxsd@wjmyl', 'hybgr@kqfal', 'hfwys@ahlqf', 'hdioo@mkarx', 'nvguo@csolw', 'zqzfi@segjz', 'xyrrt@zaxyh', 'ulpex@khadw', 'iajln@kuhkg', 'umbcz@uwcwf', 'gtyur@qnkzi', 'uylvr@sszur', 'yhhmy@aigco', 'czzdv@cwtud', 'hafyg@gayvf', 'ogqsd@rwfjh', 'jleoa@ylplx', 'iwsss@nlgxs', 'cxdcx@eanfj', 'sdjno@qbdzq', 'celaa@ntegu', 'jjxqm@yojhe', 'xjtxp@boyiv', 'ldszv@bnkpq', 'esxju@cplju', 'jwwvz@sbdck', 'vpelg@kvxes', 'mbzlt@nntvi', 'ydbwu@xxgcs', 'xzvsw@epdmt', 'tubhh@lqljz', 'efztu@uhpwk', 'aujbf@zbekc', 'evgot@pyjfv', 'fojgj@xrbdk', 'bsjkr@gf

### 2. Creating a Hash Function, Filtering Items Using a Single Hash 

In the following we create a hash function $h(x)$, which also takes as a parameter a value and $n$, and returns a value in $0\dots n-1$. We populate the byte array $B$, and then we simulate a stream taking random values from $D$ and checking whether the value is good or not. We measure the true positive, false positive, and false negative rates.

In [71]:
n = 128

# Hash Function
def h(x,n):
  return hash(x)%n

# Just for checking TP and FP rates
good_set = set(D[:g])
print("Good Set:", good_set)

# Allocate the array of 0s
B = [0] * n

# Fill the byte array
for i in range(g): B[h(D[i],n)] = 1
    
print("Good - Byte Array:", B)

tp = 0 # good items passing = true positive
fp = 0 # bad items passing = false positive
tn = 0 # bad items discarded = true negative
fn = 0 # good items discarded = false negative

# Simulate a stream
for _ in range(stream_size):
  # Take a random email
  s = random.choice(D)
  # Check its hash value
  if B[h(s,n)]==1: # Good
    if s not in good_set:
      fp += 1
    else:
      tp += 1
  else: # Bad 
    if s in good_set:
      fn += 1
    else:
      tn += 1

print('False positive rate: %f'%(float(fp)/float(tn+fp)))

Good Set: {'qwsim@yslda', 'figxw@xxeva', 'pqeno@hyjmk', 'wfdnr@pqujz', 'jrtyf@sdfnw', 'ylypp@ewrkq', 'ggtnb@foirg', 'egott@bedrx', 'bbotw@kythb', 'lfkgj@cwkqv'}
Good - Byte Array: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
False positive rate: 0.022065


We may want to create a random hash function that can also be pairwise independent when we will need to generate $k$ independent pairwise hashes.
The following procedure can be implemented:
* choose a large prime number $p$
* generate two random numbers $a$ and $b$ in the range $\{1,\dots,p\}$
* the hash is then $h_{a,b}(x)=ax+b \mod p$
* we can also restrict it into $\{0,\dots,n-1\}$

In [72]:
# Large Prime Number
p = 1223543677

# 2 random numbers
a = random.randrange(p)
b = random.randrange(p)

# Remark: here we use hash(x) instead of the values to allow for all hashable python types
# e.g., strings, tuples
def h(x,a,b,p,n):
  return ((a*hash(x)+b)%p)%n

# Reinitialize the array, for testing
B = [0] * n

for i in range(g): 
  B[h(D[i],a,b,p,n)] = 1

print(B)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]


### 3. **TASK** - Bloom Filters


Your task is to implement the Bloom filters as described in the class lecture. For this, you have to:

1. Generate $k$ random pairwise independent hash functions (_hint_: use the example shown above)

2. Initialize $B$, by setting $1$ in each $h_i(x)$, $i\in\{1,\dots,k\}$, for all items $x$ in the good set

3. An item $s$ in the stream is considered good if, for all $i\in\{1,\dots,k\}$, we have $B[h_i(s)]=1$

Measure the true positive and false positive rate for various values of $k$ and compare to the values obtained when setting $k=n/m\ln 2$ (to the nearest integer value). What do you notice?

Rates:

- False positive rate: $\frac{FP}{FP+TN}$

- True positive rate: $\frac{TP}{TP+FN}$

In [73]:
# Generate k pairwise independent hash functions
def generate_hash_functions(k, p, n):
    # List to store all hash_functions
    hash_functions = []
    for _ in range(k):
        # Random integers within the range defined by the prime number
        a = random.randrange(1, p)
        b = random.randrange(0, p)
        
        # Creating and appending of the hash functions
        hash_functions.append(lambda x, a=a, b=b, p=p, n=n: ((a * hash(x) + b) % p) % n)
    return hash_functions


# Bloom filter implementation
def bloom_filter(k, D, good_set, n, stream_size):
    # Generate k hash functions
    hash_functions = generate_hash_functions(k, p, n)

    # Initialize the bit array
    B = [0] * n

    # Insert "good" emails into the Bloom filter
    for email in good_set:
        for h in hash_functions:
            B[h(email)] = 1

    # Check stream and compute rates
    tp = 0  # true positives
    fp = 0  # false positives
    tn = 0  # true negatives
    fn = 0  # false negatives

    # Simulate the stream
    for _ in range(stream_size):
        s = random.choice(D)
        if all(B[h(s)] == 1 for h in hash_functions):
            if s in good_set:
                tp += 1  # Correctly identified as good (true positive)
            else:
                fp += 1  # Incorrectly identified as good (false positive)
        else:
            if s in good_set:
                fn += 1  # Incorrectly discarded (false negative)
            else:
                tn += 1  # Correctly discarded as bad (true negative)

    # Calculate false positive and true positive rates
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0

    return fpr, tpr

print("----------------------- Testing Bloom Filter -----------------------")
# Test with different values of k
for k in [1, 2, 3, 5, 10, 15, 20, 25, 30, 100]:
    fpr, tpr = bloom_filter(k, D, good_set, n, stream_size)
    print(f'k = {k}: False Positive Rate = {fpr:.6f}, True Positive Rate = {tpr:.6f}')
print("---------------------------------------------------------------------\n")


# Optimal value of k = n/m * ln(2)
optimal_k = round((n / m) * math.log(2))
fpr, tpr = bloom_filter(optimal_k, D, good_set, n, stream_size)
print(f'Optimal k = {optimal_k}: False Positive Rate = {fpr:.6f}, True Positive Rate = {tpr:.6f}')

----------------------- Testing Bloom Filter -----------------------
k = 1: False Positive Rate = 0.065172, True Positive Rate = 1.000000
k = 2: False Positive Rate = 0.022391, True Positive Rate = 1.000000
k = 3: False Positive Rate = 0.000000, True Positive Rate = 1.000000
k = 5: False Positive Rate = 0.000000, True Positive Rate = 1.000000
k = 10: False Positive Rate = 0.011869, True Positive Rate = 1.000000
k = 15: False Positive Rate = 0.000000, True Positive Rate = 1.000000
k = 20: False Positive Rate = 0.000000, True Positive Rate = 1.000000
k = 25: False Positive Rate = 0.011767, True Positive Rate = 1.000000
k = 30: False Positive Rate = 0.010252, True Positive Rate = 1.000000
k = 100: False Positive Rate = 1.000000, True Positive Rate = 1.000000
---------------------------------------------------------------------

Optimal k = 1: False Positive Rate = 0.074247, True Positive Rate = 1.000000
