Question 2: Suppose we have computed signatures for a number of columns, and each signature
consists of 24 integers, arranged as a column of 24 rows. There are N pairs of signatures that are 50%
similar (i.e., they agree in half of the rows). There are M pairs that are 20% similar, and all other pairs (an
unknown number) are 0% similar.

We can try to find 50%-similar pairs by using Locality-Sensitive Hashing LSH, and we can do so by
choosing bands of 1, 2, 3, 4, 6, 8, 12, or 24 rows. Calculate approximately, in terms of N and M, the
number of false positive and the number of false negatives, for each choice for the number of rows.
Then, suppose that we assign equal cost to false positives and false negatives (an atypical assumption).
Which number of rows would you choose if MN were in each of the following ratios: 11, 101, 1001, and
10001? Identify the correct choice from the list below.


In [4]:
import numpy as np
from collections import Counter

vectors = [
    np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]),
    np.array([0, 1, 0, 0, 1, 0, 0, 1, 0, 1]),
    np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0]),
    np.array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
    np.array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1])
]

def jaccard_distance(v1, v2):
    v1 = [int(x) for x in v1]
    v2 = [int(x) for x in v2]
    
    intersection = sum([1 for x, y in zip(v1, v2) if x == y == 1])
    union = sum([1 for x, y in zip(v1, v2) if x == 1 or y == 1])
    
    jaccard_index = intersection / union if union != 0 else 0
    
    return 1 - jaccard_index

distances = {}
for i in range(len(vectors)):
    for j in range(i + 1, len(vectors)):
        distance = jaccard_distance(vectors[i], vectors[j])
        distances[(i, j)] = distance

for pair, distance in distances.items():
    print(f"Jaccard distance vector {pair[0]} and vector {pair[1]}: {distance}")


Jaccard distance vector 0 and vector 1: 0.8571428571428572
Jaccard distance vector 0 and vector 2: 1.0
Jaccard distance vector 0 and vector 3: 0.7
Jaccard distance vector 0 and vector 4: 0.7
Jaccard distance vector 1 and vector 2: 0.8571428571428572
Jaccard distance vector 1 and vector 3: 0.5555555555555556
Jaccard distance vector 1 and vector 4: 0.7
Jaccard distance vector 2 and vector 3: 0.5555555555555556
Jaccard distance vector 2 and vector 4: 0.5555555555555556
Jaccard distance vector 3 and vector 4: 0.19999999999999996
