# Mining Massive Datasets Problem Set 6

Ruben Hartenstein, Taha Erkoc

# Exercise 1

The Jaccard similarity for two sets $C_1, C_2$ is defined as:

$sim(C_1, C_2) = \frac{|C_1 \cup C_2|}{|C_1 \cap C_2|}$

For multi-sets, where an element can be a member more than once, the definition of the intersection and union needs to account for the multiplicities of the elements.

The intersection of two multi-sets $C_1, C_2$ is defined as:

$(C_1 \cap C_2)(x) = min(C_1(x), C_2(x))$

where $C_1(x), C_2(x)$ represent the counts of elements x in these two sets respectively.

Therefore the union of two multi-sets can also be defined:

$(C_1 \cup C_2)(x) = max(C_1(x), C_2(x))$


Using these two definitions, we can define the Jaccard similarity for multi-sets as:

$sim(C_1, C_2) = \frac{\sum_{x \in C_1 \cup C_2} min(C_1(x), C_2(x))}{\sum_{x \in C_1 \cup C_2} max(C_1(x), C_2(x))}$

With this definition, in the case of two sets where each element appears at most one, the formula reduced back to its original form:

$sim(C_1, C_2) = \frac{|C_1 \cup C_2|}{|C_1 \cap C_2|}$

# Exercise 3

### a)
Hash functions

$h_1(x) = (2x + 1) \mod{6} $ <br>
$h_2(x) = (3x + 2) \mod{6} $ <br>
$h_3(x) = (5x + 2) \mod{6} $ <br>

Now we compute the hash values for $x \in \{0,1,2,3,4,5\}$

For $h_1(x) = (2x + 1) \mod{6} $:

$h_1(0) = 1$ <br>
$h_1(1) = 3$ <br>
$h_1(2) = 5$ <br>
$h_1(3) = 1$ <br>
$h_1(4) = 3$ <br>
$h_1(5) = 5$ <br>

For $h_2(x) = (3x + 2) \mod{6} $:

$h_2(0) = 2$ <br>
$h_2(1) = 5$ <br>
$h_2(2) = 2$ <br>
$h_2(3) = 5$ <br>
$h_2(4) = 2$ <br>
$h_2(5) = 5$ <br>

For $h_3(x) = (5x + 2) \mod{6}$:

$h_3(0) = 2$ <br>
$h_3(1) = 1$ <br>
$h_3(2) = 0$ <br>
$h_3(3) = 5$ <br>
$h_3(4) = 4$ <br>
$h_3(5) = 3$ <br>

### MinHash Signature for each set:

$S_1 \{2,5\}$ (non-zero entries in $S_1$):

$h_1(2) = 5, h_1(5) = 5$, $Min = 5$<br>
$h_2(2) = 2, h_2(5) = 5$, $Min = 2$<br>
$h_3(2) = 0, h_3(5) = 3$, $Min = 0$<br>

The same for sets $S_2,S_3,S_4$, the signature table looks like:

| Set  | $h_1 Min$ | $h_2 Min$ | $h_3 Min$|
|------|-----------|-----------|----------|
| $S_1$| 5         | 2         | 0        |
| $S_2$| 1         | 2         | 1        |
| $S_3$| 1         | 2         | 4        |
| $S_4$| 1         | 2         | 0        |




### b)
$h_3(x)$ is a true permutation because all hash values $\{0,1,2,3,4,5\}$ are distinct.

Collisions:

$h_1(x)$: <br>
$h_1(0) = h_1(3) = 1$ <br>
$h_1(1) = h_1(4) = 3$ <br>
$h_1(2) = h_1(5) = 5$ <br>

$h_2(x)$: <br>
$h_2(0) = h_2(2) = h_2(4) = 2$ <br>
$h_2(1) = h_2(3) = h_2(5)= 3$ <br>


### c)
| Pair        | MinHash Similarity | Jaccard Similarity |
|-------------|--------------------|--------------------|
| $S_1$,$S_2$ | 0.33               | 0                  |
| $S_1$,$S_3$ | 0.33               | 0                  |
| $S_1$,$S_4$ | 0.67               | 0.33               |
| $S_2$,$S_3$ | 0.33               | 0                  |
| $S_2$,$S_4$ | 0.67               | 0.33               |
| $S_3$,$S_4$ | 0.33               | 0.33               |

Due to the small number of hash functions we cannot see that the MinHash similarity would actually converge to the true Jaccard similarity.

# Exercise 2

In [1]:
import pandas as pd
# Stream of integers
stream = [3, 1, 4, 1, 5, 9, 2, 6, 5]

# Hash functions
def h1(x):
    return (2 * x + 1) % 32

def h2(x):
    return (3 * x + 7) % 32

def h3(x):
    return (4 * x) % 32

# Function to count trailing zeros in binary
def count_trailing_zeros(num):
    binary = f"{num:05b}"  # Convert to 5-bit binary
    return len(binary) - len(binary.rstrip('0'))  # Count trailing zeros

# Process each hash function
results = []
hash_functions = [
    (h1, "h1(x) = (2x + 1) mod 32"),
    (h2, "h2(x) = (3x + 7) mod 32"),
    (h3, "h3(x) = (4x) mod 32"),
]

# Iterate through hash functions
for h_func, description in hash_functions:
    hashed_values = [h_func(x) for x in stream]
    tail_lengths = [count_trailing_zeros(h_func(x)) for x in stream]
    max_tail_length = max(tail_lengths)
    estimate = 2 ** max_tail_length
    results.append({
        "Hash Function": description,
        "Hashed Values": hashed_values,
        "Binary Representations": [f"{x:05b}" for x in hashed_values],
        "Tail Lengths": tail_lengths,
        "Max Tail Length": max_tail_length,
        "Estimate (Distinct Elements)": estimate,
    })

# Display results
df_results = pd.DataFrame(results)
print(df_results)


             Hash Function                        Hashed Values  \
0  h1(x) = (2x + 1) mod 32      [7, 3, 9, 3, 11, 19, 5, 13, 11]   
1  h2(x) = (3x + 7) mod 32  [16, 10, 19, 10, 22, 2, 13, 25, 22]   
2      h3(x) = (4x) mod 32     [12, 4, 16, 4, 20, 4, 8, 24, 20]   

                              Binary Representations  \
0  [00111, 00011, 01001, 00011, 01011, 10011, 001...   
1  [10000, 01010, 10011, 01010, 10110, 00010, 011...   
2  [01100, 00100, 10000, 00100, 10100, 00100, 010...   

                  Tail Lengths  Max Tail Length  Estimate (Distinct Elements)  
0  [0, 0, 0, 0, 0, 0, 0, 0, 0]                0                             1  
1  [4, 1, 0, 1, 1, 1, 0, 0, 1]                4                            16  
2  [2, 2, 4, 2, 2, 2, 3, 3, 2]                4                            16  


### Bonus

The choice of hash functions might lead to poor distributions of hash values. For example $h_3(x) = (4x)\mod 32$ will always produce values that are multiples of 4, which significantly reduces randomness and results in fewer unique hash values. This can lead to clustering and many collisions.

An advice would be to avoid simple linear functions like $ax+b\mod 2^k$ unless $a$ and $b$ are carefully chosen to avoid patterns. One could also consider using more advanced hash functions like MurmurHash because these tend to have way better randomness.

# Exercise 4

In [11]:
def compute_k_shingles(digits, k):
    # Set to store unique positions
    positions = set()

    # Iterate over the digits
    for i in range(len(digits) - k + 1):
        # Get k-shingle at current position
        shingle = digits[i:i+k]
        # Convert to integer and add to set
        positions.add(int(shingle))

    # Return ordered list of unique positions
    return sorted(positions)

# Test function with example
test_example = "1234567"
k = 4
shingles = compute_k_shingles(test_example, k)
print(shingles)  # Expected: [1234, 2345, 3456, 4567]

[1234, 2345, 3456, 4567]


### b)

In [None]:
from mpmath import mp

# Set precision to 10000 digits
mp.dps = 10000

# Get pi as string after decimal point
pi_digits = str(mp.pi)[2:]

# Apply shingles function with k = 12
k = 12
shingles_positions = compute_k_shingles(pi_digits, k)

# Save output to text file
with open("k_shingles_pi.txt", "w") as f:
    for pos in shingles_positions:
        f.write(f"{pos}\n")

### c)

In [None]:
import random

def minhash_signature(positions, hash_functions):
    # Initialize signature
    signature = []

    # Iterate over hash functions
    for a, b, p in hash_functions:
        min_hash = float("inf")
        # Compute hash value for each position and track the minimum
        for pos in positions:
            hash_value = ((a * pos + b) % p) % (10**15)
            min_hash = min(min_hash, hash_value)
        signature.append(min_hash)
    return signature

def generate_hash_functions():
    hash_functions = []
    # First hash function
    hash_functions.append((37, 126, 10**15 + 223))
    
    # Generate 4 additional hash functions
    for i in [37, 91, 159, 187]:
        a = random.randint(0, 10**12)
        b = random.randint(0, 10**12)
        p = 10**15 + i
        hash_functions.append((a, b, p))
    
    return hash_functions


# Generate hash functions
hash_functions = generate_hash_functions()

# Compute MinHash signature
signature = minhash_signature(shingles_positions, hash_functions)

# Output the MinHash signature
print("MinHash Signature:", signature)


MinHash Signature: [11610003501, 54862297882, 26611768318, 3324423771, 36766341956]
