# Mining Massive Datasets Problem Set 9

Ruben Hartenstein, Taha Erkoc

# Exercise 2

### a)

In a clique, every node is connected to every other node which means that all nodes have the same in- and out-degree. Thus each node contributes and receives the same fraction of its PageRank. This symmetric structure allows for the flow of the PageRank to be uniform across all nodes in the clique.

### b)

The general form of matrix A is as follows:

$A = \beta * M + (1-\beta) * [\frac{1}{N}]_{NXN}$

When a node is a dead-end (has no outgoing links), its corresponding column in $M$ contains all zeros. To make $M$ a valid stochastic matrix, we replace the entire column of the dead-end nodes with $1/N$, saying there is a uniform probability to any other node. Doing this our matrix $M$ remains column-stochastic (each column sums up to $1$) and random teleport links are followed with a probability $1.0$ from dead-ends.

With this approach, the teleportation is already incorperated in our new matrix $M$ and our teleportation matrix $(1-\beta) * [\frac{1}{N}]_{NXN}$ no longer needs to adress the dead-end issue. It only serves as a random jump mechanism across all nodes.

### c) (in the PDF)


### d)

With random teleports we give the surfer a probability of $1 - \beta$ of jumping to a random page outside the spider trap rather than following links within the trap. This prevents the pagerank score from being permanently trapped.

For Dead-ends, with random teleports a surfer at a dead-end is assumed to teleport to any other page in the graph with an equal probability of $1/N$. For our formula this means that we replace the dead-end column in $M$ with uniform probabilities $1/N$, restoring the column-stochastic property. This ensures our matrix $M$ remanins valid for the PageRank calculation the algorithm converges to a meaningful result regardless of the graph structure.

# Exercise 2

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lit, input_file_name, concat_ws, collect_list, size, array_distinct
from pyspark.sql.types import ArrayType, StringType

# Initialize Spark Session
spark = SparkSession.builder.appName("CharacterShingling").getOrCreate()
path = "/grundgesetz/brd_grundgesetz_63_2019-04-03.txt"

# Load all text files from folder into a DataFrame
documents = spark.read.text(path).withColumnRenamed("value", "text")

# Add column for file name
documents = documents.withColumn("file_name", input_file_name())

# Group documents by file name and concatenate to single block of text
documents_grouped = documents.groupBy("file_name").agg(
    concat_ws(" ", collect_list("text")).alias("full_text")
)


# Function to generate shingles of size k from a text
def generate_shingles(text, k):
    shingles = set()

    # Handle line breaks, hyphens and remove extra spaces
    text = text.replace("\n", " ").replace("- ", "").replace("\r", " ")
    text = " ".join(text.split())

    # Generate shingles
    for i in range(len(text) - k + 1):
        shingle = text[i:i + k]
        shingles.add(shingle)

    return list(shingles)


# Register the function as a UDF
shingles_udf = udf(lambda text, k: generate_shingles(text, k), ArrayType(StringType()))

# Generate shingles for k = 5 and k = 9
results = {}
for k in [5, 9]:
    # Add column for shingles of size k
    shingles_df = documents_grouped.withColumn(f"shingles_{k}", shingles_udf(col("full_text"), lit(k)))

    # Add column for the number of distinct shingles
    shingles_count_df = shingles_df.withColumn(f"distinct_shingles_{k}", size(array_distinct(col(f"shingles_{k}"))))

    # Add results to the dictionary
    results[k] = shingles_count_df.select("file_name", f"distinct_shingles_{k}")

    # Filter results for the Grundgesetz
    grundgesetz_results = results[k].filter(col("file_name").contains("grundgesetz"))

    # Display results
    print(f"\nResults for k={k}:")
    results[k].show(truncate=False)

    print(f"\nGrundgesetz results for k={k}:")
    grundgesetz_results.show(truncate=False)


# Exercise 3

### a)
Hash functions

$h_1(x) = (2x + 1) \mod{6} $ <br>
$h_2(x) = (3x + 2) \mod{6} $ <br>
$h_3(x) = (5x + 2) \mod{6} $ <br>

Now we compute the hash values for $x \in \{0,1,2,3,4,5\}$

For $h_1(x) = (2x + 1) \mod{6} $:

$h_1(0) = 1$ <br>
$h_1(1) = 3$ <br>
$h_1(2) = 5$ <br>
$h_1(3) = 1$ <br>
$h_1(4) = 3$ <br>
$h_1(5) = 5$ <br>

For $h_2(x) = (3x + 2) \mod{6} $:

$h_2(0) = 2$ <br>
$h_2(1) = 5$ <br>
$h_2(2) = 2$ <br>
$h_2(3) = 5$ <br>
$h_2(4) = 2$ <br>
$h_2(5) = 5$ <br>

For $h_3(x) = (5x + 2) \mod{6}$:

$h_3(0) = 2$ <br>
$h_3(1) = 1$ <br>
$h_3(2) = 0$ <br>
$h_3(3) = 5$ <br>
$h_3(4) = 4$ <br>
$h_3(5) = 3$ <br>

### MinHash Signature for each set:

$S_1 \{2,5\}$ (non-zero entries in $S_1$):

$h_1(2) = 5, h_1(5) = 5$, $Min = 5$<br>
$h_2(2) = 2, h_2(5) = 5$, $Min = 2$<br>
$h_3(2) = 0, h_3(5) = 3$, $Min = 0$<br>

The same for sets $S_2,S_3,S_4$, the signature table looks like:

| Set  | $h_1 Min$ | $h_2 Min$ | $h_3 Min$|
|------|-----------|-----------|----------|
| $S_1$| 5         | 2         | 0        |
| $S_2$| 1         | 2         | 1        |
| $S_3$| 1         | 2         | 4        |
| $S_4$| 1         | 2         | 0        |




### b)
$h_3(x)$ is a true permutation because all hash values $\{0,1,2,3,4,5\}$ are distinct.

Collisions:

$h_1(x)$: <br>
$h_1(0) = h_1(3) = 1$ <br>
$h_1(1) = h_1(4) = 3$ <br>
$h_1(2) = h_1(5) = 5$ <br>

$h_2(x)$: <br>
$h_2(0) = h_2(2) = h_2(4) = 2$ <br>
$h_2(1) = h_2(3) = h_2(5)= 3$ <br>


### c)
| Pair        | MinHash Similarity | Jaccard Similarity |
|-------------|--------------------|--------------------|
| $S_1$,$S_2$ | 0.33               | 0                  |
| $S_1$,$S_3$ | 0.33               | 0                  |
| $S_1$,$S_4$ | 0.67               | 0.33               |
| $S_2$,$S_3$ | 0.33               | 0                  |
| $S_2$,$S_4$ | 0.67               | 0.33               |
| $S_3$,$S_4$ | 0.33               | 0.33               |

Due to the small number of hash functions we cannot see that the MinHash similarity would actually converge to the true Jaccard similarity.

# Exercise 4

In [14]:
def compute_k_shingles(digits, k):
    # Set to store unique positions
    positions = set()

    # Iterate over the digits
    for i in range(len(digits) - k + 1):
        # Get k-shingle at current position
        shingle = digits[i:i+k]
        # Convert to integer and add to set
        positions.add(int(shingle))

    # Return ordered list of unique positions
    return sorted(positions)

# Test function with example
test_example = "1234567"
k = 4
shingles = compute_k_shingles(test_example, k)
print(shingles)  # Expected: [1234, 2345, 3456, 4567]

[1234, 2345, 3456, 4567]


### b)

In [15]:
from mpmath import mp

# Set precision to 10000 digits
mp.dps = 10000

# Get pi as string after decimal point
pi_digits = str(mp.pi)[2:]

# Apply shingles function with k = 12
k = 12
shingles_positions = compute_k_shingles(pi_digits, k)

# Save output to text file
with open("k_shingles_pi.txt", "w") as f:
    for pos in shingles_positions:
        f.write(f"{pos}\n")

### c)

In [16]:
import random

def minhash_signature(positions, hash_functions):
    # Initialize signature
    signature = []

    # Iterate over hash functions
    for a, b, p in hash_functions:
        min_hash = float("inf")
        # Compute hash value for each position and track the minimum
        for pos in positions:
            hash_value = ((a * pos + b) % p) % (10**15)
            min_hash = min(min_hash, hash_value)
        signature.append(min_hash)
    return signature

def generate_hash_functions():
    hash_functions = []
    # First hash function
    hash_functions.append((37, 126, 10**15 + 223))
    
    # Generate 4 additional hash functions
    for i in [37, 91, 159, 187]:
        a = random.randint(0, 10**12)
        b = random.randint(0, 10**12)
        p = 10**15 + i
        hash_functions.append((a, b, p))
    
    return hash_functions


# Generate hash functions
hash_functions = generate_hash_functions()

# Compute MinHash signature
signature = minhash_signature(shingles_positions, hash_functions)

# Output the MinHash signature
print("MinHash Signature:", signature)


MinHash Signature: [11610003501, 63680740533, 107687383220, 41635782020, 203614208147]
