# Mining Massive Datasets Problem Set 9

Ruben Hartenstein, Taha Erkoc

# Exercise 2

### a)

In a clique, every node is connected to every other node which means that all nodes have the same in- and out-degree. Thus each node contributes and receives the same fraction of its PageRank. This symmetric structure allows for the flow of the PageRank to be uniform across all nodes in the clique.

### b)

The general form of matrix A is as follows:

$A = \beta * M + (1-\beta) * [\frac{1}{N}]_{NXN}$

When a node is a dead-end (has no outgoing links), its corresponding column in $M$ contains all zeros. To make $M$ a valid stochastic matrix, we replace the entire column of the dead-end nodes with $1/N$, saying there is a uniform probability to any other node. Doing this our matrix $M$ remains column-stochastic (each column sums up to $1$) and random teleport links are followed with a probability $1.0$ from dead-ends.

With this approach, the teleportation is already incorperated in our new matrix $M$ and our teleportation matrix $(1-\beta) * [\frac{1}{N}]_{NXN}$ no longer needs to adress the dead-end issue. It only serves as a random jump mechanism across all nodes.

### c) (in the PDF)


### d)

With random teleports we give the surfer a probability of $1 - \beta$ of jumping to a random page outside the spider trap rather than following links within the trap. This prevents the pagerank score from being permanently trapped.

For Dead-ends, with random teleports a surfer at a dead-end is assumed to teleport to any other page in the graph with an equal probability of $1/N$. For our formula this means that we replace the dead-end column in $M$ with uniform probabilities $1/N$, restoring the column-stochastic property. This ensures our matrix $M$ remanins valid for the PageRank calculation the algorithm converges to a meaningful result regardless of the graph structure.

# Exercise 3

In [10]:
import numpy as np

def pagerank(M, beta, epsilon):
    """
    Compute the PageRank vector using the Google formulation and the Power Iteration method.

    Parameters:
        M (numpy.ndarray): The dense adjacency matrix of the graph (column-stochastic).
        beta (float): The teleportation factor.
        epsilon (float): The convergence threshold.

    Returns:
        numpy.ndarray: The PageRank vector.
    """
    # Number of nodes in the graph
    N = M.shape[0]

    # Compute the Google Matrix A
    teleportation_matrix = np.ones((N, N)) / N  # Dense matrix with all entries 1/N
    A = beta * M + (1 - beta) * teleportation_matrix

    # Initialize the PageRank vector (uniform distribution)
    r = np.ones(N) / N

    # Power Iteration Method
    while True:
        r_new = A @ r  # Matrix-vector multiplication
        # Check for convergence using L1 norm
        if np.linalg.norm(r_new - r, 1) < epsilon:
            break
        r = r_new

    return r

In [11]:
# Example matrix from exercise 1
M = np.array([
    [1/3, 1/2, 0],
    [1/3, 0, 1/2],
    [1/3, 1/2, 1/2]
])

# Parameters
beta = 1  # Teleportation factor
epsilon = 1/12  # Convergence threshold

# Compute PageRank
pagerank_vector = pagerank(M, beta, epsilon)
print("PageRank Vector:", pagerank_vector)

PageRank Vector: [0.23148148 0.31481481 0.4537037 ]


In [12]:
def generate_clique_matrix(n):
    """
    Generate the column-stochastic adjacency matrix for a clique graph with n vertices.

    Parameters:
        n (int): Number of vertices in the clique.

    Returns:
        numpy.ndarray: Column-stochastic adjacency matrix for the clique.
    """
    M = np.ones((n, n))  # Fully connected graph (all entries are 1)
    np.fill_diagonal(M, 0)  # Remove self-loops
    M /= M.sum(axis=0)  # Normalize to make it column-stochastic
    return M

In [13]:
# Generate M(4) and M(6)
M4 = generate_clique_matrix(4)
M6 = generate_clique_matrix(6)

# Compute PageRank vectors
pagerank_M4 = pagerank(M4, beta, epsilon)
pagerank_M6 = pagerank(M6, beta, epsilon)

print("PageRank Vector for M(4):", pagerank_M4)
print("PageRank Vector for M(6):", pagerank_M6)

PageRank Vector for M(4): [0.25 0.25 0.25 0.25]
PageRank Vector for M(6): [0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]


# Exercise 4

In [14]:
print("ey")

ey


# Exercise 5

In [14]:
def compute_k_shingles(digits, k):
    # Set to store unique positions
    positions = set()

    # Iterate over the digits
    for i in range(len(digits) - k + 1):
        # Get k-shingle at current position
        shingle = digits[i:i+k]
        # Convert to integer and add to set
        positions.add(int(shingle))

    # Return ordered list of unique positions
    return sorted(positions)

# Test function with example
test_example = "1234567"
k = 4
shingles = compute_k_shingles(test_example, k)
print(shingles)  # Expected: [1234, 2345, 3456, 4567]

[1234, 2345, 3456, 4567]


### b)

In [15]:
from mpmath import mp

# Set precision to 10000 digits
mp.dps = 10000

# Get pi as string after decimal point
pi_digits = str(mp.pi)[2:]

# Apply shingles function with k = 12
k = 12
shingles_positions = compute_k_shingles(pi_digits, k)

# Save output to text file
with open("k_shingles_pi.txt", "w") as f:
    for pos in shingles_positions:
        f.write(f"{pos}\n")

### c)

In [16]:
import random

def minhash_signature(positions, hash_functions):
    # Initialize signature
    signature = []

    # Iterate over hash functions
    for a, b, p in hash_functions:
        min_hash = float("inf")
        # Compute hash value for each position and track the minimum
        for pos in positions:
            hash_value = ((a * pos + b) % p) % (10**15)
            min_hash = min(min_hash, hash_value)
        signature.append(min_hash)
    return signature

def generate_hash_functions():
    hash_functions = []
    # First hash function
    hash_functions.append((37, 126, 10**15 + 223))
    
    # Generate 4 additional hash functions
    for i in [37, 91, 159, 187]:
        a = random.randint(0, 10**12)
        b = random.randint(0, 10**12)
        p = 10**15 + i
        hash_functions.append((a, b, p))
    
    return hash_functions


# Generate hash functions
hash_functions = generate_hash_functions()

# Compute MinHash signature
signature = minhash_signature(shingles_positions, hash_functions)

# Output the MinHash signature
print("MinHash Signature:", signature)


MinHash Signature: [11610003501, 63680740533, 107687383220, 41635782020, 203614208147]
