# Question 1

Suppose we have an LSH family h of (d1,d2,.6,.4) hash functions. We can use three functions from h and the AND-construction to form a (d1,d2,w,x) family, and we can use two functions from h and the OR-construction to form a (d1,d2,y,z) family. Calculate w, x, y, and z, and then identify the correct value of one of these in the list below.

<ol>
<li>w = .216
<li>y = .64
<li>y = .936
<li>x = .16
</ol>

In [8]:
import numpy as np

p1 = 0.6
p2 = 0.4

# AND of hash functions
# If H is (d1, d2, p1, p2)-sensitive, then H' (d1, d2, p1^r, p2^r)-sensitive
w = np.power(p1, 3)
x = np.power(p2, 3)

# OR of hash functions
# If H is (d1, d2, p1, p2)-sensitive, then H' (d1, d2, (1-p1)^b, (1-p2)^b)-sensitive
y = 1 - np.power(1 - p1, 2)
z = 1 - np.power(1 - p2, 2)

print(w, x, y, z)

0.216 0.064 0.84 0.64


# Question 2

Here are eight strings that represent sets:

- s1 = abcef
- s2 = acdeg
- s3 = bcdefg
- s4 = adfg
- s5 = bcdfgh
- s6 = bceg
- s7 = cdfg
- s8 = abcd

Suppose our upper limit on Jaccard distance is 0.2, and we use the indexing scheme of Section 3.9.4 based on symbols appearing in the prefix (no position or length information). For each of s1, s3, and s6, determine how many other strings that string will be compared with, if it is used as the probe string. Then, identify the true count from the list below.

<ol>
<li>s1 is compared with 3 other strings.
<li>s1 is compared with 5 other strings.
<li>s1 is compared with 6 other strings.
<li>s3 is compared with 4 other strings.
</ol>

In [77]:
from collections import defaultdict

strings = {
    's1' : 'abcef',
    's2' : 'acdeg',
    's3' : 'bcdefg',
    's4' : 'adfg',
    's5' : 'bcdfgh',
    's6' : 'bceg',
    's7' : 'cdfg',
    's8' : 'abcd',
}

def index(values, J):
    idx = defaultdict(list)
    for v in values:
        n = int(len(v) * J + 1)
        for i in range(n):
            idx[v[i]].append(v)
    return idx
    
def search(idx, J, string):
    res = set()
    n = int(len(string) * J + 1)
    for i in range(n):
        res.update(idx[string[i]])
    return res

idx = index(strings.values(), 0.2)

for v in ['s1', 's3', 's6']:
    val = strings[v]
    res = [x for x in search(idx, 0.2, val) if x != val]
    print(v, len(res))

s1 6
s3 5
s6 3


# Question 3

Consider the link graph

<img style="float:left" src="https://d396qusza40orc.cloudfront.net/mmds/images/otc_pagerank4.gif"/>
<br clear="all"/>

First, construct the L, the link matrix, as discussed in Section 5.5 on the HITS algorithm. Then do the following:

<ol>
<li>Start by assuming the hubbiness of each node is 1; that is, the vector h is (the transpose of) [1,1,1,1].
<li>Compute an estimate of the authority vector a=LTh.
<li>Normalize a by dividing all values so the largest value is 1.
<li>Compute an estimate of the hubbiness vector h=La.
<li>Normalize h by dividing all values so the largest value is 1.
<li>Repeat steps 2-5.
</ol>

Now, identify in the list below the true statement about the final estimates.

<ol>
<li>The final estimate of the hubbiness of 1 is 1.
<li>The final estimate of the authority of 4 is 1/8.
<li>The final estimate of the authority of 2 is 1/5.
<li>The final estimate of the hubbiness of 3 is 3/5.
</ol>

In [65]:
# Link matrix
L = np.array([
    [0, 1, 1, 0],
    [1, 0, 0, 0],
    [0, 0, 0, 1],
    [0, 0, 1, 0],
])

def normalize(v):
    # Normalize by dividing all values so the largest value is 1
    return v / np.max(v)

h = np.array([[1.0, 1.0, 1.0, 1.0]]).T

for i in range(2):
    # Compute an estimate of the authority vector a = L.T * h
    a = normalize(np.dot(L.T, h))
    # Compute an estimate of the hubbiness vector h = L * a
    h = normalize(np.dot(L, a))

print("Authorithy", a.flatten())
print("Hubbiness", h.flatten())

Authorithy [ 0.2  0.6  1.   0.2]
Hubbiness [ 1.     0.125  0.125  0.625]


# Question 4

Consider an implementation of the Block-Stripe Algorithm discussed in Section 5.2 to compute page rank on a graph of N nodes (i.e., Web pages). Suppose each page has, on average, 20 links, and we divide the new rank vector into k blocks (and correspondingly, the matrix M into k stripes). Each stripe of M has one line per "source" web page, in the format:

- [source_id, degree, m, dest_1, ...., dest_m]

Notice that we had to add an additional entry, m, to denote the number of destination nodes in this stripe, which of course is no more than the degree of the node. Assume that all entries (scores, degrees, identifiers,...) are encoded using 4 bytes.

There is an additional detail we need to account for, namely, locality of links. As a very simple model, assume that we divide web pages into two disjoint sets:

Introvert pages, which link only to other pages within the same host as themselves.
Extrovert pages, which have links to pages across several hosts.
Assume a fraction x of pages (0 Construct a formula that counts the amount of I/O per page rank iteration in terms of N, x, and k. The 4-tuples below list combinations of N, k, x, and I/O (in bytes). Pick the correct combination.
Note. There are some additional optimizations one can think of, such as striping the old score vector, encoding introvert and extrovert pages using different schemes, etc. For the purposes of working this problem, assume we don't do any optimizations beyond the block-stripe algorithm discussed in class.

1. N = 1 billion, k = 2, x = 0.75, 112GB
2. N = 1 billion, k = 2, x = 0.75, 107GB
3. N = 1 billion, k = 3, x = 0.75, 132GB
4. N = 1 billion, k = 3, x = 0.5, 132GB

In [78]:
def calculate_io(k, x):
    return 4 * (21 + k + 3 * (x + (1 - x) * k))

for k, x in [(3, 0.75), (2, 0.5), (2, 0.75), (3, 0.5)]:
    print(calculate_io(k, x))

114.0
110.0
107.0
120.0
