# Week3A (Advanced)

## Question 1

For the following graph:

In [103]:
"""
   C -- D -- E
 / |    |    | \
A  |    |    |  B
 \ |    |    | /
   F -- G -- H
""";

Write the adjacency matrix A, the degree matrix D, and the Laplacian matrix L. For each, find the sum of all entries and the number of nonzero entries. Then identify the true statement from the list below.

<ol>
<li>A has 30 nonzero entries.
<li>A has 22 nonzero entries.
<li>D has 16 nonzero entries.
<li>L has 64 nonzero entries.
<ol>

In [104]:
import numpy as np

# Adjacency matrix    
A = np.matrix([[0, 0, 1, 0, 0, 1, 0, 0],
               [0, 0, 0, 0, 1, 0, 0, 1], 
               [1, 0, 0, 1, 0, 1, 0, 0], 
               [0, 0, 1, 0, 1, 0, 1, 0], 
               [0, 1, 0, 1, 0, 0, 0, 1], 
               [1, 0, 1, 0, 0, 0, 1, 0], 
               [0, 0, 0, 0, 1, 1, 0, 1], 
               [0, 1, 0, 0, 1, 0, 1, 0]]) 

# Degree matrix
D = np.diag([2,2,3,3,3,3,3,3])

# Laplacian Matrix
L = D - A

print("1:", np.count_nonzero(A) == 30)
print("2:", np.count_nonzero(A) == 22)
print("3:", np.count_nonzero(D) == 16)
print("4:", np.count_nonzero(L) == 64)

1: False
2: True
3: False
4: False


## Question 2

You are given the following graph.

In [30]:
"""
   2 ----6
 /  \    |
1    4   |
 \  /  \ |
  3      5 
""";

The goal is to find two clusters in this graph using Spectral Clustering on the Laplacian matrix. Compute the Laplacian of this graph. Then compute the second eigen vector of the Laplacian (the one corresponding to the second smallest eigenvalue).

To cluster the points, we decide to split at the mean value. We say that a node is a tie if its value in the eigen-vector is exactly equal to the mean value. Let's assume that if a point is a tie, we choose its cluster at random. Identify the true statement from the list below.

<ol>
<li>2 and 5 can either be in the same cluster or in different clusters (depending on randomness)
<li>4 and 5 are in the same cluster
<li>1 and 6 are in the same cluster
<li>2 and 5 are in different clusters
</ol>

In [107]:
from numpy import linalg as LA

# Adjacency matrix
A = np.array([[0, 1, 1, 0, 0, 0],
              [1, 0, 0, 1, 0, 1],
              [1, 0, 0, 1, 0, 0],
              [0, 1, 1, 0, 1, 0],
              [0, 0, 0, 1, 0, 1],
              [0, 1, 0, 0, 1, 0]])

# Degree matrix
D = np.diag([2,3,2,3,2,2])

# Laplacian Matrix
L = D - A

eigenvalues, eigenvectors = LA.eig(L)
e = eigenvectors[1]
mean = e.mean()

print("1:", e[1] == e[4])
print("2:", (e[3] > mean and e[4] > mean) or (e[3] < mean and e[4] < mean))
print("3:", (e[0] > mean and e[5] > mean) or (e[0] < mean and e[5] < mean))
print("4:", (e[1] > mean and not e[4] > mean) and (e[1] < mean and not e[4] < mean))

1: False
2: True
3: False
4: False


## Question 3

We wish to estimate the surprise number (2nd moment) of a data stream, using the method of AMS. It happens that our stream consists of ten different values, which we'll call 1, 2,..., 10, that cycle repeatedly. That is, at timestamps 1 through 10, the element of the stream equals the timestamp, at timestamps 11 through 20, the element is the timestamp minus 10, and so on. It is now timestamp 75, and a 5 has just been read from the stream. As a start, you should calculate the surprise number for this time.

For our estimate of the surprise number, we shall choose three timestamps at random, and estimate the surprise number from each, using the AMS approach (length of the stream times 2m-1, where m is the number of occurrences of the element of the stream at that timestamp, considering all times from that timestamp on, to the current time). Then, our estimate will be the median of the three resulting values.

You should discover the simple rules that determine the estimate derived from any given timestamp and from any set of three timestamps. Then, identify from the list below the set of three "random" timestamps that give the closest estimate.

<ol>
<li>{17, 43, 51}
<li>{25, 34, 47}
<li>{5, 33, 67}
<li>{14, 35, 42}
</ol>

In [123]:
def cycle_number(n):
    if n % 10 == 0:
        return 10
    else:
        return n % 10
    
def ams(timestamps, t):
    window = timestamps[t:]
    return (len(window) * 2 * window.count(cycle_number(t))) - 1
    
timestamps = [cycle_number(x) for x in range(1,76)]

ams(timestamps, 44)

185

## Question 4

We wish to use the Flagolet-Martin lgorithm of Section 4.4 to count the number of distinct elements in a stream. Suppose that there ten possible elements, 1, 2,..., 10, that could appear in the stream, but only four of them have actually appeared. To make our estimate of the count of distinct elements, we hash each element to a 4-bit binary number. The element x is hashed to 3x + 7 (modulo 11). For example, element 8 hashes to 3*8+7 = 31, which is 9 modulo 11 (i.e., the remainder of 31/11 is 9). Thus, the 4-bit string for element 8 is 1001.
A set of four of the elements 1 through 10 could give an estimate that is exact (if the estimate is 4), or too high, or too low. You should figure out under what circumstances a set of four elements falls into each of those categories. Then, identify in the list below the set of four elements that gives the exactly correct estimate.

<ol>
<li>{4, 5, 6, 7}
<li>{4, 6, 9, 10}
<li>{ 3, 7, 8, 10}
<li>{3, 4, 8, 10}
</ol>

In [None]:
#TODO

## Question 5

Suppose we are using the DGIM algorithm of Section 4.6.2 to estimate the number of 1's in suffixes of a sliding window of length 40. The current timestamp is 100, and we have the following buckets stored:

<table style="float:left">
<tr>
    <td><b>End Time</b></td><td>100</td><td>98</td><td>95</td><td>92</td><td>87</td><td>80</td><td>65</td>
</tr>
<tr>
    <td><b>Size</b></td><td>1</td><td>1</td><td>2</td><td>2</td><td>4</td><td>8</td><td>8</td>
</tr>
</table>

<br style="clear:both" />

Note: we are showing timestamps as absolute values, rather than modulo the window size, as DGIM would do.

Suppose that at times 101 through 105, 1's appear in the stream. Compute the set of buckets that would exist in the system at time 105. Then identify one such bucket from the list below. Buckets are represented by pairs (end-time, size).

<ol>
<li>(102,4)
<li>(95,4)
<li>(103,2)
<li>(87,4)
</ol>

In [50]:
#TODO

#Video - Counting 1's
#Book - Section 4.6.5