# Week 2A: LSH (Basic)

## Question 1

The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. Then, identify which of the following is a true statement about the number of pairs at a certain edit distance.

<ol>
<li>There is 1 pair at distance 1.
<li>There is 1 pair at distance 3.
<li>There are 2 pairs at distance 3.
<li>There are 2 pairs at distance 4.
</ol>

In [17]:
from itertools import combinations
from nltk.metrics.distance import edit_distance

strings = ["he", "she", "his", "hers"]

# NOTE: edit_distance allow substitution!
for a, b in combinations(strings, 2):
    print(a, "-", b, edit_distance(a, b))

he - she 1
he - his 2
he - hers 2
she - his 3
she - hers 3
his - hers 2


## Question 2

Consider the following matrix:

<table style="float:left">
  <tr>
    <td></td>
    <td><b>C1</td>
    <td><b>C2</td> 
    <td><b>C3</td>
    <td><b>C4</td>
  </tr>
  <tr>
    <td>R1</td><td>0</td><td>1</td><td>1</td><td>0</td>
  </tr>
  <tr>
    <td>R2</td><td>1</td><td>0</td><td>1</td><td>0</td>
  </tr>
  <tr>
    <td>R3</td><td>0</td><td>1</td><td>0</td><td>1</td>
  </tr>
  <tr>
    <td>R4</td><td>0</td><td>0</td><td>1</td><td>0</td>
  </tr>
  <tr>
    <td>R5</td><td>1</td><td>0</td><td>1</td><td>0</td>
  </tr>
  <tr>
    <td>R6</td><td>0</td><td>1</td><td>0</td><td>0</td>
  </tr>
</table>

<br style="clear:both" />

Perform a minhashing of the data, with the order of rows: R4, R6, R1, R3, R5, R2. Which of the following is the correct minhash value of the stated column? Note: we give the minhash value in terms of the original name of the row, rather than the order of the row in the permutation. These two schemes are equivalent, since we only care whether hash values for two columns are equal, not what their actual values are.

<ol>
<li>The minhash value for C1 is R2
<li>The minhash value for C1 is R5
<li>The minhash value for C2 is R3
<li>The minhash value for C2 is R4

In [18]:
import numpy as np

# Input matrix
I = [[0, 1, 1, 0], 
     [1, 0, 1, 1], 
     [0, 1, 0, 1], 
     [0, 0, 1, 0], 
     [1, 0, 1, 0],
     [0, 1, 0, 0]]

# Row order
order = [3, 5, 0, 2, 4, 1]

def min_hash(M, order):
    res = [0] * 4
    for row in order:
        for c, col in enumerate(M[row]):
            if col == 1 and res[c] == 0:
                res[c] = row + 1
    return res
    
m = min_hash(I, order)

print("Minhash values")
print(m)

Minhash values
[5, 6, 4, 3]


## Question 3

Here is a matrix representing the signatures of seven columns, C1 through C7.

<table style="float:left">
  <tr>
    <td><b>C1</td>
    <td><b>C2</td> 
    <td><b>C3</td>
    <td><b>C4</td>
    <td><b>C5</td> 
    <td><b>C6</td>
    <td><b>C7</td>
  </tr>
  <tr> <td>1<td>2<td>1<td>1<td>2<td>5<td>4 </tr>
  <tr> <td>2<td>3<td>4<td>2<td>3<td>2<td>2 </tr>
  <tr> <td>3<td>1<td>2<td>3<td>1<td>3<td>2 </tr>
  <tr> <td>4<td>1<td>3<td>1<td>2<td>4<td>4 </tr>
  <tr> <td>5<td>2<td>5<td>1<td>1<td>5<td>1 </tr>
  <tr> <td>6<td>1<td>6<td>4<td>1<td>1<td>4 </tr>
</table>

<br style="clear:both" />

Suppose we use locality-sensitive hashing with three bands of two rows each. Assume there are enough buckets available that the hash function for each band can be the identity function (i.e., columns hash to the same bucket if and only if they are identical in the band). Find all the candidate pairs, and then identify one of them in the list below.

<ol>
<li>C4 and C7
<li>C3 and C4
<li>C3 and C6
<li>C2 and C4
</ol>

In [30]:
# Signature matrix
S = np.array([
     [1, 2, 1, 1, 2, 5, 4],
     [2, 3, 4, 2, 3, 2, 2],
     [3, 1, 2, 3, 1, 3, 2],
     [4, 1, 3, 1, 2, 4, 4],
     [5, 2, 5, 1, 1, 5, 1],
     [6, 1, 6, 4, 1, 1, 4]])

def candidate_pairs(matrix, r = []):
    m, n = matrix.shape
    for band in range(3):
        row1 = matrix[band * 2]
        row2 = matrix[band * 2 + 1]
        for i, j in combinations(range(n), 2):
            if row1[i] == row1[j] and row2[i] == row2[j]:
                r.append((i + 1, j + 1))
    return r

c = candidate_pairs(S)

print("Candidate pairs")
print(c)
print()

print("1:", (4, 7) in pairs)
print("2:", (3, 4) in pairs)
print("3:", (3, 6) in pairs)
print("4:", (2, 4) in pairs)

Candidate pairs
[(1, 4), (2, 5), (1, 6), (1, 3), (4, 7)]

1: True
2: False
3: False
4: False


## Question 4

Find the set of 2-shingles for the "document":

<p style="color:blue">ABRACADABRA</p>

and also for the "document":

<p style="color:blue">BRICABRAC</p>

Answer the following questions:

<ol>
<li>How many 2-shingles does ABRACADABRA have?
<li>How many 2-shingles does BRICABRAC have?
<li>How many 2-shingles do they have in common?
<li>What is the Jaccard similarity between the two documents"?
</ol>

Then, find the true statement in the list below.

<ol>
<li>BRICABRAC has 6 2-shingles.
<li>ABRACADABRA has 7 2-shingles.
<li>BRICABRAC has 8 2-shingles.
<li>ABRACADABRA has 10 2-shingles.
</ol>

In [35]:
from nltk.util import ngrams

d1 = "ABRACADABRA"
d2 = "BRICABRAC"

def shingles(word, n):
    return set([''.join(w) for w in ngrams(word, n)])
    
s1 = shingles(d1, 2)
s2 = shingles(d2, 2)
i = set.intersection(s1, s2)
u = set.union(s1, s2)
jaccard = len(i) / len(u)

print(len(s1))
print(len(s2))
print("Overlap:", len(i))
print("Jaccard Similarity:", round(jaccard, 2))

7
7
Overlap: 5
Jaccard Similarity: 0.56


<strike><h2> Question 5 </h2> </strike>

## Question 6

Suppose we want to assign points to whichever of the points (0,0) or (100,40) is nearer.
Depending on whether we use the L1 or L2 norm, a point (x,y) could be clustered with a different
one of these two points. For this problem, you should work out the conditions under which a point
will be assigned to (0,0) when the L1 norm is used, but assigned to (100,40) when the L2 norm is used.
Identify one of those points from the list below.

<ol>
<li>(54,8)
<li>(50,18)
<li>(53,10)
<li>(57,5)
</ol>

In [57]:
import numpy as np

def l1(a, b):
    return abs(np.sum(a - b))

def l2(a, b):
    return math.sqrt(np.sum((a - b)**2))

# The points which we calculate the distance to
centroids = np.array([
        [0, 0],
        [100, 40]
    ])

# The points we want to test
points = np.array([
        [59,10],
        [53,10],
        [56,15],
        [50,18]
        ])

winner = 0

for p in points:
    # Calculate the distances using l1 normalization
    l1_p1_dist, l1_p2_dist = l1(p, centroids[0]), l1(p, centroids[1])
    # Calculate the distances using l2 normalization
    l2_p1_dist, l2_p2_dist = l2(p, centroids[0]), l2(p, centroids[1])
    # Identify the point where the distance 
    if l1_p1_dist < l1_p2_dist and l2_p1_dist > l2_p2_dist or l1_p1_dist > l1_p2_dist and l2_p1_dist < l2_p2_dist:
        winner = p

print("Winner")
print(winner)

Winner
[59 10]
