# Exercise 1

Let $P$ be a set of $n$ bit vectors of length $d$. Give a data structure (for Hamming distance) for which $\texttt{Insert}$ can be implemented in $O(1)$ time and $\texttt{NearestNeighbor}$ in $O(nd)$ time.

# Exercise 2: LSH Hamming Distance
Let $T_1,T_2$ be hash tables of size $5$ with hash functions:
- $h_1(x)=(3x+4)\text{ mod }5$
- $h_2(x)=(7x+2)\text{ mod }5$

Let:
- $g_1(x)=x_1x_4x_8$
- $g_2(x)=x_1x_7x_7$

Insert the bit strings:
- $x=10110011$
- $y=00110010$
- $u=01001010$
- $v=01001000$

and draw the hash tables.
Compute the results of $\texttt{ApxNearNeighbor}(10111010)$.

In [3]:
from lib.alsh import ALSH
from lib.lsh import LSH
from lib.hash_function import HashFunction
from lib.filter import Filter

X = [
    {
        'name': "x",
        'value': '10110011'
    },
    {
        'name': "y",
        'value': '10001101'
    },
    {
        'name': "z",
        'value': '00110010'
    },
    {
        'name': "u",
        'value': '01001010'
    },
    {
        'name': "v",
        'value': '01001000'
    }
]

alsh = ALSH([
    LSH(5, HashFunction(3, 4, 5), Filter([1, 4, 8]), "Table 1"),
    LSH(5, HashFunction(7, 2, 5), Filter([1, 7, 7]), "Table 2")
])

alsh.insert(X)
alsh.print_tables()

w = {
    'name': 'w',
    'value': '10111010'
}

closest, distance = alsh.apxNearNeighbor(w)
print(f"Closest: {closest}, Distance: {distance}")

Table: "Table 1"
0: {'111': [{'name': 'x', 'value': '10110011'}], '010': [{'name': 'z', 'value': '00110010'}]}
1: {}
2: {}
3: {}
4: {'101': [{'name': 'y', 'value': '10001101'}], '000': [{'name': 'u', 'value': '01001010'}, {'name': 'v', 'value': '01001000'}]}

Table: "Table 2"
0: {'100': [{'name': 'y', 'value': '10001101'}]}
1: {'111': [{'name': 'x', 'value': '10110011'}]}
2: {'000': [{'name': 'v', 'value': '01001000'}]}
3: {'011': [{'name': 'z', 'value': '00110010'}, {'name': 'u', 'value': '01001010'}]}
4: {}

Closest: {'name': 'x', 'value': '10110011'}, Distance: 2


# Exercise 3: $c$-Approximate Closest Pair under Hamming Distance

Assume you have a data structure bit vectors for which $\texttt{Insert}$ and $\texttt{ApxNearestNeighbor}$ run in time $T(n)$. The distance metric is the Hamming distance, $n$ is the number of bit string in the data structure, and $\texttt{ApxNearestNeighbor}(x)$ returns a point no more than $c\cdot\min_{x\in P}d(x,z)$ away from $x$.
Give an algorithm that given a set $P$ of $n$ bit vectors each of length $d$ finds a pair $x,y$ of distinct strings in $P$ in time $O(T(n)\cdot n+dn)$ such that $d(x,y)\le c\cdot\min_{u,v\in P,u\ne v}d(u,v)$.

# Exercise 4: Hamming Distance Analysis
From the proof of Claim 1 on the slides: Prove that $Lp_1^k\ge 2$.
*Hint: recall that $k=\dfrac{log_2(n)}{\log_2(1/p_2)}$*

# Exercise 5: Hamming Distnace Analysis 2
In this exercise we will analyse the LSH scheme for Hamming Distance. Recall that in a query we stop after we have checked $6L+1$ strings. Let $F=\{y\in P:d(x,y)\}$ (strings from $x$) and let $z^*$ be a fixed string with $d(x,z^*)\le r$. We say that $y$ collides with $x$ if $g_j(x)=g_j(y)$ for some $i\in\{1,\ldots,L\}$.

## Part 1
Explain why it is enough to prove that the following two properties hold:
1. the number of strings in $F$ that collides with $x$ at most $6L$
2. $z^*$ collides with $x$

## Part 2
Let $y$ be a string in $F$. Prove that $P[y\text{ collides with }x\text{ in }T_j]\le\dfrac 1n$.
Hint: recall that $k=\dfrac{\log(n)}{\log(1/p_2)}$

## Part 3
Let $X_{y,j}=1$ if $y$ collides with $x$ in $T_j$ and $0$ otherwisem and let $X=\displaystyle\sum_{y\in F}\sum_{j=1}^LX_{y,j}$. Prove that $\mathbb E[X]\le L$.

## Part 4
Use Markov's inequality to show that $P[X>6P]<\dfrac16$.

## Part 5
Prove that if there exists a string $z^*$ in $P$ with $d(x,z^*)\le r$ then with probability at least $\dfrac23$ we will return some $y$ in $P$ for which $d(x,y)\le cr$.

# Exercise 6: Jaccard distance and Sim Hash

The Jaccard similarity of two sets is defined as $\text{JSIM}(A,B)=\dfrac{|A\cap B|}{|A\cup B|}$. In $\texttt{MinHash}$ you pick a random permutation $\pi$ of the elements in the universe and let $h(A)=\min_{a\in A}\pi(a)$.

## Part 1
Let:
- $S_1=\{a,e\}$
- $S_2=\{b\}$
- $S_3=\{a,c,e\}$
- $S_4=\{b,d,e\}$

Compute the Jaccard similarity of each pair of sets.

In [4]:
from lib.min_hash import MinHash
from lib.utils import jaccard
from itertools import chain

sets = [
    ['a', 'e'],
    ['b'],
    ['a', 'c', 'e'],
    ['b', 'd', 'e']
]

for i in range(len(sets)):
    for j in range(i+1, len(sets)):
        print(f"Jaccard similarity between {sets[i]} and {sets[j]}: {jaccard(sets[i], sets[j])}")

Jaccard similarity between ['a', 'e'] and ['b']: 0.0
Jaccard similarity between ['a', 'e'] and ['a', 'c', 'e']: 0.6666666666666666
Jaccard similarity between ['a', 'e'] and ['b', 'd', 'e']: 0.25
Jaccard similarity between ['b'] and ['a', 'c', 'e']: 0.0
Jaccard similarity between ['b'] and ['b', 'd', 'e']: 0.3333333333333333
Jaccard similarity between ['a', 'c', 'e'] and ['b', 'd', 'e']: 0.2


## Part 2
Let $S_1,S_2,S_3,S_4$ be as above and let the random permutation be $(b,d,e,a,c)$, i.e., $\pi(a)=4$, $\pi(b)=1$, etc.
Compute the min-hash value of each of the sets.

In [46]:
minhash = MinHash(list(set(chain.from_iterable(sets))))
minhash.insert(sets)
minhash.print_table()

Table: "MinHash"
['c', 'e', 'a', 'd', 'b']
0: [['a', 'c', 'e']]
1: [['a', 'e'], ['b', 'd', 'e']]
2: []
3: []
4: [['b']]


## Part 3
Prove that the probability that the min-hash of two sets is the same is equal to the Jaccard similarity of the two sets, i.e., that $P[h(A)=h(B)]=\dfrac{|A\cap B|}{|A\cup B|}$

## Part 4
The Jaccard distance is defined as $d_J(a,b)=1-\text{Jsim}(A,B)$. Show that the Jaccard distance is a metric, i.e., show that:
1. $d_J(A,B)\ge 0$ for all sets $A$ and $B$
2. $d_J(A,B)=0$ if and only if $A=B$
3. $d_J(A,B)=d_J(B,A)$
4. $d_J(A,B)\le d_J(A,C)+d_J(C,B)$ for all sets $A,B,C$

Hint: for 4. use $P[h(A)=h(B)]=\dfrac{|A\cap B|}{|A\cup B|}$