# 2. Approximate Retrieval

Quickly find nearest neighbors in (very) high dimensions.

Examples:
 * Image search and image completion
 * Song search

## Distance functions

 * $d : S \times S \rightarrow \mathbb{R}$ is a **distance function** iff
     - $\forall s, t \in S : d(s, t) \ge 0$
     - $\forall s : d(s, s) = 0$
     - $\forall s, t \in S : d(s, t) = d(t, s)$
     - $\forall s, t, r \in S: d(s, t) + d(t, r) \ge d(s, r)$ (triangle inequality)
     - if $\forall s, t \in S: d(s, t) = 0 \implies s = t$, then d is a "stronger" function called a **metric**
 * We make use of this by representing **objects as vectors**
     - images become feature vectors (see Computer Vision course)
     - documents become bag-of-words or tf-idf representations
 * Many types of distances
     - $\ell_p$, such as the Euclidean distance ($\ell_2$)
     - cosine distance (used a lot in text search)
     - edit distance (expensive)
     - Jaccard-distance (for sets)

## Curse of dimensionality
In very large dimensions, the minimum distance between any two points gets very close to the maximum distance between any points.

$ \lim_{D \rightarrow \infty} P[d_{max} \le (1 + \epsilon)d_{min}] = 1 $

## Approximate retrieval
### Input
A data set $S$ and a distance function $d$.

### Problem 1: Nearest neighbor
Given $q$, find $s* = \text{argmin}_{s \in S} d(q, s)$

### Problem 2: Near-duplicate detection
Find all $s$, $s'$ in $S$, with distance at most $\epsilon$.

* Use **shingling** and **Jaccard distance** as a similarity measure.
* Can even hash shingles to save space
* Jaccard similarity: $JSim(A, B) = \frac{|A \cap B|}{|A \cup B|}$
* Jaccard distance: $d(A, B) = 1 - JSim(A, B)$

In [8]:
def jaccard_sim(a, b):
    return len(a & b) * 1.0 / len(a | b)

def jaccard_distance(a, b):
    return 1 - jaccard_sim(a, b)

x = {1, 5, 6, 10}
y = {2, 5, 6, 20}
print("Similarity: %.2f" % jaccard_sim(x, y))
print("Distance:   %.2f" % jaccard_distance(x, y))

Similarity: 0.33
Distance:   0.67


* Scale remains problematic; we can't just do a double loop over all $N$ elements...
* Hashing works well for exact duplicates, can it work with near duplicates?
* **Yes**, we have **locality sensitive hashing** (LSH)

## Min-hashing
 * Reorder shingle matrix rows with random permutation $\pi$
 * $\operatorname{hash}(C) =$ minimum row number in which permuted column contains a one (C represents a column, i.e. a document in shingle form)
 * $h(C) = h_\pi(C) = \underset{i:C(i)=1}{\min}\pi(i)$
 * Turns out that the probability of two documents sharing a hash is equal to their Jaccard similarity: $P[h(C_1) = h(C_2)] = Sim(C_1, C_2)$ (trivial but interesting proof; see slides)
 * An alternative is sim-hashing (se