## Review from the last session

- Universal hash family of functions
- Jaccard similarity and distance
- Connection between minhash signatures and Jaccard distance

## Python Imports

In [None]:
import numpy as np
import pandas as pd

rng = np.random.default_rng()

## Universal Hash Family

A family with the following property:

> for any pair of keys in the universe of keys (i.e. $U$), a hash
> function drawn uniformly at random from the family will cause the keys
> to collide with an expected probability of $1/m$, where $m$ is the
> table size.

Let $p$ be a large prime, and let $a, b$ be random integers chosen
uniformly such that $1 \leq a \leq p-1$ and $0 \leq b \leq p-1$.

In [None]:
import random
import math
import numpy as np
import pandas as pd

rng = np.random.default_rng()
class UHF:
    """A factory for producing a universal family of hash functions"""

    @staticmethod
    def is_prime(k):
        if k%2==0:
            return False
        for i in range(3, int(math.sqrt(k)), 2):
            if k%i == 0:
                return False
        return True

    def __init__(self, n):
        """Universe size is n"""
        self.n = n
        if n%2==0:
            m = n+1
        else:
            m = n+2
        while not(UHF.is_prime(m)):
            m = m+2
        self.p = m

    def make_hash(self, m):
        """Return a random hash function

        m: table size
        """
        a = random.randint(1,self.p-1)
        b = random.randint(0,self.p-1)
        return lambda k: ((a*k+b)%self.p)%m

## Test the Universal Hash Family (Exercise)

Set up and evaluate an experiment to verify that the family defined
above is indeed a universal family. Use a reasonable universe of keys,
say integers in $range(1000000)$ and a hash table of size $5000$.

- Fix two keys at random
- Generate a lot of hash functions and see how many of them collide

## Data as Feature Matrices

- Each data item is generically referred to as a **document**

- In the simplest case, documents are *characterized* by the presence or
  absence of *binary* features (indicated by 0 and 1), i.e., the data is
  the **characteristic matrix** of the features.

In [None]:
characters = pd.DataFrame({'Hermione': [1,0,0,1,1,0,1],
                           'Harry': [0,0,1,1,1,0,1],
                           'Ron': [1,1,0,0,0,1,1],
                           'Severus': [1,0,0,0,0,1,1]})

## Jaccard Similarity

Jaccard *similarity* between two documents $D_i$ and $D_j$ equals
$ \text{SIM}(i,j) = \frac{|D_i \cap D_j|}{|D_i \cup D_j|}$

**Question:** What is the Jaccard similarity between `Harry` and
`Hermione`? Between `Severus` and `Ron`?

## Compressing a document\'s representation: Minhash Signatures

Choose a *random* permutation $p$ of the features. A **minhash signature**
$\text{sign}_p(i)$ for document $D_i$ using permutation $p$
is obtained as follows:

- recall that the document is a feature vector in the original order of
  features
- **Reorder** the feature vector according to the permutation $p$
- Among all features present in the document, determine the one with the
  **smallest index in the new ordering**! This index is the signature
  for the document.

## Example (contd)

For our example above, consider document `Hermione` with feature vector
$[1,0,0,1,1,0,1]$ (the first column above).

Consider the *permutation* of the **feature indices** given by
$[3,1,0,6,2,5,4]$

This says that feature 3 will be numbered 0, feature 1 still numbered 1,
feature 0 numbered 2 and so on. Hence, the permutation will **reorder**
document `Hermione` as $[\underline{1},0,1,1,0,0,1]$ e.g. the
underlined 1 at the new index **0**, corresponds to the original feature
number 3.

If we apply the permutation across all characters (columns), we get:

In [None]:
perm = [3,1,0,6,2,5,4]
f_matrix = characters.to_numpy()
print(f_matrix)
perm_mat = np.take(f_matrix, perm, axis=0)
perm_mat

## Computing a signature

To determine the **signature** of document `Hermione` according to
permutation `perm` above, we just walk down the *permuted* column until
we find the **first row index** corresponding to a feature that
`Hermione` *contains*, viz. the 1 entry at index **0**.

Doing this for all columns, the signatures for the documents (in order)
according to `perm` are: 0, 0, 1, 2.

In [None]:
np.argmin(1 - perm_mat, axis=0)

## Exercise

- What are the signatures for the other documents according to the
  permutation \[6,4,2,0,5,3,1\]?

- What signatures are possible, in theory, for `Harry`?

- What **common signatures** are possible for `Harry` and `Severus`?

## Using numpy to compute document signatures

We can do this in a variety of ways. For example:

- Start with a random permutation $p$ of the row indices. Apply the
permutation $p$ to the matrix, then determine the signature by using
    `argmin` over the matrix with entries *flipped*.

In [None]:
p_mat = rng.permutation(f_matrix, axis=0)
np.argmin(1 - p_mat, axis=0)

## Alternative

- We need not actually permute the rows of the matrix at all -
    permuting rows can be very expensive with large data. Instead, we
    imagine that the rows are numbered from 1 through $m$, the number of
    features. A permutation can be thought of as supplying **weights**
    for the corresponding features. We multiply each feature by its
    weight, then find the smallest non-zero weighted entry in a column:
    that is the signature!

In [None]:
## Mask all 0 values
n_rows = f_matrix.shape[0]
masked = np.where(f_matrix == 0, n_rows+1, f_matrix)
perm_arr = np.array(perm) + 1  # make elements non-zero
permuted_tr = masked.T * perm_arr  # transpose to do elementwise multiplication
signs = np.min(permuted_tr.T, axis=0)
print(signs)

## Discussion

We used the same permutation `perm` = \[3,1,0,6,2,5,4\] but seem to get
different signatures. Why is that? Discuss in your group why this is
happening. Understand the code below to frame the
discussion.

In [None]:
perm_inv = np.argsort(perm)
print(perm_inv)
np.argmin(1 - np.take(f_matrix, perm, axis=0), axis=0)

## Relationship between Jaccard similarity and signatures

Given a *random* permutation $p$, the Jaccard similarity,
$\text{SIM}(i,j)$ is related to signatures by the following property:

> $\text{SIM}(i,j) = \text{Pr}\{{\text{sign}}_p(i) = {\text{sign}}_p(j)\} $

Thus, the likelihood of two documents being deemed similar (via their
signatures) will be in accordance with their (mathematical) similarity!

**Proof:** In class!

## Minhashing

Based on the proof, we can see that if we were to calculate a large
number of **different signatures** (each corresponding to a different
permutation), we can approximate the true Jaccard similarity of two
documents as the fraction of permutations with matching signatures.

Random permutations are very expensive to compute: it takes time linear in $n$ to obtain such a permutation. When $n$ is very large, this is
infeasible.

So, we **approximate the signature** by using a **random hash function**
in place of a random permutation!

Use the **range** of a hash function applied to the feature indices as
a **proxy** for permuting the indices of the features.

## Minhash Signatures

Minhashing using just one small signature will unfortunately produce
both **false positives** and **false negatives** for similarity. We try
to mitigate this as follows:

- generate **many** small signatures using different hash functions

*Minhash similarity* between two documents is the expected proportion of
hash functions for which their small signatures agree!

## Jaccard Distance

The Jaccard distance between two documents $i$ and $j$ is given by
$d(i,j)=1 - \text{SIM}(i,j)$.

It is distance *metric*:

- it equals 0 iff $D_i = D_j$
- it is symmetric: $d(i,j)=d(j,i)$
- it satisfies the *triangle inequality*:
  $d(i,j) ~<=~ d(i,k) ~+~ d(k,j)$

## Exercise

- Compute signature matrices for a document using both random
  permutations and hash functions

- Use the matrices to see how accurately they approximate the true
  Jaccard similarity.

In [None]:
def sign(mat, h_fn, p):
    """
    Return array of signatures for all documents

    The shape of the matrix is (m, n) where m = number of documents
    and n = number of features. The hash function has domain equal to
    the column indices 0, 1, ..., n-1

    mat (np.array): 0/1 matrix
    h_fn (int -> int): universal hash function whose range is 0, ..., p-1
    p (int): upper bound on values returned by hfn
    """
    n_features = mat.shape[1]
    masked = np.where(mat == 0, n_features + 1, mat)
    s = masked*np.vectorize(h_fn)(np.arange(n_features))
    return np.min(s, 1)

tbl_size = 53
uhf = UHF(tbl_size)
h = uhf.make_hash(tbl_size)

signatures = sign(f_matrix.T, h, tbl_size)
signatures

## Locality Sensitive Hashing

Basic Ingredients:

- A *family* $\cal{F}$ of "hash functions" that produce *short signatures* for items. Ideally, we should have a **diversity** of functions.

- A **distance** measure, $d$

Then \[\cal{F}\] is a

$(d_1, d_2, p_1, p_2)\text{-sensitive family}$

with \[p_1 > p_2\] and \[d_1 < d_2\], iff for
any function \[h\] in the family and any pair of items $x$ and $y$:

> if $d(x, y) \leq d_1$, then $~\text{Pr}\{h(x) = h(y)\} \geq p_1$

and

> if $d(x, y) \geq d_2$, then $~\text{Pr}\{h(x) = h(y)\} \leq p_2$

## Implications

Thus, the probability of **distant** items getting the same signature is reasonably small, and the probability of items in the same *locality* getting the same signature is reasonably large.

> **False positive match probability** bounded above by $p_2$ and
> **False negative match probability** bounded above by $1 - p_1$.

A Minhash family with Jaccard distance is a $(d_1, d_2, (1-d_1), (1-d_2))$-sensitive family!

## Amplification: AND, OR constructions

Suppose that $s$ is the probability that two documents have the same signatures.

**AND Construction:** Consider $r$ different signatures, and we require **all** to match!

> Pairing Probability: $s^r$

**OR Construction:** Consider $r$ different signatures, and we require **at least one** to match!

> Pairing Probability: $1 - (1-s)^r$

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

def set_ax_properties(axis):
    axis.set_xlabel('Similarity')
    axis.set_ylabel('Pairing Probability')
    axis.spines['top'].set_visible(False)
    axis.spines['right'].set_visible(False)
    axis.yaxis.set_ticks_position('none')
    axis.yaxis.tick_left()
    axis.xaxis.set_ticks_position('none')
    axis.xaxis.tick_bottom()

def set_plt_properties(a_plot, title):
    a_plot.title(title)
    a_plot.legend(loc='best')
    a_plot.grid()

## AND Construction

In [None]:
s = np.arange(0.0,1.01,0.01)
ax = plt.gca()
set_ax_properties(ax)
plt.plot(s, s, label='1 row', color='r')
plt.plot(s, s**2, label='2 rows', color='g')
plt.plot(s, s**3, label='3 rows', color='b')
set_plt_properties(plt, "AND-construction curves")
plt.show()

## OR Construction

In [None]:
s = np.arange(0.0,1.01,0.01)
ax = plt.gca()
set_ax_properties(ax)
plt.plot(s, 1 - (1-s), label='1 row', color='r')
plt.plot(s, 1 - (1-s)**2, label='2 rows', color='g')
plt.plot(s, 1 - (1-s)**3, label='3 rows', color='b')
set_plt_properties(plt, "OR-construction curves")
plt.show()

## Side by Side ...

In [None]:
s = np.arange(0.0,1.01,0.01)
fig, axes = plt.subplots(1,2, sharey=True)
set_ax_properties(axes[0])
set_ax_properties(axes[1])
axes[0].plot(s, 1 - (1-s), label='1 row', color='r')
axes[0].plot(s, 1 - (1-s)**2, label='2 rows', color='g')
axes[0].plot(s, 1 - (1-s)**3, label='3 rows', color='b')
axes[0].set_title("OR-Construction")
axes[1].plot(s, s, label='1 row', color='r')
axes[1].plot(s, s**2, label='2 rows', color='g')
axes[1].plot(s, s**3, label='3 rows', color='b')
axes[1].set_title("AND-Construction")
handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper right')
plt.show()

## AND-OR Construction

In [None]:
s = np.arange(0.0,1.01,0.01)
ax = plt.gca()
set_ax_properties(ax)
plt.plot(s, s, '-.', color='k')
plt.plot(s, 1 - (1-s)**2, label='2 bands with 1 rows', color='r')
plt.plot(s, 1 - (1-s**2)**2, label='2 bands with 2 rows', color='g')
plt.plot(s, 1-(1-s**3)**2, label='2 bands with 3 rows', color='b')
plt.plot(s, 1-(1-s**4)**2, label='2 bands with 4 rows', color='y')
set_plt_properties(plt, "AND-OR Construction")
plt.show()

## OR-AND Construction

In [None]:
s = np.arange(0.0,1.01,0.01)
ax = plt.gca()
plt.plot(s, s, '-.', color='k')
plt.plot(s, (1 - (1-s))**2, label='2 bands with 1 rows', color='r')
plt.plot(s, (1 - (1-s)**2)**2, label='2 bands with 2 rows', color='g')
plt.plot(s, (1-(1-s)**3)**2, label='2 bands with 3 rows', color='b')
plt.plot(s, (1-(1-s)**4)**2, label='2 bands with 4 rows', color='y')
set_plt_properties(plt, "OR-AND Construction")
plt.show()

##

In [None]:
n = 60
rows = [3,4,5,6,10]
bands = [60//r for r in rows]
colors = 'bgrcm'
s = np.arange(0.0,1.01,0.01)

def and_or_threshold(r, b):
    """Returns the value of s when the S-curve has probability 0.5

    Args:
        r (int): number of rows
        b (int): number of bands
    """
    return (np.log(2)/b)**(1.0/r)

def or_and_threshold(r, b):
    """Returns the value of s when the S-curve has probability 0.5

    Args:
        r (int): number of rows
        b (int): number of bands
    """
    return 1 - and_or_threshold(r, b)

## S-Curves: AND-OR construction

In [None]:
params = zip(rows,bands,colors)
ax = plt.gca()
plt.title("AND-OR S-curves: 60 signatures")
z = 0.2
for r,b,c in params:
    plt.plot(s, 1-(1-s**r)**b, label='%d rows'%(r,),
             color=c)
    t = and_or_threshold(r,b)
    pair_prob = 1-(1-t**r)**b
    plt.plot([t,t],[0,pair_prob],'o--',color=c)
    ax.annotate('%3.2f'%t, xy=(t,pair_prob),
                xytext=(t+0.1,pair_prob+z),color=c,
                arrowprops=dict(arrowstyle="->",
                                connectionstyle="arc3"))
    z -= 0.03
ax.set_xlabel('Similarity')
ax.set_ylabel('Pairing Probability')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.yaxis.set_ticks_position('none')
ax.yaxis.tick_left()
ax.xaxis.set_ticks_position('none')
ax.xaxis.tick_bottom()
plt.legend(loc='best')
plt.grid()
plt.show()

## S-Curves: OR-AND Construction

In [None]:
params = zip(rows,bands,colors)
ax = plt.gca()
plt.title("OR-AND S-curves: 60 signatures")
z = 0.2
for r,b,c in params:
    plt.plot(s, (1-(1-s)**r)**b, label='%d rows'%(r,),
             color=c)
    t = or_and_threshold(r,b)
    pair_prob = (1-(1-t)**r)**b
    plt.plot([t,t],[0,pair_prob],'o--',color=c)
    ax.annotate('%3.2f'%t, xy=(t,pair_prob),
                xytext=(t+0.1,pair_prob+z),color=c,
                arrowprops=dict(arrowstyle="->",
                                connectionstyle="arc3"))
    z -= 0.03
ax.set_xlabel('Similarity')
ax.set_ylabel('Pairing Probability')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.yaxis.set_ticks_position('none')
ax.yaxis.tick_left()
ax.xaxis.set_ticks_position('none')
ax.xaxis.tick_bottom()
plt.legend(loc='best')
plt.grid()
plt.show()

## False Positives and Negatives

In [None]:
plt.figure()
plt.title("False positives vs. False negatives")
plt.xlabel('Similarity')
plt.ylabel('Pairing Probability')
r = 4
b = 15
plt.plot(s, 1-(1-s**r)**b, label='%d rows, %d bands'%(r,b))
t = and_or_threshold(r,b)
pair_prob = 1-(1-t**r)**b
plt.axvline(t,0,1)
s1 = np.arange(t,1.0,0.01)
y1 = 1-(1-s1**r)**b
plt.fill_between(s1, 1.0, y1, facecolor='green')
plt.annotate('False negative', xy=(t,0.9), xytext=(t-0.25,0.8),
             color='green',
            arrowprops=dict(arrowstyle="->",connectionstyle="arc3"))

s2 = np.arange(0.0,t,0.01)
y2 = 1-(1-s2**r)**b
plt.fill_between(s2, 0.0, y2, facecolor='red')
plt.annotate('False positive', xy=(t,0.3), xytext=(t+0.2,0.3),
             color='red',
            arrowprops=dict(arrowstyle="->",connectionstyle="arc3"))
plt.show()

In [None]:
def and_or_curve(s, b, r):
    # AND first (s^r), then OR
    return 1 - (1 - s**r)**b

def or_and_curve(s, b, r):
    # OR first (1 - (1-s)^b), then AND
    # Note: b is now the "width" of the OR, r is the "count" of ANDs
    return (1 - (1 - s)**b)**r


s = np.linspace(0, 1, 100)
b, r = 5, 5  # Using symmetric parameters for fair comparison

plt.figure(figsize=(10, 6))
plt.plot(s, and_or_curve(s, b, r), label=f'AND-OR (b={b}, r={r})', linewidth=3, color='blue')
plt.plot(s, or_and_curve(s, b, r), label=f'OR-AND (b={b}, r={r})', linewidth=3, color='red', linestyle='--')

plt.title("Comparison of Cascading Constructions", fontsize=16)
plt.xlabel("Jaccard Similarity ($s$)", fontsize=14)
plt.ylabel("Probability of Candidate", fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.axvline(0.5, color='gray', linestyle=':', alpha=0.5)
plt.show()

## Minhashing with Multiple Signatures & Amplification!

Consider Jaccard similarity for now:

- Generate 20 different signatures for the documents: signatures matrix has signatures as rows and documents as columns

- Arrange signatures in 5 **bands**, with each band containing 4 **rows**.

- **Candidate Pair:** Any pair of documents whose signatures agree in **every**
    row in **some** band.

## Find Candidate Pairs

1. Hash documents to *buckets* such that similar documents are likely to
hash to the same bucket!

2. **Actually compare features** of candidate pairs

Benefits: Instead of $O(n^2)$ comparisons, we only need $O(n)$ comparisons to
find *approximately* similar documents!

## General Approach

- Choose distance measure for items

- Create a matrix of signatures using an **appropriate LSH family**

- Construct candidate pairs applying the LSH banding technique

- Choose a threshold fraction $t$ for *similarity* of items in order
  for them to be regarded as a *true pair*.

- Check if the signatures for the candidate pairs match in at least a fraction $t$, or if pairs are sufficiently similar, do a more fine-grained check.

## Cosine Distance

Distances equals angle (between 0 and $\pi$) between points in Euclidean space

- Can be computed with dot products of the vectors representing the points

- Distance is a metric

> Note: `scipy` treats the cosine distance as the **cosine** of the angle between points. Hence `scipy` cosine "distance' is not a distance metric! However, it is convenient because no `acos` (inverse cosine) computations are needed.

## LSH Family for Cosine Distance

- Data is a collection of $n$-dimensional points (treated as vectors)

- A "hash function" in the family corresponds to a **random vector**; the hash function applied to a point is the **sign of the dot product** of the point with the random vector.

> This yields a $(d_1, d_2, (1-\frac{d_1}{\pi}), (1-\frac{d_2}{\pi}))$-sensitive hash family for any distances $d_1 < d_2$.

**Sketch:** A vector with coordinates $\pm 1$.

Sketches are reasonable approximations to the full LSH family but are only plentiful in very high dimensions!


## Hamming Distance

Distance between 0/1-valued (binary) vectors of length $n$

- equals the number of bit positions that differ, or can also be defined as the *fraction* of bit positions that differ; i.e., it is integer-valued between 0 and $n$ (when un-normalized) or $\leq 1$ (when normalized by $n$).

- it is a metric distance

## LSH Family for Hamming Distance

- Data is a collection of $n$ dimensional **binary** vectors with positions indexed from 0 through $(n-1)$ from right to left.

- A "hash function" in the family corresponds to a *projection*: the projection $h_i$ for $0 \leq i \leq n-1$ extracts the $i^{th}$ (least significant) bit of a vector.

If Hamming distance is *normalized*, this yields a $(d_1, d_2, (1-d_1), (1-d_2))$-sensitive family. Otherwise, it yields a $(d_1, d_2, (1- \frac{d_1}{n}), (1-\frac{d_2}{n})$-sensitive family.

> Unless the vectors are high-dimensional, this family suffers from a paucity of enough functions to amplify the probabilities!

## Euclidean Distances

* $L_1$ distance (also called **Manhattan distance**: *sum* of the absolute value difference in each of the dimensions:

$L_1(p,q) = \sum_{i=0}^{n} |p_i - q_i|$

* $L_2$ distance is the usual square root of the sum of squares of differences along each dimension:

$L_2(p, q) = \sqrt[2]{\sum_{i=0}^{n-1} |p_i - q_i|^2}$

* $L_k$ distance generalizes $L_2$:

$L_k(p, q) = \sqrt[k]{\sum_{i=0}^{n-1} |p_i - q_i|^k}$

* $L_{\infty}$ distance is the maximum absolute value of difference along any one dimension:

$(p, q) = max_{i} |p_i - q_i|$

## LSH family for Euclidean $L_2$ distance

- Data is a collection of $2$-dimensional points

- A "hash function" in the family corresponds to a **random line** that is divided into numbered segments (**buckets**) of length $a$

- The hash application is the bucket number reached when a **perpendicular** is dropped from the point onto the line!

> This yields a $(\frac{a}{2}, 2a, \frac{1}{2}, \frac{1}{3})$-sensitive hash family.