# Report - Finding similar items 

In [1]:
s1 = "abcdab"
s2 = "Får får lammdab"

# K-shingles

> A class&nbsp;<em>Shingling</em>&nbsp;that constructs&nbsp;<em>k</em>–shingles of a given length&nbsp;<em>k</em> (e.g., 10) from a given document, computes a hash value for each unique shingle and represents the document in the form of an ordered set of its hashed&nbsp;<em>k</em>-shingles.

* Between two documents
* Hash shingles
    * vad är en bra hash funktion? Finns i boken.

In [2]:
def shingle(s, k=5):
    """
    Split a string into k equal sizes
    TODO: remove upper case and what not? Now I remove it :) 
    """
    return set([s[i:(i+k)].lower() for i in range(0, len(s)-k+1)])

Using Pyhon's native hash function with congruens to limit the possible outcomes.
This function returns a set of all unique functions. I think it is correct, according to [Stanford's slides](http://www.mmds.org/mmds/v2.1/ch03-lsh.pdf), page 22:

> Each **unique shingle** is a dimension

In [4]:
def get_hashed_shingles(document, k=5, hash_boundary = 2**32):
    shingles = shingle(document, k)
    hashed_shingles = [hash(shingle) % hash_boundary for shingle in shingles]
    return set(hashed_shingles)

In [5]:
# SANDBOX - setting test variables and testing functions
k = 3

print(shingle(s1, k=k))
h1 = get_hashed_shingles(s1, k=k)
print(h1)

print(shingle(s2, k=k))
h2 = get_hashed_shingles(s2, k=k)
print(h2)

{'bcd', 'cda', 'dab', 'abc'}
{603619785, 286046921, 1196472809, 3673230799}
{'mmd', 'r l', 'r f', 'får', ' la', 'mda', 'dab', 'år ', 'amm', ' få', 'lam'}
{1177382594, 4098313827, 3878788484, 2320198341, 1986240005, 4016560105, 1196472809, 1306261067, 1262776890, 4207179418, 2117958527}


# CompareSets

> A class&nbsp;<em>CompareSets </em>computes the Jaccard similarity of two sets of integers – two sets of hashed shingles.

From the lecture:

> Column similarirty is the Jaccard similarity of the corresponding sets (rows with value 1) and can be computed using bitwise AND (for intersection) and bitwise OR (for union)

Example:

If $C_1$ and $C_2$ are a transposed columns, we can follow this approach using bitwise OR and AND to find their Jaccardi Distance:

$$ 
C_1 = 011010, \quad C_2 = 101011 \\ 
C_1 \cup C_2 = C_1 \lor C_2 = 5 \\
C_1 \cap C_2 = C_1  \land C_2 = 2 \\ \\
\text{Jaccardi distance} = \frac{|C_1 \cap C_2|}{|C_1 \cup C_2|} = \frac{2}{5}
$$

Python has native support for comparing sets in such a manner.

In [6]:
# sandbox
import pandas as pd
import numpy as np

intersection = h1 & h2
union = h1 | h2

print(intersection)
print(union)
len(intersection) / len(union)


{1196472809}
{1177382594, 4098313827, 3878788484, 2320198341, 1986240005, 603619785, 286046921, 1196472809, 4016560105, 1306261067, 3673230799, 4207179418, 1262776890, 2117958527}


0.07142857142857142

In [7]:
def similarity(A, B):
    """
    Jaccardi similarity between two sets of hashed shingles.
    """
    return len(A & B) / len(A | B)

def distance(A, B):
    """
    Jaccardi distance.
    """
    return 1-similarity(A, B)

In [8]:
# SANDBOX - setting test data
d3 = "Jag heter Linus Östlund och är en kille"
d4 = "Heter jag Linus Östlund och varför är jag en kille?"
d5 = "En kille heter Linus Östlund och är jag en kille?"

h3 = get_hashed_shingles(d3)
h4 = get_hashed_shingles(d4)
h5 = get_hashed_shingles(d5)

similarity(h3, h4)



0.3793103448275862

# MinHashing

> A class&nbsp;<em>MinHashing</em>&nbsp;that builds a minHash signature (in the form of a vector or a set) of a given length&nbsp;<em>n</em>&nbsp;from a given set of integers (a set of hashed shingles).


In [10]:
import numpy as np

def get_characteristic_matrix(documents_of_hashed_shingles):
    """
    Produce the characteristic matris.
    INPUT: documents_of_hashed_shingles -- a list of lists, where each element are a list of a document's hashed shingles.
    NOTE Total amount of shingles must be sorted!
    Returns a (shingles x documents) matrix with 0 and 1's
    TODO is it really necessary to compute this or should it be avoided? Might be very large and poor time complexity.
    TODO the input is not documents. Would it be better to just input the documents?
    """
    # Get the union of all shingles, which are the rows in the characteristic matrix
    union_of_shingles = get_union_of_shingles(documents_of_hashed_shingles)
    # Get the number of documents, which are the columns in the characteristic matrix
    number_of_documents = ['D'+str(i) for i in range(len(documents_of_hashed_shingles))]
    # Initiate an empty characteristic matrix and then populate it
    cm = np.zeros((len(union_of_shingles), len(documents_of_hashed_shingles)))
    for i, shingle in enumerate(union_of_shingles):
        for j, hashed_shingles in enumerate(documents_of_hashed_shingles):
            if shingle in hashed_shingles:
                cm[i][j] = 1
    # return a pandas dataframe for easier viewing
    return pd.DataFrame(cm, columns=number_of_documents, index=union_of_shingles, dtype=int) # kan sätta som 'bool' med dtype=bool

def get_union_of_shingles(shingles):
    """
    Get the union of all shingles from a list of shingles.
    Output is a sorted list of unique shingles.
    """
    union = set()
    for s in shingles:
        union |= s
    return sorted(union)

From a list of hashed shingles `[shingles of documents 1, shingles of doucment 2, ..]` we generate the *characteristic matrix*:

In [11]:
import pandas as pd

hashed_shingles = [h3, h4, h5]
df = get_characteristic_matrix(hashed_shingles)
df

Unnamed: 0,D0,D1,D2
40852693,0,1,1
209715022,0,1,1
301143686,0,1,0
460860852,0,1,0
594259817,1,1,1
...,...,...,...
3988809903,0,0,1
4047504577,0,1,0
4128545898,1,1,1
4133237477,0,1,1


# Generating hash functions
From [Stanford's slides, p. 39](http://www.mmds.org/mmds/v2.1/ch03-lsh.pdf):
> How to pick a random hash function $h(x)$?
> * $ h(x) = ( (ax + b) \mod p ) \mod N $, where $a$ and $b$ are random integers and $p$ is a prime number larger than $N$.

MinHash gives the signature of a set. The signature is then used as a proxy for the characterstic matrix, and allows a computational-friendly estimate for the Jaccardian distance. For big datasets, the law of large numbers kicks in.

For *k* (**k=100 recommended!**) independent hash functions, and according to the lecture, a hash function can look like this:

$$
h(x) \equiv (ax + b) \mod c
$$

> Pseudokod finnes i boken (och föreläsningen, 01:00:00 in!

You may also use permutations to achieve the same result, but the föreläsare recommends hash functions. Not sure which apporach is easier to understand. 

In [12]:
import random

# TODO fix magic number and approach all together

def get_hash_function(N):
    """
    Generate a hash function for a given N
    """
    a = random.randint(0, N)
    b = random.randint(0, N)
    p = get_prime(N)
    return lambda x: ((a * x + b) % p) % N

# TODO naive implementation, not efficient, but then again, it's a one off
def get_prime(N):
    """
    Get a prime number larger than N
    """
    n = random.randint(N*2, N*4)
    while not is_prime(n):
        n = random.randint(N*2, N*4)
    return n

def is_prime(n):
    """
    Check if n is a prime number
    """
    if n < 2:
        return False
    for i in range(2, n):
        if n % i == 0:
            return False
    return True

In [156]:
# SANDBOX
hash_functions = [get_hash_function(x) for x in range(20, 25)]
len(hash_functions)

5

## Hashing the columns (documents) of the characteristic matrix

Here, I generate a `len(hash_functions)` x `len(documents)` matrix, where each row is a hash function, and each column is a document. The value in each cell $hash_i(index_{row})$

In [157]:
# from here on there is only sandboxing and testing
data = []
for index in range(len(df.index)):
    data.append([f(index) for f in hash_functions])

print(len(data), len(data[0]))

df_hashed = pd.DataFrame(data=data, index=df.index, columns=range(len(hash_functions)))
#df_hashed
df_hashed

62 5


Unnamed: 0,0,1,2,3,4
40852693,10,2,10,6,19
209715022,11,19,17,12,16
301143686,12,15,2,18,13
460860852,13,0,9,1,3
594259817,14,17,16,7,0
...,...,...,...,...,...
3988809903,0,17,14,13,7
4047504577,1,13,21,19,4
4128545898,2,9,6,2,1
4133237477,3,15,13,8,15


In [158]:
def compute_hashed_minhash(df, hash_functions):
    """
    df: a dataframe with rows as shingles and columns as documents.
    hash_functions: a list of hash functions.
    Compute the hashed minhash of a dataframe.
    """
    data = []
    for index in range(len(df.index)):
        data.append([f(index) for f in hash_functions])
    return pd.DataFrame(data=data, index=df.index, columns=range(len(hash_functions)))

In [159]:
B = compute_hashed_minhash(df, hash_functions).to_numpy()
A = df.to_numpy()
signature = np.full((len(hash_functions), len(df.columns)), np.inf)


for i in range(len(df.columns)):
    for j in range(len(hash_functions)):
        for k in range(len(df.index)):
            if A[k][i] == 1:
                signature[j][i] = min(signature[j][i], B[k][j])

signature


array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [163]:
document = 0
A.T[document]
B.T*A.T[document]



array([[ 0,  0,  0,  0, 14,  0, 16, 17, 18, 19,  0,  0,  2,  3,  4,  0,
         0,  0,  8,  0, 10, 11,  0, 13,  0,  0, 16, 17, 18, 19,  0,  0,
         2,  3,  0,  0,  0,  7,  0,  9, 10,  0, 12, 13, 14,  0, 16,  0,
        18, 19,  0,  0,  0,  3,  4,  0,  6,  0,  0,  2,  0,  4],
       [ 0,  0,  0,  0, 17,  0,  9, 15, 11,  7, 13,  0,  5, 11,  7,  0,
         0,  0,  1,  0,  3, 20,  0,  1,  0,  0, 20, 16,  1, 18, 14,  0,
        16, 12,  0,  0,  0,  6,  0,  8,  4,  0,  6,  2,  8,  0,  0,  0,
         2, 19,  4,  0,  0,  2, 19,  0,  0,  0,  0,  9,  0, 11],
       [ 0,  0,  0,  0, 16,  0,  8, 15,  0,  7,  1,  0, 15,  0,  7,  0,
         0,  0, 13,  0,  5, 12,  0, 13,  0,  0, 12, 19,  4, 11, 18,  0,
        10,  4,  0,  0,  0, 10,  0,  2,  9,  0,  1,  8,  2,  0, 16,  0,
         8, 15,  0,  0,  0, 21,  6,  0,  7,  0,  0,  6,  0, 20],
       [ 0,  0,  0,  0,  7,  0, 19,  2,  8, 14, 20,  0, 11, 17,  0,  0,
         0,  0,  1,  0, 13, 19,  0, 10,  0,  0,  5, 11, 17,  0,  6,  0,
        18,  

In [120]:
df_hashed = compute_hashed_minhash(df, hash_functions)
df_hashed.head()


hashed_data = df_hashed.to_numpy()

data = df_hashed.to_numpy() # data = B
# idx is where the rows have 1 in the column
idx = data > 0

signature = np.full((len(hash_functions), len(df.columns)), np.inf)

A = df.to_numpy()

B = data

print(A.shape, B.shape)
A.T

#np.take(B.T, A > 0)


C = []
for document in range(A.shape[1]):
    row = list(map(lambda x: np.nanmin(x, keepdims=True), B[A[:,document] > 0]))
    C.append(row)
    print(len(row))

C = np.array(C)
C.shape

np.nanmin(np.array([[1,2,3], [np.nan,np.nan,np.nan], [1,2,3]]), axis=1, keepdims=True)

#df_c = pd.DataFrame(C, dtype=int)

#df_c


5
(62, 3) (62, 5)
35
45
41


  C = np.array(C)
  np.nanmin(np.array([[1,2,3], [np.nan,np.nan,np.nan], [1,2,3]]), axis=1, keepdims=True)


array([[ 1.],
       [nan],
       [ 1.]])

In [113]:
df_signature = pd.DataFrame(columns=['D'+str(i) for i in range(len(hashed_shingles))], index=range(len(hash_functions)), dtype=int)
df_signature[:] = signature
df_signature

Unnamed: 0,D0,D1,D2
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0
...,...,...,...
95,0.0,0.0,0.0
96,0.0,0.0,0.0
97,0.0,0.0,0.0
98,0.0,0.0,0.0


# LSH

With the signature matrix at hand, we are supposed to reduce the number of required comparisons by generating *candidate pairs*, which much likely are going to be similar. This is done by using yet another matrix, the matrix *M*. Approach according to lecture:

* Divide matrix *M* into *b* bands and *r* rows
* For each band, hash its portion of each column to a hash table with *k* buckets (make *k* as large as possible)
* Candidate column pair are those that hash to the same bucker for ≥ 1 band.
* tune *b* and *r* to catch most similar pairs, but few non-similar pairs. 

Recommender *b* and *r* values, but for signatare matrix of 100 hashes *b* = 20, *r* = 5.

> Statistiken gås igenom efter ca 1h 10m på föreläsningen. Förstår ej.  

In [None]:
def find_candidate_pairs():
    return null