# Finding Similar Items: Locality Sensitive Hashing

Finding “similar” data items in a given dataset is a fundamental data-mining problem. "Similarity" is represented (computed, measured, or estimated) in the form of some similarity or distance measure, such as Jaccard distance/similarity, Euclidean distances, cosine distance, edit distance, and Hamming distance. In this lecture, we will consider the problem of estimating the similarity of textual documents: Given a body of documents, e.g., the Web, find pairs of textually similar documents with many texts in common, e.g., near duplicate pairs. We will study three essential techniques for finding similar documents: 

* Shingling to convert documents to sets of shingles;
* Minhashing to convert large sets to short signatures while preserving similarity;
* Locality-sensitive hashing to finding candidate pairs of signatures likely to be from similar documents.

In [13]:
import numpy as np
import pandas as pd
import time

import re

In [14]:
num_docs = 30

# Read the data from csv file
df = pd.read_csv('articles1.csv')[:num_docs]
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [15]:
# Shingling function to create shingles of size k
def shingle_document(document, k, HASH = True):
    
    shingles = set()
    for i in range(len(document) - k + 1):
        shingles.add(document[i:i+k])

    if HASH:
        return set([hash(shingle) for shingle in shingles])
    
    return shingles

# Create shingles of size 10
df['shingles'] = df['content'].apply(lambda x: shingle_document(x, 3, HASH = False))
df['hashed-shingles'] = df['content'].apply(lambda x: shingle_document(x, 3, HASH = True))


In [16]:
# Jaccard similarity of hashed shingles
def jaccard_similarity(shingles1, shingles2):
    return len(shingles1.intersection(shingles2)) / len(shingles1.union(shingles2))

print(jaccard_similarity(set(df['hashed-shingles'][0]), set(df['hashed-shingles'][1])))

0.34420096205237843


In [17]:
def func(a,b,x):
    return (a * x - b) % (2 ** 32 - 1)

# Minhashing function 
def minhashing(shingles, n):
    minhashes = []
    np.random.seed(0)
    for i in range(n):
        a = np.random.randint(1, 1e4)
        b = np.random.randint(1, 1e4)
        minhash = np.inf
        for shingle in shingles:
            minhash = min(minhash, func(a, b, shingle))
            #minhash = min(minhash, hash((i, shingle)))
        minhashes.append(minhash)
    return minhashes

# Create minhashes of size 100
df['minhashes'] = df['hashed-shingles'].apply(lambda x: minhashing(x, 100))

In [18]:
# A function CompareSignatures estimates the similarity of two integer vectors – minhash signatures – as a fraction of components in which they agree.
def CompareSignatures(signature1, signature2):
    count = 0
    for i in range(len(signature1)):
        if signature1[i] == signature2[i]:
            count += 1
    return count / len(signature1)

# Compare the minhashes of two documents
print(CompareSignatures(df['minhashes'][0], df['minhashes'][1]))

0.38


In [19]:
#Similarity matrix
JaccardSim = np.zeros((num_docs, num_docs))
for (i) in range(num_docs):
    for j in range(num_docs):
        JaccardSim[i][j] = jaccard_similarity(set(df['hashed-shingles'][i]), set(df['hashed-shingles'][j]))


# Apply minhashing to estimate jaccard similarity
start_time = time.time()
minhashing_JaccardSim = np.zeros((num_docs, num_docs))
df['minhashes'] = df['hashed-shingles'].apply(lambda x: minhashing(x, 100))

for row in range(num_docs):
    for col in range(num_docs):
        minhashing_JaccardSim[row][col] = CompareSignatures(df['minhashes'][row], df['minhashes'][col])

comp_time = time.time() - start_time
print("Time taken to compute minhashing similarity matrix: ", comp_time)

Time taken to compute minhashing similarity matrix:  3.4149081707000732


In [20]:
# LSH function given a collection of minhash signatures (integer vectors) and a similarity threshold t (0 < t < 1), returns a list of candidate pairs of documents that are similar.
def LSH(minhashes, b):
    r = int(len(minhashes[0]) / b)
    print("r = ", len(minhashes[0]))
    t = np.power(1/b, 1/r)
    num_docs = len(minhashes)
    candidate_pairs = []
    for i in range(b):
        buckets = {}
        for j in range(num_docs):
            signature = minhashes[j]
            bucket = hash(tuple(signature[i * r:(i + 1) * r]))
            if bucket in buckets:
                buckets[bucket].append(j)
            else:
                buckets[bucket] = [j]
        for bucket in buckets:
            if len(buckets[bucket]) > 1:
                for doc1 in buckets[bucket]:
                    for doc2 in buckets[bucket]:
                        if doc1 < doc2:
                            if CompareSignatures(minhashes[doc1], minhashes[doc2]) > t:
                                candidate_pairs.append((doc1, doc2))
    return candidate_pairs,t

In [21]:
# Apply LSH to estimate similarity
start_time = time.time()
candidate_pairs, threshold = LSH(df['minhashes'], 25)
comp_time = time.time() - start_time
print("Time taken to compute LSH: ", comp_time)

# Pair with similarity greater than threshold
print("Threshold: ", threshold)
print("Number of pairs with similarity greater than threshold: ", len(set(candidate_pairs)))

r =  100
Time taken to compute LSH:  0.0090179443359375
Threshold:  0.4472135954999579
Number of pairs with similarity greater than threshold:  71


In [22]:
# Count the number of pairs with similarity greater than threshold
count = 0
count_minhash = 0
for i in range(num_docs):
    for j in range(num_docs):
        if minhashing_JaccardSim[i][j] > threshold:
            count_minhash += 1
        if JaccardSim[i][j] > threshold:
            count += 1

print("Number of pairs with similarity greater than threshold: ", (count-num_docs)/2)
print("Number of pairs with similarity greater than threshold: ", (count_minhash-num_docs)/2)

Number of pairs with similarity greater than threshold:  18.0
Number of pairs with similarity greater than threshold:  92.0


In [23]:
JaccardSim

array([[1.        , 0.34420096, 0.36344411, 0.33444909, 0.37790422,
        0.18502203, 0.38711195, 0.36795069, 0.3472177 , 0.34637802,
        0.41345764, 0.39093095, 0.37860662, 0.33714547, 0.35519503,
        0.36386585, 0.41161049, 0.39766302, 0.34258417, 0.28431373,
        0.34410802, 0.36012297, 0.38147448, 0.33399867, 0.43662519,
        0.43896104, 0.35483871, 0.40717075, 0.38004246, 0.37605804],
       [0.34420096, 1.        , 0.46225984, 0.44242424, 0.29666757,
        0.13131313, 0.29429107, 0.48647377, 0.4138093 , 0.35112285,
        0.38289101, 0.43455117, 0.41725739, 0.29740681, 0.40785645,
        0.43487076, 0.43703892, 0.42007624, 0.30567568, 0.21834914,
        0.31793825, 0.32543276, 0.40742625, 0.42511493, 0.41497462,
        0.36731477, 0.38620029, 0.43587224, 0.3589404 , 0.47760165],
       [0.36344411, 0.46225984, 1.        , 0.43957845, 0.31881372,
        0.14383339, 0.31986532, 0.46182918, 0.4304653 , 0.35805022,
        0.3912794 , 0.42009971, 0.4086423 , 0.