### Simhash
In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google to find near duplicate webpages. It was created by Moses Charikar.

Algorithm:

- set the hashsize, ex. 32 bits, initialize them all zero
- break the phrase up into features (shingles)  
'the cat sat on the mat' </b>
-> {"th", "he", "e ", " c", "ca", "at", "t ",</b>
    " s", "sa", " o", "on", "n ", " t", " m", "ma"}</b>
- hash each feature using a normal 32-bit hash algorithm ex. md5</b>
"th" -> 10010010...</b>
"he" -> 10010110...</b>
- set all zero to -1, sum them. 
- generate simhash - 1: T[i]>0, 0: T[i]<0

In [228]:
import re
import hashlib
from simhash import Simhash

In [167]:
def make_features(input_str):
    width = 3
    input_str = input_str.lower()
    out_str = re.sub(r'[^\w]+', '', input_str)
    return [out_str[i:i + width] for i in range(max(len(out_str) - width + 1, 1))]

In [165]:
def make_simhash(input_str):
    features = make_features(input_str)
    return Simhash(features).value

In [168]:
make_features("hello world")

['hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']

In [166]:
make_simhash("hello world")

13548364882372308181

In [224]:
text_1 = "Good job"
text_2 = "Good job, ray"

Simhash(text_1).distance(Simhash(text_2))

14

In [225]:
hash_1 = Simhash(text_1).value
hash_2 = Simhash(text_2).value

In [226]:
def simhash_diff(hash_1, hash_2):
    """calcuate the difference from two simhash values.
    """
    x = (hash_1 ^ hash_2) & ((1 << 64) - 1)
    ans = 0
    while x:
        ans += 1
        x &= x - 1
    return ans

In [227]:
simhash_diff(hash_1, hash_2)

14

Ref

Google paper - http://www.wwwconference.org/www2007/papers/paper215.pdf
Implement code - https://github.com/leonsim/simhash