# Locality Sensitive Hashing - Python3 Example

The following Python3 Jupyter Notebook walks through a general example of a LSH example.

### References

* stack overflow example of the lsh where much of the code is from. [How to understand Locality Sensitive Hashing](https://stackoverflow.com/questions/12952729/how-to-understand-locality-sensitive-hashing)
* [Mining of Massive Datasets. Check Chapter 3 - Finding Similar Items](http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf)
* [Locality-Sensitive Hashing for Finding Nearest Neighbors](http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf)

In [None]:
import numpy as np
import math

## Signature Generator
LSH signature generation using random projection.  Creation example:

**r1 =** 5258938758107597568730090422086682111341336555356353764484021991907916884238862457752925253205886540063228859967989877341517
3708958214185378191229172581078591707136483695662719666201194820344998416019553425368626333566280373218005140028455336616252
113734411167457082017629313236890406507495424505529953875716 

**r2 =**
4188726764665893952181551897213197536490322759612234627111192449758262888603806182273761430857902365465000582299715182381794
6025553602019495096619768555900041945849259355914615371488201147608435031471657292041543981419502219834960094881943140319526
668290047547998830850975676427533817413288633885065017275967

Where 
* user_vector = is a random dimension (integer) - an array of dim floats
* rand_proj = is number between d and user_vector.  d is number of bits per signature.
* x << y = Returns x with the bits shifted to the left by y places (and new bits on the right-hand-side are zeros). This is the same as multiplying x by 2**y.

**Output:**
* res = number of 1's in dot product of rand_proj and user_vector

In [32]:
def get_signature(user_vector, rand_proj): 
    res = 0
    val = np.dot(rand_proj, user_vector)
    for v in val:
        res = res << 1
        if v >= 0:
            res += 1
    print(f"def_signature res: {res}")
    return res

# Binary Tally

Where:
* num = xor = r1^r2
* x & y = Does a "bitwise and". Each bit of the output is 1 if the corresponding bit of x AND of y is 1, otherwise it's 0.

In [25]:
# get number of '1's in binary
# running time: O(# of '1's)
def nnz(num):
    print(f"nnz num: {num}")
    if num == 0:
        return 0
    res = 1
    num = num & (num-1)
    while num:
        res += 1
        num = num & (num-1)
    print(f"nnz res: {res}")
    return res     

## Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0, π] radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].

In [14]:
# angular similarity using definitions
# http://en.wikipedia.org/wiki/Cosine_similarity
def angular_similarity(a,b):
    dot_prod = np.dot(a,b)
    sum_a = sum(a**2) **.5
    sum_b = sum(b**2) **.5
    cosine = dot_prod/sum_a/sum_b # cosine similarity
    theta = math.acos(cosine)
    return 1.0-(theta/math.pi)

# Main

User defines:
* dim - number of dimensions per data
* d - number of bits per signature
* nruns - repeat times

**Outputs:**
* true_sim = cosine similarity meassure between user_1 and user_2
* hash_sim = binary tally
* diff = percent difference

**Example:**


In [33]:
if __name__ == '__main__':
    dim = 200 # number of dimensions per data
    d = 2**4 # number of bits per signature
    
    nruns = 24 # repeat times
    
    avg = 0
    for run in range(nruns):
        user1 = np.random.randn(dim)
        user2 = np.random.randn(dim)
        randv = np.random.randn(d, dim)    
        r1 = get_signature(user1, randv)
        r2 = get_signature(user2, randv)
        xor = r1^r2
        true_sim, hash_sim = (angular_similarity(user1, user2), (d-nnz(xor))/float(d))
        diff = abs(hash_sim-true_sim)/true_sim
        avg += diff
        print('true %.4f, hash %.4f, diff %.4f' % (true_sim, hash_sim, diff)) 
    print('avg diff' , avg / nruns)

def_signature res: 52543
def_signature res: 35017
nnz num: 17910
nnz res: 9
true 0.4923, hash 0.4375, diff 0.1113
def_signature res: 35070
def_signature res: 5516
nnz num: 40306
nnz res: 9
true 0.4952, hash 0.4375, diff 0.1165
def_signature res: 36150
def_signature res: 26306
nnz num: 60404
nnz res: 11
true 0.4836, hash 0.3125, diff 0.3538
def_signature res: 16715
def_signature res: 5694
nnz num: 22389
nnz res: 10
true 0.5068, hash 0.3750, diff 0.2601
def_signature res: 27030
def_signature res: 39351
nnz num: 61473
nnz res: 6
true 0.4596, hash 0.6250, diff 0.3598
def_signature res: 40268
def_signature res: 45167
nnz num: 11555
nnz res: 7
true 0.5007, hash 0.5625, diff 0.1233
def_signature res: 56501
def_signature res: 27758
nnz num: 45275
nnz res: 9
true 0.5231, hash 0.4375, diff 0.1636
def_signature res: 37353
def_signature res: 48296
nnz num: 11585
nnz res: 6
true 0.5014, hash 0.6250, diff 0.2466
def_signature res: 11323
def_signature res: 5619
nnz num: 14792
nnz res: 7
true 0.4725, 