

## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 2**
This assignment is concerned with finding the set of similar users in the provided datasource. To be more explicit, in finding all pairs of users who have a Jaccard similarity of more than 0.5. Additionally, this assignment considers comparing the "naïve implementation" with the "LSH implementation". The "naïve implementation" can be found in the file `time_estimate.ipynb` and the "LSH implementation" in the file `lsh.ipynb`.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 


POSSIBLE SOURCES:
<http://www.hcbravo.org/dscert-mldm/projects/project_1/>
<https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD#scrollTo=hzPw8EMoW4i4&forceEdit=true&sandboxMode=true>


#### **LSH Implementation**
This notebook implements LSH in order to find all pairs of users with a Jaccard similarity of more than 0.5. As noted in the assignment instructions the data file is loaded from `user_movie.npy` and the list of user pairs are printed in the file `ans.txt`. Additionally, this implementation supports the setting of a random seed to determine the permutations to be used in LSH. The algorithm will continually save its output so as to aid in the evluation criteria which only looks at the first 15 minutes of the LSH execution.
___

The following snippet handles all imports.

In [24]:
import time
import sys
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, csc_matrix, coo_matrix, lil_matrix, find
from scipy.sparse import identity

### **Program Execution**
This section is concerned with parsing the input arguments and determining the execution flow of the program.

___
The `main` function handles the start of execution from the command line.

In order to do this the function uses the following parameters:
  * `argv` - the command line arguments given to the program
  
The following command line arguments are expected:
  * `seed` - the value to use as random seed
  * `path` - the location of the `user_movies.npy` file

In [25]:
user_movie = np.load('datasets/user_movie.npy')

In [26]:
%%time
c = user_movie[:,0]
r = user_movie[:,1]
d = np.ones(len(c))
max_c = len(np.unique(c))
max_r = len(np.unique(r))
# m = csr_matrix((d, (r,c)), shape=(max_r, max_c))
csc = csc_matrix((d, (r,c)), shape=(max_r, max_c))
csr = csr_matrix((d, (r,c)), shape=(max_r, max_c))
signature_length = 50

# example = np.array([[1,0,0,1],[0,0,1,0],[0,1,0,1],[1,0,1,0],[0,0,1,0]])
# hash_func = np.array([[4,3,1,2,0], [3,0,4,2,1]])

Wall time: 12.4 s


In [27]:
def rowminhash(signature_length, hash_func, matrix):
    sigm = np.full((signature_length, matrix.shape[1]), np.inf)
    for row in range(matrix.shape[0]):
        ones = find(matrix[row, :])[1]
        hash = hash_func[:,row]
        B = sigm.copy()
        B[:,ones] = 1
        B[:,ones] = np.multiply(B[:,ones], hash.reshape((len(hash), 1)))
        sigm = np.minimum(sigm, B)
    return(sigm)

In [148]:
import scipy.optimize as opt
import math

def choose_nbands(t, n):
    def error_fun(x):
        cur_t = (1/x[0])**(x[0]/n)
        return (t-cur_t)**2

    opt_res = opt.minimize(error_fun, x0=(10), method='Nelder-Mead')
    b = int(math.ceil(opt_res['x'][0]))
#     r = int(n / b)
    r = round(n / b)
    final_t = (1/b)**(1/r)
    return b, final_t




def do_lsh(sign_matrix, signature_length, threshold):
    return 0

In [139]:
def make_random_hash_fn(p=2**31-1, m=4294967295):
#     def make_random_hash_fn(p=2**33-355, m=4294967295):
    a = np.random.randint(1,p-1)
    b = np.random.randint(0, p-1)
    return lambda x: ((a * x + b) % p) % m

In [140]:
import hashlib
def lsh_r_bucket_to_id(u,id,b,r):
    u=u.T
    id = np.array(id)
    number_of_users = u.shape[0]
    buckets = [dict() for x in range(b)]
    hash_buckets = [dict() for x in range(b)]
    hf = make_random_hash_fn()
    t1 = time.time()

    for i in range(number_of_users):
        if i % 10000==0:
            print(str(round(100*i/number_of_users,2))+' percent complete in '+str(round(time.time()-t1,2))+ ' seconds')
        row = u[i,:]    
        for j in range(b):
            r_signature = str(row[j*r:(j+1)*r])
            r_hash = int(hashlib.sha1(r_signature.encode()).hexdigest()[:8], 16)
            r_hash = hf(r_hash)
#             print(r_signature)
#             print(r_signature.shape)
#             print(r_hash)
#             return 0
            if r_hash in hash_buckets[j]:
                hash_buckets[j][r_hash] = hash_buckets[j][r_hash]+[id[i]]
            else :
                hash_buckets[j][r_hash] = [id[i]]
#             if r_signature in buckets[j]:
#                 buckets[j][r_signature] = buckets[j][r_signature]+[id[i]]
#             else :
#                 buckets[j][r_signature] = [id[i]]
    return hash_buckets#buckets

In [22]:
# sigm100 = np.load('datasets/sign_matrix_100.npy')
sigm50 = np.load('datasets/sign_matrix.npy')

In [149]:
# Tests multiple values of b and r and returns those with .1 of our ideal t value
#TODO make function and auto generate b/r
for b in range(1,15):
    r = int(numhashes / b)
    for r_i in range (r-1,r+2):
        t = (1/b)**(1/r_i)
        if t > 0.4 and t < 0.6:
            print(b,r_i,t)

9 4 0.5773502691896257
10 4 0.5623413251903491
11 3 0.4496443130226092
11 4 0.5491004867761125
12 3 0.43679023236814946
12 4 0.537284965911771
13 3 0.4252903702829902
13 4 0.5266403878479267
14 3 0.41491326668312173
14 4 0.5169731539571706


In [150]:
sigm = sigm50
threshold=0.57 # overshoot so as to get a similarity matrix closer to 0.5
numhashes = sigm.shape[0]
b, _ = choose_nbands(threshold, numhashes)
# b = 13
# r = 4
r = round(numhashes / b)

print(threshold,b,r,_)
user_ids = np.array(list(range(csr.shape[1])))
# r_bucket_to_id = lsh_r_bucket_to_id(sigm,user_ids,b,r)

0.57 12 4 0.537284965911771


In [None]:
r_bucket_to_id

In [None]:
https://surfdrive.surf.nl/files/index.php/s/WwZqzkkHxg6KLlL

In [None]:
print((short_buckets.values()))

In [142]:
%%time
pairs = set()
for b in range(len(r_bucket_to_id)):
    print('band:',b)
    sub_buckets = r_bucket_to_id[b]
    print(len(sub_buckets))
    short_buckets = {k: v for k, v in sub_buckets.items() if len(v) >= 2}
    print(len(short_buckets))
    i = 0
#     for v in short_buckets.values():
# #         print(i,len(v))
#         pairs.update(set(it.combinations(v,2)))
#         i += 1
#         if i == 100:
#             break
print(len(pairs))

band: 0
28232
6982
band: 1
45541
10873
band: 2
35656
8466
band: 3
43230
10535
band: 4
48048
10862
band: 5
46199
10892
band: 6
41774
9706
band: 7
40963
10421
band: 8
33665
8080
band: 9
48973
10395
band: 10
40473
9699
band: 11
43707
10135
band: 12
4445
2500
band: 13
1
1
0
Wall time: 226 ms


In [6]:
np.random.seed = 42

# example = csr
# example = np.array([[1,0,0,1],[0,0,1,0],[0,1,0,1],[1,0,1,0],[0,0,1,0]])
# %time sigm1 = minhash(signature_length,hash_func, example)
# signature_length = 100
# hash_func = np.array([np.random.permutation(csr.shape[0]) for i in range(signature_length)])
# %time sigm1 = rowminhash(signature_length ,hash_func, csr)

signature_length = 50
hash_func = np.array([np.random.permutation(csr.shape[0]) for i in range(signature_length)])
%time sigm2 = rowminhash(signature_length,hash_func, csr)
# print(sigm2)
# np.save('datasets/sign_matrix_100', sigm1)
# np.save('datasets/sign_matrix', sigm2)

CPU times: user 4min 13s, sys: 2min 47s, total: 7min 1s
Wall time: 7min 1s


In [None]:
#TODO write out our results ans.txt 
#TODO set seed in our perm hash function
#TODO Find pairs of similar users from buckets
#TODO Elegance aka classes, comments/report/citation
#TODO IMPROVE EFFICIENCY FOR LONGER SIGNATURES Time < 15

In [None]:
def main(argv):
    seed = sys.argv[1]
    path = sys.argv[2]
    print(seed, path)

The following snippet passes the start of the program and the command line arguments to the `main` function.

In [None]:
if __name__ == "__main__":
    main(sys.argv[1:])