

## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 2**
This assignment is concerned with finding the set of similar users in the provided datasource. To be more explicit, in finding all pairs of users who have a Jaccard similarity of more than 0.5. Additionally, this assignment considers comparing the "naïve implementation" with the "LSH implementation". The "naïve implementation" can be found in the file `time_estimate.ipynb` and the "LSH implementation" in the file `lsh.ipynb`.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 


POSSIBLE SOURCES:
<http://www.hcbravo.org/dscert-mldm/projects/project_1/>
<https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD#scrollTo=hzPw8EMoW4i4&forceEdit=true&sandboxMode=true>


#### **LSH Implementation**
This notebook implements LSH in order to find all pairs of users with a Jaccard similarity of more than 0.5. As noted in the assignment instructions the data file is loaded from `user_movie.npy` and the list of user pairs are printed in the file `ans.txt`. Additionally, this implementation supports the setting of a random seed to determine the permutations to be used in LSH. The algorithm will continually save its output so as to aid in the evluation criteria which only looks at the first 15 minutes of the LSH execution.
___

The following snippet handles all imports.

In [1]:
import time
import sys
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, csc_matrix, coo_matrix, lil_matrix, find
from scipy.sparse import identity
from collections import defaultdict

### **Program Execution**
This section is concerned with parsing the input arguments and determining the execution flow of the program.

___
The `main` function handles the start of execution from the command line.

In order to do this the function uses the following parameters:
  * `argv` - the command line arguments given to the program
  
The following command line arguments are expected:
  * `seed` - the value to use as random seed
  * `path` - the location of the `user_movies.npy` file

In [2]:
user_movie = np.load('datasets/user_movie.npy')

In [None]:
%%time
c = user_movie[:,0]
r = user_movie[:,1]
d = np.ones(len(c))
max_c = len(np.unique(c))
max_r = len(np.unique(r))
# m = csr_matrix((d, (r,c)), shape=(max_r, max_c))
csc = csc_matrix((d, (r,c)), shape=(max_r, max_c))
csr = csr_matrix((d, (r,c)), shape=(max_r, max_c))
signature_length = 50

# example = np.array([[1,0,0,1],[0,0,1,0],[0,1,0,1],[1,0,1,0],[0,0,1,0]])
# hash_func = np.array([[4,3,1,2,0], [3,0,4,2,1]])

In [100]:
example = np.array([[1,0,0,1,1,0,0,1],[0,0,1,0,0,0,1,0],[0,0,1,0,0,0,1,0],[0,1,0,1,1,0,0,1],[1,0,1,0,1,0,1,0],[0,1,1,1,1,1,0,1]])
hash_func = np.array([[5,4,3,1,2,0],[3,1,2,0,5,4],[1,2,0,5,4,3],[2,0,5,4,3,1],[0,5,4,3,1,2],[3,0,4,2,1,5],[0,4,2,1,5,3],[4,2,1,5,3,0],[2,1,5,3,0,4]])
c = csr_matrix(example)
s = rowminhash(9, hash_func, c)

In [101]:
display(c.todense())
display(hash_func)
display(s)

matrix([[1, 0, 0, 1, 1, 0, 0, 1],
        [0, 0, 1, 0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 1, 0],
        [0, 1, 0, 1, 1, 0, 0, 1],
        [1, 0, 1, 0, 1, 0, 1, 0],
        [0, 1, 1, 1, 1, 1, 0, 1]], dtype=int32)

array([[5, 4, 3, 1, 2, 0],
       [3, 1, 2, 0, 5, 4],
       [1, 2, 0, 5, 4, 3],
       [2, 0, 5, 4, 3, 1],
       [0, 5, 4, 3, 1, 2],
       [3, 0, 4, 2, 1, 5],
       [0, 4, 2, 1, 5, 3],
       [4, 2, 1, 5, 3, 0],
       [2, 1, 5, 3, 0, 4]])

array([[2., 0., 0., 0., 0., 0., 2., 0.],
       [3., 0., 1., 0., 0., 4., 1., 0.],
       [1., 3., 0., 1., 1., 3., 0., 1.],
       [2., 1., 0., 1., 1., 1., 0., 1.],
       [0., 2., 1., 0., 0., 2., 1., 0.],
       [1., 2., 0., 2., 1., 5., 0., 2.],
       [0., 1., 2., 0., 0., 3., 2., 0.],
       [3., 0., 0., 0., 0., 0., 1., 0.],
       [0., 3., 0., 2., 0., 4., 0., 2.]])

In [3]:
def rowminhash(signature_length, hash_func, matrix):
    sigm = np.full((signature_length, matrix.shape[1]), np.inf)
    for row in range(matrix.shape[0]):
        ones = find(matrix[row, :])[1]
        hash = hash_func[:,row]
        B = sigm.copy()
        B[:,ones] = 1
        B[:,ones] = np.multiply(B[:,ones], hash.reshape((len(hash), 1)))
        sigm = np.minimum(sigm, B)
    return(sigm)

In [4]:
import scipy.optimize as opt
import math

def choose_nbands(t, n):
    def error_fun(x):
        cur_t = (1/x[0])**(x[0]/n)
        return (t-cur_t)**2

    opt_res = opt.minimize(error_fun, x0=(10), method='Nelder-Mead')
    b = int(math.ceil(opt_res['x'][0]))
#     r = int(n / b)
    r = round(n / b)
    final_t = (1/b)**(1/r)
    return b, final_t




def do_lsh(sign_matrix, signature_length, threshold):
    return 0

In [138]:
def make_random_hash_fn(p=2**31-1, m=4294967295):
# def make_random_hash_fn(p=2**33-355, m=4294967295):
    a = np.random.randint(1,p-1)
    b = np.random.randint(0, p-1)
    return lambda x: ((a * x + b) % p) % m

In [139]:
hf = make_random_hash_fn

In [150]:
s = hash('hi')
print(s)
h = hf(s)
print(h)

7431932759269154899


ValueError: high is out of bounds for int32

In [162]:
import hashlib
import random
from collections import defaultdict
def lsh_r_bucket_to_id(sigm,id,b,r,numhashes):
#     print(sigm)
#     u=sigm.T
#     print(u)
#     return
    u = sigm
    id = np.array(id)
#     number_of_users = u.shape[0]
    number_of_users = sigm.shape[1]
    buckets = [dict() for x in range(b)]
#     hash_buckets = [dict() for x in range(b)]
    hash_buckets = defaultdict(list)
#     buckets = []
    hf = make_random_hash_fn()
    t1 = time.time()    
    for i in range(number_of_users):
        if i % 10000==0:
            print(str(round(100*i/number_of_users,2))+' percent complete in '+str(round(time.time()-t1,2))+ ' seconds')
#         row = u[i,:]  
        row = u[:,i] 
        for j in range(b):
            r_signature = str(row[j*r:(j+1)*r])
            r_hash = hash(r_signature)
#             print(r_signature)
#             print(r_hash)
#             r_hash = int(hashlib.sha1(r_signature.encode()).hexdigest()[:8], 16)
            r_hash = hf(r_hash)# ^ h_r
#             print(r_signature.shape)
#             print(r_hash)
#             return 0
            hash_buckets[r_hash].append(id[i])
#             if r_hash in hash_buckets[j]:
#                 hash_buckets[j][r_hash] = hash_buckets[j][r_hash]+[id[i]]
#             else :
#                 hash_buckets[j][r_hash] = [id[i]]
#             if r_signature in buckets[j]:
#                 buckets[j][r_signature] = buckets[j][r_signature]+[id[i]]
#             else :
#                 buckets[j][r_signature] = [id[i]]
    hash_buckets_set = {k: set(v) for k,v in hash_buckets.items()}
    return hash_buckets_set#buckets

In [118]:
# sigm100 = np.load('datasets/sign_matrix_100.npy')
sigm50 = np.load('datasets/sign_matrix.npy')

In [None]:
# Tests multiple values of b and r and returns those with .1 of our ideal t value
#TODO make function and auto generate b/r
for b in range(1,15):
    r = int(numhashes / b)
    for r_i in range (r-2,r+3):
        t = (1/b)**(1/r_i)
        if t > 0.4 and t < 0.6:
            print(b,r_i,t)

In [163]:
sigm = sigm50
threshold=0.57 # overshoot so as to get a similarity matrix closer to 0.5
numhashes = sigm.shape[0]
# b, _ = choose_nbands(threshold, numhashes)
b = 13
r = 4
# b = 3
# r = 3
print(b*r,'vs',numhashes)
# r = round(numhashes / b)
threshold = (1/b)**(1/r)
print(threshold,b,r)
user_ids = np.array(list(range(sigm.shape[1])))
buckets = lsh_r_bucket_to_id(sigm,user_ids,b,r,numhashes)

52 vs 50
0.5266403878479267 13 4
0.0 percent complete in 0.0 seconds
9.64 percent complete in 16.02 seconds
19.29 percent complete in 33.82 seconds
28.93 percent complete in 49.26 seconds
38.57 percent complete in 65.9 seconds
48.21 percent complete in 82.13 seconds
57.86 percent complete in 101.35 seconds
67.5 percent complete in 116.98 seconds
77.14 percent complete in 133.21 seconds
86.79 percent complete in 154.04 seconds
96.43 percent complete in 170.41 seconds


3503486605907668772


-3378229029050999201

In [97]:
from collections import defaultdict
hf = make_random_hash_fn()
n_b = 3
n_r = 3
t = s.T
bu = dd(list)
# print('s',s)
print('s.t',t)
for i in range(8):
#     r = t[i,:]
    c = s[:,i]
    r = c
#     print(r)
    # print(c)
    for b in range(n_b):
        r_s = r[b*n_r:(b+1)*n_r]
#         print(r_s)
        r_h = str(r_s)
#         r_h = hf(hash(str(r_s)))
#         print(r_h)
        bu[r_h].append(i)
#         bu[b][r_h] = bu[b][r_h]+[i]
print(bu)
bu_s = {k: set(v) for k,v in bu.items()}
print(bu_s)

s.t [[2. 3. 1. 2. 0. 1. 0. 3. 0.]
 [0. 0. 3. 1. 2. 2. 1. 0. 3.]
 [0. 1. 0. 0. 1. 0. 2. 0. 0.]
 [0. 0. 1. 1. 0. 2. 0. 0. 2.]
 [0. 0. 1. 1. 0. 1. 0. 0. 0.]
 [0. 4. 3. 1. 2. 5. 3. 0. 4.]
 [2. 1. 0. 0. 1. 0. 2. 1. 0.]
 [0. 0. 1. 1. 0. 2. 0. 0. 2.]]
defaultdict(<class 'list'>, {'[2. 3. 1.]': [0], '[2. 0. 1.]': [0], '[0. 3. 0.]': [0], '[0. 0. 3.]': [1], '[1. 2. 2.]': [1], '[1. 0. 3.]': [1], '[0. 1. 0.]': [2, 2, 6], '[2. 0. 0.]': [2], '[0. 0. 1.]': [3, 4, 7], '[1. 0. 2.]': [3, 7], '[0. 0. 2.]': [3, 7], '[1. 0. 1.]': [4], '[0. 0. 0.]': [4], '[0. 4. 3.]': [5], '[1. 2. 5.]': [5], '[3. 0. 4.]': [5], '[2. 1. 0.]': [6, 6]})
{'[2. 3. 1.]': {0}, '[2. 0. 1.]': {0}, '[0. 3. 0.]': {0}, '[0. 0. 3.]': {1}, '[1. 2. 2.]': {1}, '[1. 0. 3.]': {1}, '[0. 1. 0.]': {2, 6}, '[2. 0. 0.]': {2}, '[0. 0. 1.]': {3, 4, 7}, '[1. 0. 2.]': {3, 7}, '[0. 0. 2.]': {3, 7}, '[1. 0. 1.]': {4}, '[0. 0. 0.]': {4}, '[0. 4. 3.]': {5}, '[1. 2. 5.]': {5}, '[3. 0. 4.]': {5}, '[2. 1. 0.]': {6}}


In [160]:
%%time
import itertools as it
pairs = set()
max_l = 0
# for b in range(len(buckets)):
#     print('band:',b)
sub_buckets = buckets
print('no_buckets:',len(sub_buckets))
short_buckets = {k: v for k, v in sub_buckets.items() if len(v) >= 2}
print('no_candidate_buckets',len(short_buckets))
i = 0
# for v in short_buckets.values():
#     if len(v) > max_l:
#         max_l = len(v)
# #         print(i,len(v))
#     pairs.update(set(it.combinations(v,2)))
# #         i += 1
# #         if i == 100:
# #             break
print('max_bucket_size:',max_l)
print('no_pairs:',len(pairs))
print('pairs:',pairs)

no_buckets: 498112
no_candidate_buckets 120311
max_bucket_size: 0
no_pairs: 0
pairs: set()
Wall time: 7min 4s


In [161]:
{k: short_buckets[k] for k in sorted(short_buckets.keys())[:20]}

{-9223094484734418285: {52472, 58753, 84248},
 -9222974050898379094: {457,
  7199,
  8132,
  14416,
  18338,
  21690,
  28909,
  30818,
  35018,
  38277,
  39685,
  40053,
  40379,
  41855,
  50486,
  55436,
  58742,
  66894,
  70341,
  75374,
  76171,
  82890,
  83861,
  85429,
  86123,
  88068,
  93008,
  97156,
  97316},
 -9222930170988853674: {3893,
  8766,
  13493,
  17374,
  22121,
  36759,
  38234,
  45760,
  50028,
  50676,
  53207,
  63354,
  75684,
  77106,
  79117,
  88435,
  90271,
  97334,
  99259},
 -9222693069484048805: {6603, 28523, 49473, 84610},
 -9222503931554967014: {27808, 38363, 73242, 82109, 89393},
 -9222461699450372472: {15302, 101127},
 -9222321469531856744: {55017, 73721},
 -9222321243107868221: {25393, 80165},
 -9221776363504249545: {55556, 94358, 101957},
 -9221712686992875266: {79635, 101160},
 -9221698087874820231: {21745, 64302},
 -9221648556140564222: {7884, 46379},
 -9221601800765036966: {230,
  16395,
  19804,
  24495,
  35111,
  35716,
  37557,
  453

In [None]:
np.random.seed = 42

# example = csr
# example = np.array([[1,0,0,1],[0,0,1,0],[0,1,0,1],[1,0,1,0],[0,0,1,0]])
# %time sigm1 = minhash(signature_length,hash_func, example)
# signature_length = 100
# hash_func = np.array([np.random.permutation(csr.shape[0]) for i in range(signature_length)])
# %time sigm1 = rowminhash(signature_length ,hash_func, csr)

signature_length = 50
hash_func = np.array([np.random.permutation(csr.shape[0]) for i in range(signature_length)])
%time sigm2 = rowminhash(signature_length,hash_func, csr)
# print(sigm2)
# np.save('datasets/sign_matrix_100', sigm1)
# np.save('datasets/sign_matrix', sigm2)

In [None]:
#TODO write out our results ans.txt 
#TODO set seed in our perm hash function
#TODO Find pairs of similar users from buckets
#TODO Elegance aka classes, comments/report/citation
#TODO IMPROVE EFFICIENCY FOR LONGER SIGNATURES Time < 15

In [None]:
def main(argv):
    seed = sys.argv[1]
    path = sys.argv[2]
    print(seed, path)

The following snippet passes the start of the program and the command line arguments to the `main` function.

In [None]:
if __name__ == "__main__":
    main(sys.argv[1:])