

## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 2**
This assignment is concerned with finding the set of similar users in the provided datasource. To be more explicit, in finding all pairs of users who have a Jaccard similarity of more than 0.5. Additionally, this assignment considers comparing the "naïve implementation" with the "LSH implementation". The "naïve implementation" can be found in the file `time_estimate.ipynb` and the "LSH implementation" in the file `lsh.ipynb`.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 


POSSIBLE SOURCES:
<http://www.hcbravo.org/dscert-mldm/projects/project_1/>
<https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD#scrollTo=hzPw8EMoW4i4&forceEdit=true&sandboxMode=true>


#### **LSH Implementation**
This notebook implements LSH in order to find all pairs of users with a Jaccard similarity of more than 0.5. As noted in the assignment instructions the data file is loaded from `user_movie.npy` and the list of user pairs are printed in the file `ans.txt`. Additionally, this implementation supports the setting of a random seed to determine the permutations to be used in LSH. The algorithm will continually save its output so as to aid in the evluation criteria which only looks at the first 15 minutes of the LSH execution.
___

The following snippet handles all imports.

In [2]:
import time
import sys
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, csc_matrix, coo_matrix, lil_matrix, find
from scipy.sparse import identity

### **Program Execution**
This section is concerned with parsing the input arguments and determining the execution flow of the program.

___
The `main` function handles the start of execution from the command line.

In order to do this the function uses the following parameters:
  * `argv` - the command line arguments given to the program
  
The following command line arguments are expected:
  * `seed` - the value to use as random seed
  * `path` - the location of the `user_movies.npy` file

In [14]:
class LSH():
    
    def __init__(self, dataset=None, sparse_matrix=None, 
                 signature_length=50, minhash=None, hash_func=None, sigm=None):
        self.dataset = dataset
        self.signature_length = signature_length
        self.sparse_matrix = sparse_matrix
        self.minhash = minhash
        self.hash_func = hash_func
        self.sigm = sigm
    
    def load(self, path):
        self.dataset = np.load(path)
     
    def create_matrix(self):
        c = dataset[:,0]
        r = dataset[:,1]
        d = np.ones(len(c))
        max_c = len(np.unique(c))
        max_r = len(np.unique(r))
        self.sparse_matrix = csr_matrix((d,(r,c)), shape=(max_r, max_c))
        
    def create_hashfunc():
        self.hash_func = np.array([np.random.permutation(self.sparse_matrix.shape[0]) for i in range(signature_length)])
        
    def minhash():
        sigm = np.full((self.signature_length, self.sparse_matrix.shape[1]), np.inf)
        for row in range(self.sparse_matrix.shape[0]):
            ones = find(self.sparse_matrix[row, :])[1]
            hash = self.hash_func[:,row]
            B = sigm.copy()
            B[:,ones] = 1
            B[:,ones] = np.multiply(B[:,ones], hash.reshape((len(hash), 1)))
            sigm = np.minimum(sigm, B)
        self.sigm = sigm
    
    
    
        

In [15]:
user_movie = np.load('datasets/user_movie.npy')
lsh = LSH(dataset=user_movie)


In [3]:
%%time
c = user_movie[:,0]
r = user_movie[:,1]
d = np.ones(len(c))
max_c = len(np.unique(c))
max_r = len(np.unique(r))
# m = csr_matrix((d, (r,c)), shape=(max_r, max_c))
csc = csc_matrix((d, (r,c)), shape=(max_r, max_c))
csr = csr_matrix((d, (r,c)), shape=(max_r, max_c))
signature_length = 50

# example = np.array([[1,0,0,1],[0,0,1,0],[0,1,0,1],[1,0,1,0],[0,0,1,0]])
# hash_func = np.array([[4,3,1,2,0], [3,0,4,2,1]])

CPU times: user 4.6 s, sys: 625 ms, total: 5.22 s
Wall time: 5.22 s


In [4]:
def rowminhash(signature_length, hash_func, matrix):
    sigm = np.full((signature_length, matrix.shape[1]), np.inf)
    for row in range(matrix.shape[0]):
        ones = find(matrix[row, :])[1]
        hash = hash_func[:,row]
        B = sigm.copy()
        B[:,ones] = 1
        B[:,ones] = np.multiply(B[:,ones], hash.reshape((len(hash), 1)))
        sigm = np.minimum(sigm, B)
    return(sigm)

In [11]:
import scipy.optimize as opt
import math

def choose_nbands(t, n):
    def error_fun(x):
        cur_t = (1/x[0])**(x[0]/n)
        return (t-cur_t)**2

    opt_res = opt.minimize(error_fun, x0=(10), method='Nelder-Mead')
    b = int(math.ceil(opt_res['x'][0]))
    r = int(n / b)
    final_t = (1/b)**(1/r)
    return b, final_t




def do_lsh(sign_matrix, signature_length, threshold):
    return 0

In [59]:
from collections import defaultdict
threshold=0.5
numhashes = signature_length
b, _ = choose_nbands(threshold, numhashes)
r = int(numhashes / b)
print(b, r)

n_col = len(csc.shape[1])
for band in range(b):
    # figure out which rows of minhash signature matrix to hash for this band
    start_index = int(band * r)
    end_index = min(start_index + r, numhashes)

    # initialize hashtable for this band
    cur_buckets = defaultdict(list)
    
    for j in range(n_col):
      # THIS IS WHAT YOU NEED TO IMPLEMENT
# http://www.hcbravo.org/dscert-mldm/projects/project_1/
#     https://colab.research.google.com/drive/1HetBrWFRYqwUxn0v7wIwS7COBaNmusfD#scrollTo=hzPw8EMoW4i4&forceEdit=true&sandboxMode=true
    # add this hashtable to the list of hashtables
    buckets.append(cur_buckets)

14 3


In [101]:
def lsh_r_bucket_to_id(u,id,b,r):
 u=u.T
 id = np.array(id)
 number_of_users = u.shape[0]
 
 r_bucket_to_id = [dict() for x in range(b)]
 
 t1 = time.time()
 
 for i in range(number_of_users):
   
   if i % 10000==0:
     print(str(round(100*i/number_of_users,2))+' percent complete in '+str(round(time.time()-t1,2))+ ' seconds')
   
   row = u[i,:]    
   for j in range(b):
     r_signature = str(row[j*r:(j+1)*r])
     
     if r_signature in r_bucket_to_id[j]:
       r_bucket_to_id[j][r_signature] = r_bucket_to_id[j][r_signature]+[id[i]]
       
     else :
       r_bucket_to_id[j][r_signature] = [id[i]]

 return r_bucket_to_id

In [105]:
sigm100 = np.load('datasets/sign_matrix_100.npy')

In [124]:
# to = time.time()

threshold=0.55
numhashes = signature_length
numhashes = 100
b, _ = choose_nbands(threshold, numhashes)
r = int(numhashes / b)
# r_bucket_to_id = lsh_r_bucket_to_id(sigm2,docid,b,r)
# r_bucket_to_id = lsh_r_bucket_to_id(sigm100,docid,b,r)

# print(round(time.time()-to,2))

0.41491326668312173

In [125]:
print(_,b,r)
# r_bucket_to_id

0.5492802716530588 20 5


In [47]:
example = np.array([[1,2,0,1],
                    [0,3,1,0],
                    [0,1,4,1],
                    [1,5,1,0],
                    [0,6,1,0]])
hash_func = np.array([[4,3,1,2,0], [3,0,4,2,1]])
print(np.unique(example[:,2],return_inverse=True))
print(np.vstack((example,example)).T)

(array([0, 1, 4]), array([0, 1, 2, 1, 1]))
[[1 0 0 1 0 1 0 0 1 0]
 [2 3 1 5 6 2 3 1 5 6]
 [0 1 4 1 1 0 1 4 1 1]
 [1 0 1 0 0 1 0 1 0 0]]


In [68]:
docid = np.array(list(range(csr.shape[1])))#.reshape((1,csr.shape[1]))
docid.shape

(103703,)

In [43]:
# ?np.vstackb

Help on function vstack in module numpy:

vstack(tup)
    Stack arrays in sequence vertically (row wise).
    
    This is equivalent to concatenation along the first axis after 1-D arrays
    of shape `(N,)` have been reshaped to `(1,N)`. Rebuilds arrays divided by
    `vsplit`.
    
    This function makes most sense for arrays with up to 3 dimensions. For
    instance, for pixel-data with a height (first axis), width (second axis),
    and r/g/b channels (third axis). The functions `concatenate`, `stack` and
    `block` provide more general stacking and concatenation operations.
    
    Parameters
    ----------
    tup : sequence of ndarrays
        The arrays must have the same shape along all but the first axis.
        1-D arrays must have the same length.
    
    Returns
    -------
    stacked : ndarray
        The array formed by stacking the given arrays, will be at least 2-D.
    
    See Also
    --------
    stack : Join a sequence of arrays along a new axis.
    hstack

In [6]:
np.random.seed = 42

# example = csr
# example = np.array([[1,0,0,1],[0,0,1,0],[0,1,0,1],[1,0,1,0],[0,0,1,0]])
# %time sigm1 = minhash(signature_length,hash_func, example)
# signature_length = 100
# hash_func = np.array([np.random.permutation(csr.shape[0]) for i in range(signature_length)])
# %time sigm1 = rowminhash(signature_length ,hash_func, csr)

signature_length = 50
hash_func = np.array([np.random.permutation(csr.shape[0]) for i in range(signature_length)])
%time sigm2 = rowminhash(signature_length,hash_func, csr)
# print(sigm2)
# np.save('datasets/sign_matrix_100', sigm1)
# np.save('datasets/sign_matrix', sigm2)

CPU times: user 4min 13s, sys: 2min 47s, total: 7min 1s
Wall time: 7min 1s


In [None]:
#TODO write out our results ans.txt 
#TODO set seed in our perm hash function
#TODO Find pairs of similar users from buckets
#TODO Elegance aka classes, comments/report/citation
#TODO IMPROVE EFFICIENCY FOR LONGER SIGNATURES Time < 15

In [None]:
def main(argv):
    seed = sys.argv[1]
    path = sys.argv[2]
    print(seed, path)

The following snippet passes the start of the program and the command line arguments to the `main` function.

In [None]:
if __name__ == "__main__":
    main(sys.argv[1:])