

## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 2**
This assignment is concerned with finding the set of similar users in the provided datasource. To be more explicit, in finding all pairs of users who have a Jaccard similarity of more than 0.5. Additionally, this assignment considers comparing the "naïve implementation" with the "LSH implementation". The "naïve implementation" can be found in the file `time_estimate.ipynb` and the "LSH implementation" in the file `lsh.ipynb`.

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions. 

#### **Naïve Implementation**
This notebook implements a naïve algorithm to find all pairs of users with a Jaccard similarity of more than 0.5. As noted in the assignment instructions the duration of the algorithm might be far to great to ever achieve execution in a reasonable time. As such, this file includes tests in order to extrapolate and estimate what the total run time of a full execution would be. This estimate is also the only output delivered by this notebook.
___

The following snippet handles all imports.

In [267]:
import sys
import numpy as np
import pandas as pd
import timeit

### **Helper Functions**
This section contains multiple helper functions whichplay a role in determining what the estimated total runtime is for the naïve algorithm for user pair similarity.
___

The `get_sample` function is a helper function which returns a subset of the given array. This function is based on the following: 

In order to do this the function uses the following parameters:
  * `data` - the array for which we want a subset
  * `sample_rate` - the rate with which we want to down-sample (between 0.0 and 1.0). A `sample_rate` of 0.25 means that the returned subset is 25% of the original.
  
Additionally, it returns the following value:
  * `sample` - the down-sampled subset of the array. 

In [302]:
def get_sample(data,sample_rate):
    mask = np.random.choice([False, True], len(data), p=[1-sample_rate, sample_rate])
    sample = data[mask]
    return sample

The `get_columns` function takes an array of users and movies and returns a dictionary of columns. Each column contains the movies watched by a given users.

In order to do this the function uses the following parameters:
  * `array` - the array from which we want to retrieve the columns
  
Additionally, it returns the following value:
  * `columns` - a dictionary of columns. 

In [269]:
def get_columns(array):  
#     %time df = pd.DataFrame({'User':array[:,0],'Movie':array[:,1]})
#     %time ct = pd.crosstab(df.Movie, df.User)
#     %time matrix = ct.to_numpy()
    users = np.unique(array[:,0])
    columns = {}
    for user in users:
        rows = array[np.where(array[:,0]==user)]
        column = rows[:,1]
        columns[user] = column
    return columns

The `get_jaccard` function receives two columns and returns the jaccard similarity of the two columns.

In order to do this the function uses the following parameters:
  * `c1` - the first column
  * `c2` - the second column
  
Additionally, it returns the following value:
  * `jaccard` - the Jaccard similarity of `c1` and `c2` 

In [270]:
def get_jaccard(c1,c2):
    s1 = set(c1)
    s2 = set(c2)
    union = s1.union(s2)
    intersection = s1.intersection(s2)
    jaccard = len(intersection) / len(union)
    return jaccard

The `get_pairs` function iterates over the columns and returns all columns with a Jaccard similarity above 0.5.

In order to do this the function uses the following parameters:
  * `columns` - a dictionary of columns which may potentially be matched
  
Additionally, it returns the following value:
  * `pairs` - a list of user pairs with a Jaccard similarity above 0.5 

In [271]:
def get_pairs(columns):
    pairs = []
    users = list(columns.keys())
    u = len(users)
    for i in range(u):
        user =users[i]
        column = columns[user]
        for j in range(i+1,u):
            other_user = users[j]
            other_column = columns[other_user]
            if get_jaccard(column,other_column) > 0.5:
                pairs.append([user,other_user])
    return pairs

The `scale_interim` estimates what the runtime for a full data set would be given a sample rate and the runtime for this sample rate.

In order to do this the function uses the following parameters:
  * `interim` - the run time to be scaled
  * `users_sample` - the number of users in the sample
  * `users_full` - the number of users in the full data set
  
Additionally, it returns the following value:
  * `runtime` - the estimated scaled runtime for the naïve algorithm. 

In [345]:
def scale_interim(interim, users_sample, users_full):
    full_comparisons = (users_full * users_full) / 2
    interim_comparisons = (users_sample * users_sample) / 2
    scale_factor = full_comparisons / interim_comparisons
    
    runtime = scale_factor * interim
    return runtime

### **Test Execution**
This section is concerned with running multiple tests in order to determine what the estimated total runtime is for the naïve algorithm for user pair similarity.
___

The `time_estimator` function is the main runner for determining the total runtime for the naïve algorithm.

In order to do this the function uses the following parameters:
  * `data` - the raw data from `user_movie.npy`
  * `sample_rate` - the sample size to use for estimating the total runtime (between 0.0 and 1.0)
  * `sample_count` - the number of samples to take
  
Additionally, it returns the following value:
  * `runtime` - the final estimated runtime for the naïve algorithm. 

In [347]:
def time_estimator(data, sample_rate, sample_count):
    durations = []
    users_samples = []
    for i in range(sample_count):   
        start_run = timeit.default_timer()
        subset = get_sample(data, sample_rate)
        columns = get_columns(subset)
        pairs = get_pairs(columns)
        end_run = timeit.default_timer()
        users_samples.append(len(columns.keys()))
        durations.append((end_run - start_run))
    interim = np.mean(durations)
    users_full = len(np.unique(data[:,0]))
    users_sample = np.mean(users_samples)
    final = scale_interim(interim,users_sample,users_full)
    return final

### **Program Execution**
This section is concerned with parsing the input arguments and determining the execution flow of the program.
___
The `main` function handles the start of execution from the command line.

The following command line arguments are expected:
  * `path` - the location of the `user_movies.npy` file

In [1]:
def main(path):    
    # convert path to matrix
    data = np.load(path)
    
    # execute time estimator
    sample_rate = 0.0001
    sample_count = 4
    estimate = time_estimator(data,sample_rate,sample_count)
    print(estimate)

The following snippet passes the start of the program and the command line arguments to the `main` function.

In [None]:
if __name__ == "__main__":
    path = sys.argv[1]
    main(path)