# Time Estimate

The task was to implement a naïve algorithm that calculates similarities of all pairs of users and run some tests to estimate the total run time of this “exact” algorithm (you don’t have to wait till it is finished)!

In [16]:
import numpy as np
import scipy
import pandas as pd
import time
import math
import itertools

In [17]:
data = np.load('user_movie.npy')

In [25]:
"""""
Takes the first 10 users and calculates their similarities to 10 other users.
Out: estimated time to calculate sililarities of all users.
"""""

def Jaccard_multiple(data):

    start_time = time.time()

    for i in range(10):

        movie_comp = data[np.where(data[:,0] == i)]
        movie_comp = set(movie_comp.flatten())

        for j in range(i, 10):

                movies = data[np.where(data[:,0] == j)]
                movies = set(movies.flatten())

                similarity = len((movie_comp & movies)) / len((movie_comp | movies))

    print('Estimated total time in years: %s'%(((time.time() - start_time)*10000*5000)/(60*60*24*365)))

In [19]:
""""
Takes one random user and calculates the similarity to another random user.
Repeats this for a number of iterations.
Out: Estimated time to calculate similarities of all possible permutations.
""""

def Jaccard(data, iterations):    

    n = len(np.unique(data[:,0]))
    r = 2
    permutations = math.factorial(n) / math.factorial(n-r)
    combi = list(itertools.combinations(range(100), 2))
    end_times = []
    
    for i in range(iterations):
        rand = np.random.randint(0,len(combi))
        select = combi[rand]
        
        user1 = data[np.where(data[:,0] == select[0])]
        user2 = data[np.where(data[:,0] == select[1])]
        comb1 = set(user1[:,1])
        comb2 = set(user2[:,1])
    
        start_time = time.time()
        similarity = len((comb1 & comb2)) / len((comb1 | comb2))
        end_times.append(time.time() - start_time)
    
    total = sum(end_times)
    out = (total/iterations)*permutations
    return ('Estimated total time in days: %s'% (out/(60*60*24)))

In [26]:
Jaccard_multiple(data)

Estimated total time in years: 9.043563869869207


In [27]:
Jaccard(data, 100)

'Estimated total time in days: 8.446372286527577'

## Notes and results jaccard
For this exercise we tried two different approaches. The first calculates multiple similarites for a few users and then multiplies the time this takes to estimate the total runtime. For this method we times the whole process from selecting the users and their data to calculating their similarities. This method would take extremly long, in advance of several years.
 The other method implemented selects two random users (for computational reasons we only select possible combinations of the first 100 users) and calculates their similarity. The similarity calculation is then timed and multiplied by the total number of possible permutations of user combinations in the dataset. This process is repeated for several random combiations to get a more accurate estimation. With this method the calculation of all possible user pairs would take less than 10 day.