# 1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

In [22]:
import pandas as pd
from tqdm import tqdm as tq
import warnings
import numpy as np
warnings.filterwarnings("ignore")

In [23]:
df = pd.read_csv("/Users/giacomo/Desktop/ADM_HW4/data.csv", sep = '\t')

In [24]:
df

Unnamed: 0.1,Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,0,C1010011F24,F,age_2,richness_7,exp_10
1,1,C1010011M33,M,age_4,richness_9,exp_5
2,2,C1010012M22,M,age_2,richness_6,exp_8
3,3,C1010014F24,F,age_2,richness_7,exp_8
4,4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...,...
1034947,1034947,C9099836M26,M,age_3,richness_9,exp_7
1034948,1034948,C9099877M20,M,age_1,richness_9,exp_4
1034949,1034949,C9099919M23,M,age_2,richness_3,exp_3
1034950,1034950,C9099941M21,M,age_2,richness_7,exp_1


In [25]:
del df['Unnamed: 0']

# 1.2.1 Shingles

First of all we build the shingles from all the unique values per column in the loaded dataset. We ignore the `TransactionID` column because it is not a shingle.

In [26]:
shingles = [] #initialize shingles
for column_name in df.columns[1:]: 
    shingles += sorted(list(df[column_name].unique())) 
    
shingles.remove('age_0')

#In order to not aggregate people who are labelled with age_0, corresponding to the Customer DOB with year 1800 
#(nan), we decided to remove age_0 from shingles such that those people will not have any 1 in the shingle matrix.
#For that reason they will not be considered similar to anyone for the age, but only for the other fields.

In [27]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


# 1.2.2 Create Shingle Matrix

First of all we create the function which maps each transaction into a vector of 0/1 based on the shingles. 

In [28]:
def one_hot_vector(data, index):
    """Creates a one hot vector for the row found in the data at the given index based on the shingles.
    
    :args
    data - a pandas dataframe containing the data.
    index - an int which corresponds to the row that will be turned into a one hot vector.
    
    :returns
    a numpy array one hot representation of the row
    """
    
    values = data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']].values #extract values
    
    indeces = np.where(values.reshape(values.size, 1) == shingles)[1]  #save indexes
    
    vector = np.zeros(len(shingles), dtype = int)  #initialize vector
    
    vector[indeces] = 1  #substitute 1 in the correct positions
    
    return vector

Example:

In [29]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


In [30]:
df.loc[1]

New_ID              C1010011M33
CustGender                    M
CustomerClassAge          age_4
Richness             richness_9
Expenditure               exp_5
Name: 1, dtype: object

In [31]:
print(one_hot_vector(df, 1))

[0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
 0 0 0]


In [32]:
len(df)

1034952

Now we can build a sparse matrix with all the encoded transaction. We don't need to insert into the matrix the name of the Customer cause they are linked to the index of the shingle matrix through the index of the column.  

In [33]:
shingle_matrix = np.zeros((len(df), 40), dtype = int)

for i in tq(range(len(df))):
    # Append the one hot vectors as rows
    shingle_matrix[df.index[i]] = one_hot_vector(df, i) 

# We need to transpose because for the shuffling, the Shingles need to be the rows
shingle_matrix = shingle_matrix.T

100%|███████████████████████████████| 1034952/1034952 [13:53<00:00, 1242.37it/s]


In [34]:
%store shingle_matrix

Stored 'shingle_matrix' (ndarray)


In [88]:
%store -r shingle_matrix

# 1.2.3 Create the Signature Matrix
From the Shingle Matrix, we will now create the signature matrix by doing the following:
1. Shuffle the rows of the Shingle Matrix.
1. Create a vector where each element corresponds to the index of the row of each column (Shingle) where the first 1 is found.
1. Append this vector to the Signature Matrix.
1. Repeat $n$ times.

The goal of the MinHash is to replace a large set with a smaller "signature" that still preserves the underlying similarity metric.

In [36]:
n_permutations = 20 #number of permutations = number of rows of the signature matrix
signature_matrix = np.zeros((20, shingle_matrix.shape[1]), dtype = int) #initialize signature matrix
seed = np.random.randint(0, 100000)
np.random.seed(seed) #set seed for reproducibility

In [37]:
for i in tq(range(n_permutations)):
    # 1. Shuffle rows
    np.random.shuffle(shingle_matrix)
    
    # 2. Create the vector of indeces where the first 1 is found. np.argmax stops at the first occurrence
    signature_row = np.argmax(shingle_matrix == 1, axis=0) + 1
    
    # 3. Add to signature matrix
    signature_matrix[i] = signature_row

100%|███████████████████████████████████████████| 20/20 [00:34<00:00,  1.74s/it]


In [38]:
signature_matrix

array([[ 9, 15, 16, ..., 18, 11,  5],
       [14,  3, 13, ..., 13, 13,  2],
       [ 8, 17,  6, ...,  9,  9,  1],
       ...,
       [15,  3,  3, ...,  3,  3,  3],
       [ 2,  3,  2, ...,  2,  2, 19],
       [15,  1,  3, ..., 13, 23,  1]])

In [39]:
signature_matrix.shape

(20, 1034952)

The index of the column can be referred to the customer ID looking at the index of the initial dataframe: 

In [40]:
df

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,C1010011F24,F,age_2,richness_7,exp_10
1,C1010011M33,M,age_4,richness_9,exp_5
2,C1010012M22,M,age_2,richness_6,exp_8
3,C1010014F24,F,age_2,richness_7,exp_8
4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...
1034947,C9099836M26,M,age_3,richness_9,exp_7
1034948,C9099877M20,M,age_1,richness_9,exp_4
1034949,C9099919M23,M,age_2,richness_3,exp_3
1034950,C9099941M21,M,age_2,richness_7,exp_1


For example the first column of the signature matrix is referred to the customer C1010011F24.

# 1.2.4 Divide Signature Matrix into Bands

The example signature matrix below is divided into $b$ bands of $r$ rows each, and each band is hashed separately. For this example, we are setting band , which means that we will consider any titles with the same first two rows to be similar. The larger we make b the less likely there will be another Paper that matches all of the same permutations.

![signature_matrix_into_bands](https://storage.googleapis.com/lds-media/images/locality-sensitive-hashing-lsh-buckets.width-1200.png)

The $\textit{probability}$ that the minhash function for a random permutation of rows produces the $\textbf{same values}$ for two sets is equal to the $\textbf{Jaccard similarity}$ of those sets.


The size of the bands control the probability that two items with a given Jaccard similarity end up in the same bucket. If the number of bands is larger, you will end up with much smaller sets. For instance, $b = p$, where $p$ is the number of permutations (i.e. rows in the signature matrix) would almost certainly lead to $N$ buckets of only one item because there would be only one item that was perfect similar across every permutation.

In order to create the buckets we decide to create a dictionary that will have the sub_vector as keys and the indexes that contains that subvector. These indexes will allow us to substistute the related customer:

In [41]:
def create_buckets(b, signature):
    
    indexes = list(range(signature.shape[1])) #create a list of indexes 

    signature_matrix_transposed = signature.T #transpose the matrix to get subvectors column

    cluster = {} #initialize the dictionary containing as keys the subvector and as values the indexes of the customer

    for i in tq(range(0, signature.shape[0], b)):  #iterate over the row with step size equal to bandsize
    
        #take the subvector of dimension i, i+b (band size) from the column
        mini_vectors = signature_matrix_transposed[:, i:i+b] 
    
    # sorts the subvectors associated to the indexes to maintain the relationship with the index of the customers.
    # We use a tuple instead of a list because tuples can be hashable and therefore 
    # usable as keys for dictionaries. Sorting will allow us to have the same subvector as neighbors
    
        c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
        curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
        for i, v in c:  
        
            if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
                cluster[v] = []
        
            if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
                curr_vector = v

            cluster[v].append(i) #append as values the indexes where that subvector is found 
    
    return cluster

We can create a dictionary containing the buckets with band size 4:

In [43]:
cluster_4 = create_buckets(4, signature_matrix)

100%|█████████████████████████████████████████████| 5/5 [00:29<00:00,  5.92s/it]


We can see that the keys are the subvector:

In [44]:
print(list(cluster_4.keys())[:5])

[(1, 2, 1, 6), (1, 2, 1, 15), (1, 2, 3, 6), (1, 2, 3, 12), (1, 2, 5, 6)]


And the values are the column index where they've been found:

In [47]:
cluster_4[list(cluster_4.keys())[0]][0:5]  #first 5 index referred to the first key of the dictionary:

[571, 38876, 41344, 58949, 62939]

Through them we can recover the name of the customers defining a function that substitute the values of each keys with the 'New_ID' value in the dataframe:

In [51]:
def substitute_keys(buckets, data):
    
    for key in list(buckets.keys()):
    
        buckets[key] = data.loc[buckets[key]]['New_ID'].to_list() #subtitute the indexes with the customers name

    return buckets

In [52]:
cluster_4 = substitute_keys(cluster_4, df)

We can see that now the values of each key are the customer name:

In [57]:
cluster_4[list(cluster_4.keys())[0]][0:5]

['C1011736M55', 'C1311745M55', 'C1319157M53', 'C1434417M52', 'C1474376M51']

We can now visually check how much similar the customer in a bucket are:

In [61]:
df[df['New_ID'].isin(cluster_4[list(cluster_4.keys())[0]])] 

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
571,C1011736M55,M,age_8,richness_4,exp_7
38876,C1311745M55,M,age_8,richness_4,exp_7
41344,C1319157M53,M,age_8,richness_4,exp_7
58949,C1434417M52,M,age_8,richness_4,exp_7
62939,C1474376M51,M,age_8,richness_4,exp_7
...,...,...,...,...,...
990528,C8730518M54,M,age_8,richness_4,exp_7
995924,C8784092M54,M,age_8,richness_4,exp_7
1011225,C8915869M54,M,age_8,richness_4,exp_7
1024636,C9017478M52,M,age_8,richness_4,exp_7


And the number of bucket that we have: 

In [62]:
len(cluster_4)

5361

We can now visually check what happen for $\textbf{band size = 2}$:

In [63]:
cluster_2 = create_buckets(2, signature_matrix) #create buckets with band size 2

cluster_2 = substitute_keys(cluster_2, df) #substitute index with customer names

print('The number of buckets is ', len(cluster_2))

df[df['New_ID'].isin(cluster_2[list(cluster_2.keys())[0]])] 

100%|███████████████████████████████████████████| 10/10 [01:07<00:00,  6.74s/it]


The number of buckets are  651


Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
10,C1010035M24,M,age_2,richness_4,exp_1
17,C1010041F24,F,age_2,richness_2,exp_1
18,C1010041F41,F,age_6,richness_10,exp_9
44,C1010116M22,M,age_2,richness_5,exp_1
55,C1010157M24,M,age_2,richness_4,exp_1
...,...,...,...,...,...
1034927,C9099170M23,M,age_2,richness_1,exp_1
1034930,C9099183M25,M,age_2,richness_4,exp_1
1034938,C9099628M22,M,age_2,richness_0,exp_1
1034939,C9099661F24,F,age_2,richness_5,exp_1


Decreasing the size of the bands the number of buckets decrease cause are grouped togheter customer that have different values in some fields.

Checking what happen with $\textbf{band size = 5}$:

In [64]:
cluster_5 = create_buckets(5, signature_matrix) #create buckets with band size 2

cluster_5 = substitute_keys(cluster_5, df) #substitute index with customer names

print('The number of buckets is ', len(cluster_5))

df[df['New_ID'].isin(cluster_5[list(cluster_5.keys())[0]])] 

100%|█████████████████████████████████████████████| 4/4 [00:35<00:00,  8.82s/it]


The number of buckets are  6351


Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
571,C1011736M55,M,age_8,richness_4,exp_7
38876,C1311745M55,M,age_8,richness_4,exp_7
41344,C1319157M53,M,age_8,richness_4,exp_7
58949,C1434417M52,M,age_8,richness_4,exp_7
62939,C1474376M51,M,age_8,richness_4,exp_7
...,...,...,...,...,...
990528,C8730518M54,M,age_8,richness_4,exp_7
995924,C8784092M54,M,age_8,richness_4,exp_7
1011225,C8915869M54,M,age_8,richness_4,exp_7
1024636,C9017478M52,M,age_8,richness_4,exp_7


There is a larger number of bucket (so a less number of customer per bucket) and the customers have all the same values in the different fields.  

So, the choice of the band size $depends$ on how much similar customer do you want in your bucket. Supposing our request is not too restrictive we can choose 4 as band size to execute the query for the next point. 

# 1.3 Locality Sensitive Hashing

Now that you prepared your algorithm, it's query time!
We have prepared some dummy users for you to work with.

Download this csv and report the most similar users (comparing them against the dataset provided in Kaggle).
Did your hashing method work properly, what scores have you obtained and how long did it take to run? Provide information and analysis about the results

# 1.3.1 Pre-processing Query dataset

First of all we need to pre-process the query dataset in the same way we pre-process the initial Kaggle dataset:

In [65]:
query = pd.read_csv("/Users/giacomo/Desktop/ADM_HW4/query_users.csv")

In [66]:
query.head()

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0
1,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0
2,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5
3,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0
4,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0


In [67]:
del query['CustLocation'], query['TransactionTime'] #We didn't those features for the customers 

First of all we convert them into class of age, richness and expenditure.

Starting with the age, we used the same procedure of point 1.1:

In [68]:
#In order to calculate the ages we need to transform into datetime birthday and year of the transactions

query['CustomerDOB'] = pd.to_datetime(query['CustomerDOB']) 

query['TransactionDate'] = pd.to_datetime(query['TransactionDate']) 

In [69]:
query['CustomerAge'] = 0 #initialize with values zero the ages

In [70]:
#give an age only to queries that have a different YOB from 1800

query.loc[query['CustomerDOB'].dt.year != 1800, 'CustomerAge'] = query.loc[query['CustomerDOB'].dt.year != 1800, 'TransactionDate'].dt.year - query.loc[query['CustomerDOB'].dt.year != 1800, 'CustomerDOB'].dt.year 

In [71]:
query.head(5)

Unnamed: 0,CustomerDOB,CustGender,CustAccountBalance,TransactionDate,TransactionAmount (INR),CustomerAge
0,1978-07-27,M,94695.61,2016-02-09,65.0,38
1,1992-06-11,M,7584.09,2016-02-09,6025.0,24
2,1991-08-14,M,7180.6,2016-10-08,541.5,25
3,1987-03-01,M,56847.75,2016-08-29,1000.0,29
4,1995-04-01,M,84950.13,2016-09-25,80.0,21


In [72]:
del query['TransactionDate'], query['CustomerDOB'] #deleting the data that we don't need

We binned the ages into the same classes of point 1.1:

In [73]:
bins = np.array(list(range(16, 102, 5)))  #bins

def age(age):
    
    class_age = np.digitize(age, bins, right=False)  #return the number of the bin
    
    age = 'age_' + str(class_age) #create a string of class age
        
    return age

query['CustomerClassAge'] = query.CustomerAge.apply(lambda x: age(x)) #build a new column called class age

In [74]:
del query['CustomerAge'] #deleting the column with the ages

In [75]:
query.head()

Unnamed: 0,CustGender,CustAccountBalance,TransactionAmount (INR),CustomerClassAge
0,M,94695.61,65.0,age_5
1,M,7584.09,6025.0,age_2
2,M,7180.6,541.5,age_2
3,M,56847.75,1000.0,age_3
4,M,84950.13,80.0,age_2


Now it's the turn of the column of richness. In this case we cannot use the quantiles to divide into bins the account balance of the customer cause it will be a different results from the classes of the point 1.1. For this reason we used the bins that we saved in point 1.1 and the function pd.cut that binned the values with the bins we insert labelling them with the labels we want. 

In [76]:
bin_labels = ['richness_0', 'richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10']

We recalled from memory the bins of the classes of richness:

In [77]:
%store -r query_bins_richness  

In [78]:
#create the column with the classes of richness:

query['Richness'] = pd.cut(query.CustAccountBalance, bins = query_bins_richness, labels=bin_labels, right=False)

In [79]:
del query['CustAccountBalance'] #deleting the column that we don't need

In [80]:
query.head() #check the result

Unnamed: 0,CustGender,TransactionAmount (INR),CustomerClassAge,Richness
0,M,65.0,age_5,richness_9
1,M,6025.0,age_2,richness_4
2,M,541.5,age_2,richness_4
3,M,1000.0,age_3,richness_8
4,M,80.0,age_2,richness_9


We repeated the same procedure of the richness for the expenditure class:

In [81]:
%store -r query_bins_expenditure

In [82]:
#create the column expenditure

bin_labels = ['exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']

query['Expenditure'] = pd.cut(query['TransactionAmount (INR)'], bins = query_bins_expenditure, labels=bin_labels, right=False)

In [83]:
del query['TransactionAmount (INR)'] #deleting the column that we don't need

In [84]:
query.head() #check the result

Unnamed: 0,CustGender,CustomerClassAge,Richness,Expenditure
0,M,age_5,richness_9,exp_2
1,M,age_2,richness_4,exp_10
2,M,age_2,richness_4,exp_6
3,M,age_3,richness_8,exp_8
4,M,age_2,richness_9,exp_2


The last step that we decided to make is to give a name to this customer -> Query_User_i where i goes from 0 to 49:

In [85]:
#create two column and combine them to create the name:

query['sub'] = 'Query_User_' #prefix

query['num'] = range(50) #numbers

query['New_ID'] = query['sub'] + query['num'].astype(str) #concatenating prefix and number

del query['num'], query['sub'] #deleting the prefix and the numbers

In [86]:
query = query[['New_ID', 'CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']] #reorder the column

In [87]:
query.head() #cheching the result query dataset

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,Query_User_0,M,age_5,richness_9,exp_2
1,Query_User_1,M,age_2,richness_4,exp_10
2,Query_User_2,M,age_2,richness_4,exp_6
3,Query_User_3,M,age_3,richness_8,exp_8
4,Query_User_4,M,age_2,richness_9,exp_2


# 1.3.2 Execute the query step by step.

To execute the query and return as output the most similar customer to each query we need to do the same MinHash procedure with the same type of shuffling to the rows. Such that the algorithm to divide into bucket is fast (less than 40 seconds) we decided to add the new shingle matrix (shingle query) to the shingle matrix of the initial dataset as new columns and to re-execute the bucket procedure. Then we will create a function that takes as input the the name of the query and gives as output the buckets that contain the most similar users to him:

First of all we define the shingles:

In [95]:
shingles = ['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']

We build the shingle query using the 'one_hot_vector' function:

In [96]:
shingle_query = np.zeros((len(query), 40), dtype = int)

for i in tq(range(len(query))):
    # Append the one hot vectors as rows
    shingle_query[query.index[i]] = one_hot_vector(query, i) 

# We need to transpose because for the shuffling, the Shingles need to be the rows
shingle_query = shingle_query.T

100%|█████████████████████████████████████████| 50/50 [00:00<00:00, 1026.82it/s]


An example to see if it worked properly:

In [97]:
query.loc[1]

New_ID              Query_User_1
CustGender                     M
CustomerClassAge           age_2
Richness              richness_4
Expenditure               exp_10
Name: 1, dtype: object

In [98]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


In [99]:
shingle_query[:, 1]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

Then we concatenate the shingle query matrix to the initial shingle matrix: 

In [100]:
%store -r shingle_matrix

In [101]:
shingle_matrix.shape

(40, 1034952)

In [102]:
shingle_query.shape

(40, 50)

In [103]:
new_shingle_matrix = np.concatenate((shingle_matrix, shingle_query), axis = 1) #concatenate the two matrix

In [104]:
new_shingle_matrix.shape

(40, 1035002)

Now we can repeat the same MinHash procedure to build the signature matrix: 

In [105]:
n_permutations = 20 #number of permutations = number of rows of the signature matrix

new_signature_matrix = np.zeros((20, new_shingle_matrix.shape[1]), dtype = int) #initialize signature matrix

for i in tq(range(n_permutations)):
    # 1. Shuffle rows
    np.random.shuffle(new_shingle_matrix)
    
    # 2. Create the vector of indeces where the first 1 is found. np.argmax stops at the first occurrence
    signature_row = np.argmax(new_shingle_matrix == 1, axis=0) + 1
    
    # 3. Add to signature matrix
    new_signature_matrix[i] = signature_row


100%|███████████████████████████████████████████| 20/20 [00:37<00:00,  1.86s/it]


To use the substitute keys function we need to add to the initial dataframe the data related to the query: 

In [107]:
df = df.append(query, ignore_index = True) #append query dataframe to the initial dataframe

In [108]:
df

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,C1010011F24,F,age_2,richness_7,exp_10
1,C1010011M33,M,age_4,richness_9,exp_5
2,C1010012M22,M,age_2,richness_6,exp_8
3,C1010014F24,F,age_2,richness_7,exp_8
4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...
1034997,Query_User_45,F,age_3,richness_10,exp_4
1034998,Query_User_46,M,age_6,richness_6,exp_2
1034999,Query_User_47,M,age_2,richness_10,exp_2
1035000,Query_User_48,F,age_2,richness_4,exp_8


We choose the band size b and then we can created the dictionary with the buckets:

In [106]:
new_signature_matrix.shape #shape of the new matrix

(20, 1035002)

In [196]:
cluster_query = create_buckets(4, new_signature_matrix) #create buckets with band size 4

cluster_query = substitute_keys(cluster_query, df) #substitute index with customer names

print('The number of buckets is ', len(cluster_query))

df[df['New_ID'].isin(cluster_query[list(cluster_query.keys())[0]])] 

100%|█████████████████████████████████████████████| 5/5 [00:56<00:00, 11.37s/it]


The number of buckets is  4683


Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
509,C1011553M30,M,age_3,richness_2,exp_8
735,C1012320M28,M,age_3,richness_2,exp_8
1183,C1013712M28,M,age_3,richness_2,exp_8
1716,C1015271M28,M,age_3,richness_2,exp_8
1817,C1015626M27,M,age_3,richness_2,exp_8
...,...,...,...,...,...
1030455,C9034686F27,F,age_3,richness_2,exp_8
1030557,C9035034F29,F,age_3,richness_2,exp_8
1031062,C9036665M29,M,age_3,richness_2,exp_8
1031803,C9038870M26,M,age_3,richness_2,exp_8


Then we created a function that takes as input the name of the query and gives as result the keys of the buckets in which it's contained: 

In [199]:
def search_bucket(name):
    
    index = [] #initialize the list of keys where the name is found
    
    keys = list(cluster_query.keys()) #list of keys of buckets

    values = list(cluster_query.values()) #list of values of buckets

    for i, v in enumerate(values):
        
        if name in v:
            
            index.append(keys[i]) #append the keys where the name is found
            
    return index 

Now we decided to create a dictionary where we store as keys the name of the query customer and as values a list buckets (keys of 'cluster_query') where the query customer is inserted. Through them we can find groups of customer similar to the queries: 

In [200]:
query_buckets = {} #initialize query_bucket

queries_name = list(df.iloc[-50:]['New_ID']) #list of query customer name

for q in queries_name:
    
    query_buckets[q] = search_bucket(q) #keys = query customer name, values = list of bucket where it's found

The work is done! 

To see the group of all the similar customers to a query we define a function that takes as input the name of the query and give as output the group of similar customers:

In [205]:
def similar_to_query(query_name):
    
    keys = query_buckets[query_name] #list of bucket keys of where the name is found
    
    similar_customers = {}  #initialize a dict: keys = name of the group, values = similar customer to the query
    
    for i,k in enumerate(keys):
        
        similar_customers['Group_' + str(i)] = cluster_query[k] #build the dictionary
       
    return similar_customers

In [209]:
print(similar_to_query('Query_User_34'))

{'Group_0': ['Query_User_34'], 'Group_1': ['Query_User_1', 'Query_User_7', 'Query_User_34', 'Query_User_47'], 'Group_2': ['Query_User_34'], 'Group_3': ['C1530761F77', 'C2633565M77', 'C2722384F76', 'C2911021M77', 'C3333537M77', 'C4011090M77', 'C6515330M79', 'C6533526M77', 'C6948022F77', 'C7815329M79', 'C8221631M77', 'C8918537M77', 'Query_User_20', 'Query_User_31', 'Query_User_34'], 'Group_4': ['C1213252F88', 'C7713218F88', 'Query_User_31', 'Query_User_34']}


For example we can take one of the members of a group and compare them to the query user: 

In [210]:
df[df['New_ID'] == 'Query_User_34']

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
1034986,Query_User_34,M,age_2,richness_10,exp_10


In [208]:
df[df['New_ID'].isin(similar_to_query('Query_User_34')['Group_3'])] 

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
70726,C1530761F77,F,age_13,richness_4,exp_10
211764,C2633565M77,M,age_13,richness_5,exp_10
220784,C2722384F76,F,age_13,richness_5,exp_10
242624,C2911021M77,M,age_13,richness_5,exp_10
301367,C3333537M77,M,age_13,richness_5,exp_10
383499,C4011090M77,M,age_13,richness_5,exp_10
703775,C6515330M79,M,age_13,richness_1,exp_10
709690,C6533526M77,M,age_13,richness_5,exp_10
764034,C6948022F77,F,age_13,richness_5,exp_10
870225,C7815329M79,M,age_13,richness_1,exp_10
