# 1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

In [1]:
import pandas as pd
from tqdm import tqdm as tq
import warnings
import numpy as np
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("/Users/giacomo/Desktop/ADM_HW4/data.csv", sep = '\t')

In [3]:
df

Unnamed: 0.1,Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,0,C1010011F24,F,age_2,richness_7,exp_10
1,1,C1010011M33,M,age_4,richness_9,exp_5
2,2,C1010012M22,M,age_2,richness_6,exp_8
3,3,C1010014F24,F,age_2,richness_7,exp_8
4,4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...,...
1034947,1034947,C9099836M26,M,age_3,richness_9,exp_7
1034948,1034948,C9099877M20,M,age_1,richness_9,exp_4
1034949,1034949,C9099919M23,M,age_2,richness_3,exp_3
1034950,1034950,C9099941M21,M,age_2,richness_7,exp_1


In [4]:
del df['Unnamed: 0']

# 1.2.1 Shingles

First of all we build the shingles from all the unique values per column in the loaded dataset. We ignore the `TransactionID` column because it is not a shingle.

In [19]:
shingles = [] #initialize shingles
for column_name in df.columns[1:]: 
    shingles += sorted(list(df[column_name].unique())) 
    
shingles.remove('age_0')

#In order to not aggregate people who are labelled with age_0, corresponding to the Customer DOB with year 1800 
#(nan), we decided to remove age_0 from shingles such that those people will not have any 1 in the shingle matrix.
#For that reason they will not be considered similar to anyone for the age, but only for the other fields.

In [20]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


# 1.2.2 Create Shingle Matrix

First of all we create the function which maps each transaction into a vector of 0/1 based on the shingles. 

In [21]:
def one_hot_vector(data, index):
    """Creates a one hot vector for the row found in the data at the given index based on the shingles.
    
    :args
    data - a pandas dataframe containing the data.
    index - an int which corresponds to the row that will be turned into a one hot vector.
    
    :returns
    a numpy array one hot representation of the row
    """
    
    values = data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']].values #extract values
    
    indeces = np.where(values.reshape(values.size, 1) == shingles)[1]  #save indexes
    
    vector = np.zeros(len(shingles), dtype = int)  #initialize vector
    
    vector[indeces] = 1  #substitute 1 in the correct positions
    
    return vector

Example:

In [22]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


In [23]:
df.loc[1]

New_ID              C1010011M33
CustGender                    M
CustomerClassAge          age_4
Richness             richness_9
Expenditure               exp_5
Name: 1, dtype: object

In [24]:
print(one_hot_vector(df, 1))

[0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
 0 0 0]


In [25]:
len(df)

1034952

Now we can build a sparse matrix with all the encoded transaction. We don't need to insert into the matrix the name of the Customer cause they are linked to the index of the shingle matrix through the index of the column.  

In [26]:
shingle_matrix = np.zeros((1034952, 40), dtype = int)

for i in tq(range(len(df))):
    # Append the one hot vectors as rows
    shingle_matrix[df.index[i]] = one_hot_vector(df, i) 

# We need to transpose because for the shuffling, the Shingles need to be the rows
shingle_matrix = shingle_matrix.T

100%|███████████████████████████████| 1034952/1034952 [12:47<00:00, 1348.35it/s]


In [27]:
%store shingle_matrix

Stored 'shingle_matrix' (ndarray)


In [28]:
%store -r shingle_matrix

# 1.2.3 Create the Signature Matrix
From the Shingle Matrix, we will now create the signature matrix by doing the following:
1. Shuffle the rows of the Shingle Matrix.
1. Create a vector where each element corresponds to the index of the row of each column (Shingle) where the first 1 is found.
1. Append this vector to the Signature Matrix.
1. Repeat $n$ times.

The goal of the MinHash is to replace a large set with a smaller "signature" that still preserves the underlying similarity metric.

In [29]:
n_permutations = 20 #number of permutations = number of rows of the signature matrix
signature_matrix = np.zeros((20, shingle_matrix.shape[1]), dtype = int) #initialize signature matrix
seed = np.random.randint(0, 100000)
np.random.seed(seed) #set seed for reproducibility

In [30]:
for i in tq(range(n_permutations)):
    # 1. Shuffle rows
    np.random.shuffle(shingle_matrix)
    
    # 2. Create the vector of indeces where the first 1 is found. np.argmax stops at the first occurrence
    signature_row = np.argmax(shingle_matrix == 1, axis=0) + 1
    
    # 3. Add to signature matrix
    signature_matrix[i] = signature_row

100%|███████████████████████████████████████████| 20/20 [00:35<00:00,  1.75s/it]


In [31]:
signature_matrix

array([[ 1, 17,  9, ...,  9,  9,  3],
       [ 5,  4, 15, ..., 12, 10,  4],
       [23,  1,  3, ...,  3,  3,  3],
       ...,
       [ 4,  2, 24, ..., 18, 23,  2],
       [ 9,  5,  7, ..., 10, 13, 12],
       [ 5,  1, 20, ...,  8,  6, 25]])

In [32]:
signature_matrix.shape

(20, 1034952)

The index of the column can be referred to the customer ID looking at the index of the initial dataframe: 

In [33]:
df

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,C1010011F24,F,age_2,richness_7,exp_10
1,C1010011M33,M,age_4,richness_9,exp_5
2,C1010012M22,M,age_2,richness_6,exp_8
3,C1010014F24,F,age_2,richness_7,exp_8
4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...
1034947,C9099836M26,M,age_3,richness_9,exp_7
1034948,C9099877M20,M,age_1,richness_9,exp_4
1034949,C9099919M23,M,age_2,richness_3,exp_3
1034950,C9099941M21,M,age_2,richness_7,exp_1


For example the first column of the signature matrix is referred to the customer C1010011F24.

# 1.2.4 Divide Signature Matrix into Bands

The example signature matrix below is divided into $b$ bands of $r$ rows each, and each band is hashed separately. For this example, we are setting band , which means that we will consider any titles with the same first two rows to be similar. The larger we make b the less likely there will be another Paper that matches all of the same permutations.

![signature_matrix_into_bands](https://storage.googleapis.com/lds-media/images/locality-sensitive-hashing-lsh-buckets.width-1200.png)

Ultimately, the size of the bands control the probability that two items with a given Jaccard similarity end up in the same bucket. If the number of bands is larger, you will end up with much smaller sets. For instance, $b = p$, where $p$ is the number of permutations (i.e. rows in the signature matrix) would almost certainly lead to $N$ buckets of only one item because there would be only one item that was perfect similar across every permutation.

In [67]:
b = 4

In [68]:
signature_matrix.T[:, 0:4] #to build the minivector

array([[ 1,  5, 23,  2],
       [17,  4,  1,  5],
       [ 9, 15,  3,  2],
       ...,
       [ 9, 12,  3,  2],
       [ 9, 10,  3,  2],
       [ 3,  4,  3,  3]])

In order to create the buckets we decide to create a dictionary that will have the sub_vector as keys and the indexes that contains that subvector. These indexes will allow us to substistute the related customer:

In [69]:
indexes = list(range(signature_matrix.shape[1])) #create a list of indexes 

signature_matrix_transposed = signature_matrix.T #transpose the matrix to get subvectors column

cluster = {} #initialize the dictionary containing as keys the subvector and as values the indexes of the customer

for i in tq(range(0, signature_matrix.shape[0], b)):  #iterate over the row with step size equal to bandsize
    
    #take the subvector of dimension i, i+b (band size) from the column
    mini_vectors = signature_matrix_transposed[:, i:i+b] 
    
    # sorts the subvectors associated to the indexes to maintain the relationship with the index of the customers.
    # We use a tuple instead of a list because tuples can be hashable and therefore 
    # usable as keys for dictionaries. In this way we will have the same subvector as neighbors
    
    c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
    curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
    for i, v in c:  
        
        if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
            cluster[v] = []
        
        if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
            curr_vector = v

        cluster[v].append(i) #append as values the indexes
            

100%|█████████████████████████████████████████████| 5/5 [00:28<00:00,  5.79s/it]


As example we print the first 5 keys -> subvector:

In [70]:
print(list(cluster.keys())[:5])

[(1, 1, 1, 1), (1, 1, 1, 6), (1, 1, 1, 9), (1, 1, 1, 10), (1, 1, 1, 11)]


For example for the subvector 1,1,1,9 we have as values the indexes of the Customers that have been mapped in the same bucket:

In [71]:
cluster[(1, 1, 1, 1)]

[3997,
 16727,
 22546,
 50596,
 54957,
 61582,
 74913,
 81269,
 87577,
 101048,
 122981,
 129307,
 129657,
 158264,
 158271,
 159128,
 161268,
 161510,
 167960,
 172795,
 174181,
 175537,
 180180,
 185977,
 189867,
 193851,
 197078,
 198032,
 200351,
 203579,
 216170,
 225865,
 235053,
 237205,
 248076,
 254689,
 257562,
 257877,
 265372,
 273864,
 277430,
 281417,
 312549,
 315738,
 325323,
 334497,
 337051,
 354246,
 368141,
 387175,
 394729,
 400242,
 404212,
 416022,
 418735,
 425453,
 436318,
 442312,
 453108,
 454457,
 472604,
 491404,
 501904,
 507897,
 518301,
 532668,
 550429,
 554896,
 572278,
 596462,
 602312,
 611036,
 633989,
 635792,
 644139,
 648730,
 649847,
 655276,
 661254,
 671868,
 672550,
 673159,
 683112,
 684784,
 688053,
 706393,
 711176,
 714130,
 731421,
 735681,
 748049,
 750563,
 752627,
 757535,
 763587,
 765242,
 771220,
 787765,
 791049,
 807342,
 816022,
 823059,
 823752,
 825122,
 838165,
 844669,
 844777,
 845971,
 851159,
 863841,
 865359,
 880782,
 8

Through them we can recover the customers and visually check for their similarity. We noticed that band size = 0 means a strong similarity between the Customers.

In [72]:
df.loc[cluster[(1, 1, 1, 1)]]

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
3997,C1022121F52,F,age_8,richness_9,exp_9
16727,C1121945F54,F,age_8,richness_9,exp_9
22546,C1139926F55,F,age_8,richness_9,exp_9
50596,C1392321F52,F,age_8,richness_9,exp_9
54957,C1422151F52,F,age_8,richness_9,exp_9
...,...,...,...,...,...
1009537,C8910950F53,F,age_8,richness_9,exp_9
1018945,C8938859F52,F,age_8,richness_9,exp_9
1019316,C8940043F53,F,age_8,richness_9,exp_9
1032262,C9040385F51,F,age_8,richness_9,exp_9


We decided to convert each list of indexes of every keys with the list of the related Customers

In [73]:
for key in cluster.keys():
    
    cluster[key] = df.loc[cluster[key]]['New_ID'].to_list() #subtitute the indexes with the customers name

Now the dictionary cluster contains as keys the name of the bucket (subvector) and as values the CustomerIDs.

In [74]:
cluster[(1, 1, 1, 1)][0:5]

['C1022121F52', 'C1121945F54', 'C1139926F55', 'C1392321F52', 'C1422151F52']

In [75]:
len(cluster.keys())

5767

There are 5767 buckets in the dictionary 

Now let's have a try with other band sizes repeating the same procedure. With a smaller band size we will expect for a less number of buckets with less similar users:

In [76]:
b = 2

indexes = list(range(signature_matrix.shape[1])) #create a list of indexes 

signature_matrix_transposed = signature_matrix.T #transpose the matrix to get subvectors column

cluster = {} #initialize the dictionary containing as keys the subvector and as values the indexes of the customer

for i in tq(range(0, signature_matrix.shape[0], b)):  #iterate over the row with step size equal to bandsize
    
    #take the subvector of dimension i, i+b (band size) from the column
    mini_vectors = signature_matrix_transposed[:, i:i+b] 
    
    # sorts the subvectors associated to the indexes to maintain the relationship with the index of the customers.
    # We use a tuple instead of a list because tuples can be hashable and therefore 
    # usable as keys for dictionaries. In this way we will have the same subvector as neighbors
    
    c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
    curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
    for i, v in c:  
        
        if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
            cluster[v] = []
        
        if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
            curr_vector = v

        cluster[v].append(i) #append as values the indexes
            

100%|███████████████████████████████████████████| 10/10 [00:57<00:00,  5.75s/it]


In [77]:
for key in cluster.keys():
    
    cluster[key] = df.loc[cluster[key]]['New_ID'].to_list()

In [79]:
cluster[(1, 1)][0:5]

['C1010958F53', 'C1011252F55', 'C1012669F53', 'C1012970F52', 'C1013372F55']

In [80]:
df[df['New_ID'].isin(cluster[(1,1)])]  #print the customers of bucket (1,1)

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
1,C1010011M33,M,age_4,richness_9,exp_5
4,C1010014M32,M,age_4,richness_9,exp_4
6,C1010024M51,M,age_8,richness_9,exp_10
13,C1010037M35,M,age_4,richness_9,exp_10
23,C1010051M0,M,age_0,richness_9,exp_9
...,...,...,...,...,...
1034891,C9098146M27,M,age_3,richness_9,exp_5
1034904,C9098525M0,M,age_0,richness_9,exp_6
1034945,C9099828M20,M,age_1,richness_9,exp_2
1034947,C9099836M26,M,age_3,richness_9,exp_7


In [81]:
len(cluster.keys())

746

As we wanted to prove, we have a less number of buckets and less similar customer. 

Let's try with a larger number of band size:

In [82]:
b = 5

indexes = list(range(signature_matrix.shape[1])) #create a list of indexes 

signature_matrix_transposed = signature_matrix.T #transpose the matrix to get subvectors column

cluster = {} #initialize the dictionary containing as keys the subvector and as values the indexes of the customer

for i in tq(range(0, signature_matrix.shape[0], b)):  #iterate over the row with step size equal to bandsize
    
    #take the subvector of dimension i, i+b (band size) from the column
    mini_vectors = signature_matrix_transposed[:, i:i+b] 
    
    # sorts the subvectors associated to the indexes to maintain the relationship with the index of the customers.
    # We use a tuple instead of a list because tuples can be hashable and therefore 
    # usable as keys for dictionaries. In this way we will have the same subvector as neighbors
    
    c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
    curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
    for i, v in c:  
        
        if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
            cluster[v] = []
        
        if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
            curr_vector = v

        cluster[v].append(i) #append as values the indexes

100%|█████████████████████████████████████████████| 4/4 [00:30<00:00,  7.71s/it]


In [83]:
for key in cluster.keys():
    
    cluster[key] = df.loc[cluster[key]]['New_ID'].to_list()

In [91]:
list(cluster.keys())[1]

(1, 1, 1, 6, 11)

In [92]:
df[df['New_ID'].isin(cluster[(1, 1, 1, 6, 11)])]  #print the customers of bucket (1,1)

Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
326,C1010958F53,F,age_8,richness_9,exp_1
5553,C1026782F55,F,age_8,richness_9,exp_1
5833,C1027552F52,F,age_8,richness_9,exp_1
15736,C1118939F54,F,age_8,richness_9,exp_1
33560,C1234222F54,F,age_8,richness_9,exp_1
...,...,...,...,...,...
966881,C8536983F54,F,age_8,richness_9,exp_1
976325,C8626736F55,F,age_8,richness_9,exp_1
1002020,C8826764F55,F,age_8,richness_9,exp_1
1004509,C8834279F54,F,age_8,richness_9,exp_1


In [93]:
len(cluster)

7218

As we wanted to prove there is a larger number of bucket and the customers have all the same values in the different fields. 

So, the choice of the band size $depends$ on how much similar customer do you want in your bucket. Supposing our request is not too restrictive we can choose 2 as band size to execute the query for the next point. 

# 1.3 Locality Sensitive Hashing

Now that you prepared your algorithm, it's query time!
We have prepared some dummy users for you to work with.

Download this csv and report the most similar users (comparing them against the dataset provided in Kaggle).
Did your hashing method work properly, what scores have you obtained and how long did it take to run? Provide information and analysis about the results

# 1.3.1 Pre-processing Query dataset

First of all we need to pre-process the query dataset in the same way we pre-process the initial Kaggle dataset:

In [98]:
query = pd.read_csv("/Users/giacomo/Desktop/ADM_HW4/query_users.csv")

In [99]:
query.head()

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0
1,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0
2,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5
3,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0
4,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0


In [100]:
del query['CustLocation'], query['TransactionTime'] #We didn't those features for the customers 

First of all we convert them into class of age, richness and expenditure.

Starting with the age, we used the same procedure of point 1.1:

In [101]:
#In order to calculate the ages we need to transform into datetime birthday and year of the transactions

query['CustomerDOB'] = pd.to_datetime(query['CustomerDOB']) 

query['TransactionDate'] = pd.to_datetime(query['TransactionDate']) 

In [103]:
query['CustomerAge'] = 0 #initialize with values zero the ages

In [104]:
#give an age only to queries that have a different YOB from 1800

query.loc[query['CustomerDOB'].dt.year != 1800, 'CustomerAge'] = query.loc[query['CustomerDOB'].dt.year != 1800, 'TransactionDate'].dt.year - query.loc[query['CustomerDOB'].dt.year != 1800, 'CustomerDOB'].dt.year 

In [105]:
query.head(5)

Unnamed: 0,CustomerDOB,CustGender,CustAccountBalance,TransactionDate,TransactionAmount (INR),CustomerAge
0,1978-07-27,M,94695.61,2016-02-09,65.0,38
1,1992-06-11,M,7584.09,2016-02-09,6025.0,24
2,1991-08-14,M,7180.6,2016-10-08,541.5,25
3,1987-03-01,M,56847.75,2016-08-29,1000.0,29
4,1995-04-01,M,84950.13,2016-09-25,80.0,21


In [106]:
del query['TransactionDate'], query['CustomerDOB'] #deleting the data that we don't need

We binned the ages into the same classes of point 1.1:

In [110]:
bins = np.array(list(range(16, 102, 5)))  #bins

def age(age):
    
    class_age = np.digitize(age, bins, right=False)  #return the number of the bin
    
    age = 'age_' + str(class_age) #create a string of class age
        
    return age

query['CustomerClassAge'] = query.CustomerAge.apply(lambda x: age(x)) #build a new column called class age

In [111]:
del query['CustomerAge'] #deleting the column with the ages

In [112]:
query.head()

Unnamed: 0,CustGender,CustAccountBalance,TransactionAmount (INR),CustomerClassAge
0,M,94695.61,65.0,age_5
1,M,7584.09,6025.0,age_2
2,M,7180.6,541.5,age_2
3,M,56847.75,1000.0,age_3
4,M,84950.13,80.0,age_2


Now it's the turn of the column of richness. In this case we cannot use the quantiles to divide into bins the account balance of the customer cause it will be a different results from the classes of the point 1.1. For this reason we used the bins that we saved in point 1.1 and the function pd.cut that binned the values with the bins we insert labelling them with the labels we want. 

In [113]:
bin_labels = ['richness_0', 'richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10']

We recalled from memory the bins of the classes of richness:

In [117]:
%store -r query_bins_richness  

In [119]:
#create the column with the classes of richness:

query['Richness'] = pd.cut(query.CustAccountBalance, bins = query_bins_richness, labels=bin_labels, right=False)

In [120]:
del query['CustAccountBalance'] #deleting the column that we don't need

In [121]:
query.head() #check the result

Unnamed: 0,CustGender,TransactionAmount (INR),CustomerClassAge,Richness
0,M,65.0,age_5,richness_9
1,M,6025.0,age_2,richness_4
2,M,541.5,age_2,richness_4
3,M,1000.0,age_3,richness_8
4,M,80.0,age_2,richness_9


We repeated the same procedure of the richness for the expenditure class:

In [123]:
%store -r query_bins_expenditure

In [124]:
#create the column expenditure

bin_labels = ['exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']

query['Expenditure'] = pd.cut(query['TransactionAmount (INR)'], bins = query_bins_expenditure, labels=bin_labels, right=False)

In [126]:
del query['TransactionAmount (INR)'] #deleting the column that we don't need

In [127]:
query.head() #check the result

Unnamed: 0,CustGender,CustomerClassAge,Richness,Expenditure
0,M,age_5,richness_9,exp_2
1,M,age_2,richness_4,exp_10
2,M,age_2,richness_4,exp_6
3,M,age_3,richness_8,exp_8
4,M,age_2,richness_9,exp_2


The last step that we decided to make is to give a name to this customer -> Query_User_i where i goes from 0 to 49:

In [130]:
#create two column and combine them to create the name:

query['sub'] = 'Query_User_' #prefix

query['num'] = range(50) #numbers

query['Name'] = query['sub'] + query['num'].astype(str) #concatenating prefix and number

del query['num'], query['sub'] #deleting the prefix and the numbers

In [131]:
query = query[['Name', 'CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']] #reorder the column

In [133]:
query.head() #cheching the result

Unnamed: 0,Name,CustGender,CustomerClassAge,Richness,Expenditure
0,Query_User_0,M,age_5,richness_9,exp_2
1,Query_User_1,M,age_2,richness_4,exp_10
2,Query_User_2,M,age_2,richness_4,exp_6
3,Query_User_3,M,age_3,richness_8,exp_8
4,Query_User_4,M,age_2,richness_9,exp_2


# 1.3.2 Minhash on query 

In [267]:
shingles = ['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']

In [268]:
shingle_query = np.zeros((50, 40), dtype = int)

for i in tq(range(len(query))):
    # Append the one hot vectors as rows
    shingle_query[query.index[i]] = one_hot_vector(query, i) 

# We need to transpose because for the shuffling, the Shingles need to be the rows
shingle_query = shingle_query.T

100%|██████████████████████████████████████████| 50/50 [00:00<00:00, 902.87it/s]


In [272]:
query.loc[1]

Name                Query_User_1
CustGender                     M
CustomerClassAge           age_2
Richness              richness_4
Expenditure               exp_10
Name: 1, dtype: object

In [277]:
shingle_query[:, 1]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

In [279]:
n_permutations = 20 #number of permutations = number of rows of the signature matrix
np.random.seed(145)
signature_query= np.zeros((20, shingle_query.shape[1]), dtype = int)

for i in tq(range(n_permutations)):
    # 1. Shuffle rows
    np.random.shuffle(shingle_query)
    
    # 2. Create the vector of indeces where the first 1 is found. np.argmax stops at the first occurrence
    signature_row = np.argmax(shingle_query == 1, axis=0) + 1
    
    # 3. Add to signature matrix
    signature_query[i] = signature_row

100%|█████████████████████████████████████████| 20/20 [00:00<00:00, 5839.62it/s]


In [233]:
indexes = list(range(signature_query.shape[1])) #create a list of indexes 

signature_query_transposed = signature_query.T 

for i in tq(range(0, signature_query.shape[0], b)):
    
    mini_vectors = signature_query_transposed[:, i:i+b]
    
    c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
    curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
    for i, v in c:  
        
        if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
            cluster[v] = []
        
        if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
            curr = v

        cluster[v].append(query.loc[i]['Name'])

100%|█████████████████████████████████████████████| 5/5 [00:00<00:00, 29.96it/s]


In [238]:
from collections import defaultdict

In [239]:
dd = defaultdict(list)

for d in (cluster, cluster_query): 
    for key, value in d.items():
        
        dd[key].append(value)

[['Query_User_3']]