# 1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

In [2]:
import pandas as pd
from tqdm import tqdm as tq
import warnings
import numpy as np
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv("/Users/giacomo/Desktop/ADM_HW4/data.csv", sep = '\t')

In [25]:
df

Unnamed: 0.1,Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,0,C1010011F24,F,age_2,richness_7,exp_10
1,1,C1010011M33,M,age_4,richness_9,exp_5
2,2,C1010012M22,M,age_2,richness_6,exp_8
3,3,C1010014F24,F,age_2,richness_7,exp_8
4,4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...,...
1034947,1034947,C9099836M26,M,age_3,richness_9,exp_7
1034948,1034948,C9099877M20,M,age_1,richness_9,exp_4
1034949,1034949,C9099919M23,M,age_2,richness_3,exp_3
1034950,1034950,C9099941M21,M,age_2,richness_7,exp_1


In [26]:
del df['Unnamed: 0']

# 1.2.1 Shingles

First of all we build the shingles from all the unique values per column in the loaded dataset. We ignore the `TransactionID` column because it is not a shingle.

In [27]:
shingles = []
for column_name in df.columns[1:]:
    shingles += sorted(list(df[column_name].unique()))
    
shingles.remove('age_0')

#In order to not aggregate people who are labelled with age_0, corresponding to the Customer DOB with year 1800 
#(nan), we decided to remove age_0 from shingles such that those people will not have any 1 in the shingle matrix.
#For that reason they will not be considered similar to anyone for the age, but only for the other fields.

In [28]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


# 1.2.2 Create Shingle Matrix

First of all we create the function which maps each transaction into a vector of 0/1 based on the shingles. 

In [127]:
def one_hot_vector(data, index):
    """Creates a one hot vector for the row found in the data at the given index based on the shingles.
    
    :args
    data - a pandas dataframe containing the data.
    index - an int which corresponds to the row that will be turned into a one hot vector.
    
    :returns
    a numpy array one hot representation of the row
    """
    
    values = data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']].values #extract values
    
    indeces = np.where(values.reshape(values.size, 1) == shingles)[1]  #save indexes
    
    vector = np.zeros(len(shingles), dtype = int)  #initialize vector
    
    vector[indeces] = 1  #substitute 1 in the correct positions
    
    return vector

Example:

In [30]:
print(shingles)

['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']


In [31]:
df.loc[1]

New_ID              C1010011M33
CustGender                    M
CustomerClassAge          age_4
Richness             richness_9
Expenditure               exp_5
Name: 1, dtype: object

In [32]:
print(one_hot_vector(df, 1))

[0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
 0 0 0]


In [33]:
len(df)

1034952

Now we can build a sparse matrix with all the encoded transaction. We don't need to insert into the matrix the name of the customerIDs cause we they are linked to the index of the shingle matrix through the index of the column.  

In [34]:
shingle_matrix = np.zeros((1034952, 40), dtype = int)

for i in tq(range(len(df))):
    # Append the one hot vectors as rows
    shingle_matrix[df.index[i]] = one_hot_vector(df, i) 

# We need to transpose because for the shuffling, the Shingles need to be the rows
shingle_matrix = shingle_matrix.T

100%|███████████████████████████████| 1034952/1034952 [13:02<00:00, 1322.44it/s]


In [35]:
%store shingle_matrix

Stored 'shingle_matrix' (ndarray)


In [280]:
%store -r shingle_matrix

# 1.2.3 Create the Signature Matrix
From the Shingle Matrix, we will now create the signature matrix by doing the following:
1. Shuffle the rows of the Shingle Matrix.
1. Create a vector where each element corresponds to the index of the row of each column (Shingle) where the first 1 is found.
1. Append this vector to the Signature Matrix.
1. Repeat $n$ times.

The goal of the MinHash is to replace a large set with a smaller "signature" that still preserves the underlying similarity metric.

In [281]:
n_permutations = 20 #number of permutations = number of rows of the signature matrix
signature_matrix = np.zeros((20, shingle_matrix.shape[1]), dtype = int) #initialize signature matrix
np.random.seed(1) #set seed for reproducibility

In [282]:
for i in tq(range(n_permutations)):
    # 1. Shuffle rows
    np.random.shuffle(shingle_matrix)
    
    # 2. Create the vector of indeces where the first 1 is found. np.argmax stops at the first occurrence
    signature_row = np.argmax(shingle_matrix == 1, axis=0) + 1
    
    # 3. Add to signature matrix
    signature_matrix[i] = signature_row

100%|███████████████████████████████████████████| 20/20 [00:35<00:00,  1.78s/it]


In [283]:
signature_matrix

array([[ 8,  1,  2, ...,  7, 34, 12],
       [11,  1,  9, ...,  8, 11, 15],
       [11,  3, 11, ...,  5, 11,  3],
       ...,
       [ 8,  1,  3, ..., 12, 22, 15],
       [11, 19, 10, ..., 11, 11, 24],
       [12,  4,  4, ...,  2,  4,  4]])

In [284]:
signature_matrix.shape

(20, 1034952)

The index of the column can be referred to the customer ID looking at the index of the initial dataframe: 

In [285]:
df

Unnamed: 0.1,Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
0,0,C1010011F24,F,age_2,richness_7,exp_10
1,1,C1010011M33,M,age_4,richness_9,exp_5
2,2,C1010012M22,M,age_2,richness_6,exp_8
3,3,C1010014F24,F,age_2,richness_7,exp_8
4,4,C1010014M32,M,age_4,richness_9,exp_4
...,...,...,...,...,...,...
1034947,1034947,C9099836M26,M,age_3,richness_9,exp_7
1034948,1034948,C9099877M20,M,age_1,richness_9,exp_4
1034949,1034949,C9099919M23,M,age_2,richness_3,exp_3
1034950,1034950,C9099941M21,M,age_2,richness_7,exp_1


For example the first column of the signature matrix is referred to the customer C1010011F24.

# 1.2.4 Divide Signature Matrix into Bands

The example signature matrix below is divided into $b$ bands of $r$ rows each, and each band is hashed separately. For this example, we are setting band , which means that we will consider any titles with the same first two rows to be similar. The larger we make b the less likely there will be another Paper that matches all of the same permutations.

![signature_matrix_into_bands](https://storage.googleapis.com/lds-media/images/locality-sensitive-hashing-lsh-buckets.width-1200.png)

Ultimately, the size of the bands control the probability that two items with a given Jaccard similarity end up in the same bucket. If the number of bands is larger, you will end up with much smaller sets. For instance, $b = p$, where $p$ is the number of permutations (i.e. rows in the signature matrix) would almost certainly lead to $N$ buckets of only one item because there would be only one item that was perfect similar across every permutation.

In [194]:
b = 4

In [195]:
signature_matrix.T[:, 0:4]

array([[ 4,  4,  2,  1],
       [11,  3, 16, 15],
       [ 5,  3,  2, 22],
       ...,
       [ 6,  3,  2,  9],
       [ 6,  3,  2,  8],
       [17,  3, 18,  6]])

In order to create the buckets we decide to create a dictionary that will have the sub_vector as keys and the indexes that contains that subvector. These indexes will allow us to substistute the related customer:

In [196]:
indexes = list(range(signature_matrix.shape[1])) #create a list of indexes 

signature_matrix_transposed = signature_matrix.T #transpose the matrix to get subvectors column

cluster = {} #initialize the dictionary containing as keys the subvector and as values the indexes of the customer

for i in tq(range(0, signature_matrix.shape[0], b)):  #iterate over the row with step size equal to bandsize
    
    #take the subvector of dimension i, i+b (band size) from the column
    mini_vectors = signature_matrix_transposed[:, i:i+b] 
    
    # sorts the subvectors associated to the indexes to maintain the relationship with the index of the customers.
    # We use a tuple instead of a list because tuples can be hashable and therefore 
    # usable as keys for dictionaries. In this way we will have the same subvector as neighbors
    
    c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
    curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
    for i, v in c:  
        
        if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
            cluster[v] = []
        
        if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
            curr_vector = v

        cluster[v].append(i) #append as values the indexes
            

100%|█████████████████████████████████████████████| 5/5 [00:32<00:00,  6.58s/it]


As example we print the first 5 keys -> subvector:

In [197]:
print(list(cluster.keys())[:5])

[(1, 1, 4, 1), (1, 1, 4, 6), (1, 1, 4, 8), (1, 1, 4, 9), (1, 1, 4, 10)]


For example for the subvector 1,1,1,9 we have as values the indexes of the Customers that have been mapped in the same bucket:

In [199]:
cluster[(1, 1, 4, 6)]

[118,
 1167,
 1641,
 1761,
 1943,
 1976,
 2181,
 2298,
 2355,
 2440,
 2652,
 2844,
 2871,
 3381,
 3400,
 3412,
 3768,
 3780,
 3966,
 3988,
 4377,
 4530,
 4549,
 4740,
 5400,
 5693,
 7119,
 7675,
 7938,
 8339,
 8677,
 8748,
 9528,
 9640,
 10723,
 10774,
 11450,
 11757,
 12321,
 12506,
 12631,
 12654,
 12813,
 12824,
 13047,
 13821,
 14543,
 14889,
 14975,
 15039,
 15163,
 15174,
 15247,
 15472,
 16119,
 16298,
 16331,
 16447,
 16470,
 16484,
 16505,
 16532,
 16580,
 16817,
 17409,
 17450,
 17928,
 18315,
 18391,
 18611,
 19125,
 19251,
 19528,
 19738,
 19796,
 20008,
 20107,
 20255,
 20507,
 20741,
 20858,
 20982,
 21155,
 21158,
 21357,
 21569,
 21575,
 21696,
 21725,
 22218,
 22465,
 23781,
 23895,
 23897,
 25293,
 25487,
 26025,
 26812,
 27081,
 27356,
 27385,
 27432,
 27585,
 28240,
 28392,
 28418,
 28755,
 28930,
 29164,
 29556,
 29570,
 29679,
 29965,
 30830,
 31164,
 31409,
 31438,
 31538,
 31852,
 32254,
 32658,
 32800,
 33339,
 33647,
 33932,
 34497,
 34598,
 35154,
 35491,
 35

Through them we can recover the customers and visually check for their similarity

In [201]:
df.loc[cluster[(1, 1, 4, 6)]]

Unnamed: 0.1,Unnamed: 0,New_ID,CustGender,CustomerClassAge,Richness,Expenditure
118,118,C1010357F26,F,age_3,richness_7,exp_7
1167,1167,C1013653M27,M,age_3,richness_7,exp_7
1641,1641,C1014989M28,M,age_3,richness_7,exp_7
1761,1761,C1015444M28,M,age_3,richness_7,exp_7
1943,1943,C1015980M27,M,age_3,richness_7,exp_7
...,...,...,...,...,...,...
1033933,1033933,C9069763M29,M,age_3,richness_7,exp_7
1033997,1033997,C9072066M30,M,age_3,richness_7,exp_7
1034200,1034200,C9078355M26,M,age_3,richness_7,exp_7
1034626,1034626,C9090674M27,M,age_3,richness_7,exp_7


We decided to convert each list of indexes of every keys with the list of the related Customers

In [202]:
for key in cluster.keys():
    
    cluster[key] = df.loc[cluster[key]]['New_ID'].to_list()

Now the dictionary cluster contains as keys the name of the bucket (subvector) and as values the CustomerIDs.

In [203]:
cluster[(1, 1, 4, 6)]

['C1010357F26',
 'C1013653M27',
 'C1014989M28',
 'C1015444M28',
 'C1015980M27',
 'C1016072F30',
 'C1016680F28',
 'C1017059M26',
 'C1017221M30',
 'C1017516M28',
 'C1018040F26',
 'C1018587M27',
 'C1018679F28',
 'C1020152M27',
 'C1020229F26',
 'C1020264F26',
 'C1021370F26',
 'C1021416M26',
 'C1021990M28',
 'C1022068M26',
 'C1023241M28',
 'C1023716M28',
 'C1023774F26',
 'C1024265F28',
 'C1026288M29',
 'C1027157M28',
 'C1031622F26',
 'C1033326M28',
 'C1034140M29',
 'C1035372M30',
 'C1036376M27',
 'C1036555M27',
 'C1038872F27',
 'C1039192M28',
 'C1042568F26',
 'C1042712M30',
 'C1061633M26',
 'C1070824F26',
 'C1087479M26',
 'C1093011M26',
 'C1096859M28',
 'C1097589M29',
 'C1110237F29',
 'C1110253M28',
 'C1110891M30',
 'C1113338M28',
 'C1115492F27',
 'C1116519M26',
 'C1116827M27',
 'C1116990F26',
 'C1117353F27',
 'C1117385M29',
 'C1117568F29',
 'C1118174F26',
 'C1120083M26',
 'C1120579M27',
 'C1120681M26',
 'C1121070M26',
 'C1121148F30',
 'C1121189F26',
 'C1121263M27',
 'C1121345F26',
 'C11214

# 1.3 Locality Sensitive Hashing

Now that you prepared your algorithm, it's query time!
We have prepared some dummy users for you to work with.

Download this csv and report the most similar users (comparing them against the dataset provided in Kaggle).
Did your hashing method work properly, what scores have you obtained and how long did it take to run? Provide information and analysis about the results

# 1.3.1 Pre-processing Query dataset

In [241]:
query = pd.read_csv("/Users/giacomo/Desktop/ADM_HW4/query_users.csv")

In [242]:
query

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,27/7/78,M,DELHI,94695.61,2/9/16,140310,65.0
1,6/11/92,M,PANCHKULA,7584.09,2/9/16,120214,6025.0
2,14/8/91,M,PATNA,7180.6,10/8/16,221732,541.5
3,3/1/87,M,CHENNAI,56847.75,29/8/16,144138,1000.0
4,4/1/95,M,GURGAON,84950.13,25/9/16,233309,80.0
5,10/1/81,M,WORLD TRADE CENTRE BANGALORE,23143.95,11/9/16,192906,303.0
6,20/9/76,F,CHITTOOR,15397.8,28/8/16,92633,20.0
7,10/4/91,M,MOHALI,426.3,2/8/16,203754,50.0
8,19/3/90,M,MOHALI,4609.34,26/8/16,184015,300.0
9,19/12/70,M,SERAMPORE,6695988.46,27/8/16,144030,299.0


In [243]:
del query['CustLocation']

First of all we convert them into class of age, richness and expenditure:

In [244]:
query['CustomerDOB'] = pd.to_datetime(query['CustomerDOB'])

query['TransactionDate'] = pd.to_datetime(query['TransactionDate']) 

In [245]:
query.head(5)

Unnamed: 0,CustomerDOB,CustGender,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,1978-07-27,M,94695.61,2016-02-09,140310,65.0
1,1992-06-11,M,7584.09,2016-02-09,120214,6025.0
2,1991-08-14,M,7180.6,2016-10-08,221732,541.5
3,1987-03-01,M,56847.75,2016-08-29,144138,1000.0
4,1995-04-01,M,84950.13,2016-09-25,233309,80.0


In [246]:
query['CustomerAge'] = 0

In [247]:
query.loc[query['CustomerDOB'].dt.year != 1800, 'CustomerAge'] = query.loc[query['CustomerDOB'].dt.year != 1800, 'TransactionDate'].dt.year - query.loc[query['CustomerDOB'].dt.year != 1800, 'CustomerDOB'].dt.year 

In [248]:
query.head(5)

Unnamed: 0,CustomerDOB,CustGender,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),CustomerAge
0,1978-07-27,M,94695.61,2016-02-09,140310,65.0,38
1,1992-06-11,M,7584.09,2016-02-09,120214,6025.0,24
2,1991-08-14,M,7180.6,2016-10-08,221732,541.5,25
3,1987-03-01,M,56847.75,2016-08-29,144138,1000.0,29
4,1995-04-01,M,84950.13,2016-09-25,233309,80.0,21


In [249]:
del query['TransactionDate'], query['TransactionTime'], query['CustomerDOB']

In [250]:
bins = np.array(list(range(16, 102, 5)))

def age(age):
    
    class_age = np.digitize(age, bins, right=False)  #return the number of the bin
    
    age = 'age_' + str(class_age)
        
    return age

In [251]:
query['CustomerClassAge'] = query.CustomerAge.apply(lambda x: age(x))

In [252]:
del query['CustomerAge']

In [253]:
bin_labels = ['richness_0', 'richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10']

In [254]:
%store -r query_bins_richness

In [255]:
%store -r query_bins_expenditure

In [256]:
query['Richness'] = pd.cut(query.CustAccountBalance, bins = query_bins_richness, labels=bin_labels, right=False)

In [257]:
del query['CustAccountBalance']

In [258]:
bin_labels = ['exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']

In [259]:
query['Expenditure'] = pd.cut(query['TransactionAmount (INR)'], bins = query_bins_expenditure, labels=bin_labels, right=False)

In [260]:
del query['TransactionAmount (INR)']

In [261]:
query['sub'] = 'Query_User_' 

In [262]:
query['num'] = range(50)

In [263]:
query['Name'] = query['sub'] + query['num'].astype(str)

In [264]:
del query['num'], query['sub']

In [265]:
query = query[['Name', 'CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]

In [278]:
query.head()

Unnamed: 0,Name,CustGender,CustomerClassAge,Richness,Expenditure
0,Query_User_0,M,age_5,richness_9,exp_2
1,Query_User_1,M,age_2,richness_4,exp_10
2,Query_User_2,M,age_2,richness_4,exp_6
3,Query_User_3,M,age_3,richness_8,exp_8
4,Query_User_4,M,age_2,richness_9,exp_2


# 1.3.2 Minhash on query 

In [267]:
shingles = ['F', 'M', 'age_1', 'age_10', 'age_11', 'age_12', 'age_13', 'age_14', 'age_15', 'age_16', 'age_17', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8', 'age_9', 'richness_0', 'richness_1', 'richness_10', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'exp_1', 'exp_10', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9']

In [268]:
shingle_query = np.zeros((50, 40), dtype = int)

for i in tq(range(len(query))):
    # Append the one hot vectors as rows
    shingle_query[query.index[i]] = one_hot_vector(query, i) 

# We need to transpose because for the shuffling, the Shingles need to be the rows
shingle_query = shingle_query.T

100%|██████████████████████████████████████████| 50/50 [00:00<00:00, 902.87it/s]


In [272]:
query.loc[1]

Name                Query_User_1
CustGender                     M
CustomerClassAge           age_2
Richness              richness_4
Expenditure               exp_10
Name: 1, dtype: object

In [277]:
shingle_query[:, 1]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])

In [279]:
n_permutations = 20 #number of permutations = number of rows of the signature matrix
np.random.seed(145)
signature_query= np.zeros((20, shingle_query.shape[1]), dtype = int)

for i in tq(range(n_permutations)):
    # 1. Shuffle rows
    np.random.shuffle(shingle_query)
    
    # 2. Create the vector of indeces where the first 1 is found. np.argmax stops at the first occurrence
    signature_row = np.argmax(shingle_query == 1, axis=0) + 1
    
    # 3. Add to signature matrix
    signature_query[i] = signature_row

100%|█████████████████████████████████████████| 20/20 [00:00<00:00, 5839.62it/s]


In [233]:
indexes = list(range(signature_query.shape[1])) #create a list of indexes 

signature_query_transposed = signature_query.T 

for i in tq(range(0, signature_query.shape[0], b)):
    
    mini_vectors = signature_query_transposed[:, i:i+b]
    
    c = [(i, tuple(v)) for v, i in sorted(zip(mini_vectors.tolist(), indexes))]
    
    curr_vector = c[0][1] #take the subvector from the tuple composed by index and subvector
    
    #Now we have equal subvector as neighbor, so we can iterate over these groups of equal subvector
    for i, v in c:  
        
        if v not in cluster: #if the subvector is not a key in the cluster --> initialize it 
            
            cluster[v] = []
        
        if curr_vector != v: #when the iteration go over the group of equal subvector updtate the current vector
            
            curr = v

        cluster[v].append(query.loc[i]['Name'])

100%|█████████████████████████████████████████████| 5/5 [00:00<00:00, 29.96it/s]


In [238]:
from collections import defaultdict

In [239]:
dd = defaultdict(list)

for d in (cluster, cluster_query): 
    for key, value in d.items():
        
        dd[key].append(value)

[['Query_User_3']]