# 1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

In [6]:
import pandas as pd
from tqdm import tqdm as tq
import warnings
import numpy as np
warnings.filterwarnings("ignore")

In [7]:
df = pd.read_csv("data/data.csv", sep = '\t')

### SAVE GENDER AS STRING IN PART 1 INSTEAD OF CONVERTING IT HERE

In [8]:
df.CustGender = df.CustGender.astype(str)

In [32]:
df

Unnamed: 0.1,Unnamed: 0,TransactionID,CustGender,CustomerClassAge,Richness,Expenditure
0,0,T1,0,age_1,richness_6,exp_1
1,1,T2,1,age_4,richness_2,exp_10
2,2,T3,0,age_1,richness_6,exp_6
3,3,T4,0,age_3,richness_10,exp_9
4,4,T5,0,age_2,richness_4,exp_9
...,...,...,...,...,...,...
1041139,1041139,T1048563,1,age_2,richness_4,exp_7
1041140,1041140,T1048564,1,age_1,richness_7,exp_6
1041141,1041141,T1048565,1,age_2,richness_10,exp_7
1041142,1041142,T1048566,1,age_3,richness_4,exp_7


In [9]:
del df['Unnamed: 0']

In [34]:
list(df)

['TransactionID', 'CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']

# 1.2.1 

First of all we built the vocabulary: 

In [10]:
vocab1 = ['0', '1'] #adding 0,1 shingles for female, male

vocab2 = ['age_1', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6']  #adding customerClassAge shingles without considering 0 class age (nan)

vocab3 = ['richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10']

#adding Richness shingles

vocab4 = ['exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']

#adding Expenditure shingles

vocabulary = vocab1 + vocab2 + vocab3 + vocab4
vocabulary = np.array(vocabulary)

In [36]:
print(vocabulary)

[0, 1, 'age_1', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10', 'exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']


We found different approaches to build the signature matrix and we decide to show all of them:

# First Approach: create one hot vector for each transaction

First of all we created the function which maps each transaction into a vector of 0/1 based on the vocabulary: 

In [11]:
#1 approach

def hot_vector(data, index):  #create one hot vector with all the zeros and ones
    
    values = data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']].values
    indeces = np.where(values.reshape(values.size, 1) == vocabulary)[1]
    vector = np.zeros(len(vocabulary))
    vector[indeces] = 1
    
    return vector

Example: 

In [38]:
print(hot_vector(df, 1))

[0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


Now we can build a sparse matrix with all the encoded transiction: 

In [29]:
# Matrix shingles
boolean_matrix = np.zeros((1041144, 28))

for i in tq(range(len(df))):
    # Append the one hot vectors as rows, we need to transpose later
    boolean_matrix[df.index[i]] = hot_vector(df, i) 
    
#boolean_matrix.T

100%|██████████████████████████████████████████████████████████████████████| 1041144/1041144 [13:10<00:00, 1317.41it/s]


In [30]:
%store boolean_matrix

Stored 'boolean_matrix' (ndarray)


In [38]:
shingle_matrix_copy = np.copy(boolean_matrix).T

#np.random.shuffle(s)


In [78]:
shingle_matrix_copy.shape

(28, 1041144)

In [41]:
hot_vector(df, 0)

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])

In [47]:
# 1. shuffle the matrix
np.random.shuffle(shingle_matrix_copy)

In [77]:
# 2. for each column, find the row where the first one appears
np.argmax(shingle_matrix_copy == 1, axis=0) + 1

array([ 6, 13,  6, ...,  9,  2,  3], dtype=int64)

In [79]:
n_permutations = 12
signature_matrix = np.zeros((12, shingle_matrix_copy.shape[1]))

In [80]:
for i in tq(range(n_permutations)):
    np.random.shuffle(shingle_matrix_copy)
    signature_row = np.argmax(shingle_matrix_copy == 1, axis=0) + 1
    
    signature_matrix[i] = signature_row

100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:12<00:00,  1.05s/it]


In [81]:
signature_matrix

array([[ 2.,  4.,  2., ...,  7.,  7.,  7.],
       [ 2., 15.,  2., ...,  5., 15.,  5.],
       [ 3.,  4., 11., ...,  5., 10.,  7.],
       ...,
       [ 1.,  2.,  1., ...,  7.,  5., 13.],
       [ 7.,  6., 10., ...,  2.,  6.,  2.],
       [15.,  3., 15., ...,  2.,  2.,  6.]])

With this matrix we can built the signature matrix simply taking the first occurrence of 1 in each column and shuffling the shingles for each row. 

# Second approach: build the boolean matrix storing the index of the ones.

Instead of storing all the zeros we defined a function that encodes the one hot vector in a list that contains the indexes of the 1. 

In [39]:
#version 1

def position_1(data, index): #take as input the dataframe and the index of the row
    
    ind = [] #initialize the list that will contain the indexes
    
    for (i,elem) in enumerate(vocabulary): 
        
        #iterate over the element of the vocabulary and if a match is found append the index i to the list
        
        if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]): 
            
            ind.append(i)
    
    return ind

In [40]:
#version 2: same idea with the list comprehension --> more compact way

def position_2(data, index):
    
    ind = [i for i, elem in enumerate(vocabulary) if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']])]
    
    return ind

In [41]:
print(hot_vector(df, 1))

[0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


In [42]:
print(position_2(df, 1))

[1, 5, 9, 27]


In [43]:
print(position_1(df, 1))

[1, 5, 9, 27]


Now we can map the dataframe into a dictionary: the keys are the transaction ID and the values ar list of indexes where 1 appears:

In [None]:
boolean_matrix = dict() #initialize the dict

for i in tq(range(len(df))):  #iterate over the dataframe
    
    boolean_matrix[df.loc[i][0]] = position_2(df, i)  #append keys: transaction id, value: list of indexes

# Third approach: build directly the signature matrix: 

The goal of the MinHash is to replace a large set with a smaller "signature" that still preserves the underlying similarity metric. In order to create a MinHash signature for each set:

 - Randomly permute the rows of the shingle matrix (permuting the indexes)
 
 - For each set, start from the first index and find the position of the first shingle with a 1 in its cell. Use this shingle number to represent the set. This is the "signature".

The idea started from shuffling the vocabulary instead of the rows of the boolean matrix: 

In [12]:
import random as ran #shuffle the elements of a list
ran.seed(7) #set seed for reproducibility

How does it work: 

In [13]:
print(vocabulary)

['0' '1' 'age_1' 'age_2' 'age_3' 'age_4' 'age_5' 'age_6' 'richness_1'
 'richness_2' 'richness_3' 'richness_4' 'richness_5' 'richness_6'
 'richness_7' 'richness_8' 'richness_9' 'richness_10' 'exp_1' 'exp_2'
 'exp_3' 'exp_4' 'exp_5' 'exp_6' 'exp_7' 'exp_8' 'exp_9' 'exp_10']


In [14]:
ran.shuffle(vocabulary)

In [15]:
print(vocabulary)

['richness_2' 'age_4' 'exp_5' 'richness_1' 'age_6' 'exp_9' 'richness_7'
 'exp_2' 'exp_10' 'exp_7' 'richness_6' 'exp_8' 'richness_8' 'exp_4' '0'
 'age_5' 'richness_9' 'exp_6' 'exp_1' 'richness_4' 'age_2' 'richness_10'
 'age_1' '1' 'exp_3' 'richness_5' 'age_3' 'richness_3']


The logic is similar to the previous points but the idea is the following: iterate over the vocabulary and return the position of the occurrence of the first 1, i.e. the first match with the current position of the elements in the vocabulary.

In [16]:
def position_shuffle(data, index): #take as input the dataframe and the index of the row
    
    for (i, elem) in enumerate(vocabulary): #iterate over the vocabulary
        
        #search for a match
        if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]):
        
            return i  #return the index of the first match

Example:

In [17]:
df.loc[1][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]

CustGender                   1
CustomerClassAge         age_4
Richness            richness_2
Expenditure             exp_10
Name: 1, dtype: object

In [50]:
print(vocabulary)

['richness_2', 'age_4', 'exp_5', 'richness_1', 'age_6', 'exp_9', 'richness_7', 'exp_2', 'exp_10', 'exp_7', 'richness_6', 'exp_8', 'richness_8', 'exp_4', 0, 'age_5', 'richness_9', 'exp_6', 'exp_1', 'richness_4', 'age_2', 'richness_10', 'age_1', 1, 'exp_3', 'richness_5', 'age_3', 'richness_3']


In [21]:
position_shuffle(df, 1)

0

In [22]:
ran.shuffle(vocabulary)

In [23]:
print(vocabulary)

['richness_5' '0' 'exp_8' '1' 'richness_7' 'richness_10' 'richness_2'
 'exp_9' 'richness_9' 'age_5' 'age_1' 'exp_5' 'richness_6' 'age_3'
 'exp_10' 'richness_4' 'richness_1' 'age_2' 'exp_4' 'exp_7' 'age_6'
 'exp_6' 'exp_3' 'exp_2' 'age_4' 'richness_8' 'richness_3' 'exp_1']


In [24]:
position_shuffle(df, 1)

3

We created a list of the transactionID to initialize the column names of the signature matrix

In [25]:
TID = list(df['TransactionID']) #transaction names

In [26]:
signature_matrix = pd.DataFrame(columns = TID) #initialize dataframe with transactionID

Choosing 10 as the number of permutation, we iterate shuffling the vocabulary at each step and appending 

In [28]:
for i in tq(range(12)): #number of permutation
    
    ran.shuffle(vocabulary) #shuffle the vocabulary
    
    rows = {} #initialize the row
    
    for j in range(len(TID)): 
        
        rows[TID[j]] = position_shuffle(df, j)  #key: transactionID, value: number of the firs occurrence of a 1
        
    signature_matrix = signature_matrix.append(rows, ignore_index=True) #append rows to signature matrix
    
signature_matrix.to_csv('/Users/giacomo/Desktop/locale/signature_matrix.csv', sep = '\t')

  0%|                                                                                           | 0/12 [00:00<?, ?it/s]
  0%|                                                                                      | 0/1041144 [00:00<?, ?it/s][A
  0%|                                                                          | 21/1041144 [00:00<1:30:43, 191.24it/s][A
  0%|                                                                          | 50/1041144 [00:00<1:11:09, 243.83it/s][A
  0%|                                                                          | 87/1041144 [00:00<1:01:12, 283.47it/s][A
  0%|                                                                         | 117/1041144 [00:00<1:00:14, 288.00it/s][A
  0%|                                                                         | 146/1041144 [00:00<1:00:09, 288.41it/s][A
  0%|                                                                         | 177/1041144 [00:00<1:01:07, 283.82it/s][A
  0%|              

  0%|▏                                                                       | 1820/1041144 [00:07<1:07:56, 254.98it/s][A
  0%|▏                                                                       | 1846/1041144 [00:07<1:17:21, 223.93it/s][A
  0%|▏                                                                       | 1870/1041144 [00:07<1:18:52, 219.58it/s][A
  0%|▏                                                                       | 1894/1041144 [00:07<1:17:00, 224.93it/s][A
  0%|▏                                                                       | 1922/1041144 [00:07<1:12:24, 239.23it/s][A
  0%|▏                                                                       | 1947/1041144 [00:08<1:19:20, 218.31it/s][A
  0%|▏                                                                       | 1970/1041144 [00:08<1:23:11, 208.18it/s][A
  0%|▏                                                                       | 1992/1041144 [00:08<1:46:14, 163.02it/s][A
  0%|▏          

  0%|                                                                                           | 0/12 [00:14<?, ?it/s]


KeyboardInterrupt: 

In [67]:
signature_matrix

Unnamed: 0,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,...,T995,T996,T997,T998,T999,T1000,T1001,T1002,T1003,T1004
0,3,6,3,3,0,3,3,0,0,0,...,0,0,0,0,11,3,0,13,6,0
1,12,8,0,13,4,9,0,1,4,4,...,4,4,4,4,2,5,4,1,8,4
2,3,11,0,4,4,1,0,10,1,2,...,14,6,14,12,2,6,13,9,11,3
3,0,4,0,9,10,8,12,1,7,4,...,4,2,2,9,4,2,4,4,4,0
4,9,11,3,5,5,14,3,12,0,2,...,2,9,9,2,4,9,15,12,6,2
5,3,1,7,0,0,5,7,8,5,11,...,11,3,3,2,7,3,6,8,1,11
6,2,16,2,2,2,0,2,8,0,9,...,19,1,14,2,4,1,11,4,5,12
7,5,9,15,1,8,1,15,0,7,2,...,8,5,5,6,2,5,8,0,6,8
8,2,0,2,3,3,10,2,11,10,13,...,5,6,5,13,2,2,1,15,0,13
9,4,0,4,3,3,8,13,0,8,0,...,0,0,0,5,0,6,0,0,0,0
