# 1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

In [53]:
import pandas as pd
from tqdm import tqdm as tq
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("/Users/giacomo/Desktop/locale/data.csv", sep = '\t')

In [3]:
df

Unnamed: 0.1,Unnamed: 0,TransactionID,CustGender,CustomerClassAge,Richness,Expenditure
0,0,T1,0,age_1,richness_6,exp_1
1,1,T2,1,age_4,richness_2,exp_10
2,2,T3,0,age_1,richness_6,exp_6
3,3,T4,0,age_3,richness_10,exp_9
4,4,T5,0,age_2,richness_4,exp_9
...,...,...,...,...,...,...
1041139,1041139,T1048563,1,age_2,richness_4,exp_7
1041140,1041140,T1048564,1,age_1,richness_7,exp_6
1041141,1041141,T1048565,1,age_2,richness_10,exp_7
1041142,1041142,T1048566,1,age_3,richness_4,exp_7


In [4]:
del df['Unnamed: 0']

In [5]:
list(df)

['TransactionID', 'CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']

# 1.2.1 Vocabulary

In [None]:
#assumption: we used the dataset with the strings!

First of all we built the vocabulary: 

In [39]:
vocab1 = [0, 1] #adding 0,1 shingles for female, male

vocab2 = ['age_1', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6']  #adding customerClassAge shingles without considering 0 class age (nan)

vocab3 = ['richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10']

#adding Richness shingles

vocab4 = ['exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']

#adding Expenditure shingles

vocabulary = vocab1 + vocab2 + vocab3 + vocab4

In [40]:
print(vocabulary)

[0, 1, 'age_1', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10', 'exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']


# 1.2.2 Create one hot vector for each transaction

First of all we created the function which maps each transaction into a vector of 0/1 based on the vocabulary: 

In [8]:
list(df.loc[1][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']])

[1, 'age_4', 'richness_2', 'exp_10']

In [9]:
#1 approach

def hot_vector(data, index):  #create one hot vector with all the zeros and ones
    
    vector = [1 if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]) else 0 for elem in vocabulary]
    
    return vector

Instead of storing all this zeros we defined a function that encodes the one hot vector in a list that contains the indexes of the 1. 

In [14]:
#2.1 approach first try

def position_1(data, index):  #----> same time OF POSITION 2
    
    ind = []
    
    for (i,elem) in enumerate(vocabulary):
        
        if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]): 
            
            ind.append(i)
    
    return ind

In [11]:
#2.2 try with list comprehension ---> same time

def position_2(data, index):
    
    ind = [i for i, elem in enumerate(vocabulary) if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']])]
    
    return ind

In [12]:
position_2(df, 1)

[1, 5, 9, 27]

In [15]:
position_1(df, 1)

[1, 5, 9, 27]

In [16]:
print(hot_vector(df, 1))

[0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


# ALL THESE FUNCTION WORKED, BUT HOW TO TAKE LESS TIME WHEN I ITERATE ON THE INITIAL DATAFRAME? 

## Read after to understand what i mean.

# 1.2.3 Mapping each transaction through vocabulary

Now we can map the initial dataset to a dictionary/dataframe in which the keys are the Transaction ID and the values are the one hot vector:  

In [18]:
#1. try  CORRECT BUT #too time

boolean_matrix = pd.DataFrame(vocabulary, columns =['Shingles']) #initialize matrix shingles

for i in tq(range(len(df))):
    
    boolean_matrix[df.loc[i][0]] = hot_vector(df, i) 

In [26]:
#2. try CORRECT BUT TAKES TIME

boolean_matrix = dict()

for i in tq(range(len(df))):  #take 5/6 hours but it works
    
    boolean_matrix[df.loc[i][0]] = position_2(df, i) 

# 1.2.4 Signature matrix

The goal of the MinHash is to replace a large set with a smaller "signature" that still preserves the underlying similarity metric. In order to create a MinHash signature for each set:

 - Randomly permute the rows of the shingle matrix (permuting the indexes)
 - For each set, start from the first index and find the position of the first shingle with a 1 in its cell. Use this shingle number to represent the set. This is the "signature".


In [None]:
# Number of permutation must be less that the number of shingles. 
#in fact we use signature matrix to reduce the size of dataset

In [29]:
import random as ran #instead of shuffling the data set we shuffle the VOCABULARY

ran.seed(7)

In [31]:
print(vocabulary)

['age_5', 'exp_7', 'exp_9', 'exp_10', 'richness_1', 'age_1', 'exp_3', 'age_2', 'richness_3', 'exp_1', 'richness_10', 'exp_6', 1, 0, 'exp_8', 'richness_6', 'richness_9', 'richness_2', 'age_3', 'age_4', 'age_6', 'exp_5', 'exp_2', 'exp_4', 'richness_4', 'richness_8', 'richness_7', 'richness_5']


In [32]:
ran.shuffle(vocabulary)

In [33]:
print(vocabulary)

['richness_8', 'exp_8', 'exp_6', 'exp_4', 'exp_3', 'exp_5', 'age_5', 'age_1', 'richness_9', 'richness_6', 'exp_2', 'exp_9', 'richness_10', 'richness_7', 'richness_3', 'age_4', 'exp_10', 'age_6', 0, 'exp_1', 'richness_1', 'richness_2', 'richness_4', 'age_2', 'exp_7', 1, 'richness_5', 'age_3']


# Probably instead of do the mapping and after built the signature we can do everything togheter through an alternative version of the position function:

In [41]:
# This function takes the position of the first one of a transaction in the 
#vocabulary matching!! ---> minash!

def position_shuffle(data, index): #index is to choice the transaction to analyze
    
    for (i, elem) in enumerate(vocabulary):
        
        if elem in list(data.loc[index][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]):
        
            return i

# Example:

In [42]:
df.loc[1][['CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']]

CustGender                   1
CustomerClassAge         age_4
Richness            richness_2
Expenditure             exp_10
Name: 1, dtype: object

In [43]:
print(vocabulary)

[0, 1, 'age_1', 'age_2', 'age_3', 'age_4', 'age_5', 'age_6', 'richness_1', 'richness_2', 'richness_3', 'richness_4', 'richness_5', 'richness_6', 'richness_7', 'richness_8', 'richness_9', 'richness_10', 'exp_1', 'exp_2', 'exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7', 'exp_8', 'exp_9', 'exp_10']


In [44]:
position_shuffle(df, 1)

1

In [45]:
ran.shuffle(vocabulary)

In [46]:
print(vocabulary)

['age_1', 'exp_3', 'age_5', 'age_2', 'exp_7', 0, 'exp_2', 'richness_5', 'exp_6', 'richness_9', 'exp_10', 'richness_1', 'age_3', 'exp_9', 1, 'exp_8', 'age_4', 'age_6', 'richness_2', 'richness_4', 'exp_5', 'exp_1', 'richness_7', 'richness_3', 'richness_6', 'richness_10', 'exp_4', 'richness_8']


In [47]:
position_shuffle(df, 1)  #do the correct work

10

# Below probably the shortest way and computational efficient to do all togheter

## In order to do that we have to build a signature matrix:

- each element of th first row contains the first occurrence of 1 in encoded vector of the transaction

- shuffle vocabulary --> second row

- shuffle vocabulary --> third row


In [54]:
TID = list(df['TransactionID']) #transaction names

In [55]:
signature_matrix = pd.DataFrame(columns = TID) #initialize dataframe with transaction name

In [None]:
#for example we can take 10 number of permutation:

for i in tq(range(10)): 
    
    ran.shuffle(vocabulary) #shuffle the vocabulary
    
    rows = {} #create rows to append
    
    for j in range(len(TID)):
        
        rows[TID[j]] = position_shuffle(df, j)  #keys transactionid, value position of the first one
        
    df2 = df.append(rows, ignore_index=True) # append rows to signature matrix


  0%|                                                    | 0/10 [00:00<?, ?it/s]

In [None]:
#i think that order of complexity is much less than the previous all approach, 

#but it takes so much time also in this way

# 1.2.5 Bucket!

# 1.3 Query

To execute the query we execute the same process of encoding and look in which bucket they go