# 1.2 Fingerprint hashing

Using the previously selected data with the features you found pertinent, you have to:

Implement your minhash function from scratch. No ready-made hash functions are allowed. Read the class material and search the internet if you need to. For reference, it may be practical to look at the description of hash functions in the book.

Process the dataset and add each record to the MinHash. The subtask's goal is to try and map each consumer to its bin; to ensure this works well, be sure you understand how MinHash works and choose a matching threshold to use. Before moving on, experiment with different thresholds, explaining your choice.

In [22]:
import pandas as pd
from tqdm import tqdm as tq
import warnings
warnings.filterwarnings("ignore")

In [9]:
df = pd.read_csv("/Users/giacomo/Desktop/locale/data.csv", sep = '\t')

In [10]:
del df['Unnamed: 0']

In [11]:
list(df)

['TransactionID', 'CustGender', 'CustomerClassAge', 'Richness', 'Expenditure']

In [12]:
df

Unnamed: 0,TransactionID,CustGender,CustomerClassAge,Richness,Expenditure
0,T1,0,1,6,1
1,T2,1,4,2,10
2,T3,0,1,6,6
3,T4,0,3,10,9
4,T5,0,2,4,9
...,...,...,...,...,...
1041139,T1048563,1,2,4,7
1041140,T1048564,1,1,7,6
1041141,T1048565,1,2,10,7
1041142,T1048566,1,3,4,7


# 1.2.1 Vocabulary

First of all we built the vocabulary: 

In [13]:
vocab1 = [0, 1] #adding 0,1 shingles for female, male

vocab2 = list(range(1, 7))  #adding customerClassAge shingles without considering 0 class age (nan)

vocab3 = list(range(1, 11)) #adding Richness shingles

vocab4 = list(range(1, 11)) #adding Expenditure shingles

vocabulary = vocab1 + vocab2 + vocab3 + vocab4

In [14]:
print(vocabulary)

[0, 1, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


# 1.2.2 Create one hot vector for each transaction

First of all we created the function which maps each transaction into a vector of 0/1 based on the vocabulary: 

In [15]:
df.loc[1][0]

'T2'

In [16]:
def trans_tovector(index): #argument of loc
    
    vector = []
    
    for i in range(len(vocab1)): #compare to shingles of genre
    
        if df.loc[index][1] == vocab1[i]:
            
            vector.append(1)
            
        else: 
            
            vector.append(0)
            
    for i in range(len(vocab2)): #age
        
        if df.loc[index][2] == vocab2[i]: 
            
            vector.append(1)
            
        else:
        
            vector.append(0)
            
    for i in range(len(vocab3)): #richness
        
        if df.loc[index][3] == vocab3[i]: 
            
            vector.append(1)
            
        else: 
            
            vector.append(0)
            
    for i in range(len(vocab4)): #expendary
        
        if df.loc[index][4] == vocab4[i]: 
            
            vector.append(1)
            
        else:
            
            vector.append(0)
            
    return vector

# 1.2.3 Mapping each transaction through vocabulary

Now we can map the initial dataset to a new dataframe using as shingles the elements of the vocabulary: 

In [26]:
boolean_matrix = pd.DataFrame(vocabulary, columns =['Shingles']) #initialize matrix with shingles

In [None]:
for i in tq(range(len(df))):
    
    boolean_matrix[df.loc[i][0]] = trans_tovector(i)   

  1%|▎                               | 11605/1041144 [02:49<10:13:14, 27.98it/s]