### This notebook uses the idea of Siamese NN. The Siames network was built to check if two images are similar. We hope to use the same idea to check if two records are similar.

The Notebook is arranged as follows:
- **Importing packages**: This step involves importing the necessary packages into this python environment.
- **Preprrocessing steps**: Load the two dataframes, convert the dates to day, month and year, and finally replace the missing values in names with an empty string. The names are converted to vectors using the position they hold in the alphabet. For example 'Jane'--> (10,1,14,5). 
 - The vector size 15(Possible the longest name in the dataset, the characters after the maximum length is reached should discarded) is initially initialized to zeros. 
 - The initialized values are populated as in the example of Jane. The vectors are scaled by dividing by 26 (value for z). We admit that the encoding of names this way is not the best but a good start towards a robust way.  Once the encoding happens, we update the values in the dataframe. 
 - Dropping the unnecessary columns such as record number, ID and names.
- **Generate possible matches**: This step generates possible matches using the local sensitive hashing with a random matrix. The possible matches are the used by the siamese network for further refining


 
 ## Final Thought.
 LSH _ Siamese method has a great potential to perform record linkage. More labeled data is needed to train a robust model. 

## Import the necessary packages

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import recordlinkage
import datetime as dt
from sklearn.metrics import confusion_matrix,f1_score
import random

import warnings
warnings.filterwarnings('ignore')
##
import keras as k
from keras.layers import *
from keras.models import Sequential, Model
from keras.regularizers import l2
from keras.optimizers import Adam
import tensorflow as tf

## Preprocessing step

In [96]:
#year, month, and day column
def connvert_date(hdss):
    hdss['dob']=pd.to_datetime(hdss['dob'],dayfirst=True)
    hdss['year']=hdss['dob'].dt.year/1970
    hdss['month']=hdss['dob'].dt.month/12
    hdss['day']=hdss['dob'].dt.day/31
    return hdss.drop('dob',axis=1)

In [97]:
#import the dataset, replace Nan with an empty string
#and convert the dates
or_hdss=pd.read_csv('data/synthetic_hdss_v3.csv')
or_hdss.replace(np.NaN,'',inplace=True)
or_hdss=connvert_date(or_hdss)
or_facility=pd.read_csv('data/synthetic_facility_v3.csv')
or_facility.replace(np.NaN,'',inplace=True)
or_facility=connvert_date(or_facility)

In [98]:
#list of alphbets
code=list('abcdefghijklmnopqrstuvwxyz')

In [99]:
#create a function that can be used to convert names to vectors
size=15
def convert_name_to_vector(name):
    name=name.lower()
    initials=np.zeros(size)
    i=0
    for a in name:
        if a in code:
            value=code.index(a)+1
            initials[i]=value
            i+=1
    return (initials/26).flatten()

In [100]:
#generate the column names for the vectors
first=[]
last=[]
pet=[]
for i in range(size):
    first.append(f'f_{i}')
    last.append(f'l_{i}')
    pet.append(f'p_{i}')

In [101]:
#copy the dataframe, for future use
master_hdss=or_hdss.copy()
master_facility=or_facility.copy()

In [102]:
#convert the names to vectors. Initiliazed vector of size 15
or_hdss[first]=pd.DataFrame(or_hdss['firstname'].apply(convert_name_to_vector).tolist())
or_hdss[last]=pd.DataFrame(or_hdss['lastname'].apply(convert_name_to_vector).tolist())
or_hdss[pet]=pd.DataFrame(or_hdss['petname'].apply(convert_name_to_vector).tolist())

or_facility[first]=pd.DataFrame(or_facility['firstname'].apply(convert_name_to_vector).tolist())
or_facility[last]=pd.DataFrame(or_facility['lastname'].apply(convert_name_to_vector).tolist())
or_facility[pet]=pd.DataFrame(or_facility['petname'].apply(convert_name_to_vector).tolist())

In [356]:
#drop the unnecessary columns
hdss=or_hdss.drop(['recnr','firstname', 'lastname', 'petname', 'hdssid','hdsshhid','nationalid'],axis=1)
facility=or_facility.drop(['recnr', 'firstname', 'lastname', 'petname','nationalid','patientid', 'visitdate',],axis=1)

### Perform LSH

In [459]:
#generate a random matrix
random_matrix=np.random.randn(49,100)

In [None]:
#multiply the hdss data with the random matrix
prod=np.matmul(hdss.values,random_matrix)
#binarize the product by setting the value greater than 0 to one
#otherwise zero
factor=np.where(prod>0,1,0)

In [None]:
#repeat the same process with the facility data
prod1=np.matmul(facility.values,a)
factor1=np.where(prod1>0,1,0)

In [441]:
#this function get the indices where a vector from hdss matches with a 98% and above
def get_index(vector,facility):
    equate=list(np.sum(facility==np.array(vector),axis=1)>=98)
    indices=np.where(equate)[0]
    return indices
    

In [451]:
#capture similar entries
#the key to the dict is the hdss index
#and values are the facility index
similar={}
count=0 #count
for i in range(factor.shape[0]):
    vector=factor[i]
    ind=get_index(vector,factor1)
    if len(ind)>0:
        count+=len(ind)
        similar[i]=ind

In [453]:
similar

{7: array([762]),
 8: array([2091]),
 12: array([2804]),
 13: array([2798]),
 14: array([772]),
 18: array([1475]),
 19: array([2769]),
 24: array([1989]),
 25: array([1109]),
 27: array([331]),
 28: array([1536]),
 34: array([533]),
 35: array([2258]),
 39: array([720]),
 43: array([1504]),
 44: array([1601]),
 48: array([1222]),
 50: array([796]),
 59: array([2450]),
 72: array([196]),
 76: array([2018]),
 78: array([1910]),
 81: array([1040]),
 82: array([1895]),
 84: array([179]),
 85: array([1562]),
 88: array([531, 547]),
 89: array([380]),
 95: array([1197]),
 98: array([2824]),
 104: array([2480]),
 110: array([216]),
 111: array([2510]),
 112: array([452]),
 116: array([2101]),
 118: array([2322]),
 122: array([1927]),
 129: array([1013]),
 135: array([868]),
 137: array([1316]),
 149: array([2239]),
 154: array([2265]),
 155: array([519]),
 160: array([2692]),
 161: array([1165]),
 163: array([2053]),
 164: array([1569]),
 168: array([2852]),
 169: array([129]),
 175: array([

In [None]:
#create a dataframe with first column as key and second column as values
#corresponding matching rows
keys=list(similar.keys())
values=list(similar.values())
k=[]
v=[]
for i in range(len(keys)):
    key=keys[i]
    val=values[i]
    for j in range(len(val)):
        k.append(key)
        v.append(val[j])
#this pairs then used in the Siamese network
#to refine the prediction
pairs=pd.DataFrame(np.array([k,v]).T)
