<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_en.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

<br>

---

<h3 align="center" > 
  Bachelor Thesis
</h3>

<h1 align="center" > 
  Entity Resolution in Dissimilarity Spaces <br>
  Implementation notebook
</h1>

---

<h3 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h3>

<h4 align="center"> 
 <b>Supervisor: Dr. Alex Delis</b>,  Professor NKUA
</h4>
<br>
<h4 align="center"> 
Athens
</h4>
<h4 align="center"> 
January 2021 - Ongoing
</h4>


---


|  <font size="5"> Contents</font> |
| :--   |
|**1. [Abstract](#Abstract)** |
|**2. [Introduction](#Introduction)**  |
&nbsp;&nbsp;&nbsp;**2.1. [   Entity resolution](#Entity-resolution)** |
&nbsp;&nbsp;&nbsp;**2.2. [   Dissimilatiry space](#Dissimilatiry-Space)** |
|**3. [ A dissimilarity-based space embedding methodology](#scrollTo=DcAYuFQjY2ni)** <br>
&nbsp;&nbsp;&nbsp;**3.1 [String Clustering and Prototype Selection](#3.1-String-Clustering-and-Prototype-Selection)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.1. [Edit distance metric](#Edit-distance-metric)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.2. [String clustering algorithm](#String-clustering-algorithm)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.3. [Algorithm complexity](#Algorithm-complexity)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.4. [Prototype selection](#Prototype-selection)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.5. [Algorithm-1: The String Clustering and Prototype Selection Algorithm](#Algorithm-1:-The-String-Clustering-and-Prototype-Selection-Algorithm)** <br>
&nbsp;&nbsp;&nbsp;**3.2 [The Vantage Space Embedding and the Chorus of Prototypes Transform Similarity Coefficient](#3.2-The-Vantage-Space-Embedding-and-the-Chorus-of-Prototypes-Transform-Similarity-Coefficient)&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;**  <br>
&nbsp;&nbsp;&nbsp;**3.3 [A Top-k List Approach for Similarity Searching in the Vantage Space](#3.3-A-Top-k-List-Approach-for-Similarity-Searching-in-the-Vantage-Space)**  |
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.3.1. [Abstract Algebra definitions](#Abstract-Algebra-definitions)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.3.2. [Hausdorff metric](#Hausdorff-metric)** <br>
&nbsp;&nbsp;&nbsp;**3.4 [Hashing of Partially Ranked Data for Efficient Similarity Search](#3.4-Hashing-of-Partially-Ranked-Data-for-Efficient-Similarity-Search)** |
|**4. [ Evaluation](#Evaluation)** |
|**5. [References](#References)**  |



# __Implementation__

## __0.0 Install components__

In [38]:
!pip install editdistance



In [39]:
!pip install pandas
import pandas as pd
print(pd.__version__)

1.1.5


In [40]:
!pip install pandas_read_xml



## __0.1 Import libraries__

In [41]:
import pandas as pd
import numpy as np
import collections
import editdistance
import string
import sklearn
import pandas_read_xml as pdx

from tqdm.notebook import tqdm as tqdm
from scipy.spatial.distance import directed_hausdorff,hamming
from scipy.stats._stats import _kendall_dis
from sklearn.metrics import accuracy_score,auc,f1_score,recall_score,precision_score,classification_report
from scipy.sparse import csr_matrix
from scipy import stats 

## __1. Prototype selection algorithm__

In [None]:
#####################################################################
# 1. Prototype selection algorithm                                  #
#####################################################################

'''
Clustering_Prototypes(S,k,d,r,C) 
The String Clustering and Prototype Selection Algorithm
is the main clustering method, that takes as input the intial strings S, 
the max number of clusters to be generated in k,
the maximum allowable distance of a string to join a cluster in var d
and returns the prototype for each cluster in array Prototype
'''
def Clustering_Prototypes(S,k,d,pairDictionary,verbose=False):
    
    # ----------------- Initialization phase ----------------- #
    i = 0
    j = 0
    C = np.empty([S.size], dtype=int)
    r = np.empty([2,k],dtype=object)

    Clusters = [ [] for l in range(0,k)]

    while i < S.size:     # String-clustering phase, for all strings
        while j < k :       # iteration through clusters, for all clusters
            if r[0][j] == None:      # case empty first representative for cluster j
                r[0][j] = S[i]   # init cluster representative with string i
                C[i] = j         # store in C that i-string belongs to cluster j
                Clusters[j].append(S[i])
                break
            elif r[1][j] == None and (EditDistance(S[i],r[0][j]) <= d):  # case empty second representative 
                r[1][j] = S[i]                                             # and ED of representative 1  smaller than i-th string 
                C[i] = j
                Clusters[j].append(S[i])
                break
            elif (r[0][j] != None and r[1][j] != None) and (EditDistance(S[i],r[0][j]) + EditDistance(S[i],r[1][j])) <= d:
                C[i] = j
                Clusters[j].append(S[i])
                break
            else:
                j += 1
        i += 1
    
    # ----------------- Prototype selection phase ----------------- #
        
    Projections = np.empty([k],dtype=object)
    Prototypes = np.empty([k],dtype=int)
    sortedProjections = np.empty([k],dtype=object)
    j = 0

    if verbose:
        print("- - - - - - - - -")
        print("Cluster array:")
        print(C)
        print("- - - - - - - - -")
        print("Represantatives array:")
        print(r)
        print("- - - - - - - - -")  
        print("Clusters:")
        print(Clusters)
        print("- - - - - - - - -")  

    # print("\n\n\n****** Prototype selection phase *********") 
    while j < k and len(Clusters[j])>0:
        
        Projections[j] = Approximated_Projection_Distances_ofCluster(r[1][j], r[0][j], j, Clusters[j],pairDictionary)
        
        # print("\n"+str(j)+"-Projections:")
        # print(Projections[j])
        
        sortedProjections[j] = {k: v for k, v in sorted(Projections[j].items(), key=lambda item: item[1])}

        # print(str(j)+"-sortedProjections:")
        # print(sortedProjections[j])
        
        Prototypes[j] = Median(sortedProjections[j])
        
        # print(".............")
        # print(str(j)+"-Prototypes:")        
        # print(Prototypes[j])
        
        j += 1
    # print("\n****** END *********\n")

    return Prototypes


def Approximated_Projection_Distances_ofCluster(right_rep, left_rep, cluster_id, clusterSet, pairDictionary):

    distances_vector = dict()
    rep_distance     = EditDistance(right_rep,left_rep)

    for str_inCluster in range(0,len(clusterSet)): 

      right_rep_distance = EditDistance(right_rep,clusterSet[str_inCluster])
      left_rep_distance  = EditDistance(left_rep,clusterSet[str_inCluster])
      
      distance = (right_rep_distance**2-rep_distance**2-left_rep_distance**2 ) / (2*rep_distance)
      distances_vector[clusterSet[str_inCluster]] = distance

    return distances_vector

def Median(distances):    
    '''
    Returns the median value of a vector
    '''
    keys = list(distances.keys())
    median_position = int(len(keys)/2)
    median_value = keys[median_position]

    return median_value

## __2. Embeddings based on the Vantage objects__




In [None]:
#####################################################################
#       2. Embeddings based on the Vantage objects                  #
#####################################################################

'''
CreateVantageEmbeddings(S,VantageObjects): Main function for creating the string embeddings based on the Vantage Objects
'''
def CreateVantageEmbeddings(S,VantageObjects, pairDictionary):
    
    # ------- Distance computing ------- #     
    vectors = []
    for s in range(0,S.size):
        string_embedding = []
        for p in range(0,VantageObjects.size): 
            if VantageObjects[p] != None:
                string_embedding.append(DistanceMetric(s,p,S,VantageObjects, pairDictionary))
            
        # --- Ranking representation ---- #
        ranked_string_embedding = stats.rankdata(string_embedding, method='dense')
        
        # ------- Vectors dataset ------- #
        vectors.append(ranked_string_embedding)
    
    return np.array(vectors)
    

'''
DistanceMetric(s,p,S,Prototypes): Implementation of equation (5)
'''
def DistanceMetric(s,p,S,VantageObjects, pairDictionary):
    
    max_distance = None
    
    for pp in range(0,VantageObjects.size):
        if VantageObjects[pp] != None:
            string_distance = EditDistance(S[s],VantageObjects[pp])    # Edit distance String-i -> Vantage Object
            VO_distance     = EditDistance(VantageObjects[p],VantageObjects[pp])    # Edit distance Vantage Object-j -> Vantage Object-i

            abs_diff = abs(string_distance-VO_distance)

            # --- Max distance diff --- #        
            if max_distance == None:
                max_distance = abs_diff
            elif abs_diff > max_distance:
                max_distance = abs_diff
            
    return max_distance

def dropNone(array):
    array = list(filter(None, list(array)))
    return np.array(array)

def topKPrototypes():
    return

## __3. Metrics and Similarity functions__

In [None]:
#####################################################################
#                 3. Similarity function                            # 
#####################################################################
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import hamming
from scipy.stats._stats import _kendall_dis

def SimilarityEvaluation(buckets,vectors,threshold, maxOnly=None, metric=None):

  numOfVectors = vectors.shape[0]
  vectorDim    = vectors.shape[1]
  mapping = {}

  for v_index in range(0,numOfVectors,1):
    
    for i_index in range(v_index+1,numOfVectors,1):
      if metric == None or metric == 'kendal': 
        tau, p_value = stats.kendalltau(vectors[v_index], vectors[i_index])
      else:
        numOf_discordant_pairs = _kendall_dis(vectors[v_index], vectors[i_index])
        tau = float((2*numOf_discordant_pairs) / ((vectorDim)*(vectorDim-1)))
      
      # print(tau,numOf_discordant_pairs,vectorDim)
      
      if tau > threshold or maxOnly:
        if not maxOnly:
          if v_index not in mapping.keys():
            mapping[v_index] = []
          mapping[v_index].append(i_index)
        else:
          if v_index not in mapping.keys():  
            mapping[v_index] = (i_index,tau)
          else:
            if mapping[v_index][1] < tau:
              mapping[v_index] = (i_index,tau)
 
  return mapping


## __4. Hashing__

In [None]:
#####################################################################
#                        4. Hashing                                 # 
#####################################################################

def WTA(vectors,K,inputDim):
  '''
    Winner Take All hash - Yagnik
    .............................

    m: number of permutations
    K: window size
  '''
  newVectors = []
  buckets = dict()

  numOfVectors = vectors.shape[0]
  vectorDim    = vectors.shape[1]

  C = np.zeros([numOfVectors], dtype=int)
  theta = np.random.permutation(inputDim)
  i=0;j=0;

  for v_index in range(0,numOfVectors,1):
    X_new = permuted(vectors[v_index],theta)
    # print( np.array(X_new[:K]))
    newVectors.append(X_new[:K])
    index_max = max(range(len(X_new)), key=X_new.__getitem__)
    c_i = index_max

    j=0
    for j in range(0,K):
      if X_new[j] > X_new[c_i]:
        c_i = j

    C[i] = c_i
    buckets = bucketInsert(buckets,c_i,i)
    i+=1
  
  return C,buckets,np.array(newVectors)

def permuted(vector,permutation):
  permuted_vector = [vector[x] for x in permutation]
  return permuted_vector 

def bucketInsert(buckets,bucket_id,item):
  if bucket_id not in buckets.keys():
    buckets[bucket_id] = []
  buckets[bucket_id].append(item)

  return buckets

## __Final model__









In [42]:
class RankedWTAHash:

  def __init__(self, max_numberOf_clusters, max_editDistance, windowSize, metric = 'kendal', similarityThreshold=None, maxOnly=None ):
      '''
        Constructor
      '''
      self.max_numberOf_clusters = max_numberOf_clusters
      self.pairDictionary = dict()
      self.max_editDistance = max_editDistance
      self.windowSize = windowSize
      self.S_set = None 
      self.S_index = None 
      self.similarityThreshold = similarityThreshold
      self.maxOnly = maxOnly
      self.metric = metric
  
  def fit(self, X, y):
    """
      Fit the classifier from the training dataset.
      Parameters
      ----------
      X : Training data.
      y : Target values.
      Returns
      -------
      self : The fitted classifier.
    """
    
    if isinstance(X, list):
      input_strings = X
    else:
      input_strings = list(X)

    # print(input_strings)
    self.S_set = np.array(input_strings,dtype=object)
    # print(self.S_set)
    self.S_index = np.arange(0,len(input_strings),1)

    print("\n-----------------\nString positions are:")
    print(self.S_index)
    print("-----------------\n")

    print("\n-----------------\n Finding prototypes and representatices of each cluster:")
    self.prototypeArray = self.Clustering_Prototypes(self.S_index,self.max_numberOf_clusters, self.max_editDistance, self.pairDictionary)
    self.embeddingDim   = self.prototypeArray.size
    print(self.prototypeArray)
    print("-----------------")

    print("\n-----------------\nEmbeddings:")
    self.Embeddings = self.CreateVantageEmbeddings(self.S_index,self.prototypeArray, self.pairDictionary)
    print(self.Embeddings)
    print("-----------------\n")

    print("\n-----------------\nWTA Buckets:")
    self.HashedClusters,self.buckets,self.rankedVectors = self.WTA(self.Embeddings,self.windowSize,self.embeddingDim)
    print(self.HashedClusters)
    print("-----------------\n")
    
    print("\n-----------------\nWTA RankedVectors after permutation:")
    print(self.rankedVectors)
    print("-----------------\n")

    print("\n-----------------\nSimilarity checking:")
    self.mapping,self.mapping_matrix = self.SimilarityEvaluation(self.buckets,self.rankedVectors,self.similarityThreshold,maxOnly=self.maxOnly, metric=self.metric)
    print(self.mapping)
    print("-----------------\n")
    
  
  def EditDistance(self, str1,str2,verbose=False):
      if verbose:
        if str1 == None:
            print("1")
        elif str2 == None:
            print("2")
        print("-> "+str(str1))
        print("--> "+str(str2))
        print(str(editdistance.eval(self.S_set[str1],self.S_set[str2])))
      
      
      # NOTE: Duplicates inside the dictionary     

      if ((str1,str2) or (str2,str1))  in self.pairDictionary.keys():
        return self.pairDictionary[(str1,str2)]
      else:
        # if verbose:
        # print("++++++++++")
        # print(str1,str2)
        # print(self.S_set[str1],self.S_set[str2])
        # print("++++++++++")
        distance = editdistance.eval(self.S_set[str1],self.S_set[str2])
        self.pairDictionary[(str2,str1)] = self.pairDictionary[(str1,str2)] = distance
        return distance

  # ----------------------------------------------------------------------------------------------------------- #

  #####################################################################
  # 1. Prototype selection algorithm                                  #
  #####################################################################

  '''
  Clustering_Prototypes(S,k,d,r,C) 
  The String Clustering and Prototype Selection Algorithm
  is the main clustering method, that takes as input the intial strings S, 
  the max number of clusters to be generated in k,
  the maximum allowable distance of a string to join a cluster in var d
  and returns the prototype for each cluster in array Prototype
  '''
  def Clustering_Prototypes(self,S,k,d,pairDictionary,verbose=False):
      
      # ----------------- Initialization phase ----------------- #
      i = 0
      j = 0
      C = np.empty([S.size], dtype=int)
      r = np.empty([2,k],dtype=object)

      Clusters = [ [] for l in range(0,k)]

      for i in tqdm(range(0,S.size,1)):     # String-clustering phase, for all strings
          while j < k :       # iteration through clusters, for all clusters
              if r[0][j] == None:      # case empty first representative for cluster j
                  r[0][j] = S[i]   # init cluster representative with string i
                  C[i] = j         # store in C that i-string belongs to cluster j
                  Clusters[j].append(S[i])
                  break
              # elif r[1][j] == None:  # !!!!
              elif r[1][j] == None and (self.EditDistance(S[i],r[0][j]) <= d):  # case empty second representative 
                  r[1][j] = S[i]                                             # and ED of representative 1  smaller than i-th string 
                  C[i] = j
                  Clusters[j].append(S[i])
                  break
              elif (r[0][j] != None and r[1][j] != None) and (self.EditDistance(S[i],r[0][j]) + self.EditDistance(S[i],r[1][j])) <= d:
                  C[i] = j
                  Clusters[j].append(S[i])
                  break
              else:
                  j += 1
          i += 1

      # ----------------- Prototype selection phase ----------------- #
          
      Projections = np.empty([k],dtype=object)
      Prototypes = np.empty([k],dtype=int)
      sortedProjections = np.empty([k],dtype=object)
      j = 0

      if verbose:
          print("- - - - - - - - -")
          print("Cluster array:")
          print(C)
          print("- - - - - - - - -")
          print("Represantatives array:")
          print(r)
          print("- - - - - - - - -")  
          print("Clusters:")
          print(Clusters)
          print("- - - - - - - - -")  

      # print("\n\n\n****** Prototype selection phase *********") 
      while j < k and len(Clusters[j])>0:
          
          Projections[j] = self.Approximated_Projection_Distances_ofCluster(r[1][j], r[0][j], j, Clusters[j],pairDictionary)
          
          if Projections[j] == None:
            print("oh no")
            continue
          # print("\n"+str(j)+"-Projections:")
          # print(Projections[j])
          
          sortedProjections[j] = {k: v for k, v in sorted(Projections[j].items(), key=lambda item: item[1])}

          # print(str(j)+"-sortedProjections:")
          # print(sortedProjections[j])
          
          Prototypes[j] = self.Median(sortedProjections[j])
          
          # print(".............")
          # print(str(j)+"-Prototypes:")        
          # print(Prototypes[j])
          
          j += 1
      # print("\n****** END *********\n")

      return Prototypes


  def Approximated_Projection_Distances_ofCluster(self, right_rep, left_rep, cluster_id, clusterSet, pairDictionary):
      # print("here")
      # print(clusterSet)
      # print(right_rep, left_rep)

      distances_vector = dict()

      if len(clusterSet) > 2:
        rep_distance     = self.EditDistance(right_rep,left_rep)
        for str_inCluster in range(0,len(clusterSet)): 
          if clusterSet[str_inCluster] != right_rep and clusterSet[str_inCluster] != left_rep:
            # print(clusterSet[str_inCluster],right_rep,left_rep)
            right_rep_distance = self.EditDistance(right_rep,clusterSet[str_inCluster])
            left_rep_distance  = self.EditDistance(left_rep,clusterSet[str_inCluster])
            
            distance = (right_rep_distance**2-rep_distance**2-left_rep_distance**2 ) / (2*rep_distance)
            distances_vector[clusterSet[str_inCluster]] = distance
      else:
        if left_rep != None:
          distances_vector[0] = left_rep
          # print("l")
        elif right_rep != None:
          distances_vector[0] = right_rep
          # print("r")
        elif left_rep == None and right_rep == None:
          return None
      # print(distances_vector)
      return distances_vector

  def Median(self, distances):    
      '''
      Returns the median value of a vector
      '''
      keys = list(distances.keys())
      if keys == 1:
        return keys[0]

      # print(distances)
      keys = list(distances.keys())
      # print(keys)
      median_position = int(len(keys)/2)
      # print(median_position)
      median_value = keys[median_position]

      return median_value

  #####################################################################
  #       2. Embeddings based on the Vantage objects                  #
  #####################################################################

  '''
  CreateVantageEmbeddings(S,VantageObjects): Main function for creating the string embeddings based on the Vantage Objects
  '''
  def CreateVantageEmbeddings(self, S, VantageObjects, pairDictionary):
      
      # ------- Distance computing ------- #     
      vectors = []
      for s in tqdm(range(0,S.size)):
          string_embedding = []
          for p in range(0,VantageObjects.size): 
              if VantageObjects[p] != None:
                  string_embedding.append(self.DistanceMetric(s,p,S,VantageObjects, pairDictionary))
              
          # --- Ranking representation ---- #
          ranked_string_embedding = stats.rankdata(string_embedding, method='dense')
          
          # ------- Vectors dataset ------- #
          vectors.append(ranked_string_embedding)
      
      return np.array(vectors)
      

  '''
  DistanceMetric(s,p,S,Prototypes): Implementation of equation (5)
  '''
  def DistanceMetric(self, s, p, S, VantageObjects, pairDictionary):
      
      max_distance = None
      
      for pp in range(0,VantageObjects.size):
          if VantageObjects[pp] != None:
              string_distance = self.EditDistance(S[s],VantageObjects[pp])    # Edit distance String-i -> Vantage Object
              VO_distance     = self.EditDistance(VantageObjects[p],VantageObjects[pp])    # Edit distance Vantage Object-j -> Vantage Object-i

              abs_diff = abs(string_distance-VO_distance)

              # --- Max distance diff --- #        
              if max_distance == None:
                  max_distance = abs_diff
              elif abs_diff > max_distance:
                  max_distance = abs_diff
              
      return max_distance

  def dropNone(array):
      array = list(filter(None, list(array)))
      return np.array(array)

  def topKPrototypes():
      return

  #####################################################################
  #                 3. Similarity function                            # 
  #####################################################################

  def SimilarityEvaluation(self, buckets,vectors,threshold, maxOnly=None, metric=None):

    numOfVectors = vectors.shape[0]
    vectorDim    = vectors.shape[1]
    mapping_matrix = np.zeros([numOfVectors,numOfVectors],dtype=np.int8)
    mapping = {}

    for v_index in tqdm(range(0,numOfVectors,1)):
      
      for i_index in range(v_index+1,numOfVectors,1):
        if metric == None or metric == 'kendal': 
          tau, p_value = stats.kendalltau(vectors[v_index], vectors[i_index])
        else:
          numOf_discordant_pairs = _kendall_dis(vectors[v_index], vectors[i_index])
          tau = float((2*numOf_discordant_pairs) / ((vectorDim)*(vectorDim-1)))
                
        if tau > threshold or maxOnly:
          if not maxOnly:
            if v_index not in mapping.keys():
              mapping[v_index] = []
            mapping[v_index].append(i_index)
            mapping_matrix[v_index][i_index] = 1
          else:
            if v_index not in mapping.keys():  
              mapping[v_index] = (i_index,tau)
              mapping_matrix[v_index][i_index] = 1
            else:
              if mapping[v_index][1] < tau:
                mapping[v_index] = (i_index,tau)
                mapping_matrix[v_index][i_index] = 1
  
    return mapping, mapping_matrix

  #####################################################################
  #                        4. Hashing                                 # 
  #####################################################################

  def WTA(self,vectors,K,inputDim):
    '''
      Winner Take All hash - Yagnik
      .............................

      m: number of permutations
      K: window size
    '''
    newVectors = []
    buckets = dict()

    numOfVectors = vectors.shape[0]
    vectorDim    = vectors.shape[1]

    C = np.zeros([numOfVectors], dtype=int)
    theta = np.random.permutation(inputDim)
    i=0;j=0;

    for v_index in tqdm(range(0,numOfVectors,1)):
      X_new = self.permuted(vectors[v_index],theta)
      # print( np.array(X_new[:K]))
      newVectors.append(X_new[:K])
      index_max = max(range(len(X_new)), key=X_new.__getitem__)
      c_i = index_max

      j=0
      for j in range(0,K):
        if X_new[j] > X_new[c_i]:
          c_i = j

      C[i] = c_i
      buckets = self.bucketInsert(buckets,c_i,i)
      i+=1
    
    return C,buckets,np.array(newVectors)

  def permuted(self,vector,permutation):
    permuted_vector = [vector[x] for x in permutation]
    return permuted_vector 

  def bucketInsert(self,buckets,bucket_id,item):
    if bucket_id not in buckets.keys():
      buckets[bucket_id] = []
    buckets[bucket_id].append(item)

    return buckets

  #####################################################################
  #                        5. Evaluation                              # 
  #####################################################################
  def evaluate_cora(self, true_matrix):
    sparce_true = csr_matrix(true_matrix)
    sparce_predicted =  csr_matrix(self.mapping_matrix)

    acc = accuracy_score(true_matrix, self.mapping_matrix)
    f1 =  f1_score(true_matrix, self.mapping_matrix,average='micro')
    recall = recall_score(true_matrix, self.mapping_matrix,average='micro')
    precision = precision_score(true_matrix, self.mapping_matrix,average='micro')

    results_dataframe = pd.DataFrame(columns=['Accuracy','Precision','Recall','F1'])
    results_dataframe.append({'Accuracy':acc,'Precision':precision,'Recall':recall,'F1':f1},ignore_index=True)

    # print(classification_report(true_matrix, self.mapping_matrix))


---
---

# __Evaluation__

In [6]:
# Opening data file
import io
from google.colab import drive

drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


## __CoRA__

### Load from Drive

In [None]:
fpcites = r"/content/drive/My Drive/ERinDS/cora_cites.csv"
fppaper = r"/content/drive/My Drive/ERinDS/cora_paper.csv"
fpcontent = r"/content/drive/My Drive/ERinDS/cora_content.csv"

cites = pd.read_csv(fpcites,sep=';')
paper = pd.read_csv(fppaper,sep=';')
content = pd.read_csv(fpcontent,sep=';')

### Overview

In [None]:
cites

Unnamed: 0,cited_paper_id,citing_paper_id
0,35,887
1,35,1033
2,35,1688
3,35,1956
4,35,8865
...,...,...
5424,853116,19621
5425,853116,853155
5426,853118,1140289
5427,853155,853118


In [None]:
paper

Unnamed: 0,paper_id,class_label
0,35,Genetic_Algorithms
1,40,Genetic_Algorithms
2,114,Reinforcement_Learning
3,117,Reinforcement_Learning
4,128,Reinforcement_Learning
...,...,...
2703,1154500,Case_Based
2704,1154520,Neural_Networks
2705,1154524,Rule_Learning
2706,1154525,Rule_Learning


In [None]:
content

Unnamed: 0,paper_id,word_cited_id
0,35,word100
1,35,word1152
2,35,word1175
3,35,word1228
4,35,word1248
...,...,...
49211,1155073,word75
49212,1155073,word759
49213,1155073,word789
49214,1155073,word815


### Train-Test-Validation datasets



In [None]:
class_labels = np.unique(paper.class_label.to_numpy())
print(class_labels)
print("Number of classes: "+str(len(class_labels)))

['Case_Based' 'Genetic_Algorithms' 'Neural_Networks'
 'Probabilistic_Methods' 'Reinforcement_Learning' 'Rule_Learning' 'Theory']
Number of classes: 7


## __DBLP/ACM__

In [None]:
acmfp = r"/content/drive/My Drive/ERinDS/ACM.csv"
dblpfp = r"/content/drive/My Drive/ERinDS/DBLP2.csv"
acm_dblp_mapping_fp = r"/content/drive/My Drive/ERinDS/DBLP-ACM_perfectMapping.csv"

acm = pd.read_csv(acmfp)
dblp = pd.read_csv(dblpfp, encoding='latin-1')
perfect_mapping = pd.read_csv(acm_dblp_mapping_fp)

dblp['year'] = dblp['year'].astype(str)
acm['year'] = acm['year'].astype(str)

### Overview

In [None]:
acm

Unnamed: 0,id,title,authors,venue,year
0,304586,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,304587,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999
2,304589,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data,1999
3,304590,XML-based information mediation with MIX,"Chaitan Baru, Amarnath Gupta, Bertram Lud&#228...",International Conference on Management of Data,1999
4,304582,The CCUBE constraint object-oriented database ...,"Alexander Brodsky, Victor E. Segal, Jia Chen, ...",International Conference on Management of Data,1999
...,...,...,...,...,...
2289,672977,Dual-Buffering Strategies in Object Bases,"Alfons Kemper, Donald Kossmann",Very Large Data Bases,1994
2290,950482,Guest editorial,"Philip A. Bernstein, Yannis Ioannidis, Raghu R...",The VLDB Journal &mdash; The International Jou...,2003
2291,672980,GraphDB: Modeling and Querying Graphs in Datab...,Ralf Hartmut G&#252;ting,Very Large Data Bases,1994
2292,945741,Review of The data warehouse toolkit: the comp...,Alexander A. Anisimov,ACM SIGMOD Record,2003


In [None]:
dblp

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and it...,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996
2,conf/vldb/PalpanasSCP02,Incremental Maintenance for Non-Distributive A...,"Themistoklis Palpanas, Richard Sidle, Hamid Pi...",VLDB,2002
3,conf/vldb/GardarinGT96,Cost-based Selection of Path Expression Proces...,"Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...",VLDB,1996
4,conf/vldb/HoelS95,Benchmarking Spatial Join Operations with Spat...,"Erik G. Hoel, Hanan Samet",VLDB,1995
...,...,...,...,...,...
2611,journals/tods/KarpSP03,A simple algorithm for finding frequent elemen...,"Scott Shenker, Christos H. Papadimitriou, Rich...",ACM Trans. Database Syst.,2003
2612,conf/vldb/LimWV03,SASH: A Self-Adaptive Histogram Set for Dynami...,"Lipyeow Lim, Min Wang, Jeffrey Scott Vitter",VLDB,2003
2613,journals/tods/ChakrabartiKMP02,Locally adaptive dimensionality reduction for ...,"Kaushik Chakrabarti, Eamonn J. Keogh, Michael ...",ACM Trans. Database Syst.,2002
2614,journals/sigmod/Snodgrass01,Chair's Message,Richard T. Snodgrass,SIGMOD Record,2001


In [None]:
perfect_mapping

Unnamed: 0,idDBLP,idACM
0,conf/sigmod/SlivinskasJS01,375678
1,conf/sigmod/ChaudhuriDN01,375694
2,conf/sigmod/RinfretOO01,375669
3,conf/sigmod/BreunigKKS01,375672
4,conf/sigmod/JagadishJOT01,375687
...,...,...
2219,journals/sigmod/Scholl01,604275
2220,journals/sigmod/Rosneblatt94,190649
2221,journals/sigmod/Winslett02b,601871
2222,journals/sigmod/Labrinidis01,604283


In [None]:
acm.loc[acm['id'] == 375678]

Unnamed: 0,id,title,authors,venue,year
301,375678,Adaptable query optimization and evaluation in...,"Giedrius Slivinskas, Christian S. Jensen, Rich...",International Conference on Management of Data,2001


In [None]:
dblp.loc[dblp['id'] == 'conf/sigmod/SlivinskasJS01']

Unnamed: 0,id,title,authors,venue,year
143,conf/sigmod/SlivinskasJS01,Adaptable Query Optimization and Evaluation in...,"Christian S. Jensen, Richard T. Snodgrass, Gie...",SIGMOD Conference,2001


### Preprocess

In [None]:
def preprocess(row):
  # print(row)
  paper_str = " ".join(row)
  paper_str = paper_str.lower()

  return paper_str

### Dataset split

### Model evaluation

Small dataset

In [None]:
text = []
id = []
sameas = []
true_labels = []
data = {'id':[],'text':[],'sameas':[]}
index = 0

for _,row in perfect_mapping.head(10).iterrows():

  # DBLP
  dplp_row = dblp.loc[dblp.id == row['idDBLP'],['title','authors','venue','year']].values.flatten().tolist()
  id.append(row['idDBLP'])
  sameas.append(row['idACM'])
  dplp_row = preprocess(dplp_row)
  text.append(dplp_row)

  # ACM
  acm_row = acm.loc[acm.id == row['idACM'],['title','authors','venue','year']].values.flatten().tolist()
  acm_row = preprocess(acm_row)
  text.append(acm_row)
  id.append(row['idACM'])
  sameas.append(row['idDBLP'])

data['id'] = id
data['text'] = text
data['sameas'] = sameas

dataset=pd.DataFrame(data)
# print(dataset)
dataset

Unnamed: 0,id,text,sameas
0,conf/sigmod/SlivinskasJS01,adaptable query optimization and evaluation in...,375678
1,375678,adaptable query optimization and evaluation in...,conf/sigmod/SlivinskasJS01
2,conf/sigmod/ChaudhuriDN01,"a robust, optimization-based approach for appr...",375694
3,375694,"a robust, optimization-based approach for appr...",conf/sigmod/ChaudhuriDN01
4,conf/sigmod/RinfretOO01,bit-sliced index arithmetic elizabeth j. o'nei...,375669
5,375669,"bit-sliced index arithmetic denis rinfret, pat...",conf/sigmod/RinfretOO01
6,conf/sigmod/BreunigKKS01,data bubbles: quality preserving performance b...,375672
7,375672,data bubbles: quality preserving performance b...,conf/sigmod/BreunigKKS01
8,conf/sigmod/JagadishJOT01,global optimization of histograms h. v. jagadi...,375687
9,375687,global optimization of histograms h. v. jagadi...,conf/sigmod/JagadishJOT01


In [None]:
model = RankedWTAHash(
    max_numberOf_clusters = 10,
    max_editDistance = 140,
    windowSize=5,
    similarityThreshold = 0.8,
    maxOnly=True,
    metric='customkendal'
    )
EditDistance = model.EditDistance
model.fit(dataset['text'],None)

['adaptable query optimization and evaluation in temporal middleware christian s. jensen, richard t. snodgrass, giedrius slivinskas sigmod conference 2001', 'adaptable query optimization and evaluation in temporal middleware giedrius slivinskas, christian s. jensen, richard thomas snodgrass international conference on management of data 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries vivek r. narasayya, gautam das, surajit chaudhuri sigmod conference 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries surajit chaudhuri, gautam das, vivek narasayya international conference on management of data 2001', "bit-sliced index arithmetic elizabeth j. o'neil, denis rinfret, patrick e. o'neil sigmod conference 2001", "bit-sliced index arithmetic denis rinfret, patrick o'neil, elizabeth o'neil international conference on management of data 2001", 'data bubbles: quality preserving performance boosting for hierarch

## __CoRA__ - New


### Load from Drive

In [43]:
fpcora = r"/content/drive/My Drive/ERinDS/CORA.xml"
cora = pdx.read_xml(fpcora,['CORA', 'NEWREFERENCE'],root_is_rows=False)
cora.index += 1 
xml_dataframe = cora
cora

Unnamed: 0,@id,author,title,journal,volume,pages,date,#text,publisher,address,note,booktitle,editor,booktile,tech,institution,Pages,year,type,month
1,1,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",Inganas and M.R.,"Andersson, J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
2,2,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
3,3,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
4,4,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
5,5,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1875,1875,"Richard C. Yee, Sharad Saxena, Paul E. Utgoff,...",Explaining temporal-differences to create usef...,,,,,751\nyee1990,,,,"In Proceedings of AAAI-90,",,,,,,1990.,,
1876,1876,"Q. Zheng,",Real-time Fault-tolerant Communication in Comp...,,,,,752\nzheng1993,,"University of Michigan,",Available via anonymous ftp from ftp.eecs.umic...,,,,,,,1993.,"PhD thesis,",
1877,1877,"Q. Zheng,",Real-time Fault-tolerant Communication in Comp...,,,,,753\nzheng1993,,,PostScript version of the thesis is available ...,,,,,"University of Michigan,",,1993.,"PhD thesis,",
1878,1878,"Q. Zheng,",Real-time Fault-tolerant Communication in Comp...,,,,,754\nzheng1993,,"University of Michigan,",PostScript version of the thesis is available ...,,,,,,,1993.,"PhD thesis,",


In [44]:
fpcora_gold = r"/content/drive/My Drive/ERinDS/cora_gold.csv"
cora_gold = pd.read_csv(fpcora_gold,sep=';')
true_values = cora_gold
cora_gold

Unnamed: 0,id1,id2
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6
...,...,...
64573,1876,1878
64574,1876,1879
64575,1877,1878
64576,1877,1879


### Preprocess

In [45]:
print(xml_dataframe.columns)

Index(['@id', 'author', 'title', 'journal', 'volume', 'pages', 'date', '#text',
       'publisher', 'address', 'note', 'booktitle', 'editor', 'booktile',
       'tech', 'institution', 'Pages', 'year', 'type', 'month'],
      dtype='object')


In [46]:
def preprocess(row):
  # print(row)
  paper_str = " ".join(row)
  paper_str = paper_str.lower()
  paper_str = paper_str.replace("\n", " ").replace("/z", " ").replace("[","").replace("]","")

  return str(paper_str)

In [47]:
shuffled_df = xml_dataframe.sample(frac=1).reset_index(drop=True)
shuffled_df

Unnamed: 0,@id,author,title,journal,volume,pages,date,#text,publisher,address,note,booktitle,editor,booktile,tech,institution,Pages,year,type,month
0,547,D. Kibler and D. W. Aha.,Learning representative exemplars of concepts:...,,,,,17\naha1987,Kaufmann.,"CA,",,Proc. 4th International Workshop on Machine Le...,"In P. Langley, editor,",,,,,1987.,,
1,363,S. E. Fahlman and C. Lebiere.,The cascade-correlation architecture.,,2,pages 524532.,1990.,fahlman1990b,"Morgan Kauf mann,","San Mateo, CA,",,Advances in Neural Information Processing Stru...,"In D. S. Touretsky, editor,",,,,,,,
2,1633,"Utgoff, P.",Incremental induction of decision trees.,"Machine Learning,",4,161-186.,,509\nutgoff1989,,,,,,,,,,(1989).,,
3,1767,"Utgoff, P. E. & Clouse, J. A.",Two kinds of training information for evaluati...,,,"pages 596-600,",,643\nutgoff1991aaai,Morgan Kaufmann.,"San Mateo, CA.",,In Proceedings of the Ninth Annual Conference ...,,,,,,(1991).,,
4,567,"Aha, D. and Kibler, D.",Noise-tolerant instace-based learning algorithms.,,,(pp. 794-799).,,37\naha1989,Morgan Kaufmann.,"Detroit, MI:",,Proceedings of the Eleventh International Join...,,,,,,(1989),,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1874,1753,P. E. Utgoff and C. E. Brodley.,An incremental method for finding multivariate...,,16,"pages 58-65,",,629\nutgoff1990,Morgan Kaufmann.,"Los Altos, CA,",,In Proceedings of the Seventh International Co...,,,,,,1990.,,
1875,566,"Aha, D. W., & Kibler, D.",Noise-tolerant instance-based learning algorit...,,,(pp. 794-799).,,36\naha1989,Morgan Kaufmann.,"Detroit, Michigan:",,Proceedings of the Eleventh International Join...,,,,,,(1989).,,
1876,261,"Fahlman, S. E. & C.","Lebiere (1990), The Cascade-Correlation Learni...",,,pp. 524-532.,,fahlman1990b,"Morgan Kaufmann,","San Mateo,",,in Advances in Neural Information Processing S...,"D.S. Touretzky, ed.,",,,,,,,
1877,937,"D. W. Aha, D. Kibler & M. K. Albert.",Instance-based learning algorithms.,Machine Learning,6(1),37-66.,,407\naha1991,,,,,,,,,,(1991),,


In [48]:
def cora_createDataset(xml_dataframe, true_values, fields):

  rawStr_col = []
  sameEntities_dictionary = {}

  for _, row in xml_dataframe.iterrows():
    # print(row)
    rawStr = []
    for field in fields:    # NAN
      rawStr.append(str(row[field]))

    rawStr_col.append(preprocess(rawStr))

  num_of_records = len(shuffled_df)
  trueValues_matrix = np.zeros([num_of_records,num_of_records],dtype=np.int8)
  
  for _, row in true_values.iterrows():  
    trueValues_matrix[row['id1']-1][row['id2']-1] = 1
    if row['id1'] not in sameEntities_dictionary.keys():
       sameEntities_dictionary[row['id1']] = []
    sameEntities_dictionary[row['id1']].append(row['id2'])

  return rawStr_col,sameEntities_dictionary, trueValues_matrix



fields = ['author', 'title', 'journal', 'volume', 'pages', 'date', '#text',
       'publisher', 'address', 'note', 'booktitle', 'editor', 'booktile',
       'tech', 'institution', 'Pages', 'year', 'type', 'month']

data, labels, true_matrix = cora_createDataset(shuffled_df, true_values, fields)
print(labels)
print(labels.keys())
print(true_matrix)

{1: [2, 3, 4, 5, 6, 7, 8], 2: [3, 4, 5, 6, 7, 8], 3: [4, 5, 6, 7, 8], 4: [5, 6, 7, 8], 5: [6, 7, 8], 6: [7, 8], 7: [8], 12: [13, 14], 13: [14], 16: [17, 18], 17: [18], 19: [20, 21, 22, 23], 20: [21, 22, 23], 21: [22, 23], 22: [23], 27: [28, 29], 28: [29], 30: [31], 32: [33], 34: [35], 40: [41, 42], 41: [42], 43: [44, 45, 46], 44: [45, 46], 45: [46], 47: [48], 49: [50, 51, 52, 53, 426, 427], 50: [51, 52, 53, 426, 427], 51: [52, 53, 426, 427], 52: [53, 426, 427], 53: [426, 427], 426: [427], 54: [55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66], 55: [56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66], 56: [57, 58, 59, 60, 61, 62, 63, 64, 65, 66], 57: [58, 59, 60, 61, 62, 63, 64, 65, 66], 58: [59, 60, 61, 62, 63, 64, 65, 66], 59: [60, 61, 62, 63, 64, 65, 66], 60: [61, 62, 63, 64, 65, 66], 61: [62, 63, 64, 65, 66], 62: [63, 64, 65, 66], 63: [64, 65, 66], 64: [65, 66], 65: [66], 67: [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 9

### Overview

In [None]:
%%time
model = RankedWTAHash(
    max_numberOf_clusters = 100,
    max_editDistance = 450,
    windowSize=50,
    similarityThreshold = 0.8,
    maxOnly=False,
    # metric='customkendal'
    )
# EditDistance = model.EditDistance
model.fit(data,None)
model.evaluate_cora(true_matrix)


-----------------
String positions are:
[   0    1    2 ... 1876 1877 1878]
-----------------



HBox(children=(FloatProgress(value=0.0, max=1879.0), HTML(value='')))



-----------------
 Finding prototypes and representatices of each cluster:
[  17   25   43    0   69   74   77    0   90    0  105  120    0  129
  155    0    0  169  184    0  197  200  217    0  237  306  309  313
  319  329    0  347  376  386  395    0    0  432  442    0  450    0
  474  505  512  518  536    0  550    0  574    0  595    0  602    0
  644  660    0  665  669  673  689    0  699  709    0  714    0  726
    0  746    0  781  807  836  849  856    0  875    0  901  914  920
    0  937  946  959    0 1008    0 1015    0 1033 1047    0    0 1059
 1083 1099]
-----------------

-----------------
Embeddings:


HBox(children=(FloatProgress(value=0.0, max=1879.0), HTML(value='')))


[[23 10  7 ... 12 13 16]
 [23 11 16 ... 15 26 30]
 [30 31 16 ... 24  3  9]
 ...
 [28  9 21 ... 17 30 29]
 [33 36 19 ... 22 11  2]
 [22  2 24 ... 22 26 25]]
-----------------


-----------------
WTA Buckets:


HBox(children=(FloatProgress(value=0.0, max=1879.0), HTML(value='')))


[27 27 27 ... 27 27 84]
-----------------


-----------------
WTA RankedVectors after permutation:
[[23  1 15 ... 18  1  1]
 [23 14 30 ... 28 14 14]
 [30 42 29 ...  1 42 42]
 ...
 [28 19 30 ... 29 19 19]
 [33 40 41 ...  9 40 40]
 [22  7 18 ... 29  7  7]]
-----------------


-----------------
Similarity checking:


HBox(children=(FloatProgress(value=0.0, max=1879.0), HTML(value='')))

---

# References

[1]   [The dissimilarity representation for pattern recognition, a tutorial
Robert P.W. Duin and Elzbieta Pekalska Delft University of Technology, The Netherlands School of Computer Science, University of Manchester, United Kingdom](http://homepage.tudelft.nl/a9p19/presentations/DisRep_Tutorial_doc.pdf)