<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_en.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

<br>

---

<h3 align="center" > 
  Bachelor Thesis
</h3>

<h1 align="center" > 
  Entity Resolution in Dissimilarity Spaces <br>
  Implementation notebook
</h1>

---

<h3 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h3>

<h4 align="center"> 
 <b>Supervisor: Dr. Alex Delis</b>,  Professor NKUA
</h4>
<br>
<h4 align="center"> 
Athens
</h4>
<h4 align="center"> 
January 2021 - Ongoing
</h4>


---


|  <font size="5"> Contents</font> |
| :--   |
|**1. [Abstract](#Abstract)** |
|**2. [Introduction](#Introduction)**  |
&nbsp;&nbsp;&nbsp;**2.1. [   Entity resolution](#Entity-resolution)** |
&nbsp;&nbsp;&nbsp;**2.2. [   Dissimilatiry space](#Dissimilatiry-Space)** |
|**3. [ A dissimilarity-based space embedding methodology](#scrollTo=DcAYuFQjY2ni)** <br>
&nbsp;&nbsp;&nbsp;**3.1 [String Clustering and Prototype Selection](#3.1-String-Clustering-and-Prototype-Selection)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.1. [Edit distance metric](#Edit-distance-metric)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.2. [String clustering algorithm](#String-clustering-algorithm)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.3. [Algorithm complexity](#Algorithm-complexity)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.4. [Prototype selection](#Prototype-selection)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.5. [Algorithm-1: The String Clustering and Prototype Selection Algorithm](#Algorithm-1:-The-String-Clustering-and-Prototype-Selection-Algorithm)** <br>
&nbsp;&nbsp;&nbsp;**3.2 [The Vantage Space Embedding and the Chorus of Prototypes Transform Similarity Coefficient](#3.2-The-Vantage-Space-Embedding-and-the-Chorus-of-Prototypes-Transform-Similarity-Coefficient)&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;**  <br>
&nbsp;&nbsp;&nbsp;**3.3 [A Top-k List Approach for Similarity Searching in the Vantage Space](#3.3-A-Top-k-List-Approach-for-Similarity-Searching-in-the-Vantage-Space)**  |
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.3.1. [Abstract Algebra definitions](#Abstract-Algebra-definitions)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.3.2. [Hausdorff metric](#Hausdorff-metric)** <br>
&nbsp;&nbsp;&nbsp;**3.4 [Hashing of Partially Ranked Data for Efficient Similarity Search](#3.4-Hashing-of-Partially-Ranked-Data-for-Efficient-Similarity-Search)** |
|**4. [ Evaluation](#Evaluation)** |
|**5. [References](#References)**  |



# __Implementation__

## __0.0 Install components__

In [1]:
!pip install editdistance



In [2]:
!pip install pandas
import pandas as pd
print(pd.__version__)

1.1.5


In [3]:
!pip install pandas_read_xml

Collecting pandas_read_xml
  Downloading https://files.pythonhosted.org/packages/dd/67/033ecb058eb44bfabc1f1b4f92e4a80f59c9b423c442255a56e1826776b5/pandas_read_xml-0.3.1-py3-none-any.whl
Collecting xmltodict
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Collecting distlib
[?25l  Downloading https://files.pythonhosted.org/packages/87/26/f6a23dd3e578132cf924e0dd5d4e055af0cd4ab43e2a9f10b7568bfb39d9/distlib-0.3.2-py2.py3-none-any.whl (338kB)
[K     |████████████████████████████████| 348kB 2.8MB/s 
[?25hCollecting zipfile36
  Downloading https://files.pythonhosted.org/packages/fd/8a/3b7da0b0bd87d1ef05b74207827c72d348b56a0d6d83242582be18a81e02/zipfile36-0.1.3-py3-none-any.whl
Collecting urllib3>=1.26.3
[?25l  Downloading https://files.pythonhosted.org/packages/0c/cd/1e2ec680ec7b09846dc6e605f5a7709dfb9d7128e51a026e7154e18a234e/urllib3-1.26.5-py2.py3-none-any.whl (138kB)
[K    

## __0.1 Import libraries__

In [117]:
import pandas as pd
import numpy as np
import collections
import editdistance
import string
import sklearn
import pandas_read_xml as pdx
import time
import warnings
import sys

from tqdm.notebook import tqdm as tqdm
from scipy.spatial.distance import directed_hausdorff,hamming
from scipy.stats._stats import _kendall_dis
from scipy.stats import spearmanr,kendalltau,pearsonr
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score,accuracy_score,auc,f1_score,recall_score,precision_score,classification_report
from scipy.sparse import csr_matrix
from scipy import stats 

## Not used

### __1. Prototype selection algorithm__

In [None]:
#####################################################################
# 1. Prototype selection algorithm                                  #
#####################################################################

'''
Clustering_Prototypes(S,k,d,r,C) 
The String Clustering and Prototype Selection Algorithm
is the main clustering method, that takes as input the intial strings S, 
the max number of clusters to be generated in k,
the maximum allowable distance of a string to join a cluster in var d
and returns the prototype for each cluster in array Prototype
'''
def Clustering_Prototypes(S,k,d,pairDictionary,verbose=False):
    
    # ----------------- Initialization phase ----------------- #
    i = 0
    j = 0
    C = np.empty([S.size], dtype=int)
    r = np.empty([2,k],dtype=object)

    Clusters = [ [] for l in range(0,k)]

    while i < S.size:     # String-clustering phase, for all strings
        while j < k :       # iteration through clusters, for all clusters
            if r[0][j] == None:      # case empty first representative for cluster j
                r[0][j] = S[i]   # init cluster representative with string i
                C[i] = j         # store in C that i-string belongs to cluster j
                Clusters[j].append(S[i])
                break
            elif r[1][j] == None and (EditDistance(S[i],r[0][j]) <= d):  # case empty second representative 
                r[1][j] = S[i]                                             # and ED of representative 1  smaller than i-th string 
                C[i] = j
                Clusters[j].append(S[i])
                break
            elif (r[0][j] != None and r[1][j] != None) and (EditDistance(S[i],r[0][j]) + EditDistance(S[i],r[1][j])) <= d:
                C[i] = j
                Clusters[j].append(S[i])
                break
            else:
                j += 1
        i += 1
    
    # ----------------- Prototype selection phase ----------------- #
        
    Projections = np.empty([k],dtype=object)
    Prototypes = np.empty([k],dtype=int)
    sortedProjections = np.empty([k],dtype=object)
    j = 0

    if verbose:
        print("- - - - - - - - -")
        print("Cluster array:")
        print(C)
        print("- - - - - - - - -")
        print("Represantatives array:")
        print(r)
        print("- - - - - - - - -")  
        print("Clusters:")
        print(Clusters)
        print("- - - - - - - - -")  

    # print("\n\n\n****** Prototype selection phase *********") 
    while j < k and len(Clusters[j])>0:
        
        Projections[j] = Approximated_Projection_Distances_ofCluster(r[1][j], r[0][j], j, Clusters[j],pairDictionary)
        
        # print("\n"+str(j)+"-Projections:")
        # print(Projections[j])
        
        sortedProjections[j] = {k: v for k, v in sorted(Projections[j].items(), key=lambda item: item[1])}

        # print(str(j)+"-sortedProjections:")
        # print(sortedProjections[j])
        
        Prototypes[j] = Median(sortedProjections[j])
        
        # print(".............")
        # print(str(j)+"-Prototypes:")        
        # print(Prototypes[j])
        
        j += 1
    # print("\n****** END *********\n")

    return Prototypes


def Approximated_Projection_Distances_ofCluster(right_rep, left_rep, cluster_id, clusterSet, pairDictionary):

    distances_vector = dict()
    rep_distance     = EditDistance(right_rep,left_rep)

    for str_inCluster in range(0,len(clusterSet)): 

      right_rep_distance = EditDistance(right_rep,clusterSet[str_inCluster])
      left_rep_distance  = EditDistance(left_rep,clusterSet[str_inCluster])
      
      distance = (right_rep_distance**2-rep_distance**2-left_rep_distance**2 ) / (2*rep_distance)
      distances_vector[clusterSet[str_inCluster]] = distance

    return distances_vector

def Median(distances):    
    '''
    Returns the median value of a vector
    '''
    keys = list(distances.keys())
    median_position = int(len(keys)/2)
    median_value = keys[median_position]

    return median_value

### __2. Embeddings based on the Vantage objects__




In [None]:
#####################################################################
#       2. Embeddings based on the Vantage objects                  #
#####################################################################

'''
CreateVantageEmbeddings(S,VantageObjects): Main function for creating the string embeddings based on the Vantage Objects
'''
def CreateVantageEmbeddings(S,VantageObjects, pairDictionary):
    
    # ------- Distance computing ------- #     
    vectors = []
    for s in range(0,S.size):
        string_embedding = []
        for p in range(0,VantageObjects.size): 
            if VantageObjects[p] != None:
                string_embedding.append(DistanceMetric(s,p,S,VantageObjects, pairDictionary))
            
        # --- Ranking representation ---- #
        ranked_string_embedding = stats.rankdata(string_embedding, method='dense')
        
        # ------- Vectors dataset ------- #
        vectors.append(ranked_string_embedding)
    
    return np.array(vectors)
    

'''
DistanceMetric(s,p,S,Prototypes): Implementation of equation (5)
'''
def DistanceMetric(s,p,S,VantageObjects, pairDictionary):
    
    max_distance = None
    
    for pp in range(0,VantageObjects.size):
        if VantageObjects[pp] != None:
            string_distance = EditDistance(S[s],VantageObjects[pp])    # Edit distance String-i -> Vantage Object
            VO_distance     = EditDistance(VantageObjects[p],VantageObjects[pp])    # Edit distance Vantage Object-j -> Vantage Object-i

            abs_diff = abs(string_distance-VO_distance)

            # --- Max distance diff --- #        
            if max_distance == None:
                max_distance = abs_diff
            elif abs_diff > max_distance:
                max_distance = abs_diff
            
    return max_distance

def dropNone(array):
    array = list(filter(None, list(array)))
    return np.array(array)

def topKPrototypes():
    return

### __3. Metrics and Similarity functions__

In [None]:
#####################################################################
#                 3. Similarity function                            # 
#####################################################################
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import hamming
from scipy.stats._stats import _kendall_dis

def SimilarityEvaluation(buckets,vectors,threshold, maxOnly=None, metric=None):

  numOfVectors = vectors.shape[0]
  vectorDim    = vectors.shape[1]
  mapping = {}

  for v_index in range(0,numOfVectors,1):
    
    for i_index in range(v_index+1,numOfVectors,1):
      if metric == None or metric == 'kendal': 
        tau, p_value = stats.kendalltau(vectors[v_index], vectors[i_index])
      else:
        numOf_discordant_pairs = _kendall_dis(vectors[v_index], vectors[i_index])
        tau = float((2*numOf_discordant_pairs) / ((vectorDim)*(vectorDim-1)))
      
      # print(tau,numOf_discordant_pairs,vectorDim)
      
      if tau > threshold or maxOnly:
        if not maxOnly:
          if v_index not in mapping.keys():
            mapping[v_index] = []
          mapping[v_index].append(i_index)
        else:
          if v_index not in mapping.keys():  
            mapping[v_index] = (i_index,tau)
          else:
            if mapping[v_index][1] < tau:
              mapping[v_index] = (i_index,tau)
 
  return mapping


### __4. Hashing__

In [None]:
#####################################################################
#                        4. Hashing                                 # 
#####################################################################

def WTA(vectors,K,inputDim):
  '''
    Winner Take All hash - Yagnik
    .............................

    m: number of permutations
    K: window size
  '''
  newVectors = []
  buckets = dict()

  numOfVectors = vectors.shape[0]
  vectorDim    = vectors.shape[1]

  C = np.zeros([numOfVectors], dtype=int)
  theta = np.random.permutation(inputDim)
  i=0;j=0;

  for v_index in range(0,numOfVectors,1):
    X_new = permuted(vectors[v_index],theta)
    # print( np.array(X_new[:K]))
    newVectors.append(X_new[:K])
    index_max = max(range(len(X_new)), key=X_new.__getitem__)
    c_i = index_max

    j=0
    for j in range(0,K):
      if X_new[j] > X_new[c_i]:
        c_i = j

    C[i] = c_i
    buckets = bucketInsert(buckets,c_i,i)
    i+=1
  
  return C,buckets,np.array(newVectors)

def permuted(vector,permutation):
  permuted_vector = [vector[x] for x in permutation]
  return permuted_vector 

def bucketInsert(buckets,bucket_id,item):
  if bucket_id not in buckets.keys():
    buckets[bucket_id] = []
  buckets[bucket_id].append(item)

  return buckets

## __Final model__









In [115]:
#@title Model class
#@markdown This cell contains all the code needed for the requested algorithm  
class RankedWTAHash:

  def __init__(self, max_numberOf_clusters, max_editDistance, windowSize, min_numOfNodes = 2, metric = 'kendal', similarityVectors='ranked', similarityThreshold=None, maxOnly=None ):
    '''
      Constructor
    '''
    self.max_numberOf_clusters = max_numberOf_clusters
    self.pairDictionary = dict()
    self.max_editDistance = max_editDistance
    self.windowSize = windowSize
    self.S_set = None 
    self.S_index = None 
    self.similarityThreshold = similarityThreshold
    self.maxOnly = maxOnly
    self.metric = metric
    self.min_numOfNodes = min_numOfNodes
    self.similarityVectors = similarityVectors
  
  def fit(self, X):
    """
      Fit the classifier from the training dataset.
      Parameters
      ----------
      X : Training data.
      Returns
      -------
      self : The fitted classifier.
    """
    print("\n#####################################################################\n#     .~ RankedWTAHash with Vantage embeddings starts training ~.   #\n#####################################################################\n")

    if isinstance(X, list):
      input_strings = X
    else:
      input_strings = list(X)

    # print(input_strings)
    self.S_set = np.array(input_strings,dtype=object)
    # print(self.S_set)
    self.S_index = np.arange(0,len(input_strings),1)

    # print("\n\nString positions are:")
    # print(self.S_index)
    # print("\n")

    print("###########################################################\n# > 1. Prototype selection phase                          #\n###########################################################\n")
    print("\n-> Finding prototypes and representatives of each cluster:")
    prototypes_time = time.time()
    self.prototypeArray,self.selected_numOfPrototypes = self.Clustering_Prototypes(self.S_index,self.max_numberOf_clusters, self.max_editDistance, self.pairDictionary)
    print("\n- Prototypes selected")
    self.embeddingDim = self.prototypeArray.size
    print(self.prototypeArray)
    print("\n- Final number of prototypes: ",self.selected_numOfPrototypes )
    prototypes_time = time.time() - prototypes_time
    print("\n# Finished in %.6s secs" % (prototypes_time))
    print("\n")

    print("###########################################################\n# > 2. Embeddings based on the Vantage objects            #\n###########################################################\n")
    print("\n-> Creating Embeddings:")
    embeddings_time = time.time()
    self.Embeddings = self.CreateVantageEmbeddings(self.S_index,self.prototypeArray, self.pairDictionary)
    print("- Embeddings created")
    print(self.Embeddings)
    embeddings_time = time.time() - embeddings_time
    print("\n# Finished in %.6s secs" % (embeddings_time))
    print("\n")


    print("###########################################################\n# > 3. WTA Hashing                                        #\n###########################################################\n")
    print("\n-> Creating WTA Buckets:")
    wta_time = time.time()
    self.HashedClusters,self.buckets,self.rankedVectors = self.WTA(self.Embeddings,self.windowSize,self.embeddingDim)
    print("- WTA Buckets created")
    print(self.HashedClusters)
    print("\n- WTA number of buckets: ", len(np.unique(self.HashedClusters)))
    print("\n- WTA RankedVectors after permutation:")
    print(self.rankedVectors)
    wta_time = time.time() - wta_time
    print("\n# Finished in %.6s secs" % (wta_time))
    print("\n")

    print("###########################################################\n# > 4. Similarity checking                                #\n###########################################################\n")
    print("\n-> Similarity checking:")
    similarity_time = time.time()
    if self.similarityVectors == 'ranked':
      self.mapping,self.mapping_matrix = self.SimilarityEvaluation(self.buckets,self.rankedVectors,self.similarityThreshold,maxOnly=self.maxOnly, metric=self.metric)
    elif self.similarityVectors == 'initial':
      self.mapping,self.mapping_matrix = self.SimilarityEvaluation(self.buckets,self.Embeddings,self.similarityThreshold,maxOnly=self.maxOnly, metric=self.metric)      
    print("- Similarity mapping in a dictionary")
    print(self.mapping)
    print("- Similarity mapping in a matrix")
    print(self.mapping_matrix)
    similarity_time = time.time() - similarity_time
    print("\n# Finished in %.6s secs" % (similarity_time))
    print("\n#####################################################################\n#                    .~ End of training ~.                          #\n#####################################################################\n")

    return self

  def EditDistance(self, str1,str2,verbose=False):
      if verbose:
        if str1 == None:
            print("1")
        elif str2 == None:
            print("2")
        print("-> "+str(str1))
        print("--> "+str(str2))
        print(str(editdistance.eval(self.S_set[str1],self.S_set[str2])))
      
      
      # NOTE: Duplicates inside the dictionary     

      if ((str1,str2) or (str2,str1))  in self.pairDictionary.keys():
        return self.pairDictionary[(str1,str2)]
      else:
        # if verbose:
        # print("++++++++++")
        # print(str1,str2)
        # print(self.S_set[str1],self.S_set[str2])
        # print("++++++++++")
        distance = editdistance.eval(self.S_set[str1],self.S_set[str2])
        self.pairDictionary[(str2,str1)] = self.pairDictionary[(str1,str2)] = distance
        return distance

  # ----------------------------------------------------------------------------------------------------------- #

  #####################################################################
  # 1. Prototype selection algorithm                                  #
  #####################################################################

  '''
  Clustering_Prototypes(S,k,d,r,C) 
  The String Clustering and Prototype Selection Algorithm
  is the main clustering method, that takes as input the intial strings S, 
  the max number of clusters to be generated in k,
  the maximum allowable distance of a string to join a cluster in var d
  and returns the prototype for each cluster in array Prototype
  '''
  def Clustering_Prototypes(self,S,k,d,pairDictionary,verbose=False):
      
      # ----------------- Initialization phase ----------------- #
      i = 0
      j = 0
      C = np.empty([S.size], dtype=int)
      r = np.empty([2,k],dtype=object)

      Clusters = [ [] for l in range(0,k)]

      for i in tqdm(range(0,S.size,1)):     # String-clustering phase, for all strings
          while j < k :       # iteration through clusters, for all clusters
              if r[0][j] == None:      # case empty first representative for cluster j
                  r[0][j] = S[i]   # init cluster representative with string i
                  C[i] = j         # store in C that i-string belongs to cluster j
                  Clusters[j].append(S[i])
                  break
              elif r[1][j] == None and (self.EditDistance(S[i],r[0][j]) <= d):  # case empty second representative 
                  r[1][j] = S[i]                                             # and ED of representative 1  smaller than i-th string 
                  C[i] = j
                  Clusters[j].append(S[i])
                  break
              elif (r[0][j] != None and r[1][j] != None) and (self.EditDistance(S[i],r[0][j]) + self.EditDistance(S[i],r[1][j])) <= d:
                  C[i] = j
                  Clusters[j].append(S[i])
                  break
              else:
                  j += 1
          i += 1

      # ----------------- Prototype selection phase ----------------- #
          
      Projections = np.empty([k],dtype=object)
      Prototypes = np.empty([k],dtype=int)
      sortedProjections = np.empty([k],dtype=object)

      Projections = []
      Prototypes = []
      sortedProjections = []

      if verbose:
          print("- - - - - - - - -")
          print("Cluster array:")
          print(C)
          print("- - - - - - - - -")
          print("Represantatives array:")
          print(r)
          print("- - - - - - - - -")  
          print("Clusters:")
          print(Clusters)
          print("- - - - - - - - -")  

      new_numofClusters = k

      # print("\n\n\n****** Prototype selection phase *********") 
      prototype_index = 0
      for j in range(0,k,1):
          
          # IF small cluster
          # print("Len ",len(Clusters[j]))
          if len(Clusters[j]) < self.min_numOfNodes or r[1][j] == None or r[0][j]==None:
            new_numofClusters-=1
            continue

          Projections.append(self.Approximated_Projection_Distances_ofCluster(r[1][j], r[0][j], j, Clusters[j],pairDictionary))         
          # print(Projections[prototype_index])
          sortedProjections.append({new_numofClusters: v for new_numofClusters, v in sorted(Projections[prototype_index].items(), key=lambda item: item[1])})
          
          
          Prototypes.append(self.Median(sortedProjections[prototype_index]))
          # print(Prototypes[prototype_index])

          prototype_index += 1

      # print("\n****** END *********\n")

      return np.array(Prototypes),new_numofClusters


  def Approximated_Projection_Distances_ofCluster(self, right_rep, left_rep, cluster_id, clusterSet, pairDictionary):
      # print("here")
      # print(clusterSet)
      # print(right_rep, left_rep)

      distances_vector = dict()

      if len(clusterSet) > 2:
        rep_distance     = self.EditDistance(right_rep,left_rep)
                 
        for str_inCluster in range(0,len(clusterSet)): 
          if clusterSet[str_inCluster] != right_rep and clusterSet[str_inCluster] != left_rep:
            # print(clusterSet[str_inCluster],right_rep,left_rep)
            right_rep_distance = self.EditDistance(right_rep,clusterSet[str_inCluster])
            left_rep_distance  = self.EditDistance(left_rep,clusterSet[str_inCluster])
            
            if rep_distance == 0: 
              distances_vector[clusterSet[str_inCluster]] = 0
            else:
              distance = (right_rep_distance**2-rep_distance**2-left_rep_distance**2 ) / (2*rep_distance)
              distances_vector[clusterSet[str_inCluster]] = distance
      
      else:
        if left_rep != None:
          distances_vector[left_rep] = left_rep
          # print("l")
        elif right_rep != None:
          distances_vector[right_rep] = right_rep
          # print("r")
        elif left_rep == None and right_rep == None:
          return None
      # print(distances_vector)
      return distances_vector

  def Median(self, distances):    
      '''
      Returns the median value of a vector
      '''
      keys = list(distances.keys())
      if keys == 1:
        return keys[0]

      # print(distances)
      keys = list(distances.keys())
      # print(keys)
      median_position = int(len(keys)/2)
      # print(median_position)
      median_value = keys[median_position]

      return median_value
  

  
  #####################################################################
  #       2. Embeddings based on the Vantage objects                  #
  #####################################################################

  '''
  CreateVantageEmbeddings(S,VantageObjects): Main function for creating the string embeddings based on the Vantage Objects
  '''
  def CreateVantageEmbeddings(self, S, VantageObjects, pairDictionary):
      
      # ------- Distance computing ------- #     
      vectors = []
      for s in tqdm(range(0,S.size)):
          string_embedding = []
          for p in range(0,VantageObjects.size): 
              if VantageObjects[p] != None:
                  string_embedding.append(self.DistanceMetric(s,p,S,VantageObjects, pairDictionary))
              
          # --- Ranking representation ---- #
          ranked_string_embedding = stats.rankdata(string_embedding, method='dense')
          
          # ------- Vectors dataset ------- #
          vectors.append(ranked_string_embedding)
      
      return np.array(vectors)
      

  '''
  DistanceMetric(s,p,S,Prototypes): Implementation of equation (5)
  '''
  def DistanceMetric(self, s, p, S, VantageObjects, pairDictionary):
      
      max_distance = None
      
      for pp in range(0,VantageObjects.size):
          if VantageObjects[pp] != None:
              string_distance = self.EditDistance(S[s],VantageObjects[pp])    # Edit distance String-i -> Vantage Object
              VO_distance     = self.EditDistance(VantageObjects[p],VantageObjects[pp])    # Edit distance Vantage Object-j -> Vantage Object-i

              abs_diff = abs(string_distance-VO_distance)

              # --- Max distance diff --- #        
              if max_distance == None:
                  max_distance = abs_diff
              elif abs_diff > max_distance:
                  max_distance = abs_diff
              
      return max_distance

  def dropNone(array):
      array = list(filter(None, list(array)))
      return np.array(array)

  def topKPrototypes():
      return

  #####################################################################
  #                 3. Similarity checking                            # 
  #####################################################################

  def SimilarityEvaluation(self, buckets,vectors,threshold,maxOnly=None,metric=None):
    
    print(buckets)
    print(vectors)
    numOfVectors = vectors.shape[0]
    vectorDim    = vectors.shape[1]
    mapping_matrix = np.zeros([numOfVectors,numOfVectors],dtype=np.int8)
    mapping = {}

    # Loop for every bucket
    for bucketid in tqdm(buckets.keys()):
      bucket_vectors = buckets[bucketid]
      numOfVectors = len(bucket_vectors)

      # For every vector inside the bucket
      for v_index in range(0,numOfVectors,1):
        v_vector_id = bucket_vectors[v_index]

        # Loop to all the other
        for i_index in range(v_index+1,numOfVectors,1):
          i_vector_id = bucket_vectors[i_index]

          print(vectors[v_vector_id])
          print(vectors[i_vector_id])
          # Simple Kendal tau metric
          if metric == None or metric == 'kendal': 
            similarity_prob, p_value = kendalltau(vectors[v_vector_id], vectors[i_vector_id])
          elif metric == 'customKendal':
            # Custom Kendal tau
            numOf_discordant_pairs = _kendall_dis(vectors[v_vector_id], vectors[i_vector_id])
            similarity_prob = (2*numOf_discordant_pairs) / (vectorDim*(vectorDim-1))
          elif metric == 'jaccard':
            similarity_prob = jaccard_score(vectors[v_vector_id], vectors[i_vector_id], average='micro')
          elif metric == 'cosine':
            similarity_prob = cosine_similarity(np.array(vectors[v_vector_id]).reshape(1, -1), np.array(vectors[i_vector_id]).reshape(1, -1))
          elif metric == 'pearson':
            similarity_prob, _ = pearsonr(vectors[v_vector_id], vectors[i_vector_id])
          elif metric == 'spearman':
            similarity_prob, _ = spearmanr(vectors[v_vector_id], vectors[i_vector_id])
          else:
            print("Please choose metric")
            

          # if v_vector_id == 0:
          #   print(v_vector_id, i_vector_id," : ",similarity_prob )        
          if similarity_prob > threshold or maxOnly:
            if not maxOnly:
              if v_vector_id not in mapping.keys():
                mapping[v_vector_id] = []
              mapping[v_vector_id].append(i_vector_id)  # insert into mapping
              mapping_matrix[v_vector_id][i_vector_id] = 1  # inform prediction matrix
            else:
              if v_vector_id not in mapping.keys():  
                mapping[v_vector_id] = (i_vector_id,similarity_prob)
                mapping_matrix[v_vector_id][i_vector_id] = 1
              else:
                if mapping[v_vector_id][1] < similarity_prob:
                  mapping[v_vector_id] = (i_vector_id,similarity_prob)
                  mapping_matrix[v_vector_id][i_vector_id] = 1
    
    return mapping, mapping_matrix

  #####################################################################
  #                        4. WTA Hashing                             # 
  #####################################################################

  def WTA(self,vectors,K,inputDim):
    '''
      Winner Take All hash - Yagnik
      .............................

      K: window size
    '''
    newVectors = []
    buckets = dict()

    numOfVectors = vectors.shape[0]
    vectorDim    = vectors.shape[1]

    if vectorDim < K:
      K = vectorDim
      warnings.warn("Window size greater than vector dimension")
      
    C = np.zeros([numOfVectors], dtype=int)
    
    i=0;j=0;
    theta = np.random.permutation(inputDim)

    for v_index in tqdm(range(0,numOfVectors,1)):
      # print("Before: ",vectors[v_index])
      X_new = self.permuted(vectors[v_index],theta)
      newVectors.append(X_new[:K])
      X_new = X_new[:K]
      # print("After: ",X_new)
      # print("X_new: ",X_new)
      index_max = max(range(len(X_new)), key=X_new.__getitem__)
      # print("- ",index_max)
      c_i = index_max

      for j in range(0,K,1):
        if X_new[j] > X_new[c_i]:
          c_i = j

      # print("-> ",c_i)
      C[i] = c_i
      buckets = self.bucketInsert(buckets,c_i,i)
      i+=1
    
    return C,buckets,np.array(newVectors)

  def permuted(self,vector,permutation):
    permuted_vector = [vector[x] for x in permutation]
    return permuted_vector 

  def bucketInsert(self,buckets,bucket_id,item):
    if bucket_id not in buckets.keys():
      buckets[bucket_id] = []
    buckets[bucket_id].append(item)

    return buckets

#####################################################################
#                          Evaluation                               # 
#####################################################################
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

def evaluate_cora(predicted_matrix, true_matrix, with_classification_report=False ):

  print("#####################################################################\n#                          Evaluation                               #\n#####################################################################\n")
  true_matrix = csr_matrix(true_matrix)
  # print(true_matrix)
  predicted_matrix =  csr_matrix(predicted_matrix)
  # print(predicted_matrix)

  acc = accuracy_score(true_matrix, predicted_matrix)
  f1 =  f1_score(true_matrix, predicted_matrix, average='micro')
  recall = recall_score(true_matrix, predicted_matrix, average='micro')
  precision = precision_score(true_matrix, predicted_matrix, average='micro')

  print("Accuracy:  %3.2f %%" % (acc*100))
  print("F1-Score:  %3.2f %%" % (f1*100))
  print("Recall:    %3.2f %%" % (recall*100))
  print("Precision: %3.2f %%" % (precision*100))

  # results_dataframe = pd.DataFrame(columns=['Accuracy','Precision','Recall','F1'])
  # results_dataframe.loc[len(results_dataframe)+1] = [acc,precision,recall,f1]

  if with_classification_report:
    print(classification_report(true_matrix, predicted_matrix))

  print('\n\n')
  return acc,f1,precision,recall

def GridSearch_cora(data,true_matrix):
  results_dataframe = pd.DataFrame(columns=['max_numberOf_clusters','max_editDistance','similarityThreshold','windowSize','metric','similarityVectors','Accuracy','Precision','Recall','F1','Time'])
  max_numberOf_clusters= [10, 20, 50, 100]
  max_editDistance= [30,50,100,200]
  windowSize= [3, 4, 5, 8, 10, 20]
  similarityThreshold= [0.5, 0.7, 0.9]
  metric= ['kendal', 'customKendal','spearman','jaccard','cosine','pearson']
  similarityVectors= ['ranked','initial']

  for n1 in max_numberOf_clusters:
    for n2 in max_editDistance:
      for n3 in similarityThreshold:
        for n4 in windowSize:
          for n5 in metric:
            for n6 in similarityVectors:
              start = time.time()
              model = RankedWTAHash(
                  max_numberOf_clusters= n1,
                  max_editDistance= n2,
                  windowSize= n4,
                  similarityThreshold= n3,
                  maxOnly= False,
                  metric=n5,
                  similarityVectors=n6
              )
              model = model.fit(data)
              exec_time = time.time() - start
              acc,f1,precision,recall = evaluate_cora(model,true_matrix)
              results_dataframe.loc[len(results_dataframe)+1] = [n1,n2,n3,n4,n5,n6,acc,precision,recall,f1,exec_time]

  return results_dataframe

---
---

# __Evaluation__

In [6]:
# Opening data file
import io
from google.colab import drive

drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


## __CoRA__

### Load from Drive

In [7]:
fpcites = r"/content/drive/My Drive/ERinDS/cora_cites.csv"
fppaper = r"/content/drive/My Drive/ERinDS/cora_paper.csv"
fpcontent = r"/content/drive/My Drive/ERinDS/cora_content.csv"

cites = pd.read_csv(fpcites,sep=';')
paper = pd.read_csv(fppaper,sep=';')
content = pd.read_csv(fpcontent,sep=';')

### Overview

In [8]:
cites

Unnamed: 0,cited_paper_id,citing_paper_id
0,35,887
1,35,1033
2,35,1688
3,35,1956
4,35,8865
...,...,...
5424,853116,19621
5425,853116,853155
5426,853118,1140289
5427,853155,853118


In [9]:
paper

Unnamed: 0,paper_id,class_label
0,35,Genetic_Algorithms
1,40,Genetic_Algorithms
2,114,Reinforcement_Learning
3,117,Reinforcement_Learning
4,128,Reinforcement_Learning
...,...,...
2703,1154500,Case_Based
2704,1154520,Neural_Networks
2705,1154524,Rule_Learning
2706,1154525,Rule_Learning


In [10]:
content

Unnamed: 0,paper_id,word_cited_id
0,35,word100
1,35,word1152
2,35,word1175
3,35,word1228
4,35,word1248
...,...,...
49211,1155073,word75
49212,1155073,word759
49213,1155073,word789
49214,1155073,word815


### Train-Test-Validation datasets



In [11]:
class_labels = np.unique(paper.class_label.to_numpy())
print(class_labels)
print("Number of classes: "+str(len(class_labels)))

['Case_Based' 'Genetic_Algorithms' 'Neural_Networks'
 'Probabilistic_Methods' 'Reinforcement_Learning' 'Rule_Learning' 'Theory']
Number of classes: 7


## __DBLP/ACM__

In [None]:
acmfp = r"/content/drive/My Drive/ERinDS/ACM.csv"
dblpfp = r"/content/drive/My Drive/ERinDS/DBLP2.csv"
acm_dblp_mapping_fp = r"/content/drive/My Drive/ERinDS/DBLP-ACM_perfectMapping.csv"

acm = pd.read_csv(acmfp)
dblp = pd.read_csv(dblpfp, encoding='latin-1')
perfect_mapping = pd.read_csv(acm_dblp_mapping_fp)

dblp['year'] = dblp['year'].astype(str)
acm['year'] = acm['year'].astype(str)

### Overview

In [None]:
acm

Unnamed: 0,id,title,authors,venue,year
0,304586,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,304587,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999
2,304589,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data,1999
3,304590,XML-based information mediation with MIX,"Chaitan Baru, Amarnath Gupta, Bertram Lud&#228...",International Conference on Management of Data,1999
4,304582,The CCUBE constraint object-oriented database ...,"Alexander Brodsky, Victor E. Segal, Jia Chen, ...",International Conference on Management of Data,1999
...,...,...,...,...,...
2289,672977,Dual-Buffering Strategies in Object Bases,"Alfons Kemper, Donald Kossmann",Very Large Data Bases,1994
2290,950482,Guest editorial,"Philip A. Bernstein, Yannis Ioannidis, Raghu R...",The VLDB Journal &mdash; The International Jou...,2003
2291,672980,GraphDB: Modeling and Querying Graphs in Datab...,Ralf Hartmut G&#252;ting,Very Large Data Bases,1994
2292,945741,Review of The data warehouse toolkit: the comp...,Alexander A. Anisimov,ACM SIGMOD Record,2003


In [None]:
dblp

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and it...,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996
2,conf/vldb/PalpanasSCP02,Incremental Maintenance for Non-Distributive A...,"Themistoklis Palpanas, Richard Sidle, Hamid Pi...",VLDB,2002
3,conf/vldb/GardarinGT96,Cost-based Selection of Path Expression Proces...,"Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...",VLDB,1996
4,conf/vldb/HoelS95,Benchmarking Spatial Join Operations with Spat...,"Erik G. Hoel, Hanan Samet",VLDB,1995
...,...,...,...,...,...
2611,journals/tods/KarpSP03,A simple algorithm for finding frequent elemen...,"Scott Shenker, Christos H. Papadimitriou, Rich...",ACM Trans. Database Syst.,2003
2612,conf/vldb/LimWV03,SASH: A Self-Adaptive Histogram Set for Dynami...,"Lipyeow Lim, Min Wang, Jeffrey Scott Vitter",VLDB,2003
2613,journals/tods/ChakrabartiKMP02,Locally adaptive dimensionality reduction for ...,"Kaushik Chakrabarti, Eamonn J. Keogh, Michael ...",ACM Trans. Database Syst.,2002
2614,journals/sigmod/Snodgrass01,Chair's Message,Richard T. Snodgrass,SIGMOD Record,2001


In [None]:
perfect_mapping

Unnamed: 0,idDBLP,idACM
0,conf/sigmod/SlivinskasJS01,375678
1,conf/sigmod/ChaudhuriDN01,375694
2,conf/sigmod/RinfretOO01,375669
3,conf/sigmod/BreunigKKS01,375672
4,conf/sigmod/JagadishJOT01,375687
...,...,...
2219,journals/sigmod/Scholl01,604275
2220,journals/sigmod/Rosneblatt94,190649
2221,journals/sigmod/Winslett02b,601871
2222,journals/sigmod/Labrinidis01,604283


In [None]:
acm.loc[acm['id'] == 375678]

Unnamed: 0,id,title,authors,venue,year
301,375678,Adaptable query optimization and evaluation in...,"Giedrius Slivinskas, Christian S. Jensen, Rich...",International Conference on Management of Data,2001


In [None]:
dblp.loc[dblp['id'] == 'conf/sigmod/SlivinskasJS01']

Unnamed: 0,id,title,authors,venue,year
143,conf/sigmod/SlivinskasJS01,Adaptable Query Optimization and Evaluation in...,"Christian S. Jensen, Richard T. Snodgrass, Gie...",SIGMOD Conference,2001


### Preprocess

In [None]:
def preprocess(row):
  # print(row)
  paper_str = " ".join(row)
  paper_str = paper_str.lower()

  return paper_str

### Dataset split

### Model evaluation

Small dataset

In [None]:
text = []
id = []
sameas = []
true_labels = []
data = {'id':[],'text':[],'sameas':[]}
index = 0

for _,row in perfect_mapping.head(10).iterrows():

  # DBLP
  dplp_row = dblp.loc[dblp.id == row['idDBLP'],['title','authors','venue','year']].values.flatten().tolist()
  id.append(row['idDBLP'])
  sameas.append(row['idACM'])
  dplp_row = preprocess(dplp_row)
  text.append(dplp_row)

  # ACM
  acm_row = acm.loc[acm.id == row['idACM'],['title','authors','venue','year']].values.flatten().tolist()
  acm_row = preprocess(acm_row)
  text.append(acm_row)
  id.append(row['idACM'])
  sameas.append(row['idDBLP'])

data['id'] = id
data['text'] = text
data['sameas'] = sameas

dataset=pd.DataFrame(data)
# print(dataset)
dataset

Unnamed: 0,id,text,sameas
0,conf/sigmod/SlivinskasJS01,adaptable query optimization and evaluation in...,375678
1,375678,adaptable query optimization and evaluation in...,conf/sigmod/SlivinskasJS01
2,conf/sigmod/ChaudhuriDN01,"a robust, optimization-based approach for appr...",375694
3,375694,"a robust, optimization-based approach for appr...",conf/sigmod/ChaudhuriDN01
4,conf/sigmod/RinfretOO01,bit-sliced index arithmetic elizabeth j. o'nei...,375669
5,375669,"bit-sliced index arithmetic denis rinfret, pat...",conf/sigmod/RinfretOO01
6,conf/sigmod/BreunigKKS01,data bubbles: quality preserving performance b...,375672
7,375672,data bubbles: quality preserving performance b...,conf/sigmod/BreunigKKS01
8,conf/sigmod/JagadishJOT01,global optimization of histograms h. v. jagadi...,375687
9,375687,global optimization of histograms h. v. jagadi...,conf/sigmod/JagadishJOT01


In [None]:
model = RankedWTAHash(
    max_numberOf_clusters = 10,
    max_editDistance = 140,
    windowSize=5,
    similarityThreshold = 0.8,
    maxOnly=True,
    metric='customkendal'
    )
EditDistance = model.EditDistance
model.fit(dataset['text'],None)

['adaptable query optimization and evaluation in temporal middleware christian s. jensen, richard t. snodgrass, giedrius slivinskas sigmod conference 2001', 'adaptable query optimization and evaluation in temporal middleware giedrius slivinskas, christian s. jensen, richard thomas snodgrass international conference on management of data 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries vivek r. narasayya, gautam das, surajit chaudhuri sigmod conference 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries surajit chaudhuri, gautam das, vivek narasayya international conference on management of data 2001', "bit-sliced index arithmetic elizabeth j. o'neil, denis rinfret, patrick e. o'neil sigmod conference 2001", "bit-sliced index arithmetic denis rinfret, patrick o'neil, elizabeth o'neil international conference on management of data 2001", 'data bubbles: quality preserving performance boosting for hierarch

## __CoRA__ - New


### Load from Drive

In [43]:
fpcora = r"/content/drive/My Drive/ERinDS/CORA.xml"
cora = pdx.read_xml(fpcora,['CORA', 'NEWREFERENCE'],root_is_rows=False)
# cora.index += 1 
xml_dataframe = cora
cora

Unnamed: 0,@id,author,title,journal,volume,pages,date,#text,publisher,address,note,booktitle,editor,booktile,tech,institution,Pages,year,type,month
0,1,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",Inganas and M.R.,"Andersson, J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
1,2,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
2,3,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
3,4,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
4,5,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",76,893,(1994).,ahlskog1994a,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1874,1875,"Richard C. Yee, Sharad Saxena, Paul E. Utgoff,...",Explaining temporal-differences to create usef...,,,,,751\nyee1990,,,,"In Proceedings of AAAI-90,",,,,,,1990.,,
1875,1876,"Q. Zheng,",Real-time Fault-tolerant Communication in Comp...,,,,,752\nzheng1993,,"University of Michigan,",Available via anonymous ftp from ftp.eecs.umic...,,,,,,,1993.,"PhD thesis,",
1876,1877,"Q. Zheng,",Real-time Fault-tolerant Communication in Comp...,,,,,753\nzheng1993,,,PostScript version of the thesis is available ...,,,,,"University of Michigan,",,1993.,"PhD thesis,",
1877,1878,"Q. Zheng,",Real-time Fault-tolerant Communication in Comp...,,,,,754\nzheng1993,,"University of Michigan,",PostScript version of the thesis is available ...,,,,,,,1993.,"PhD thesis,",


### Import true values

In [44]:
fpcora_gold = r"/content/drive/My Drive/ERinDS/cora_gold.csv"
cora_gold = pd.read_csv(fpcora_gold,sep=';')
true_values = cora_gold
cora_gold.head(40)

Unnamed: 0,id1,id2
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6
5,1,7
6,1,8
7,2,3
8,2,4
9,2,5


In [20]:
cora20 = cora_gold.head(20)

In [51]:
cora20d = cora.head(15)
cora20d[['author',	'title',	'journal', 'date']]

Unnamed: 0,author,title,journal,date
0,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",Inganas and M.R.,"Andersson, J Appl. Phys.,",(1994).
1,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",(1994).
2,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",(1994).
3,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",(1994).
4,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",(1994).
5,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",(1994).
6,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"J Appl. Phys.,",(1994).
7,"M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyrekle...",,"Journal of Applied Physics,",(1994).
8,C. Ray Asfahl.,Robots and Manufacturing Automation.,,1992.
9,Steve Benford and Lennart E. Fahlen.,A spatial model of interaction in large virtua...,,"[September, 1993.]"


In [48]:
cora_gold.head(31)

Unnamed: 0,id1,id2
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6
5,1,7
6,1,8
7,2,3
8,2,4
9,2,5


### Preprocess

In [15]:
def preprocess(row):
  # print(row)
  paper_str = " ".join(row)
  paper_str = paper_str.lower()
  paper_str = paper_str.replace("\n", " ").replace("/z", " ").replace("[","").replace("]","")

  return str(paper_str)

### Shuffle data

In [16]:
# print(xml_dataframe.columns)
shuffled_df = xml_dataframe.sample(frac=1).reset_index(drop=True)
shuffled_df

Unnamed: 0,@id,author,title,journal,volume,pages,date,#text,publisher,address,note,booktitle,editor,booktile,tech,institution,Pages,year,type,month
0,471,"Guy L. Steele Jr., Scott E. Fahlman, Richard P...",Common Lisp: The Language.,,,,1984.,steele1984a,"Digital Press,","Burlington, Massachusetts,",,,,,,,,,,
1,1710,"Utgoff, P. E.",Perceptron trees: A case study in hybrid conce...,"Connection Science,",1,377-391.,,586\nutgoff1989pt,,,,,,,,,,(1989b).,,
2,565,"Aha, D. W., & Kibler, D.",Noise-tolerant instance-based learning algorit...,,,(pp. 794-799).,,35\naha1989,Morgan Kaufmann.,"Detroit, Michigan:",,Proceedings of the Eleventh International Join...,,,,,,(1989).,,
3,524,"Weiss, Y., Edelman, S., and Fahle, M.",Models of perceptual learning in vernier hyper...,"Neural Computation,",5,695-718.,(1993).,weiss1993a,,,,,,,,,,,,
4,1048,"Ourston, D. & Mooney, R.",Theory refinement combining analytical and emp...,"Artificial Intelligence,",66(2),273-309.,,518\nourston1994,,,,,,,,,,(1994).,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1874,145,"Scott Fahlman,","""Faster-Learning Variations on Back Propagatio...",,,,1988.,fahlman1988a,"Morgan Kaufmann,",,,In Proceedings of the 1988 Connectionist Model...,,,,,,,,
1875,1029,"D. Kibler, D. W. Aha, and M. K. Albert.",Instance-Based Prediction of Real-Valued Attri...,"Computational Intelligence,",5,"51-57,",,499\nkibler1989,,,,,,,,,,1989.,,
1876,686,"Aha, David W., Dennis Kibler, Marc K. Albert,","Instance-Based Learning Algorithms,","Machine Learning,","vol. 6,",pp. 37-66.,,156\naha1991,,,,,,,,,,(1991).,,
1877,1242,Richard Caruana.,Multitask learning: A knowledge-based of sourc...,,18,"pages 41-48,",,118\ncaruana1993,Morgan Kaufmann.,"San Mateo, CA,",,Proceedings of the Tenth International Confere...,"In Paul E. Utgoff, editor,",,,,,1993.,,


In [79]:
def cora_createDataset(xml_dataframe, true_values, fields):

  rawStr_col = []
  index_to_id_dict = {}
  sameEntities_dictionary = {}

  i=0
  for _, row in tqdm(xml_dataframe.head(15).iterrows()):
    index_to_id_dict[int(row['@id'])] = i

    rawStr = []
    for field in fields:    # NAN
      rawStr.append(str(row[field]))
    i+=1
    rawStr_col.append(preprocess(rawStr))

  num_of_records = len(xml_dataframe)
  num_of_records = 15
  trueValues_matrix = np.zeros([num_of_records,num_of_records],dtype=np.int8)
  
  for _, row in tqdm(true_values.head(31).iterrows()):  
    trueValues_matrix[index_to_id_dict[row['id1']]][index_to_id_dict[row['id2']]] = 1
    if row['id1'] not in sameEntities_dictionary.keys():
       sameEntities_dictionary[row['id1']] = []
    sameEntities_dictionary[row['id1']].append(row['id2'])

  return rawStr_col,sameEntities_dictionary, trueValues_matrix



fields = ['author', 'title', 'journal', 'volume', 'pages', 'date', '#text',
       'publisher', 'address', 'note', 'booktitle', 'editor', 'booktile',
       'tech', 'institution', 'Pages', 'year', 'type', 'month']

fields = ['author', 'title', 'journal','volume', 'date']

data, labels, true_matrix = cora_createDataset(shuffled_df, true_values, fields)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




### Evaluation

In [114]:
%%time
model = RankedWTAHash(
    max_numberOf_clusters= 10,
    max_editDistance= 100,
    windowSize= 3,
    similarityThreshold= 0.9,
    maxOnly= False,
    metric='jaccard',
    similarityVectors='ranked'
)
model = model.fit(data)
evaluate_cora(model.mapping_matrix,true_matrix, False)


#####################################################################
#     .~ RankedWTAHash with Vantage embeddings starts training ~.   #
#####################################################################

###########################################################
# > 1. Prototype selection phase                          #
###########################################################


-> Finding prototypes and representatives of each cluster:


HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))



- Prototypes selected
[ 3  8 10 12]

- Final number of prototypes:  4

# Finished in 0.0447 secs


###########################################################
# > 2. Embeddings based on the Vantage objects            #
###########################################################


-> Creating Embeddings:


HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))


- Embeddings created
[[1 2 4 3]
 [1 2 4 3]
 [1 2 4 3]
 [1 2 4 3]
 [1 2 4 3]
 [1 2 4 3]
 [1 2 4 3]
 [1 2 4 3]
 [2 1 4 3]
 [4 2 3 1]
 [4 3 1 2]
 [4 3 2 1]
 [4 3 2 1]
 [4 3 2 1]
 [3 4 2 1]]

# Finished in 0.0616 secs


###########################################################
# > 3. WTA Hashing                                        #
###########################################################


-> Creating WTA Buckets:


HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))


- WTA Buckets created
[2 2 2 2 2 2 2 2 2 1 1 1 1 1 1]

- WTA number of buckets:  2

- WTA RankedVectors after permutation:
[[3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 2 4]
 [1 4 3]
 [2 4 1]
 [1 4 2]
 [1 4 2]
 [1 4 2]
 [1 3 2]]

# Finished in 0.0674 secs


###########################################################
# > 4. Similarity checking                                #
###########################################################


-> Similarity checking:
{2: [0, 1, 2, 3, 4, 5, 6, 7, 8], 1: [9, 10, 11, 12, 13, 14]}
[[3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 1 4]
 [3 2 4]
 [1 4 3]
 [2 4 1]
 [1 4 2]
 [1 4 2]
 [1 4 2]
 [1 3 2]]


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 1 4]
[3 1 4]
[3 2 4]
[3 1 4]
[3 2 4]
[1 4 3]
[2 4 1]
[1 4 3]
[1 4 2]
[1 4 3]
[1 4 2]
[1 4 3]
[1 4 2]
[1 4 3]
[1 3 2]
[2 4 1]
[1 4 2]
[2 4 1]
[1 4 2]
[2 4 1]
[1 4 2]
[2 4 1]
[1 3 2]
[1 4 2]
[1 4 2]
[1 4 2]
[1 4 2]
[1 4 2]
[1 3 2]
[1 4 2]
[1 4 2]
[1 4 2]
[1 3 2]
[1 4 2]
[1 3 2]

- Similarity mapping in a dictionary
{0: [1, 2, 3, 4, 5, 6, 7], 1: [2, 3, 4, 5, 6, 7], 2: [3, 4, 5, 6, 7], 3: [4, 5, 6, 7], 4: [5, 6, 7], 5: [6, 7], 6: [7], 11: [12, 13], 12: [13]}
- 

### GridSearch - Shuffled

In [118]:
results_shuffled = GridSearch_cora(data,true_matrix)


#####################################################################
#     .~ RankedWTAHash with Vantage embeddings starts training ~.   #
#####################################################################

###########################################################
# > 1. Prototype selection phase                          #
###########################################################


-> Finding prototypes and representatives of each cluster:


HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))



- Prototypes selected
[ 3 11]

- Final number of prototypes:  2

# Finished in 0.0498 secs


###########################################################
# > 2. Embeddings based on the Vantage objects            #
###########################################################


-> Creating Embeddings:


HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))


- Embeddings created
[[1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]]

# Finished in 0.0915 secs


###########################################################
# > 3. WTA Hashing                                        #
###########################################################


-> Creating WTA Buckets:




HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))


- WTA Buckets created
[0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]

- WTA number of buckets:  2

- WTA RankedVectors after permutation:
[[2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]]

# Finished in 0.0695 secs


###########################################################
# > 4. Similarity checking                                #
###########################################################


-> Similarity checking:
{0: [0, 1, 2, 3, 4, 5, 6, 7, 8], 1: [9, 10, 11, 12, 13, 14]}
[[2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [2 1]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]
 [1 2]]


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[2 1]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]
[1 2]

- Similarity mapping in a dictionary
{0: [1, 2, 3, 4, 5, 6, 7, 8], 1: [2, 3, 4, 5, 6, 7, 8], 2: [3, 4, 5, 6, 7, 8], 3: [4, 5, 6, 7, 8], 4: [5, 6, 7, 8], 5: [6, 7, 8], 6: [7, 8], 7: [8], 9: [10, 11, 12, 13, 14], 10: [11, 12, 13, 14], 11: [12, 13, 14], 12: [13, 14], 13: [14]}
- Similarity mapping in a matrix
[[0 1 1 1 1 1 1 1 1 0 0 0 0 0 0]
 [0 0 1 1 1 1 1 1 1 0 0 0 0 0 0]
 [0 0 0 1 1 1

TypeError: ignored

---

# References

1.   [The dissimilarity representation for pattern recognition, a tutorial
Robert P.W. Duin and Elzbieta Pekalska Delft University of Technology, The Netherlands School of Computer Science, University of Manchester, United Kingdom](http://homepage.tudelft.nl/a9p19/presentations/DisRep_Tutorial_doc.pdf)