<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_en.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

<br>

---

<h3 align="center" > 
  Bachelor Thesis
</h3>

<h1 align="center" > 
  Entity Resolution in Dissimilarity Spaces <br>
  Implementation notebook
</h1>

---

<h3 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h3>

<h4 align="center"> 
 <b>Supervisor: Dr. Alex Delis</b>,  Professor NKUA
</h4>
<br>
<h4 align="center"> 
Athens
</h4>
<h4 align="center"> 
January 2021 - Ongoing
</h4>


---


|  <font size="5"> Contents</font> |
| :--   |
|**1. [Abstract](#Abstract)** |
|**2. [Introduction](#Introduction)**  |
&nbsp;&nbsp;&nbsp;**2.1. [   Entity resolution](#Entity-resolution)** |
&nbsp;&nbsp;&nbsp;**2.2. [   Dissimilatiry space](#Dissimilatiry-Space)** |
|**3. [ A dissimilarity-based space embedding methodology](#scrollTo=DcAYuFQjY2ni)** <br>
&nbsp;&nbsp;&nbsp;**3.1 [String Clustering and Prototype Selection](#3.1-String-Clustering-and-Prototype-Selection)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.1. [Edit distance metric](#Edit-distance-metric)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.2. [String clustering algorithm](#String-clustering-algorithm)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.3. [Algorithm complexity](#Algorithm-complexity)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.4. [Prototype selection](#Prototype-selection)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.5. [Algorithm-1: The String Clustering and Prototype Selection Algorithm](#Algorithm-1:-The-String-Clustering-and-Prototype-Selection-Algorithm)** <br>
&nbsp;&nbsp;&nbsp;**3.2 [The Vantage Space Embedding and the Chorus of Prototypes Transform Similarity Coefficient](#3.2-The-Vantage-Space-Embedding-and-the-Chorus-of-Prototypes-Transform-Similarity-Coefficient)&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;**  <br>
&nbsp;&nbsp;&nbsp;**3.3 [A Top-k List Approach for Similarity Searching in the Vantage Space](#3.3-A-Top-k-List-Approach-for-Similarity-Searching-in-the-Vantage-Space)**  |
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.3.1. [Abstract Algebra definitions](#Abstract-Algebra-definitions)** <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.3.2. [Hausdorff metric](#Hausdorff-metric)** <br>
&nbsp;&nbsp;&nbsp;**3.4 [Hashing of Partially Ranked Data for Efficient Similarity Search](#3.4-Hashing-of-Partially-Ranked-Data-for-Efficient-Similarity-Search)** |
|**4. [ Evaluation](#Evaluation)** |
|**5. [References](#References)**  |



# __Implementation__

## __0. Libraries__

In [6]:
import pandas as pd
import numpy as np
import collections
from scipy import stats 
import editdistance
import string

In [7]:
!pip install editdistance



## __1. Prototype selection algorithm__

In [8]:
#####################################################################
# 1. Prototype selection algorithm                                  #
#####################################################################

'''
Clustering_Prototypes(S,k,d,r,C) 
The String Clustering and Prototype Selection Algorithm
is the main clustering method, that takes as input the intial strings S, 
the max number of clusters to be generated in k,
the maximum allowable distance of a string to join a cluster in var d
and returns the prototype for each cluster in array Prototype
'''
def Clustering_Prototypes(S,k,d,pairDictionary,verbose=False):
    
    # ----------------- Initialization phase ----------------- #
    i = 0
    j = 0
    C = np.empty([S.size], dtype=int)
    r = np.empty([2,k],dtype=object)

    Clusters = [ [] for l in range(0,k)]

    while i < S.size:     # String-clustering phase, for all strings
        while j < k :       # iteration through clusters, for all clusters
            if r[0][j] == None:      # case empty first representative for cluster j
                r[0][j] = S[i]   # init cluster representative with string i
                C[i] = j         # store in C that i-string belongs to cluster j
                Clusters[j].append(S[i])
                break
            elif r[1][j] == None and (EditDistance(S[i],r[0][j]) <= d):  # case empty second representative 
                r[1][j] = S[i]                                             # and ED of representative 1  smaller than i-th string 
                C[i] = j
                Clusters[j].append(S[i])
                break
            elif (r[0][j] != None and r[1][j] != None) and (EditDistance(S[i],r[0][j]) + EditDistance(S[i],r[1][j])) <= d:
                C[i] = j
                Clusters[j].append(S[i])
                break
            else:
                j += 1
        i += 1
    
    # ----------------- Prototype selection phase ----------------- #
        
    Projections = np.empty([k],dtype=object)
    Prototypes = np.empty([k],dtype=int)
    sortedProjections = np.empty([k],dtype=object)
    j = 0

    if verbose:
        print("- - - - - - - - -")
        print("Cluster array:")
        print(C)
        print("- - - - - - - - -")
        print("Represantatives array:")
        print(r)
        print("- - - - - - - - -")  
        print("Clusters:")
        print(Clusters)
        print("- - - - - - - - -")  

    
    while j < k and len(Clusters[j])>0:
        print("\n\n\n****** Prototype selection phase *********") 
        Projections[j] = Approximated_Projection_Distances_ofCluster(r[1][j], r[0][j], j, Clusters[j],pairDictionary)
        # sortedProjections[j] = np.sort(np.array(Projections[j]),kind = 'quicksort' ) 
        
        print("\n"+str(j)+"-Projections:")
        print(Projections[j])
        
        sortedProjections[j] = {k: v for k, v in sorted(Projections[j].items(), key=lambda item: item[1])}

        print(str(j)+"-sortedProjections:")
        print(sortedProjections[j])
        
        Prototypes[j] = Median(sortedProjections[j])
        
        print(".............")
        print(str(j)+"-Prototypes:")        
        print(Prototypes[j])
        
        j += 1
        print("\n****** END *********\n")

    return Prototypes


def Approximated_Projection_Distances_ofCluster(right_rep, left_rep, cluster_id, clusterSet, pairDictionary):

    distances_vector = dict()
    rep_distance     = EditDistance(right_rep,left_rep)

    for str_inCluster in range(0,len(clusterSet)): 

      right_rep_distance = EditDistance(right_rep,clusterSet[str_inCluster])
      left_rep_distance  = EditDistance(left_rep,clusterSet[str_inCluster])
      
      distance = (right_rep_distance**2-rep_distance**2-left_rep_distance**2 ) / (2*rep_distance)
      distances_vector[clusterSet[str_inCluster]] = distance

    return distances_vector

def Median(distances):    
    '''
    Returns the median value of a vector
    '''
    keys = list(distances.keys())
    median_position = int(len(keys)/2)
    median_value = keys[median_position]

    return median_value

In [9]:
# pairDictionary = dict()
# input_strings = list(strObjects)
# k = 7 # max_number_of_clusters
# d = 120
# S_set = np.array(input_strings,dtype=object)
# S_index = np.arange(0,len(input_strings),1)
# # S_index_list = list(S_index) 


# # print(S_set)

# print("\n-----------------\nString positions are:")
# print(S_index)
# print("-----------------\n")

# Prototypes = Clustering_Prototypes(S_index,k,d)

# print("\n-----------------\n NumofPrototypes: "+str(k)+" || Prototypes are: ")
# print(Prototypes)
# print("-----------------")

## __2. Embeddings based on the Vantage objects__




In [10]:
#####################################################################
#       2. Embeddings based on the Vantage objects                  #
#####################################################################

'''
CreateVantageEmbeddings(S,VantageObjects): Main function for creating the string embeddings based on the Vantage Objects
'''
def CreateVantageEmbeddings(S,VantageObjects, pairDictionary):
    
    # ------- Distance computing ------- #     
    vectors = []
    for s in range(0,S.size):
        string_embedding = []
        for p in range(0,VantageObjects.size): 
            if VantageObjects[p] != None:
                string_embedding.append(DistanceMetric(s,p,S,VantageObjects, pairDictionary))
            
        # --- Ranking representation ---- #
        ranked_string_embedding = stats.rankdata(string_embedding, method='dense')
        
        # ------- Vectors dataset ------- #
        vectors.append(ranked_string_embedding)
    
    return np.array(vectors)
    

'''
DistanceMetric(s,p,S,Prototypes): Implementation of equation (5)
'''
def DistanceMetric(s,p,S,VantageObjects, pairDictionary):
    
    max_distance = None
    
    for pp in range(0,VantageObjects.size):
        if VantageObjects[pp] != None:
            string_distance = EditDistance(S[s],VantageObjects[pp])    # Edit distance String-i -> Vantage Object
            VO_distance     = EditDistance(VantageObjects[p],VantageObjects[pp])    # Edit distance Vantage Object-j -> Vantage Object-i

            abs_diff = abs(string_distance-VO_distance)

            # --- Max distance diff --- #        
            if max_distance == None:
                max_distance = abs_diff
            elif abs_diff > max_distance:
                max_distance = abs_diff
            
    return max_distance

def dropNone(array):
    array = list(filter(None, list(array)))
    return np.array(array)

def topKPrototypes():
    return

In [11]:
# Embeddings = CreateVantageEmbeddings(S_index,Prototypes)

## __3. Metrics and Similarity functions__

In [54]:
#####################################################################
#                 3. Similarity function                            # 
#####################################################################
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import hamming

def SimilarityEvaluation(buckets,vectors,threshold):

  numOfVectors = vectors.shape[0]
  vectorDim    = vectors.shape[1]
  mapping = {}

  for v_index in range(0,numOfVectors,1):
    
    for i_index in range(v_index+1,numOfVectors,1):
      # print(v_index,i_index)
      tau, p_value = stats.kendalltau(vectors[v_index], vectors[i_index])
      # print(tau)
      
      if tau > threshold:
        if v_index not in mapping.keys():
          mapping[v_index] = []
        mapping[v_index].append(i_index)
        # print("--")
        # print(tau)
        # print(vectors[v_index])
        # print(vectors[i_index])
        # print("--")
  return mapping


## __4. Hashing__

In [16]:
#####################################################################
#                        4. Hashing                                 # 
#####################################################################

def WTA(vectors,K,inputDim):
  '''
    Winner Take All hash - Yagnik
    .............................

    m: number of permutations
    K: window size
  '''

  buckets = dict()

  numOfVectors = vectors.shape[0]
  vectorDim    = vectors.shape[1]

  C = np.zeros([numOfVectors], dtype=int)
  # print(vectors.shape[0])
  theta = np.random.permutation(inputDim)
  i=0;j=0;

  for v_index in range(0,numOfVectors,1):
    X_new = permuted(vectors[v_index],theta)
    index_max = max(range(len(X_new)), key=X_new.__getitem__)
    c_i = index_max

    j=0
    for j in range(0,K-1):
      if X_new[j] > X_new[c_i]:
        c_i = j

    C[i] = c_i
    buckets = bucketInsert(buckets,c_i,i)
    i+=1
  
  return C,buckets

def permuted(vector,permutation):
  permuted_vector = [vector[x] for x in permutation]
  return permuted_vector 

def bucketInsert(buckets,bucket_id,item):
  if bucket_id not in buckets.keys():
    buckets[bucket_id] = []
  buckets[bucket_id].append(item)

  return buckets

## __Final model__









In [58]:
class RankedWTAHash:

  def __init__(self, max_numberOf_clusters, max_editDistance, windowSize ):
      '''
        Constructor
      '''
      self.max_numberOf_clusters = max_numberOf_clusters
      self.pairDictionary = dict()
      self.max_editDistance = max_editDistance
      self.windowSize = windowSize
      self.S_set = None 
      self.S_index = None 
      # self.inputSize = 
  
  def fit(self, X, y):
    """
      Fit the classifier from the training dataset.
      Parameters
      ----------
      X : Training data.
      y : Target values.
      Returns
      -------
      self : The fitted classifier.
    """
    
    input_strings = list(X)
    print(input_strings)
    self.S_set = np.array(input_strings,dtype=object)
    self.S_index = np.arange(0,len(input_strings),1)
    # S_index_list = list(S_index) 
    # self.X = S_index

    # print(S_set)

    print("\n-----------------\nString positions are:")
    print(self.S_index)
    print("-----------------\n")

    self.prototypeArray = Clustering_Prototypes(self.S_index,self.max_numberOf_clusters, self.max_editDistance, self.pairDictionary)
    self.embeddingDim   = self.prototypeArray.size
    print("\n-----------------\n NumofPrototypes: "+str(self.max_numberOf_clusters)+" || Prototypes are: ")
    print(self.prototypeArray)
    print("-----------------")

    self.Embeddings = CreateVantageEmbeddings(self.S_index,self.prototypeArray, self.pairDictionary)

    print("\n-----------------\nEmbeddings:")
    print(self.Embeddings)
    print("-----------------\n")

    self.HashedClusters,self.buckets = WTA(self.Embeddings,self.windowSize,self.embeddingDim)

    print("\n-----------------\nBuckets:")
    print(self.HashedClusters)
    print("-----------------\n")

    self.threshold = 0.80
    self.mapping = SimilarityEvaluation(self.buckets,self.Embeddings,self.threshold)
    # return self
    print(self.mapping)
  
  def EditDistance(self, str1,str2,verbose=False):
      if verbose:
          if str1 == None:
              print("1")
          elif str2 == None:
              print("2")
          print("-> "+str(str1))
          print("--> "+str(str2))
          print(str(editdistance.eval(self.S_set[str1],self.S_set[str2])))
      
      
      # NOTE: Duplicates inside the dictionary     

      if ((str1,str2) or (str2,str1))  in self.pairDictionary.keys():
          return self.pairDictionary[(str1,str2)]
      else:
          distance = editdistance.eval(self.S_set[str1],self.S_set[str2])
          self.pairDictionary[(str2,str1)] = self.pairDictionary[(str1,str2)] = distance
          return distance

  
  # def predict(self, X):
  # """
  #   Predict the class labels for the provided data.
  #   Parameters
  #   ----------
  #   X : Test samples.
  #   Returns
  #   -------
  #   y : Class labels for each data sample.
  # """
  
  
  # def predict_proba(self, X):
  # """
  #   Return probability estimates for the test data X.
  #   Parameters
  #   ----------
  #   X : Test samples.
    
  #   Returns
  #   -------
  #   p : The class probabilities of the input samples.
  # """


  # # def evaluate():

---
---

# __Evaluation__

In [27]:
# Opening data file
import io
from google.colab import drive

drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


## __CoRA__

### Load from Drive

In [28]:
fpcites = r"/content/drive/My Drive/ERinDS/cora_cites.csv"
fppaper = r"/content/drive/My Drive/ERinDS/cora_paper.csv"
fpcontent = r"/content/drive/My Drive/ERinDS/cora_content.csv"

cites = pd.read_csv(fpcites,sep=';')
paper = pd.read_csv(fppaper,sep=';')
content = pd.read_csv(fpcontent,sep=';')

### Overview

In [29]:
cites

Unnamed: 0,cited_paper_id,citing_paper_id
0,35,887
1,35,1033
2,35,1688
3,35,1956
4,35,8865
...,...,...
5424,853116,19621
5425,853116,853155
5426,853118,1140289
5427,853155,853118


In [30]:
paper

Unnamed: 0,paper_id,class_label
0,35,Genetic_Algorithms
1,40,Genetic_Algorithms
2,114,Reinforcement_Learning
3,117,Reinforcement_Learning
4,128,Reinforcement_Learning
...,...,...
2703,1154500,Case_Based
2704,1154520,Neural_Networks
2705,1154524,Rule_Learning
2706,1154525,Rule_Learning


In [31]:
content

Unnamed: 0,paper_id,word_cited_id
0,35,word100
1,35,word1152
2,35,word1175
3,35,word1228
4,35,word1248
...,...,...
49211,1155073,word75
49212,1155073,word759
49213,1155073,word789
49214,1155073,word815


### Train-Test-Validation datasets



In [32]:
class_labels = np.unique(paper.class_label.to_numpy())
print(class_labels)
print("Number of classes: "+str(len(class_labels)))

['Case_Based' 'Genetic_Algorithms' 'Neural_Networks'
 'Probabilistic_Methods' 'Reinforcement_Learning' 'Rule_Learning' 'Theory']
Number of classes: 7


## __DBLP/ACM__

In [33]:
acmfp = r"/content/drive/My Drive/ERinDS/ACM.csv"
dblpfp = r"/content/drive/My Drive/ERinDS/DBLP2.csv"
acm_dblp_mapping_fp = r"/content/drive/My Drive/ERinDS/DBLP-ACM_perfectMapping.csv"

acm = pd.read_csv(acmfp)
dblp = pd.read_csv(dblpfp, encoding='latin-1')
perfect_mapping = pd.read_csv(acm_dblp_mapping_fp)

dblp['year'] = dblp['year'].astype(str)
acm['year'] = acm['year'].astype(str)

### Overview

In [34]:
acm

Unnamed: 0,id,title,authors,venue,year
0,304586,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,304587,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999
2,304589,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data,1999
3,304590,XML-based information mediation with MIX,"Chaitan Baru, Amarnath Gupta, Bertram Lud&#228...",International Conference on Management of Data,1999
4,304582,The CCUBE constraint object-oriented database ...,"Alexander Brodsky, Victor E. Segal, Jia Chen, ...",International Conference on Management of Data,1999
...,...,...,...,...,...
2289,672977,Dual-Buffering Strategies in Object Bases,"Alfons Kemper, Donald Kossmann",Very Large Data Bases,1994
2290,950482,Guest editorial,"Philip A. Bernstein, Yannis Ioannidis, Raghu R...",The VLDB Journal &mdash; The International Jou...,2003
2291,672980,GraphDB: Modeling and Querying Graphs in Datab...,Ralf Hartmut G&#252;ting,Very Large Data Bases,1994
2292,945741,Review of The data warehouse toolkit: the comp...,Alexander A. Anisimov,ACM SIGMOD Record,2003


In [35]:
dblp

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and it...,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996
2,conf/vldb/PalpanasSCP02,Incremental Maintenance for Non-Distributive A...,"Themistoklis Palpanas, Richard Sidle, Hamid Pi...",VLDB,2002
3,conf/vldb/GardarinGT96,Cost-based Selection of Path Expression Proces...,"Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...",VLDB,1996
4,conf/vldb/HoelS95,Benchmarking Spatial Join Operations with Spat...,"Erik G. Hoel, Hanan Samet",VLDB,1995
...,...,...,...,...,...
2611,journals/tods/KarpSP03,A simple algorithm for finding frequent elemen...,"Scott Shenker, Christos H. Papadimitriou, Rich...",ACM Trans. Database Syst.,2003
2612,conf/vldb/LimWV03,SASH: A Self-Adaptive Histogram Set for Dynami...,"Lipyeow Lim, Min Wang, Jeffrey Scott Vitter",VLDB,2003
2613,journals/tods/ChakrabartiKMP02,Locally adaptive dimensionality reduction for ...,"Kaushik Chakrabarti, Eamonn J. Keogh, Michael ...",ACM Trans. Database Syst.,2002
2614,journals/sigmod/Snodgrass01,Chair's Message,Richard T. Snodgrass,SIGMOD Record,2001


In [36]:
perfect_mapping

Unnamed: 0,idDBLP,idACM
0,conf/sigmod/SlivinskasJS01,375678
1,conf/sigmod/ChaudhuriDN01,375694
2,conf/sigmod/RinfretOO01,375669
3,conf/sigmod/BreunigKKS01,375672
4,conf/sigmod/JagadishJOT01,375687
...,...,...
2219,journals/sigmod/Scholl01,604275
2220,journals/sigmod/Rosneblatt94,190649
2221,journals/sigmod/Winslett02b,601871
2222,journals/sigmod/Labrinidis01,604283


In [37]:
acm.loc[acm['id'] == 375678]

Unnamed: 0,id,title,authors,venue,year
301,375678,Adaptable query optimization and evaluation in...,"Giedrius Slivinskas, Christian S. Jensen, Rich...",International Conference on Management of Data,2001


In [38]:
dblp.loc[dblp['id'] == 'conf/sigmod/SlivinskasJS01']

Unnamed: 0,id,title,authors,venue,year
143,conf/sigmod/SlivinskasJS01,Adaptable Query Optimization and Evaluation in...,"Christian S. Jensen, Richard T. Snodgrass, Gie...",SIGMOD Conference,2001


### Preprocess

In [39]:
def preprocess(row):
  # print(row)
  paper_str = " ".join(row)
  paper_str = paper_str.lower()

  return paper_str

### Dataset split

### Model evaluation

Small dataset

In [59]:
dataset100 = []

for _,row in perfect_mapping.head(100).iterrows():

  dplp_row = dblp.loc[dblp.id == row['idDBLP'],['title','authors','venue','year']].values.flatten().tolist()
  acm_row = acm.loc[acm.id == row['idACM'],['title','authors','venue','year']].values.flatten().tolist()
  # print(dplp_row)
  # dplp_row = dblp.loc[dblp['id'] == row['idDBLP']].values.tolist()
  # acm_row = acm.loc[acm['id'] == row['idACM']]
  
  dplp_row = preprocess(dplp_row)
  dataset100.append(dplp_row)

  acm_row = preprocess(acm_row)
  dataset100.append(acm_row)


print(dataset100)

['adaptable query optimization and evaluation in temporal middleware christian s. jensen, richard t. snodgrass, giedrius slivinskas sigmod conference 2001', 'adaptable query optimization and evaluation in temporal middleware giedrius slivinskas, christian s. jensen, richard thomas snodgrass international conference on management of data 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries vivek r. narasayya, gautam das, surajit chaudhuri sigmod conference 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries surajit chaudhuri, gautam das, vivek narasayya international conference on management of data 2001', "bit-sliced index arithmetic elizabeth j. o'neil, denis rinfret, patrick e. o'neil sigmod conference 2001", "bit-sliced index arithmetic denis rinfret, patrick o'neil, elizabeth o'neil international conference on management of data 2001", 'data bubbles: quality preserving performance boosting for hierarch

In [66]:
model = RankedWTAHash(40,150,30)
EditDistance = model.EditDistance
model.fit(dataset100,None)

['adaptable query optimization and evaluation in temporal middleware christian s. jensen, richard t. snodgrass, giedrius slivinskas sigmod conference 2001', 'adaptable query optimization and evaluation in temporal middleware giedrius slivinskas, christian s. jensen, richard thomas snodgrass international conference on management of data 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries vivek r. narasayya, gautam das, surajit chaudhuri sigmod conference 2001', 'a robust, optimization-based approach for approximate answering of aggregate queries surajit chaudhuri, gautam das, vivek narasayya international conference on management of data 2001', "bit-sliced index arithmetic elizabeth j. o'neil, denis rinfret, patrick e. o'neil sigmod conference 2001", "bit-sliced index arithmetic denis rinfret, patrick o'neil, elizabeth o'neil international conference on management of data 2001", 'data bubbles: quality preserving performance boosting for hierarch

---

# References

[1]   [The dissimilarity representation for pattern recognition, a tutorial
Robert P.W. Duin and Elzbieta Pekalska Delft University of Technology, The Netherlands School of Computer Science, University of Manchester, United Kingdom](http://homepage.tudelft.nl/a9p19/presentations/DisRep_Tutorial_doc.pdf)