<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_en.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

<br>

---

<h3 align="center" > 
  Bachelor Thesis
</h3>

<h1 align="center" > 
  Entity Resolution in Dissimilarity Spaces
</h1>

---

<h3 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h3>

<h4 align="center"> 
 <b>Supervisor: Alex Delis</b>, Dr. Professor NKUA
</h4>
<br>
<h4 align="center"> 
Athens
</h4>
<h4 align="center"> 
January-September 2021
</h4>

---

# Abstract

---

In this notebook it will be presented a dissimilarity-based entity resolution
framework that introduces a new efficient object representation
scheme. This framework consists of four parts. First part is the string clustering and prototype selection, in which clusters will be made that afterwords will be used for the string embedding. The second part in this methology is the string embedding into an N-dimensional Vantage space which has been generated by the prototype selection. Next, in the third part, it will be presented a distance measure that relies on Kendall tau
correlation coefficient and generalizes the similarity measures and
distances presented so far. Finally, in the fourth part, a sparse embedding scheme on this metric is added in order to minimize the computational cost of this methodology. 

This system will be evaluated in three databases. Its performance will be compared with some other famous Entity reslution systems in metrics Recall and Precision and also in computational time. 




# Introduction

---

Every technique and methodology used in this work, that is out of the ordinary, will be briefly introduced and explained. Starting with Entity resolution. 


 ### Entity resolution
 
__Entity resolution (ER)__ or Deduplication are among the research themes that have recently received escalated interest. ER is the process of creating systematic linkage between disparate data records that represent the same thing in reality, in the absence of a join key. For example, as a previous project that I made, say you have a dataset of camera records from multiple websites (Amazon, AliBaba, etc) and you want to find which of these records refer to the same real object. Records may have slightly different names, somewhat different descriptions, maybe similar prices, and totally different unique identifiers. This may heard no big deal, but taking into serious the volume of some datasets and databases, gets you to understand how challenging, in prospects of  accuracy and computability this is. ER applications are now used for multiple reasons, not only for avoiding duplicates in databases, but also for reasons like finding "similar" accounts in social media or email, that are connected to  criminal actions.     

The goal of this project, is to make an Entity resolution system that performs both well in Precision, Recall and execution time.  In this work we embrace an embedding approach by selecting a number of pivot objects to act as prototypes for transforming a dissimilarity space of proximities into a reduced set of distances of objects from these prototypes. It is now important to make clear what a dissimilarity space is. This definition comes from the fields of Statistis and theoritical Machine Learning. 

### Dissimilatiry Space

Dissimilarities [1] have been used in pattern recognition for a long time. In the first approach the dissimilarity matrix is considered as a set of row vectors, one for every object.  They represent the objects in a vector space constructed by the dissimilarities to the other objects.  Usually, this vector space is treated as a Euclidean space and equipped with the standard inner product definition. Let $ \textit{X} = \{ x_1, . . . , x_n \} $ be a training set. Given a dissimilarity function and/or dissimilarity data, we define a data-dependent mapping $ D(·, R) : X → R $ from  $ X $ to the so-called __dissimilarity space (DS)__ . The $k-element$ set $R$ consists of
objects that are representative for the problem. This set is called the representation or __prototype set__ and it may be a subset of X . In the dissimilarity space each dimension $ D(·, p_i) $ describes a dissimilarity to a prototype $ p_i $ from R. In this paper, we initially choose $ R := X $ . As a result, every object is described by an n-dimensional dissimilarity vector $ D(x, X ) = [d(x, x_1) . . . d(x, x_n)]^T $. The resulting vector space is endowed with the traditional inner product and the Euclidean metric.
Any dissimilarity measure ρ can be defined in the DS. 




___This project builds on four pillars:___

1. Object partitioning and embedding. More specifically the embedding technique that is mainly used is called __Vantage Embedding__  and the __Chorus of Prototypes scheme__ .
2. Machine Learning techniques that build __nearest neighbors classification__ models on the selection of prototypes
3. Various correlation coefficient and distance metrics that are applied on ranked data as well as a generalization of the well known __Hausdorff distance metric for partially ranked data__.
4. __Locality Sensitive Hashing (LSH)__ techniques specifically tuned for handling ranking data to render the similarity search process very efficient.





# A dissimilarity-based space embedding methodology
---

Central theme in this methodology is the transformation of the input data in a representation form that can easily and accurately circumvent the inherent lack of features of objects and handle a variety of different data types in a unified way. 


The first step is to read and process the input data (strings in this particular work). Section 3.1 consists of the idea and the algorithm of string clustering in order to create the embeddings. But, firstly, it is highly important to define what's the Vantage Space and the Chorus of Prototypes scheme. These approaches are used in order to to efficiently and effectively capture the similarity of high dimensional data. 

### Vantage Space

The *Vantage Objects* approach maps pairwise distances of input objects into an 𝑁-dimensional space of pivot objects which is known as __Vantage Space__ in such a way that points that lie close to each other in this space correspond to similar objects in the original dissimilarity space. The __Chorus of Prototypes Transform (CT)__ on the other hand proposes the use of a rank correlation coefficient defined on the data induced by the distances of the input objects from the pivot objects.

After a brief definition of the above embedding techniques, we need to create a set of string prototypes and according to these prototypes, create the embeddings into an N-dimensional space.
 





## 3.1 String Clustering and Prototype Selection

***The algorithm takes as input:***
 - __𝑆__: the input strings in vector, 
 - __𝑘__: the maximum number of clusters to be generated, 
 - __d__: the maximum allowable distance of a string to join a cluster

it produces in the first phase two arrays,
 - an array variable __𝐶__ that contains the assignment of individual strings to cluster identities, and 
 - the 2D array variable __r__ that maintains the assignment of representatives to clusters. 

In the second phase, it produces the assignment of prototypes to clusters in the array variable __Prototype__.

***Returns:***
- __Prototype__: array
- __𝐶__: array
- __r__: 2D array

In [None]:
def CLUSTERING_PROTOTYPES(S,k,d,r,C):

    i = 1
    j = 1
        
    while i <=   :     # String-clustering phase
        while j <= k :
            if r[j]:
                

In [None]:
input_strings = ["abcd","efgh","h","i"]
max_number_of_clusters = 2
msx_distance = 1
C = []
r = []



## 3.2 The Vantage Space Embedding and the Chorus of Prototypes Transform Similarity Coefficient


## 3.3 A Top-k List Approach for Similarity Searching in the Vantage Space

## 3.4 Hashing of Partially Ranked Data for Efficient Similarity Search

# Evaluation

# References

[1]   [The dissimilarity representation for pattern recognition, a tutorial
Robert P.W. Duin and Elzbieta Pekalska Delft University of Technology, The Netherlands School of Computer Science, University of Manchester, United Kingdom](http://homepage.tudelft.nl/a9p19/presentations/DisRep_Tutorial_doc.pdf)