<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_en.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

<br>

---

<h3 align="center" > 
  Bachelor Thesis
</h3>

<h1 align="center" > 
  Entity Resolution in Dissimilarity Spaces
</h1>

---

<h3 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h3>

<h4 align="center"> 
 <b>Supervisor: Dr. Alex Delis</b>,  Professor NKUA
</h4>
<br>
<h4 align="center"> 
Athens
</h4>
<h4 align="center"> 
January-September 2021
</h4>

---

<!--
| Table of Contents |
| :--   |
|**1. [Abstract](#Abstract)** |
|**2. [Introduction](#Introduction)** <br />    **2.1. [   Entity resolution](#Entity-resolution)** <br />   **2.2. [   Dissimilatiry space](#Dissimilatiry-Space)**|
|**3. [ A dissimilarity-based space embedding methodology](#A-dissimilarity-based-space-embedding-methodology)**|

-->

# Contents
---
**1. [Abstract](#Abstract)** \
**2. [Introduction](#Introduction)**  \
&nbsp;&nbsp;&nbsp;**2.1. [   Entity resolution](#Entity-resolution)** \
&nbsp;&nbsp;&nbsp;**2.2. [   Dissimilatiry space](#Dissimilatiry-Space)** \
**3. [ A dissimilarity-based space embedding methodology](#A-dissimilarity-based-space-embedding-methodology)** \
&nbsp;&nbsp;&nbsp;**3.1 [String Clustering and Prototype Selection](#3.1-String-Clustering-and-Prototype-Selection)** \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.1. [Edit distance metric](#Edit-distance-metric)** \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.2. [String clustering algorithm](#String-clustering-algorithm)** \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.3. [Algorithm complexity](#Algorithm-complexity)** \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.4. [Prototype selection](#Prototype-selection)** \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3.1.5. [Algorithm-1: The String Clustering and Prototype Selection Algorithm](#Algorithm-1:-The-String-Clustering-and-Prototype-Selection-Algorithm)** \
&nbsp;&nbsp;&nbsp;**3.2 [The Vantage Space Embedding and the Chorus of Prototypes Transform Similarity Coefficient](#3.2-The-Vantage-Space-Embedding-and-the-Chorus-of-Prototypes-Transform-Similarity-Coefficient)** \
&nbsp;&nbsp;&nbsp;**3.3 [A Top-k List Approach for Similarity Searching in the Vantage Space](#3.3-A-Top-k-List-Approach-for-Similarity-Searching-in-the-Vantage-Space)** \
&nbsp;&nbsp;&nbsp;**3.4 [Hashing of Partially Ranked Data for Efficient Similarity Search](#3.4-Hashing-of-Partially-Ranked-Data-for-Efficient-Similarity-Search)** \
**4. [ Evaluation](#Evaluation)** \
**5. [References](#References)** 


# Abstract

---

In this notebook it will be presented a dissimilarity-based entity resolution
framework that introduces a new efficient object representation
scheme. This framework consists of four parts. First part is the string clustering and prototype selection, in which clusters will be made that afterwords will be used for the string embedding. The second part in this methology is the string embedding into an N-dimensional Vantage space which has been generated by the prototype selection. Next, in the third part, it will be presented a distance measure that relies on Kendall tau
correlation coefficient and generalizes the similarity measures and
distances presented so far. Finally, in the fourth part, a sparse embedding scheme on this metric is added in order to minimize the computational cost of this methodology. 

This system will be evaluated in three databases. Its performance will be compared with some other famous Entity reslution systems in metrics Recall and Precision and also in computational time. 




# Introduction

---


 ### Entity resolution
 
__Entity resolution (ER)__ or Deduplication are among the research themes that have recently received escalated interest. ER is the process of creating systematic linkage between disparate data records that represent the same thing in reality, in the absence of a join key. For example, as a previous project that I made, say you have a dataset of camera records from multiple websites (Amazon, AliBaba, etc) and you want to find which of these records refer to the same real object. Records may have slightly different names, somewhat different descriptions, maybe similar prices, and totally different unique identifiers. This may heard no big deal, but taking into serious the volume of some datasets and databases, gets you to understand how challenging, in prospects of  accuracy and computability this is. ER applications are now used for multiple reasons, not only for avoiding duplicates in databases, but also for reasons like finding "similar" accounts in social media or email, that are connected to  criminal actions.     

The goal of this project, is to make an Entity resolution system that performs both well in Precision, Recall and execution time.  In this work we embrace an embedding approach by selecting a number of pivot objects to act as prototypes for transforming a dissimilarity space of proximities into a reduced set of distances of objects from these prototypes. It is now important to make clear what a dissimilarity space is. This definition comes from the fields of Statistis and theoritical Machine Learning. 

### Dissimilatiry Space

Dissimilarities [[1]]() have been used in pattern recognition for a long time. In the first approach the dissimilarity matrix is considered as a set of row vectors, one for every object.  They represent the objects in a vector space constructed by the dissimilarities to the other objects.  Usually, this vector space is treated as a Euclidean space and equipped with the standard inner product definition. Let $ \textit{X} = \{ x_1, . . . , x_n \} $ be a training set. Given a dissimilarity function and/or dissimilarity data, we define a data-dependent mapping $ D(·, R) : X → R $ from  $ X $ to the so-called __dissimilarity space (DS)__ . The $k-element$ set $R$ consists of
objects that are representative for the problem. This set is called the representation or __prototype set__ and it may be a subset of X . In the dissimilarity space each dimension $ D(·, p_i) $ describes a dissimilarity to a prototype $ p_i $ from R. In this paper, we initially choose $ R := X $ . As a result, every object is described by an n-dimensional dissimilarity vector $ D(x, X ) = [d(x, x_1) . . . d(x, x_n)]^T $. The resulting vector space is endowed with the traditional inner product and the Euclidean metric.
Any dissimilarity measure ρ can be defined in the DS. 




___This project builds on four pillars:___

1. Object partitioning and embedding. More specifically the embedding technique that is mainly used is called __Vantage Embedding__  and the __Chorus of Prototypes scheme__ .
2. Machine Learning techniques that build __nearest neighbors classification__ models on the selection of prototypes
3. Various correlation coefficient and distance metrics that are applied on ranked data as well as a generalization of the well known __Hausdorff distance metric for partially ranked data__.
4. __Locality Sensitive Hashing (LSH)__ techniques specifically tuned for handling ranking data to render the similarity search process very efficient.





# A dissimilarity-based space embedding methodology
---

Central theme in this methodology is the transformation of the input data in a representation form that can easily and accurately circumvent the inherent lack of features of objects and handle a variety of different data types in a unified way. 


The first step is to read and process the input data (strings in this particular work). Section 3.1 consists of the idea and the algorithm of string clustering in order to create the embeddings. But, firstly, it is highly important to define what's the Vantage Space and the Chorus of Prototypes scheme. These approaches are used in order to to efficiently and effectively capture the similarity of high dimensional data. 


After a brief definition of the above embedding techniques, we need to create a set of string prototypes and according to these prototypes, create the embeddings into an N-dimensional space. 


## 3.1 String Clustering and Prototype Selection

The first step in this methodology is to cluster the input strings in order to identify the cluster representatives that will be used as prototypes in the embedding process. It is vey important to mention that every cluster needs two representative strings that will selected from the clustering algorithm.

__Why do we need 2 representatives?__

This way we can compute the inner product space where one string serves the origin and the other the endpoint of a vector.

After the formation of the clusters, the prototype selection phase follows, in which one of the members in each cluster (not necessarily one of the two cluster representatives), will become its unique prototype and will be used
as the pivot object for the embedding.

Note that based on the assumed dissimilarity representation of the input objects, the considered distance metric for the input strings in this work is the __Edit Distance metric__.


### Edit distance metric
Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Βased on the assumed dissimilarity representation of the input objects, the considered distance metric for the input strings in this work is the Edit Distance metric.

Theres already an implementation for this metric in library editdistance

Downloading the library

In [17]:
!pip install editdistance



In [23]:
import editdistance

# def EditDistance(str1,str2):
#     str1 = str1 if str1 not None else ''
#     str2 = str2 if str2 not None else ''
    
#     return editdistance.eval(str1,str2)

def EditDistance(str1,str2,verbose=None):
    if verbose:
        if str1 == None:
            print("1")
        elif str2 == None:
            print("2")
        print(str1+" | "+str2+" : "+str(editdistance.eval(str1,str2)))
    
    return editdistance.eval(str1,str2)


print("Edit distance: "+str(editdistance.eval('banana', 'bahama')))

Edit distance: 2


Or we can either implement it

In [3]:
# https://www.geeksforgeeks.org/edit-distance-dp-5/
# A Naive recursive Python program to fin minimum number
# operations to convert str1 to str2
 
def EditDistance(str1, str2, m, n):
 
    # If first string is empty, the only option is to
    # insert all characters of second string into first
    if m == 0:
        return n

    # If second string is empty, the only option is to
    # remove all characters of first string
    if n == 0:
        return m

    # If last characters of two strings are same, nothing
    # much to do. Ignore last characters and get count for
    # remaining strings.
    if str1[m-1] == str2[n-1]:
        return EditDistance(str1, str2, m-1, n-1)

    # If last characters are not same, consider all three
    # operations on last character of first string, recursively
    # compute minimum cost for all three operations and take
    # minimum of three values.
    return 1 + min(EditDistance(str1, str2, m, n-1),    # Insert
               EditDistance(str1, str2, m-1, n),    # Remove
               EditDistance(str1, str2, m-1, n-1))    # Replace
                
str1 = "banana"
str2 = "bahama"
print("Edit distance: "+str(EditDistance(str1, str2, len(str1), len(str2))))

Edit distance: 2


### String clustering algorithm

The string clustering algorithm produces as its output two vectors that contain for each discovered cluster, two representatives, as well as the assignment of individual strings to the closest cluster.

***Functionality:***
The string clustering algorithm iterates through the list of the input strings, and for each input string loops over the list of the currently discovered clusters. The algorithm starts processing the of the first discovered cluster1. The second string will be checked against the first representative of the first cluster, and if it is the case that the distance between these two strings is less than the distance threshold 𝑑 given as input to the algorithm, the second string will join the first cluster and it will become the second representative of this cluster. If the distance between these two strings is greater
than 𝑑, then the algorithm will go on with the next cluster. Given there is only one cluster which has been created so far, and by assuming that the distance between the second string and the first representative is greater than the threshold, the second string will be automatically allocated to the second cluster (which is empty at this moment), in which case it will become its first representative.


### Cluster  membership condition

When a new string is checked for membership to a cluster, the condition that must be satisfied by this newly coming string in order to join this cluster is that the sum of its distances from the two representatives must be less than the distance threshold.

$$
\textbf{String_Sum_Of_Distances < Distance_Threshold}
$$

By comparing the newly arrived string against the two cluster representatives and by ensuring that the cluster
membership condition is satisfied, we have the right to provide distance guarantees for all the pairs of strings that will eventually join the same cluster, meaning that no string in the cluster is more than 𝑑 distance away from any other string in the same cluster.

![](img/fig_1.png)

$$
\textbf{Figure 1: Properties of distances of strings from cluster representatives}
$$


#### Triangle inequality

Figure 1 is a visualize of the membership condition that must be fullfilled in each cluster. 
In this graph:

- Nodes __A,B,C,D,E,F__ represent the strings
- Edges represent the Edit distances among the strings

__Figure Scenario:__

1. String __A__ joins first the cluster
2. String __B__ joins first the cluster $\Rightarrow$ __A__,__B__ become the 2 representatives
3. __A__,__B__ distance must be less than d $\Rightarrow$ $\textbf{distance(A,B) < d}$
4. String __C__ is checker for entering the cluster. Condition checked: $\sum_{n=A,B}\textbf{distance(C,n) < d}$. If condition is valid string __C__ enters the cluster.
5. Assume that strings __D__ and __E__ are checked for membership to the cluster in a similar fashion,which is done by questioning for each one of them whether their sum of the distances from the two representatives is less than or equal to the distance threshold. Assuming also that these conditions fulfilled and they join the clusters. It can be proofed from the triangle inequality that the distances of the newly joined __C__,__D__,__E__ from each other are below d.

(PROOF)


### Algorithm complexity
For $n$ strings and $k$ prototypes: $\textit{O(nk)}$


### Prototype selection
From among the strings that were allocated to a certain cluster by the string clustering algorithm, we need to specify one string that will play the role of the cluster prototype. This method benefits the algorithm complexity, as it is only needed to compare the incoming new string with the 2 representatives and not with all the cluster. Even though either one of the two representatives of each cluster which were derived during the clustering process can assume the cluster prototype role, there are still other choices that are way more precise and can be computed in a computationally efficient manner. To accomplish this, we rely on the notion of the set median string which has appeared as a concept in a number of studies before. 



#### Median string
The string 𝑚 that belongs to a set of strings 𝑆 and satisfies the following property
$$
m = \textit{argmax}_{y\in S} \sum_{\forall x \in S} d(x,y)
$$

Briefly, this condition means that it is the string in S whose sum of the distances from all the othe strings in S is minimum.
It is obvious that this procedure has an expencive computation as it is needed to traverse the set of strings multiple times. For this reason in this framework it will be used another way of finding the cluster prototypes. 


![](img/fig_2.png)

$$
\textbf{Figure 2: Projections of distances of strings from cluster representative A}
$$

Figure 2 illustrates the idea of the string prototype selection for a random cluster of strings. It will be demontrated an efficient alforithm for selecting prototypes in the following steps:

1. Calculating the distances of the projections of the strings from the leftmost representative string __A__ onto the line that extends from points __A__ and __B__. Example is the computation of distance $d_{CA}$ 
    - Assuming that __A__ and __B__ are already projected,which means that we do not have to do anything about these two strings.
    - Distance $d_{CA}$ :
        \begin{align*}
            CA^2 &= CC'^2 + d_{CA}^2 \Leftrightarrow   &&  \textit{(Pythagorian Theorem - PT in C'CA trianlge)} \\
            CA^2 &= CB^2 - (d_{CA}+AB)^2 + d_{CA}^2 \Leftrightarrow   &&  (\textit{C'CB trianlge: } CC'^2 + C'B^2 = CB^2)   \\
            CA^2 &= CB^2 - d_{CA}^2 - AB^2 - 2 \cdot AB^2 \cdot  d_{CA} + d_{CA}^2 \Leftrightarrow && (\textit{Algebraic identity})\\
            CA^2 &= CB^2 - AB^2 - 2 \cdot AB^2 \cdot  d_{CA} \Leftrightarrow && \\
            2 \cdot AB^2 \cdot  d_{CA} &= CB^2 - AB^2 - CA^2 \Leftrightarrow &&  \\  
            d_{CA} &= \frac{CB^2 - AB^2 - CA^2}{2 \cdot AB^2} &&  \\  
        \end{align*}
        The distances __AB__,__CA__,__CB__ are already known from the clustering phase, so there's no need to be computed again. This way we can now easily use the formula to compute all the wanted distances
2. __Sorting phase__:   After we have estimated all the projected distances, we sort these distances algebraically from the smallest to the largest. In this implementation it will be used quicksort algorithm from numpy library.
3. We choose the __median distance__ among those and set it as the prototype of this cluster. 



### ***Algorithm 1:*** The String Clustering and Prototype Selection Algorithm


***The algorithm takes as input:***
 - __𝑆__: the input strings in vector, 
 - __𝑘__: the maximum number of clusters to be generated, 
 - __d__: the maximum allowable distance of a string to join a cluster

it produces in the first phase two arrays,
 - an array variable __𝐶__ that contains the assignment of individual strings to cluster identities, and 
 - the 2D array variable __r__ that maintains the assignment of representatives to clusters. 

In the second phase, it produces the assignment of prototypes to clusters in the array variable __Prototype__.

***Returns:***
- __Prototype__: Assignment of prototypes to clusters as array

In [25]:
'''
CLUSTERING_PROTOTYPES(S,k,d,r,C) 
The String Clustering and Prototype Selection Algorithm
is the main clustering method, that takes as input the intial strings S, 
the max number of clusters to be generated in k,
the maximum allowable distance of a string to join a cluster in var d
and returns the prototype for each cluster in array Prototype
'''
def CLUSTERING_PROTOTYPES(S,k,d,r,C):
    
    # ----------------- Initialization phase ----------------- #
    i = 0
    j = 0
        
    while i < S.size:     # String-clustering phase, for all strings
#         print(S.size)
        while j < k :       # iteration through clusters, for all clusters
            if r[0][j] == None:      # case empty first representative for cluster j
                r[0][j] = S[i]   # init cluster representative with string i
                C[i] = j         # store in C that i-string belongs to cluster j
                break
            elif r[1][j] == None and (EditDistance(S[i],r[0][j]) <= d):  # case empty second representative 
                r[1][j] = S[i]                                             # and ED of representative 1  smaller than i-th string 
                C[i] = j
                break
            elif (EditDistance(S[i],r[0][j]) + EditDistance(S[i],r[1][j])) <= d:
                C[i] = j
                break
            else:
                j += 1
        i += 1
    
    # ----------------- Prototype selection phase ----------------- #
        
#     Projections = np.array([1,2])
    Prototype = np.empty([k],dtype=object)
    
    while j < k:
        Prototype[j] = r[0][j]  # First representative becomes the Prototype
#         Projections[j] = Approximated_Projection_Distances_ofCluster()
#         sortedProjections[j] = np.sort(np.array(Projections[j]),kind = 'quicksort' ) 
#         Prototype[j] = median(sortedProjections[j])
        j += 1
    
    return Prototype

def Approximated_Projection_Distances_ofCluster():
    
    
    return distances

def median(distances):
    
    return median_value

In [5]:
import numpy as np

In [26]:
input_strings = ["abcd","efgh","h","i","1","2","4","jshsshsjaleebsj"]
max_number_of_clusters = 3
k = max_number_of_clusters
d = 5
S = np.array(input_strings,dtype=object)
C = np.empty([S.size], dtype=int)
r = np.empty([2,k],dtype=object)

# print(r)
# print(str(None))
Prototypes = CLUSTERING_PROTOTYPES(S,k,d,r,C)
# print(C)
# print(r)
# print(S)
# print(Prototypes)

## 3.2 The Vantage Space Embedding and the Chorus of Prototypes Transform Similarity Coefficient

The second step in the methodology is to embed the input strings into the 𝑁-dimensional vantage space which has been generated by the prototype selection algorithm and to demonstrate the significance of using a correlation metric proposed in the context of the Chorus Transform for estimating the similarity of the embedded pairs of strings in the Vantage space.


### Vantage Space

The *Vantage Objects* approach maps pairwise distances of input objects into an 𝑁-dimensional space of pivot objects which is known as __Vantage Space__ in such a way that points that lie close to each other in this space correspond to similar objects in the original dissimilarity space. The __Chorus of Prototypes Transform (CT)__ on the other hand proposes the use of a rank correlation coefficient defined on the data induced by the distances of the input objects from the pivot objects.


__Note:__

- __k__ denotes the max number of expected clusters but
- __N__ is the actual number of produced clusters

#### Definition of a Finite metric space (S,d)

A metric space is an ordered pair __(S,d)__ where:
- __S__ is a set (our object set - strings) and 
- __d__ is a metric on S, i.e., a function $ d:S x S \rightarrow \Re   $

such that for any random $o_1,o_2,o_3 \in S$, the following holds:

\begin{align*}
    d(o_1,o_2) & \ge 0,   d(o_1,o_2) = 0 \textit{ iff } o_1=o_2   &&  \textit{(non-negativity)} \\   
    d(o_1,o_2) &= d(o_2,o_1)   &&  \textit{(symmetry)} \\
    d(o_1,o_3) & \le d(o_1,o_2) + d(o_2,o_3)   &&  \textit{(triangle inequality)}
\end{align*}


#### Vantage embedding

In this work it will be used a method called vantage embedding, which utilizes a set of 𝑁 objects from S such that
\begin{align*}
    A^* &= \{ A^*_1,A^*_2,...,A^*_n \}    &&   \textbf{Vantage Objects}
\end{align*}

Vantage objects generate $N$ __different one dimensional mappings__ which are used collectively to generate the embedding for each string.


The distance $x_j$ from each object $A_i \in S$ to every ___Vantage Object___ in $A^*$ 
$$
x_j = d(A_i,A_j^*)
$$

is computed creating in this way an __N-dimensional__ point in the Vantage Space for the object $A_i$:
$$
p_i = \{ x_1,x_2,...,x_N \}
$$

The mapping for an object $A_i$ in the Vantage Space is denoted by $F(A_i)$. Now it is pretty useful to define a distance metric $δ$ for this space we created, with the above formula:
$$
    δ(A_i,A_j) = \max_{A_υ \in Α*} | d(A_i,A_υ) - d(A_j,A_υ) | 
$$

#### Kendall’s tau rank coefficient definition

Let $(x_1,y_1), ... , (x_n,y_n)$ be a set of observations of the joint random variables X and Y, such that all the values of $(x_i)$ and $(y_i) $ are unique. Any pair of observations $(x_i,y_i)$ and $(x_j,y_j)$, where $i<j$ are said to be __concordant__ values in the sort order of $(x_i,x_j)$ and $(y_i,y_j)$ agrees: that is , if either both $x_i>x_j$ and $y_i>y_j$ holds or both $x_i<x_j$ and $y_i<y_j$ 

Otherwise they are said to be __discordant__.

The Kendall τ coefficient is defined as:
\begin{align*}
    τ &= \frac{\textbf{(number_of_concordant_pairs)} - \textbf{(number_of_discordant_pairs)} }{ {n\choose 2} }  && \\
\end{align*}

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Concordant_Points_Kendall_Correlation.svg/330px-Concordant_Points_Kendall_Correlation.svg.png)
$$
\textbf{Figure 3: Example of concordant and discordant points}
$$

In this example all points in the gray area are concordant and all points in the white area are discordant with respect to point $(X_{1},Y_{1})$. With $n=30$ points, there are a total of ${30 \choose 2} = 435$ possible point pairs. In this example there are 395 concordant point pairs and 40 discordant point pairs, leading to a Kendall rank correlation coefficient of 0.816.

Linking this definition to our work:
\begin{align*}
    τ &= \frac{|C| - |D|}{ {n \choose 2} }  && \\
\end{align*}

where 
- $|C|$: concordant pairs of components
- $|D|$: discordant pairs of components
- $  {n\choose 2} = \frac{n(n-1)}{2} $

Equivalent way to compute Kendall's tau correlation coefficient is by computing the product of the signs of the differences between their $j$ and $i$ components. This method is shown in the following equation:

$$
τ = \frac{2}{n(n-1)}\sum_i\sum_{i<j} \textit{sign}(s_1[i] - s_1[j]) \cdot \textit{sign}(s_1[i] - s_1[j]) 
$$


__SciPy implementation__

Kendall's tau rank is implemented in the python library __SciPy__ for the stats projects and it will be used in this implementation. More info [scipy.stats.kendalltau](https://web.archive.org/web/20181008171919/https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html)

````
scipy.stats.kendalltau(x, y, initial_lexsort=None, nan_policy='propagate')
````

Use example:

In [28]:
from scipy import stats

x1 = [12, 2, 1, 12, 2]
x2 = [1, 4, 7, 1, 0]

tau, p_value = stats.kendalltau(x1, x2)
print(tau,p_value)

-0.4714045207910316 0.2827454599327748


This method described in above must no be implemented and linked with 3.1 Clustering phase


## 3.3 A Top-k List Approach for Similarity Searching in the Vantage Space



### Abstract Algebra definitions

Lets get started with groups, by taking as example the group G:
- A group 𝐺 is a set of elements based on which a compositional law is defined that obeys certain properties.
- The group 𝐺 is:
    - __closed under composition__, which is associative, and 
    - it contains an __identity element__, as well as 
    - __inverses__ of all the elements in the group.
        
Let S(A) be the set of all permutations on $𝐴 = \{1, 2, 3, · · · , 𝑁\}$ which is a group with respect to the composition of mappings. 

- __Finite group__: If a group 𝐺 has a finite number of elements 
- __|𝐺|__: the number of elements in a group
- $S_N$: The group __𝑆(𝐴)__ is known as the symmetric group of 𝑁 elements
- __Subgroup of 𝐺__: A subset 𝐻 of a group 𝐺 is called a subgroup of 𝐺 if 𝐻 forms a group with respect to the same operation that is defined on 𝐺.  If 𝐻 is a subgroup of 𝐺 and 𝑎 ∈ 𝐺 then:
    - $𝑎𝐻 = {𝑥 ∈ 𝐺|𝑥 = 𝑎ℎ \textit{ for some }ℎ ∈ 𝐻}$ is called a __left coset__ of 𝐻 in 𝐺 while 𝐻𝑎 is called a __right coset__ of 𝐻 in 𝐺
    - The distinct left or right cosets of a subgroup H in a group 𝐺 form a partition of G.

### Hausdorff metric
The Hausdorff distance, or Hausdorff metric, also called Pompeiu–Hausdorff distance, measures how far two subsets of a metric space are from each other. It turns the set of non-empty compact subsets of a metric space into a metric space in its own right. Informally, two sets are close in the Hausdorff distance if every point of either set is close to some point of the other set. The Hausdorff distance is the longest distance you can be forced to travel by an adversary who chooses a point in one of the two sets, from where you then must travel to the other set. In other words, it is the greatest of all the distances from a point in one set to the closest point in the other set.


Let $𝑆_{𝑁 −𝑘}$ be the subgroup of $𝑆_𝑁$ consisting of all permutations that leave the first 𝑘 integers fixed
$𝑆_{𝑁−𝑘} = {𝜋 ∈ 𝑆_𝑁 |𝜋(𝑖) = 𝑖 \textit{  for all } 𝑖 = 1, · · · , 𝑘}$.  A partition of the group $𝑆_𝑁$ can be generated then based on the subgroup $𝑆_{𝑁−𝑘}$ and the right cosets of this subgroup in $𝑆_𝑁$ . If we consider an arbitrary group 𝐺, and any subgroup 𝐾 of this group, as well as a metric 𝑑 which is defined on 𝐺, then the metric 𝑑 induces a metric $𝑑^∗$ on its coset space 𝐺/𝐾 which is called the Hausdorff metric induced by 𝑑 that measures the distance between any two right cosets $𝐾_𝜋, 𝐾_𝜎 ∈ 𝐺/𝐾$ by the formula shown in equation.

$$
𝑑^∗(𝐾𝜋, 𝐾𝜎) = max\{ \max_{𝛽 ∈𝐾𝜎} \min_{𝛼 ∈ 𝐾𝜋} 𝑑(𝛼, 𝛽), \max_{𝛼 ∈ 𝐾𝜋} \min_{𝛽 ∈𝐾𝜎} 𝑑(𝛼, 𝛽)\}
$$



## 3.4 Hashing of Partially Ranked Data for Efficient Similarity Search

After having demonstrated the extension of the Kendall tau distance
metric to partially ranked data by using the induced Hausdorff
distance, we proceed in the last step of our methodology, which
is to incorporate a sparse embedding scheme on this metric. The
adopted embedding scheme transforms the feature space of the
ranked distances for its induced distance metric into integer codes
upon which the hashing of the embedded strings will take place.
This transformation has as a result that the computation of the
similar pairs of strings will become as efficient computationally as
possible, without sacrificing in return any of the recall and precision
metrics along the way.

### Winner Take All (WTA) hash
A collection of binary coded vectors, each
one of which corresponds to the position within a sized 𝑚 subset
of the original ranked data vector, where the maximum rank value
is located. More specifically, the authors propose an embedding
to be applied to the embedded string space that is not sensitive
to the rank values of the features themselves, but rather on the
relative ranking of the values of these features. 

---
---

# Evaluation

---

# References

[1]   [The dissimilarity representation for pattern recognition, a tutorial
Robert P.W. Duin and Elzbieta Pekalska Delft University of Technology, The Netherlands School of Computer Science, University of Manchester, United Kingdom](http://homepage.tudelft.nl/a9p19/presentations/DisRep_Tutorial_doc.pdf)