## Similarity Functions

#### TOKEN-BASED SIMILARITY

Token-based similarity measures compare two strings by first dividing them into a set of tokens
using a tokenization function, which we denote as tokenize(·). Intuitively, tokens correspond to
substrings of the original string. As a simple example, assume the tokenization function splits a
string into tokens based on whitespace characters.Then, the string Sean Connery results in the set
of tokens *{Sean,Connery}*. As we will show throughout our discussion, the main advantage of
token-based similarity measures is that the similarity is less sensitive to word swaps compared to
similarity measures that consider a string as a whole (notably edit-based measures). That is, the
comparison of *Sean Connery* and *Connery Sean* will yield a maximum similarity score because both
strings contain the exact same tokens. On the other hand, typographical errors within tokens are
penalized, for instance, the similarity of *Sean Connery* and *Shawn Conery* will be zero.

#### JACCARD COEFFICIENT

The Jaccard coefficient is a similarity measure that, in its most general form, compares two sets P
and Q with the following formula:
$$Jaccard(P,Q) = \frac{|P \cap Q|}{|P \cup Q|}$$
Essentially,the Jaccard coefficient measures the fraction of the data that is shared between P
and Q, compared to all data available in the union of these two sets.

An advantage of the Jaccard coefficient is that it is not sensitive to word swaps. Indeed, the
score of two names *John Smith* and *Smith John* would correspond to the score of exactly equal strings because the Jaccard coefficient considers only whether a token exists in a string, not at which position.

#### COSINE SIMILARITY USINGTOKEN FREQUENCY AND INVERSE DOCUMENT FREQUENCY

The cosine similarity is a similarity measure often used in information retrieval. In general,given two n-dimensional vectors V and W, the cosine similarity computes the cosine of the angle $\alpha$ between
these two vectors as
$$CosineSimilarity(V,W) = cos(\alpha) = \frac{V \cdot W}{||V|| \cdot ||W||}$$

### EDIT-BASED SIMILARITY

We now focus on a second family of similarity measures,so called edit-based similarity measures.
In contrast to token-based measures, strings are considered as a whole and are not divided into sets
of tokens. However, to account for errors, such as typographical errors, word swaps and so on, edit-
based similarities allow different edit operations to transform one string into the other,e.g.,*insertion* of characters, character *swaps*, *deletion* of characters, or *replacement* of characters.

In [10]:
import re


def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)



In [3]:
correction('korrectud')

'corrected'

In [9]:
import editdistance

In [10]:
editdistance.eval('banana', 'bahama')

2

In [8]:
import stringdist
stringdist.levenshtein('test', 'testing')

3

Thanks to 
https://stackoverflow.com/questions/39008069/r-and-python-in-one-jupyter-notebook

In [1]:
#%load_ext rpy2.ipython