# Sequence comparison - NUMBERS

- Cosine distance
- Euclidean distance
- Manhattan distance

The closeness (similarity, nearness) of the two records (vectors) with features $(x_{1},x_{2},...,x_{n})$ and $(u_{1},u_{2},...,u_{n})$ can be measured in different ways:
| Metric | Description | Formula | Limitations |
| - | - | - | - |
| Euclidean distance | Straight-line (crow's fly) distance between two points. Linear distance between two points in a multi-dimensional space. | $$ \sqrt{(x_{1}-u_{1})^{2} + (x_{2}-u_{2})^{2} +...+(x_{n}-u_{n})^{2}} $$ | This metric is suitable for numerical continuous features. It can handle outliers and noise well. However, it can be affected by the curse of dimensionality, which means that as the number of features increases, the distance between any two points becomes less meaningful and more similar. If the data contains categorical or binary features, other distance metrics such as Hamming distance or Jaccard distance may be more appropriate. |
| Manhattan distance (aka city block / taxicab distance) | Distance between two points if calculated going through each dimension (feature) at a time. | $$ \|x_{1}-u_{1}\| + \|x_{2}-u_{2}\| + ... + \|x_{p}-u_{p}\| $$ | The Manhattan distance can be used with numerical continuous variables; it is also suitable for data that has discrete and categorical features, as it does not penalize small differences as much as the Euclidean distance. It can also handle high-dimensional data better, as it is less sensitive to the curse of dimensionality. However, it can be influenced by the orientation and scale of the features, as it assumes that all directions are equally important and all units are comparable. |
| Hamming distance | "In information theory, the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other." | Formula is pretty complicated, so let's consider this example instead. If A = 101101, B = 100111, then hamming distance = 2. | It is used for categorical and binary features. |
| Jaccard distance | The Jaccard distance is a measure of dissimilarity between two sets. It is defined as the size of the symmetric difference of the sets divided by the size of their union. Mathematically, the Jaccard distance between sets A and B is calculated as: | $$J(A, B) = 1 - \frac{\| A \cap B \|}{ \| A \cup B \|}$$ | |
| Minkoswki distance | A more general distance metric for KNN is the Minkowski distance, which is a generalization of the Euclidean and Manhattan distances. It is defined by a parameter p that controls how much emphasis is given to larger or smaller differences between coordinates. The Minkowski distance can be seen as a family of distance metrics that includes the Euclidean distance (p = 2), the Manhattan distance (p = 1), and the Chebyshev distance (p = infinity), which is the maximum of the absolute differences between coordinates. | $$d(x, y) = \left( \sum^{n}_{i=1} \| x_{i} - y_{i} \| ^{p} \right)^{\cfrac{1}{p}} $$ | The Minkowski distance is suitable for data that has mixed types of features, as it allows you to adjust the parameter p to balance the importance of different features and distances. However, it can be computationally expensive and difficult to interpret, as the parameter p can have different effects on different data sets and problems. |

Other distances:
- Mahalanobis
- Pearson
- Levenshtein
- Cosine similarity

# Sequence comparison - TEXT

> An excellent resource: https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7




Algorithms:
- Embeddings
- Levenshtein distance
- Hamming distance
- Smith-Waterman distance






## Edit-based algorithms

> aka distance-based algorithms

Measure the minimum number of single-character operations (insertions, deletions, or substitutions) required to transform one string into another. The more operations we'll have the greater the distance, and the less the similarity, will be.

Used in:
- Spell checking
- Autocorrection
- DNA sequence analysis


### Hamming

- The number of characters that are different in two equal length strings. 
- If you overlay two strings of the same length, how many positions will have different characters; 

Disadvantages:
- Strings have to be of matching lengths;

In [7]:
a = 'crook'
b = 'shook'

def hamming_distance(str1, str2):
    assert len(str1) == len(str2)
    hamming_distance = 0
    for i, j in zip(str1, str2):
        if i != j:
            hamming_distance += 1
    return hamming_distance

hamming_distance(a, b)

2

In [13]:
import textdistance as td

a, b = 'book', 'look'
print( td.hamming(a, b) )
print( td.hamming.normalized_similarity(a, b) )
c, d = 'below', 'bellow'
print( td.hamming(c, d) ) # it automatically transforms 'below' into 'below_' to match the length of the second string, as hamming distance can only deal with equal-length strings
print( td.hamming.normalized_similarity(c, d) )

1
0.75
3
0.5


### Levenshtein

The minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. 

Advantages:
- Strings do not have to be the same length;

In [17]:
import textdistance as td

# we can replace one letter by another to get the other word, so normalized similarity is (4-1)/4 = 75%
a, b = 'book', 'look'
print( td.levenshtein(a, b) )
print( td.levenshtein.normalized_similarity(a, b) )


1
0.75


In [19]:
print( td.levenshtein('act', 'cat') )
print( td.levenshtein.normalized_similarity('act', 'cat') )

2
0.33333333333333337


In [18]:
# we have one insertion operation, so the distance is 1 and the normalized similarity is (6-1)/6 = 84%
c, d = 'below', 'bellow'
print( td.levenshtein(c, d) )
print( td.levenshtein.normalized_similarity(c, d) )


1
0.8333333333333334


### Damerau-Levenshtein distance

This algorithm is a variation of the Levenshtein distance that also includes the transposition operation (swapping two adjacent characters). The number of four operations (insertions, deletions, substitutions, or transposition) required to transform one string to another.  

In [20]:
import textdistance as td

print( td.damerau_levenshtein('act', 'cat') )
print( td.damerau_levenshtein.normalized_similarity('act', 'cat') )

1
0.6666666666666667


## Token-based algorithms

## Sequence-based algorithms