# Distance + Similarity

![](img/distance.png)

## Euclidean Distance

- Length of a segment connecting two points
- Calculated from the cartesian coordinates of the points using Pythagorean theorem
- $D(x,y)=\sqrt{\sum(x_i-y_i)^2}$
- Disadvantages:
    - It might be skewed if units are highly variant, so normalization is recommended
    - Works best in lower dimensional space, susceptible to curse of dimensionality
- Advantages:
    - Low-dimensional data
    - Intuitive to use, simple to implement, has great results in many use-cases

## Manhattan Distance

- Often called Taxicab distance or City Block distance
- Calculates the distance between two vectors if they could only move right angles, without diagonal movement
- $D(x,y)=\sum|x_i-y_i|$
- Disadvantages:
    - Works okay for high-dimensional data
    - Less intuitive than euclidean distance
- Advantages:
    - Discrete / binary attributes

## Minkowski Distance

- Generalizes the above distance formulas in n-dimensional real space
- p-norm vector
- Three requirements:
    - Zero vector - has a length of zero
    - Scalar factor - when you multiply the vector with a positive number its length is changed while keeping its direction
    - Triangle inequality - shortest distance between two points is a straight line
- $D(x,y)=(\sum|x_i-y_i|^p)^{1/p}$
- Common values of $p$ are:
    - $p=1$ — Manhattan distance
    - $p=2$ — Euclidean distance
    - $p=∞$ — Chebyshev distance
- Disadvantages:
    - Refer to common values of $p$ individually
    - $p$ can be computationally inefficient to tune
- Advantages:
    - $p$ allows for flexibility to find the right value to tune your distance metric

## Chebyshev Distance

- Greatest of difference between two vectors along any coordinate dimension
- Maximum distance along one axis
- Chessboard distance
- $D(x,y)=max(|x_i-y_i|)$
- Disadvantages:
    - Only usable in specific use-cases, so not a good for an all-purpose distance metric
- Advantages:
    - Extracts the minimum number of moves needed to go from one square to another. Useful in games that allow unrestricted 8-way movement.
    - Logistic warehouse / manufacturing applications

## Jaccard Similarity / Index

- Finds the similarlity between two sets or objects
- Used to compare sets of patterns
- For two sets: $J(A,B)= \frac{|A \cap B|}{|A \cup B|}$
- Similarity will be 0 if the sets don't share any values and 1 if they're identical
- Jaccard distance: $D(A,B)=1-J(A,B)$
- Disadvantages:
    - Highly influenced by the size of the data; large datasets can significantly increace the union while keeping the intersection similar
- Advantages:
    - Image detection / NN to measure accuracy of object detection
    - Text similarity analysis to see how much overlap there is between documents
- Other versions: asymmetric binary arrays

## Hamming Distance

- Number of values that are different between two vectors
- Distance between categorical variables
- Typically used to compare two binary strings of equal length
- Disadvantages:
    - Difficult to use when vectors are of different lengths
    - Does not take the actual value into account, so it's not advised to use hamming distance when magnitude is important
- Advantages:
    - Error correction/detection, detetching distorted bits in a binary word

## Cosine Similarity

- Cosine of the angle / distance between two (sometimes sparse) vectors
- Doesn't need normalization
- $C(x,y)=cos(\theta)=\frac{xy}{||x||\text{ }||y||}$
- Two vectors with exactly the same orientation have a cosine similarity of 1, and two diametrically opposed to each other have a similarity of -1
- Cosine distance: $D(x,y)=1-C(x,y)$
- Vectors must be non-zero
- Disadvantages:
    - Magnitude isn't taken into account, only the direction; in practice, this means the differences in values are not fully taken into account
- Advantages:
    - High-dimensional data and magnitude of the vectors is not important
    - Word counts work well because it might not be the number of times a word appears that is most important

## Honorable Mentions

- Haversine Distance: distance between two points on a sphere given their longitutes and latitudes
- Sørensen-Dice Index: measure the similarity and diversity of sample sets, percentage of overlap between two sets