# Introduction to vector distances

In <span style="font-size: 11pt; color: steelblue; font-weight: bold">Data Analysis</span> and <span style="font-size: 11pt; color: steelblue; font-weight: bold">Machine Learning</span>, it is <u>crucial to understand how to measure the distance or dissimilarity between vectors</u>.  

Various distance metrics exist, each with its own strengths and limitations. This short piece of work provides an overview of commonly used vector distance measures, their pros and cons, way of computation and applications.  

By exploring these distances, we will gain a deeper understanding of their importance in Data Analysis and Machine Learning tasks.

# Vector Distances

## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Euclidean Distance</span>:
#### Overview:
Euclidean distance or **L2 Distance** measures the <u>straight-line distance between two points in a multi-dimensional space</u>. It calculates the length of the line segment connecting the two points.

#### Formula:
For two vectors, $A$ and $B$, each with $n$ dimensions, the Euclidean distance is computed as:

$$\sqrt{\sum_{i=1}^{n}(A_i-B_i)^2})$$

#### Pros and Cons:
| Pros                           | Cons                               |
|--------------------------------|------------------------------------|
| Intuitive interpretation       | Sensitive to outliers              |
| Captures differences in all dimensions | Doesn't handle sparse data well |
| Widely used in various algorithms |                                 |


#### Trivia:
   - The Euclidean distance is a special case of the Minkowski distance with p=2.

#### Computation in Python:
```python
import numpy as np
   
def euclidean_distance(A, B):
    return np.sqrt(np.sum((A - B) ** 2))
```
***
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Manhattan Distance</span>:
#### Overview:
Manhattan distance, also known as the "city block" distance or **L1 Distance**, measures the distance between two points by summing the absolute differences of their corresponding components.

#### Formula:
For two vectors, $A$ and $B$, with $n$ dimensions, the Manhattan distance is calculated as:

$$\sum_{i=1}^{n}|A_i-B_i|$$

#### Pros and Cons:

| Pros                                        | Cons                                        |
|---------------------------------------------|---------------------------------------------|
| Intuitive interpretation as the sum of absolute differences.|Ignores differences in magnitude across dimensions|
|Robust to outliers. | Not suitable for continuous data.|
|Suitable for grid-like structures.|


#### Trivia:
   - The Manhattan distance is a special case of the Minkowski distance with p=1.

#### Computation in Python:
   ```python
   import numpy as np
   
   def manhattan_distance(A, B):
       return np.sum(np.abs(A - B))
   ```
***
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Cosine Similarity</span>:
#### Overview:
   Cosine similarity measures the cosine of the angle between two vectors. It is often <span style="font-size: 11pt; color: seagreen; font-weight: bold">used to determine the similarity between documents or high-dimensional vectors</span>.

#### Formula:
   For two vectors, $A$ and $B$, the cosine similarity is computed as:

   $$\frac{{A \cdot B}}{{\|A\| \cdot \|B\|}}$$
where $\|A\|$ stands for vector norm / magnitude / length

For a vector, such as $A$, the norm $\|A\|$ represents the length or magnitude of the vector. The specific norm used (e.g., Euclidean norm or L2 norm) will depend on the context and the mathematical definition being used.

For example, the Euclidean norm or **L2 norm** of a vector $A = (a_1, a_2, \ldots, a_n)$ in $n$-dimensional space is calculated as:

$$
\|A\| = \sqrt{a_1^2 + a_2^2 + \ldots + a_n^2}
$$

Similarly, for matrices or other mathematical objects, the notation $\|A\|$ represents the norm or magnitude associated with that object, which again will depend on the specific context and definition being used.

#### Pros and Cons:

| Pros                                          | Cons                                        |
|-----------------------------------------------|---------------------------------------------|
| Independent of vector magnitude.              | Doesn't capture magnitude differences.       |
| Measures the similarity rather than distance. | Can be affected by vectors of all zeros.     |
| Effective for high-dimensional and sparse data.|                                             |

#### Trivia:
   - The cosine similarity ranges from -1 to 1, where <u>1 represents identical vectors, 0 indicates orthogonality, and -1 indicates complete dissimilarity</u>.

#### Computation in Python:
   ```python
   import numpy as np
   
   def cosine_similarity(A, B):
       dot_product = np.dot(A, B)
       norm_A = np.linalg.norm(A)
       norm_B = np.linalg.norm(B)
       return dot_product / (norm_A * norm_B)
   ```
***
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Minkowski Distance</span>:
#### Overview:
   The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. It allows adjusting the sensitivity to different dimensions using a parameter called the "**order**" or "**p**".

#### Formula:
For two vectors, $A$ and $B$, the Minkowski distance of order $p$ is computed as:

$$\sqrt[p]{\sum_{i=1}^{n}|A_i-B_i|^p}$$

#### Pros and Cons:

| Pros                           | Cons                               |
|--------------------------------|------------------------------------|
|Flexible distance metric that encompasses both Euclidean and Manhattan distances.|Selection of the parameter p can be challenging.|
|Allows adjusting sensitivity to different dimensions.|Sensitive to outliers, especially with high values of p.|
|| 

#### Trivia:
   - The Manhattan distance corresponds to the Minkowski distance with p=1, and the Euclidean distance corresponds to p=2.

#### Computation in Python:
   ```python
   import numpy as np
   
   def minkowski_distance(A, B, p):
       return np.power(np.sum(np.abs(A - B) ** p), 1/p)
   ```
***
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Hamming Distance</span>:
#### Overview:
   Hamming distance is primarily used for comparing vectors of equal length that represent binary strings. It measures the number of positions at which the corresponding elements between two vectors differ.

#### Formula:
   For two binary vectors, $A$ and $B$, the Hamming distance is calculated as the count of differing elements.

#### Pros and Cons:
| Pros                                             | Cons                                        |
|--------------------------------------------------|---------------------------------------------|
| Effective for categorical or binary data.| Limited to binary or categorical data.      |
| Useful for error detection and correction.| Not suitable for continuous or numerical data.|


#### Trivia:
   - The Hamming distance is equivalent to the Manhattan distance for binary vectors.

#### Computation in Python:
   ```python
   def hamming_distance(A, B):
       return np.count_nonzero(A != B)
   ```
---
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Mahalanobis Distance</span>:
#### Overview:
   The Mahalanobis distance measures the distance between a <u>point</u> and a <u>distribution</u>, taking into account the covariance structure of the data. It considers the variability and correlation of the data.

#### Formula:
   For two vectors, A and B, the Mahalanobis distance is computed as:

$$\sqrt{{(A-B)^\top\Sigma^{-1}(A-B)}}$$

#### Pros and Cons:

| Pros                                                       | Cons                                                        |
| ---------------------------------------------------------- | ----------------------------------------------------------- |
| Accounts for covariance structure                           | Requires estimation of the covariance matrix                |
| Useful for high-dimensional data                            | Sensitive to assumptions about the data distribution       |
| Handles outliers better than Euclidean distance             |                                                             |

#### Trivia:
   - The Mahalanobis distance can be used for <span style="font-size: 11pt; color: seagreen; font-weight: normal">**outlier detection and clustering analysis**</span>.

#### Computation in Python:
   ```python
   import numpy as np
   
   def mahalanobis_distance(A, B, covariance_matrix):
       diff = A - B
       inverse_cov = np.linalg.inv(covariance_matrix)
       return np.sqrt(np.dot(np.dot(diff, inverse_cov), diff.T))
   ```
***
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Jaccard Distance</span>:
#### Overview:
   The Jaccard distance measures the dissimilarity between sets or binary vectors. It quantifies the difference in terms of the size of the symmetric difference of the sets divided by the size of their union.

#### Formula:
   For two sets or binary vectors, A and B, the Jaccard distance is calculated as:

$$\frac{{|A \cup B|-|A \cap B|}}{{|A \cup B|}}$$

#### Pros and Cons:
Pros | Cons
--- | ---
Suitable for comparing sets or binary data. | Limited to binary or categorical data.
Measures the dissimilarity rather than distance. | Not applicable for continuous or numerical data.
Effective for data with varying lengths. | 

#### Trivia:
   - The Jaccard distance is used in applications like <span style="font-size: 11pt; color: seagreen; font-weight: normal">**text mining, recommendation systems, and clustering.**</span>

#### Computation in Python:
   ```python
   def jaccard_distance(A, B):
       union = np.union1d(A, B)
       intersection = np.intersect1d(A, B)
       return 1 - len(intersection) / len(union)
   ```
***
## <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Chebyshev Distance</span>:
#### Overview:
The Chebyshev distance is a metric used to measure dissimilarity or distance between two points in a vector space. It calculates the maximum absolute difference between the corresponding elements of two vectors. In other words, it measures the maximum distance between any dimension of the two vectors.

#### Formula:

The Chebyshev distance between two vectors, $A = (a₁, a₂, ..., aₙ)$ and $B = (b₁, b₂, ..., bₙ)$, is calculated as:

$$\max\left(|a₁ - b₁|, |a₂ - b₂|, \ldots, |aₙ - bₙ|\right)$$

#### Pros and Cons:

| Pros                                                  | Cons                                                     |
| ----------------------------------------------------- | -------------------------------------------------------- |
| Suitable for comparing sets or binary data             | Limited to binary or categorical data                     |
| Measures the dissimilarity rather than distance        | Not applicable for continuous or numerical data           |
| Effective for data with varying lengths                |                                                          |

#### Computation in Python:
```python
chebyshev_distance = np.max(np.abs(A - B))
```

#### Trivia:
The Chebyshev distance is named after the Russian mathematician Pafnuty Chebyshev. It is commonly used in applications such as <span style="font-size: 11pt; color: seagreen; font-weight: normal">**pattern recognition, computer vision, and outlier detection**</span>.

## Conclusion:

<span style="font-size: 16pt; color: steelblue; font-weight: bold">Understanding vector distances is crucial for various Data Analysis and Machine Learning tasks.</span>  

By grasping the strengths and limitations of different distance measures like **Euclidean**, **Manhattan**, **cosine similarity**, **Minkowski**, **Hamming**, **Mahalanobis**, and **Jaccard distances**, we gain valuable tools to <u>assess the similarities or dissimilarities</u> between vectors. 

<u>Each distance metric has its own unique characteristics</u>, making them suitable for specific scenarios. 

<span style="font-size: 11pt; color: goldenrod; font-weight: bold">Expanding our knowledge of vector distances equips us with the ability to choose the most appropriate distance measure for our data analysis needs.</span>