# Distances
## Distance metric
A distance metric is a function that measures the distance between two data sets / points.
a distance metric d(...) is a function that satisfies the following properties:
1. d(x, y) ≥ 0 non-negativity
2. d(x, y) = 0 if and only if x = y identity of indiscernibles
3. d(x, y) = d(y, x) symmetry
4. d(x, y) + d(y, z) ≥ d(x, z) triangle inequality

## Jaccard distance - sets
The Jaccard distance between two sets is defined as the size of the intersection divided by the size of the union of the two sets.
$$d_{J}(A,B) = 1 - \frac{|A \cap B|}{|A \cup B|}$$
Applications:
- Text similarity
- Set similarity
- Image similarity
- finding similar pieces of malware
- Collaborative Filtering - Amazon recs
- ...


In [None]:
import numpy as np
def jaccard_similarity(A: set, B: set) -> float:
  union = A.union(B)
  intersection = A.intersection(B)
  return len(intersection)/len(union)
def jaccard_distance(A: set, B: set) -> float:
  return 1 - jaccard_similarity(A, B)

## Euclidean distance - vectors
The Euclidean distance between two vectors is the length of the line segment connecting them. (lower bound on the distance between two points in a grid)
$$d_{E}(A,B) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$$
When to use:
- Continuous data
- not too many colums
- Magnitude matters ???
When to avoid:
- Discrete data
- Many columns
- Sparse data -> Zero values are treated as any other
- high dimensionality -> Curse of dimensionality
- When magnitude matters (It is not scale invariant! (dist(a,b) ≠ dist(ac,bc) for constant c))


In [None]:
def euclidean_distance(a: np.array, b: np.array) -> float:
  return np.sqrt(np.sum(np.power(a - b, 2)))

## Manhattan distance - vectors
The Manhattan distance between two vectors is the sum of the absolute differences of their coordinates. (upper bound on the distance between two points in a grid)
$$d_{M}(A,B) = \sum_{i=1}^{n} |a_i - b_i|$$
When to use:
- when outliers are a problem
- Incomaparable features
- Many dimentions

## Minkowski distance - vectors (generalization of Euclidean and Manhattan)
The Minkowski distance between two vectors is the length of the line segment connecting them.
$$d_{M}(A,B) = \sqrt[p]{\sum_{i=1}^{n} (a_i - b_i)^p}$$

Generalized Minkwoski distance:
- L0 = Non-zeros
- L1 = Manhattan distance
- L2 = Euclidean distance
- Linf = Maximum distance / chebyshev


In [None]:
def lp_norm(a: np.array, b: np.array, p : int) -> float:
  return np.power(np.sum(np.power(np.abs(a - b), p)), 1/p)

## Maximum (Chebyshev) distance - vectors
The Maximum distance between two vectors is the maximum difference of their coordinates.
$$d_{M}(A,B) = max(|a_i - b_i|)$$
When to use:
- Incomparable features
When to avoid:
- when outliers are a problem

## Hamming distance - vectors
The Hamming distance between two vectors is the number of positions at which the corresponding symbols are different.
$$d_{H}(A,B) = \sum_{i=1}^{n} |a_i - b_i|$$

In [None]:
def hamming_distance(a: np.array, b: np.array) -> int:
  return np.sum(a != b)

## Cosine distance - vectors
The Cosine distance between two vectors is the cosine of the angle between them. Not a metric (violates triangle ineq). Scale invariant but not translation invariant.
$$d_{C}(A,B) = 1 - \frac{A \cdot B}{|A| |B|}$$
When to use:
- bag of words (distances do not depend on length of document)

In [None]:
def cosine_distance(a: np.array, b: np.array) -> float:
  return 1 - (np.dot(a,b)/(np.linalg.norm(a) * np.linalg.norm(b)))

a = np.array([1,1])
b = np.array([1,0])
c = np.array([0,1])

dist_ab = cosine_distance(a,b)
dist_bc = cosine_distance(b,c)
dist_ac = cosine_distance(a,c)
print(dist_bc, dist_ab, dist_ac)
print(dist_bc > dist_ab + dist_ac)

## ISOMAP Distance - vectors
Compute kNN for each data point → Construct weighted graph: weights = distances → Set : ISOMAP distance(p,q) = weight of shortest path from p to q.

In [None]:
import heapq as heap
from collections import defaultdict

def dijkstra(G, startingNode):
	visited = set()
	parentsMap = {}
	pq = []
	nodeCosts = defaultdict(lambda: float('inf'))
	nodeCosts[startingNode] = 0
	heap.heappush(pq, (0, startingNode))

	while pq:
		_, node = heap.heappop(pq)
		visited.add(node)

		for weight, adjNode  in G[node]:
			if adjNode in visited:	continue

			newCost = nodeCosts[node] + weight
			if nodeCosts[adjNode] > newCost:
				parentsMap[adjNode] = node
				nodeCosts[adjNode] = newCost
				heap.heappush(pq, (newCost, adjNode))

	return parentsMap, nodeCosts
def ISOMAP_distance(a_index: int, b_index: int, dataset: np.array, k: int) -> float:
  def knn_for_one_point(p_index : int, data: np.array) -> np.array:
    return sorted([(euclidean_distance(data[p_index], vector), index) for index,vector in enumerate(data) if index != p_index])[:k]
  all_knns = [knn_for_one_point(index, dataset) for index, _ in enumerate(dataset)]
  return dijkstra(all_knns, a_index)[1][b_index]

## Kullback-Leibler divergence - statistical
The Kullback-Leibler divergence between two probability distributions is a measure of how one probability distribution is different from a second, reference probability distribution.
$$D_{KL}(P||Q) = \sum_{i=1}^{n} P(i) \log \frac{P(i)}{Q(i)}$$ -> discrete
$$D_{KL}(P||Q) = \int_{-\infty}^{\infty} P(x) \log \frac{P(x)}{Q(x)} dx$$ -> continuous

- Measures "surprise" of the distribution P with respect to Q. (P is the true distribution, Q is the estimated distribution)
- The expected value of log likelyhood ratio between P and Q.
- The number of bits needed to encode a sample from P, assuming a Q model.
- Information gain or relative entropy
- Not a metric (not symmetric)

Forward KL P(P||Q) when P is high, Q should be high
![Row normalization effect](./assets/kl.png)

In [None]:
def discrete_kl_divergence(P: np.array, Q: np.array) -> float:
  return np.sum(P * np.log2(P/Q))

In [None]:
discrete_kl_divergence(
    np.array([0.01, 0.1, 0.39, 0.5]),
    np.array([0.1, 0.5, 0.3, 0.1]),
)

In [None]:
discrete_kl_divergence(
    np.array([0.1, 0.5, 0.3, 0.1]),
    np.array([0.01, 0.1, 0.39, 0.5])
)

## Jensen-Shannon divergence - statistical
The Jensen-Shannon divergence between two probability distributions is a measure of how one probability distribution is different from a second, reference probability distribution.
$$D_{JS}(P||Q) = \frac{1}{2} D_{KL}(P||M) + \frac{1}{2} D_{KL}(Q||M)$$ with $M = \frac{1}{2}(P+Q)$
- Unlike KL divergence, JS divergence is symmetric and is a metric.
- It will not be infinite if one of the distributions is zero.

In [None]:
def jensen_shannon_divergence(P: np.array, Q: np.array) -> float:
  M = (P + Q)/2
  return (discrete_kl_divergence(P, M) + discrete_kl_divergence(Q, M))/2

In [None]:
jensen_shannon_divergence(
    np.array([0.1, 0.5, 0.3, 0.1]),
    np.array([0.01, 0.1, 0.39, 0.5])
) == jensen_shannon_divergence(
    np.array([0.01, 0.1, 0.39, 0.5]),
    np.array([0.1, 0.5, 0.3, 0.1]),
)

## Dynamic Time Warping DTW - sequential
Dynamic time warping is another distance measure for sequential data that allows for differences in the speed and timing of the sequences. It is commonly used in speech recognition, image matching, and time series analysis.
Not a metrics, does not satisfy triangle inequality.

In [None]:
def dynamic_time_warping(a: np.array, b: np.array) -> float:
  distance_matrix = np.array([[euclidean_distance(i,j)**2 for j in b] for i in a])
  result_matrix = np.zeros(distance_matrix.shape)
  for i in range(len(a)):
    for j in range(len(b)):
      minimum_path = 100000
      if i != 0:
        minimum_path = min(minimum_path, result_matrix[i-1,j])
      if j != 0:
        minimum_path = min(minimum_path, result_matrix[i,j-1])
      if i != 0 and j != 0:
        minimum_path = min(minimum_path, result_matrix[i-1,j-1])
      if minimum_path != 100000:
        result_matrix[i,j] = minimum_path + distance_matrix[i,j]
      else:
        result_matrix[i,j] = distance_matrix[i,j]
  return np.sqrt(result_matrix[len(a)-1, len(b)-1])

## Levenshtein distance - sequential
Levenshtein distance, also known as edit distance, is a measure of the minimum number of insertions, deletions, and substitutions required to transform one sequence into another. It is commonly used in natural language processing and spell-checking applications.
Applications
It is a metric
Applications:
- Spell checking
- DNA sequence alignment

## Alignment
When working with sequential data, it may be necessary to align the sequences in order to compute a meaningful distance. This can be done using techniques such as dynamic time warping or Needleman-Wunsch alignment.

In [None]:
def sequence_alignment(a: list, b: list, W_l: float, W_r: float) -> float:
  result_matrix = np.ones((len(a),len(b))) * 1e7
  for i in range(len(a)):
    result_matrix[i,0]=0
  for j in range(len(b)):
    result_matrix[0,j]=0

  for i in range(1,len(a)):
    for j in range(1,len(b)):
      result_matrix[i,j] = max(
          0,
          result_matrix[i-1,j-1] + a[i]!=b[i],
          max([result_matrix[i-k,j] + W_l  for k in range(1, i+1)]),
          max([result_matrix[i-k,j] + W_r  for k in range(1, i+1)])
      )
  return np.sqrt(result_matrix[len(a)-1, len(b)-1])