# Statistical Distances with `Python` and `R`

## Index

* [Statistical Distances ](#1)
* * [Distance Definition](#2)
* * [Distance Matrix](#3)

* [ Distances with quantitative variables](#1)
* * [Euclidean Distance](#2)
* * * [Disadvantages](#3)
* * * [Euclidean Distance in `R`](#4)
* * * [Euclidean Distance in `Python`](#5)

  
* [F-test: test to compare models](#4)
* * [F-test in `Python`](#5)
* * [F-test in `R`](#6)
* * [ANOVA test as an F-test](#7)
* * [Significance test as an F-test](#8)

  <br>


## Statistical Distances <a class="anchor" id="1"></a>



The concept of distance between elements of a set $\varepsilon$ allows us to interpret geometrically many classical techniques of multivariate analysis.

This interpretation is possible both with quantitative and categorical variables, or even when no variables are available, as long as it makes sense to obtain a measure of proximity between the elements of $\varepsilon$



###  Distance Definition <a class="anchor" id="1"></a>



Given a set of elements $\Omega$

#### Almost-metric <a class="anchor" id="1"></a>


It is called **quasi-metric** or **dissimilarity** to any mapping $\delta : \Omega \hspace{0.05cm}x\hspace{0.05cm} \Omega \rightarrow \mathbb{R}$ that satisfies the following properties:



1) $\hspace{0.15cm}\delta (i,j) \geq 0 \hspace{0.25cm}, \forall i,j \in \Omega$

2) $\hspace{0.15cm}\delta (i,i) = 0 \hspace{0.25cm}, \forall i \in  \Omega$

3) $\hspace{0.15cm}\delta (i,j) = \delta (j, i) \hspace{0.25cm}, \forall i,j \in \Omega $



#### Semi-metric <a class="anchor" id="1"></a>


It is called **semi-metric** to any dissimilarity (quasi-metric)  that satisfies the triangular inequality:



4) $\hspace{0.15cm} \delta (i,j) \hspace{0.1 cm}\leq \hspace{0.1 cm} \delta (i,k) + \delta (k,j) \hspace{0.25cm}, \forall i,j,k \in \Omega$



#### Metric <a class="anchor" id="1"></a>


It is called a **metric** to any semi-metric that satisfies:

5) $\hspace{0.15cm} \delta (i,j)=0 \hspace{0.15cm}\Leftrightarrow\hspace{0.15cm} i=j$




#### Distance <a class="anchor" id="1"></a>

A **distance** is a metric or semi-metric
 

### Distance Matrix <a class="anchor" id="1"></a>



When $\varepsilon$ is a finite set, we will have a distance matrix:



$$
D= \begin{pmatrix}
0 & \delta_{12}&...&\delta_{1n}\\
\delta_{21} & 0&...&\delta_{2n}\\
...&...&...&...\\
\delta_{n1}& \delta_{n2}&...& 0\\
\end{pmatrix}
$$
con $\delta_{ij}=\delta_{ji}$



We will also use the matrix of squares of distances:



$$
D^{(2)}= 
\begin{pmatrix}
0 & \delta^2_{12}&...&\delta^2_{1n}\\
\delta^2_{21} & 0&...&\delta^2_{2n}\\
...&...&...&...\\
\delta^2_{n1}& \delta^2_{n2}&...& 0\\
\end{pmatrix}
$$





No debe confundirse con  $D^2=D\cdot D$



## Distances with quantitative variables <a class="anchor" id="1"></a>



Sean $X_1,...,X_p$ variables cuantitativas, 

Sean $x_i=(x_{i1},...,x_{ip})^t$ \hspace{0.2cm}y\hspace{0.2cm}
$x_j=(x_{i1},...,x_{ip})^t$ los valores (observaciones) de las variables $X_1,...,X_p$ para los elementos o individuos $i$ y $j$ de la muestra.

Let $X_1,...,X_p$ be quantitative variables,

Let $x_i=(x_{i1},...,x_{ip})^t$ \hspace{0.2cm}and \hspace{0.2cm}
$x_j=(x_{i1},...,x_{ip})^t$ the values ​​(observations) of the variables $X_1,...,X_p$ for the elements or individuals $i$ and $j$ of the sample $\Omega$.


## Euclidean Distance <a class="anchor" id="1"></a>


 
The Euclidean distance between the elements / individuals $i$ and $j$ of $\Omega$ with respect to the quantitative variables $X_1,...,X_p$ is defined as:



 $$
\delta^2(i,j)_{Euclidea} = \sum_{k=1}^{p} (x_{ik} - x_{jk})\hspace{0.05cm}^2 = (x_i - x_j)\hspace{0.05cm}^t\cdot (x_i - x_j)
$$



$$
\delta(i,j)_{Euclidea} =\sqrt{\sum_{k=1}^{p} (x_{ik} - x_{jk})\hspace{0.05cm}^2  }  = \sqrt{(x_i - x_j)\hspace{0.05cm}^t\cdot (x_i - x_j)}
 $$

 


### Disadvantages <a class="anchor" id="1"></a>


 
Although it is one of the most popular distances, it is not suitable in many cases for the following reasons:

1) It assumes that the variables are uncorrelated and with unit variance (although this last problem can be solved by standardizing the variables to unit variance by dividing them by their respective standard deviations).

2) It is not invariant against changes in scale (changes in measurement units) of the variables.


 
Let's see what this means in more detail:

If a change of scale is applied to the variables $a\cdot X_j + b$, with $a\neq 1$ and $b\neq 0$

Now the observations for elements $i$ and $j$ are $a\cdot x_i + b$ and $a\cdot x_j + b$

Then the Euclidean distance between the elements $i$ and $j$ with respect to the scaled variables $a\cdot X_j + b$ is:

$$
\delta^2(i,j)_{Euclidea} = a^2 \cdot (x_i - x_j)^t\cdot (x_i - x_j)
$$


## Data-set in  `R` <a class="anchor" id="1"></a>


Data-set de trabajo, tendra 4 variables cuantitativas, 3 binarias y 3 categoricas multiples:


In [15]:
%%R

set.seed(123)

#Cuantitativas
X1 <- rnorm(50, mean=10 , sd=15)
X2 <- rnorm(50, mean=10 , sd=15)
X3 <- rnorm(50, mean=10 , sd=15)
X4 <- rnorm(50, mean=10 , sd=15)

#Binarias 
X5<- round(runif(50))
X6<- round(runif(50))
X7<- round(runif(50))

#Categoricas multiples 
X8<-round(runif(50, min=0, max=4)) #categorias: 0,1,2,3,4
X9<-round(runif(50, min=0, max=3))  #categorias: 0,1,2,3
X10<-round(runif(50, min=0, max=5))  #categorias: 0,1,2,3,4,5

In [17]:
%%R

mean(X1)

[1] 10.51605


## Data-set in  `Python` <a class="anchor" id="1"></a>


In [1]:
import numpy as np


In [32]:
np.random.seed(123)

# Quantitative

X1 = np.random.normal(loc=10, scale=15, size=50)
X2 = np.random.normal(loc=10, scale=15, size=50)
X3 = np.random.normal(loc=10, scale=15, size=50)
X4 = np.random.normal(loc=10, scale=15, size=50)

# Binary Categorical / Dummies ( categories: 0,1)


X5 = np.random.uniform(low=0.0, high=1.0, size=50).round()
X6 = np.random.uniform(low=0.0, high=1.0, size=50).round() 
X7 = np.random.uniform(low=0.0, high=1.0, size=50).round() 


# Multiple categorical

X8 = np.random.uniform(low=0, high=4, size=50).round()   # categories: 0,1,2,3,4
X9 = np.random.uniform(low=0, high=3, size=50).round()   # categories: 0,1,2,3
X10 = np.random.uniform(low=0, high=5, size=50).round()  # categories: 0,1,2,3,4,5

array([1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1.,
       1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1.,
       1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1.])

### Euclidean Distance in `R` <a class="anchor" id="1"></a>

### Euclidean Distance in `Python` <a class="anchor" id="1"></a>


## Bibliography <a class="anchor" id="1"></a>

https://numpy.org/doc/stable/reference/random/legacy.html