# Statistical Distances with `Python` and `R`

## Index

* [Data-set in `Python`](#1)
* [Data-set in `R`](#2)
  


* [Statistical Distances ](#3)
* * [Distance Definition](#4)
* * * [Almost-metric](#5)
* * * [Semi-metric](#6)
* * * [Metric](#7)
* * * [Distance](#9)


* * [Distance Matrix](#10)

* [ Distances with quantitative variables](#11)
  
* * [Euclidean Distance](#12)
* * * [Disadvantages](#13)
* * * [Euclidean Distance in `R`](#14)
* * * [Euclidean Distance in `Python`](#15)
* * * [Euclidean Distance Matrix in `R`](#16)
* * * [Euclidean Distance Matrix in `Python`](#17)
 

 
* * [Minkowski Distance](#18)
* * * [Disadvantages](#19)
* * * [ Particular cases of the Minkowski distance](#20)
* * * * [Euclidean Distance](#21)
* * * * [Manhattan Distance](#22)
* * * * [Dominant Distance](#23)
* * * [Minkowski Distance in `R`](#24)
* * * [Minkowski Distance in `Python`](#25)
* * * [Minkowski Distance Matrix in `R`](#26)
* * * [Minkowski Distance Matrix in `Python`](#27)
* * * [Dominant Distance in `R`](#28)
* * * [Dominant Distance in `Python`](#29)
* * * [Dominant Distance Matrix in `R`](#30)
* * * [Dominant Distance Matrix in `Python`](#31)


  
* * [Canberra Distance](#32)
* * * [Disadvantages](#33.1)
* * * [Canberra Distance in `R`](#33)
* * * [Canberra Distance in `Python`](#34)
* * * [Canberra Distance Matrix in `R`](#35)
* * * [Canberra Distance Matrix in `Python`](#36)
 

 
* * [Karl Pearson Distance](#37)
* * * [Disadvantages](#38.1)
* * * [Karl Pearson Distance in `R`](#38)
* * * [Karl Pearson Distance in `Python`](#39)
* * * [Karl Pearson Distance Matrix in `R`](#40)
* * * [Karl Pearson Distance Matrix in `Python`](#41)
  


* * [Mahalanobis Distance](#42)
* * * [Advantages](#42.1)
* * * [Mahalanobis Distance in `R`](#43)
* * * [Mahalanobis Distance in `Python`](#44)
* * * [Mahalanobis Distance Matrix in `R`](#45)
* * * [Mahalanobis Distance Matrix in `Python`](#46)

* [ Distances with categorical variables](#47)
* * [Similarity](#48)
* * [Similarity Matrix](#49)
* * [Go from similarity to distance](#50)

* [Similarities with binary categorical variable](#51)
* * * [Arrays with parameters a, b, c and d ](#52)
* * * [Computing the arrays with parameters a, b, c and d  in `R`](#53)
* * * [Computing the arrays with parameters a, b, c and d  in `Python`](#54)
* * [Sokal Similarity](#55)
* * [Sokal Distance](#56)
* * * [ Sokal Similarity in `R`](#57)
* * * [ Sokal Similarity in `Python`](#58)
* * * [ Sokal Similarity Matrix in `R`](#59)
* * * [ Sokal Similarity Matrix in `Python`](#60)
* * * [Sokal Distance in `R`](#61)
* * * [Sokal Distance in `Python`](#62)
* * * [Sokal Distance Matrix in `R`](#63)
* * * [Sokal Distance Matrix in `Python`](#64)
* * [Jaccard Similarity](#65)
* * [Jaccard Distance](#66)
* * * [ Jaccard Similarity in `R`](#67)
* * * [ Jaccard Similarity in `Python`](#68)
* * * [ Jaccard Similarity Matrix in `R`](#69)
* * * [ Jaccard Similarity Matrix in `Python`](#70)
* * * [Jaccard Distance in `R`](#71)
* * * [Jaccard Distance in `Python`](#72)
* * * [Jaccard Distance Matrix in `R`](#73)
* * * [Jaccard Distance Matrix in `Python`](#74)
* * [More similarity coefficients](#75)

* [Similarities with multiclass  categorical variable](#76)
* * [Matches Similarity ](#77)
* * [Matches Distance  ](#78)
* * * [ Matches Similarity in `R`](#79)
* * * [ Matches Similarity in `Python`](#79.1)
* * * [ Matches Distance in `R`](#80)
* * * [ Matches Distance in `Python`](#80.1)
* * * [ Matches Similarity Matrix in `R`](#81)
* * * [ Matches Similarity Matrix in `Python`](#82)
* * * [ Matches Distance Matrix in `R`](#83)
* * * [ Matches Distance Matrix in `Python`](#84)

* [Distances with Mixed Variables](#85)
* * [Gower Similarity Coefficient](#86)
* * [Gower Distance](#87)
* * * [ Gower Similarity in `R`](#88)
* * * [Gower Similarity in `Python`](#89)
* * * [Gower Similarity Matrix in `R`](#90)
* * * [Gower Similarity Matrix in `Python`](#91)
* * * [Gower Distance in `R`](#92)
* * * [Gower Distance in `Python`](#93)
* * * [Gower Distance Matrix in `R`](#94)
* * * [Gower Distance Matrix in `Python`](#95)
* * [Gower-Mahalanobis Similarity](#96)
* * [Gower-Mahalanobis Distance](#97)
* * * [Gower-Mahalanobis Similarity in `R`](#98)
* * * [Gower-Mahalanobis Similarity in `Python`](#99)
*  * * [Gower-Mahalanobis Similarity Matrix in `R`](#100)
*  * * [Gower-Mahalanobis Similarity Matrix in `Python`](#101)
*  * * [Gower-Mahalanobis Distance in `R`](#102)
*  * * [Gower-Mahalanobis Distance in `Python`](#103)
*  * * [Gower-Mahalanobis Distance Matrix in `R`](#104)
*  * * [Gower-Mahalanobis Distance Matrix in `Python`](#105)

## Data-set in  `Python` <a class="anchor" id="1"></a>


In [2]:
import numpy as np

In [3]:
np.random.seed(123)

# Quantitative

X1 = np.random.normal(loc=10, scale=15, size=50)
X2 = np.random.normal(loc=10, scale=15, size=50)
X3 = np.random.normal(loc=10, scale=15, size=50)
X4 = np.random.normal(loc=10, scale=15, size=50)

# Binary Categorical / Dummies ( categories: 0,1)

X5 = np.random.uniform(low=0.0, high=1.0, size=50).round()
X6 = np.random.uniform(low=0.0, high=1.0, size=50).round() 
X7 = np.random.uniform(low=0.0, high=1.0, size=50).round() 


# Multiple categorical

X8 = np.random.uniform(low=0, high=4, size=50).round()   # categories: 0,1,2,3,4
X9 = np.random.uniform(low=0, high=3, size=50).round()   # categories: 0,1,2,3
X10 = np.random.uniform(low=0, high=5, size=50).round()  # categories: 0,1,2,3,4,5

In [4]:
import pandas as pd

In [5]:
Data_set_Python = pd.DataFrame({'X1': X1 , 'X2': X2, 'X3': X3 , 'X4': X4 , 'X5': X5 , 
                         'X6': X6 , 'X7': X7 , 'X8': X8 , 'X9': X9 , 'X10': X10 })

## Data-set in  `R` <a class="anchor" id="2"></a>


In [6]:
import rpy2

%load_ext rpy2.ipython

import rpy2.robjects as robjects



In [7]:
%%R

# Quantitative

X1 <- c(-6.28445905,  24.9601817 ,  14.24467747, -12.59442071,
         1.32099622,  34.77154806, -26.40018865,   3.56631057,
        28.98904388,  -3.00110603,  -0.18329227,   8.57936547,
        32.37084439,   0.41647005,   3.34027061,   3.48473087,
        43.08895124,  42.80179133,  25.06080847,  15.79279599,
        21.06052864,  32.36098042,  -4.03750803,  27.63743567,
        -8.80821002,   0.43372746,  23.60657794, -11.4302105 ,
         7.8989692 ,  -2.92632344,   6.16570944, -31.97883658,
       -16.57299657,  -0.49815852,  23.91193648,   7.39546476,
        10.04268874,  20.32334067,  -3.19304515,  14.25440986,
        -2.08049777, -15.91504241,   4.13650309,  18.60708794,
        15.07883576,   9.82254258,  45.88547899,  16.1936824 ,
        24.68104009,  43.57215008)

X2 <- c(-9.41127985,  -5.58182315,  36.15568338,  -1.97094103,
        10.44524845,  26.03973954,  23.36059587,  36.32329273,
        32.43466206,  26.04089005,  -1.59063071,  21.92294002,
        14.71407992,  -9.8939819 ,  31.2594857 ,  22.10854802,
        10.68235121,   6.50361909,  -7.97451717,  12.9928611 ,
        17.02658679,  -2.46732476,  27.43306074,  -6.4580457 ,
       -21.84650525,  25.59590636,   3.94950943,   8.10955622,
        -2.56275084, -14.08944141,  28.82856062,  -0.33303476,
        34.91428732,  22.10962279,   5.2786278 ,  -6.28853602,
        -0.9869298 ,  -8.18784697,  41.30670039,  12.46661845,
        27.25308314,  -9.01028074,  12.71552694,  27.66792908,
         4.97483857,  25.46671688,  -6.26851868, -10.45207317,
        15.69100918,   4.31235348)

X3 <- c(19.63082034, -19.66831897,  20.68396953,  48.97455891,
         9.63061028,  10.51213193,  12.69324227, -17.92963566,
        16.3921996 , -14.08114616,   3.58480603,  28.64304324,
        -1.02825434,  17.51873484,  25.19108581,  14.18111284,
       -10.56422705,   5.01287087,  39.39117014, -20.37568644,
         5.86320979,   1.71837893,  11.81121045,  21.22323426,
        34.13036452,   5.94651412,  22.18511995,  17.49610217,
        17.11520947,   1.54114103,  -4.95982203,  -6.50064669,
        -1.34655814,  14.82529864,  21.4142409 ,  14.85203272,
         1.76567356,  37.08955165,  32.78298435,   4.68999831,
        -2.35147109,  11.95322431,  29.00947968,  14.99147466,
        18.34823057,   6.81879817,  16.84406343,  33.16816677,
         6.40496828,  12.14961599)

X4 <- c(13.80724715,  14.25588034, -11.17833314, -18.15302984,
        -5.29482606,  12.51913443,  18.30784249,   2.0398816 ,
        30.65886224,   7.85236039,  10.30473997,   7.09054194,
        12.01040189,  20.56711111,  19.98480157,  -3.47634411,
        32.85495665,  -6.42539686,  11.18840521,   5.8840514 ,
        -5.73487516,   8.87319118,  -1.1122066 ,  11.09360865,
        16.04628942,  32.07894053,  14.61076328,   0.8316199 ,
         4.12570284,  12.09967159,  11.40191244,  31.89383902,
        30.93029395,   4.61596111,   1.77036808, -28.35581906,
         1.7661938 ,  -4.67086559,   4.67763313,  15.87376364,
        12.65788494,   9.55047989,  12.99373167,   8.1082334 ,
        12.95528399, -38.46582512,   5.96059765,   8.33723918,
         4.88107426,   6.73080607)



# Binary Categorical / Dummies ( categories: 0,1)

X5 <- c(0., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1.,
       1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0.)

X6 <- c(0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0.,
       1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1.,
       1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1.)    

X7 <- c(1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1.,
       1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1.,
       1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1.) 



# Multiple categorical

X8 <- c(3., 3., 3., 1., 2., 3., 3., 0., 0., 1., 4., 2., 1., 1., 1., 2., 1.,
       3., 2., 1., 4., 1., 4., 0., 3., 2., 2., 0., 0., 1., 4., 2., 1., 2.,
       1., 1., 1., 3., 2., 1., 4., 3., 1., 4., 4., 2., 3., 2., 1., 0.) 

X9 <- c(1., 1., 1., 1., 3., 0., 2., 1., 1., 1., 3., 1., 0., 3., 3., 1., 2.,
       0., 0., 3., 1., 2., 3., 1., 0., 0., 2., 3., 1., 1., 2., 3., 2., 1.,
       2., 0., 0., 2., 1., 2., 1., 1., 3., 0., 2., 0., 2., 1., 0., 2.)       
        
X10 <- c(4., 3., 0., 1., 2., 1., 4., 1., 3., 5., 5., 1., 0., 5., 4., 5., 2.,
       4., 1., 2., 4., 1., 4., 2., 2., 2., 5., 3., 4., 4., 5., 2., 2., 3.,
       2., 5., 2., 2., 3., 2., 5., 0., 1., 2., 1., 1., 4., 3., 0., 5.)        

In [8]:
%%R

Data_set_R <- cbind(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10)

## Statistical Distances <a class="anchor" id="3"></a>



The concept of distance between elements of a set $\varepsilon$ allows us to interpret geometrically many classical techniques of multivariate analysis.

This interpretation is possible both with quantitative and categorical variables, or even when no variables are available, as long as it makes sense to obtain a measure of proximity between the elements of $\varepsilon$



###  Distance Definition <a class="anchor" id="4"></a>



Given a set of elements $\Omega$

#### Almost-metric <a class="anchor" id="1"></a>


It is called **quasi-metric** or **dissimilarity** to any mapping $\delta : \Omega \hspace{0.05cm}x\hspace{0.05cm} \Omega \rightarrow \mathbb{R}$ that satisfies the following properties:



1) $\hspace{0.15cm}\delta (i,j) \geq 0 \hspace{0.25cm}, \forall i,j \in \Omega$

2) $\hspace{0.15cm}\delta (i,i) = 0 \hspace{0.25cm}, \forall i \in  \Omega$

3) $\hspace{0.15cm}\delta (i,j) = \delta (j, i) \hspace{0.25cm}, \forall i,j \in \Omega $



#### Semi-metric <a class="anchor" id="6"></a>


It is called **semi-metric** to any dissimilarity (quasi-metric)  that satisfies the triangular inequality:



4) $\hspace{0.15cm} \delta (i,j) \hspace{0.1 cm}\leq \hspace{0.1 cm} \delta (i,k) + \delta (k,j) \hspace{0.25cm}, \forall i,j,k \in \Omega$



#### Metric <a class="anchor" id="7"></a>


It is called a **metric** to any semi-metric that satisfies:

5) $\hspace{0.15cm} \delta (i,j)=0 \hspace{0.15cm}\Leftrightarrow\hspace{0.15cm} i=j$




#### Distance <a class="anchor" id="9"></a>

A **distance** is a metric or semi-metric
 

### Distance Matrix <a class="anchor" id="10"></a>



When $\varepsilon$ is a finite set, we will have a distance matrix:



$$
D= \begin{pmatrix}
0 & \delta_{12}&...&\delta_{1n}\\
\delta_{21} & 0&...&\delta_{2n}\\
...&...&...&...\\
\delta_{n1}& \delta_{n2}&...& 0\\
\end{pmatrix}
$$
con $\delta_{ij}=\delta_{ji}$



We will also use the matrix of squares of distances:



$$
D^{(2)}= 
\begin{pmatrix}
0 & \delta^2_{12}&...&\delta^2_{1n}\\
\delta^2_{21} & 0&...&\delta^2_{2n}\\
...&...&...&...\\
\delta^2_{n1}& \delta^2_{n2}&...& 0\\
\end{pmatrix}
$$





No debe confundirse con  $D^2=D\cdot D$



## Distances with quantitative variables <a class="anchor" id="11"></a>

Let $X_1,...,X_p$ be quantitative variables,

Let $\hspace{0.1cm} x_i=(x_{i1},...,x_{ip})^t \hspace{0.1cm}$ and $\hspace{0.1cm} x_j=(x_{j1},...,x_{jp})^t \hspace{0.1cm}$ the values ​​(observations) of the variables $X_1,...,X_p$ for the elements or individuals $i$ and $j$ of the sample $\Omega$.


### Euclidean Distance <a class="anchor" id="12"></a>


 
The Euclidean distance between the elements / individuals $i$ and $j$ of $\Omega$ with respect to the quantitative variables $X_1,...,X_p$ is defined as:



 $$
\delta^2(i,j)_{Euclidea} = \sum_{k=1}^{p} (x_{ik} - x_{jk})\hspace{0.05cm}^2 = (x_i - x_j)\hspace{0.05cm}^t\cdot (x_i - x_j) = sum \left( \hspace{0.05cm} (x_i - x_j)^2 \hspace{0.05cm} \right)
$$



$$
\delta(i,j)_{Euclidea} =\sqrt{\sum_{k=1}^{p} (x_{ik} - x_{jk})\hspace{0.05cm}^2  }  = \sqrt{(x_i - x_j)\hspace{0.05cm}^t\cdot (x_i - x_j)} = \sqrt{ sum \left( \hspace{0.05cm} (x_i - x_j)^2  \hspace{0.05cm} \right) }
 $$

 


Where $\hspace{0.05cm} sum \left( \hspace{0.05cm} (x_i - x_j)^2 \hspace{0.05cm} \right) \hspace{0.05cm}$ is a vectorial operation.

**Observation:**

Given two vectors $v=(v_1,...,v_n)^t$ and $w=(w_1,...,w_n)^t$

The Euclidean distance between the vectors v and w is :

$$
\delta^2(v,w)_{Euclidean} \hspace{0.07cm}=\hspace{0.07cm}  sum( (v-w)^2)  \hspace{0.07cm}=\hspace{0.07cm}  \sum_{i=1}^{n} (v_{i} - w_{i})\hspace{0.05cm}^2   
$$


So that, the Euclidean distance between the elements  $i$ and $j$ of $\Omega$ with respect to the quantitative variables $X_1,X_2,...,X_p$ is the Euclidean distance between the vectors $x_i=(x_{i1},x_{i2},...,x_{ip})$ and $x_j=(x_{j1},x_{j2},...,x_{jp})$

#### Disadvantages <a class="anchor" id="13"></a>


 
Although it is one of the most popular distances, it is not suitable in many cases for the following reasons:

1) It assumes that the variables are uncorrelated and with unit variance (although this last problem can be solved by standardizing the variables to unit variance by dividing them by their respective standard deviations).

2) It is not invariant against changes in scale (changes in measurement units) of the variables.


 
Let's see what this means in more detail:

If a change of scale is applied to the variables $a\cdot X_j + b$, with $a\neq 1$ and $b\neq 0$

Now the observations for elements $i$ and $j$ are $a\cdot x_i + b$ and $a\cdot x_j + b$

Then the Euclidean distance between the elements $i$ and $j$ with respect to the scaled variables $a\cdot X_j + b$ is:

$$
\delta^2(i,j)_{Euclidea} = a^2 \cdot (x_i - x_j)^t\cdot (x_i - x_j)
$$


### Euclidean Distance in `R` <a class="anchor" id="14"></a>

In [9]:
%%R 

Dist_Euclidea_R <- function(i,j, Quantitative_Data_set){

Quantitative_Data_set=as.matrix(Quantitative_Data_set)  
  
Dist_Euclidea = sum( (Quantitative_Data_set[i,] - Quantitative_Data_set[j,])^2 )

Dist_Euclidea = sqrt(Dist_Euclidea)

return(Dist_Euclidea)

}

In [10]:
%%R

library(tidyverse)

Data_set_R <- as.data.frame(Data_set_R)

Quantitative_Data_R <- Data_set_R %>% select(1:4)

R[write to console]: -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

R[write to console]: v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.7     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.0
v readr   2.1.2     v forcats 0.5.1

R[write to console]: -- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()



In [11]:
%%R 

Dist_Euclidea_R(1,2, Quantitative_Data_R)

[1] 50.35391


### Euclidean Distance in `Python` <a class="anchor" id="15"></a>


In [12]:
def Dist_Euclidea_Python(i, j, Quantitative_Data_set):

    Dist_Euclidea = ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] )**2 ).sum()

    Dist_Euclidea = np.sqrt(Dist_Euclidea)

    return Dist_Euclidea

In [13]:
Quantitative_Data_Python = Data_set_Python.iloc[ : , [0,1,2,3] ] 

In [14]:
Dist_Euclidea_Python(1, 2, Quantitative_Data_Python)

50.35390686386084

### Euclidean Distance Matrix in `R` <a class="anchor" id="16"></a>

In [15]:
%%R

Dist_Euclidea_Matrix_R <- function( Quantitative_Data_set ){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)
  
  M <- matrix(NA, ncol =dim(Quantitative_Data_set)[1] , nrow=dim(Quantitative_Data_set)[1] )
  
  for(i in 1:dim(Quantitative_Data_set)[1] ){
    for(j in 1:dim(Quantitative_Data_set)[1]){
    
      M[i,j]=Dist_Euclidea_R(i,j, Quantitative_Data_set)
  
   }
  }
  return(M)
}

In [16]:
%%R

Dist_Euclidea_Matrix_R(Quantitative_Data_R)[1:10,1:10]

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,]  0.00000 50.35391 55.88544 44.47121 30.28231 55.01982 39.33222 61.13876
 [2,] 50.35391  0.00000 64.28101 84.76773 45.34684 44.83372 67.37388 48.64144
 [3,] 55.88544 64.28101  0.00000 54.98164 31.38217 34.47783 52.43096 42.18753
 [4,] 44.47121 84.76773 54.98164  0.00000 45.39912 73.81242 58.97470 81.31139
 [5,] 30.28231 45.34684 31.38217 45.39912  0.00000 40.99075 38.75224 38.57564
 [6,] 55.01982 44.83372 34.47783 73.81242 40.99075  0.00000 61.54206 44.70198
 [7,] 39.33222 67.37388 52.43096 58.97470 38.75224 61.54206  0.00000 47.62804
 [8,] 61.13876 48.64144 42.18753 81.31139 38.57564 44.70198 47.62804  0.00000
 [9,] 57.35650 55.05364 44.72150 79.73168 50.86690 20.92743 57.58937 51.56037
[10,] 49.39225 43.05869 44.39039 74.35722 31.57531 45.31425 37.16027 14.05185
          [,9]    [,10]
 [1,] 57.35650 49.39225
 [2,] 55.05364 43.05869
 [3,] 44.72150 44.39039
 [4,] 79.73168 74.35722
 [5,] 50.86690 31.5753

In [17]:
%%R

Dist_Euclidea_Matrix_R(Quantitative_Data_R)[1,2]

[1] 50.35391


In [18]:
%%R

Dist_Euclidea_Matrix_R(Quantitative_Data_R)[5,3]

[1] 31.38217


### Euclidean Distance Matrix in `Python` <a class="anchor" id="17"></a>

In [19]:
def Dist_Euclidea_Matrix_Python( Quantitative_Data_set ):

    M = np.zeros((Quantitative_Data_set.shape[0] , Quantitative_Data_set.shape[0]))

    for i in range(0 , Quantitative_Data_set.shape[0]):
        for j in range(0 , Quantitative_Data_set.shape[0]):

            M[i,j]=Dist_Euclidea_Python(i+1,j+1, Quantitative_Data_set)
                 
    return M

In [20]:
np.set_printoptions(threshold=np.inf)

In [21]:
Dist_Euclidea_Matrix_Python(Quantitative_Data_Python)

array([[ 0.        , 50.35390686, 55.88543575, 44.47121042, 30.28230619,
        55.0198165 , 39.33222296, 61.13875643, 57.35650319, 49.39224756,
        19.18668603, 36.45683618, 50.06287534,  9.76176818, 42.61250194,
        37.64782144, 64.1564009 , 57.32179616, 37.17418244, 51.50386721,
        44.92383319, 43.43833526, 40.57450703, 34.19526501, 19.39727407,
        42.32909321, 32.85049264, 22.50294156, 18.65831141, 19.06081274,
        47.19933709, 41.86399342, 52.95147825, 33.684141  , 35.71670108,
        44.692872  , 28.3142343 , 36.82009972, 53.27475603, 33.58550615,
        42.97081537, 13.03742629, 26.20711906, 45.25990645, 25.80158224,
        66.12512498, 52.92367433, 26.82407183, 42.9370094 , 52.72631243],
       [50.35390686,  0.        , 64.28100719, 84.76773428, 45.34683551,
        44.83371544, 67.37388444, 48.6414443 , 55.05364378, 43.05869447,
        34.70508944, 58.39668596, 28.6240421 , 45.20726822, 62.20843455,
        51.84754056, 31.96851235, 38.74606183, 59.

Beware of the Python output:

Each array represents a row, and each element within the array represents a column. Therefore element 3 of array 5 is the element of row 5 and column 3 of the distance matrix.

In [22]:
Dist_Euclidea_Matrix_Python(Quantitative_Data_Python)[1,2]

64.28100718719624

In [23]:
Dist_Euclidea_Matrix_Python(Quantitative_Data_Python)[5,3]

73.81241882537256

### Minkowski Distance <a class="anchor" id="18"></a>



The Minkowski distance with parameter $\hspace{0.1cm} q=1,2,3,... \hspace{0.1cm}$ between the individuals $i$ and $j$ with respect to the quantitative variables $X_1,. ..,X_k$ is:


$$
\delta_q(i,j)_{Minkowski } = \left( \sum_{k=1}^{p}  \mid x_{ik} - x_{jk} \mid  ^q  \right)^{(1/q)} =  sum \left( \hspace{0.1cm} \mid x_i - x_j \mid  ^q \hspace{0.1cm}\right)^{(1/q)}    
$$

**Observation:**

Given two vectors $v=(v_1,...,v_n)^t$ and $w=(w_1,...,w_n)^t$

The Minkowski distance between the vectors v and w is :

$$
\delta_q(v,w)_{Minkowski} \hspace{0.07cm}=\hspace{0.07cm}  sum \left( \hspace{0.1cm} \mid v - w \mid  ^q \hspace{0.1cm}\right)^{(1/q)}  \hspace{0.07cm}=\hspace{0.07cm}  \left( \sum_{i=1}^{n}  \mid v_{i } - w_{i} \mid  ^q  \right)^{(1/q)}    
$$


So that, the Minkowski distance between the elements  $i$ and $j$ of $\Omega$ with respect to the quantitative variables $X_1,X_2,...,X_p$ is the Minkowski distance between the vectors $x_i=(x_{i1},x_{i2},...,x_{ip})$ and $x_j=(x_{j1},x_{j2},...,x_{jp})$

#### Disadvantages <a class="anchor" id="19"></a>


1) **Assumes** that the **variables** are **uncorrelated** and with **unit variance**.

2) It is **not invariant** against changes in scale (changes in measurement units) of the variables.

3) It is **hardly Euclideanizable** (we will see later what this means).


#### Particular cases of the Minkowski distance <a class="anchor" id="20"></a>


##### Euclidean Distance <a class="anchor" id="21"></a>


\begin{gather*}
 \delta_2(i,j)_{Minkowski }=\delta (i,j)_{Euclidea }   \hspace{1cm} (q=2)
 \end{gather*}
 


##### Manhattan Distance <a class="anchor" id="22"></a>



\begin{gather*}
 \delta_1(i,j)_{Minkowski } \hspace{0.1cm}=\hspace{0.1cm} \sum_{k=1}^{p}  \mid x_{ik} - x_{jk}  \mid  \hspace{0.1cm}=\hspace{0.1cm}  sum \left( \hspace{0.1cm} \mid x_i - x_j \mid \hspace{0.1cm} \right) \hspace{1cm} (q=1)
 \end{gather*}



 
##### Dominant Distance <a class="anchor" id="23"></a>



\begin{gather*}
 \delta_{\infty}(i,j)_{Minkowski } \hspace{0.1cm}=\hspace{0.1cm} max \lbrace  \hspace{0.1cm} \mid x_{i1} - x_{j1} \mid \hspace{0.1cm},...,\hspace{0.1cm} \mid x_{ip} - x_{jp} \mid \hspace{0.1cm}  \rbrace \hspace{0.1cm}=\hspace{0.1cm} max \left( \mid x_i - x_j \mid \right) \hspace{1cm} (q\rightarrow \infty)
 \end{gather*}


### Minkowski Distance in `R` <a class="anchor" id="24"></a>

In [24]:
%%R

Dist_Minkowski_R <- function(i,j, q , Quantitative_Data_set){
  
Quantitative_Data_set=as.matrix(Quantitative_Data_set)  

Dist_Minkowski = ( sum( ( abs(Quantitative_Data_set[i,] - Quantitative_Data_set[j,]) )^q ) )^(1/q)
  
return(Dist_Minkowski)
}

Particular cases:

**Euclidean Distance** $\hspace{0.1cm} (q=2)$

In [25]:
%%R

Dist_Minkowski_R(1,2, q=2, Quantitative_Data_R)

[1] 50.35391


  **Manhattan Distance** $\hspace{0.1cm} (q=1)$

In [26]:
%%R

Dist_Minkowski_R(1,2, q=1, Quantitative_Data_R)

[1] 74.82187


### Minkowski Distance in `Python` <a class="anchor" id="25"></a>

In [27]:
def Dist_Minkowski_Python(i,j, q , Quantitative_Data_set):

    Dist_Minkowski = ( ( ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] ).abs() )**q ).sum() )**(1/q)

    return Dist_Minkowski

Particular cases:

**Euclidean Distance** $\hspace{0.1cm} (q=2)$

In [28]:
Dist_Minkowski_Python(1,2, 2 , Quantitative_Data_Python)

50.35390686386084

  **Manhattan Distance** $\hspace{0.1cm} (q=1)$

In [29]:
Dist_Minkowski_Python(1,2, 1 , Quantitative_Data_Python)

74.821869942812

### Minkowski Distance Matrix in `R` <a class="anchor" id="26"></a>

In [30]:
%%R

Dist_Minkowski_Matrix_R <- function(q , Quantitative_Data_set ){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)
  
  M<-matrix(NA, ncol =dim(Quantitative_Data_set)[1] , nrow=dim(Quantitative_Data_set)[1] )
  
  for(i in 1:dim(Quantitative_Data_set)[1] ){
    for(j in 1:dim(Quantitative_Data_set)[1]){
    
  M[i,j]=Dist_Minkowski_R(i,j, q , Quantitative_Data_set)
  
   }
  }
 return(M)
}

In [31]:
%%R

Dist_Minkowski_Matrix_R(q=2 , Quantitative_Data_R)[1:10,1:10]

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,]  0.00000 50.35391 55.88544 44.47121 30.28231 55.01982 39.33222 61.13876
 [2,] 50.35391  0.00000 64.28101 84.76773 45.34684 44.83372 67.37388 48.64144
 [3,] 55.88544 64.28101  0.00000 54.98164 31.38217 34.47783 52.43096 42.18753
 [4,] 44.47121 84.76773 54.98164  0.00000 45.39912 73.81242 58.97470 81.31139
 [5,] 30.28231 45.34684 31.38217 45.39912  0.00000 40.99075 38.75224 38.57564
 [6,] 55.01982 44.83372 34.47783 73.81242 40.99075  0.00000 61.54206 44.70198
 [7,] 39.33222 67.37388 52.43096 58.97470 38.75224 61.54206  0.00000 47.62804
 [8,] 61.13876 48.64144 42.18753 81.31139 38.57564 44.70198 47.62804  0.00000
 [9,] 57.35650 55.05364 44.72150 79.73168 50.86690 20.92743 57.58937 51.56037
[10,] 49.39225 43.05869 44.39039 74.35722 31.57531 45.31425 37.16027 14.05185
          [,9]    [,10]
 [1,] 57.35650 49.39225
 [2,] 55.05364 43.05869
 [3,] 44.72150 44.39039
 [4,] 79.73168 74.35722
 [5,] 50.86690 31.5753

### Minkowski Distance Matrix in `Python` <a class="anchor" id="27"></a>

In [32]:
def Dist_Minkowski_Matrix_Python(q , Quantitative_Data_set):

    M = np.zeros((Quantitative_Data_set.shape[0] , Quantitative_Data_set.shape[0]))

    for i in range(0 , Quantitative_Data_set.shape[0]):
        for j in range(0 , Quantitative_Data_set.shape[0]):

            M[i,j] = Dist_Minkowski_Python(i+1,j+1, q, Quantitative_Data_set)
                 
    return M

In [33]:
Dist_Minkowski_Matrix_Python(2 , Quantitative_Data_Python)

array([[ 0.        , 50.35390686, 55.88543575, 44.47121042, 30.28230619,
        55.0198165 , 39.33222296, 61.13875643, 57.35650319, 49.39224756,
        19.18668603, 36.45683618, 50.06287534,  9.76176818, 42.61250194,
        37.64782144, 64.1564009 , 57.32179616, 37.17418244, 51.50386721,
        44.92383319, 43.43833526, 40.57450703, 34.19526501, 19.39727407,
        42.32909321, 32.85049264, 22.50294156, 18.65831141, 19.06081274,
        47.19933709, 41.86399342, 52.95147825, 33.684141  , 35.71670108,
        44.692872  , 28.3142343 , 36.82009972, 53.27475603, 33.58550615,
        42.97081537, 13.03742629, 26.20711906, 45.25990645, 25.80158224,
        66.12512498, 52.92367433, 26.82407183, 42.9370094 , 52.72631243],
       [50.35390686,  0.        , 64.28100719, 84.76773428, 45.34683551,
        44.83371544, 67.37388444, 48.6414443 , 55.05364378, 43.05869447,
        34.70508944, 58.39668596, 28.6240421 , 45.20726822, 62.20843455,
        51.84754056, 31.96851235, 38.74606183, 59.

### Dominant Distance in `R` <a class="anchor" id="28"></a>

In [34]:
%%R

Dist_Dominant_R <- function(i,j, Quantitative_Data_set){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)  

  Dist_Dominante =  max( abs(Quantitative_Data_set[i,] - Quantitative_Data_set[j,]) )
  
return(Dist_Dominante)
}

In [35]:
%%R

Dist_Dominant_R(1,2, Quantitative_Data_R)

[1] 39.29914


### Dominant Distance in `Python` <a class="anchor" id="29"></a>

In [36]:
def Dist_Dominant_Python(i, j, Quantitative_Data_set):

    Dist_Dominante = ( (Quantitative_Data_set.iloc[i-1,] - Quantitative_Data_set.iloc[j-1,]).abs() ).max()

    return Dist_Dominante

In [37]:
Dist_Dominant_Python(1, 2, Quantitative_Data_Python)

39.299139311884204

### Dominant Distance Matrix in `R` <a class="anchor" id="30"></a>

In [38]:
%%R

Dist_Dominant_Matrix_R <- function( Quantitative_Data_set ){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)
  
  M<-matrix(NA, ncol =dim(Quantitative_Data_set)[1] , nrow=dim(Quantitative_Data_set)[1] )
  
  for(i in 1:dim(Quantitative_Data_set)[1] ){
    for(j in 1:dim(Quantitative_Data_set)[1]){
    
       M[i,j]=Dist_Dominant_R(i,j,  Quantitative_Data_set)
  
   }
  }
 return(M)
}

In [39]:
%%R

Dist_Dominant_Matrix_R(Quantitative_Data_R)[1:10,1:10]

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,]  0.00000 39.29914 45.56696 31.96028 19.85653 41.05601 32.77188 45.73457
 [2,] 39.29914  0.00000 41.73751 68.64288 29.29893 31.62156 51.36037 41.90512
 [3,] 45.56696 41.73751  0.00000 38.12662 25.71043 23.69747 40.64487 38.61361
 [4,] 31.96028 68.64288 38.12662  0.00000 39.34395 47.36597 36.46087 66.90419
 [5,] 19.85653 29.29893 25.71043 39.34395  0.00000 33.45055 27.72118 27.56025
 [6,] 41.05601 31.62156 23.69747 47.36597 33.45055  0.00000 61.17174 31.20524
 [7,] 32.77188 51.36037 40.64487 36.46087 27.72118 61.17174  0.00000 30.62288
 [8,] 45.73457 41.90512 38.61361 66.90419 27.56025 31.20524 30.62288  0.00000
 [9,] 41.84594 38.01649 41.83720 48.81189 35.95369 18.13973 55.38923 34.32184
[10,] 35.45217 31.62271 34.76512 63.05571 23.71176 37.77265 26.77439 10.28240
          [,9]    [,10]
 [1,] 41.84594 35.45217
 [2,] 38.01649 31.62271
 [3,] 41.83720 34.76512
 [4,] 48.81189 63.05571
 [5,] 35.95369 23.7117

### Dominant Distance Matrix in `Python` <a class="anchor" id="31"></a>

In [40]:
def Dist_Dominant_Matrix_Python(Quantitative_Data_set):

    M = np.zeros((Quantitative_Data_set.shape[0] , Quantitative_Data_set.shape[0]))

    for i in range(0 , Quantitative_Data_set.shape[0]):
        for j in range(0 , Quantitative_Data_set.shape[0]):

            M[i,j] = Dist_Dominant_Python(i+1,j+1, Quantitative_Data_set)
                 
    return M

In [41]:
Dist_Dominant_Matrix_Python(Quantitative_Data_Python)

array([[ 0.        , 39.29913931, 45.56696322, 31.96027699, 19.8565283 ,
        41.05600711, 32.77187572, 45.73457258, 41.8459419 , 35.45216989,
        16.04601431, 31.33421986, 38.65530344,  6.75986396, 40.67076554,
        31.51982787, 49.37341029, 49.08625038, 31.34526752, 40.00650678,
        27.34498769, 38.64543947, 36.84434058, 33.92189472, 14.49954418,
        35.00718621, 29.89103699, 17.52083607, 14.18342825, 18.08967931,
        38.23984047, 26.13146703, 44.32556717, 31.52090264, 30.19639553,
        42.16306621, 17.86514678, 26.60779972, 50.71798024, 21.8778983 ,
        36.66436299,  9.63058336, 22.12680679, 37.07920893, 21.36329481,
        52.27307227, 52.16993804, 22.47814145, 30.96549914, 49.85660913],
       [39.29913931,  0.        , 41.73750653, 68.64287788, 29.29892925,
        31.62156269, 51.36037035, 41.90511588, 38.01648521, 31.6227132 ,
        25.14347397, 48.31136222, 20.29590307, 37.18705382, 44.85940478,
        33.84943181, 18.59907631, 24.68118984, 59.

### Canberra Distance <a class="anchor" id="32"></a>


The Canberra distance between the elements $i$ and $j$ with respect to the quantitative variables $X_1,...,X_p$ is:


\begin{gather*}
 \delta(i,j)_{Canberra} \hspace{0.1cm}= \hspace{0.1cm} \sum_{k=1}^{p} \dfrac{\mid x_{ik} - x_{jk} \mid}{\mid x_{ik} \mid + \mid x_{jk} \mid}  \hspace{0.1cm}= \hspace{0.1cm} sum \left( \dfrac{\mid x_i - x_j \mid }{ \mid x_i \mid + \mid x_j \mid} \right)
 \end{gather*}


**Observation:**

Given two vectors $v=(v_1,...,v_n)^t$ and $w=(w_1,...,w_n)^t$

The Euclidean distance between the vectors $v$ and $w$ is :

$$
\delta (v,w)_{Canberra} \hspace{0.07cm}=\hspace{0.07cm}  sum \left( \dfrac{ \mid v-w \mid }{\mid v \mid + \mid w \mid} \right)  \hspace{0.07cm}=\hspace{0.07cm}  \sum_{i=1}^{n} \dfrac{ \mid v_i - w_i \mid }{\mid v_i \mid + \mid w_i \mid}    
$$


So that, the Canberra distance between the elements  $i$ and $j$ of $\Omega$ with respect to the quantitative variables $X_1,X_2,...,X_p$ is the Canberra distance between the vectors $x_i=(x_{i1},x_{i2},...,x_{ip})$ and $x_j=(x_{j1},x_{j2},...,x_{jp})$


#### Disadvantages <a class="anchor" id="33.1"></a>


1) Assumes that the variables are uncorrelated and with unit variance.

Although it is invariant against changes in the scale of the variables (changes in units of measurement).


#### Canberra Distance in `R` <a class="anchor" id="33"></a>

In [42]:
%%R

Dist_Canberra_R <- function(i,j,   Quantitative_Data_set){

Quantitative_Data_set=as.matrix(Quantitative_Data_set)  

Dist_Canberra =   sum( abs(Quantitative_Data_set[i,] - Quantitative_Data_set[j,])/(abs(Quantitative_Data_set[i,])+ abs(Quantitative_Data_set[j,])) ) 
  
return(Dist_Canberra)
}

In [43]:
%%R

Dist_Canberra_R(1,2, Quantitative_Data_R)

[1] 2.271401


#### Canberra Distance in `Python` <a class="anchor" id="34"></a>

In [44]:
def Dist_Canberra_Python(i,j, Quantitative_Data_set):

    Dist_Canberra =  ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] ).abs()  / ( (Quantitative_Data_set.iloc[i-1, ]).abs() + (Quantitative_Data_set.iloc[j-1, ]).abs() ) ).sum()

    return Dist_Canberra

In [45]:
Dist_Canberra_Python(1,2, Quantitative_Data_Python)

2.2714011238550422

#### Canberra Distances Matrix in `R` <a class="anchor" id="35"></a>

In [46]:
%%R

Dist_Canberra_Matrix_R <- function( Quantitative_Data_set ){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)
  
  M <- matrix(NA, ncol =dim(Quantitative_Data_set)[1] , nrow=dim(Quantitative_Data_set)[1] )
  
  for(i in 1:dim(Quantitative_Data_set)[1] ){
    for(j in 1:dim(Quantitative_Data_set)[1]){
    
  M[i,j]=Dist_Canberra_R(i,j,  Quantitative_Data_set)
  
   }
  }
 return(M)
}

In [47]:
%%R

Dist_Canberra_Matrix_R(Quantitative_Data_R)[1:10,1:10]

          [,1]     [,2]     [,3]     [,4]     [,5]      [,6]     [,7]     [,8]
 [1,] 0.000000 2.271401 3.026123 2.415632 3.341754 2.3514434 1.970214 3.742555
 [2,] 2.271401 0.000000 3.273321 3.478087 3.899472 2.2291216 3.124432 2.545852
 [3,] 3.026123 3.273321 0.000000 2.643922 2.103762 1.9074860 2.454391 2.601851
 [4,] 2.415632 3.478087 2.643922 0.000000 3.219714 3.6465720 2.942378 4.000000
 [5,] 3.341754 3.899472 2.103762 3.219714 0.000000 2.3979854 2.519236 3.012739
 [6,] 2.351443 2.229122 1.907486 3.646572 2.397985 0.0000000 1.336006 2.698629
 [7,] 1.970214 3.124432 2.454391 2.942378 2.519236 1.3360055 0.000000 3.016687
 [8,] 3.742555 2.545852 2.601851 4.000000 3.012739 2.6986294 3.016687 0.000000
 [9,] 2.468881 2.439881 1.511044 3.498455 2.685481 0.8387234 1.542040 2.712695
[10,] 2.628528 2.455193 3.162626 3.615132 3.427440 2.2291056 2.249779 1.872681
           [,9]    [,10]
 [1,] 2.4688808 2.628528
 [2,] 2.4398814 2.455193
 [3,] 1.5110440 3.162626
 [4,] 3.4984546 3.615132
 [5,] 

#### Canberra Distances Matrix in `Python` <a class="anchor" id="36"></a>

In [48]:
def Dist_Canberra_Matrix_Python(Quantitative_Data_set):

    M = np.zeros((Quantitative_Data_set.shape[0] , Quantitative_Data_set.shape[0]))

    for i in range(0 , Quantitative_Data_set.shape[0]):
        for j in range(0 , Quantitative_Data_set.shape[0]):

            M[i,j] = Dist_Canberra_Python(i+1,j+1, Quantitative_Data_set)
                 
    return M

In [49]:
Dist_Canberra_Matrix_Python(Quantitative_Data_Python)

array([[0.        , 2.27140112, 3.02612315, 2.41563242, 3.34175397,
        2.35144336, 1.97021427, 3.74255505, 2.4688808 , 2.62852805,
        2.49059882, 2.50809694, 3.06959756, 1.2785115 , 2.3068633 ,
        3.16117705, 3.40820424, 3.59317208, 1.52220832, 3.4023704 ,
        3.54003272, 2.64114513, 2.46638558, 1.33405285, 0.90974886,
        2.93321284, 2.08935924, 2.23435946, 2.18028251, 1.48398078,
        3.09541511, 2.9989188 , 2.83286184, 2.49147232, 2.81615389,
        2.33748768, 3.4183086 , 2.37732022, 2.07101008, 2.68394653,
        2.5459975 , 0.88091472, 2.22317059, 2.39404391, 2.06560516,
        3.48439346, 1.67377591, 1.55580616, 2.98562112, 2.57995549],
       [2.27140112, 0.        , 3.27332082, 3.4780875 , 3.89947207,
        2.22912162, 3.12443178, 2.54585179, 2.4398814 , 2.45519252,
        2.71733421, 2.82407231, 2.11538526, 2.42705311, 2.93125554,
        3.75498389, 1.96233654, 3.26329826, 1.29906823, 1.65829792,
        3.08473692, 1.74877024, 4.        , 1.2

### Karl Pearson Distance <a class="anchor" id="37"></a>

 
The Karl Pearson distance between the elements $i$ and $j$ with respect to the quantitative variables $X_1,...,X_p$ is:


$$
  \delta^2(i,j)_{KP}= \sum_{k=1}^{p} \dfrac{( x_{ik} - x_{jk} )\hspace{0.03cm}^2 }{s\hspace{0.03cm}^2_k} \hspace{0.1cm} = \hspace{0.1cm} (x_i - x_j)\hspace{0.03cm}^t \cdot S_0^{-1} \cdot (x_i - x_j )   \hspace{0.1cm} = \hspace{0.1cm} sum \left( \hspace{0.07cm} \left(  \dfrac{ x_i - x_j  }{ \overrightarrow{s}   }\right)^2 \hspace{0.07cm} \right)
$$

$$
 \delta(i,j)_{KP}= \sqrt{ \sum_{k=1}^{p} \dfrac{( x_{ik} - x_{jk} )\hspace{0.03cm}^2 }{s\hspace{0.03cm}^2_k}} = \sqrt{(x_i - x_j)\hspace{0.03cm}^t \cdot S_0^{-1} \cdot (x_i - x_j )}  \hspace{0.1cm} = \hspace{0.1cm} \sqrt{  sum \left( \hspace{0.07cm} \left(  \dfrac{ x_i - x_j  }{ \overrightarrow{s}   }\right)^2 \hspace{0.07cm} \right)}
  $$

 Where:
 
 $S_0 = diag(s_1 ^2 ,..., s_p ^2)$
 
 $s_k ^2$ is the variance of $X_k$

 $\overrightarrow{s} = ( s_1 ,..., s_p )$



**Observation:**

Given two vectors $v=(v_1,...,v_n)^t$ and $w=(w_1,...,w_n)^t$  

Given the  $2 \hspace{0.05cm} x \hspace{0.05cm} n$ matrix 

$$
M =  \begin{pmatrix}
    v\\
    w
    
    \end{pmatrix} =
     \begin{pmatrix}
    v_1 &  v_2 &...& v_n\\
    w_1 &  w_2 &...& w_n
    
    \end{pmatrix} = 
    \begin{pmatrix}
    C_1 & C2 &...& C_n
    \end{pmatrix}
$$    



The Pearson distance between the vectors $v$ and $w$ is :

$$
\delta (v,w)_{Pearson} \hspace{0.07cm}=\hspace{0.07cm}  \sqrt{  sum \left( \hspace{0.07cm} \left(  \dfrac{ x_i - x_j  }{ \overrightarrow{s}   }\right)^2 \hspace{0.07cm} \right)}   
$$

Where: 

 $\overrightarrow{s} = ( s_1 ,..., s_n )\hspace{0.07cm}$ and $\hspace{0.07cm} s_k \hspace{0.07cm}$ is the standard deviation of $\hspace{0.07cm} C_k = (v_k , w_k)^t$

¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿
So that, the Pearson distance between the elements  $i$ and $j$ of $\Omega$ with respect to the quantitative variables $X_1,X_2,...,X_p$ is the Pearson distance between the vectors $x_i=(x_{i1},x_{i2},...,x_{ip})$ and $x_j=(x_{j1},x_{j2},...,x_{jp})$ ??????????????????'??

**Observation:**

With the Karl Pearson distance, the weight attributed to the difference between individuals with respect to a certain variable increases when the variance of the variable decreases, and vice versa.


#### Disadvantages <a class="anchor" id="38.1"></a>
 
1) Assumes that the variables are uncorrelated and with unit variance.

Although it is invariant against changes in scale (changes in units of measurement) of the variables.

#### Pearson Distance in `R` <a class="anchor" id="38"></a>

In [50]:
%%R

Dist_Pearson_R <- function(i,j, Quantitative_Data_set){

Quantitative_Data_set=as.matrix(Quantitative_Data_set)  
  
Dist_Pearson = sum( ((Quantitative_Data_set[i,] - Quantitative_Data_set[j,])^2)/diag(cov(Quantitative_Data_set)) ) 

Dist_Pearson = sqrt(Dist_Pearson)

return( Dist_Pearson )
}

In [51]:
%%R

Dist_Pearson_R(1,2, Quantitative_Data_R)

[1] 3.112755


#### Pearson Distance in `Python` <a class="anchor" id="39"></a>

In [52]:
def Dist_Pearson_Python(i, j, Quantitative_Data_set):

    Dist_Pearson = ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] )**2 / Quantitative_Data_set.var() ).sum()

    Dist_Pearson = np.sqrt(Dist_Pearson)

    return Dist_Pearson

In [53]:
Dist_Pearson_Python(1, 2, Quantitative_Data_Python)

3.1127552054306986

#### Pearson Distance Matrix in `R` <a class="anchor" id="40"></a>

In [54]:
%%R

Dist_Pearson_Matrix_R <- function( Quantitative_Data_set ){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)
  
  M<-matrix(NA, ncol =dim(Quantitative_Data_set)[1] , nrow=dim(Quantitative_Data_set)[1] )
  
  for(i in 1:dim(Quantitative_Data_set)[1] ){
    for(j in 1:dim(Quantitative_Data_set)[1]){
    
  M[i,j]=Dist_Pearson_R(i,j,  Quantitative_Data_set)
  
   }
  }
 return(M)
}

In [55]:
%%R

Dist_Pearson_Matrix_R(Quantitative_Data_R)[1:10,1:10]

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]      [,8]
 [1,] 0.000000 3.112755 3.547553 3.066054 2.011176 3.224333 2.386857 3.8910984
 [2,] 3.112755 0.000000 4.177840 5.489200 2.898739 2.841462 3.990835 2.9955236
 [3,] 3.547553 4.177840 0.000000 3.392456 1.941028 2.255526 3.249619 2.7690344
 [4,] 3.066054 5.489200 3.392456 0.000000 2.949829 4.606760 3.961906 5.2731237
 [5,] 2.011176 2.898739 1.941028 2.949829 0.000000 2.460189 2.444760 2.4775568
 [6,] 3.224333 2.841462 2.255526 4.606760 2.460189 0.000000 3.426500 2.7300081
 [7,] 2.386857 3.990835 3.249619 3.961906 2.444760 3.426500 0.000000 2.9709441
 [8,] 3.891098 2.995524 2.769034 5.273124 2.477557 2.730008 2.970944 0.0000000
 [9,] 3.480684 3.551576 3.162818 5.192758 3.348337 1.462445 3.258870 3.3775867
[10,] 3.152730 2.570858 2.898238 4.891168 2.077519 2.664770 2.315938 0.8839344
          [,9]     [,10]
 [1,] 3.480684 3.1527300
 [2,] 3.551576 2.5708579
 [3,] 3.162818 2.8982378
 [4,] 5.192758 4.8911681
 [5,] 

#### Pearson Distance Matrix in `Python` <a class="anchor" id="41"></a>

In [56]:
def Dist_Pearson_Matrix_Python(Quantitative_Data_set):

    M = np.zeros((Quantitative_Data_set.shape[0] , Quantitative_Data_set.shape[0]))

    for i in range(0 , Quantitative_Data_set.shape[0]):
        for j in range(0 , Quantitative_Data_set.shape[0]):

            M[i,j] = Dist_Pearson_Python(i+1,j+1, Quantitative_Data_set)
                 
    return M

In [57]:
Dist_Pearson_Matrix_Python(Quantitative_Data_Python)

array([[0.        , 3.11275521, 3.54755307, 3.06605427, 2.0111755 ,
        3.22433312, 2.38685702, 3.89109837, 3.48068378, 3.15272998,
        1.2327125 , 2.2470448 , 2.94788998, 0.63124268, 2.64435459,
        2.41249275, 3.85722343, 3.38535513, 2.17794449, 3.26071449,
        2.79568825, 2.50739623, 2.5844437 , 1.90386099, 1.24257151,
        2.72399308, 1.86255996, 1.47280383, 1.14882692, 1.2403834 ,
        2.95588632, 2.64563481, 3.36552558, 2.11542827, 2.10076775,
        3.17133764, 1.79601957, 2.29948959, 3.33503   , 2.03054531,
        2.7046437 , 0.79640872, 1.61301797, 2.7327908 , 1.48762823,
        4.53384449, 2.96169368, 1.58205611, 2.55905853, 2.97982428],
       [3.11275521, 0.        , 4.17783959, 5.48920017, 2.89873938,
        2.84146222, 3.99083519, 2.99552357, 3.55157633, 2.57085786,
        2.09984422, 3.74515837, 1.80937143, 2.84059759, 3.93403035,
        3.30813741, 2.05063559, 2.53082014, 3.87798155, 1.39957659,
        2.62956082, 1.52346416, 3.50458396, 2.6

## Mahalanobis Distance <a class="anchor" id="42"></a>


The Mahalanobis distance between the elements $i$ and $j$ with respect to the quantitative variables $X_1,...,X_p$ is:

$$
\delta^2(i,j)_{Maha}= (x_i - x_j)\hspace{0.03cm}^t \cdot S^{-1} \cdot (x_i - x_j )
$$

 $$
 \delta(i,j)_{Maha}=\sqrt{(x_i - x_j)\hspace{0.03cm}^t \cdot S^{-1} \cdot (x_i - x_j ) }   
$$

Where:

$S$ is the covariance matrix of the data matrix $X=(X_1,...,X_p)$

**Observation:**

Given two vectors $v=(v_1,...,v_n)^t$ and $w=(w_1,...,w_n)^t$  

Given the following  $\hspace{0.05cm}2 \hspace{0.05cm} x \hspace{0.05cm} n\hspace{0.05cm}$ matrix 

$$
M =  \begin{pmatrix}
    v\\
    w
    
    \end{pmatrix} =
     \begin{pmatrix}
    v_1 &  v_2 &...& v_n\\
    w_1 &  w_2 &...& w_n
    
    \end{pmatrix}  
  
$$    



The Mahalanobis distance between the vectors $v$ and $w$ is :

$$
\delta (v,w)_{Mahalanobis} \hspace{0.07cm}=\hspace{0.07cm}  (v - w)\hspace{0.03cm}^t \cdot S^{-1} \cdot (v - w )
$$

Where: 

 ¿¿¿¿¿¿¿¿ $S$ is the covariance matrix of $M$ ????????

### Advantages <a class="anchor" id="42.1"></a>

The Mahalanobis distance is suitable as a measure of discrepancy between data for the following reasons:

1) It is invariant against linear transformations of the variables

2) It takes into account the correlations between the variables. For example, it doesn't increase due to the fact of increasing the number of variables observed, like others such as the Euclidean, but it will only increase when the new variables aren't correlated with the previous one, so only when the new variables are not redundant  with respect to the information provided by the previous ones.

***Observation***

1) The Euclidean distance is equal to the Mahalanobis distance when $\hspace{0.1cm} S=I$

2) The Karl Pearson distance is equal to the Mahalanobis distance when $\hspace{0.1cm} S=diag(s_1^2 ,..., s_p^2)$

#### Mahalanobis Distance in `R` <a class="anchor" id="43"></a>

In [58]:
%%R

Dist_Mahalanobis_R <- function(i,j,   Quantitative_Data_set){
  
Quantitative_Data_set=as.matrix(Quantitative_Data_set)  

Dist_Mahalanobis  =    t( Quantitative_Data_set[i,] - Quantitative_Data_set[j,] )  %*% solve(cov(Quantitative_Data_set)) %*% ( Quantitative_Data_set[i,] - Quantitative_Data_set[j,] ) 
  
Dist_Mahalanobis <- sqrt(Dist_Mahalanobis)

return( Dist_Mahalanobis )
}

In [59]:
%%R

Dist_Mahalanobis_R(1,2, Quantitative_Data_R)

         [,1]
[1,] 3.065756


#### Mahalanobis Distance in `Python` <a class="anchor" id="44"></a>

In [60]:
def Dist_Mahalanobis_Python(i, j, Quantitative_Data_set):

    x = (Quantitative_Data_set.to_numpy()[i-1, ] - Quantitative_Data_set.to_numpy()[j-1, ])

    x = np.array([x]) # necessary step to transpose a 1D array

    S_inv = np.linalg.inv( Quantitative_Data_set.cov() ) # inverse of covariance matrix

    Dist_Maha = np.sqrt( x @ S_inv @ x.T )  # x @ S_inv @ x.T = np.matmul( np.matmul(x , S_inv) , x.T )

    Dist_Maha = float(Dist_Maha)

    return Dist_Maha

In [61]:
Dist_Mahalanobis_Python(1, 2, Quantitative_Data_Python)

3.0657555948686936

#### Mahalanobis Distance Matrix in `R` <a class="anchor" id="45"></a>

In [62]:
%%R

Dist_Mahalanobis_Matrix_R <- function( Quantitative_Data_set ){
  
  Quantitative_Data_set=as.matrix(Quantitative_Data_set)
  
  M<-matrix(NA, ncol =dim(Quantitative_Data_set)[1] , nrow=dim(Quantitative_Data_set)[1] )
  
  for(i in 1:dim(Quantitative_Data_set)[1] ){
    for(j in 1:dim(Quantitative_Data_set)[1]){
    
  M[i,j]=Dist_Mahalanobis_R(i,j,  Quantitative_Data_set)
  
   }
  }
  
 return(M)
}

In [63]:
%%R

Dist_Mahalanobis_Matrix_R(Quantitative_Data_R)[1:10,1:10]

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]      [,8]
 [1,] 0.000000 3.065756 3.659457 2.911593 2.028289 3.313878 2.253666 3.6675947
 [2,] 3.065756 0.000000 4.436090 5.245863 2.881512 3.354759 4.104878 2.9974192
 [3,] 3.659457 4.436090 0.000000 3.164140 2.185135 2.240367 3.213737 2.7465299
 [4,] 2.911593 5.245863 3.164140 0.000000 2.706227 4.320227 3.470171 4.7088505
 [5,] 2.028289 2.881512 2.185135 2.706227 0.000000 2.697824 2.405110 2.1860227
 [6,] 3.313878 3.354759 2.240367 4.320227 2.697824 0.000000 3.431404 2.9277000
 [7,] 2.253666 4.104878 3.213737 3.470171 2.405110 3.431404 0.000000 2.9839682
 [8,] 3.667595 2.997419 2.746530 4.708851 2.186023 2.927700 2.983968 0.0000000
 [9,] 3.748082 4.311350 3.234364 4.978094 3.769488 1.577960 3.479996 3.8924861
[10,] 2.930631 2.564855 2.959137 4.364786 1.807781 2.922094 2.380072 0.8814192
          [,9]     [,10]
 [1,] 3.748082 2.9306312
 [2,] 4.311350 2.5648551
 [3,] 3.234364 2.9591366
 [4,] 4.978094 4.3647858
 [5,] 

#### Mahalanobis Distance Matrix in `Python` <a class="anchor" id="46"></a>

In [64]:
def Dist_Mahalanobis_Matrix_Python(Quantitative_Data_set):

    M = np.zeros((Quantitative_Data_set.shape[0] , Quantitative_Data_set.shape[0]))

    for i in range(0 , Quantitative_Data_set.shape[0]):
        for j in range(0 , Quantitative_Data_set.shape[0]):

            M[i,j] = Dist_Mahalanobis_Python(i+1,j+1, Quantitative_Data_set)
                 
    return M

In [65]:
Dist_Mahalanobis_Matrix_Python(Quantitative_Data_Python)

array([[0.        , 3.06575559, 3.65945678, 2.91159323, 2.02828882,
        3.31387783, 2.2536659 , 3.6675947 , 3.74808193, 2.93063116,
        1.20777713, 2.48924638, 2.86494136, 0.63744005, 2.92199743,
        2.43287692, 3.68741867, 3.3598746 , 2.31702225, 3.1452949 ,
        2.78518099, 2.45834183, 2.57188489, 1.9361286 , 1.16208643,
        2.60123211, 2.0085199 , 1.48198231, 1.14724136, 1.38300327,
        2.75642098, 2.44253703, 3.04833307, 2.1301019 , 2.1528517 ,
        3.24154747, 1.82217706, 2.21775338, 3.62872004, 1.94193856,
        2.49822145, 0.90434336, 1.86265239, 2.83403814, 1.56739763,
        4.61260177, 2.9518337 , 1.59849624, 2.5369823 , 2.99682363],
       [3.06575559, 0.        , 4.43609031, 5.24586333, 2.8815118 ,
        3.35475863, 4.10487849, 2.9974192 , 4.31135044, 2.56485509,
        2.07776841, 4.09745691, 2.12327037, 2.88009606, 4.45812274,
        3.43374003, 2.4125535 , 2.56302348, 4.01420404, 1.3951464 ,
        2.72827177, 1.60001479, 3.64219098, 2.7

## Distances with categorical variables <a class="anchor" id="47"></a>

### Similarity <a class="anchor" id="48"></a>

It is a dual concept  that expresses the proximity or similarity between two elements.

Given a set of elements $\varepsilon$ , we call similarity to all maps $\hspace{0.1cm} s: \varepsilon x \varepsilon \rightarrow \mathbb{R}\hspace{0.1cm}$   such that:

1) $\hspace{0.1cm} 0 \leq s_{ij} \leq   1$

2) $\hspace{0.1cm} s_{ii} = 1$

3)  $\hspace{0.1cm} s_{ij} = s_{ji}$

 


### Similarity Matrix <a class="anchor" id="49"></a>


$$
\mathcal{S}= \begin{pmatrix}
s_{11} & s_{12}&...&s_{1n}\\
s_{21} & s_{22}&...&s_{2n}\\
...&...&...&...\\
s_{n1}& \delta_{n2}&...& s_{nn}\\
\end{pmatrix}
$$

With $s_{ii}=1$ and $s_{ij}=s_{ji}$



### Go from similarity to distance <a class="anchor" id="50"></a>

 

The following transformations allow to go from a measure of similarity to a distance:

1) $\hspace{0.15cm} \delta_{ij}=1-s_{ij} $


2) $\hspace{0.15cm} \delta_{ij}=\sqrt{1-s_{ij}} $

3) Gower transformation: $\hspace{0.15cm} \delta^2_{ij} = s_{ii} + s_{jj} - 2\cdot s_{ij}$

### Similarities with binary categorical variables <a class="anchor" id="51"></a>

Let $X_1,...,X_p$ be binary categorical variables, $ Range(X_j) = \lbrace 0,1 \rbrace$


The main coefficients of similarity between two individuals/elements with respect to binary variables are usually calculated from the following parameters:

 - $a_{ij}\hspace{0.1cm}=\hspace{0.1cm}$ nº of binary variables with response 1 in both elements $i$ and $j$

 - $b_{ij}\hspace{0.1cm}=\hspace{0.1cm}$ nº of binary variables with response 0 in element $i$ and response 1 in $j$

 - $c_{ij}\hspace{0.1cm}=\hspace{0.1cm}$ nº of binary variables with response 1 in element $i$ and response 0 in $j$

 - $d_{ij}\hspace{0.1cm}=\hspace{0.1cm}$ nº of binary variables with response 0 in both elements $i$ and $j$
 


 **Observation:**

$a_{ij} + b_{ij} + c_{ij} +d_{ij} =p$


### Arrays with parameters a, b, c and d <a class="anchor" id="52"></a>

Given a data matrix $\hspace{0.03cm}X=(X_1,...,X_p)\hspace{0.03cm}$ of size $\hspace{0.03cm}n\hspace{0.03cm}x\hspace{0.03cm}p\hspace{0.03cm}$ of **binary categorical variables**, then:

- $ a = X\cdot X^t $

- $ b=(\overrightarrow{1}_{nxp} - X)\cdot X\hspace{0.03cm}^t $

- $ c=b\hspace{0.03cm}^t $

- $ d=(\overrightarrow{1}_{nxp} - X)\cdot(\overrightarrow{1}_{nxp} - X)\hspace{0.03cm}^t $



 Where: $\hspace{0.2cm} p \hspace{0.05cm} = \hspace{0.05cm} $ nº of columns of $X$ , and  $\hspace{0.05cm} n \hspace{0.05cm} = \hspace{0.05cm} $ nº of rows of $X$ 
 
They are the matrices that contain the parameters $\hspace{0.05cm}a_{ij}\hspace{0.05cm}$ , $\hspace{0.05cm}b_{ij}\hspace{0.1cm}$, $\hspace{ 0.05cm}c_{ij}\hspace{0.1cm}$ and $\hspace{0.1cm}d_{ij}\hspace{0.05cm}$ , respectively.


#### Computing the arrays with parameters a, b, c and d  in `R`   <a class="anchor" id="53"></a>

In [66]:
%%R

Binary_Data <- Data_set_R %>% select(5:7)

Binary_Data <- as.matrix(Binary_Data)

In [67]:
%%R

head(Binary_Data)

     X5 X6 X7
[1,]  0  0  1
[2,]  1  1  1
[3,]  1  0  0
[4,]  0  1  0
[5,]  0  1  0
[6,]  1  1  1


In [68]:
%%R

a = Binary_Data %*% t(Binary_Data)

unos<- rep(1, dim(Binary_Data)[2])

Ones_Matrix <- matrix( rep(unos, dim(Binary_Data)[1]), ncol=dim(Binary_Data)[2]) #Matriz de unos de tamano nxp

b = (Ones_Matrix-Binary_Data)%*%t(Binary_Data)

c = t(b)

d = (Ones_Matrix - Binary_Data)%*%t(Ones_Matrix - Binary_Data) 

In [69]:
%%R

a[1:10,1:10]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    1    0    0    0    1    0    0    1     1
 [2,]    1    3    1    1    1    3    2    2    1     2
 [3,]    0    1    1    0    0    1    1    1    0     0
 [4,]    0    1    0    1    1    1    1    1    0     1
 [5,]    0    1    0    1    1    1    1    1    0     1
 [6,]    1    3    1    1    1    3    2    2    1     2
 [7,]    0    2    1    1    1    2    2    2    0     1
 [8,]    0    2    1    1    1    2    2    2    0     1
 [9,]    1    1    0    0    0    1    0    0    1     1
[10,]    1    2    0    1    1    2    1    1    1     2


In [70]:
%%R

b[1:10,1:10]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    2    1    1    1    2    2    2    0     1
 [2,]    0    0    0    0    0    0    0    0    0     0
 [3,]    1    2    0    1    1    2    1    1    1     2
 [4,]    1    2    1    0    0    2    1    1    1     1
 [5,]    1    2    1    0    0    2    1    1    1     1
 [6,]    0    0    0    0    0    0    0    0    0     0
 [7,]    1    1    0    0    0    1    0    0    1     1
 [8,]    1    1    0    0    0    1    0    0    1     1
 [9,]    0    2    1    1    1    2    2    2    0     1
[10,]    0    1    1    0    0    1    1    1    0     0


In [71]:
%%R

c[1:10,1:10]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    0    1    1    1    0    1    1    0     0
 [2,]    2    0    2    2    2    0    1    1    2     1
 [3,]    1    0    0    1    1    0    0    0    1     1
 [4,]    1    0    1    0    0    0    0    0    1     0
 [5,]    1    0    1    0    0    0    0    0    1     0
 [6,]    2    0    2    2    2    0    1    1    2     1
 [7,]    2    0    1    1    1    0    0    0    2     1
 [8,]    2    0    1    1    1    0    0    0    2     1
 [9,]    0    0    1    1    1    0    1    1    0     0
[10,]    1    0    2    1    1    0    1    1    1     0


In [72]:
%%R

d[1:10,1:10]

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    2    0    1    1    1    0    0    0    2     1
 [2,]    0    0    0    0    0    0    0    0    0     0
 [3,]    1    0    2    1    1    0    1    1    1     0
 [4,]    1    0    1    2    2    0    1    1    1     1
 [5,]    1    0    1    2    2    0    1    1    1     1
 [6,]    0    0    0    0    0    0    0    0    0     0
 [7,]    0    0    1    1    1    0    1    1    0     0
 [8,]    0    0    1    1    1    0    1    1    0     0
 [9,]    2    0    1    1    1    0    0    0    2     1
[10,]    1    0    0    1    1    0    0    0    1     1


#### Computing the arrays with parameters a, b, c and d  in `Python`   <a class="anchor" id="54"></a>

In [73]:
Binary_Data_Py = Data_set_Python.iloc[ : , 4:7]

In [74]:
Binary_Data_Py.head()

Unnamed: 0,X5,X6,X7
0,0.0,0.0,1.0
1,1.0,1.0,1.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0


In [75]:
X = Binary_Data_Py

a = X @ X.T

n = X.shape[0]

p = X.shape[1]

ones_matrix = np.ones((n, p)) 

b = (ones_matrix - X) @ X.T

c = b.T

d = (ones_matrix - X) @ (ones_matrix - X).T

In [76]:
a.iloc[0:10 , 0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
1,1.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,1.0,2.0
2,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
3,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
4,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
5,1.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,1.0,2.0
6,0.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,1.0
7,0.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,1.0
8,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
9,1.0,2.0,0.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0


In [77]:
b.iloc[0:10 , 0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,2.0,0.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0
3,1.0,2.0,1.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0
4,1.0,2.0,1.0,0.0,0.0,2.0,1.0,1.0,1.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
7,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
8,0.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,1.0
9,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0


In [78]:
c.iloc[0:10 , 0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
1,2.0,0.0,2.0,2.0,2.0,0.0,1.0,1.0,2.0,1.0
2,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,2.0,0.0,2.0,2.0,2.0,0.0,1.0,1.0,2.0,1.0
6,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0
7,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0
8,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
9,1.0,0.0,2.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0


In [79]:
d.iloc[0:10 , 0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,2.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0
3,1.0,0.0,1.0,2.0,2.0,0.0,1.0,1.0,1.0,1.0
4,1.0,0.0,1.0,2.0,2.0,0.0,1.0,1.0,1.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
7,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
8,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,1.0
9,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0


### Sokal Similarity (simple matching coefficient) <a class="anchor" id="55"></a>


The Sokal coefficient of similarity between the elements/individuals $i$ and $j$ with respect to the binary variables $X_1,...,X_p$ is:


\begin{gather*}
S(i,j)_{Sokal}  =\dfrac{a_{ij}+d_{ij}}{a_{ij} + b_{ij} + c_{ij} +d_{ij}} = \dfrac{a_{ij}+d_{ij}}{p} 
\end{gather*}



### Sokal Distance <a class="anchor" id="56"></a>


We get the Sokal distance as:



\begin{gather*}
\delta(i,j)_{Sokal} = \sqrt{S(i,i)_{Sokal} +S(j,j)_{Sokal} - 2\cdot S(i,j)_{Sokal} }
\end{gather*}



####  Sokal Similarity in `R` <a class="anchor" id="57"></a>


In [80]:
%%R

Sokal_Similarity_R <- function(i,j,   Binary_Data_Matrix){
  
  Binary_Data_Matrix=as.matrix(Binary_Data_Matrix)

  
  a= Binary_Data_Matrix %*% t(Binary_Data_Matrix)
  
  Ones <- rep(1, dim(Binary_Data_Matrix)[2])

  #Matriz de unos de tamano nxp :
  Ones_Matrix <- matrix( rep(Ones, dim(Binary_Data_Matrix)[1]), ncol=dim(Binary_Data_Matrix)[2]) 
  
  b=(Ones_Matrix-Binary_Data_Matrix)%*%t(Binary_Data_Matrix)
  
  c= t(b)
  
  d= (Ones_Matrix - Binary_Data_Matrix)%*%t(Ones_Matrix -     
      Binary_Data_Matrix) 


Sokal_Similarity  = (a[i,j] + d[i,j]) /  dim(Binary_Data_Matrix)[2] 
  
return(Sokal_Similarity)
}

In [81]:
%%R

Sokal_Similarity_R(1, 2, Binary_Data)

[1] 0.3333333


In [82]:
%%R

Sokal_Similarity_R(7,8, Binary_Data)

[1] 1


####  Sokal Similarity in `Python` <a class="anchor" id="58"></a>


In [83]:
def Sokal_Similarity_Py(i, j, Binary_Data_Matrix):

    X = Binary_Data_Matrix

    a = X @ X.T

    n = X.shape[0]

    p = X.shape[1]

    ones_matrix = np.ones((n, p)) 

    b = (ones_matrix - X) @ X.T

    c = b.T

    d = (ones_matrix - X) @ (ones_matrix - X).T

    Sokal_Similarity = (a.iloc[i-1,j-1] + d.iloc[i-1,j-1])/p

    return Sokal_Similarity

In [84]:
Sokal_Similarity_Py(1, 2, Binary_Data_Py)

0.3333333333333333

In [85]:
Sokal_Similarity_Py(7, 8, Binary_Data_Py)

1.0

####  Sokal Similarity Matrix in `R` <a class="anchor" id="59"></a>


In [86]:
%%R

Sokal_Similarity_Matrix <- function( Binary_Data_Matrix ){
  
  Binary_Data_Matrix=as.matrix(Binary_Data_Matrix)

  M<-matrix(NA, ncol =dim(Binary_Data_Matrix)[1] , nrow=dim(Binary_Data_Matrix)[1] )
  
  for(i in 1:dim(Binary_Data_Matrix)[1] ){
    for(j in 1:dim(Binary_Data_Matrix)[1]){
    
  M[i,j]=Sokal_Similarity_R(i,j,  Binary_Data_Matrix)
  
   }
  }
 return(M)
}

In [87]:
%%R

Sokal_Similarity_Matrix(Binary_Data)[1:10,1:10]

           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,] 1.0000000 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 0.0000000
 [2,] 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 1.0000000 0.6666667
 [3,] 0.3333333 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 0.6666667
 [4,] 0.3333333 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 0.6666667
 [5,] 0.3333333 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 0.6666667
 [6,] 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 1.0000000 0.6666667
 [7,] 0.0000000 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000
 [8,] 0.0000000 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000
 [9,] 1.0000000 0.3333333 0.3333333 0.3333333 0.3333333 0.3333333 0.0000000
[10,] 0.6666667 0.6666667 0.0000000 0.6666667 0.6666667 0.6666667 0.3333333
           [,8]      [,9]     [,10]
 [1,] 0.0000000 1.0000000 0.6666667
 [2,] 0.6666667 0.3333333 0.6666667
 [3,] 0.6666667 0.3333333 0.0000000
 [4,] 0.6666667 0.33

####  Sokal Similarity Matrix in `Python` <a class="anchor" id="60"></a>

In [None]:
X = Binary_Data_Matrix

a = X @ X.T

n = X.shape[0]

p = X.shape[1]

ones_matrix = np.ones((n, p)) 

b = (ones_matrix - X) @ X.T

c = b.T

d = (ones_matrix - X) @ (ones_matrix - X).T

In [88]:
def Sim_Sokal_Matrix_Python(Binary_Data_set):

    M = np.zeros((Binary_Data_set.shape[0] , Binary_Data_set.shape[0]))

    for i in range(0 , Binary_Data_set.shape[0]):
        for j in range(0 , Binary_Data_set.shape[0]):

            M[i,j] = (a.iloc[i-1,j-1] + d.iloc[i-1,j-1])/p
                
    return M

In [89]:
Sim_Sokal_Matrix_Python(Binary_Data_Py)

array([[1.        , 0.33333333, 0.33333333, 0.33333333, 0.33333333,
        0.33333333, 0.        , 0.        , 1.        , 0.66666667,
        0.33333333, 0.66666667, 0.66666667, 0.33333333, 0.        ,
        0.33333333, 0.66666667, 0.33333333, 1.        , 0.66666667,
        0.33333333, 0.66666667, 0.66666667, 0.        , 0.33333333,
        0.        , 0.66666667, 0.66666667, 0.33333333, 0.33333333,
        0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
        0.33333333, 0.66666667, 0.33333333, 1.        , 1.        ,
        0.33333333, 1.        , 0.66666667, 0.66666667, 0.66666667,
        0.33333333, 0.66666667, 0.33333333, 0.        , 0.66666667],
       [0.33333333, 1.        , 0.33333333, 0.33333333, 0.33333333,
        1.        , 0.66666667, 0.66666667, 0.33333333, 0.66666667,
        0.33333333, 0.66666667, 0.        , 0.33333333, 0.66666667,
        0.33333333, 0.66666667, 1.        , 0.33333333, 0.66666667,
        0.33333333, 0.        , 0.        , 0.6

In this case we go from similarity to distance using the transformation:



\begin{gather*}
 \delta(i,j)_{Sokal}= \sqrt{ S(i,i)_{Sokal} + S(j,j)_{Sokal} - 2\cdot S(i,j)_{Sokal} }
\end{gather*}  


####  Sokal Distance in `R` <a class="anchor" id="61"></a>

In [90]:
%%R

Dist_Sokal_R <- function(i,j,   Binary_Data){

Binary_Data=as.matrix(Binary_Data)
  
Dist_Sokal  = sqrt( Sokal_Similarity_R(i,i,   Binary_Data) + Sokal_Similarity_R(j,j,   Binary_Data) - 2*Sokal_Similarity_R(i,j,   Binary_Data)  )
  
return( Dist_Sokal )
}

In [91]:
%%R

Dist_Sokal_R(1, 2, Binary_Data)

[1] 1.154701


####  Sokal Distance in `Python` <a class="anchor" id="62"></a>

In [None]:
X = Binary_Data_Matrix

a = X @ X.T

n = X.shape[0]

p = X.shape[1]

ones_matrix = np.ones((n, p)) 

b = (ones_matrix - X) @ X.T

c = b.T

d = (ones_matrix - X) @ (ones_matrix - X).T

In [92]:
def Dist_Sokal_Python(i, j, Binary_Data_set):

    dist_Sokal = np.sqrt( 2 - 2*(a.iloc[i-1,j-1] + d.iloc[i-1,j-1])/p )

    return dist_Sokal      


In [93]:
Dist_Sokal_Python(1,2, Binary_Data_Py)

1.1547005383792517

####  Sokal Distance Matrix in `R` <a class="anchor" id="63"></a>

In [94]:
%%R

Dist_Sokal_Matrix_R <- function( Binary_Data ){
  
  Binary_Data=as.matrix(Binary_Data)

  M<-matrix(NA, ncol =dim(Binary_Data)[1] , nrow=dim(Binary_Data)[1] )
  
  for(i in 1:dim(Binary_Data)[1] ){
    for(j in 1:dim(Binary_Data)[1]){
    
  M[i,j]=Dist_Sokal_R(i,j,  Binary_Data)
  
   }
  }
 return(M)
}

In [95]:
%%R

Dist_Sokal_Matrix_R(Binary_Data)[1:10,1:10]

           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,] 0.0000000 1.1547005 1.1547005 1.1547005 1.1547005 1.1547005 1.4142136
 [2,] 1.1547005 0.0000000 1.1547005 1.1547005 1.1547005 0.0000000 0.8164966
 [3,] 1.1547005 1.1547005 0.0000000 1.1547005 1.1547005 1.1547005 0.8164966
 [4,] 1.1547005 1.1547005 1.1547005 0.0000000 0.0000000 1.1547005 0.8164966
 [5,] 1.1547005 1.1547005 1.1547005 0.0000000 0.0000000 1.1547005 0.8164966
 [6,] 1.1547005 0.0000000 1.1547005 1.1547005 1.1547005 0.0000000 0.8164966
 [7,] 1.4142136 0.8164966 0.8164966 0.8164966 0.8164966 0.8164966 0.0000000
 [8,] 1.4142136 0.8164966 0.8164966 0.8164966 0.8164966 0.8164966 0.0000000
 [9,] 0.0000000 1.1547005 1.1547005 1.1547005 1.1547005 1.1547005 1.4142136
[10,] 0.8164966 0.8164966 1.4142136 0.8164966 0.8164966 0.8164966 1.1547005
           [,8]      [,9]     [,10]
 [1,] 1.4142136 0.0000000 0.8164966
 [2,] 0.8164966 1.1547005 0.8164966
 [3,] 0.8164966 1.1547005 1.4142136
 [4,] 0.8164966 1.15

####  Sokal Distance Matrix in `Python` <a class="anchor" id="64"></a>

In [96]:
def Dist_Sokal_Matrix_Python(Binary_Data_set):

    M = np.zeros((Binary_Data_set.shape[0] , Binary_Data_set.shape[0]))

    for i in range(0 , Binary_Data_set.shape[0]):
        for j in range(0 , Binary_Data_set.shape[0]):

            M[i,j] = Dist_Sokal_Python(i+1, j+1, Binary_Data_set)
                 
    return M

In [97]:
Dist_Sokal_Matrix_Python(Binary_Data_Py)

array([[0.        , 1.15470054, 1.15470054, 1.15470054, 1.15470054,
        1.15470054, 1.41421356, 1.41421356, 0.        , 0.81649658,
        1.15470054, 0.81649658, 0.81649658, 1.15470054, 1.41421356,
        1.15470054, 0.81649658, 1.15470054, 0.        , 0.81649658,
        1.15470054, 0.81649658, 0.81649658, 1.41421356, 1.15470054,
        1.41421356, 0.81649658, 0.81649658, 1.15470054, 1.15470054,
        0.81649658, 0.81649658, 0.81649658, 0.81649658, 0.81649658,
        1.15470054, 0.81649658, 1.15470054, 0.        , 0.        ,
        1.15470054, 0.        , 0.81649658, 0.81649658, 0.81649658,
        1.15470054, 0.81649658, 1.15470054, 1.41421356, 0.81649658],
       [1.15470054, 0.        , 1.15470054, 1.15470054, 1.15470054,
        0.        , 0.81649658, 0.81649658, 1.15470054, 0.81649658,
        1.15470054, 0.81649658, 1.41421356, 1.15470054, 0.81649658,
        1.15470054, 0.81649658, 0.        , 1.15470054, 0.81649658,
        1.15470054, 1.41421356, 1.41421356, 0.8

### Jaccard Similarity  <a class="anchor" id="65"></a>

The Jaccard coefficient of similarity between the elements/individuals $i$ and $j$ with respect to the binary variables $X_1,...,X_p$ is:


\begin{gather*}
S(i,j)_{Jaccard}  = \dfrac{a_{ij} }{a_{ij} + b_{ij}+ c_{ij}} 
\end{gather*}



### Jaccard Distance  <a class="anchor" id="66"></a>


We get the Jaccard distance as:


\begin{gather*}
\delta(i,j)_{Jaccard} = \sqrt{S(i,i)_{Jaccard} +S(j,j)_{Jaccard} - 2\cdot S(i,j)_{Jaccard} }
\end{gather*}


#### Jaccard Similarity in `R` <a class="anchor" id="67"></a>

In [98]:
%%R

Jaccard_Similarity_R <- function(i,j,   Binary_Data_Matrix){
  
  Binary_Data_Matrix=as.matrix(Binary_Data_Matrix)
  
  a= Binary_Data_Matrix %*% t(Binary_Data_Matrix)
  
  Ones <- rep(1, dim(Binary_Data_Matrix)[2])

  #Matriz de unos de tamano nxp :
  Ones_Matrix <- matrix( rep(Ones, dim(Binary_Data_Matrix)[1]), ncol=dim(Binary_Data_Matrix)[2]) 
  
  b=(Ones_Matrix-Binary_Data_Matrix)%*%t(Binary_Data_Matrix)
  
  c= t(b)
  
  d= (Ones_Matrix - Binary_Data_Matrix)%*%t(Ones_Matrix -     
      Binary_Data_Matrix) 


Similaridad_Jaccard  = a[i,j]/(a[i,j] + b[i,j] + c[i,j])
  
return(Similaridad_Jaccard)
}

In [99]:
%%R

Jaccard_Similarity_R(1,2, Binary_Data)

[1] 0.3333333


In [100]:
%%R

Jaccard_Similarity_R(8,9, Binary_Data)

[1] 0


#### Jaccard Similarity in `Python` <a class="anchor" id="68"></a>

In [101]:
def Jaccard_Similarity_Py(i, j, Binary_Data_Matrix):

    X = Binary_Data_Py

    a = X @ X.T

    n = X.shape[0]

    p = X.shape[1]

    ones_matrix = np.ones((n, p)) 

    b = (ones_matrix - X) @ X.T

    c = b.T

    d = (ones_matrix - X) @ (ones_matrix - X).T

    Jaccard_Similarity = a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1])

    return Jaccard_Similarity

In [102]:
Jaccard_Similarity_Py(1, 2, Binary_Data_Py)

0.3333333333333333

#### Jaccard Similarity Matrix in `R` <a class="anchor" id="69"></a>

In [103]:
%%R
Sim_Jaccard_Matrix_R <- function( Binary_Data_set ){
  
 Binary_Data_set=as.matrix(Binary_Data_set)

  
  M<-matrix(NA, ncol =dim(Binary_Data_set)[1] , nrow=dim(Binary_Data_set)[1] )
  
  for(i in 1:dim(Binary_Data_set)[1] ){
    for(j in 1:dim(Binary_Data_set)[1]){
    
  M[i,j]=Jaccard_Similarity_R(i,j,  Binary_Data_set)
  
   }
  }
 return(M)
}

In [104]:
%%R

Sim_Jaccard_Matrix_R(Binary_Data)[1:10 , 1:10]

           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,] 1.0000000 0.3333333 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000
 [2,] 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 1.0000000 0.6666667
 [3,] 0.0000000 0.3333333 1.0000000 0.0000000 0.0000000 0.3333333 0.5000000
 [4,] 0.0000000 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.5000000
 [5,] 0.0000000 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.5000000
 [6,] 0.3333333 1.0000000 0.3333333 0.3333333 0.3333333 1.0000000 0.6666667
 [7,] 0.0000000 0.6666667 0.5000000 0.5000000 0.5000000 0.6666667 1.0000000
 [8,] 0.0000000 0.6666667 0.5000000 0.5000000 0.5000000 0.6666667 1.0000000
 [9,] 1.0000000 0.3333333 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000
[10,] 0.5000000 0.6666667 0.0000000 0.5000000 0.5000000 0.6666667 0.3333333
           [,8]      [,9]     [,10]
 [1,] 0.0000000 1.0000000 0.5000000
 [2,] 0.6666667 0.3333333 0.6666667
 [3,] 0.5000000 0.0000000 0.0000000
 [4,] 0.5000000 0.00

#### Jaccard Similarity Matrix in `Python` <a class="anchor" id="70"></a>

In [None]:
X = Binary_Data_Matrix

a = X @ X.T

n = X.shape[0]

p = X.shape[1]

ones_matrix = np.ones((n, p)) 

b = (ones_matrix - X) @ X.T

c = b.T

d = (ones_matrix - X) @ (ones_matrix - X).T

In [105]:
def Sim_Jaccard_Matrix_Python(Binary_Data_set):

    M = np.zeros((Binary_Data_set.shape[0] , Binary_Data_set.shape[0]))

    for i in range(0 , Binary_Data_set.shape[0]):
        for j in range(0 , Binary_Data_set.shape[0]):

            M[i,j] =  a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1])
                 
    return M

In [106]:
Sim_Jaccard_Matrix_Python(Binary_Data_Py)

  Jaccard_Similarity = a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1])


array([[1.        , 0.33333333, 0.        , 0.        , 0.        ,
        0.33333333, 0.        , 0.        , 1.        , 0.5       ,
        0.        , 0.5       , 0.        , 0.        , 0.        ,
        0.        , 0.5       , 0.33333333, 1.        , 0.5       ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5       , 0.        , 0.33333333, 0.        ,
        0.5       , 0.        , 0.        , 0.5       , 0.5       ,
        0.        , 0.5       , 0.33333333, 1.        , 1.        ,
        0.        , 1.        , 0.5       , 0.        , 0.5       ,
        0.        , 0.5       , 0.        , 0.        , 0.5       ],
       [0.33333333, 1.        , 0.33333333, 0.33333333, 0.33333333,
        1.        , 0.66666667, 0.66666667, 0.33333333, 0.66666667,
        0.33333333, 0.66666667, 0.        , 0.33333333, 0.66666667,
        0.33333333, 0.66666667, 1.        , 0.33333333, 0.66666667,
        0.33333333, 0.        , 0.        , 0.6

#### Jaccard Distance in `R` <a class="anchor" id="71"></a>

In [107]:
%%R

Dist_Jaccard_R <- function(i,j,   Binary_Data_set){

Binary_Data_set=as.matrix(Binary_Data_set)
  
Dist_Jaccard  = sqrt( Jaccard_Similarity_R(i,i,   Binary_Data_set) + Jaccard_Similarity_R(j,j,   Binary_Data_set) - 2*Jaccard_Similarity_R(i,j,   Binary_Data_set)  )
  
return( Dist_Jaccard )
}

In [108]:
%%R

Dist_Jaccard_R(1,2, Binary_Data) 

[1] 1.154701


#### Jaccard Distance in `Python` <a class="anchor" id="72"></a>

In [109]:
def Dist_Jaccard_Python(i, j, Binary_Data_set):

    dist_Jaccard = np.sqrt( 2 - 2*( a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1]) ) )

    return dist_Jaccard

In [110]:
Dist_Jaccard_Python(1, 2, Binary_Data_Py)

1.1547005383792517

#### Jaccard Distance Matrix in `R` <a class="anchor" id="73"></a>

In [111]:
%%R

Dist_Jaccard_Matrix_R <- function( Binary_Data_set ){
  
 Binary_Data_set=as.matrix(Binary_Data_set)

  
  M<-matrix(NA, ncol =dim(Binary_Data_set)[1] , nrow=dim(Binary_Data_set)[1] )
  
  for(i in 1:dim(Binary_Data_set)[1] ){
    for(j in 1:dim(Binary_Data_set)[1]){
    
  M[i,j]=Dist_Jaccard_R(i,j,  Binary_Data_set)
  
   }
  }
 return(M)
}

In [112]:
%%R

Dist_Jaccard_Matrix_R(Binary_Data)[1:10 , 1:10]

          [,1]      [,2]     [,3]     [,4]     [,5]      [,6]      [,7]
 [1,] 0.000000 1.1547005 1.414214 1.414214 1.414214 1.1547005 1.4142136
 [2,] 1.154701 0.0000000 1.154701 1.154701 1.154701 0.0000000 0.8164966
 [3,] 1.414214 1.1547005 0.000000 1.414214 1.414214 1.1547005 1.0000000
 [4,] 1.414214 1.1547005 1.414214 0.000000 0.000000 1.1547005 1.0000000
 [5,] 1.414214 1.1547005 1.414214 0.000000 0.000000 1.1547005 1.0000000
 [6,] 1.154701 0.0000000 1.154701 1.154701 1.154701 0.0000000 0.8164966
 [7,] 1.414214 0.8164966 1.000000 1.000000 1.000000 0.8164966 0.0000000
 [8,] 1.414214 0.8164966 1.000000 1.000000 1.000000 0.8164966 0.0000000
 [9,] 0.000000 1.1547005 1.414214 1.414214 1.414214 1.1547005 1.4142136
[10,] 1.000000 0.8164966 1.414214 1.000000 1.000000 0.8164966 1.1547005
           [,8]     [,9]     [,10]
 [1,] 1.4142136 0.000000 1.0000000
 [2,] 0.8164966 1.154701 0.8164966
 [3,] 1.0000000 1.414214 1.4142136
 [4,] 1.0000000 1.414214 1.0000000
 [5,] 1.0000000 1.414214 1.000000

#### Jaccard Distance Matrix in `Python` <a class="anchor" id="74"></a>

In [113]:
def Dist_Jaccard_Matrix_Python(Binary_Data_set):

    M = np.zeros((Binary_Data_set.shape[0] , Binary_Data_set.shape[0]))

    for i in range(0 , Binary_Data_set.shape[0]):
        for j in range(0 , Binary_Data_set.shape[0]):

            M[i,j] = Dist_Jaccard_Python(i+1, j+1, Binary_Data_set)
                 
    return M

In [114]:
Dist_Jaccard_Matrix_Python(Binary_Data_Py)

  Jaccard_Similarity = a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1])


array([[0.        , 1.15470054, 1.41421356, 1.41421356, 1.41421356,
        1.15470054, 1.41421356, 1.41421356, 0.        , 1.        ,
        1.41421356, 1.        , 1.41421356, 1.41421356, 1.41421356,
        1.41421356, 1.        , 1.15470054, 0.        , 1.        ,
        1.41421356, 1.41421356, 1.41421356, 1.41421356, 1.41421356,
        1.41421356, 1.        , 1.41421356, 1.15470054, 1.41421356,
        1.        , 1.41421356, 1.41421356, 1.        , 1.        ,
        1.41421356, 1.        , 1.15470054, 0.        , 0.        ,
        1.41421356, 0.        , 1.        , 1.41421356, 1.        ,
        1.41421356, 1.        , 1.41421356, 1.41421356, 1.        ],
       [1.15470054, 0.        , 1.15470054, 1.15470054, 1.15470054,
        0.        , 0.81649658, 0.81649658, 1.15470054, 0.81649658,
        1.15470054, 0.81649658, 1.41421356, 1.15470054, 0.81649658,
        1.15470054, 0.81649658, 0.        , 1.15470054, 0.81649658,
        1.15470054, 1.41421356, 1.41421356, 0.8

### More similarity coefficients   <a class="anchor" id="75"></a>

<img src="similaridades 1.jpg" width=400 height=250 >

<img src="similaridades 2.jpg" width=500 height=280 >

## Similarities with multiclass  categorical variable   <a class="anchor" id="76"></a>

Let $X_1,...,X_p$ be multiple categorical variables with possible different number of categories.


The parameters used to construct the similarity coefficients with multiple categorical variables are:

$\alpha_{ij}$ = number of matches between the variables $p$ for both elements $i$ and $j$

$p-\alpha_{ij}$ = number of mismatches between variables $p$ for both elements $i$ and $j$

Observation:

$$ \alpha_{ij}=a_{ij}+b_{ij}$$

 ### Matches Similarity   <a class="anchor" id="77"></a>

The most common measure of similarity in these cases is the matches similarity coefficient. 


The matches coefficient between the elements/individuals $i$ and $j$ with respect to the multiple  categorical variables $X_1,...,X_p$ is:



\begin{gather*}
S(i,j)_{Matches}= \dfrac{\alpha_{ij}}{p}
\end{gather*}

Observation:

When the variables are binary, the matches similarity coefficient is equal to   Sokal similarity, since $\hspace{0.1cm} \alpha_{ij}=a_{ij}+b_{ij}$

 ### Matches Distance   <a class="anchor" id="78"></a>



\begin{gather*}
\delta(i,j)_{Coincidencias} = \sqrt{S(i,i)_{Coincidencias} +S(j,j)_{Coincidencias} - 2\cdot S(i,j)_{Coincidencias} }
\end{gather*}



 ####  Matches Similarity in `R`   <a class="anchor" id="79"></a>


In [115]:
%%R

Multiple_Categorical_Data_R <- Data_set_R %>% select(8:10)

We create a function to compute the parameter $\hspace{0.1cm}\alpha_{ij}$  

In [116]:
%%R

alpha<- function(i,j, Multiple_Categorical_Data){
  
  X=as.matrix(Multiple_Categorical_Data)
  
  alpha=ifelse( X[i,]==X[j,] , 1 , 0)
  
  # Otra forma de hacer lo mismo, pero menos eficiente:
    
  # alpha=rep(0, dim(Multiple_Categorical_Data)[2])

  # for(k in 1:dim(X)[2]){
  # if( X[i,k]==X[j,k] ){ alpha[k]=1 } else { alpha[k]=0 }
  # }
  
  alpha=sum(alpha)
  
  return(alpha)
}

In [117]:
%%R

alpha(1,2 , Multiple_Categorical_Data_R)

[1] 2


Now we develop a function to compute the matches similarity coefficient:

In [118]:
%%R

Matches_Similarity_R <- function(i,j, Multiple_Categorical_Data){

Multiple_Categorical_Data=as.matrix(Multiple_Categorical_Data)
  
Matches_Similarity  =  alpha(i,j, Multiple_Categorical_Data) / dim(Multiple_Categorical_Data)[2] 
  
return(Matches_Similarity )
}

In [119]:
%%R

Matches_Similarity_R(1,2, Multiple_Categorical_Data_R)

[1] 0.6666667


#### Matches Similarity in `Python`  <a class="anchor" id="79.1"></a>


In [120]:
Multiple_Categorical_Data_Py = Data_set_Python.iloc[: , 7:10]

In [121]:
def alpha_py(i,j, Multiple_Categorical_Data):

    X = Multiple_Categorical_Data

    alpha = np.repeat(0, X.shape[1])

    for k in range(0, X.shape[1]) :

        if X.iloc[i-1, k] == X.iloc[j-1, k] :

            alpha[k] = 1

        else :

            alpha[k] = 0

    alpha = alpha.sum()

    return(alpha)

In [122]:
def matches_similarity_py(i, j, Multiple_Categorical_Data):

    p = Multiple_Categorical_Data.shape[1]

    matches_similarity = alpha_py(i,j, Multiple_Categorical_Data) / p

    return(matches_similarity)

In [123]:
matches_similarity_py(1, 2, Multiple_Categorical_Data_Py)

0.6666666666666666

 ####  Matches Distance in `R`   <a class="anchor" id="80"></a>


In [124]:
%%R

Dist_Matches_R <- function(i,j,   Multiple_Categorical_Data){

Multiple_Categorical_Data=as.matrix(Multiple_Categorical_Data)
  
Dist_Matches  = sqrt( Matches_Similarity_R(i,i,   Multiple_Categorical_Data) + Matches_Similarity_R(j,j,   Multiple_Categorical_Data) - 2*Matches_Similarity_R(i,j,   Multiple_Categorical_Data)  )
  
return( Dist_Matches )
}

In [125]:
%%R

Dist_Matches_R(1,3, Multiple_Categorical_Data_R)

[1] 0.8164966


 ####  Matches Distance in `Python`   <a class="anchor" id="80.1"></a>


In [126]:
def Dist_Matches_Py(i,j, Multiple_Categorical_Data):

    Dist_Matches = np.sqrt( matches_similarity_py(i, i, Multiple_Categorical_Data) +  matches_similarity_py(j, j, Multiple_Categorical_Data) - 2*matches_similarity_py(i, j, Multiple_Categorical_Data) )

    return( Dist_Matches )

In [127]:
Dist_Matches_Py(1,3, Multiple_Categorical_Data_Py)

0.816496580927726

 ####  Matches Similarity Matrix in `R`   <a class="anchor" id="81"></a>


In [128]:
%%R

Matches_Similarity_Matrix_R <- function( Multiple_Categorical_Data ){
  
  Multiple_Categorical_Data=as.matrix(Multiple_Categorical_Data)

  M<-matrix(NA, ncol =dim(Multiple_Categorical_Data)[1] , nrow=dim(Multiple_Categorical_Data)[1] )
  
  for(i in 1:dim(Multiple_Categorical_Data)[1] ){
    for(j in 1:dim(Multiple_Categorical_Data)[1]){
    
  M[i,j]=Matches_Similarity_R(i,j,  Multiple_Categorical_Data)
  
   }
  }
 return(M)
}

In [129]:
%%R

Matches_Similarity_Matrix_R(Multiple_Categorical_Data_R)[1:10 , 1:10]

           [,1]      [,2]      [,3]      [,4] [,5]      [,6]      [,7]
 [1,] 1.0000000 0.6666667 0.6666667 0.3333333    0 0.3333333 0.6666667
 [2,] 0.6666667 1.0000000 0.6666667 0.3333333    0 0.3333333 0.3333333
 [3,] 0.6666667 0.6666667 1.0000000 0.3333333    0 0.3333333 0.3333333
 [4,] 0.3333333 0.3333333 0.3333333 1.0000000    0 0.3333333 0.0000000
 [5,] 0.0000000 0.0000000 0.0000000 0.0000000    1 0.0000000 0.0000000
 [6,] 0.3333333 0.3333333 0.3333333 0.3333333    0 1.0000000 0.3333333
 [7,] 0.6666667 0.3333333 0.3333333 0.0000000    0 0.3333333 1.0000000
 [8,] 0.3333333 0.3333333 0.3333333 0.6666667    0 0.3333333 0.0000000
 [9,] 0.3333333 0.6666667 0.3333333 0.3333333    0 0.0000000 0.0000000
[10,] 0.3333333 0.3333333 0.3333333 0.6666667    0 0.0000000 0.0000000
           [,8]      [,9]     [,10]
 [1,] 0.3333333 0.3333333 0.3333333
 [2,] 0.3333333 0.6666667 0.3333333
 [3,] 0.3333333 0.3333333 0.3333333
 [4,] 0.6666667 0.3333333 0.6666667
 [5,] 0.0000000 0.0000000 0.0000000
 [6

 ####  Matches Similarity Matrix in `Python`   <a class="anchor" id="82"></a>


In [130]:
def Sim_Matches_Matrix_Python(Multiple_Categorical_Data):

    M = np.zeros((Multiple_Categorical_Data.shape[0] , Multiple_Categorical_Data.shape[0]))

    for i in range(0 , Multiple_Categorical_Data.shape[0]):
        for j in range(0 , Multiple_Categorical_Data.shape[0]):

            M[i,j] = matches_similarity_py(i+1, j+1, Multiple_Categorical_Data)
                 
    return M

In [131]:
Sim_Matches_Matrix_Python(Multiple_Categorical_Data_Py)

array([[1.        , 0.66666667, 0.66666667, 0.33333333, 0.        ,
        0.33333333, 0.66666667, 0.33333333, 0.33333333, 0.33333333,
        0.        , 0.33333333, 0.        , 0.        , 0.33333333,
        0.33333333, 0.        , 0.66666667, 0.        , 0.        ,
        0.66666667, 0.        , 0.33333333, 0.33333333, 0.33333333,
        0.        , 0.        , 0.        , 0.66666667, 0.66666667,
        0.        , 0.        , 0.        , 0.33333333, 0.        ,
        0.        , 0.        , 0.33333333, 0.33333333, 0.        ,
        0.33333333, 0.66666667, 0.        , 0.        , 0.        ,
        0.        , 0.66666667, 0.33333333, 0.        , 0.        ],
       [0.66666667, 1.        , 0.66666667, 0.33333333, 0.        ,
        0.33333333, 0.33333333, 0.33333333, 0.66666667, 0.33333333,
        0.        , 0.33333333, 0.        , 0.        , 0.        ,
        0.33333333, 0.        , 0.33333333, 0.        , 0.        ,
        0.33333333, 0.        , 0.        , 0.3

 ####  Matches Distance Matrix in `R`   <a class="anchor" id="83"></a>


In [132]:
%%R

Matches_Dist_Matrix_R <- function( Multiple_Categorical_Data ){
  
  Multiple_Categorical_Data=as.matrix(Multiple_Categorical_Data)

  M<-matrix(NA, ncol =dim(Multiple_Categorical_Data)[1] , nrow=dim(Multiple_Categorical_Data)[1] )
  
  for(i in 1:dim(Multiple_Categorical_Data)[1] ){
    for(j in 1:dim(Multiple_Categorical_Data)[1]){
    
  M[i,j]=Dist_Matches_R(i,j,  Multiple_Categorical_Data)
  
   }
  }
 return(M)
}

In [133]:
%%R

Matches_Dist_Matrix_R(Multiple_Categorical_Data_R)[1:10 , 1:10]

           [,1]      [,2]      [,3]      [,4]     [,5]     [,6]      [,7]
 [1,] 0.0000000 0.8164966 0.8164966 1.1547005 1.414214 1.154701 0.8164966
 [2,] 0.8164966 0.0000000 0.8164966 1.1547005 1.414214 1.154701 1.1547005
 [3,] 0.8164966 0.8164966 0.0000000 1.1547005 1.414214 1.154701 1.1547005
 [4,] 1.1547005 1.1547005 1.1547005 0.0000000 1.414214 1.154701 1.4142136
 [5,] 1.4142136 1.4142136 1.4142136 1.4142136 0.000000 1.414214 1.4142136
 [6,] 1.1547005 1.1547005 1.1547005 1.1547005 1.414214 0.000000 1.1547005
 [7,] 0.8164966 1.1547005 1.1547005 1.4142136 1.414214 1.154701 0.0000000
 [8,] 1.1547005 1.1547005 1.1547005 0.8164966 1.414214 1.154701 1.4142136
 [9,] 1.1547005 0.8164966 1.1547005 1.1547005 1.414214 1.414214 1.4142136
[10,] 1.1547005 1.1547005 1.1547005 0.8164966 1.414214 1.414214 1.4142136
           [,8]      [,9]     [,10]
 [1,] 1.1547005 1.1547005 1.1547005
 [2,] 1.1547005 0.8164966 1.1547005
 [3,] 1.1547005 1.1547005 1.1547005
 [4,] 0.8164966 1.1547005 0.8164966
 [5,] 

 ####  Matches Distance Matrix in `Python`   <a class="anchor" id="84"></a>


In [134]:
def Dist_Matches_Matrix_Python(Multiple_Categorical_Data):

    M = np.zeros((Multiple_Categorical_Data.shape[0] , Multiple_Categorical_Data.shape[0]))

    for i in range(0 , Multiple_Categorical_Data.shape[0]):
        for j in range(0 , Multiple_Categorical_Data.shape[0]):

            M[i,j] = Dist_Matches_Py(i+1, j+1, Multiple_Categorical_Data)
                 
    return M

In [135]:
Dist_Matches_Matrix_Python(Multiple_Categorical_Data_Py)

array([[0.        , 0.81649658, 0.81649658, 1.15470054, 1.41421356,
        1.15470054, 0.81649658, 1.15470054, 1.15470054, 1.15470054,
        1.41421356, 1.15470054, 1.41421356, 1.41421356, 1.15470054,
        1.15470054, 1.41421356, 0.81649658, 1.41421356, 1.41421356,
        0.81649658, 1.41421356, 1.15470054, 1.15470054, 1.15470054,
        1.41421356, 1.41421356, 1.41421356, 0.81649658, 0.81649658,
        1.41421356, 1.41421356, 1.41421356, 1.15470054, 1.41421356,
        1.41421356, 1.41421356, 1.15470054, 1.15470054, 1.41421356,
        1.15470054, 0.81649658, 1.41421356, 1.41421356, 1.41421356,
        1.41421356, 0.81649658, 1.15470054, 1.41421356, 1.41421356],
       [0.81649658, 0.        , 0.81649658, 1.15470054, 1.41421356,
        1.15470054, 1.15470054, 1.15470054, 0.81649658, 1.15470054,
        1.41421356, 1.15470054, 1.41421356, 1.41421356, 1.41421356,
        1.15470054, 1.41421356, 1.15470054, 1.41421356, 1.41421356,
        1.15470054, 1.41421356, 1.41421356, 1.1

Mas coeficentes de similaridad:

 
<img src="similaridades 3.jpg" width=420 height=200 >

 ## Distances with Mixed Variables  <a class="anchor" id="76"></a>


Let $X=(X_1,...,X_p)$ be a mixed data matrix such that:

$X_1,...,X_{p_1} \hspace{0.1cm}$ are quantitative variables

$X_{p_1 + 1},...,X_{p_1 + p_2} \hspace{0.1cm}$ are binary categorical variables

$X_{p_1 + p_2 + 1},...,X_{p_1 + p_2 + p_3} \hspace{0.1cm}$ are multiple categorical variables (non-binary).

Where: $\hspace{0.2cm} p=p_1 + p_2 + p_3$


 ### Gower Similarity Coefficient  <a class="anchor" id="86"></a>




The Gower coefficient of similarity between the elements $i$ and $j$ with respect to the variables $X_1,...,X_p$ is:



\begin{gather*}
S(i,j)_{Gower}=\dfrac{\sum_{k=1}^{p_1} \left(1- \dfrac{\mid x_{ik} - x_{jk} \mid}{G_k} \right) + a_{ij} + \alpha_{ij} }{p_1 + (p_2 - d_{ij}) + p_3}
\end{gather*}



Where:

$p_1 \hspace{0.05cm} $ is the number of quantitative variables

$p_2 \hspace{0.05cm} $ is the number of binary categorical variables

$p_3 \hspace{0.05cm}  $ is the number of multiple (non-binary) categorical variables

$G_k \hspace{0.05cm} $ is the range of the $k$-th quantitative variable  $\hspace{0.1cm} \left( G_k = max(X_k) - min(X_k) \right)$ 

$a_{ij} \hspace{0.05cm} $ is the number of binary variables (there are $\hspace{0.05cm} p_2\hspace{0.05cm} $) for which the answer is 1 in both individuals $i$ and $j$

$d_{ij} \hspace{0.05cm} $ is the number of binary variables (there are $\hspace{0.05cm}  p_2 \hspace{0.05cm} $) for which the response is 0 in both individuals $i$ and $j$

$\alpha_{ij} \hspace{0.05cm} $ is the number of matches (coincidencias) between multiple non-binary categorical variables (there are $\hspace{0.05cm}  p_3 \hspace{0.05cm} $) for individuals $i$ and $j$


 ### Gower Distance   <a class="anchor" id="87"></a>



The Gower distance is obtained as:



\begin{gather*}
\delta(i,j)_{Gower} = \sqrt{1 - S(i,j)_{Gower}}
\end{gather*}



 ### Properties   


- Gower's similarity coefficient is the sum of different coefficients appropriate for each type of variable.

- If we only have quantitative variables, the distance obtained is:

\begin{gather*}
\dfrac{1}{p} \sum_{k=1}^{p } \left(1- \dfrac{\mid x_{ik} - x_{jk} \mid}{G_k} \right)
\end{gather*}

- If we only have binary categorical variables, Gower's coefficient of similarity coincides with Jaccard's.

- If we only have multiclass categorical variables, the Gower d coefficient matches the matches coefficient.




With this idea, other similarity coefficients can be constructed for data of mixed type. Some recommendations for this are the following:

If the resulting coefficient is to have the Euclidean property, all the coefficients that are combined must have it.

For quantitative variables, coefficients should be used that divide each comparison by a normalization factor before adding.

For binary and qualitative variables, those coefficients that take values ​​in [0,1] will be preferred to avoid rescaling the similarities before adding

 #### Gower Similarity in `R`   <a class="anchor" id="88"></a>


In [136]:
%%R
 
Similaridad_Gower <- function(i,j,   Matriz_Datos_Mixtos, p1, p2, p3){

  X=as.matrix(Matriz_Datos_Mixtos) #tienen que estar las variables ordenadas del siguiente modo: las p1 primeras son cuantitativas, las p2 siguientes son binarias, las p3 siguientes son categoricas multiples (no binarias). De modo que p=p1+p2+p3
  
###################### 
  G <- function(k, X){

  G = max(X[ ,k])-min(X[ ,k])

  return(G)
  }
######################
  
  G_vector <- rep(0, p1)
  
  for(r in 1:p1){
  G_vector[r]=G(r, X)
  }

######################

  Matriz_Datos_Binarios = X[ , (p1+1):(p2+p1)]
  
  a = Matriz_Datos_Binarios %*% t(Matriz_Datos_Binarios)
  
  unos <- rep(1, dim(Matriz_Datos_Binarios)[2])

  Matriz_Unos <- matrix( rep(unos, dim(Matriz_Datos_Binarios)[1]), ncol=dim(Matriz_Datos_Binarios)[2])
                
  d = (Matriz_Unos - Matriz_Datos_Binarios)%*%t(Matriz_Unos - Matriz_Datos_Binarios)   
  
  Matriz_Datos_Categoricos_Multiples = X[ , (p1+p2+1):(p1+p2+p3)]
  
  Matriz_Datos_Cuantitativos = X[ , 1:p1]
  
Similaridad_Gower = (  sum( 1 - abs(Matriz_Datos_Cuantitativos[i,] - Matriz_Datos_Cuantitativos[j,])/G_vector ) + a[i,j] + alpha(i,j,Matriz_Datos_Categoricos_Multiples)  ) / (p1+p2- d[i,j] + p3)
  
return(Similaridad_Gower)

}

In [137]:
%%R

Mixed_Data_R <- Data_set_R 

In [138]:
%%R

head(Mixed_Data_R)

          X1        X2        X3         X4 X5 X6 X7 X8 X9 X10
1  -6.284459 -9.411280  19.63082  13.807247  0  0  1  3  1   4
2  24.960182 -5.581823 -19.66832  14.255880  1  1  1  3  1   3
3  14.244677 36.155683  20.68397 -11.178333  1  0  0  3  1   0
4 -12.594421 -1.970941  48.97456 -18.153030  0  1  0  1  1   1
5   1.320996 10.445248   9.63061  -5.294826  0  1  0  2  3   2
6  34.771548 26.039740  10.51213  12.519134  1  1  1  3  0   1


In [139]:
 %%R

 Similaridad_Gower(1,3, Mixed_Data_R, p1=4, p2=3, p3=3)

[1] 0.5165893


 #### Gower Similarity in `Python`   <a class="anchor" id="89"></a>

In [140]:
def Gower_Similarity_Python(i,j, Mixed_Data_Set, p1, p2, p3):

    X = Mixed_Data_Set

   # The data matrix X have to be order in the following way:
   # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

##########################################################################################
    def G(k, X):

        range = X.iloc[:,k].max() - X.iloc[:,k].min() 

        return(range)

    G_vector = np.repeat(0, p1)

    for r in range(0, p1):

        G_vector[r] = G(r, X)
##########################################################################################
    
    ones = np.repeat(1, p1)

    Quantitative_Data = X.iloc[: , 0:p1]
    Binary_Data = X.iloc[: , (p1):(p1+p2)]
    Multiple_Categorical_Data = X.iloc[: , (p1+p2):(p1+p2+p3) ]

    a = Binary_Data @ Binary_Data.T

    ones_matrix = np.ones(( Binary_Data.shape[0] , Binary_Data.shape[1])) 
   
    d = (ones_matrix - Binary_Data) @ (ones_matrix - Binary_Data).T

##########################################################################################

    numerator_part_1 = ( ones - ( (Quantitative_Data.iloc[i,:] - Quantitative_Data.iloc[j,:]).abs() / G_vector ) ).sum() 

    numerator_part_2 = a.iloc[i,j] + alpha_py(i,j, Multiple_Categorical_Data)

    numerator = numerator_part_1 + numerator_part_2

    denominator = p1 + (p2 - d.iloc[i,j]) + p3

    Similarity_Gower = numerator / denominator  

    return(Similarity_Gower)

In [141]:
Mixed_Data_Py = Data_set_Python


In [142]:
Gower_Similarity_Python(1, 3, Mixed_Data_Py, 4, 3, 3)

0.5003674460819078

 #### Gower Similarity Matrix in `R`   <a class="anchor" id="90"></a>


In [143]:
%%R

Matriz_Similaridad_Gower <- function( Matriz_Datos_Mixtos, p1, p2, p3 ){
  
  Matriz_Datos_Mixtos=as.matrix(Matriz_Datos_Mixtos)
  
  M<-matrix(NA, ncol =dim(Matriz_Datos_Mixtos)[1] , nrow=dim(Matriz_Datos_Mixtos)[1] )
  
  for(i in 1:dim(Matriz_Datos_Mixtos)[1] ){
    for(j in 1:dim(Matriz_Datos_Mixtos)[1]){
    
  M[i,j]=Similaridad_Gower(i,j,  Matriz_Datos_Mixtos, p1, p2, p3)
  
   }
  }
 return(M)
}

In [144]:
%%R

Matriz_Similaridad_Gower(Mixed_Data_R, p1=4, p2=3, p3=3)[1:10 , 1:10]

           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,] 1.0000000 0.5965125 0.5165893 0.4366560 0.3528749 0.4761826 0.5059590
 [2,] 0.5965125 1.0000000 0.5263010 0.4016305 0.3746023 0.6913742 0.5358645
 [3,] 0.5165893 0.5263010 1.0000000 0.3939846 0.3538925 0.5097256 0.5274163
 [4,] 0.4366560 0.4016305 0.3939846 1.0000000 0.4846341 0.3963480 0.3763553
 [5,] 0.3528749 0.3746023 0.3538925 0.4846341 1.0000000 0.4060985 0.4515971
 [6,] 0.4761826 0.6913742 0.5097256 0.3963480 0.4060985 1.0000000 0.6059342
 [7,] 0.5059590 0.5358645 0.5274163 0.3763553 0.4515971 0.6059342 1.0000000
 [8,] 0.3442706 0.5865341 0.5686755 0.5486914 0.4512389 0.5879352 0.5266914
 [9,] 0.5751749 0.5596320 0.4559143 0.3074368 0.2994289 0.4485348 0.2918448
[10,] 0.5363176 0.5969818 0.3850222 0.5732644 0.4634767 0.5094815 0.4124375
           [,8]      [,9]     [,10]
 [1,] 0.3442706 0.5751749 0.5363176
 [2,] 0.5865341 0.5596320 0.5969818
 [3,] 0.5686755 0.4559143 0.3850222
 [4,] 0.5486914 0.30

 #### Gower Similarity Matrix in `Python`   <a class="anchor" id="91"></a>


In [145]:
def Sim_Gower_Matrix_Python(Mixed_Data,  p1, p2, p3):

    M = np.zeros((Mixed_Data.shape[0] , Mixed_Data.shape[0]))

    for i in range(0 , Mixed_Data.shape[0]):
        for j in range(0 , Mixed_Data.shape[0]):

            M[i,j] = Gower_Similarity_Python(i,j, Mixed_Data , p1, p2, p3)
                 
    return M

In [146]:
Sim_Gower_Matrix_Python(Mixed_Data_Py,  4, 3, 3)

array([[1.        , 0.39575689, 0.2936589 , 0.32494839, 0.35245233,
        0.37537932, 0.3054635 , 0.34360304, 0.57417475, 0.53579726,
        0.50163831, 0.56493126, 0.34879079, 0.41994375, 0.40618444,
        0.33893339, 0.48155037, 0.46130776, 0.53085573, 0.40736248,
        0.30560554, 0.38234942, 0.5078159 , 0.34512805, 0.50312948,
        0.29014101, 0.48348808, 0.68017129, 0.55342749, 0.51065685,
        0.42678487, 0.61109375, 0.32720157, 0.58060239, 0.46436508,
        0.45662972, 0.58063687, 0.41217444, 0.60445091, 0.51754421,
        0.44762855, 0.71216731, 0.48511812, 0.3675837 , 0.49595731,
        0.36836414, 0.45796436, 0.49092442, 0.28820019, 0.43628719],
       [0.39575689, 1.        , 0.52552934, 0.50036745, 0.47386154,
        0.5908791 , 0.53475007, 0.68597423, 0.45905967, 0.59637546,
        0.52174574, 0.45495937, 0.42798281, 0.39849702, 0.44036139,
        0.45412472, 0.61124978, 0.59274744, 0.60615754, 0.5457942 ,
        0.39389094, 0.54686846, 0.24266839, 0.6

 #### Gower Distance in `R`   <a class="anchor" id="92"></a>

In [147]:
 %%R 
 
Dist_Gower <- function(i, j, Matriz_Datos_Mixtos , p1, p2, p3) {

Dist_Gower <- sqrt( 1 - Similaridad_Gower(i, j, Matriz_Datos_Mixtos , p1, p2, p3) )

return(Dist_Gower)

}

In [148]:
 %%R 

Dist_Gower(1,3, Mixed_Data_R, p1=4, p2=3, p3=3)

[1] 0.6952774


 #### Gower Distance in `Python`   <a class="anchor" id="93"></a>

In [149]:
def Dist_Gower_Py(i, j, Mixed_Data , p1, p2, p3):

    Dist_Gower = np.sqrt( 1 - Gower_Similarity_Python(i, j, Mixed_Data , p1, p2, p3) )

    return(Dist_Gower)

In [150]:
Dist_Gower_Py(1, 3, Mixed_Data_Py , 4, 3, 3)

0.7068469098171769

 #### Gower Distance Matrix in `R`   <a class="anchor" id="94"></a>

In [151]:
%%R

 Matriz_Dist_Gower <- function( Matriz_Datos_Mixtos, p1, p2, p3 ){
  
  Matriz_Datos_Mixtos=as.matrix(Matriz_Datos_Mixtos)
  
  M<-matrix(NA, ncol =dim(Matriz_Datos_Mixtos)[1] , nrow=dim(Matriz_Datos_Mixtos)[1] )
  
  for(i in 1:dim(Matriz_Datos_Mixtos)[1] ){
    for(j in 1:dim(Matriz_Datos_Mixtos)[1]){
    
  M[i,j]=Dist_Gower(i,j,  Matriz_Datos_Mixtos, p1, p2, p3)
  
   }
  }
 return(M)
}

In [152]:
 %%R

 Matriz_Dist_Gower(Mixed_Data_R, p1=4, p2=3, p3=3)[1:10 , 1:10]

           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
 [1,] 0.0000000 0.6352066 0.6952774 0.7505625 0.8044409 0.7237523 0.7028805
 [2,] 0.6352066 0.0000000 0.6882579 0.7735435 0.7908209 0.5555410 0.6812749
 [3,] 0.6952774 0.6882579 0.0000000 0.7784699 0.8038081 0.7001960 0.6874472
 [4,] 0.7505625 0.7735435 0.7784699 0.0000000 0.7178899 0.7769505 0.7897118
 [5,] 0.8044409 0.7908209 0.8038081 0.7178899 0.0000000 0.7706501 0.7405423
 [6,] 0.7237523 0.5555410 0.7001960 0.7769505 0.7706501 0.0000000 0.6277466
 [7,] 0.7028805 0.6812749 0.6874472 0.7897118 0.7405423 0.6277466 0.0000000
 [8,] 0.8097712 0.6430131 0.6567530 0.6717951 0.7407841 0.6419228 0.6879743
 [9,] 0.6517861 0.6636023 0.7376217 0.8322038 0.8370012 0.7426070 0.8415196
[10,] 0.6809423 0.6348372 0.7842052 0.6532500 0.7324775 0.7003702 0.7665263
           [,8]      [,9]     [,10]
 [1,] 0.8097712 0.6517861 0.6809423
 [2,] 0.6430131 0.6636023 0.6348372
 [3,] 0.6567530 0.7376217 0.7842052
 [4,] 0.6717951 0.83

 #### Gower Distance Matrix in `Python`   <a class="anchor" id="95"></a>

In [153]:
def Dist_Gower_Matrix_Python(Mixed_Data,  p1, p2, p3):

    M = np.zeros((Mixed_Data.shape[0] , Mixed_Data.shape[0]))

    for i in range(0 , Mixed_Data.shape[0]):
        for j in range(0 , Mixed_Data.shape[0]):

            M[i,j] = Dist_Gower_Py(i,j, Mixed_Data , p1, p2, p3)
                 
    return M

In [154]:
Dist_Gower_Matrix_Python(Mixed_Data_Py,  4, 3, 3)

array([[0.        , 0.77733076, 0.84044102, 0.82161524, 0.80470347,
        0.79032948, 0.83338857, 0.81018329, 0.65255287, 0.68132426,
        0.70594737, 0.65959741, 0.80697535, 0.76161424, 0.77059429,
        0.81306003, 0.72003446, 0.73395656, 0.68494107, 0.76982954,
        0.83330334, 0.78590749, 0.70155834, 0.80924159, 0.70489043,
        0.8425313 , 0.71868764, 0.56553401, 0.66826081, 0.69953066,
        0.75710972, 0.62362348, 0.82024291, 0.64760915, 0.73187084,
        0.73713654, 0.64758253, 0.76669783, 0.62892694, 0.69459037,
        0.74321696, 0.53650041, 0.7175527 , 0.79524606, 0.70995964,
        0.79475522, 0.7362307 , 0.71349533, 0.84368229, 0.75080811],
       [0.77733076, 0.        , 0.68881831, 0.70684691, 0.72535402,
        0.63962559, 0.68209232, 0.56038002, 0.73548646, 0.63531452,
        0.69155929, 0.73826867, 0.75631818, 0.77556624, 0.74808997,
        0.73883373, 0.62349838, 0.63816343, 0.62756869, 0.67394792,
        0.77853006, 0.67315046, 0.87024802, 0.6

 ### Gower-Mahalanobis Similarity Coefficient   <a class="anchor" id="96"></a>



The Gower-Mahalanobis similarity coefficient between the elements $i$ and $j$ with respect to the variables $X_1,...,X_p$ is:


\begin{gather*}
S(i,j)_{Gower-Maha}=\dfrac{ \left(1- \dfrac{\delta(i,j)_{Maha}}{max(D_{Maha})} \right) + a_{ij} + \alpha_{ij} }{ \left( p_2 - d_{ij} \right) + p_3}
\end{gather*}


Where:

$p_2 \hspace{0.05cm}$ is the number of binary categorical variables

$p_3 \hspace{0.05cm}$ is the number of multiple (non-binary) categorical variables

$\delta(i,j)_{Maha}  \hspace{0.05cm}$ is the Mahalanobis distance between the individuals with respect to the $p_1$ quantitative variables

$max(D_{Maha})  \hspace{0.05cm}$ is the maximum value of the Mahalanobis distance matrix  between the individuals with respect to the $p_1$ quantitative variables.

$a_{ij}  \hspace{0.05cm}$ is the number of binary variables (there are $p_2$) for which the answer is 1 in both individuals $i$ and $j$

$d_{ij}  \hspace{0.05cm}$ is the number of binary variables (there are $p_2$) for which the response is 0 in both individuals $i$ and $j$

$\alpha_{ij}  \hspace{0.05cm}$ is the number of matches between multiple non-binary categorical variables (there are $p_3$) for individuals $i$ and $j$

 ### Gower-Mahalanobis Distance   <a class="anchor" id="97"></a>



La distancia de Gower-Mahalanobis:



\begin{gather*}
\delta(i,j)_{Gower-Maha} = \sqrt{S(i,i)_{Gower-Maha} +S(j,j)_{Gower-Maha} - 2\cdot S(i,j)_{Gower-Maha} }
\end{gather*}


 #### Gower-Mahalanobis Similarity in `R`   <a class="anchor" id="98"></a>


In [155]:
%%R

Gower_Maha_Similarity_R <- function(i,j, Mixed_Data_Set, p1, p2, p3){

  # X tiene que ser tal que: las p1 primeras variables son cuantitativas, las p2 siguientes son binarias,# tienen que estar las variables ordenadas del siguiente mod las p3 siguientes son categoricas multiples (no binarias). De modo que p=p1+p2+p3

  X=as.matrix(Mixed_Data_Set) 

############################################

  Binary_Data_set = X[ , (p1+1):(p2+p1)]

############################################

  a= Binary_Data_set %*% t(Binary_Data_set)
  
  unos<- rep(1, dim(Binary_Data_set)[2])

  Ones_Matrix <- matrix( rep(unos, dim(Binary_Data_set)[1]),      
                ncol=dim(Binary_Data_set)[2])
                
  d= (Ones_Matrix - Binary_Data_set)%*%t(Ones_Matrix -     
      Binary_Data_set)   

############################################ 

  Multiple_Categorical_Data_set = X[ , (p1+p2+1):(p1+p2+p3)]

############################################

  Quantitative_Data_set  = X[ , 1:p1]

############################################ 

  max_maha = max(Dist_Mahalanobis_Matrix_R(Quantitative_Data_set))
  
############################################
  
  Similaridad_Gower_Mahalanobis = (  1 - Dist_Mahalanobis_R(i,j, Quantitative_Data_set)/max_maha  + a[i,j] + alpha(i,j, Multiple_Categorical_Data_set)  ) / ( p2- d[i,j] + p3 )
  
  return(Similaridad_Gower_Mahalanobis)
} 

In [156]:
%%R  

 Gower_Maha_Similarity_R(1,2, Mixed_Data_R, p1=4, p2=3, p3=3)

          [,1]
[1,] 0.5789689


 #### Gower-Mahalanobis Similarity in `Python`   <a class="anchor" id="99"></a>


In [157]:
def Gower_Maha_Similarity_Python(i,j, Mixed_Data_Set, p1, p2, p3):

    X = Mixed_Data_Set

    # The variable must to be order in the following way: 
    # the p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

##########################################################################################
    
    ones = np.repeat(1, p1)

    Quantitative_Data = X.iloc[: , 0:p1]
    Binary_Data = X.iloc[: , (p1):(p1+p2)]
    Multiple_Categorical_Data = X.iloc[: , (p1+p2):(p1+p2+p3) ]

    a = Binary_Data @ Binary_Data.T

    ones_matrix = np.ones(( Binary_Data.shape[0] , Binary_Data.shape[1])) 
   
    d = (ones_matrix - Binary_Data) @ (ones_matrix - Binary_Data).T

    max_maha = Dist_Mahalanobis_Matrix_Python(Quantitative_Data).max()

##########################################################################################

    numerator =  1 - ( Dist_Mahalanobis_Python(i,j, Quantitative_Data) / max_maha ) + a.iloc[i,j] + alpha_py(i,j, Multiple_Categorical_Data)

    denominator = (p2 - d.iloc[i,j]) + p3

    Gower_Maha_Similarity = numerator / denominator  

    return(Gower_Maha_Similarity) 

In [158]:
Gower_Maha_Similarity_Python(1,2, Mixed_Data_Py, 4, 3, 3)

0.5789689030957912

 #### Gower-Mahalanobis Similarity Matrix in `R`   <a class="anchor" id="100"></a>


In [159]:
%%R

Sim_Gower_Maha_Matrix_R <- function( Matriz_Datos_Mixtos, p1, p2, p3 ){
  
  Matriz_Datos_Mixtos=as.matrix(Matriz_Datos_Mixtos)
  
  M<-matrix(NA, ncol =dim(Matriz_Datos_Mixtos)[1] , nrow=dim(Matriz_Datos_Mixtos)[1] )
  
  for(i in 1:dim(Matriz_Datos_Mixtos)[1] ){
    for(j in 1:dim(Matriz_Datos_Mixtos)[1]){
    
  M[i,j]=Gower_Maha_Similarity_R(i,j,  Matriz_Datos_Mixtos, p1, p2, p3)
  
   }
  }
 return(M)
}

The computational cost of implement the following code is too large (about 40 minutes)

In [160]:
%%R

# Sim_Gower_Maha_Matrix_R(Mixed_Data_R , p1=4, p2=3, p3=3) 

NULL


 #### Gower-Mahalanobis Similarity Matrix in `Python`   <a class="anchor" id="101"></a>


In [161]:
def Sim_Gower_Maha_Matrix_Python(Mixed_Data,  p1, p2, p3):

    M = np.zeros((Mixed_Data.shape[0] , Mixed_Data.shape[0]))

    for i in range(0 , Mixed_Data.shape[0]):
        for j in range(0 , Mixed_Data.shape[0]):

            M[i,j] = Gower_Maha_Similarity_Python(i,j, Mixed_Data , p1, p2, p3) 
                 
    return M

Again, the computational cost of implement the following code is too large (about 40 minutes)

In [162]:
# Sim_Gower_Maha_Matrix_Python(Mixed_Data_Py, 4, 3, 3)[1:3,1:3]

 #### Gower-Mahalanobis Distance in `R`   <a class="anchor" id="102"></a>


In [165]:
%%R

Dist_Gower_Maha_R <- function(i, j, Mixed_Data , p1, p2, p3) {

Dist_Gower_Mahalanobis <- sqrt( Gower_Maha_Similarity_R(i, i, Mixed_Data , p1, p2, p3) + Gower_Maha_Similarity_R(j, j, Mixed_Data , p1, p2, p3) - 2*Gower_Maha_Similarity_R(i, j, Mixed_Data , p1, p2, p3))

return(Dist_Gower_Mahalanobis)
}

In [167]:
%%R 

Dist_Gower_Maha_R(1,2, Mixed_Data_R , p1=4, p2=3, p3=3)

         [,1]
[1,] 1.121931


 #### Gower-Mahalanobis Distance in `Python`   <a class="anchor" id="103"></a>


In [170]:
def Dist_Gower_Maha_Python(i, j, Mixed_Data, p1, p2, p3):

    Dist_Gower_Mahalanobis = np.sqrt( Gower_Maha_Similarity_Python(i, i, Mixed_Data , p1, p2, p3) + Gower_Maha_Similarity_Python(j, j, Mixed_Data , p1, p2, p3) - 2*Gower_Maha_Similarity_Python(i, j, Mixed_Data , p1, p2, p3) )

    return Dist_Gower_Mahalanobis

In [171]:
Dist_Gower_Maha_Python(1, 2, Mixed_Data_Py, 4, 3, 3)

1.1219308626092273

 #### Gower-Mahalanobis Distance Matrix in `R`   <a class="anchor" id="104"></a>


In [175]:
%%R

 Dist_Gower_Maha_Matrix_R <- function( Mixed_Data, p1, p2, p3 ){
  
  Mixed_Data=as.matrix(Mixed_Data)
  
  M<-matrix(NA, ncol = dim(Mixed_Data)[1] , nrow = dim(Mixed_Data)[1] )
  
  for(i in 1:dim(Mixed_Data)[1] ){
    for(j in 1:dim(Mixed_Data)[1]){
    
  M[i,j]=Dist_Gower_Maha_R(i,j,  Mixed_Data, p1, p2, p3)
  
   }
  }
 return(M)
}

The computational cost of implement this code is too large

In [1]:
%%R 

# Dist_Gower_Maha_Matrix_R(Mixed_Data_R, p1=4, p2=3, p3=3)

UsageError: Cell magic `%%R` not found.


 #### Gower-Mahalanobis Distance Matrix in `Python`   <a class="anchor" id="105"></a>


In [None]:
def Dist_Gower_Maha_Matrix_Python(Mixed_Data,  p1, p2, p3):

    M = np.zeros((Mixed_Data.shape[0] , Mixed_Data.shape[0]))

    for i in range(0 , Mixed_Data.shape[0]):
        for j in range(0 , Mixed_Data.shape[0]):

            M[i,j] = Dist_Gower_Maha_Python(i,j, Mixed_Data , p1, p2, p3) 
                 
    return M

The computational cost of implement this code is too large

In [None]:
# Dist_Gower_Maha_Matrix_Python(Mixed_Data_Py , 4, 3, 3)

## Bibliography <a class="anchor" id="1"></a>

Notes from the professor at UC3M **Aurea Grane Chavez**

https://numpy.org/doc/stable/reference/random/legacy.html

