# Choice of Metric in Machine Learning

The metric distance measure is used for how two data are similar. For example, if the distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.

Two main similarity:        
If X = Y, similarity = 1 (where X & Y are two objects)      
If X NOT = Y, similarity = 0

Euclidean distance - distance between two points in the plane in Euclidean space and with the distance, Euclidean space becomes a metric spaces. This most common use for distance.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# fix_yahoo_finance is used to fetch data 
import fix_yahoo_finance as yf
yf.pdr_override()

In [2]:
# input
symbol = 'AMD'
start = '2014-01-01'
end = '2018-08-27'

# Read data 
dataset = yf.download(symbol,start,end)

# Only keep close columns 
dataset.head()

[*********************100%***********************]  1 of 1 downloaded


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-02,3.85,3.98,3.84,3.95,3.95,20548400
2014-01-03,3.98,4.0,3.88,4.0,4.0,22887200
2014-01-06,4.01,4.18,3.99,4.13,4.13,42398300
2014-01-07,4.19,4.25,4.11,4.18,4.18,42932100
2014-01-08,4.23,4.26,4.14,4.18,4.18,30678700


In [3]:
dataset.shape

(1172, 6)

In [4]:
X = dataset['Open'].values.reshape(1172,-1)
Y = dataset['Adj Close'].values.reshape(1172,-1)

In [5]:
from sklearn.metrics.pairwise import euclidean_distances

In [6]:
euclidean_distances(X, X) # distance between rows of X

array([[ 0.      ,  0.13    ,  0.16    , ..., 17.340001, 19.06    ,
        21.090001],
       [ 0.13    ,  0.      ,  0.03    , ..., 17.210001, 18.93    ,
        20.960001],
       [ 0.16    ,  0.03    ,  0.      , ..., 17.180001, 18.9     ,
        20.930001],
       ...,
       [17.340001, 17.210001, 17.180001, ...,  0.      ,  1.719999,
         3.75    ],
       [19.06    , 18.93    , 18.9     , ...,  1.719999,  0.      ,
         2.030001],
       [21.090001, 20.960001, 20.930001, ...,  3.75    ,  2.030001,
         0.      ]])

In [7]:
euclidean_distances(X, Y) # get distance to origin

array([[1.0000000e-01, 1.5000000e-01, 2.8000000e-01, ..., 1.8440001e+01,
        2.0130000e+01, 2.1410000e+01],
       [3.0000000e-02, 2.0000000e-02, 1.5000000e-01, ..., 1.8310001e+01,
        2.0000000e+01, 2.1280000e+01],
       [6.0000000e-02, 1.0000000e-02, 1.2000000e-01, ..., 1.8280001e+01,
        1.9970000e+01, 2.1250000e+01],
       ...,
       [1.7240001e+01, 1.7190001e+01, 1.7060001e+01, ..., 1.1000000e+00,
        2.7899990e+00, 4.0699990e+00],
       [1.8960000e+01, 1.8910000e+01, 1.8780000e+01, ..., 6.1999900e-01,
        1.0700000e+00, 2.3500000e+00],
       [2.0990001e+01, 2.0940001e+01, 2.0810001e+01, ..., 2.6500000e+00,
        9.6000100e-01, 3.1999900e-01]])

In [8]:
from scipy.spatial import distance

In [9]:
distance.euclidean(X, Y)

8.051006457582059

In [10]:
from sklearn.neighbors import DistanceMetric

In [11]:
dist = DistanceMetric.get_metric('euclidean')
dist.pairwise(X)

array([[ 0.      ,  0.13    ,  0.16    , ..., 17.340001, 19.06    ,
        21.090001],
       [ 0.13    ,  0.      ,  0.03    , ..., 17.210001, 18.93    ,
        20.960001],
       [ 0.16    ,  0.03    ,  0.      , ..., 17.180001, 18.9     ,
        20.930001],
       ...,
       [17.340001, 17.210001, 17.180001, ...,  0.      ,  1.719999,
         3.75    ],
       [19.06    , 18.93    , 18.9     , ...,  1.719999,  0.      ,
         2.030001],
       [21.090001, 20.960001, 20.930001, ...,  3.75    ,  2.030001,
         0.      ]])

Squared euclidean distance - two data points involves computing the square root of the sum of the squares of the differences between corresponding values. 

In [12]:
# scipy.spatial.distance.pdist
from scipy.spatial.distance import pdist

In [13]:
squared_eculidean = pdist(X, metric='euclidean')
squared_eculidean

array([0.13    , 0.16    , 0.34    , ..., 1.719999, 3.75    , 2.030001])

Manhattan distance - is a point and a line. The distance between a point and a line is defined as the smallest distance between any point on the line.

In [14]:
from scipy.spatial import distance

In [15]:
distance.cityblock(X, Y)

170.850004

In [16]:
from sklearn.metrics.pairwise import manhattan_distances

In [17]:
manhattan_distances(X, Y, sum_over_features=False)

array([[0.1     ],
       [0.15    ],
       [0.28    ],
       ...,
       [2.65    ],
       [0.960001],
       [0.319999]])

Maximum distance (Chebyshev distance) - is a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension.

In [18]:
from scipy.spatial import distance

In [19]:
distance.chebyshev(X,Y)

1.4100000000000001

In [20]:
from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('chebyshev')
dist

<sklearn.neighbors.dist_metrics.ChebyshevDistance at 0x1cbbccef518>

In [21]:
dist.pairwise(X)

array([[ 0.      ,  0.13    ,  0.16    , ..., 17.340001, 19.06    ,
        21.090001],
       [ 0.13    ,  0.      ,  0.03    , ..., 17.210001, 18.93    ,
        20.960001],
       [ 0.16    ,  0.03    ,  0.      , ..., 17.180001, 18.9     ,
        20.930001],
       ...,
       [17.340001, 17.210001, 17.180001, ...,  0.      ,  1.719999,
         3.75    ],
       [19.06    , 18.93    , 18.9     , ...,  1.719999,  0.      ,
         2.030001],
       [21.090001, 20.960001, 20.930001, ...,  3.75    ,  2.030001,
         0.      ]])

Mahalanobis distance - is a mesurement of the distance between a point P and a distribution D. 

https://www.statisticshowto.datasciencecentral.com/mahalanobis-distance/

In [22]:
from sklearn.neighbors import DistanceMetric

In [23]:
dist = DistanceMetric.get_metric('mahalanobis', V=np.cov(X))
dist

<sklearn.neighbors.dist_metrics.MahalanobisDistance at 0x1cbbccef588>

In [24]:
dist.rdist_to_dist(X)

array([[1.96214169],
       [1.99499373],
       [2.00249844],
       ...,
       [4.60325982],
       [4.78643918],
       [4.9939965 ]])

In [25]:
dist.dist_to_rdist(X)

array([[ 14.8225    ],
       [ 15.8404    ],
       [ 16.0801    ],
       ...,
       [449.01614238],
       [524.8681    ],
       [622.00364988]])

Jaccard Metric

Jaccard Metrics is similar to coefficient score

In [27]:
from scipy.spatial import distance

distance.jaccard(X, Y)

0.96160409556314

In [29]:
from sklearn.metrics import jaccard_similarity_score

X = dataset['Open'].astype(int)
Y = dataset['Adj Close'].astype(int)
jaccard_similarity_score(X, Y)

0.8447098976109215

In [30]:
jaccard_similarity_score(X, Y, normalize=False)

990