### Data Distance Measurement

공간상 데이터들 간의 거리가 가까우면 가까울수록 유사하다고 볼 수 있다. 이러한 데이터 간의 거리 측정 혹은 유사도 측정은 데이터 분식 및 모델링에서 매우 중요하다. 이렇듯 관측치 A를 기준으로 B와 C 중 어느 관측치가 더 가까이 있는가를 판단하기 위한 것이 Data Distance Measurement(데이터 거리 측정)이다.

**Euclidean Distance 유클리드 거리**  
유클리드 거리 측정은 관측치 간의 직선거리를 측정하는 것으로 피타고라스 정리를 활용한다. n차원의 데이터에 대한 유클리드 거리는 다음과 같이 계산한다.  
$$d(A,B)=\sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + ... + (a_n-b_n)^2} = \sqrt{\sum_{i=1}^{n}(a_i-b_i)^2}$$  
L2 Norm이라고도 불린다.

**Manhattan Distance 맨해튼 거리**  
맨해튼 거리는 A지점과 B지점가지의 X축 거리, Y축 거리를 합해주면 된다.  
$$d(A,B) = |a_1-b_1| + |a_2-b_2|+...+|a_n-b_n|$$  
![manhattan distance](https://static.packt-cdn.com/products/9781789956399/graphics/C12628_01_06.jpg)  
L1Norm이라고도 불린다.

**Minkowski Distance 민코프스키 거리**  
민코프스키 거리는 유클리드 거리 수식과 동일하며 단지 제곱 부분을 p-norm값으로 설정하여 조정할 수 있다. p값을 1로 설정하면 맨해튼 거리와 동일하고, 2로 설정하면 유클리드 거리와 동일하다.  $$d(A,B)=(\sum_{i=1}^n(a_i-b_i)^p)^{\frac{1}{p}}$$


**Chebyshev Distance 체비쇼프 거리**
Minkowski distance의 p값을 무한대로 설정했을 때, Chebyshev distance 혹은 maximum distance라 한다. 또는 L max Norm으로도 불린다.  
체비쇼프 거리는 군집 간의 최대 거리를 구할 때 사용한다.
$$d(A,B)=max(|a_1-b_1|)=\lim_{n\rightarrow \infty}\left [\sum_{i=1}^p(a_i-b_i)^n \right ]^{\frac{1}{n}}$$

**Mahalanobis Distance 마할라노비스 거리**  
마할라노비스 거리는 유클리드 거리에 공분산을 고려한 거리 측정 방법이다. 변수 내 분산과 변수 간 공분산을 모두 반영하여 A와 B 간 거리를 계산한다. 마할라노비스 거리는 확률 분포를 고려하기 때문에 기존 유클리드 공식에 공분산 행렬을 더해준다.  
$$d(A,B) = \sqrt{(A-B)^{\sum^{-1}}(A-B)^T}$$  
![mahalanobis](https://www.researchgate.net/publication/334035854/figure/fig4/AS:774052423688200@1561559887723/Fig-A1-Example-of-the-Mahalanobis-distance-d-M-for-2-dimensional-case.png)

**Cosine Distance 코사인 거리**  
Cosine Similarity(코사인 유사도)는 벡터 사이의 각도만으로 두 점 간의 유사도를 측정한다. 즉, 두 벡터의 사이각을 구해서 유사도를 구하는 것이다. 두 점 간의 각도가 작으면 유사도가 높고, 각도가 크면 유사도가 낮아진다. 즉, 실제 거리보다 좌표 공간상 각도가 얼마나 차이나는 가에 따라 유사도를 측정한다.  
![cosine similarity](https://www.tyrrell4innovation.ca/wp-content/uploads/2021/06/rsz_jenny_du_miword.png)  
1에서 코사인 유사도를 빼주면 코사인 거리가 된다. 코사인 유사도는 일반적으로 0~1의 값을 가지며, 1에 가까울 수록 유사도가 높다. 따라서 코사인 유사도가 높을 수록 코사인 거리는 줄어들게 된다.

In [1]:
from scipy.spatial import distance 
import pandas as pd

In [2]:
# https://www.kaggle.com/datasets/sewonghwang/starbucks-seoul?select=Starbucks_Seoul.csv
df = pd.read_csv("/kaggle/input/starbucks-seoul/Starbucks_Seoul.csv")

In [3]:
df.tail()

Unnamed: 0,name,address,gu_name,latitude,longitude
516,황학사거리,"서울특별시 성동구 왕십리로 410 (하왕십리동, 센트라스)",성동구,37.567644,127.023697
517,황학캐슬,서울특별시 중구 청계천로 400 (황학동) 롯데캐슬베네치아,중구,37.57124,127.022905
518,회기역사거리,서울특별시 동대문구 이문로 37 (회기동),동대문구,37.590784,127.056098
519,회현역,서울특별시 중구 퇴계로 72 (회현동) 리더스뷰남산,중구,37.558981,126.980747
520,효창공원앞역,"서울특별시 용산구 백범로 313 (효창동, 용산 롯데캐슬 센터포레)",용산구,37.539305,126.96306


In [4]:
# 금천구 스타벅스만 샘플링
df1 = df[(df['gu_name']=='금천구')]

df1

Unnamed: 0,name,address,gu_name,latitude,longitude
2,W-Mall,서울특별시 금천구 디지털로 188 (가산동),금천구,37.477305,126.887691
7,가산그레이트,"서울특별시 금천구 디지털로9길 32, 갑을그레이트밸리 1층 (가산동)",금천구,37.479449,126.887477
8,가산디지털,서울특별시 금천구 가산디지털1로 168 (가산동),금천구,37.479835,126.8825
9,가산디지털단지역,서울특별시 금천구 벚꽃로 298 (가산동),금천구,37.481324,126.883779
10,가산브이타워,서울특별시 금천구 가산디지털1로 128 (가산동),금천구,37.477221,126.883683
11,가산에스케이,서울특별시 금천구 가산디지털1로 171 (가산동),금천구,37.480613,126.88057
79,금천독산,서울특별시 금천구 두산로 70 (독산동),금천구,37.46914,126.895246
80,금천시흥,서울특별시 금천구 시흥대로 164 (시흥동),금천구,37.448624,126.903183
123,독산사거리,서울특별시 금천구 시흥대로 426 (독산동),금천구,37.471511,126.898775
146,마리오아울렛,서울특별시 금천구 벚꽃로 266 마리오아울렛3관,금천구,37.478701,126.885076


In [5]:
# Euclidean distance

dist_euclidean = distance.cdist(df1[['latitude','longitude']], 
                                df1[['latitude','longitude']], metric='euclidean') 

# 컬럼명, 인덱스 설정
dist_euclidean = pd.DataFrame(data=dist_euclidean, columns=df1['name'])
dist_euclidean.set_index(df1['name'],inplace=True)

dist_euclidean

name,W-Mall,가산그레이트,가산디지털,가산디지털단지역,가산브이타워,가산에스케이,금천독산,금천시흥,독산사거리,마리오아울렛
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
W-Mall,0.0,0.002155,0.005775,0.005608,0.004009,0.007852,0.011124,0.032598,0.012507,0.002964
가산그레이트,0.002155,0.0,0.004992,0.004146,0.0044,0.007004,0.012909,0.034596,0.013808,0.002515
가산디지털,0.005775,0.004992,0.0,0.001963,0.002869,0.002081,0.016639,0.037442,0.01828,0.002815
가산디지털단지역,0.005608,0.004146,0.001963,0.0,0.004104,0.003287,0.016731,0.038024,0.017921,0.002926
가산브이타워,0.004009,0.0044,0.002869,0.004104,0.0,0.004604,0.014107,0.034613,0.016136,0.002032
가산에스케이,0.007852,0.007004,0.002081,0.003287,0.004604,0.0,0.018628,0.039175,0.020353,0.004895
금천독산,0.011124,0.012909,0.016639,0.016731,0.014107,0.018628,0.0,0.021998,0.004252,0.013959
금천시흥,0.032598,0.034596,0.037442,0.038024,0.034613,0.039175,0.021998,0.0,0.023308,0.035107
독산사거리,0.012507,0.013808,0.01828,0.017921,0.016136,0.020353,0.004252,0.023308,0.0,0.015471
마리오아울렛,0.002964,0.002515,0.002815,0.002926,0.002032,0.004895,0.013959,0.035107,0.015471,0.0


In [7]:
# Manhattan distance

dist_manhattan = distance.cdist(df1[['latitude','longitude']], 
                                df1[['latitude','longitude']], metric='cityblock') 

# 컬럼명, 인덱스 설정
dist_manhattan = pd.DataFrame(data=dist_manhattan, columns=df1['name'])
dist_manhattan.set_index(df1['name'],inplace=True)

dist_manhattan

name,W-Mall,가산그레이트,가산디지털,가산디지털단지역,가산브이타워,가산에스케이,금천독산,금천시흥,독산사거리,마리오아울렛
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
W-Mall,0.0,0.002358,0.007721,0.007931,0.004092,0.010429,0.01572,0.044173,0.016878,0.004011
가산그레이트,0.002358,0.0,0.005363,0.005573,0.006022,0.008071,0.018078,0.046531,0.019236,0.003149
가산디지털,0.007721,0.005363,0.0,0.002769,0.003797,0.002708,0.023441,0.051894,0.024599,0.00371
가산디지털단지역,0.007931,0.005573,0.002769,0.0,0.004199,0.00392,0.023651,0.052104,0.024809,0.00392
가산브이타워,0.004092,0.006022,0.003797,0.004199,0.0,0.006505,0.019644,0.048097,0.020802,0.002873
가산에스케이,0.010429,0.008071,0.002708,0.00392,0.006505,0.0,0.026149,0.054602,0.027307,0.006418
금천독산,0.01572,0.018078,0.023441,0.023651,0.019644,0.026149,0.0,0.028453,0.0059,0.019731
금천시흥,0.044173,0.046531,0.051894,0.052104,0.048097,0.054602,0.028453,0.0,0.027295,0.048184
독산사거리,0.016878,0.019236,0.024599,0.024809,0.020802,0.027307,0.0059,0.027295,0.0,0.020889
마리오아울렛,0.004011,0.003149,0.00371,0.00392,0.002873,0.006418,0.019731,0.048184,0.020889,0.0


In [8]:
# Minkowski distance

dist_minkowski = distance.cdist(df1[['latitude','longitude']], 
                                df1[['latitude','longitude']],  metric='minkowski', p=1.5) 

# 컬럼명, 인덱스 설정
dist_minkowski = pd.DataFrame(data=dist_minkowski, columns=df1['name'])
dist_minkowski.set_index(df1['name'],inplace=True)

dist_minkowski

name,W-Mall,가산그레이트,가산디지털,가산디지털단지역,가산브이타워,가산에스케이,금천독산,금천시흥,독산사거리,마리오아울렛
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
W-Mall,0.0,0.002189,0.00631,0.006295,0.004016,0.008554,0.012482,0.035842,0.013725,0.003257
가산그레이트,0.002189,0.0,0.005048,0.004541,0.00486,0.007222,0.01442,0.037907,0.015384,0.002672
가산디지털,0.00631,0.005048,0.0,0.002201,0.003121,0.002247,0.018641,0.041612,0.020034,0.003056
가산디지털단지역,0.006295,0.004541,0.002201,0.0,0.004113,0.003428,0.018776,0.042028,0.019906,0.0032
가산브이타워,0.004016,0.00486,0.003121,0.004113,0.0,0.005165,0.015714,0.038516,0.017351,0.002281
가산에스케이,0.008554,0.007222,0.002247,0.003428,0.005165,0.0,0.020832,0.043657,0.022276,0.005302
금천독산,0.012482,0.01442,0.018641,0.018776,0.015714,0.020832,0.0,0.023688,0.004728,0.015664
금천시흥,0.035842,0.037907,0.041612,0.042028,0.038516,0.043657,0.023688,0.0,0.024159,0.038834
독산사거리,0.013725,0.015384,0.020034,0.019906,0.017351,0.022276,0.004728,0.024159,0.0,0.016982
마리오아울렛,0.003257,0.002672,0.003056,0.0032,0.002281,0.005302,0.015664,0.038834,0.016982,0.0


In [9]:
# Chebyshev distance

dist_chebyshev = distance.cdist(df1[['latitude','longitude']], 
                                df1[['latitude','longitude']],  metric='chebyshev') 

# 컬럼명, 인덱스 설정
dist_chebyshev = pd.DataFrame(data=dist_chebyshev, columns=df1['name'])
dist_chebyshev.set_index(df1['name'],inplace=True)

dist_chebyshev

name,W-Mall,가산그레이트,가산디지털,가산디지털단지역,가산브이타워,가산에스케이,금천독산,금천시흥,독산사거리,마리오아울렛
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
W-Mall,0.0,0.002144,0.005191,0.004019,0.004008,0.007121,0.008165,0.028681,0.011084,0.002615
가산그레이트,0.002144,0.0,0.004977,0.003698,0.003794,0.006907,0.010309,0.030825,0.011298,0.002401
가산디지털,0.005191,0.004977,0.0,0.001489,0.002614,0.00193,0.012746,0.031211,0.016275,0.002576
가산디지털단지역,0.004019,0.003698,0.001489,0.0,0.004103,0.003209,0.012184,0.0327,0.014996,0.002623
가산브이타워,0.004008,0.003794,0.002614,0.004103,0.0,0.003392,0.011563,0.028597,0.015092,0.00148
가산에스케이,0.007121,0.006907,0.00193,0.003209,0.003392,0.0,0.014676,0.031989,0.018205,0.004506
금천독산,0.008165,0.010309,0.012746,0.012184,0.011563,0.014676,0.0,0.020516,0.003529,0.01017
금천시흥,0.028681,0.030825,0.031211,0.0327,0.028597,0.031989,0.020516,0.0,0.022887,0.030077
독산사거리,0.011084,0.011298,0.016275,0.014996,0.015092,0.018205,0.003529,0.022887,0.0,0.013699
마리오아울렛,0.002615,0.002401,0.002576,0.002623,0.00148,0.004506,0.01017,0.030077,0.013699,0.0


In [10]:
# Mahalanobis distance

dist_mahalanobis = distance.cdist(df1[['latitude','longitude']], 
                                  df1[['latitude','longitude']],  metric='mahalanobis') 

# 컬럼명, 인덱스 설정
dist_mahalanobis = pd.DataFrame(data=dist_mahalanobis, columns=df1['name'])
dist_mahalanobis.set_index(df1['name'],inplace=True)

dist_mahalanobis

name,W-Mall,가산그레이트,가산디지털,가산디지털단지역,가산브이타워,가산에스케이,금천독산,금천시흥,독산사거리,마리오아울렛
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
W-Mall,0.0,0.412222,1.008327,0.534635,1.137676,1.408515,1.021456,3.181956,2.09265,0.489261
가산그레이트,0.412222,0.0,1.320538,0.70698,1.502236,1.716643,1.09537,3.520033,1.851829,0.817487
가산디지털,1.008327,1.320538,0.0,0.658912,0.317804,0.400714,1.900196,3.251106,3.100708,0.519729
가산디지털단지역,0.534635,0.70698,0.658912,0.0,0.911016,1.035496,1.554849,3.489268,2.550268,0.304243
가산브이타워,1.137676,1.502236,0.317804,0.911016,0.0,0.420369,1.900563,2.976769,3.197236,0.687926
가산에스케이,1.408515,1.716643,0.400714,1.035496,0.420369,0.0,2.271659,3.340112,3.500299,0.920369
금천독산,1.021456,1.09537,1.900196,1.554849,1.900563,2.271659,0.0,2.70965,1.458239,1.434991
금천시흥,3.181956,3.520033,3.251106,3.489268,2.976769,3.340112,2.70965,0.0,3.913709,3.194173
독산사거리,2.09265,1.851829,3.100708,2.550268,3.197236,3.500299,1.458239,3.913709,0.0,2.581894
마리오아울렛,0.489261,0.817487,0.519729,0.304243,0.687926,0.920369,1.434991,3.194173,2.581894,0.0


In [11]:
# Cosine distance

dist_cosine = distance.cdist(df1[['latitude','longitude']], 
                             df1[['latitude','longitude']],  metric='cosine') 

# 컬럼명, 인덱스 설정
dist_cosine = pd.DataFrame(data=dist_cosine, columns=df1['name'])
dist_cosine.set_index(df1['name'],inplace=True)

dist_cosine

name,W-Mall,가산그레이트,가산디지털,가산디지털단지역,가산브이타워,가산에스케이,금천독산,금천시흥,독산사거리,마리오아울렛
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
W-Mall,0.0,1.279999e-10,4.336818e-10,7.034265e-10,3.177569e-11,7.693567e-10,2.839378e-09,2.905348e-08,2.159871e-09,1.23521e-10
가산그레이트,1.279999e-10,0.0,9.046519e-11,2.312978e-10,3.222489e-11,2.697336e-10,4.173098e-09,3.303833e-08,3.339466e-09,3.974598e-14
가산디지털,4.336818e-10,9.046519e-11,0.0,3.245748e-11,2.306766e-10,4.777923e-11,5.492417e-09,3.658644e-08,4.529214e-09,9.430401e-11
가산디지털단지역,7.034265e-10,2.312978e-10,3.245748e-11,0.0,4.361916e-10,1.476264e-12,6.369318e-09,3.879836e-08,5.328503e-09,2.374124e-10
가산브이타워,3.177569e-11,3.222489e-11,2.306766e-10,4.361916e-10,0.0,4.884231e-10,3.471896e-09,3.100691e-08,2.715597e-09,2.999756e-11
가산에스케이,7.693567e-10,2.697336e-10,4.777923e-11,1.476264e-12,4.884231e-10,0.0,6.564742e-09,3.927851e-08,5.507374e-09,2.763336e-10
금천독산,2.839378e-09,4.173098e-09,5.492417e-09,6.369318e-09,3.471896e-09,6.564742e-09,0.0,1.372763e-08,4.639511e-11,4.147336e-09
금천시흥,2.905348e-08,3.303833e-08,3.658644e-08,3.879836e-08,3.100691e-08,3.927851e-08,1.372763e-08,0.0,1.537015e-08,3.296578e-08
독산사거리,2.159871e-09,3.339466e-09,4.529214e-09,5.328503e-09,2.715597e-09,5.507374e-09,4.639511e-11,1.537015e-08,0.0,3.316424e-09
마리오아울렛,1.23521e-10,3.974598e-14,9.430401e-11,2.374124e-10,2.999756e-11,2.763336e-10,4.147336e-09,3.296578e-08,3.316424e-09,0.0
