# Tutorial 4 - MPdist (A distance measure for real-world time series datasets)

<div style="text-align: justify">
MPdist is a time series similarity measure developed by UCR researchers as an alternative to the widely used Eucledian distance and Dynamic Time Warping.<br><br>
It can be optimized for speed as well as be implemented on real-world datasets that contain "spikes, dropouts, wandering baseline and missing values$^1$."<br><br>
Further details can be found by reading the paper, "An Ultra-Fast Time Series Distance Measure to
allow Data Mining in more Complex Real-World
Deployments$^1$"<br><br>

In [1]:
#Install all the libraries
import numba
import stumpy
import numpy as np
import pandas as pd

In [2]:
#Data Preprocessing Step
def MPdist(t1,t2):
    #t1 -> Time series 1
    #t2 -> Time series 2
    #sub -> subsequence length
    if t1.ndim!=1 or t2.ndim!=1:
        raise Exception ('t1 & t2 should be univariate series with dimensions equalling 1')
    return t1,t2

<div style="text-align: justify">
The basis of this similarity measure is the matrix profile. This is an array that contains the z-normalized Eucledian distance between a time series subsequence with its nearest neighbour. A facile and quick implementation of the Matrix Profile is possible using the stumpy library (read tutorial_1). <br><br>

The function mat_join takes in 2 time series ($t_{1}$ & $t_{2}$) as arguments and calculates the distance between all  subsequences in $t_{1}$ with its nearest neighbours in $t_{2}$ (AB). It also calculates the distance between every subsequence in $t_{2}$ with its nearest neighbour in $t_{1}$ (BA). <br><br>

The final step consists of generating an array from the two distance profiles created (AB & BA) by concatenating them together $(P_{ABBA})$ <br><br>

The authors of the MPdist paper reasoned that two similar time series will possess multiple similar subsequences within them. This information can be easily extraced from the matrix profile $(P_{ABBA})$ and therefore one of the values of $(P_{ABBA})$ could be used to develop the distance measure. <br><br>





In [12]:
def mat_join(t1,t2):
    m=7
    t1,t2=MPdist(t1,t2)
    AB=stumpy.stump(t1, m, T_B=t2, ignore_trivial=True)
    BA=stumpy.stump(t2, m, T_B=t1, ignore_trivial=True)
    PABBA=np.concatenate((AB, BA), axis=0)
    return t1,t2,PABBA

<div style="text-align: justify">
However the question arises, "what value of $P_{ABBA}$ would best serve as our distance measure?" The largest value in $P_{ABBA}$ would make the measure sensitive to the slightest outlier in either of the time series. Whereas, the smallest value would essentially consider all time series to be the same irrespective of their attributes. Therefore, the $k^{th}$ smallest number in $sorted$ $P_{ABBA}$ was taken to be the value of MPdist. The authors set this value as 5 percent of sum of the length of both time series.<br><br>
However, if the length of the subsequence extracted for comparision via sliding window (for more details, refer tutorial_0) is near the length of the concatenated series, then the $P_{ABBA}$ length reduces lower than 5% of the concatenated length.During such cases, the paper considers the MPdist to equal the maximum value of $sorted$ $P_{ABBA}$.

In [13]:
def calc_MPdist(t1,t2):
    thr = 0.05
    t1,t2,PABBA=mat_join(t1,t2)
    k=int(thr*(t1.size+t2.size))
    PABBA_sorted=np.sort(PABBA)
    if PABBA_sorted.size>k:
        MPdist=PABBA_sorted[k]
    else:
        MPdist=PABBA_sorted[-1]
    return MPdist

In [14]:
t1=np.array([1.0,2,3,4,5,6,6])
t2=np.array([1.0,2,3,4,5,6,6])
calc_MPdist(t1,t2)

array([-1, -1, 0, inf], dtype=object)

[1]. Gharghabi, Shaghayegh, et al. "Matrix Profile XII: MPdist: A Novel Time Series Distance Measure to Allow Data Mining in More Challenging Scenarios." 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018.