# DAMP: Discord-Aware Matrix Profile

Authors in [DAMP](https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf) presented a method for discord detection that is scalable and it can be used in offline and online mode.

To better understand the mechanism behind this method, we should first understand the difference between the full matrix profile and the left matrix profile of a time series `T`. For a subsequence with length `m`, and the start index `i`, i.e. `S_i = T[i:i+m]`, there are two groups of neighbors, known as left and right neighbors. The left neighbors are the subsequences on the left side of `S_i`, i.e. the subsequences in `T[:i]`. And, the right neighbors are the subsequences on the right side of `S_i`, i.e. the subsequences in `T[i+1:]`. The `i`-th element of the full matrix profile is the minimum distance between `S_i` and all of its neighbors, considering both left and right ones. However, in the left matrix profile, the `i`-th element is the minimum distance between the subsequence `S_i` and its left neighbors.

One can use either the full matrix profile or the left matrix profile to find the top discord, a subsequence whose distance to its nearest neighbor is larger than the distance of any other subsequences to their nearest neighbors. However, using full matrix profile for detecting discords might result in missing the case where there are two rare subsequences that happen to also be similar to each other (a case that is known as "twin freak"). On the other hand, the left matrix profile resolves this problem by capturing the discord at its first occurance. Hence, even if there are two or more of such discords, we can still capture the first occurance by using the left matrix profile.

The original `DAMP` algorithm needs a parameter called `split_idx`. For a given `split_idx`, the train part is `T[:split_idx]` and the potential anomalies should be coming from `T[split_idx:]`. The value of split_idx is problem dependent. If split_idx is too small, then `T[:split_idx]` may not contain all different kinds of regular patterns. Hence, we may incorrectly select a subsequence as a discord. If split_idx is too large, we may miss a discord if that discord and its nearest neighbor are both in `T[:split_idx]`. The following two extreme scenarios can help with understanding the rationale behind `split_idx`.

(1) `split_idx = 0`: In this case, the first subsequence can be a discord itself as it is a "new" pattern. <br>
(2) `split_idx = len(T) - m` In such case, the last pattern is the only pattern that will be analyzed for the discord. It will be compared against all subsequences except the last one!


# Getting Started

In [2]:
import math

import numpy as np
import matplotlib.pyplot as plt
import stumpy
import time

from stumpy import core
from scipy.io import loadmat

## Naive approach

In [3]:
def naive_DAMP(T, m, split_idx):
    """
    Compute the top-1 discord in `T`, where the subsequence discord resides in T[split_index:]
    
    Parameters
    ----------
    T : numpy.ndarray
        A time series for which the top discord will be computed.
        
    m : int
        Window size
    
    split_idx : int
        The split index between train and test. See note below for further details.
    
    Returns
    -------
    PL : numpy.ndarry
        The [exact] left matrix profile. All infinite distances are ingored in computing
        the discord.
        
    discord_dist : float
        The discord's distance, which is the distance between the top discord and its
        left nearest neighbor
        
    discord_idx : int
        The start index of the top discord
        
    Note
    ----
    
    """
    mp = stumpy.stump(T, m)
    IL = mp[:, 2].astype(np.int64)
    
    k = len(T) - m + 1  # len(IL)
    PL = np.full(k, np.inf, dtype=np.float64)
    for i in range(split_idx, k):
        nn_i = IL[i]
        if nn_i >= 0:
            PL[i] = np.linalg.norm(core.z_norm(T[i : i + m]) - core.z_norm(T[nn_i : nn_i + m]))
    
    PL_modified = np.where(PL==np.inf, np.NINF, PL)
    discord_idx = np.argmax(PL_modified)
    discord_dist = PL_modified[discord_idx]
    if discord_dist == np.NINF:
        discord_idx = -1
        
    return PL, discord_dist, discord_idx

## DAMP

In [95]:
def next_pow2(v):
    """
    Compute the smallest "power of two" number that is greater than/ equal to `v`
    
    Parameters
    ----------
    v : float
        A real positive value
    
    Returns
    -------
    out : int
        An integer value that is power of two, and satisfies `out >= v`
    """
    return int(math.pow(2, math.ceil(math.log2(v))))

In [103]:
def naive_get_range_damp(i, m, chunksize_init=None):
    if chunksize_init is None:
        chunksize_init = next_pow2(m)
        
    lst = []
    chunksize = chunksize_init
    stop = i
    for _ in range(i):
        start = stop - chunksize
        start = max(0, start)
        lst.append([start, stop])
        
        if start <= 0:
            break
        
        stop = start
        chunksize = 2 * chunksize
            
    return np.array(lst)


def _get_range_damp(i, m, chunksize_init=None):
    """
    For the given index `i`, segments the array `np.arange(i)` into 
    chunks such that the last chunk has size `chunksize_init`, and 
    the one before that has size `2 * chunksize_init` and so on. The
    output contains the (start, stop) of chunks.
    
    Parameters
    ----------
    i : int
        The stop index
    
    m : int
        The window size
        
    chunksize_init : int
        The initial chunksize, which is the size of the last chunk
        
    Returns
    -------
    out : np.ndarray
        A 2D numpy array, where each row has (start, stop) index. The
        very first chunk has start index `0`.
    """
    if chunksize_init is None:
        chunksize_init = next_pow2(m)
    
    n = int(math.ceil(math.log2(i / chunksize_init + 1)))    
    indices = i - (np.power(2, np.arange(n + 1)) - 1) * chunksize_init
    start_indices = indices[1:]
    stop_indices = indices[:-1]
    
    out = np.empty((n, 2), dtype=np.int64)
    out[:, 0] =  start_indices
    out[:, 1] =  stop_indices
    
    out[-1, 0] = 0
    
    return out


# test 
for i in range(8, 65):
    for m in range(3, 8):
        ref = naive_get_range_damp(i, m)
        cmp = _get_range_damp(i, m)
        np.testing.assert_almost_equal(ref, cmp)
        
        chunksize_init = next_pow2(2 * m)
        ref = naive_get_range_damp(i, m, chunksize_init)
        cmp = _get_range_damp(i, m, chunksize_init)
        np.testing.assert_almost_equal(ref, cmp)
        
        chunksize_init = 1
        ref = naive_get_range_damp(i, m, chunksize_init)
        cmp = _get_range_damp(i, m, chunksize_init)
        np.testing.assert_almost_equal(ref, cmp)

In [86]:
def _backward_process(
    T, 
    m, 
    query_idx, 
    M_T, 
    Σ_T, 
    T_subseq_isconstant, 
    bsf,
):
    """
    Compute the (approximate) left matrix profile value that corresponds to the subsequence 
    `T[query_idx:query_idx+m]` and update the best-so-far discord distance.
    
    Parameters
    ----------
    T : numpy.ndarray
        A time series
    
    m : int
        Window size
    
    query_idx : int
        The start index of the query with length `m`, i.e. `T[query_idx:query_idx+m]`
    
    M_T : np.ndarray
        The sliding mean of `T`
        
    Σ_T : np.ndarray
        The sliding standard deviation of `T`
    
    T_subseq_isconstant : numpy.ndarray
        A numpy boolean array whose i-th element indicates whether the subsequence
        `T[i : i+m]` is constant (True)
    
    bsf : float
        The best-so-far discord distance
        
    Returns
    -------
    distance : float
        The [approximate] left matrix profile value that corresponds to 
        the query, `T[query_idx : query_idx + m]`.
    
    bsf : float
        The best-so-far discord distance 
    """
    excl_zone = int(math.ceil(m / stumpy.core.config.STUMPY_EXCL_ZONE_DENOM))
    chunksize = next_pow2(m) 
    
    nn_distance = np.inf
    for (start, stop) in _get_range_damp(query_idx - excl_zone, m, chunksize):
        # The stop index is the last index from which a subsequence starts,
        # and continues till `stop - 1 + m`
        QT = core.sliding_dot_product(
            T[query_idx : query_idx + m], 
            T[start : stop - 1 + m],
        )
        D = core._mass(
            T[query_idx : query_idx + m],
            T[start : stop - 1 + m],
            QT=QT,
            μ_Q=M_T[query_idx],
            σ_Q=Σ_T[query_idx],
            M_T=M_T[start : stop],
            Σ_T=Σ_T[start : stop],
            Q_subseq_isconstant=T_subseq_isconstant[query_idx],
            T_subseq_isconstant=T_subseq_isconstant[start : stop],
            )
        
        nn_distance = min(nn_distance, np.min(D))
        if nn_distance < bfs:
            break
            
    else:
        bsf = nn_distance
    
    return nn_distance, bsf