# EMD algorithm - quick and dirty implementation

## Maria Inês Silva
## 07/01/2019

***

## Pseudo-code from the paper

The efficient EMD paper uses the following parameter setting:
* Tmin - the minimum length of analysis window (in data points).
* a - the SAX alphabet size.
* w - the number of SAX symbols in one BS.
* aw - the size of the analysis window, which is measured in the number of BSs.

### Algorithm EMD

1. Transform the original time series to PAA representation, using a slideing window of size `T_min`.
2. Transform the PAA reduced time series to SAX symbol sequence.
3. Transform the SAX symbol sequence to BS sequence.
4. Transform the BS sequence to Modified BS sequence.
5. Find all the motif candidates:
    1. Set the size of analysis window, W, to `T_min`.
    2. Extract all modified BS subsequences under the analysis window, by sliding the window from left to right. Find some DL pattern from these subsequences, if any. If there exists no DL pattern found and the window is now at the end of the Modified BS sequence, go to step 6.
    3. From the set of all pattern instances found in 5.B, establish the distance matrix for them (using DTW distance to calculate the distances between them). Call the procedure `Extract_Motif_Candidate` to find the motif candidate from the distance matrix, and calculate the MDL value of the candidate.
    4. Add the motif candidate to the result list along with its MDL value.
    5. Increase the size of the analysis window (i.e., Set W: = W + 1), go to 5.2
6. From the result list, find the motif candidate with the smallest MDL value. The found motif will be the returned result

### Algorithm `Extract_Motif_Candidate`

1. For each TSS in the distance matrix, identify all the other TSS which are similar to it (i.e., the distance between them is less than the threshold R).
2. Select as the pattern instances the TSSs which have the highest count of its similar subsequence.
3. Among the pattern instances selected in step 2, determine the instance which has the smallest sum of distances to all the other instances. This one is considered as the centre of the pattern (that means a motif candidate).

***

## Data and library imports

In [2]:
import numpy as np
import pandas as pd
import os
from dtaidistance import dtw

In [3]:
cwd = os.getcwd()
data_folder = os.path.abspath(os.path.join(cwd, os.pardir, 'data'))
data_folder

'/Users/misilva73/Documents/Tese/extendedMD/data'

In [4]:
ay = pd.read_csv(os.path.join(data_folder, 'ay.csv'), names=['ay'])
az = pd.read_csv(os.path.join(data_folder, 'az.csv'), names=['az'])

data = ay.assign(az = az['az'])
data.head()

Unnamed: 0,ay,az
0,-0.034,0.013
1,-0.005,0.003
2,0.006,0.01
3,0.004,-0.012
4,-0.012,-0.027


## 1. PCA

**input:** data (pandas dataframe where each column is a timeseries.Number of columns is the number of variable to apply pca)

**output:** numpy array that represents the 1-dimensional time series resulting from the PCA.

In [5]:
def extract_pca_ts(multi_dim_ts):
    # compute vector with the mean of each time series
    means_vec = data.agg('mean').values
    # compute the covariance matrix of the multi-dim time-series data
    cov_mat = multi_dim_ts.cov().values
    # extract eigenvalues and eigenvectors (the PCs) of the covariance matrix
    e_val, e_vec = np.linalg.eigh(cov_mat)
    # get the eigenvector with the highest eigenvalue (i.e. the 1st PC)
    pc1_vec = e_vec[np.argmax(e_val)]
    # compute the 1-dim time series as the data's projection on the 1st PC
    ts_1d = np.dot((multi_dim_ts.values - means_vec), pc1_vec)
    return ts_1d

In [6]:
ts_1d = extract_pca_ts(data)
ts_1d

array([ 0.03494258,  0.00693807, -0.00463669, ..., -0.04965133,
       -0.07109884, -0.07675976])

## 2. SAX

Code based on the function `sax_via_window` from `saxpy.sax`.

**input:** 1-d time series, sliding window size, PAA representation size, alphabet size and z-normalization threshold (??)

**output:** list of sax words (one word per sliding window)

In [7]:
from saxpy.znorm import znorm
from saxpy.paa import paa
from saxpy.alphabet import cuts_for_asize
from saxpy.sax import ts_to_string

def extract_sax_sequence(ts, win_size, paa_size, alphabet_size=3, z_threshold=0.01):
    # initialize list with sax sequence
    sax_sequence = []
    # get the cuts thresholds based on the gaussian distribution
    cuts = cuts_for_asize(alphabet_size)
    for t in range(0, len(ts) - win_size):
        # define the current window
        ts_win = ts[t:(t+win_size)]
        # normalize the window
        ts_win_normalized = znorm(ts_win, z_threshold)
        # compute PAA representation of normalized window
        paa_rep = paa(ts_win_normalized, paa_size)
        # compute sax representation of PAA representation
        sax_word = ts_to_string(paa_rep, cuts)
        # append sax word to sax sequence list
        sax_sequence.append(sax_word)
    return sax_sequence

In [8]:
win_size=9
paa_size=3
alphabet_size=3

sax_sequence = extract_sax_sequence(ts_1d, win_size, paa_size, alphabet_size)
np.unique(sax_sequence, return_counts=True)

(array(['aac', 'abb', 'abc', 'aca', 'acb', 'acc', 'bab', 'bac', 'bba',
        'bbb', 'bbc', 'bca', 'bcb', 'caa', 'cab', 'cac', 'cba', 'cbb',
        'cca'], dtype='<U3'),
 array([ 291,  158, 1254,  122,  752,  244,  120,  725,  145, 1294,  126,
         736,  103,  263,  794,  100, 1368,  115,  267]))

## 3. Extract modified BS-sequence

**input:** sax sequence

**output:** a list with the modified bs sequence and a list with the lenght of each bs in the sequence (if there were two sax words together, then the lenght of the result bs sequence is 2)

In [45]:
def get_modified_bs_sequence(sax_sequence):
    # initialize the lists to save the bs and their lenghts
    bs_sequence = []
    bs_lengths = []
    # initialize the bs lenght counter
    curr_len = 1
    for i in range(len(sax_sequence)):
        # set the current bs element
        curr_bs = sax_sequence[i]
        # set the nex bs element
        if i<len(sax_sequence)-1:
            next_bs = sax_sequence[i+1]
        else: # if the current element is the last, then thre's no "next_bs"
            next_bs = ''
        # test if the current bs is equal to the next bs
        if curr_bs==next_bs:
            # if yes, add 1 to the current lenght counter
            curr_len = curr_len + 1
        else:
            # if no, save the bs and its lenght in the corresponding lists
            bs_sequence.append(curr_bs)
            bs_lengths.append(curr_len)
            # and initialize the lenght counter
            curr_len = 1
    return bs_sequence, bs_lengths

In [46]:
bs_sequence, bs_lengths = get_modified_bs_sequence(sax_sequence)

## 5. Compute MDL

**input:**

**output:** 

## 6. Extact all motif candidates

**input:**

**output:** 

In [None]:
def compute_dtw_dist_mat(ts_list, R):
    dist_matrix_vec = dtw.distance_matrix(ts_list, parallel=True, max_dist=R)
    dist_matrix = np.triu(dist_matrix_vec) + np.triu(dist_matrix_vec).T
    np.fill_diagonal(dist_matrix, 0)
    return dist_matrix

## 4. `Extract_Motif_Candidate`

**input:** 

**output:** 