# Assignment 1

In [1]:
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
import string

## a. Translation of time series into SAX representation

We start by defining the time series $x(t):=-3t^2+10$, for all $t\geq0$.

In [2]:
# define time series x(t):=-3t^2+10
def x(t):
    assert np.greater_equal(t,0).all()
    return -3*t^2+10

We then obtain the first $20$ elements $X:=[x(0),\dots,x(19)]$ of this series and normalize it.

In [3]:
ts = x(np.arange(20))
ts_normalized = stats.zscore(ts)

In the next step, the dimensionality of the normalized seris is reduced using $w$-dimensional piecewise aggregate approximation (PAA). Specifically, we use a dimension of $w=10$, thus averaging over 2 points in each frame.

In [4]:
# define function to calculate w-dimensional PAA of a time series ts
def calc_paa(ts, w):
    n = len(ts)
    mesh = np.arange(n * w)
    index_out = mesh // n
    index_in = mesh // w
    _, n_rep = np.unique(index_out, return_counts=True)
    res = np.array([ts[indices].sum() / n for indices in 
           np.split(index_in, n_rep.cumsum())[:-1]])
    return res

In [5]:
# calculate PAA of length w=10, i.e., averaged over 2 points in each frame
ts_paa = calc_paa(ts_normalized, 10)
ts_paa

array([ 1.38207347,  1.07494603,  1.17732185,  0.05118791,  0.15356372,
       -0.15356372, -0.87019441, -0.35831534, -1.07494603, -1.38207347])

The final step consists in mapping each element of the PAA to a SAX symbol. We represent SAX symbols with the Roman alphabet, where the first letter corresponds to the lowest region, and so on.

In [6]:
# define function which maps each element of PAA to a sax-symbol, using n_breaks breakpoints
def map_symbol(x, n_breaks):
    symbols = list(string.ascii_lowercase[:n_breaks])
    breaks = stats.norm.ppf(np.linspace(0, 1, n_breaks+1))
    s = ''
    for i in range(n_breaks):
        if breaks[i] <= x < breaks[i+1]:
            s = symbols[i]
    return s

Using a SAX cardinality of 4 yields the following symbolic SAX representation:

In [7]:
# obtain sax representation of the series
ts_sax = [map_symbol(x,4) for x in ts_paa]
ts_sax

['d', 'd', 'd', 'c', 'c', 'b', 'a', 'b', 'a', 'a']

Finally, the frequencies are as follows:

In [8]:
# compute frequency
for k,v in Counter(ts_sax).items():
    print(k,v)

d 3
c 2
b 2
a 3


## b. Exploring impact of parameters via graphical interface

In this section we provide a graphical interface to evaluate the impact of the PAA dimension and the SAX cardinality.

In [9]:
!pip install jupyter-dash



To create the visualization, run the following cell. Subsequently, the parameters can be modified interactively.

In [10]:
from jupyter_dash import JupyterDash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.graph_objects as go

app = JupyterDash(__name__)

app.layout = html.Div([
    dcc.Graph(id='graph-with-slider'),
    html.Label('PAA Dimension'),    
    dcc.Slider(
        id='paa-dim-slider',
        min=2,
        max=10,
        value=10,
        marks={str(i): str(i) for i in range(2,21)},
        step=None
    ),
    html.Label('SAX Cardinality'),
    dcc.Slider(
        id='sax-cardinality-slider',
        min=2,
        max=8,
        value=4,
        marks={str(i): str(i) for i in range(2,21)},
        step=None
    )
])


@app.callback(
    Output('graph-with-slider', 'figure'),
    Input('paa-dim-slider', 'value'),
    Input('sax-cardinality-slider', 'value'))
def update_graph(paa_dim, sax_cardinality):
    ts_paa = calc_paa(ts_normalized, paa_dim)
    ts_sax = '' 
    for x in ts_paa:
        ts_sax+=map_symbol(x,sax_cardinality)
    breaks = stats.norm.ppf(np.linspace(0, 1, sax_cardinality+1))

    x_normalized = np.linspace(0, 19, 20)
    x_min = x_normalized[0]
    x_max = x_normalized[-1]       
    y_min = min(ts_normalized)
    y_max = max(ts_normalized)
    m = len(ts_normalized)/paa_dim
    
    breaks[0] = y_min-1
    breaks[-1] = y_max+1
    
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(
        mode='lines+markers',
        x=x_normalized, 
        y=ts_normalized, 
        name="Normalized Time Series"
    ))
        
    fig.update_xaxes(title_text="t", range=[x_min-1, x_max+1])
    fig.update_yaxes(range=[y_min-1, y_max+1], )
    
    fig.update_layout(margin=dict(l=150,t=10))
    
    fig.add_annotation(dict(font=dict(color="black",size=14),
                            x=x_max-6,
                            y=y_max+0.5,
                            showarrow=False,
                            text='SAX Representation: '+ts_sax,
                            textangle=0,
                            xref="x",
                            yref="y",
                            bordercolor="#F8F8F8",
                            borderwidth=2,
                            borderpad=4,
                            bgcolor="#F8F8F8",
                            opacity=0.8
                           ))

    for i,y_i in enumerate(ts_paa):
        fig.add_trace(go.Scatter(
            mode='lines',
            x=[i*m,(i+1)*m-1],
            y=[y_i,y_i], 
            line_color="green", 
            line_width=2,
            name='Piecewise Aggregate Approximation (PAA)',
            showlegend=True if i==0 else False,
            legendgroup=1
        ))
        
    for i in range(sax_cardinality):
        if i>0:
            fig.add_hline(
                y=breaks[i],
                line_width=1, line_dash="dash", line_color="red"
            )
        fig.add_annotation(dict(font=dict(color="red",size=14),
                        x=-0.1,
                        y=(breaks[i]+breaks[i+1])/2,
                        showarrow=False,
                        text=string.ascii_lowercase[i],
                        textangle=0,
                        xref="paper",
                        yref="y"
                       ))


    return fig

app.run_server(mode='inline', port=8055)

# Assignment 2

-	How to measure the quality of a machine learning model:
    - Define a suitable metric. For instance, classification tasks are often evaluated by calculating the share of correct classifications (i.e., accuracy), whereas regression tasks often employ the (root) mean squared error. The choice of the metric is important and depends on multiple factors. Besides the task type (e.g. classification/regression) the decision can be determined by specific characteristics of the problem. For instance, if there is a high cost of misclassification associated with a specific scenario, standard metrics, such as accuracy, might not be suitable to evaluate model performance. An instructive example often mentioned is misclassifying an ill person as healthy, which is much more detrimental than misclassifying a healthy person. 
    - Usually, the dataset is split into training, test and development set. In most scenarios, assignment to the respective sets occurs at random. There are, however, expections; for instance, in time series analysis the training set usually precedes test and development sets. A model is trained on the training set, while the development set is used to measure the impact of different hyperparameters. The hyperparameters of the final model are chosen based on the development score. The test set is used to estimate the generalization capacity of the final model; it must only be employed after the final model has been selected. Subsequently, the final model is usually re-trained on the full dataset. Note that the estimation of the generalization capacity underlies the assumption that the distribution on new data will be the same as on the seen data.
    - When calculating scores on development and test sets, the associated estimators can have a high variance. Therefore, in practice one often uses methods such as k-fold cross validation, in which the dataset is split into k equally sized sets and scores are obtained separately on each set. By subsequently averaging across all sets, the variance is reduced. The larger k, the less biased the results, but the higher the variance becomes; a good tradeoff often used in practice is k=10. When there are both test and development sets, the described method is performed as a nested loop over both dimensions.

- How to proceed if the model is below expectation:
    - If development results are below expectation, then different hyperparameters or even different models should be considered.
    - If test set results are below expectation, modifying the hyperparameters or selecting a better model until test set results meet expectations will result in an optimistically biased estimate of model performance. Therefore, model selection and hyperparameter tuning must occur prior to obtaining the test set results!