# Time Series
Based on Charu Aggarwal, Data Mining, chapter 14, Time Series.

Time series data have two components:
1. behavioral e.g. the temperature measurement values
1. contextual e.g. the temperature measurement times

SIMILARITY:   
To compare two time series, use:
1. Euclidean distance (requires same number of time points)
1. Edit distance (assume possible to mutate one into the other)
1. Longest Common Substring (found by dynamic programming) (possibly recursive)
1. Difference between their coefficients in wave transform (wavelet, Haar, Fourier)
1. Align and compute distance after Dynamic Time Warp (DTW): measure differences after aligning periods with similar meaning (such as heart-valve-open and heart-valve-closed).
1. Align and compute distance after Piecewise Aggregate Appoximiation (PAA). This uses mean or median of each bin. 
1. Align and compute distance after Symbolic Aggregate Approximation (SAX). This reduces continuous range to a few values of equal frequency. For example, turns a sine wave into a step function.

## Data Prep
Ideally, work with consecutive time points and no missing values.

### Missing values
INTERPOLATION:   
Interpolate missing values if required. 
Linear interpolation is usually fine, but polynomial and spline can be used.

### Smoothing and Noise Reduction
PAA = piecewise aggregate approximation i.e. BINNING.
Binning does smoothing and data reduction.
Larger bin sizes provide more smoothing.

Apply non-overlapping windows. 
Replace each window with one statistic.
Mean is more inclusive, but median is less sensitive to outliers.

MOVING AVERAGE:   
Moving average smoothing does data smoothing but not data reduction.

Apply overlapping windows e.g. stride=1.
Replace each window with one statistic.
Mean is more inclusive, but median is less sensitive to outliers.

Downsides: 
1. Window effect: You lose the first window of data
1. Lag: Sudden big changes are hidden for a while
1. Inversion: If wavelength is about half the window size, the waves can flip up/down.

EXPONENTIAL SMOOTHING:   
Exponential smoothing uses weighted average,
so the most recent value counts more or less than the trend.
Requires a smoothing parameter $\alpha$. 
Larger values of $\alpha$ emphasize the most recent value more.

If $\alpha=\frac{1}{4}$ then   
$\hat{y}_{i} = (\frac{1}{4})(y_{i})-(\frac{3}{4})(\hat{y}_{i-1})$   
The recursion leads to exponential decay of older values.

Notes from [Wikipedia](https://en.wikipedia.org/wiki/Exponential_smoothing).
Exponential smoothing is just a rule of thumb.
It is popular because it is easy to use.
It is a low-pass filter (allow low values, but filter or attenuate high values).
For looking ahead one timepoint, it is more reliable than moving average.
It fails to detect trends; for a steadily increasing price, the prediction always lags.
It incorporates infinitely many previous timepoints, 
with the coefficient $(1-\alpha)^n$ for the value n time units ago.

## Transforms

### Normalization
Two ways to normalize.
1. Range-based: (yi-min)/(max-min)
1. Z-score: (yi-mean)/(stdev)

Z-score standardization is preferable mathematically but 
range-based is computationally convenient since 
no value ever exceeds the minimum or maximum.

For multivariate behavioral data on different scales,
normalize each feature (variable) separately.

### Differencing
First order differencing: use the difference at consecutive time points. 
Replace each time point value with its delta since the previous time point. 
This converts a steady trend to stationary. 
For example: I am one year older each year.

Second order differencing: use the difference of consecutive differences.

### Log transform
It is possible to transform an exponential trend into a stationary one.
For example: prices incorporate the compounding effects of inflation.
Differencing doesn't help because the differences keep increasing.
After the log transform, the differencing series is stationary.

## Data Reduction
### DTW = Discrete Wavelet Transform
This transform is also used for noise reduction, data compression, and lossy image compression.

DTW decomposes the time series into waves.
DTW generates a ranked series of coefficients; 
using just the first few provides lossy compression.
Each wavelet captures difference between consecutive periods.
One wavelet captures first half vs second half; and so on recursively.

Wavelets capture local changes.
Wavelets are better than Fourier for capturing one-time events such as bursts.

The simplest DTW is the Haar Transform, which uses a square wave.
For each wave, the 3 coefficients represent overall avg, left avg, right avg.

Computation time is linear.

### Fourier Transform
DFT = Discrete Fourier Transform,
usually calculated by FFT = Fast Fourier Transform
or replaced with DCT = Discrete Cosine Transform.
DFT describes the global data by combinations of sinusoidal waves.
DFT is best for periodic time series similar to sine waves.

The coefficients are complex numbers but 
the complex terms cancel out to give real-valued predictions.

DFT makes for quick time series comparison.
Define distance between FFTs = difference in coefficients = distance between time series.

DFT computation time is quadratic, 
but FFT computation time is log-linear by taking advantage of sparse matrices.

### SAX = Symbolic Aggregate Approximation 
Choose certain values that are sybolic or representative.
Ideally, those values should be equally represented and equally likely.
Example: replace a sine wave with a square wave with 3 values: +1, -1, 0. 

## ARIMA(p,d,q)
For a single-valued time series.

Stationary time series have time-independent mean and variance. Most time series are non-stationary but can be made so. For example, prices might be steady after adjusting for inflation.

### I(d): differencing   
With I(d), we predict not the actual values but the d-order differences. At d=1, we predict first-order differences, which should account for a mean that increases with time:   
$\hat{y}_{i} = y_{i}-y_{i-1}$   

At d=2, we predict second-order differences, i.e. differences of differences, which I suppose captures a bit of non-linearity:   
$\hat{y}_{i} = \hat{y}_{i}-\hat{y}_{i-1}$   

Since AR(p) and MA(q) assume stationary data, apply I(d) first.
The For stationary data, use I(d=0) or use ARMA without the I.  

### AR(p): autoregression
With AR(p), we predict the current time value by a linear combination of p previous values. 
AR assumes stationarity.

For p=2, the model uses a combination of 2 previous values plus a term for white noise:    
$\hat{y}_{i} = (a_{1})(y_{i-1}) + (a_{2})(y_{i-2}) + \epsilon_i$   

Note that $a_1 = a_2 = \frac{1}{2}$ means use the average of the previous two points.
Other values mean take a weighted average of the previous times.

To fit this model to the data and learn the $a_i$ parameters, 
use linear regression and least squares.
Each previous time window provides one linear equation.
Since there are more equations than unknowns,
the system is overspecified (with contradiction), 
and there are no solutions, only estimates.

The autoregression plot helps choose p.
The X-axis is the "lag"; values 1,2,3 represent the previous 1,2,3 time points. 
The Y-axis is the correlation (range is -1 to 1).

Usually the plot starts at almost 1 because the current time depends heavily on the previous.
Usually it drops close to zero for no more correlation after some time.
If so, choose p where the correlation is still high.
Keep p small to avoid overfitting.

In seasonal data, the autoregression plot looks like a sine wave.
Example: in monthly temperature data, correlation is near +1 at 12-month lag,
and AR(12) could use last year to predict this year.
Also, correlation is near -1 at 6-month lag,
and AR(6) could use last summer to predict this winter.

### MA(q): moving average
With MA(q), we predict the shocks i.e. deviations from the mean.
MA assumes stationarity, predicts the mean, and counts every other value as a deviation.

I guess if you mean-center your data, you must re-insert the mean for valid predictions.

MA assumes previous shocks are predictive of future shocks.
I think it also assumes that shocks come in regular periods.

For q=1, the model predicts the next deviation based on the previous one.
Each prediction does not depend on the previous value, but rather its deviation.

For q=2, the model uses the previous two deviations.
It uses two coefficients to determine their relative importance.
Aggarwal gives this for centered data (epsilon=deviation):   
$\hat{y}_{i} = (b_{1})(\epsilon_{i-1}) + (b_{2})(\epsilon_{i-2}) + \epsilon_i$   

But [Penn State](https://online.stat.psu.edu/stat510/lesson/2/2.1) 
gives this, which incorpotes the mean but is otherwise the same (w=deviation):   
$x_t = \mu + w_t + \Theta_1 w_{t-1} + \Theta_2 w_{t-2}$

To fit this model to the data and learn the parameters ($b_i$ or $\Theta_i$), 
use a hill climbing method like gradient descent.
The system is recursive and non-linear so regression is inappropriate.
