# Time Series
Based on Charu Aggarwal, Data Mining, chapter 14, Time Series.

Time series data have two components:
1. behavioral e.g. the temperature values
1. contextual e.g. the measurement times

SIMILARITY:   
To compare two time series, use:
1. Euclidean distance (requires same number of time points)
1. Edit distance
1. Longest Common Substring (found by dynamic programming)
1. Difference between their coefficients in an FFT representation
1. DTW = Dynamic Time Warp: measure differences after aligning periods with similar meaning (such as valve-open and valve-closed) if not similar duration

## Data Prep
Ideally, work with consecutive time points and no missing values.

### Missing values
INTERPOLATION:   
Interpolate missing values if required. 
Linear interpolation is usually fine, but polynomial and spline can be used.

### Noise reduction
PAA = BINNING:   
Binning does smoothing and data reduction.
Replace original values with means of non-overlapping windows.
Use median instead of mean for less sensitivity to outliers.
Larger bin sizes provide more smoothing.
PAA = piecewise aggregate approximation.

MOVING AVERAGE:   
Moving average smoothing does data smoothing but not data reduction.
Replace original values with means of overlapping windows e.g. stride=1.
Use median instead of mean for less sensitivity to outliers.
Larger window sizes provide more smoothing.

Downsides: 
1. lose first window of data
1. lag for predictions to catch up to recent big changes
1. can actually invert big, fast oscillation

EXPONENTIAL SMOOTHING:   
Exponential smoothing uses weighted average,
so the most recent value counts more or less than the trend.
Requires the smoothing parameter a. 
Larger values of a emphasize the most recent value more.

If a=1/4:   
$\hat{y}_{i} = (\frac{1}{4})(y_{i})-(\frac{3}{4})(\hat{y}_{i-1})$   
The recursion leads to exponential decay of older values.

### Normalization
Normalize multivariate behavioral data on different scales.
1. Range-based: (yi-min)/(max-min)
1. Z-score: (yi-mean)/(stdev)

Z-score standardization is preferable mathematically but 
range-based is computationally convenient since 
no value ever exceeds the minimum or maximum.

### Log transform
A log transform can make data stationary.
For example: it can remove the compounding effects of inflation on prices.
After the log transform plus differencing, the series is stationary.

## Data Reduction
### DTW = Discrete Wavelet Transform
Generates ranked series of coefficients; 
using the first few provides lossy compression.
Each wavelet captures difference between consecutive periods.
Wavelets capture local changes.
Best for one-time events.
### Fourier Transform
DFT = Discrete Fourier Transform,
usually calculated by FFT = Fast Fourier Transform
or replaced with DCT = Discrete Cosine Transform.
DFT describes the global data by combinations of sinusoidal waves.
Best for periodic time series.
The coefficients are complex numbers but 
the complex terms cancel out to give real-valued predictions.
Distance between FFTs = difference in coefficients = distance between time series.
### SAX = Symbolic Aggregate Approximation 
Choose certain values that are sybolic or representative.
Ideally, those values should be equally represented and equally likely.
Example: replace a sine wave with a square wave with 3 values: +1, -1, 0. 

## ARIMA(p,d,q)
For a single-valued time series.

Stationary time series have time-independent mean and variance. Most time series are non-stationary but can be made so. For example, prices might be steady after adjusting for inflation.

### I(d): differencing   
With I(d), we predict not the actual values but the d-order differences. At d=1, we predict first-order differences, which should account for a mean that increases with time:   
$\hat{y}_{i} = y_{i}-y_{i-1}$   
At d=2, we predict second-order differences, i.e. differences of differences, which I suppose captures a bit of non-linearity:   
$\hat{y}_{i} = \hat{y}_{i}-\hat{y}_{i-1}$   

Since AR(p) and MA(q) assume stationary data, apply I(d) first.
The For stationary data, use I(d=0) or use ARMA without the I.  

### AR(p): autoregression
With AR(p), we predict the current time value by a linear combination of p previous values. 
AR assumes stationarity.

For p=2, the model uses a combination of 2 previous values plus a term for white noise:    
$\hat{y}_{i} = (a_{1})(y_{i-1}) + (a_{2})(y_{i-2}) + \epsilon_i$   
Note that $a_1 = a_2 = \frac{1}{2}$ means use the average of the previous two points.
Other values mean take a weighted average of the previous times.

To fit this model to the data and learn the $a_i$ parameters, 
use linear regression and least squares.
Each previous time window provides one linear equation.
Since there are more equations than unknowns,
the system is overspecified (with contradiction), 
and there are no solutions, only estimates.

The autoregression plot helps choose p.
The X-axis is the "lag"; values 1,2,3 represent the previous 1,2,3 time points. 
The Y-axis is the correlation between -1 and 1.
Usually the plot starts near one and drops close to zero for no more correlation.
Choose p where the correlation is still high.
Keep p small to avoid overfitting.

In seasonal data, the autoregression plot looks like a sine wave.
In temperature data, correlation is near +1 at 12-month lag,
and near -1 at 6-month lag.
AR(12) could pick up, "After an especially cold winter comes a rather cool summer."

### MA(q): moving average
With MA(q), we predict the shocks i.e. deviations from the mean.
MA assumes stationarity.
Obviously, it assumes previous shocks are predictive of future shocks.
I think it also assumes that shocks come in regular periods.

The prediction does not depend on the previous value, only its deviation.
I guess this would work on mean-centered data
and you could re-insert the mean later.
For q=2, the model uses the previous two deviations from the predition:       
$\hat{y}_{i} = (b_{1})(\epsilon_{i-1}) + (b_{2})(\epsilon_{i-2}) + \epsilon_i$   

To fit this model to the data and learn the $b_i$ parameters, 
but the system is recursive and non-linear so regression is inappropriate.
Use an iterative method like gradient descent.