# Time Series
This notebook was formerly named TimeSeries_02.

See Charu Aggarwal, Data Mining, chapter 14, Time Series.

Time series data have two components:
1. behavioral e.g. the temperature measurement values
1. contextual e.g. the temperature measurement times

SIMILARITY:   
To compare two time series, use:
1. Euclidean distance (requires same number of time points)
1. Edit distance (assume possible to mutate one into the other)
1. Longest Common Substring (found by dynamic programming) (possibly recursive)
1. Difference between their coefficients in wave transform (wavelet, Haar, Fourier)
1. Align and compute distance after Dynamic Time Warp (DTW): measure differences after aligning periods with similar meaning (such as heart-valve-open and heart-valve-closed).
1. Align and compute distance after Piecewise Aggregate Appoximiation (PAA). This uses mean or median of each bin. 
1. Align and compute distance after Symbolic Aggregate Approximation (SAX). This reduces continuous range to a few values of equal frequency. For example, turns a sine wave into a step function.

## Data Prep
Ideally, work with consecutive time points and no missing values.

### Missing values
INTERPOLATION:   
Interpolate missing values if required. 
Linear interpolation is usually fine, but polynomial and spline can be used.

### Smoothing and Noise Reduction
PAA = piecewise aggregate approximation i.e. BINNING.
Binning does smoothing and data reduction.
Larger bin sizes provide more smoothing.
Results can be sensitive to bin size.

Apply non-overlapping windows. 
Replace each window with one statistic.
Mean is more inclusive, but median is less sensitive to outliers.

MOVING AVERAGE:   
Moving average smoothing does data smoothing but not data reduction.

Apply overlapping windows e.g. stride=1.
Replace each window with one statistic.
Mean is more inclusive, but median is less sensitive to outliers.

Downsides: 
1. Window effect: You lose the first window of data
1. Lag: Sudden big changes are hidden for a while
1. Inversion: If wavelength is about half the window size, the waves can flip up/down.

EXPONENTIAL SMOOTHING:   
Exponential smoothing uses weighted average,
so the most recent value counts more or less than the trend.
Requires a smoothing parameter $\alpha$. 
Larger values of $\alpha$ emphasize the most recent value more.

$\hat{y}_{i} = (\alpha)(y_{i})+(1-\alpha)(\hat{y}_{i-1})$   

If $\alpha=\frac{1}{4}$ then   
$\hat{y}_{i} = (\frac{1}{4})(y_{i})+(\frac{3}{4})(\hat{y}_{i-1})$   

The recursion leads to exponential decay of older values.

Notes from [Wikipedia](https://en.wikipedia.org/wiki/Exponential_smoothing).
Exponential smoothing is just a rule of thumb.
It is popular because it is easy to use.
It is a low-pass filter (allow low values, but filter or attenuate high values).
For looking ahead one timepoint, it is more reliable than moving average.
It fails to detect trends; for a steadily increasing price, the prediction always lags.
It incorporates infinitely many previous timepoints, 
with the coefficient $(1-\alpha)^n$ for the value n time units ago.

I have only covered simple exponential smoothing, with one parameter.
It can be extended to have multiple parameters e.g. damping.

## Transforms

### Normalization
Two ways to normalize.
1. Range-based: (yi-min)/(max-min)
1. Z-score: (yi-mean)/(stdev)

Z-score standardization is preferable mathematically but 
range-based is computationally convenient since 
no value ever exceeds the minimum or maximum.

For multivariate behavioral data on different scales,
normalize each feature (variable) separately.

### Differencing
This captures and erases a trend, leaving a stationary timeseries.

First order differencing removes a linear trend.
Use the difference between consecutive time points. 
Replace each time point value with its delta since the previous time point. 
For example: my age keeps going up, but the difference is 1 every year. 

Second order differencing removes a non-linear trend.
Use the difference of consecutive differences.

### Log transform
This may erase an an exponential trend, leaving a stationary one.
For example: prices incorporate the compounding effects of inflation.
Differencing doesn't help because the differences keep increasing.
After the log transform, the differencing series is stationary.

## Data Reduction
### DWT = Discrete Wavelet Transform
This transform is used for data reduction, noise reduction, data compression, and lossy image compression.

DWT decomposes the time series into combinations of (coefficient * wave).
Each wavelet captures the difference between consecutive periods.
One wavelet captures first half vs second half, and so on, recursively.
The coefficients are ranked; 
discard the low-order coefficients for lossy compression.

Wavelets are better than Fourier for capturing one-time events such as bursts.

The simplest DWT is the Haar Transform, which uses a square wave.
For each wave, the 3 coefficients represent overall avg, left avg, right avg.

Computation time is linear.

### Fourier Transform
This transform is used to describe an oscillation.
DFT = Discrete Fourier Transform.
If the signal were a perfect sine wave, the DFT would describe the wavelength and amplitude.

Like wavelet, DFT decomposes the time series into combinations of (coefficient * wave).
Fourier describes the global data by combinations of sinusoidal waves.
DFT is best for describing periodic time series composed of sine waves on sine waves.

The coefficients are complex numbers but 
the complex terms cancel out to give real-valued predictions.

This transform allows quick time series comparison.
Define distance between FFTs = difference in coefficients = distance between time series.

DFT computation time is quadratic, 
but FFT computation time is log-linear by taking advantage of sparse matrices.
DFT is usually calculated by FFT = Fast Fourier Transform,
or replaced by DCT = Discrete Cosine Transform.

### SAX = Symbolic Aggregate Approximation 
Choose certain values that are sybolic or representative.  
Ideally, those values should be equally represented and equally likely.  
Example: replace a sine wave with a square wave with 3 values: +1, -1, 0.   

## ARIMA(p,d,q)
ARIMA is for forecasting the next value of a single-valued time series.  
For mulivariate predictions, predict each feature separately.

Stationary time series have time-independent mean and variance.   
Most time series are non-stationary but can be made stationary for ARIMA.   
For example, prices might be steady after adjusting for inflation, or after applying a log transform.   
Within ARIMA, second-order differencing with I(d=2) might make it stationary.

Jason Brownlee shows a worked toy example in Python 
[link](https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/).
This stats class lecture gives a much more detailed example in R
[link](https://ademos.people.uic.edu/Chapter23.html).

### I(d): differencing   
With I(d), we predict not the actual values but the d-order differences. 
First, transform the series of measurements into a series of consecutive differences.
Second, predict the next difference.
Third, transform the predicted difference to a predicted measurement.

At d=1, we predict first-order differences, i.e. differences of consecutive measurements.   
$\hat{y}_{i} = y_{i}-y_{i-1}$   
First-order differencing models a mean that increases linearly with time.

At d=2, we predict second-order differences, i.e. differences of differences.
$\hat{y}_{i} = \hat{y}_{i}-\hat{y}_{i-1}$   
Second-order differencing models a mean that increases non-linearly with time.   

Since AR(p) and MA(q) assume stationary data, apply differencing first.
If the data are stationary, use I(d=0) or use ARMA, which is ARIMA without the I.  

### AR(p): autoregression
With AR(p), we predict the current time value by a linear combination of p previous "lagged" values. 
AR assumes stationarity.

For p=1, the model relies on the previous value plus a term for white noise. 
Here, $\epsilon_i$ is an unknown value from a random variable.   
$\hat{y}_{i} = (a_{1})(y_{i-1}) + \epsilon_i$   

For p=2, the model uses a combination of 2 previous values plus a term for white noise:    
$\hat{y}_{i} = (a_{1})(y_{i-1}) + (a_{2})(y_{i-2}) + \epsilon_i$   

If $a_1 = a_2 = \frac{1}{2}$, then AR(2) is just a moving average of window size 2.  
Other values of $a_1, a_2$ take a weighted average of the previous times.  
For example, AR(6) could learn coefficients 0,0,.5,0,0,.5 and thus rely equally on 3 and 6 times ago.

To fit this model to the data and learn the $a_i$ coefficients, 
use linear regression and least squares.  
Each previous time window provides one linear equation.  
Since there are more equations than unknowns,
the system is overspecified (with contradiction).   
So there are no solutions, just compromises and estimates.  

### MA(q): moving average
With MA(q), we predict the shocks (aka innovations, deviations, residuals) from the mean.  
MA assumes stationarity, so the mean is a given.  
MA uses the past q deviations from the mean to predict the next deviation from the mean.   

I guess if you mean-center your data, you must re-insert the mean for valid predictions.

MA assumes previous shocks are predictive of future shocks.
I think it also assumes that shocks come in regular periods.

For q=1, the model predicts the next deviation based on the previous one.
Each prediction does not depend on the previous value, 
but rather its deviation from the mean.
Here, $\epsilon_{i-1}$ is a known quantity, but $\epsilon_i$ is an unknown from a random variable.    
$\hat{y}_{i} = (b_{1})(\epsilon_{i-1}) + \epsilon_i$   

For q=2, the model uses the previous two deviations.
It uses two coefficients to determine their relative importance.
Aggarwal gives this for centered data (epsilon=deviation):   
$\hat{y}_{i} = (b_{1})(\epsilon_{i-1}) + (b_{2})(\epsilon_{i-2}) + \epsilon_i$   

But [Penn State](https://online.stat.psu.edu/stat510/lesson/2/2.1) 
gives this, which incorpotes the mean but is otherwise the same (w=deviation):   
$x_t = \mu + w_t + \Theta_1 w_{t-1} + \Theta_2 w_{t-2}$

## AR vs MA
AR(p) predicts the next value based on p previous values.   
If the values were 6 and 7 previously, AR might learn to predict 8.

MA(q) predicts the next deviation based on q previous deviations.   
If the values were $\mu+1$ and $\mu+2$ previously, MA might learn to predict $\mu+3$.

In theory, either one would suffice on its own, but good predictions might require larger p or q.   
By modeling the value and the deviation separately, and combining predictions, ARIMA is more robust.

## ARIMA parameter turning
The approach described by Box & Jenkins involves trial & error, iterations, and grid search.

### Auto correlation plot, ACF
STAT 510 at [Penn State](https://online.stat.psu.edu/stat510/lesson/1/1.2) has a good explanation. 
[Duke 411](https://people.duke.edu/~rnau/411arim3.htm) shows good plots.

Auto Correlation Function ACF = covariance / variance (which is similar to Pearson correlation).    
ACF measures correlation of now to previous, aka now to lag 1, lag 2, lag 3.    
If ACF(1) is a large fraction, then lag 1 had a big influence on now.   

For lag 1, ACF is $\rho = \frac{\Theta_1}{1 + \Theta_1^2}$

If ACF(1) is large, then each value is influenced by the previous.   
Thus, it means also that lag 2 had a big influence on the lag 1 value.   
This is where Partial autocorrelation (PCF) comes in.    
PCF(2) is the effect of lag 2 on now, minus the effect of lag 1.   

The X-axis is the "lag"; values 1,2,3 represent the previous 1,2,3 time points, with zero on the left.  
The Y-axis is the (range -1 to 1) auto correlation function (ACF).  
The autoregression plots help choose p and q for ARIMA.  
They say to choose p from the PCF and choose q from the ACF.   
Choose the largest X-axis value where Y is still statistically significant.   
But the larger the value, the higher the chance of overfitting.  
On the plots, statistical significance thresholds are indicated by horizontal thresholds above & below axis.
The region of no significance may have a cigar shape (narrow on the left).

Usually the correlations are high at X=1 
because the current time value depends heavily on the previous.
Usually the correlations tapers to noise after some time.
For negative coeficients, the plot can alternate: +1, -1, +1, -1.
For periodic time series, the plot can look sinusoidal.

In periodic data, the autoregression plot looks sinusoidal.   
An algorithm like LOESS can discern the trend and seasonal effect.     
(LOESS = locally weighted scatterplot smoothing is a moving average technique.)   
Example: in monthly temperature data, correlation is near +1 at 12-month lag.
We could:
* Preprocess the timeseries data to subtract the seasonal effect.
* Use AR(12) so last year can predict this year, but it must learn to downweight the months inbetween.
* Use MA(12) to model the seasonal ups and downs of every month.

If AR(1) is sufficient, the ACF at lag2 is the square (smaller fraction) of the lag1 value. 
A plot of Partial Auto Correlation (PACF) subtracts this out,
leaving the only significant bar at lag 1.
When PACF leaves more than one bar, it is recommended to try AR(2) next.

Penn State shows modeling seasonal data (beer sales).
Their model has terms for time (due to trend) 
and time squared (due to upward curve in the trend) 
plus four terms representing seasons.
Each seasonal term has an indicator function (1 or 0) and a learned coefficient.
Another example uses 12-month differencing, I(12), to model seasonality.

After choosing q, fit the MA(q) model to the data and learn the parameters ($b_i$ or $\Theta_i$) 
by a hill climbing method like gradient descent.
The system is recursive and non-linear so regression is inappropriate.

### Goodness of fit

To test goodness of fit, plot the residuals vs time.
If this looks like zero-centered random noise, the model works, and the residuals are unpredictable.

# AIC and BIC
These are statistical alternatives to empirical meaurement by cross-validation.  
These statistics measure goodness of fit of model to data.   
Both reward log likelihood of the model and both penalize model complexity.   
They can be used to evaluate time series models.   
