09-A-time-series.Rmd

---
layout: topic
title: "Time Series Analysis"
author: Lizzie
output: html_document
---

**Suggested Reading:**

Chapter 12 from: Legendre, P. and L. Legendre. 2012. Numerical Ecology. Elsevier. [link] (https://ac.els-cdn.com/B9780444538680500125/1-s2.0-B9780444538680500125-main.pdf?_tid=f5eaeb54-d142-11e7-bee5-00000aab0f6b&acdnat=1511547298_9eabfd069eba80f333a2bc8774adb524)

**Helpful Resources:**

Cochlan, A. Little Book of R for Time Series. 2017. [link](https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html)

Hyndman, R. CRAN Task View: Time Series. [link](https://CRAN.R-project.org/view=TimeSeries)

Hyndman, R. Time Series Data Library (TSDL). [link](https://datamarket.com/data/list/?q=provider:tsdl)

National Ecological Observatory Network (NEON) Data Skills. [link](http://neondataskills.org/time-series/)

National Institute of Standards and Technology (NIST/SEMATECH). Engineering Statistics Handbook. 2012. [link](http://www.itl.nist.gov/div898/handbook/)

```{r include = FALSE}
# This code block sets up the r session when the page is rendered to html
# include = FALSE means that it will not be included in the html document

# Write every code block to the html document 
knitr::opts_chunk$set(echo = TRUE)

# Write the results of every code block to the html document 
knitr::opts_chunk$set(eval = TRUE)

# Define the directory where images generated by knit will be saved
knitr::opts_chunk$set(fig.path = "images/09-A/")

# Set the web address where R will look for files from this repository
# Do not change this address
repo_url <- "https://raw.githubusercontent.com/fukamilab/BIO202/master/"
```

### Key Points

#### Definition 
A time series is a sequence of successive values representing a variable measured at equally spaced time intervals.

#### Assumptions 
* ordered (non-random) sequence 
* measurements taken at equally spaced time intervals
  
#### Purpose 
* identify and describe the nature, internal structure that yield the observed data
+ *e.g.*, autocorrelation, trend, seasonal variation, etc.
* fit a model to forecast 
+ predict future values based on the character of the observed values' sequence
  
#### Identifying Patterns 

##### Pattern analysis: systematic pattern vs. random noise (error) 
* filter out noise to reveal pattern
  
##### Trend analysis: the trend is a general linear or nonlinear component that does not repeat in the sampled time range
1. **smoothing**: averaging the data locally to cancel out nonsystematic variability between individual observations
+ _moving average_ (MA): most common method; replaces each element of the series with either the simple or weighted average (or unweighted median) of n surrounding elements; n is the width of the smoothing "window"
+ _distance-weighted least squares smoothing_ or _negative exponentially weighted smoothing_: lesson common methods, applicable when measurement errors are large; filter out noise and convert data into smooth curves that are relatively unbiased by outliers
+ _bicubic points_: applicable when a series has a few points that are systematically distributed
2. **fitting function** 
+ monotonous linear time series: linear function
+ monotonous nonlinear: transform data using logarithmic, exponential, or polynomial function to remove nonlinearity
  
##### Seasonality analysis: seasonality is similar to trend, except the component repeats in systematic intervals over time
* defined as correlational dependency of order *k* between each *i*th element of the series and the (*i-k*)th element (Kendall, 1976); measured by autocorrelation (*i.e*., correlation between the two terms) 
* *k* = _lag_: if measurement error is not too large, seasonality emerges in the series as a pattern repeating every _k_ elements 
* **autocorrelation function (ACF)**: serial correlation coefficents with their standard errors for successive lags in a specifed range of lags; displayed in autocorrelation correlogram (autocorrelogram)
+ Note: autocorrelations for consecutive lags are interdependent
* **partial autocorrelation function (PACF)**: extension of autocorrelation that removes dependence on (or autocorrelations among) elements within a lag
+ Note: If the lag is 1, there are no intermediate elements within the lag; therefore, the partial  autocorrelation equals the autocorrelation. The partial autocorrelation may provides a "cleaner" picture of serial dependencies for individual lags, unconfounded by other serial dependencies.
* Removing serial dependency: to remove serial dependency within a specific lag, replace each *i*th element of the series with the difference from the (*i-k*)th element
+ two reasons for such transformations
1. identify hidden nature of seasonal dependencies in the series; removing some of the autocorrelations will change other auto correlations, eliminating them or making other seasonalities more salient
2. to make the series stationary, which is necessary for ARIMA (autoregressive integrated moving average) and other models

##### Multiplicative seasonality: data display seasonality and trend

<br>

### Analysis Examples

#### Three time-series datasets: 

1. Atmospheric CO2 levels (ppm) measured at Mauna Loa from 1965 to 1980

2. Total annual rainfall (in) measured in London from 1813 to 1912 ("http://robjhyndman.com/tsdldata/hurst/precip1.dat)

3. Volcanic dust veil index (measure of the environemtal impact of volcanic eruptions’ release of dust and aerosols) in the northern hemisphere, measured from 1500 to 1969 (http://robjhyndman.com/tsdldata/annual/dvi.dat)

##### Read time series data

```{r include = TRUE}
ppm <- read.csv("data/ppmmaunaloa19651980.csv")
```

##### Store data as a time-series object, specifying frequency and start.
```{r include = TRUE}
ppmtimeseries <- ts(ppm, frequency=12, start=c(1965))
ppmtimeseries
```

##### Plot the data.
```{r include = TRUE}
plot.ts(ppmtimeseries)
```
There appears to be seasonal variation in CO2 levels, with troughs every winter and peaks every summer. Since the seasonal and random fluctuations in the data remain nearly constant in size over time, we can likely use an additive model to describe the time series.

In contrast, the London rain data are not seasonal.
```{r include = TRUE}
rain <- scan("http://robjhyndman.com/tsdldata/hurst/precip1.dat",skip=1)
rain
rainseries <- ts(rain,start=c(1813))
plot.ts(rainseries)
```
However, the rain plot's random fluctuations also appear to remain consistent over time, indicating an additive model is probably appropriate.

#### Decomposing Time Series
Decomposing time series dismantles each sequence into its constituents--trend, irregular, and (if applicable) seaonsal components.

##### Decomposing non-seasonal data: trend and irregular components
Since we can describe the rain time series using an additive model, we can estimate the trend component using the smoothing method of simple moving averages (MA).

In the _TTR_ R package, the *SMA()* function applies simple MA to smooth time series: SMA(x, n=10, ...), where n denotes order, or number of periods to average over.

```{r include = TRUE}
#install package TTR()
library("TTR")
rainseriesSMA3 <- SMA(rainseries,n=3)
plot.ts(rainseriesSMA3)
```
```{r include = TRUE}
rainseriesSMA8 <- SMA(rainseries,n=8)
plot.ts(rainseriesSMA8)
```
Increasing the order of the function increases the "smoothness" of the plot, yielding a clearer depiction of the trend component.

#### Decomposing seasonal data: trend, irregular, and seasonal components

If an additive model can describe a time series, the *decompose()* R funtion estimates the trend, seasonal, and irregular components of that time series. Therefore, we can apply *decompose()* to the Mauna Loa time series.

```{r include = TRUE}
ppmtimeseriescomponents <- decompose(ppmtimeseries)
```

*decompose()* returns a list object under which it stores estimates of the seasonal, trend, and irregular components in named elements. Hence, we can inspect the estimated values of each component by specifying its variable.

```{r include = TRUE}
ppmtimeseriescomponents$seasonal
```
The estimated seasonal factors are listed for each month; they remain the same every year. 
October=-3.25194, trough
May=3.00028, peak

Plot each component.
```{r include = TRUE}
plot(ppmtimeseriescomponents)
```

##### Adjusting seasonality
Again, since we can describe the Mauna Loa CO2 time series using an additive model, we can subtract the estimated seasonal component (calculated by *decompose()*) from the original time series to yield an adjusted time series that only contains the trend and irregular components. 

```{r include = TRUE}
ppmtimeseriescomponents <- decompose(ppmtimeseries)
ppmtimeseriesseasonallyadjusted <- ppmtimeseries - ppmtimeseriescomponents$seasonal
plot(ppmtimeseriesseasonallyadjusted)
```

#### Forecasts using exponential smoothing: short-term forecasts for time-series data

#Simple exponential smoothing
We can use simple exponential smoothing to yield short-term forecasts for the London rain time series, for it is nonseasonal and displays constant variance (can be described by an additive model) and level (the mean hovers around 25 inches).

*Holt-Winters()* estimates the level, slope, and seasonal component at a given time point. Three parameters control smoothing: *alpha* estimates the level (mean), *beta* estimates the trend component's slope *b*, and *gamma* estimates the seasonal component. *Alpha*, *beta*, and *gamma* range from 0 and 1; low values indicate that recent observations carry reltively little weight regarding forecasted values.

Thus, we can modify *HoltWinters()* to execute simple exponential smoothing by setting *beta* and *gamma* to FALSE, leaving *alpha* to control smoothing.

```{r include = TRUE}
rainseriesforecasts <- HoltWinters(rainseries, beta=FALSE, gamma=FALSE)
rainseriesforecasts
```
*Alpha* is close to zero, suggesting that recent observations are weighted more than are previous observations.

*HoltWinters* stores forecasts in named element *fitted*. To inspect the fitted values:
```{r include = TRUE}
rainseriesforecasts$fitted
```

To plot the original time series against the forecasts:
```{r include = TRUE}
plot(rainseriesforecasts)
```

To calculate the sum of squared errors for the forecast errors:
```{r include = TRUE}
rainseriesforecasts$SSE
```

To modify the forecast window, we can specify the initial value in *HoltWinters()* by changing the *l.start* parameter. For example, 23.56 inches of rain fell in 1813.
```{r include = TRUE}
HoltWinters(rainseries, beta=FALSE, gamma=FALSE, l.start=23.56)
```

We can also modify the time period from which forecasts are drawn by using *forecast()* (originally *forecast.HoltWinters()*) from the *forecast* R package. Thus, we can extend the original London rainfall data time period (1813-1912) to 1920 by adjusting the *h* parameter to 8, adding 8 years.
```{r include = TRUE}
#install.packages("forecast")
library("forecast")
library("stats")
rainseriesforecasts2 <- forecast(rainseriesforecasts, h=8) #corrected from forecast.HoltWinters()
rainseriesforecasts2
```

```{r include = TRUE}
plot(rainseriesforecasts2) #originally plot.forecast()
```

If the predictive model cannot be improved, then no correlations should exist between forecast errors over successive predictions. If correlations are present, another forecasting technique should be used to improve the simple exponential smoothing forecasts. Accordingly, we can calculate a correlogram of the in-sample forecast errors using *acf()*. To specify the maximum lag investigated, modify the *lag-max* parameter in *acf()*.

```{r include = TRUE}
acf(rainseriesforecasts2$residuals, lag.max=20, na.action = na.omit) 
#need na.action = na.omit because acf works on regularly spaced data, so acf() first expands the time series to a regularly spaced series, inserting NAs as needed
```

To test whether there is significant evidence for non-zero correlations, whether any of a group of autocorrelations of a time series differ from zero, we can run a Ljung-Box test:
```{r include = TRUE}
Box.test(rainseriesforecasts2$residuals, lag=20, type="Ljung-Box") 
```
Results: Ljung-Box test statistic is 17.4, and the p-value is 0.6, suggesting there is little evidence for non-zero autocorrelations in the in-sample forecast errors at lags 1-20; small p-values (*i.e*., p-value < .05) indicates the possibility of non-zero autocorrelation within the first m lags.

Another step to check whether the predictive model can be improved upon is to investigate whether the forecast errors are normally distributed with mean zero and constant variance. Hence, make a time plot of the in-sample forecast errors to view the variance, and plot a histogram of the forecast errors (with an overlaid normal curve that has mean zero and the same standard deviation as the distribution of forecast errors) to check for normal distribution.
```{r include = TRUE}
#Check for constant variance over time:
plot.ts(rainseriesforecasts2$residuals)
#in-sample forecast errors seem to have roughly constant variance over time, although the sizes of the fluctuations early in the time series (1820-1830) may be slightly smaller than those at later dates (eg. 1840-1850)
```

```{r include = TRUE} 
##define an R function “plotForecastErrors()”
plotForecastErrors <- function(forecasterrors)
  {
    # make a histogram of the forecast errors:
  mybinsize <- IQR(forecasterrors, na.rm = TRUE)/4
    mysd   <- sd(forecasterrors, na.rm = TRUE)
    mymin  <- min(forecasterrors, na.rm = TRUE) - mysd*5
    mymax  <- max(forecasterrors, na.rm = TRUE) + mysd*3
    # generate normally distributed data with mean 0 and standard deviation mysd
    mynorm <- rnorm(10000, mean=0, sd=mysd)
    mymin2 <- min(mynorm, na.rm = TRUE)
    mymax2 <- max(mynorm, na.rm = TRUE)
    if (mymin2 < mymin ) { mymin <- mymin2}
    if (mymax2 > mymax) { mymax <- mymax2}
    # make a red histogram of the forecast errors, with the normally distributed data overlaid:
    mybins <- seq(mymin, mymax, mybinsize)
    hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
    # freq=FALSE ensures the area under the histogram = 1
    # generate normally distributed data with mean 0 and standard deviation mysd
    myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
    # plot the normal curve as a blue line on top of the histogram of forecast errors:
    points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
#plot a histogram (with overlaid normal curve) of the forecast errors for the rainfall predictions
plotForecastErrors(rainseriesforecasts2$residuals)
```

The Ljung-Box test reveals little evdence of non-zero autocorrelations in the in-sample forecast errors, the variances in forecast errors appear roughly constant, and the forecast errors appear normally distributed. Therefore the simple exponential smoothing method likely provides an adequate predictive model for London rainfall.

##### Holt-Winters Exponential Smoothing
Holt-Winters exponential smoothing is used to make short-term forecasts for time series that can be described using an additive model with increasing or decreasing trend and seasonality, such as for the Mauna Loa CO2 time series. Holt-Winters filtering follows the same steps as for exponential smoothing, except *beta* and *gamma* are included to reflect the trend in the seasonal data series. 


#### ARIMA Models

Prediction intervals for forecasting based on exponential smoothing methods require the forecast errors to be uncorrelated and normally distributed with mean zero and constant variance.

In contrast, Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, allowing non-zero autocorrelations in the irregular component. ARIMA models also require stationary time series.

```{r include = TRUE} 
volcanodust <- scan("http://robjhyndman.com/tsdldata/annual/dvi.dat", skip=1)
volcanodust
plot(volcanodust)
volcanodustseries <- ts(volcanodust,start=c(1500))
plot.ts(volcanodustseries)
```
Given that the random fluctuations are roughly constant in size over time, an additive model is likely appropriate. The time series also appears to be stationary in mean and variance, for the mean and variance are roughly constant over time. Therefore, we can fit an ARIMA model.

To investigate which ARIMA model we should use, plot full and partial correlograms.
```{r include = TRUE} 
acf(volcanodustseries, lag.max=20)             # plot a correlogram
acf(volcanodustseries, lag.max=20, plot=FALSE) # get the values of the autocorrelations
```
```{r include = TRUE} 
pacf(volcanodustseries, lag.max=20)
pacf(volcanodustseries, lag.max=20, plot=FALSE)
```

Since the correlogram approaches zero after lag 4 and the partial correlogram after lag 2, the ARMA(2,0) model is the most appropriate candidate.
The partial correlogram approaches zero too abruptly to use the ARMA(0,3) model.
And the correlogram and partial correlogram both approach zero too abruptly to use the ARMA(p,q) mixed model.

To fit the ARMA(2,0) model [or ARIMA(2,0,0) model]:
```{r include = TRUE}
volcanodustseriesarima <- arima(volcanodustseries, order=c(2,0,0))
volcanodustseriesarima
```

Once we've fitted the ARIMA(2,0,0) model, we can use *forecast()* for predict future values of the volcanic dust veil index. Let's extend the original time range (1500-1969) by 31 years (1970-2000).

```{r include = TRUE}
volcanodustseriesforecasts <- forecast(volcanodustseriesarima, h=31) #edited from forecast.Arima
volcanodustseriesforecasts
```

Plot the orignal time series with the forecasted values.
```{r include = TRUE}
plot(volcanodustseriesforecasts) #edited from plot.forecast
```

Check if the forecast errors are correlated, normally distributed, and have constant variance.
```{r include = TRUE}
acf(volcanodustseriesforecasts$residuals, lag.max=20)
Box.test(volcanodustseriesforecasts$residuals, lag=20, type="Ljung-Box")
```
```{r include = TRUE}
plot.ts(volcanodustseriesforecasts$residuals)            # make time plot of forecast errors
plotForecastErrors(volcanodustseriesforecasts$residuals) # make a histogram
```

The variance appears relatively constant. But is the mean negative?
```{r include = TRUE}
mean(volcanodustseriesforecasts$residuals)
```
Yep. The distribution is also skewed right. Hence, the forecast errors are not normally distributed with mean zero, suggesting that the ARIMA(2,0,0) model is the not the best model for the volcanic-dust-veil-index time series.

### Discussion Questions
1. How can you use time series analysis for your own data? After learning about its applications, what new research questions could you pose using time series analysis?
2. What are the advantages and disadvantages of smoothing data using moving averages versus medians? 
3. How could we improve the ARIMA model for the volcanic-dust-veil-index time series?
4. Can we create a *Tinder*-like app to match ecologists with qualified statisticians for impromptu exponential-smoothing sessions?