### Introducing Time Series

A <b>time series</b> is a set of observations taken sequentially in time. If we keep taking the same observation at different points in time, we get a time series. This book focuses on regular time series - where we have observations coming in at regular intervals of time. 

Time series forecasting is predicting the future values of a time series, given the past values, e.g. predict the next day's temperature using the last 5 years of temperature data. 

In [1]:
import sys
import os

# Get the absolute path of the src folder
src_path = os.path.abspath(os.path.join(os.getcwd(), "./src"))

# Add to sys.path
sys.path.append(src_path)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import timesynth as ts
from utils.plotting_utils import *
from synthetic_ts.autoregressive import AutoRegressive

import warnings
warnings.filterwarnings("ignore")

## Data-generating process (DGP)

Any time series is generated by some kind of <i>mechanism</i>. In statistics, this underlying process is referred to as the <b>DGP</b>. Time series data is produced by stochastic and deterministic processes. The deterministic processes involve quantities that evolve in a predictable manner over time (e.g. the radioactive decay of an element). But the more interesting time series are generated by a stochastic process - a process that changes over time in a random but somewhat predictable manner (e.g. weather).

If we had complete and perfect knowledge of reality, we would put the DGP together in a mathematical formula and get the most accurate forecast possible. What we try to do is mathematically approximate the DGP as closely as possible so that our imitation of the DGP gives us the best possible forecast. This imitation is called a <b>model</b> that provides a useful approximation of the DGP. The model is not the DGP, but a representation of it.

#### Generating synthetic time series

We can experiment with different tools and techniques using synthetic time series.

<u>White and red noise</u>:
An extreme case of a stochastic process that generates a time series is a <b>white noise</b> process. It has a sequence of random numbers with zero mean and constant variance.

In [28]:
# Generate the time axis with sequential numbers up to 200
time = np.arange(200)

# Sample 200 random values
values = np.random.randn(200) * 100
plot_time_series(time, values, "White noise")

Red noise has zero mean and constant variance, but is serially correlated in time. This serial correlation/redness is parameterised by a correlation coefficient r.

In [30]:
# Setting the correlation coefficient
r = 0.4
white_noise = np.random.randn(200) * 100

# Create red noise by introduction correlation between subsequent values in the white noise
values = np.zeros(200)
for index, value in enumerate(white_noise):
    if index == 0:
        values[index] = value
    else:
        values[index] = r*values[index-1] + np.sqrt((1-np.power(r, 2))) * value

plot_time_series(time, values, "Red noise")

#### Cyclical or seasonal signals

Among the most common signals you'll see. Below, note the two sinusoidal waves are different with respect to the frequency (how fast the time series crosses zero) and amplitude (how far away from zero the time series travels).

In [3]:
signal_1 = ts.signals.Sinusoidal(amplitude=1.5, frequency=0.25)
signal_2 = ts.signals.Sinusoidal(amplitude=1, frequency=0.5)

# Generate time series
samples_1, regular_time_samples, signals_1, errors_1 = generate_timeseries(signal=signal_1)
samples_2, regular_time_samples, signals_2, errors_2 = generate_timeseries(signal=signal_2)

plot_time_series(regular_time_samples,
                 [samples_1, samples_2],
                 "Sinusoidal waves",
                 legends=["Amplitude = 1.5 | Frequency = 0.25",
                          "Amplitude = 1 | Frequency = 0.5"])

#### Autoregressive signals

An AR signal refers to when the value of a time series for the current timestep is dependent on the values of the time series in the previous timesteps. This is parameterised by a few parameters:
- Order of serial correlation (number of previous timesteps the signal is dependent on)
- Coefficients to combine the previous timesteps

In [None]:
signal = AutoRegressive(ar_param=[1.5, -0.75])

samples, regular_time_samples, signals, errors = generate_timeseries(signal)
plot_time_series(regular_time_samples,
                 samples,
                 "Auto Regressive")

NameError: name 'signals' is not defined

#### Stationary and non-stationary time series

<b>Stationarity</b> is a key assumption in many modelling approaches, although most real-world time series are non-stationary. Think of the probability or data distribution of a time series. We call a time series stationary when the probability distribution remains the same at every point in time. 

A standard Gaussian distribution is defined by two parameters - the mean and the variance. So the stationarity assumption can be broken in two ways:
- Change in mean over time
- Change in variance over time

#### Change in mean over time

The most popular way a non-stationary time series presents itself. If there is an upward/downward trend in the time series, the mean across two windows of time would not be the same. This holds true for seasonality (mean temperatures of summer and winter will be different). Below, you see a definite trend and the seasonality which together make the mean of the data distribution change wildly across different windows of time.

In [15]:
signal = ts.signals.Sinusoidal(amplitude=1, frequency=0.25)

# White noise with standard deviation = 0.3
noise = ts.noise.GaussianNoise(std=0.3)

# Generate the time series
sinusoidal_samples, regular_time_samples, _, _ = generate_timeseries(signal=signal, noise=noise)

# Regular_time_samples is a linear increase time axis and can be used as a trend
trend = regular_time_samples * 0.4

# Combining the signal and trend
ts = sinusoidal_samples + trend
plot_time_series(regular_time_samples,
                 ts,
                 "Sinusoidal with trend and white noise")

#### Change in variance over time

In statistics, a change in variance is known as <b>heteroscedasticity</b>. As can be seen below, the seasonal peaks getting wider and wider as we move through time.

In [6]:
df = pd.read_csv("datasets/AirPassengers.csv")
df.head()

Unnamed: 0,Month,#Passengers
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


In [8]:
plot_time_series(df["Month"],
                 df["#Passengers"],
                 "Changing variance over time")

There are three main factors that form a mental model for predictability:
- <u>Understanding the DGP</u>: The better you understand this, the higher the predictability of a time series.
- <u>Amount of data</u>: The more data you have, the better the predictability is.
- <u>Adequately repeating pattern</u>: For any mathematical model to work well, there should be an adequately repeating pattern in the time series.