# Time series forecasting projects

## 5-step forecasting task

1. Problem definition

We need to determine the people who needs the forecasts and how the forecasts will be used. This is the most challenging part of the process.

2. Gathering information

The processing of collecting historical data to analyze and model. This also involves getting access to domain experts and gathering information that can help to best interpret the historical information, and ultimately the forecasts that will be used.

3. Data exploratory analysis (EDA)

The use of simple tools, like graphing, to better understand the data. Review plots and summarize and note obvious temporal stuctures, like trends seasonality, anomalies like missing data, corruption,
and outliers, and any other structures that may impact forecasting.

4. Choosing and fitting models

The evaluation of two, three or suite of models of various types on the problem. Models may be chosen on the assumptions they make and whether the data conforms. Models are configured and fitted on the historical data.

5. Using and evaluating a forecasting model

The trained model is utilized to make forecasts and the performance of those forecasts are evaluated and the skill of models estimated.

# Setup

In [1]:
import logging

FORMAT = '%(levelname)s:%(asctime)s:%(message)s'
logging.basicConfig(format=FORMAT, level=logging.INFO, force=True)

# Project 1: Monthly armed robberies in Boston

## Problem description

The problem is to predict the number of armed robberies that happen in Boston every month. We are going to use a public dataset curated by McCleary and Hay. The dataset has 118 observations and it describes the number of monthly armed robberies in Boston from January 1966 to October 1975.


## Test harness

- Defining a validation dataset

- Developing a method for model evaluation

### Creating the validation dataset

In [2]:
import pandas as pd

ds = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-robberies.csv',
                 header=0,
                 index_col=0,
                 parse_dates=True,
                 squeeze=True)
split_point = len(ds) - 12  # grab the last twelve months 
                            # to use as the validation set
train_set, test_set = ds[0: split_point], ds[split_point:]
logging.info(f'Training set: {len(train_set)}, Validation set: {len(test_set)}')

INFO:2022-12-18 08:56:04,221:NumExpr defaulting to 2 threads.
INFO:2022-12-18 08:56:04,552:Training set: 106, Validation set: 12


### Model evaluation

Model evaluation involves two components:

- Performance measure
- Test strategy

**Performance measure**

We are going to use RMSE. RMSE gives more weight to predictions that are grossly wrong and has the same units as the original data.

**Test strategy** 

Candidate models are evaluated using the walk-forward validation method.

## Persistence model

We are going to use the observation from the previous time step as the prediction for the observation at the next time step.

In [3]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# prepare data
X = train_set.values
X = X.astype('float32')
# split the train dataset into train and validation sets
train_size = int(len(train_set) * 0.5)  # 50% of the data
                                        # go into the new train set
                                        # the other 50% go into the
                                        # validation set
new_train_set, val_set = train_set[0: train_size], train_set[train_size:]
# walk-forward validation
history = [x for x in new_train_set]
predictions = []
for i in range(len(val_set)):
    y_hat = history[-1]
    predictions.append(y_hat)

    obs = val_set[i]
    history.append(obs)
    logging.info(f'Predicted: {y_hat}, Expected: {obs}')

# report performance
rmse = sqrt(mean_squared_error(val_set, predictions))
logging.info(f'RMSE: {rmse:.3f}')

INFO:2022-12-18 08:55:23,790:Predicted: 98, Expected: 125
INFO:2022-12-18 08:55:23,792:Predicted: 125, Expected: 155
INFO:2022-12-18 08:55:23,795:Predicted: 155, Expected: 190
INFO:2022-12-18 08:55:23,798:Predicted: 190, Expected: 236
INFO:2022-12-18 08:55:23,800:Predicted: 236, Expected: 189
INFO:2022-12-18 08:55:23,803:Predicted: 189, Expected: 174
INFO:2022-12-18 08:55:23,805:Predicted: 174, Expected: 178
INFO:2022-12-18 08:55:23,807:Predicted: 178, Expected: 136
INFO:2022-12-18 08:55:23,809:Predicted: 136, Expected: 161
INFO:2022-12-18 08:55:23,811:Predicted: 161, Expected: 171
INFO:2022-12-18 08:55:23,817:Predicted: 171, Expected: 149
INFO:2022-12-18 08:55:23,819:Predicted: 149, Expected: 184
INFO:2022-12-18 08:55:23,820:Predicted: 184, Expected: 155
INFO:2022-12-18 08:55:23,821:Predicted: 155, Expected: 276
INFO:2022-12-18 08:55:23,823:Predicted: 276, Expected: 224
INFO:2022-12-18 08:55:23,825:Predicted: 224, Expected: 213
INFO:2022-12-18 08:55:23,826:Predicted: 213, Expected: 27

## Data analysis

### Summary statistics

In [4]:
logging.info(train_set.describe())

INFO:2022-12-18 08:55:23,960:count    106.000000
mean     173.103774
std      112.231133
min       29.000000
25%       74.750000
50%      144.500000
75%      271.750000
max      487.000000
Name: Robberies, dtype: float64


From the data summary, there are some observations we can draw:

- The number of observations matches our expectations, meaning we are handling the data correctly.

- The mean is about 173, which we might consider in this time series.

- The standard deviation is relatively large at 112 robberies.

- The percentiles along with the standard deviation do suggest a large spread to the data.

### Line plot

In [5]:
import plotly.graph_objects as go

fig = go.Figure(
    go.Scatter(x=train_set.index,
               y=train_set.values,
    )
)
fig.update_traces(name='Robberies in Boston',
                  showlegend=True)
fig.show()

From the line graph, there are some observations we can draw:

- There is an increasing trend of robberies over time.

- There do not appear to be any obvious outliers.

- There are relatively large fluctuations from year to year, up and down.

- The fluctuations at later years appear largers than fluctuations at earlier years.

- The trend means the dataset is almost certainly non-stationary and the apparent change in fluctuation may also contribute.

### Density plot

In [6]:
import plotly.graph_objects as go
import plotly.figure_factory as ff

hist_data = [train_set.values]
group_labels = ['kde']

fig = ff.create_distplot(hist_data, group_labels, bin_size=50)
fig.show()

From the density plots, there are some observations we can draw:

- The distribution is not Gaussian.

- The distribution is left-shifted and may be exponential or a double Gaussian.

### Box and whisker plots

In [7]:
import plotly.graph_objects as go

groups = train_set[:'1973'].groupby(pd.Grouper(freq='A'))

fig = go.Figure()

for name, group in groups:
    fig.add_trace(go.Box(name=name.year,
                  y=group.values))

fig.show()

The observations suggest that the year-to-year 
fluctuations may not be systematic and hard
to model. They also suggest that there may be some benefit in clipping the first two years of
data from modeling if it is indeed quite different. This yearly view of the data is an interesting
avenue and could be pursued further by looking at summary statistics from year-to-year and
changes in summary stats from year-to-year.

## ARIMA models