# Time series forecasting projects

## 5-step forecasting task

1. Problem definition

We need to determine the people who needs the forecasts and how the forecasts will be used. This is the most challenging part of the process.

2. Gathering information

The processing of collecting historical data to analyze and model. This also involves getting access to domain experts and gathering information that can help to best interpret the historical information, and ultimately the forecasts that will be used.

3. Data exploratory analysis (EDA)

The use of simple tools, like graphing, to better understand the data. Review plots and summarize and note obvious temporal stuctures, like trends seasonality, anomalies like missing data, corruption,
and outliers, and any other structures that may impact forecasting.

4. Choosing and fitting models

The evaluation of two, three or suite of models of various types on the problem. Models may be chosen on the assumptions they make and whether the data conforms. Models are configured and fitted on the historical data.

5. Using and evaluating a forecasting model

The trained model is utilized to make forecasts and the performance of those forecasts are evaluated and the skill of models estimated.

# Setup

In [1]:
import logging

FORMAT = '%(levelname)s:%(asctime)s:%(message)s'
logging.basicConfig(format=FORMAT, level=logging.INFO, force=True)
logging.info('TEST')

INFO:2022-12-17 15:06:10,129:TEST


# Project 1: Monthly armed robberies in Boston

## Problem description

The problem is to predict the number of armed robberies that happen in Boston every month. We are going to use a public dataset curated by McCleary and Hay. The dataset has 118 observations and it describes the number of monthly armed robberies in Boston from January 1966 to October 1975.


## Test harness

- Defining a validation dataset

- Developing a method for model evaluation

### Creating the validation dataset

In [3]:
import pandas as pd

ds = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-robberies.csv',
                 header=0,
                 index_col=0,
                 parse_dates=True,
                 squeeze=True)
split_point = len(ds) - 12  # grab the last twelve months 
                            # to use as the validation set
train_set, test_set = ds[0: split_point], ds[split_point:]
logging.info(f'Training set: {len(train_set)}, Validation set: {len(test_set)}')

INFO:2022-12-17 15:15:52,601:Training set: 106, Validation set: 12


### Model evaluation

Model evaluation involves two components:

- Performance measure
- Test strategy

**Performance measure**

We are going to use RMSE. RMSE gives more weight to predictions that are grossly wrong and has the same units as the original data.

**Test strategy** 

Candidate models are evaluated using the walk-forward validation method.

## Persistence model

We are going to use the observation from the previous time step as the prediction for the observation at the next time step.

In [4]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# prepare data
X = train_set.values
X = X.astype('float32')
# split the train dataset into train and validation sets
train_size = int(len(train_set) * 0.5)  # 50% of the data
                                        # go into the new train set
                                        # the other 50% go into the
                                        # validation set
train_set, val_set = train_set[0: train_size], train_set[train_size:]
# walk-forward validation
history = [x for x in train_set]
predictions = []
for i in range(len(val_set)):
    y_hat = history[-1]
    predictions.append(y_hat)

    obs = val_set[i]
    history.append(obs)
    logging.info(f'Predicted: {y_hat}, Expected: {obs}')

# report performance
rmse = sqrt(mean_squared_error(val_set, predictions))
logging.info(f'RMSE: {rmse:.3f}')

INFO:2022-12-17 15:29:39,339:Predicted: 98, Expected: 125
INFO:2022-12-17 15:29:39,342:Predicted: 125, Expected: 155
INFO:2022-12-17 15:29:39,344:Predicted: 155, Expected: 190
INFO:2022-12-17 15:29:39,346:Predicted: 190, Expected: 236
INFO:2022-12-17 15:29:39,350:Predicted: 236, Expected: 189
INFO:2022-12-17 15:29:39,352:Predicted: 189, Expected: 174
INFO:2022-12-17 15:29:39,355:Predicted: 174, Expected: 178
INFO:2022-12-17 15:29:39,357:Predicted: 178, Expected: 136
INFO:2022-12-17 15:29:39,358:Predicted: 136, Expected: 161
INFO:2022-12-17 15:29:39,360:Predicted: 161, Expected: 171
INFO:2022-12-17 15:29:39,361:Predicted: 171, Expected: 149
INFO:2022-12-17 15:29:39,363:Predicted: 149, Expected: 184
INFO:2022-12-17 15:29:39,364:Predicted: 184, Expected: 155
INFO:2022-12-17 15:29:39,366:Predicted: 155, Expected: 276
INFO:2022-12-17 15:29:39,367:Predicted: 276, Expected: 224
INFO:2022-12-17 15:29:39,369:Predicted: 224, Expected: 213
INFO:2022-12-17 15:29:39,370:Predicted: 213, Expected: 27

## Data analysis

### Summary statistics

In [5]:
logging.info(train_set.describe())

INFO:2022-12-17 15:42:35,090:count     53.000000
mean      79.056604
std       35.344685
min       29.000000
25%       49.000000
50%       74.000000
75%      108.000000
max      158.000000
Name: Robberies, dtype: float64
