# Tutorial: Distributed Dataset

This notebook is designed to give a simple introduction to forecasting using the Deep4Cast package. The time series data is taken from the [M4 dataset](https://github.com/M4Competition/M4-methods/tree/master/Dataset), specifically, the ``Daily`` subset of the data. 

Since most of the content is duplicated from the M4 Daily notebook we will here focus only on how to use the distributed dataset features.

In [1]:
import numpy as np
import os
import pandas as pd

import torch
from torch.utils.data import DataLoader

from deep4cast.forecasters import Forecaster
from deep4cast.models import WaveNet
from deep4cast.datasets import TimeSeriesDataset
import deep4cast.transforms as transforms
import deep4cast.metrics as metrics

# Make RNG predictable
np.random.seed(0)
torch.manual_seed(0)
# Use a gpu if available, otherwise use cpu
device = ('cuda' if torch.cuda.is_available() else 'cpu')

## Dataset
In this section we prepare the dataset, write it into parquet files, and prepare it for easy consumption with PyTorch-based data loaders. Model construction and training will be done in the next section.

In [2]:
if not os.path.exists('data/Daily-train.csv'):
    !wget https://raw.githubusercontent.com/M4Competition/M4-methods/master/Dataset/Train/Daily-train.csv -P data/
if not os.path.exists('data/Daily-test.csv'):
    !wget https://raw.githubusercontent.com/M4Competition/M4-methods/master/Dataset/Test/Daily-test.csv -P data/

In [3]:
df = pd.read_csv('data/Daily-train.csv')
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V9911,V9912,V9913,V9914,V9915,V9916,V9917,V9918,V9919,V9920
0,D1,1017.1,1019.3,1017.0,1019.2,1018.7,1015.6,1018.5,1018.3,1018.4,...,,,,,,,,,,
1,D2,2793.7,2793.8,2803.7,2805.8,2802.3,2795.0,2806.4,2782.2,2780.3,...,,,,,,,,,,
2,D3,1091.3,1088.5,1085.7,1082.9,1080.1,1077.3,1074.5,1071.7,1068.9,...,,,,,,,,,,
3,D4,1092.0,1078.0,1064.0,1050.0,1036.0,1022.0,1008.0,1092.0,1078.0,...,,,,,,,,,,
4,D5,2938.63,2956.44,2964.41,2972.41,3014.97,3014.23,3024.08,3031.97,3062.7,...,,,,,,,,,,


We transform from wide to long format to facilitate paritioning parquet files on the time series id.

In [4]:
df = df.melt(id_vars='V1')
df = df[df.value.notnull()]
df = df.reset_index(drop=True)
df = df.drop('variable', axis=1)
df.head()

Unnamed: 0,V1,value
0,D1,1017.1
1,D2,2793.7
2,D3,1091.3
3,D4,1092.0
4,D5,2938.63


We create parquet files by paritioning on the time series id. This creates directories with parquet files containing the entirety of the single time series.

In [5]:
df.to_parquet(
    'data/m4/daily/',
    engine='fastparquet',
    partition_cols=['V1'],
    compression=None)

### Data handling

We use the DataLoader object from PyTorch to build batches from the data set.

However, we first need to specify how much history to use in creating a forecast of a given length:
- horizon = time steps to forecast
- lookback = time steps leading up to the period to be forecast

In [6]:
horizon = 14
lookback = 128

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.LogTransform(targets=[0], offset=1.0),
    transforms.RemoveLast(targets=[0]),
    transforms.Target(targets=[0]),
])

In [7]:
dfg = df.groupby('V1').count()
dfg.to_csv('data/m4/daily/_metadata_partition.csv', header=None)
dfg.head()

Unnamed: 0_level_0,value
V1,Unnamed: 1_level_1
D1,1006
D10,674
D100,1006
D1000,1052
D1001,1052


`TimeSeriesDataset` inherits from [Torch Datasets](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) for use with [Torch DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). It handles the creation of the examples used to train the network using `lookback` and `horizon` to partition the time series.

Instead of providing an array of ``numpy`` time series, we here provide a path to the paritioned parquet files as well as a list of files locations containing metadata on the time series ids. The metadata file has the partition name (first column) and the length of the time series (second column). This will be used to calculate the number of examples in each time series.

Finally, since the entire time series is stored in the parquet file, if we want to perform a train-test split then we set ``split='train'``, this holds out the final horizon from each time series from training. Setting ``split='test'`` will conversely provide only the final ``lookback`` and ``horizon``.

In [8]:
data_train = TimeSeriesDataset(
    path_parquet='data/m4/daily/',
    path_metadata=['data/m4/daily/_metadata_partition.csv'],
    lookback=lookback, 
    horizon=horizon,
    step=1,
    transform=transform,
    thinning=0.1,
    split='train'
)
dataloader_train = DataLoader(
    data_train, 
    batch_size=512, 
    shuffle=True, 
    pin_memory=True,
    num_workers=8
)

## Modeling and Forecasting

In [9]:
model = WaveNet(input_channels=1,
                output_channels=1,
                horizon=horizon, 
                n_layers=7)

if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)

optim = torch.optim.Adam(model.parameters(), lr=0.001)

loss = torch.distributions.StudentT

In [10]:
forecaster = Forecaster(model, loss, optim, n_epochs=1, device=device)
forecaster.fit(dataloader_train, eval_model=False)





## Evaluation

We need to append the ``lookback`` to the test data so that we can make forecasts to compare to actuals.

In [11]:
data_train = pd.read_csv('data/Daily-train.csv')
data_test = pd.read_csv('data/Daily-test.csv')
data_train = data_train.iloc[:, 1:].values
data_test = data_test.iloc[:, 1:].values

data_arr = []
for ts_train, ts_test in zip(data_train, data_test):
    ts_a = ts_train[~np.isnan(ts_train)]
    ts_b = ts_test
    ts = np.concatenate([ts_a, ts_b])[None, :]
    data_arr.append(ts)

Here we provide a list of ``numpy`` arrays containing the train and test time series. ``TimeSeriesDataset`` creates a test split (``split='test'``) providing the final ``lookback`` and ``horizon`` of each time series so that the ``lookback`` can be used to create a forecast.

In [12]:
data_test = TimeSeriesDataset(
    time_series=data_arr,
    lookback=lookback, 
    horizon=horizon, 
    step=1,
    transform=transform,
    split='test'
)
dataloader_test = DataLoader(
    data_test, 
    batch_size=1024, 
    shuffle=False,
    num_workers=8
)

In [13]:
y_test = []
for example in dataloader_test:
    example = dataloader_test.dataset.transform.untransform(example)
    y_test.append(example['y'])
y_test = np.concatenate(y_test)

y_samples = forecaster.predict(dataloader_test, n_samples=100)

In [14]:
test_smape = metrics.smape(y_samples, y_test)

print('SMAPE: {}%'.format(test_smape.mean()))

SMAPE: 3.007478952407837%
