# Prepare data with SageMaker Processing

DeepAR is a supervised learning algorithm for forecasting scalar time series. This notebook demonstrates how to prepare a dataset of time series for training DeepAR and how to use the trained model for inference.

<div style="text-align:center">
    <img src="../media/manual.png" width="800"/>
</div>

## Setup environment

In [None]:
!pip install -q sagemaker==2.16.1

In [None]:
import sagemaker
import matplotlib.pyplot as plt
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role() # we are using the notebook instance role for training in this example
bucket = sagemaker_session.default_bucket() # you can specify a bucket name here

## Generate and explore data

In this example we want to train a model that can predict the next 48 points of syntheticly generated time series. The time series that we use have hourly granularity.

In [None]:
import json
import numpy as np
import pandas as pd
np.random.seed(1)

freq = 'H'
prediction_length = 48

We also need to configure the so-called context_length, which determines how much context of the time series the model should take into account when making the prediction, i.e. how many previous points to look at. A typical value to start with is around the same size as the prediction_length. In our example we will use a longer context_length of 72. Note that in addition to the context_length the model also takes into account the values of the time series at typical seasonal windows e.g. for hourly data the model will look at the value of the series 24h ago, one week ago one month ago etc. So it is not necessary to make the context_length span an entire month if you expect monthly seasonalities in your hourly data.

In [None]:
context_length = 72

For this notebook, we will generate 200 noisy time series, each consisting of 400 data points and with seasonality of 24 hours. In our dummy example, all time series start at the same time point t0. When preparing your data, it is important to use the correct start point for each time series, because the model uses the time-point as a frame of reference, which enables it to learn e.g. that weekdays behave differently from weekends. Each time series will be a noisy sine wave with a random level.

In [None]:
t0 = '2016-01-01 00:00:00'
data_length = 400
num_ts = 200
period = 24

time_series = []
for k in range(num_ts):
    level = 10 * np.random.rand()
    seas_amplitude = (0.1 + 0.3*np.random.rand()) * level
    sig = 0.05 * level # noise parameter (constant in time)
    time_ticks = np.array(range(data_length))
    source = level + seas_amplitude*np.sin(time_ticks*(2*np.pi)/period)
    noise = sig*np.random.randn(data_length)
    data = source + noise
    index = pd.date_range(start=t0, freq=freq, periods=data_length)
    time_series.append(pd.Series(data=data, index=index))

time_series[0].plot()
plt.show()

Often one is interested in tuning or evaluating the model by looking at error metrics on a hold-out set. For other machine learning tasks such as classification, one typically does this by randomly separating examples into train/test sets. For forecasting it is important to do this train/test split in time rather than by series.

In this example, we will leave out the last section of each of the time series we just generated and use only the first part as training data. Here we will predict 48 data points, therefore we take out the trailing 48 points from each time series to define the training set. The test set contains the full range of each time series.

In [None]:
time_series_training = []
for ts in time_series:
    time_series_training.append(ts[:-prediction_length])

time_series[0].plot(label='test')
time_series_training[0].plot(label='train', ls=':')
plt.legend()
plt.show()

In [None]:
print(time_series_training[0])

## Launch data processing job

<div style="text-align:center">
    <img src="../media/processing.png" width="700"/>
</div>

In [None]:
sklearn_processor = SKLearnProcessor(role=role,
                                     instance_count=1,
                                     instance_type="ml.m5.xlarge",
                                     framework_version="0.20.0",
                                     volume_size_in_gb=30, 
                                     max_runtime_in_seconds=1200,
                                     base_job_name='data-processing'
)

In [None]:
output_folder = '/opt/ml/processing/output'

sklearn_processor.run(
    code="prepare_data.py",
    arguments= [
        f'--output={output_folder}'
    ],
    outputs= [
        ProcessingOutput(
            output_name='preprocessed',
            source=output_folder,
            destination=bucket
        )
    ]
)