# Bike-Share Demand Forecasting 2a: Modelling with [Amazon Forecast](https://aws.amazon.com/forecast/)

We'll look at 3 ways to tackle the bike-share demand forecasting problem set up previously in the data preparation notebook:

1. Applying an AWS "Managed AI" service ([Amazon Forecast](https://aws.amazon.com/forecast/)), to tackle the scenario as a common/commodity business problem
2. Using a SageMaker built-in algorithm ([DeepAR](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html)), to approach it as a common/commodity algorithm in our own data science workbench
3. Using a custom SageMaker algorithm, to take on the core modelling as a value-added differentiator working in our data science workbench.

These approaches represent different cost/control trade-offs that we might make as a business.

**This notebook shows how to apply the Amazon Forecast service *via the AWS console*, although the same actions can all be performed via API instead.**

<img src="BlogImages/amazon_forecast.png">

## Dependencies and configuration

As usual we start by loading libraries, defining configuration, and connecting to AWS SDKs

In [None]:
# Basic data configuration is initialised and stored in the Data Preparation notebook
# ...We just retrieve it here:
%store -r
assert bucket, "Variable `bucket` missing from IPython store"

assert data_prefix, "Variable `data_prefix` missing from IPython store"
assert target_train_filename, "Variable `target_train_filename` missing from IPython store"
assert target_test_filename, "Variable `target_test_filename` missing from IPython store"
assert related_filename, "Variable `related_filename` missing from IPython store"

In [None]:
%load_ext autoreload
%autoreload 1

# Built-Ins:
from datetime import datetime, timedelta

# External Dependencies:
import boto3
from IPython.core.display import display, HTML
import pandas as pd

# Local Dependencies:
%aimport util

Now we connect to our AWS SDKs, and initialise our access role (which may wait a little while to ensure any newly created permissions propagate):

<div class="alert alert-block alert-warning">
    If you haven't already, you'll need to grant this notebook access to Amazon Forecast.
    The simplest way to do this is to click on the "IAM Role ARN" hyperlink in the details page for this Notebook Instance on the SageMaker Console.<br/>
    You can "Attach Policies" and add "AmazonForecastFullAccess", as visible below:
    <img src="BlogImages/ForecastAccessPermissions.png"/>
</div>

In [None]:
session = boto3.Session() 
region = session.region_name
forecast = session.client(service_name="forecast") 
forecast_query = session.client(service_name="forecastquery")
s3 = session.client(service_name="s3")

## Overview

The overall workflow of Amazon Forecast is a typical batch ML model training approach, as summarized below.

Although the `forecast` SDK initialised above supports doing all these steps programmatically, **we'll be using the AWS Console approach** to show you around.

<img src="BlogImages/outline.png">

<img src="BlogImages/forecast_workflow.png">

## Step 1: Selecting the Amazon Forecast domain<a class="anchor" id="prepare"/>

Amazon Forecast defines a set of **domains** (documented [here](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html)), for common forecasting use cases.

The domain provides a **base data schema** and featurizations/model architectures tailored towards that particular use case. We can add custom data fields as well (and we will)... but in general the more advantage we can take of the structure in the out-of-the-box domain model, the better model performance we'll see.

<img src="BlogImages/AmazonForecastDomains.png"/>

In this example we'll use the [`RETAIL`](https://docs.aws.amazon.com/forecast/latest/dg/retail-domain.html) domain, but it could also be argued that [`METRICS`](https://docs.aws.amazon.com/forecast/latest/dg/metrics-domain.html) or even some others might be just as good a fit! If you have time, feel free to experiment with other domains and see if performance can be improved.

## Step 2: Preparing the data

The [domain documentation](https://docs.aws.amazon.com/forecast/latest/dg/retail-domain.html) tells us what mandatory fields we need to provide, so we'll tweak our data slightly and re-upload to S3.

In [None]:
target_train_df = pd.read_csv(f"./data/{target_train_filename}")
target_test_df = pd.read_csv(f"./data/{target_test_filename}")
related_df = pd.read_csv(f"./data/{related_filename}")

Target timeseries must specify `timestamp`, `item_id`, `demand`, and preferably no other fields.

Our canonical data is already really close to this, so we'll just rename the customer_type field to the more Forecast-y item_id:

In [None]:
target_train_df.rename(columns={ "customer_type": "item_id" }, inplace=True)
target_test_df.rename(columns={ "customer_type": "item_id" }, inplace=True)
target_train_df.head()

Related timeseries in this domain:

1. Must specify `timestamp` (which we already have)
2. Must specify `item_id` (which we don't currently as the weather is not customer_type specific)
3. Suggest a number of optional domain fields, but none map very closely to our data set

...and all data sets in general:

4. Must not any of the documented [reserved field names](https://docs.aws.amazon.com/forecast/latest/dg/reserved-field-names.html) (including `temp`)
5. Can consist of fields with types `string`, `integer`, `float`, or `timestamp` as specified in the user [schema](https://docs.aws.amazon.com/forecast/latest/dg/API_SchemaAttribute.html)

We'll ignore the lack of support for boolean fields, since loading the data as strings will have equivalent results.

Therefore we'll prepare the data by:

* Duplicating our related timeseries data for all item_ids (per point 2.)
* Renaming the `temp` column to `temperature` (per point 4.)

In [None]:
# Duplicate data for each item_id in the target dataframe:
related_peritem_dfs = []
item_ids = target_train_df["item_id"].unique()
for item_id in item_ids:
    df = related_df.copy()
    df["item_id"] = item_id
    related_peritem_dfs.append(df)

related_df = pd.concat(related_peritem_dfs).sort_values(["timestamp", "item_id"]).reset_index(drop=True)

# Rename any reserved columns to keep Forecast happy:
related_df.rename(columns={ "temp": "temperature" }, inplace=True)
related_df.head()

...Now store the data in S3 ready to import to Amazon Forecast

In [None]:
print("Writing dataframes to file...")
!mkdir -p ./data/amzforecast
target_train_df.to_csv(
    f"./data/amzforecast/{target_train_filename}",
    index=False
)
target_test_df.to_csv(
    f"./data/amzforecast/{target_test_filename}",
    index=False
)
related_df.to_csv(
    f"./data/amzforecast/{related_filename}",
    index=False
)

print("Uploading dataframes to S3...")
s3.upload_file(
    Filename=f"./data/amzforecast/{target_train_filename}",
    Bucket=bucket,
    Key=f"{data_prefix}amzforecast/{target_train_filename}"
)
print(f"s3://{bucket}/{data_prefix}amzforecast/{target_train_filename}")
s3.upload_file(
    Filename=f"./data/amzforecast/{target_test_filename}",
    Bucket=bucket,
    Key=f"{data_prefix}amzforecast/{target_test_filename}"
)
print(f"s3://{bucket}/{data_prefix}amzforecast/{target_test_filename}")
s3.upload_file(
    Filename=f"./data/amzforecast/{related_filename}",
    Bucket=bucket,
    Key=f"{data_prefix}amzforecast/{related_filename}"
)
print(f"s3://{bucket}/{data_prefix}amzforecast/{related_filename}")
print("Done")

## Step 3: Create a Dataset Group

Open up the Amazon Forecast console (in the same `region` that we selected earlier!). You might see the landing page below, or a different dashboard if you've used the service before.

Click "Create Dataset Group" either from the landing page or from the "Dataset Groups" tab of the expandable left-side menu.

<img src="BlogImages/AmazonForecastDashboard.png"/>

Name your dataset group **`bikeshare_dataset_group`** and select the **`Retail`** demand as discussed above.

Click Next to continue

## Step 4: Create a Target Dataset

Next you'll be prompted to create the target data set with a form like the below (or, if not, can choose to create a target dataset from the dashboard)

<img src="BlogImages/CreateDataset.png"/>

First, let's review our dataframe's structure:

In [None]:
target_train_df.head()

You'll need to:

* Give the dataset a name: **`bikeshare_target_dataset`**
* Adjust the granularity to **`hourly`**, matching our data
* **`Re-order the columns in the data schema`**, to match the dataframe above

When you've made the changes, go ahead and click "Next"

## Step 5: Import target timeseries data

Next you'll be prompted to create a *dataset import job* (or, if not, can choose to do so from the dashboard).

* Name the import job **`bikeshare_target_import`**
* Check the timestamp format matches our dataframe
* Select "Create a new role", and grant it access to either all buckets or just the one created for this exercise
* Provide the **target training** file S3 URL (hint: We printed it out near the end of step 2)

<img src="BlogImages/ImportTargetTimeseries.png"/>

When you click "Start Import", you'll be taken back to the Forecast dashboard page.

Note that:

* Dataset imports can take several minutes to complete, because Amazon Forecast spins up resources to handle the task in a scalable way and performs validation of the data set.
* You don't need to wait for the target data import to complete to start the related data import (next step)
* It's possible to train a "predictor" (a forecast model) as soon as the target data is imported, but we can achieve better accuracy by waiting for the related data to be imported as well and using it in the model.

## Step 6: Create and import Related Timeseries Dataset

Next, select the option to create/import a related dataset.

Let's review the structure of our related data:

In [None]:
related_df.head()

We'll name the new dataset **`bikeshare_related_dataset`**.

Remember to select **`hourly`** frequency!

This time, we'll have to make a lot more edits to the dataset schema to capture all of our columns.

The API docs give low-level details of what each [SchemaAttribute](https://docs.aws.amazon.com/forecast/latest/dg/API_SchemaAttribute.html) can contain, and the overall [Schema](https://docs.aws.amazon.com/forecast/latest/dg/API_Schema.html) object. For our example, the below should work:

```json
{
    "Attributes": [
        {
            "AttributeName": "timestamp",
            "AttributeType": "timestamp"
        },
        {
            "AttributeName": "season",
            "AttributeType": "float"
        },
        {
            "AttributeName": "holiday",
            "AttributeType": "string"
        },
        {
            "AttributeName": "weekday",
            "AttributeType": "float"
        },
        {
            "AttributeName": "workingday",
            "AttributeType": "string"
        },
        {
            "AttributeName": "weathersit",
            "AttributeType": "float"
        },
        {
            "AttributeName": "temperature",
            "AttributeType": "float"
        },
        {
            "AttributeName": "atemp",
            "AttributeType": "float"
        },
        {
            "AttributeName": "hum",
            "AttributeType": "float"
        },
        {
            "AttributeName": "windspeed",
            "AttributeType": "float"
        },
        {
            "AttributeName": "item_id",
            "AttributeType": "string"
        }
    ]
}
```

Once the dataset is completed, we'll create a dataset import job for it:

* Name the import job **`bikeshare_related_import`**
* Check the timestamp format matches our dataframe above
* The IAM role should be pre-populated for you as we created it for the target dataset import
* This time provide the related dataset S3 URL (which we printed out near the end of step 2)

Go ahead and click "Start import" when you're ready, and you should be returned to the dashboard screen while the data is loaded!

## Step 7: While the datasets import...

With the data volumes in our example, import is usually done within a couple of minutes.

In case it's taking longer for you though (especially in group workshops...) - why not make a start on one of the other model fitting notebooks while you wait? e.g. training a model with SageMaker.

Note: Although it should usually update live, sometimes you might need to refresh the page on the Amazon Forecast dashboard to see the latest status.

## Step 8: Train a "Prophet" predictor

When your dashboard looks like the below, with both target and related data imported, you're ready to start predictor training.

For our first predictor, we'll train a model using Facebook's [Prophet](https://facebook.github.io/prophet/) algorithm: A highly successful open source framework based on additive-component regression (as described in the [paper](https://peerj.com/preprints/3190/)).

<img src="BlogImages/DashboardDatasetsImported.png"/>

First up, we need to review how much of our target series was chopped out from the end of the data-set as test data:

In [None]:
n_train_samples = len(target_train_df["timestamp"].unique())
n_test_samples = len(target_test_df["timestamp"].unique())
n_related_samples = len(related_df["timestamp"].unique())

print(f"  {n_train_samples} training samples")
print(f"+ {n_test_samples} testing samples")
print(f"= {n_related_samples} total samples (related dataset)")

assert (
    n_train_samples + n_test_samples == n_related_samples
), "Mismatch between target train+test timeseries and related timeseries coverage"

Create your predictor, configured as follows:

* **Predictor name:** **`bikeshare_prophet_predictor`**
* **Forecast horizon:** **336** (2 weeks at 24hrs/day)
* **Forecast frequency:** **1 hour** (matching our source data)
* **Algorithm selection:** **Manual**
* **Algorithm:** **Prophet**
* **Country for holidays:** **United States** (where the Capital Bikeshare scheme operates)
* **Number of backtest windows:** **4**
* **Backtest window offset:** *See below*

The [`BackTestWindowOffset`](https://docs.aws.amazon.com/forecast/latest/dg/API_EvaluationParameters.html#forecast-Type-EvaluationParameters-BackTestWindowOffset) parameter sets where the last forecast validation window starts, defaulting equal to `ForecastHorizon` on the assumption that no data has been withheld for external testing.

Since we held out data, we'll need to increase this value by the number of samples removed (see code cell above).

Assuming your configuration is identical, this will be: 336 + 744 = **1,080**

The *NumberOfBacktestWindows* parameter controls how many separate windows Amazon Forecast uses to [evaluate model accuracy](https://docs.aws.amazon.com/forecast/latest/dg/metrics.html): Allowing us to measure performance more robustly than concentrating only on the very end of the data set.

## Step 9: Train a "DeepAR+" predictor

DeepAR+ is perhaps the "signature" algorithm of Amazon Forecast: Based on the same neural timeseries modelling [approach](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar_how-it-works.html) behind the [SageMaker DeepAR built-in algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html) - but with some proprietary extensions and improvements implemented in Amazon Forecast.

We **don't need to wait** for the Prophet predictor to train to kick off another predictor training job: simply go to the "Predictors" item in the sidebar menu and click the "Train new predictor" button

<img src="BlogImages/AmazonForecastPredictorCreateInProgress.png"/>

The configuration should be as above, except for:

* **Predictor name:** **`bikeshare_deeparplus_predictor`**
* **Algorithm:** **Deep_AR_Plus**

Once you've kicked off the training, return to the "Predictors" screen to track the status of the two training predictors.

## Step 10: Create forecasts (and maybe custom predictors?)

If you'd like to fit any other models (e.g. using the AutoML model selection or one of the more baseline architectures like ARIMA), feel free to kick off more training jobs at this point with the same naming conventions and configurations.

Our next step is to create a "forecast" for each predictor: running the model and extracting predicted confidence intervals.

You can kick off forecast creation for each predictor any time it's done training, and Prophet trains relatively quickly so may already be available.

Since both predictor fitting and forecast creation can take a while, you can make some progress on the other SageMaker model fitting if you get blocked; and check back every now and then.

To create the forecasts, go to Forecasts in the sidebar menu and click the "Create a Forecast" button. Configure each forecast as:

* **Name:** e.g. **`bikeshare_prophet_forecast`**, **`bikeshare_deeparplus_forecast`**, etc
* **Predictor:** selected from the dropdown (if your predictor doesn't appear in the dropdown yet, it probably hasn't finished training)
* **Forecast types:** We'll look at **`.10, .50, .90, mean`**

<img src="BlogImages/CreateAForecast.png"/>

As soon as each forecast creation has been kicked off in the console, you'll be able to select that item from the list and should enter the **Forecast ARN** below.

**Note that the forecast ARN is different from the predictor ARN!** You can access the list of created forecasts from the "Forecasts" tab of the sidebar menu:

<img src="BlogImages/ProphetForecastDetails.png"/>

In [None]:
forecast_arns = {
    # Each example should look something like this:
    # "a_nice_name": "arn:aws:forecast:[REGION?]:[ACCOUNT?]:forecast/[FORECASTNAME?]"
    "bikeshare_prophet_forecast": "", # TODO ,
    "bikeshare_deeparplus_forecast": ""# TODO
    # More entries if you created other forecasts with different settings too?
}

## Step 11: Review model accuracy metrics

Because we generate probabilistic forecasts with *confidence intervals*, evaluating the results is not as simple as comparing RMSE scores: There's a **trade-off** between:

* accuracy (whether the actual values are within the proposed confidence interval / probability distribution), versus
* precision (how narrow the proposed confidence interval is)

Predictor metrics calculated on our training set backtesting windows are available directly through the AWS Console:

* Go to "Predictors" in the sidebar menu
* Select a predictor to review and click to view details
* Scroll down to the "Predictor metrics" section

You'll see (example screenshot below) RMSE vs mean and weighted quantile losses at the three 10%, 50%, 90% evaluation points; for each prediction window and summarized by average.

**Which predictor seems to perform best from these metrics? Are there any patterns in accuracy over the different prediction windows?**

<img src="BlogImages/AmazonForecastPredictorMetrics.png"/>

## Step 12: Visualise and evaluate forecast quality

It's possible (via "Forecast Lookup" in the side-bar menu) to view forecast outputs directly in the AWS console: Feel free to try it out!

Here though, we'll use the Forecast Query API to programmatically download results and plot in our notebook - which would allow you to construct different visualisations, or custom evaluation metrics.

First note that although Forecast understood our source data timestamps for model training, inference has more strict requirements so we'll need to generate start and end timestamps in proper ISO format:

In [None]:
first_test_ts = target_test_df["timestamp"].iloc[0]

# Remember we predict to 2 weeks horizon
# [Python 3.6 doesn't have fromisoformat()]
test_end_dt = datetime(
    int(first_test_ts[0:4]),
    int(first_test_ts[5:7]),
    int(first_test_ts[8:10]),
    int(first_test_ts[11:13]),
    int(first_test_ts[14:16]),
    int(first_test_ts[17:])
) + timedelta(days=14, hours=-1)

# Forecast wants a slightly different timestamp format to the dataset:
fcst_start_date = first_test_ts.replace(" ", "T")
fcst_end_date = test_end_dt.isoformat()
print(f"Forecasting\nFrom: {fcst_start_date}\nTo: {fcst_end_date}")

Next, we'll use the `forecast_arns` dictionary you filled out earlier as a basis to download predictions for each of the created forecasts:

In [None]:
forecasts = {
    predictor_name: {
        "forecast_arn": forecast_arn,
        "forecasts": {
            item_id: forecast_query.query_forecast(
                ForecastArn=forecast_arn,
                StartDate=fcst_start_date,
                EndDate=fcst_end_date,
                Filters={ "item_id": item_id }
            )
        for item_id in item_ids }
    }
for (predictor_name, forecast_arn) in forecast_arns.items() }

Since Amazon Forecast and various SageMaker models will produce outputs in different formats, we'll **standardize the results** into a local CSV file to help with cross-system comparisons:

In [None]:
clean_results_df = pd.DataFrame()
for predictor_name, predictor_data in forecasts.items():
    for item_id, forecast_data in predictor_data["forecasts"].items():
        predictions = forecast_data["Forecast"]["Predictions"]
        pred_mean_df = pd.DataFrame(predictions["mean"])
        pred_timestamps = pd.to_datetime(pred_mean_df["Timestamp"].apply(lambda s: s.replace("T", " ")))
        
        df = pd.DataFrame()
        df["timestamp"] = pred_timestamps
        df["model"] = f"amzforecast-{predictor_name}"
        df["customer_type"] = item_id
        df["mean"] = pred_mean_df["Value"]
        df["p10"] = pd.DataFrame(predictions["p10"])["Value"]
        df["p50"] = pd.DataFrame(predictions["p50"])["Value"]
        df["p90"] = pd.DataFrame(predictions["p90"])["Value"]
        
        clean_results_df = clean_results_df.append(df)

!mkdir -p results/amzforecast
clean_results_df.to_csv(
    f"./results/amzforecast/results_clean.csv",
    index=False
)
print("Clean results saved to ./results/amzforecast/results_clean.csv")
clean_results_df.head()

Now finally, we use this standardized format to plot results:

(Using our handy plotting function in the util folder, to avoid cluttering up this notebook)

In [None]:
# First, prepare the actual data (training + test) for easy plotting:
first_plot_dt = test_end_dt - timedelta(days=21)
actuals_df = target_train_df.append(target_test_df)
actuals_df["timestamp"] = pd.to_datetime(actuals_df["timestamp"])
actuals_plot_df = actuals_df[
    (actuals_df["timestamp"] >= first_plot_dt)
    & (actuals_df["timestamp"] <= test_end_dt)
]
actuals_plot_df.rename(columns={ "item_id": "customer_type"}, inplace=True)

util.plot_fcst_results(actuals_plot_df, clean_results_df)

...and there you have it! Statistical timeseries forecasts created with the AWS console and downloaded/processed in code: No deep data science knowledge required, but with ability to play around with hyperparameters and model architectures if we wanted to dive deeper.

**Did the graphs visually agree with your assessment of which models looked best from the console metrics?**

## Extension exercises and exploring further

Given the formulae listed on the [Amazon Forecast metrics docs](https://docs.aws.amazon.com/forecast/latest/dg/metrics.html) and the example plotting code above, could you calculate the RMSE and weighted quantile loss scores for this prediction window? How do they compare to the scores Amazon Forecast calculated in training?

Might we get better performance with a different dataset group **domain**? Some domains might have different column name and type requirements, so might need to do some more data manipulation!

Amazon Forecast has some built-in time featurization capability, which is why it's important to provide correct, absolute date/timestamps. Does removing the `workingday` related timeseries feature have much impact on prediction quality? How about `holiday`? What if we offset the timestamps of the whole data-set by one calendar day?

## Thanks for joining in! (Clean-up time)

[Amazon Forecast pricing](https://aws.amazon.com/forecast/pricing/) is by:

* Generated forecasts
* Data storage, and
* Training hours

...So there are no real-time endpoint compute resources to worry about deleting like some services: but it might still be worth cleaning up if the data storage cost is significant for you at these sizes.

You can delete all resources (forecasts, forecast exports if you triggered any, predictors, import jobs, datasets, and dataset groups) through the Amazon Forecast console. Consider also clearing out the S3 bucket, and stopping this notebook instance if running on SageMaker!

We hope you've enjoyed this section and any others you're still working on. If you have any feedback for this workshop, please do get in touch via the GitHub or workshop facilitators!