# Amazon Forecast: predicting time-series at scale

Forecasting is used in a variety of applications and business use cases: For example, retailers need to forecast the sales of their products to decide how much stock they need by location, Manufacturers need to estimate the number of parts required at their factories to optimize their supply chain, Businesses need to estimate their flexible workforce needs, Utilities need to forecast electricity consumption needs in order to attain an efficient energy network, and enterprises need to estimate their cloud infrastructure needs.
<img src="https://amazon-forecast-samples.s3-us-west-2.amazonaws.com/common/images/forecast_overview.png" width="98%">

# Notebook Overview

<img src="images/forecast_overview.png" width="100%">

In this notebook we will be walking through the all the steps mentioned below.


## Table Of Contents
* Step 1: [Setup Amazon Forecast](#setup)
* Step 2: [Prepare the Datasets](#DataPrep)
* Step 2a: [Prepare and Save the Target Time Series](#DataPrepTTS) 
* Step 2b: [Prepare and save the Related Time Series](#DataPrepRTS) 
* Step 3: [Create the Dataset Group and Dataset](#DataSet)
* Step 4: [Create the Target Time Series Data Import Job](#DataImportTTS)
* Step 5: [Create the Related Time Series Data Import Job](#DataImportRTS)
* Step 6: [Training a predictor and evaluating its performance](#training)
* Step 6a: [Train a Predictor](#train)
* Step 6b: [Get Predictor Error Metrics from Backtesting](#predictorErrors)
* Step 7: [Create a Forecast](#createForecast)
* Step 8: [Query a Forecast](#queryForecast)
* Step 9: [Export a Forecast](#exportForecast)
* Step 10: [Clean up your Resources](#cleanup)
* [Next Steps](#nextSteps)

For more informations about APIs, please check the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)


# Step 1: Setup Amazon Forecast<a class="anchor" id="setup"></a>

This section sets up the permissions and relevant endpoints.

In [None]:
import sys
import os

# importing forecast notebook utility from notebooks/common directory
sys.path.insert( 0, os.path.abspath("../../common") )
import util
import util.fcst_utils

%reload_ext autoreload
import boto3
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
plt.rcParams['figure.figsize'] = (15.0, 5.0)

<b>Create a new S3 bucket for this lesson</b>
- The cell below will create a new S3 bucket with name ending in "forecast-demo-bike-small"

In [None]:
region = boto3.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

# create unique S3 bucket for saving your own data
bucket_name = account_id + '-forecast-demo-bike-small'
if util.create_bucket(bucket_name, region=region):
    print(f"Success! Created bucket {bucket_name}")

In [None]:
# Connect API sessions
session = boto3.Session(region_name=region) 
s3 = session.client(service_name='s3')
forecast = session.client(service_name='forecast') 
forecastquery = session.client(service_name='forecastquery')

<b>Create IAM Role for Forecast</b> <br>
Like many AWS services, Forecast will need to assume an IAM role in order to interact with your S3 resources securely. In the sample notebooks, we use the get_or_create_iam_role() utility function to create an IAM role. Please refer to "notebooks/common/util/fcst_utils.py" for implementation.

In [None]:
# Create the role to provide to Amazon Forecast.
role_name = "ForecastNotebookRole-Basic"
print(f"Creating Role {role_name} ...")
role_arn = util.get_or_create_iam_role( role_name = role_name )

# echo user inputs without account
print(f"Success! Created role arn = {role_arn.split('/')[1]}")

# Step 2: Prepare the Datasets<a class="anchor" id="DataPrep"></a>

In [None]:
bike_df = pd.read_csv("data/train.csv", dtype = object)
bike_df.head()

In [None]:
print(bike_df.datetime.min())
print(bike_df.datetime.max())

In [None]:
bike_df['count'] = bike_df['count'].astype('float')
bike_df['workingday'] = bike_df['workingday'].astype('float')

The dataset happens to span January 01, 2011 to Deceber 31, 2012. We are only going to use about two and a half week's of hourly data to train Amazon Forecast.

In [None]:
bike_df_small = bike_df[-2*7*24-24*3:].copy()
bike_df_small['item_id'] = "bike_12"

In [None]:
# save an item_id for querying later
item_id = 'bike_12'

Let us plot the time series first.

In [None]:
bike_df_small.plot(x='datetime', y='count', figsize=(15, 8))

We can see that the target time series seem to have a drop over weekends. This is a clue for a useful related time series variable.  Let's plot both the target time series and a potential related time series variable `workday` that indicates whether any day is a `workday` or not. 

More precisely, the new related variable `workday`, $r_t = 1$ if $t$ is a work day and 0 if not.

In [None]:
plt.figure(figsize=(15, 8))
ax = plt.gca()
bike_df_small.plot(x='datetime', y='count', ax=ax);
ax2 = ax.twinx()
bike_df_small.plot(x='datetime', y='workingday', color='red', ax=ax2);

## Step 2a: Prepare and Save the Target Time Series<a class="anchor" id="DataPrepTTS"></a>

Below, we specify key input data and forecast parameters

In [None]:
# what is your forecast horizon in number time units you've selected?
# e.g. if you're forecasting in hours, how many months out do you want a forecast?
FORECAST_LENGTH = 24

# What is your forecast time unit granularity?
# Choices are: ^Y|M|W|D|H|30min|15min|10min|5min|1min$ 
DATASET_FREQUENCY = "H"
TIMESTAMP_FORMAT = "yyyy-MM-dd hh:mm:ss"
# delimiter = ','

# What name do you want to give this project?  
# We will use this same name for your Forecast Dataset Group name.
PROJECT = 'small_bike_demo'
DATA_VERSION = '00'

In [None]:
target_df = bike_df_small[['item_id', 'datetime', 'count']][:-FORECAST_LENGTH]
target_df.head(5)

Notice in the output above there are 3 columns of data:

1. An Item ID
1. The Timestamp
1. A Value

These are the 3 key required pieces of information to generate a forecast with Amazon Forecast. More can be added but these 3 must always remain present.


## Step 2b: Prepare and Save the Related Time Series <a class="anchor" id="DataPrepRTS"></a>

To use the related time series, we need to ensure that the related time series covers the whole target time series, as well as the future values as specified by the forecast horizon. More precisely, we need to make sure:
```
len(related time series) >= len(target time series) + forecast horizon
```
Basically, all items need to have data start at or before the item start date, and have data until the forecast horizon (i.e. the latest end date across all items + forecast horizon).  Additionally, there should be no missing values in the related time series. The following picture illustrates the desired logic. 

<img src="images/rts_viz.png">

For more details regarding how to prepare your Related Time Series dataset, please refer to the public documentation <a href="https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html">here</a>. 


In [None]:
rts_df = bike_df_small[['item_id', 'datetime', 'workingday']]
rts_df.head(5)

As we can see, the length of the related time series is equal to the length of the target time series plus the forecast horizon. 

In [None]:
print(f"{len(target_df)} + {FORECAST_LENGTH} = {len(rts_df)}")
assert len(target_df) + FORECAST_LENGTH == len(rts_df), "length doesn't match"

Next we check whether there are "holes" in the related time series.  

In [None]:
assert len(rts_df) == len(pd.date_range(
    start=list(rts_df['datetime'])[0],
    end=list(rts_df['datetime'])[-1],
    freq='H'
)), "missing entries in the related time series"

Everything looks fine, the related time series (indicator of whether the current day is a workday or not) is longer than the target time series.  And, the related time series does not have any missing values.

The binary working day indicator feature is a good example of a related time series, since it is known at all future time points.  Other examples of related time series include holiday, price, and promotion features.

Now export them to CSV files and place them into your `data` folder.

In [None]:
target_df.to_csv("data/bike_small.csv", index= False, header = False)
rts_df.to_csv("data/bike_small_rts.csv", index= False, header = False)

At this time the data is ready to be sent to S3 where Forecast will use it later. The following cells will upload the data to S3.

In [None]:
key = "bike_small"

s3.upload_file(Filename="data/bike_small.csv", Bucket = bucket_name, Key = f"{key}/bike.csv")
s3.upload_file(Filename="data/bike_small_rts.csv", Bucket = bucket_name, Key = f"{key}/bike_rts.csv")

# Step 3: Create the Dataset Group and Dataset<a class="anchor" id="DataSet"></a>
First let's create a dataset group and then update it later to add our datasets.

In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. Since data files are imported headerless, it is important to define a schema for your data.

More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [RETAIL](https://docs.aws.amazon.com/forecast/latest/dg/retail-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`.

### Create the Dataset Group

In this task, we define a container name or Dataset Group name, which will be used to keep track of Dataset import files, schema, and all Forecast results which go together.


In [None]:
dataset_group = f"{PROJECT}_{DATA_VERSION}"
print(f"Dataset Group Name = {dataset_group}")

In [None]:
dataset_arns = []
create_dataset_group_response = \
    forecast.create_dataset_group(Domain="RETAIL",
                                  DatasetGroupName=dataset_group,
                                  DatasetArns=dataset_arns)

In [None]:
dataset_group_arn = create_dataset_group_response['DatasetGroupArn']

In [None]:
forecast.describe_dataset_group(DatasetGroupArn=dataset_group_arn)

### Create the Target Schema

Next, we specify the schema of our dataset below. Make sure the order of the attributes (columns) matches the raw data in the files. 

In [None]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
ts_schema ={
   "Attributes":[
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"demand",
         "AttributeType":"float"
      }
   ]
}

### Create a Target Dataset 

Target is a required dataset to use the service.

In [None]:
ts_dataset_name = f"{PROJECT}_{DATA_VERSION}_tts"
print(ts_dataset_name)

In [None]:
response = \
    forecast.create_dataset(Domain="RETAIL",
                            DatasetType='TARGET_TIME_SERIES',
                            DatasetName=ts_dataset_name,
                            DataFrequency=DATASET_FREQUENCY,
                            Schema=ts_schema
                           )

In [None]:
ts_dataset_arn = response['DatasetArn']

In [None]:
forecast.describe_dataset(DatasetArn=ts_dataset_arn)

### Create the Related Schema
Make sure the order of the attributes (columns) matches the raw data in the files. 

In [None]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
rts_schema ={
   "Attributes":[
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"workingday",
         "AttributeType":"float"
      }
   ]
}

### Create a Related Dataset 

In this example, we will define a related time series.

In [None]:
rts_dataset_name = f"{PROJECT}_{DATA_VERSION}_rts"
print(rts_dataset_name)

In [None]:
response = \
    forecast.create_dataset(Domain="RETAIL",
                            DatasetType='RELATED_TIME_SERIES',
                            DatasetName=rts_dataset_name,
                            DataFrequency=DATASET_FREQUENCY,
                            Schema=rts_schema
                           )

In [None]:
rts_dataset_arn = response['DatasetArn']

In [None]:
forecast.describe_dataset(DatasetArn=rts_dataset_arn)

### Update the dataset group with the datasets we created 

You can have multiple datasets under the same dataset group. Update it with the datasets we created before.

In [None]:
dataset_arns = []
dataset_arns.append(ts_dataset_arn)
dataset_arns.append(rts_dataset_arn)
forecast.update_dataset_group(DatasetGroupArn=dataset_group_arn, DatasetArns=dataset_arns)

In [None]:
forecast.describe_dataset_group(DatasetGroupArn=dataset_group_arn)

# Step 4. Create the Target Time Series Data Import Job<a class="anchor" id="DataImportTTS"></a>

Now that Forecast knows how to understand the CSV we are providing, the next step is to import the data from S3 into Amazon Forecast.

In [None]:
s3_data_path = f"s3://{bucket_name}/{key}"

In [None]:
ts_s3_data_path = f"{s3_data_path}/bike.csv"

In [None]:
ts_dataset_import_job_response = \
    forecast.create_dataset_import_job(DatasetImportJobName=dataset_group,
                                       DatasetArn=ts_dataset_arn,
                                       DataSource= {
                                         "S3Config" : {
                                             "Path": ts_s3_data_path,
                                             "RoleArn": role_arn
                                         } 
                                       },
                                       TimestampFormat=TIMESTAMP_FORMAT)

In [None]:
ts_dataset_import_job_arn=ts_dataset_import_job_response['DatasetImportJobArn']

Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes.

In [None]:
status = util.wait(lambda: forecast.describe_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn))
assert status

# Step 5. Create a Related Time Series Data Import Job<a class="anchor" id="DataImportRTS"></a>

In [None]:
rts_s3_data_path = f"{s3_data_path}/bike_rts.csv"

In [None]:
rts_dataset_import_job_response = \
    forecast.create_dataset_import_job(DatasetImportJobName=dataset_group,
                                       DatasetArn=rts_dataset_arn,
                                       DataSource= {
                                         "S3Config" : {
                                             "Path": rts_s3_data_path,
                                             "RoleArn": role_arn
                                         } 
                                       },
                                       TimestampFormat=TIMESTAMP_FORMAT)

In [None]:
rts_dataset_import_job_arn=rts_dataset_import_job_response['DatasetImportJobArn']

Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes.

In [None]:
status = util.wait(lambda: forecast.describe_dataset_import_job(DatasetImportJobArn=rts_dataset_import_job_arn))
assert status

# Step 6. Training a predictor and evaluating its performance<a class="anchor" id="train"></a>

Once the datasets are specified with the corresponding schema, Amazon Forecast will automatically aggregate all the relevant pieces of information for each item, such as sales, price, promotions, as well as categorical attributes, and generate the desired dataset. Amazon Forecast creates predictors, which involves applying the optimal combination of algorithms to each time series in your datasets.
ML experts train separate models for different parts of their dataset to improve forecasting accuracy. This process of segmenting your data and applying different algorithms can be very challenging for non-ML experts. Forecast uses ML to learn not only the best algorithm for each item, but the best ensemble of algorithms for each item.

## How to evaluate a forecasting model?

Before moving forward, let's first introduce the notion of *backtest* when evaluating forecasting models. The key difference between evaluating forecasting algorithms and standard ML applications is that we need to make sure there is no future information gets used in the past. In other words, the procedure needs to be causal. 

<img src="https://amazon-forecast-samples.s3-us-west-2.amazonaws.com/common/images/backtest.png" width=70%>



## Step 6a.  Train a Predictor <a class="anchor" id="trainaAutoPred"></a>

In [None]:
predictor_name = f"{PROJECT}_{DATA_VERSION}_predictor"
print(f"Predictor Name = {predictor_name}")

In [None]:
response = forecast.create_auto_predictor(PredictorName = predictor_name,
                                   ForecastHorizon = FORECAST_LENGTH,
                                   ForecastFrequency = DATASET_FREQUENCY,
                                   DataConfig = {
                                       'DatasetGroupArn': dataset_group_arn, 
                                    },
                                   ExplainPredictor = False)

In [None]:
predictor_arn = response['PredictorArn']

Check the status of the predictor. When the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on data size, model selection and choice of hyper parameters tuning，it can take several hours to be **ACTIVE**.

In [None]:
status = util.wait(lambda: forecast.describe_auto_predictor(PredictorArn=predictor_arn))
assert status

In [None]:
forecast.describe_auto_predictor(PredictorArn=predictor_arn)

## Step 6b. Get Predictor Error Metrics from Backtesting <a class="anchor" id="predictorErrors"></a>

After creating the predictors, we can query the errors given by the backtest scenario and have a quantitative understanding of the performance of the algorithm. In the cells below, we get the predictor error metrics. 

We're not demoing it in this notebook, but there is also an Export Predictor Backtest files job you can trigger.  This will save Predictor Error Metrics and also save Item-level Backtest Forecasts to an S3 bucket of your choice.  This is useful in case you want to use custom metric calculations on particular groups of items.
<a href="https://github.com/aws-samples/amazon-forecast-samples/tree/master/notebooks/advanced/Item_Level_Accuracy" target="_blank">See advanced/Item_Level_Accuracy notebook</a>
<br>
<br>

In [None]:
error_metrics = forecast.get_accuracy_metrics(PredictorArn=predictor_arn)
error_metrics

# Step 7. Create a Forecast <a class="anchor" id="createForecast"></a>

In [None]:
forecast_name = f"{PROJECT}_{DATA_VERSION}_forecast"
print(f"Forecast Name = {predictor_arn}")

In [None]:
response = forecast.create_forecast(ForecastName=forecast_name,PredictorArn=predictor_arn)

In [None]:
forecast_arn = response['ForecastArn']

Check the status of the forecast process, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on data size, model selection and choice of hyper parameters tuning，it can take several hours to be **ACTIVE**.

In [None]:
status = util.wait(lambda: forecast.describe_forecast(ForecastArn=forecast_arn))
assert status

In [None]:
forecast.describe_forecast(ForecastArn=forecast_arn)

# Step 8. Query a Forecast<a class="anchor" id="queryForecast"></a>

Once created, the forecast results are ready and you view them. 

In [None]:
item_id

In [None]:
response = forecastquery.query_forecast(
    ForecastArn=forecast_arn,
    Filters={"item_id": item_id})

In [None]:
fname = f'data/bike_small.csv'
exact = util.load_exact_sol(fname, item_id)

In [None]:
util.plot_forecasts(response, exact)
plt.title("Auto Predictor Forecast");

# Step 9. Export a Forecast<a class="anchor" id="exportForecast"></a>

Forecasts can be exported to your own S3 bucket of choice.  You may need to use these in downstream Supply Chain processes.  Or, perhaps you just want to import them into a BI tool to visualize and socialize the results.

In [None]:
forecast_export_name = f"{PROJECT}_{DATA_VERSION}_forecast_export"
forecast_export_path = f"{s3_data_path}/{forecast_export_name}"

In [None]:
response = forecast.create_forecast_export_job(ForecastExportJobName=forecast_export_name,
                                        ForecastArn=forecast_arn,
                                        Destination={
                                            "S3Config" : {
                                                "Path": forecast_export_path,
                                                "RoleArn": role_arn
                                            }
                                        })
forecast_export_arn = response['ForecastExportJobArn']
forecast_export_arn

# Step 10. Clean up your Resources<a class="anchor" id="cleanup"></a>

Once we have completed the above steps, we can start to cleanup the resources we created. All delete jobs, except for `delete_dataset_group` are asynchronous, so we have added the helpful `wait_till_delete` function. 
Resource Limits documented <a href="https://docs.aws.amazon.com/forecast/latest/dg/limits.html">here</a>. 

### This needs to be un-commented for clean-up

In [None]:
# # Delete forecast export jobs
# util.wait_till_delete(lambda: forecast.delete_forecast_export_job(ForecastExportJobArn = forecast_export_arn))

# # Delete forecasts
# util.wait_till_delete(lambda: forecast.delete_forecast(ForecastArn = forecast_arn))

# # Delete predictors
# util.wait_till_delete(lambda: forecast.delete_predictor(PredictorArn = predictor_arn))

# # Delete the target time series and related time series dataset import jobs
# util.wait_till_delete(lambda: forecast.delete_dataset_import_job(DatasetImportJobArn=ts_dataset_import_job_arn))
# util.wait_till_delete(lambda: forecast.delete_dataset_import_job(DatasetImportJobArn=rts_dataset_import_job_arn))

# # Delete the target time series and related time series datasets
# util.wait_till_delete(lambda: forecast.delete_dataset(DatasetArn=ts_dataset_arn))
# util.wait_till_delete(lambda: forecast.delete_dataset(DatasetArn=rts_dataset_arn))

# # Delete dataset group
# util.wait_till_delete(lambda: forecast.delete_dataset_group(DatasetGroupArn=dataset_group_arn))

# # Delete your file in S3
# boto3.Session().resource('s3').Bucket(bucket_name).Object(key).delete()


## Next Steps<a class="anchor" id="nextSteps"></a>

Congratulations!! You've trained your first Amazon Forecast model and generated your first forecast!!

To dive deeper, here are a couple options for further evaluation:
<ul>
    <li>Example how to use a notebook and Predictor Backtest Forecasts to evaluate all items at once using custom metrics: <a href="https://github.com/aws-samples/amazon-forecast-samples/tree/master/notebooks/advanced/Item_Level_Accuracy" target="_blank">Item_Level_Accuracy notebook</a></li>
    <li>Example how to use our built-in, hosted-by-AWS weather data: <a href="https://github.com/aws-samples/amazon-forecast-samples/blob/master/notebooks/advanced/Weather_index" target="_blank">Training your model with Weather Index </a></li>
    <li>Finally, for a production-level example, how to use Amazon QuickSight to visualize either Predictor Backtest Forecasts and/or Forecasts so you can share and socialize the results with others <a href="https://aws.amazon.com/solutions/implementations/improving-forecast-accuracy-with-machine-learning/?did=sl_card&trk=sl_card" target="_blank">see our automation solution Improving Forecast Accuracy</a></li>
    <li><a href="https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=forecast-stack&t[…]acy-with-machine-learning-demo.template" target="_blank">Quick launch link for above automation</a></li>
    </ul>