Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
333 lines (222 sloc) 29.6 KB

Shipping and Distribution Demand Forecasting

Outline

  1. Introduction
  2. Architecture
  3. Technical details and workflow
  4. Provisioned Resources
  5. Data Schema
  6. Simulated Data
  7. Loading your own data
  8. Solution Customization

Introduction

The Shipping and Distribution Demand Forecasting Solution uses historical demand time series to forecast demand in future periods. For instance, a shipping or delivery company wants to predict the quantities of the different products its customers will commit at future times. Similarly a vendor or insurer wants to know the number of products that will be returned due to failure over the course of a year. A company can use these forecasts as input to an allocation tool that optimizes delivery vehicles or routes, or to plan capacity in the longer term.

Characteristic of all of these forecasting cases are:

  • There are numerous kinds of items with differing volumes, that roll up under one or more category levels.
  • There is a history available for the quantity of the item at each time in the past.
  • The volumes of the items differ widely, with possibly a substantial number that have zero volume at times.
  • The history of items shows both trend and seasonality, possibly at multiple time scales.
  • The quantities committed or returned are not strongly price sensitive. In other words, the delivery company cannot strongly influence quantities by short-term changes in prices, although there may be other determinants that affect volume, such as weather.

Under these conditions we can take advantage of the hierarchy formed among the time series of the different items. By enforcing consistency so that the quantities lower in the hierarchy (e.g. individual product quantities) sum to the quantities above (customer product totals) we improve the accuracy of the overall forecast. The same applies if individual items are grouped into categories, even possibly categories that overlap. For example, one might be interested in forecasting demand of all products in total, by location, by product category, by customer, etc.

This solution computes forecasts at all aggregation levels in the hierarchy for each time period specified. For simplicity, we will refer to both hierarchial and grouped time series as "hierarchical time series."

Automatically installing solutions in the Cloud

A "solution" refers to an assembly of Azure resources, such as predictive services, cloud storage an so on, that consitute an application. The Deploy button on this page runs a service that automates the set of steps to create a runnable copy of this application in your Azure subscription. If you don't already have a subscription, then you need to sign up for one on Azure first.

In addition if you are a System Integrator and ISV, you will be able to customize this solution to build derivative applications on Azure for your clients' specific needs. You can find the entire sources for this "Cortana Intelligence Quick Start" (CIQS) Installer on this Github repository under Deployment.

The fundamentals of hierarchical forecasting

Time series data can often be disaggregated by attributes of interest to form groups of time series or a hierarchy. For example, one might be interested in forecasting demand of all products in total, by location, by product category, by customer, etc. Forecasting grouped time series data is challenging because the generated forecasts need to satisfy the aggregate requirement, that is, lower-level forecasts need to sum up to the higher-level forecasts. There are many approaches that solve this problem, differing in the way they aggregate individual time series forecasts across the groups or the hierarchy.

The novelty of this solution is to generate accurate forecasts that satisfy aggregation requirements; that is, lower-level forecasts need to sum up to the higher-level forecasts, and the individual forecasts are more accurate than just independantly generated ones. The challenge has spawned many approaches to solving this problem, differing in the way they aggregate individual time series forecasts across the groups or the hierarchy. This solution exposes several approaches to forecasting grouped time series as a parameter, namely bottom-up, top-down and a combination approach (Hyndman et al.).

Customizations and limitations

The solution "out of the box" takes the monthly historical data you provide, and predicts the demand broken out by three levels of hierarchy, customer, product and, destination. Therefore, the solution assumes:

  • a specific data schema (described in more detail below in Data Schema section),
  • a monthly demand time series.

It should be noted that the time series forecasting does very basic data processing, like removing time series with missing information, or imputing unknown quantities. Other than that, no aggregation of the data or any other processing is done. The data needs to be aggregated to the monthly level prior to including it with the solution.

A number of configurations can be made through the exposed solution parameters. These parameters are stored in the SQL table (described in the Data Schema section) and passed on to the Azure Machine Learning web service that generates the forecasts. Some of these parameters are the starting and ending dates of the historical demand, the number of periods to forecast, the structure of the hierarchy or grouping rules, and so on.

If you want to make more extensive customizations to the machine learning web-service, we provide the R code used in the web-service. For example, if you would like to modify the the web service to forecast weekly time series demand data (rather than monthly), this will require changes to the R code. Note that more extensive modifications to the web service, such as modifications to its inputs and outputs, may require changes to the auxiliary AML files and the SQL data tables where the data resides.

Architecture

The following chart describes the solution architecture.

Solution Architecture

The solution uses five types of resources hosted and managed in Azure:

  • Azure SQL Server instance (Azure SQL) for persistent storage,
  • Azure Machine Learning (AML) webservice to host the R forecasting code,
  • Azure Blob Storage for intermediate storage of generated forecasts,
  • Azure Data Factory (ADF) that orchestrates regular runs of the AML model,
  • Power BI dashboard to display and drill down on the forecasts.

Technical details and workflow

The deployment goes through several provisioning and setup steps, using a combination of Azure Resource Manager (ARM) templates and Azure Functions. ARM templates are JSON files that define the resources you need to deploy for your solution. Azure Functions is a serverless compute service that enables you to run code on-demand without having to explicitly provision or manage infrastructure. We will describe ARM templates and Azure Functions used in this solution in later sections.

The resources deployed in this solution are provisioned in the following order:

  1. Setup SQL Server Account: The first step in the workflow sets up a SQL server account. The user is prompted for credentials, and asked to specify the following parameters:

    • sqlServerUser: SQL Server username
    • sqlServerPasswd: SQL Server password

    [NOTE] The server admin login and password that you specify here are required to log in to the server and its databases. Remember or record this information for later use.

  2. Deploying SQL Server Resources: In this step, an Azure SQL Server and Azure SQL Database are provisioned. These resources are described in SqlDeploy.json ARM template. Azure SQL Database is used to store the historical demand data and the prediction results received from the Azure Machine Learning service.

  3. Configuring Blob Storage: Azure Storage Account is configured in this step for storing the following: 1) R code used for forecasting, that is turned into Azure Machine Learning web service, later in the process, 2) intermediate forecasting results, before they are loaded into the Azure SQL Database. This step creates a container for ADF intermediate storage, a container for Azure Machine Learning R module, and copies the module zip file to the previously created container. This is accomplished with BlobStorageSetup Azure function.

  4. Creating SQL Database Objects: In this step, Azure Function SqlDbSetup creates all necessary database objects: tables and stored procedures. In addition, it populates the demand SQL table with a sample data set, stored in a sample file, ExampleDemandData.csv.

  5. Deploying ML Webservice: Once the R code used for forecasting is copied to Azure Storage, this step creates an Azure Machine Learning service by executing FcastMlDeploy.json ARM template. This template deploys the R code as Custom R Module in Azure ML Studio and links web service inputs and outputs to relevant storage accounts. The resulting AML webservice is used to generate the demand forecasts.

  6. Retrieving ML Webservice Credentials: In this step of the deployment process, we only retrieve the ML webservice API key. This is accomplished with GetMLApiKey Azure Function. The webservice API key is used later by the Azure Data Factory that schedules the calls to the webservice.

  7. Deploying Data Factory: Azure Data Factory handles orchestration, and scheduling of the monthly demand forecast. In this step of the deployment process, we set up the Data Factory pipeline - define resources it orchestrates, provide their credential, define scheduling timeline, etc. This is accomplished with the DataFactoryDeploy.json ARM template. At this point in time the Data Factory is not active yet - we will activate it as the last step of the deployment process.

  8. Activating Data Factory Pipelines: Finally, Azure Function StartPipelines activates the Data Factory, and triggers the activities performed by it, most notably, the generation of the monthly forecast on the example data set.

Provisioned Resources

Once the solution is deployed to the subscription, you can see the services deployed by clicking the resource group name on the final deployment screen in the CIS.

CIS resource group link

This will show all the resources under this resource group on Azure management portal. After successful deployment, the entire solution is automatically started on cloud. You can monitor the progress from the following resources.

Azure Functions

Six Azure Functions are created during the deployment to start certain Azure services. We described the tasks performed by these functions in the Technical Details and Workflow section above. You can monitor these functions' progress by clicking the link on your deployment page.

  • BlobStorageSetup: Sets up Azure blob storage.
  • GetMLApiKey: Retrievs ML webservice key.
  • SqlDbSetup: Sets up SQL database objects.
  • StartPipelines: Activates data factory pipeline.

Azure Data Factory

Azure Data Factory is used to schedule machine learning model. You can monitor the data pipelines by clicking the link on your deployment page.

Azure SQL Database

Azure SQL database is used to save the data and forecast results. You can use the SQL server and database name showing on the last page of deployment with the username and password that you set up in the beginning of your deployment to log in your database and check the results.

Azure Machine Learning Web Service

You can view your forecasting model on machine learning experiment by navigating to your Machine Learning Workspace. The machine learning model is deployed as Azure Web Service to be invoked by the Azure Data Factory. You can view your the web service API manual by clicking the link on your deployment page.

Power BI Dashboard

Once the solution is deployed, the forecasting web service generates forecasts for a simulated sample data set, included with this solution. To view the generated forecasts, and drill down by customer, product, and destination, you can click on the Power BI dashboard link on your deployment page.

Data Schema

Here we describe the valid data schemas and fields for SQL tables used by the solution.

The solution consumes four database tables: HistoricalOrders, ForecastParameters, StatisticalForecast, and ForecastHistory. The solution reads historical orders data from HistoricalOrders database, and a set of forecasting parameters from the ForecastParameters table. These two data sources are then used in the forecasting model which generates future demand forecasts, and a set of evaluation metrics on the historical data. Generated forecast is written into StatisticalForecast table, while evaluation metrics as well as additional logging infromation is written into ForecastHistory table.

Figure below shows the data sources to the forecasting model and the respective SQL tables.

DBSourceFlow

Here we provide more information about the tables used, their columns, valid values and constraints.

Input table 1 - Historical orders

Table HistoricalOrders contains data on historical orders. We provide field descriptions in a table below. All fields in this table except the Date and Quantity are used as disaggregation variables for the grouped (or hierarchical) time series. Date is used as time series time points, and Quantity is used as a time series data point that is used for modeling and forecasting.

NOTE: Historical orders need to be aggregated to the monthly level, as the solution currently performs only monthly time series forecasting, and does not perform any aggregation itself.

Field Description Type Example Comments
CustomerName Individual customer name Text Contoso Key
ProductCategory Product category being distributed Text Plastics Key
Destination Region or country of destination Text Europe Key
Date Date of the order Date 01-01-2017 This field needs to be aggregated to monthly level. Current solution only allows monthly order data, so the dates will always be the first of the month (e.g. 01-01-2017, 01-02-2017, etc.); Key
Quantity Historical order quantity Numeric 50 This is the the field the forecasting model will forecast.

The following columns form the composite primary key for HistoricalOrders table: (CustomerName, ProductCategory, Destination, Date).

Input table 2 - Forecasting parameters

Table ForecastParameters contains parameters that define the forecasting approach used to generate the forecasts. Field descriptions are given in the table below. In particular, parameters that define time series approaches used for forecasting originate from the 'hts' R package which is used for hierarchical or grouped time series forecasting. For a more detailed overview (beyond what is provided in the table below), please refer to package documentation or the author's book Forecasting: principles and practice (Chapter 9.4).

Field Description Type Example Comments
ForecastParametersId Unique identifier for each set of forecasts produced by the forecasting model Text AUG2017 Key
EarliestOrderHistoryDate Earliest order date to include Date 07-01-2014 This field can be used to filter out noisy data from training (for example, earlier data points that do not reflect most recent trends and can degrade the time series analysis)
LatestOrderHistoryDate Latest order date to include Date 04-01-2017 Similar to above, this field can be used to filter out later data points from training (for example, the current month can be exluded from training, as its orders are not complete yet). Notice that the forecasting model will generate forecasts from this date on.
ForecastHorizonMonths Number of months to forecast, forecasting horizon Numeric 3 The forecasting model will generate forecasts for ForecastHorizonMonths from the LatestOrderHistoryDate.
EvaluationWindow Number of months history to use for computing evaluation metrics Numeric 5 The forecasting model returns evaluation metrics based on the historical data, and uses most recent months specified with this parameter to compute those metrics. For example, for EvaluationWindow = 3, forecast evaluation is performed on the last 3 months.
GTSorHTS Text gts Valid values: "gts" (for grouped time series), "hts" (for hierarchical time series).
CombiningGTSMethod Text bu Valid values: "bu" (bottom-up approach), "comb" (optimal combination approach as described in Hyndman et al. (2015))
UnivariateTSMethod Text arima Valid values: "ets" (Exponential smoothing or ETS), "arima" (ARIMA)
GTSCombWeights Text nseries Valid values: "ols", "wls", "nseries". These are weights used for the optimal combination method ("comb"). "ols" uses an unweighted combination, "wls" uses weights based on forecast variances , weights="nseries" uses weights based on the number of series aggregated at each node.

The following column forms the primary key for ForecastParameters table: ForecastParametersId.

Output table 1 - Forecast output

When the forecasting model generates the forecasts, they are stored in the StatisticalForecast table. See the table below for field descriptions. Forecasts are generated at the disaggregated level specified by the fields in the input table HistoricalOrders, so this table has a very similar schema to the one of HistoricalOrders table. Note that each time we generate new forecasts, they are appended to this table, and are uniquely identified by the ForecastParametersId and ModelVersion parameters (ForecastParametersId links back to the ForecastParameters table).

Field Description Type Example Comments
CustomerName Individual customer name Text Contoso Key
ProductCategory Product category being distributed Text Plastics Key
Destination Region or country of destination Text Europe Key
ForecastDate Date of the order Date 01-01-2017 This field needs to be aggregated to monthly level. Current solution only allows monthly order data, so the dates will always be the first of the month (e.g. 01-01-2017, 01-02-2017, etc.); Key
Quantity Historical order quantity Numeric 50 This is the the field the forecasting model will forecast.
ForecastParametersId Unique identifier for each set of forecasts produced by the forecasting model Text AUG2017 Key
ModelVersion Forecasting model version Text 1.2.0 Key

The following columns form the composite primary key for Forecast table: (ForecastParametersId, ModelVersion, CustomerName, ProductCategory, Destination, ForecastDate).

Output table 2 - Forecast history and evaluation

Table ForecastHistory contains information about forecasting runs. This is a reference table that links generated forecasts with specific parameters and a model version used (via ForecastParametersId and ModelVersion). Additionally, this table provides evaluation metrics for the forecasting run (computed on historical data), as well as the log generated by the forecasting model. See the following table for more information about the ForecastHistory fields.

Field Description Type Example Comments
ForecastParametersId Unique identifier for each set of forecasts produced by the forecasting model Text AUG2017 Key
ModelVersion Forecasting model version Text 1.2.0 Key
MLCallDate DateTime of call to forecasting model Date 12-01-2017
RMSE Root mean squared error Numeric 2.34 More details
MAE Mean absolute error Numeric 1.39 More details
MPE Mean percentage error Numeric 6.05 More details
MAPE Mean absolute percentage error Numeric 2.5 More details
MASE Mean absolute scaled error Numeric 3.45 More details
SMAPE Symmetric mean absolute percentage error Numeric 2.67 More details
MLCallLog Forecasting log Text This log is generated by the forecasting model, and can be used to track down any issues or look up more information about the forecasting run.

The following columns form the composite primary key for ForecastHistory table: (ForecastParametersId, ModelVersion).

Simulated Data

The solution comes with an example demand data set pre-loaded into the provisioned SQL database. This example data set is generated by an R script, and loaded into the SQL table containing historical demand during the deployment process. The R script simulates grouped time series over a number of customers, products and destinations, and can be found here.

NOTE: these instructions assume you have R installed on your machine.

Generating new simulated data

To load new simulated data, you need to modify the following lines in the ExampleDataGen.R script to include the following SQL Server credentials.

# SQL Server credentials
myServer <- "<server>.database.windows.net"
myUser <- "<user>"
myPassword <- "<passwd>"

SQL Server username and password are the credentials you created during the deployment.You can find the SQL Server name on Azure portal by navigating to the provisioned Azure SQL Server under the resource group with the deployment name you provided at the start of the deployment.

To generated different data groups, you can modify the following variables:

# Default time series groups
customerList <- c("Contoso","Protoso")
productCategories <- c("Plastics","Metals")
destinationList <- c("China","United States","India")

Once you are done modifying the R script, you can open any R IDE or an R console and execute the script by typing:

setwd("<rscript_directory_path>")
source("ExampleDataGen.R")

where the <rscript_directory_path> refers to the directory where you downloaded the ExampleDataGen.R script.

Resetting Azure Data Factory pipelines

After the new data is loaded into the SQL database, you will need to reset the Azure Data Factory to re-run the demand forecast pipelines on the new data. You can do this from Azure portal, by going to the deployed Azure Data Factory and resetting resetting the pipelines manually. However, we provided a powershell script, reset_slices.ps1, that lets you reset the ADF pipelines automatically.

To execute the script, download it to your local machine, open Windows PowerShell, and run the following command:

cd <psscript_directory_path>
.\reset_slices.ps1 <subscription_id> <resource_group_name> <data_factory_name>

where psscript_directory_path is the location of the PowerShell script, subscription_id is Azure subscription ID, resource_group_name is Resource Group name of the solution, and data_factory_name is the name of the Azure Data Factory deployed by the solution. You can find this information on the Azure portal, by navigating to the resource group with the deployment name you provided at the start of the deployment.

Loading Your Own Data

If you would like to generate forecasts on your own hierarchical demand data, you can do that by loading your data into the Azure SQL database created by the solution. There are many ways to do that, but here we highlight two ways to loading data into the Azure DB, one using bcp command and the other way using Azure Data Factory.

  • Load data from CSV into Azure SQL Database (flat files)

    You can use the bcp (bulk copy) command-line utility to import data from a CSV file into Azure SQL Database. This document describes how to that from creating the destination table (which is already created in the solution), to creating a source data file, to loading the data using the bcp command-line utility.

  • Copy data from Blob Storage to SQL Database using Azure Data Factory

    You can also create a data factory with a pipeline to copy data from Blob storage to SQL database. The Copy Activity performs the data movement in Azure Data Factory. It is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. This document walks through a simple example of copying some data from Azure blob to Azure SQL table.

Note that after the new data is loaded into the SQL database, you will need to reset the Azure Data Factory to re-run the demand forecast pipelines on the new data. You can do so by following the steps outlined in the previous section.

Solution Customization

If you would like to customize the solution to your data and business needs, beyond what can be done through R code customization, you can customize the ARM template files used in the automated deployment.

  • SqlDeploy.json: Deploys Azure SQL Server
  • FcastMlDeploy.json: Deploys ML Webservice
  • DataFactoryDeploy.json: Sets up Data Factory pipelines

We described these templates in some detail in the Technical Details and Workflow section above.

In general, there are four main sections in an ARM template: parameters, variables, resource, and outputs (as shown below). Parameters are values that are provided when deployment is executed to customize resource deployment. Variables are values that are used as JSON fragments in the template to simplify template language expressions. Resources section is the core section of the ARM template and it lists all resource types that are deployed or updated in a resource group. Outputs are values that are returned after deployment.

{
  "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {...},
  "variables": {...},
  "resources": [...],
  "outputs": {...}
}

For example, let's say we'd like to change our Azure Machine Learning Commitment Plan, say upgrade it to plan Standard S2. We can do that by modifying the FcastMlDeploy.json ARM template. The Machine Learning plan is defined under variables, then used under resources to specify the plan to be provisioned. To upgrade from Standard S1 to Standard S2 plan, we would need to change the variable "planSkuName" to S2, as shown below.

"variables": {
  "namePrefix": "[resourceGroup().name]",
  "uniqueNamePrefix": "[toLower(concat(variables('namePrefix'), uniqueString(subscription().subscriptionId)))]",
  "location": "[resourceGroup().location]",
  "planName": "[concat(variables('uniqueNamePrefix'),'plan')]",
  "planSkuName": "S2",
  "planSkuTier": "Standard",
  "mlWebServiceName": "[concat(variables('uniqueNamePrefix'),'fcastmlsvc')]",
  "mlWebServiceTitle": "ML Forecast Service"
},
"resources": [
  {
    "apiVersion": "2016-05-01-preview",
    "name": "[variables('planName')]",
    "type": "Microsoft.MachineLearning/CommitmentPlans",
    "location": "[variables('location')]",
    "sku": {
      "name": "[variables('planSkuName')]",
      "tier": "[variables('planSkuTier')]",
      "capacity": "1"
    },
    "properties": {
   }
  },

For more information about ARM templates, please refer to the Azure Resource Manager documentation page.