Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/auto-ml-forecasting-beer-remote.png)

# Automated Machine Learning
**Distributed TCN Forecasting**

---

## Introduction

This notebook demonstrates demand forecasting for UCI Electricity Dataset using AutoML where the objective is to predict electricity consumption 24 hours ahead for each of the stations. The original dataset lists electricity consumption every 15 minutes. We aggregated the original data to an hourly frequency.

AutoML highlights here include using TCNForecaster (a deep learning model for forecasting) in a distributed fashion, Remote Inferencing, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.

**Contents**:

1. Creating an Experiment in an existing Workspace
2. Configuration and remote run of AutoML for a time-series model exploring DNNs
4. Evaluating the fitted model using a rolling evaluation
5. Deploying the model to a web service

## Setup


In [None]:
import pandas as pd
import logging

import azureml.core
from azureml.core import Workspace, Experiment, Dataset
from azureml.train.automl import AutoMLConfig
from azureml.exceptions import UserErrorException
from azureml.automl.core.forecasting_parameters import ForecastingParameters

This notebook is compatible with Azure ML SDK version 1.49.0 or later.

In [None]:
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

Make sure you already have a <b>Workspace</b> created. If not, please refer [this link](https://learn.microsoft.com/en-us/azure/machine-learning/quickstart-create-resources#create-the-workspace) to create a workspace. To run AutoML, you also need to create an <b>Experiment</b>. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
ws = Workspace.from_config()

experiment_name = "distributed-tcn"
experiment = Experiment(ws, experiment_name)

datastore = ws.get_default_datastore()

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Run History Name"] = experiment_name
output["SDK Version"] = azureml.core.VERSION
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

### Using AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you use `AmlCompute` as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "distributed-tcn-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="Standard_NC6s_v3", max_nodes=6
    )
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

## Data

We will be using UCI Electricity dataset. Original data can be found [here](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014). The data has been aggregated to the hourly frequency and a subset of it is taken. Below is a summary of the sets we will be using:

| Dataset | Start DateTime | End DateTime | No. of Timeseries |
| - | - | - | - |
| Train | 2012-01-01 00:00:00 | 2013-10-19 15:00:00 | 15 |
| Valid | 2013-10-19 16:00:00 | 2014-05-26 19:00:00 | 15 |
| Test | 2014-05-26 20:00:00 | 2014-12-31 23:00:00 | 15 |

This step expects that the dataset has already been partitioned based on ``time_series_id_column_names`` (column names used to uniquely identify the time series in data that has multiple rows with the same timestamp), which means you should have:
- partitioned train data
- partitioned validation data
- test data

**If this is not the case, please run [this notebook](./data-partition.ipynb) to prepare the data only if the total size of data is in GBs. For smaller datasets, like the ones we are using, this step will handle the partitioning.**

Once we have the prepared data registered in the worksapce, we can retrieve them using ``get_by_name`` method from ``Dataset`` class.

Following variables are specific to the dataset and must be updated accordingly:

| Variables | Description |
| - | - |
| **dataset_name** | Name for the dataset, used while data registration and writing to datastore. |
| **time_column_name** | Name of your time column. |
| **time_series_id_column_names** | The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. These columns will be used to partition the data. |

In [None]:
dataset_name = "electricity"
partitioned_dataset_name = f"{dataset_name}_partitioned"
time_column_name = "datetime"
time_series_id_column_names = ["station"]

In [None]:
from helper import register_dataset


try:
    train_dataset = Dataset.get_by_name(
        workspace=ws, name=f"{partitioned_dataset_name}_train"
    )
except UserErrorException:
    train_dataset = register_dataset(
        src_dir="./data/train",
        target=f"{dataset_name}/train",
        name=f"{partitioned_dataset_name}_train",
        partition_column_names=time_series_id_column_names,
    )

try:
    valid_dataset = Dataset.get_by_name(
        workspace=ws, name=f"{partitioned_dataset_name}_valid"
    )
except UserErrorException:
    valid_dataset = register_dataset(
        src_dir="./data/valid",
        target=f"{dataset_name}/valid",
        name=f"{partitioned_dataset_name}_valid",
        partition_column_names=time_series_id_column_names,
    )

try:
    test_dataset = Dataset.get_by_name(workspace=ws, name=f"{dataset_name}_test")
except UserErrorException:
    test_dataset = register_dataset(
        src_dir="./data/test",
        target=f"{dataset_name}/test",
        name=f"{dataset_name}_test",
    )

## Train

We need to create ``ForecastingParameters`` and ``AutoMLConfig`` objects.

#### ``ForecastingParameters`` arguments
To define forecasting parameters for your experiment training, you can leverage the ``ForecastingParameters`` class. The table below details the forecasting parameter we will be passing into our experiment.

|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|
|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.

#### ``AutoMLConfig`` arguments
Instantiate an AutoMLConfig object. This defines the settings and data used to run the experiment.

| Property                           | Description|
| :---------------                   | :------------------- |
| **task**                           | forecasting |
| **primary_metric**                 | This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i> |
| **training_data**                  | Input dataset, containing both features and label column. |
| **validation_data**                | Validation dataset, containing both features and label column. |
| **test_data**                      | Test dataset, containing both features and label column. This will trigger test run at the end of the training and we can use it to view the test metrics. |
| **label_column_name**              | The name of the label column. |
| **compute_target**                 | The remote compute for training. |
| **experiment_timeout_hours**       | Maximum amount of time in hours that each experiment can take before it terminates. This is optional but provides customers with greater control on exit criteria. |
| **enable_early_stopping**          | Flag to enable early termination if the primary metric is no longer improving. |
| **verbosity**                      | The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python [logging library](https://docs.python.org/3/library/logging.html). **We encourage setting it to ERROR or 40 for larger datasets to avoid generating high volume of logs which could lead to experiment failure.** |
| **iterations**                     | Number of models to train. This is optional but provides customers with greater control on exit criteria. **We recommend to use more the 50 completed HyperDrive runs to find the best model**. We are using a small number of iterations to reduce the runtime and this comes, potentially, at the expense of accuracy.|
| **iteration_timeout_minutes**      | Maximum amount of time in minutes that the model can train. This is optional but provides customers with greater control on exit criteria. |
| **max_concurrent_iterations**      | Number of models to train in parallel. |
| **enable_dnn**                     | Enable Forecasting DNNs. |
| **allowed_models**                 | A list of model names to search for an experiment. If not specified, then all models supported for the task are used minus any specified in ``blocked_models`` or deprecated TensorFlow models. The supported models for each task type are described in the ``azureml.train.automl.constants.SupportedModels`` class. |
| **use_distributed**                | This enables distributed training and must be set to True. |
| **forecasting_parameters**         | The ``ForecastingParameters`` object defined above. |
| **max_nodes**                      | Maximum number of nodes to use in training. We encourage this value to be a multiple of max_concurrent_iterations. The multiple indicates the number of nodes that will be used by each concurrent iteration. Minimum acceptable value to kick off distributed training is 2. |

<br/>

> **Note**: Cross Validation is not supported by Distributed TCN right now. And the data needs to be multi-timeseries.

In [None]:
target_column_name = "target"
time_column_name = "datetime"
freq = "H"
forecast_horizon = 24

forecasting_parameters = ForecastingParameters(
    time_column_name=time_column_name,
    forecast_horizon=forecast_horizon,
    time_series_id_column_names=time_series_id_column_names,
    freq=freq,
)

# To only allow the TCNForecaster we set the allowed_models parameter to reflect this.
automl_config = AutoMLConfig(
    task="forecasting",
    primary_metric="normalized_root_mean_squared_error",
    training_data=train_dataset,  # The data has to be partitioned.
    validation_data=valid_dataset,  # The data has to be partitioned.
    test_data=test_dataset,
    label_column_name=target_column_name,
    compute_target=compute_target,
    experiment_timeout_hours=1,
    enable_early_stopping=True,
    verbosity=logging.INFO,  # Set it to logging.ERROR for larger dataset
    iterations=20,
    max_concurrent_iterations=3,
    enable_dnn=True,
    allowed_models=["TCNForecaster"],
    use_distributed=True,
    forecasting_parameters=forecasting_parameters,
    max_nodes=6,
)

You can now submit a new training run. Depending on the data and number of iterations this operation may take several minutes.
Information from each iteration will be printed to the console.  Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous.

In [None]:
remote_run = experiment.submit(automl_config, show_output=False)

In [None]:
remote_run.wait_for_completion(show_output=False)

In [None]:
# If you need to retrieve a run that already started, use the following code
# from azureml.train.automl.run import AutoMLRun
# remote_run = AutoMLRun(experiment = experiment, run_id = '<replace with your run id>')

Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!

### Retrieve the Best Model for Each Algorithm
Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. There are overloads on get_output that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration.

In [None]:
from helper import get_result_df


summary_df = get_result_df(remote_run)
summary_df

In [None]:
from azureml.core.run import Run
from azureml.widgets import RunDetails

forecast_model = "TCNForecaster"
if not forecast_model in summary_df["run_id"]:
    forecast_model = "ForecastTCN"

best_dnn_run_id = summary_df[summary_df["Score"] == summary_df["Score"].min()][
    "run_id"
][forecast_model]
best_dnn_run = Run(experiment, best_dnn_run_id)
model_name = best_dnn_run.properties["model_name"]

In [None]:
best_dnn_run.parent
RunDetails(best_dnn_run.parent).show()

In [None]:
best_dnn_run
RunDetails(best_dnn_run).show()

## Evaluate on Test Data

We now use the best fitted model from the AutoML Run to make forecasts for the test set.  

We always score on the original dataset whose schema matches the training set schema.

In [None]:
# preview the first 5 rows of the dataset
test_dataset.take(5).to_pandas_dataframe()

In [None]:
test_experiment = Experiment(ws, experiment_name + "_test")

In [None]:
import os
import shutil


script_folder = os.path.join(os.getcwd(), "inference")
os.makedirs(script_folder, exist_ok=True)
shutil.copy("infer.py", script_folder)

In [None]:
from helper import run_inference


test_run = run_inference(
    test_experiment,
    compute_target,
    script_folder,
    best_dnn_run,
    test_dataset,
    valid_dataset,
    forecast_horizon,
    target_column_name,
    time_column_name,
    freq,
)

In [None]:
RunDetails(test_run).show()

## Operationalize

*Operationalization* means getting the model into the cloud so that other can run it after you close the notebook. We will create a docker running on Azure Container Instances with the model.

In [None]:
description = "AutoML UCI Electricity Forecaster"
tags = None
model = remote_run.register_model(
    model_name=model_name, description=description, tags=tags
)

print(remote_run.model_id)

### Develop the scoring script

For the deployment we need a function which will run the forecast on serialized data. It can be obtained from the best_dnn_run.

In [None]:
script_file_name = "score_forecaster.py"
best_dnn_run.download_file("outputs/scoring_file_v_1_0_0.py", script_file_name)

### Deploy the model as a Web Service on Azure Container Instance

In [None]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model


inference_config = InferenceConfig(
    environment=best_dnn_run.get_environment(), entry_script=script_file_name
)

aciconfig = AciWebservice.deploy_configuration(
    cpu_cores=2,
    memory_gb=4,
    tags={"type": "automl-forecasting"},
    description="Automl forecasting sample service",
)

aci_service_name = "automl-uci-elec-forecast-01"
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

In [None]:
aci_service.get_logs()

### Call the service

In [None]:
import json


X_query = test_dataset.take(100).to_pandas_dataframe()
X_query.pop(target_column_name)
# We have to convert datetime to string, because Timestamps cannot be serialized to JSON.
X_query[time_column_name] = X_query[time_column_name].astype(str)
# The Service object accept the complex dictionary, which is internally converted to JSON string.
# The section 'data' contains the data frame in the form of dictionary.
test_sample = json.dumps({"data": X_query.to_dict(orient="records")})
response = aci_service.run(input_data=test_sample)

try:
    res_dict = json.loads(response)
    y_fcst_all = pd.DataFrame(res_dict["index"])
    y_fcst_all[time_column_name] = pd.to_datetime(
        y_fcst_all[time_column_name], unit="ms"
    )
    y_fcst_all["forecast"] = res_dict["forecast"]
except:
    print(res_dict)

In [None]:
y_fcst_all.head()

### Delete the web service if desired

In [None]:
serv = Webservice(ws, aci_service_name)
serv.delete()  # don't do it accidentally