Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Automated Machine Learning
**Distributed TCN Forecasting**

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Evaluate](#Evaluate)

## Introduction
This notebook demonstrates demand forecasting for Github Daily Active Users Dataset using AutoML.

AutoML highlights here include using Deep Learning forecasts, Arima, Prophet,  Remote Execution and Remote Inferencing, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.

Make sure you have executed the [configuration](../../../configuration.ipynb) before running this notebook.

Notebook synopsis:

1. Creating an Experiment in an existing Workspace
2. Configuration and remote run of AutoML for a time-series model exploring Regression learners, Arima, Prophet and DNNs
4. Evaluating the fitted model using a rolling test 

In [1]:
import json
import logging

import azureml.core
import pandas as pd
from azureml.automl.core.featurization import FeaturizationConfig
from azureml.core import Experiment, Workspace, Dataset
from azureml.train.automl import AutoMLConfig

In [2]:
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

You are currently using version 1.39.0 of the Azure ML SDK


In [3]:
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = "automl-ojforecasting"

experiment = Experiment(ws, experiment_name)

output = {}
output["Subscription ID"] = ws.subscription_id
output["Workspace"] = ws.name
output["SKU"] = ws.sku
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Run History Name"] = experiment_name
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

Unnamed: 0,Unnamed: 1
Subscription ID,ba7979f7-d040-49c9-af1a-7414402bf622
Workspace,yuzhua-rg-easts-2
SKU,Basic
Resource Group,yuzhua-rg-eastus
Location,eastus
Run History Name,automl-ojforecasting


### Using AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you use `AmlCompute` as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "oj-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_D12_V2", max_nodes=12
    )
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Data<a id="data"></a>
You are now ready to load the historical orange juice sales data. We will load the CSV file into a plain pandas DataFrame; the time column in the CSV is called _WeekStarting_, so it will be specially parsed into the datetime type.

In [5]:
time_column_name = "WeekStarting"
data = pd.read_csv("dominicks_OJ.csv", parse_dates=[time_column_name])

# Drop the columns 'logQuantity' as it is a leaky feature.
data.drop("logQuantity", axis=1, inplace=True)

data.head()

Unnamed: 0,WeekStarting,Store,Brand,Quantity,Advert,Price,Age60,COLLEGE,INCOME,Hincome150,Large HH,Minorities,WorkingWoman,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
0,1990-06-14,2,dominicks,10560,1,1.59,0.232865,0.248935,10.553205,0.463887,0.103953,0.11428,0.303585,2.110122,1.142857,1.92728,0.376927
1,1990-06-14,2,minute.maid,4480,0,3.17,0.232865,0.248935,10.553205,0.463887,0.103953,0.11428,0.303585,2.110122,1.142857,1.92728,0.376927
2,1990-06-14,2,tropicana,8256,0,3.87,0.232865,0.248935,10.553205,0.463887,0.103953,0.11428,0.303585,2.110122,1.142857,1.92728,0.376927
3,1990-06-14,5,dominicks,1792,1,1.59,0.117368,0.321226,10.922371,0.535883,0.103092,0.053875,0.410568,3.801998,0.681818,1.600573,0.736307
4,1990-06-14,5,minute.maid,4224,0,2.99,0.117368,0.321226,10.922371,0.535883,0.103092,0.053875,0.410568,3.801998,0.681818,1.600573,0.736307


Each row in the DataFrame holds a quantity of weekly sales for an OJ brand at a single store. The data also includes the sales price, a flag indicating if the OJ brand was advertised in the store that week, and some customer demographic information based on the store location. For historical reasons, the data also include the logarithm of the sales quantity. The Dominick's grocery data is commonly used to illustrate econometric modeling techniques where logarithms of quantities are generally preferred.    

The task is now to build a time-series model for the _Quantity_ column. It is important to note that this dataset is comprised of many individual time-series - one for each unique combination of _Store_ and _Brand_. To distinguish the individual time-series, we define the **time_series_id_column_names** - the columns whose values determine the boundaries between time-series: 

In [6]:
time_series_id_column_names = ["Store", "Brand"]
nseries = data.groupby(time_series_id_column_names).ngroups
print("Data contains {0} individual time-series.".format(nseries))

Data contains 249 individual time-series.


### Data Splitting
We now split the data into a training and a testing set for later forecast evaluation. The test set will contain the final 12 weeks of observed sales for each time-series. The splits should be stratified by series, so we use a group-by statement on the time series identifier columns.

In [7]:
n_test_periods = 12


def split_last_n_by_series_id(df, n):
    """Group df by series identifiers and split on last n rows for each group."""
    df_grouped = df.sort_values(time_column_name).groupby(  # Sort by ascending time
        time_series_id_column_names, group_keys=False
    )
    df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])
    df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])
    return df_head, df_tail


train, test = split_last_n_by_series_id(data, n_test_periods)

### Upload data to datastore
The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create [tabular datasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.

In [8]:
from azureml.data.dataset_factory import TabularDatasetFactory

datastore = ws.get_default_datastore()
train_dataset = TabularDatasetFactory.register_pandas_dataframe(
    train, target=(datastore, "dataset/"), name="dominicks_OJ_train"
)
test_dataset = TabularDatasetFactory.register_pandas_dataframe(
    test, target=(datastore, "dataset/"), name="dominicks_OJ_valid"
)

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to dataset//48ced0c5-75ba-4d42-bc36-88ce97a0a405/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.
Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to dataset//be03c8a5-902c-410d-958e-f49ec23cb6bd/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [9]:
train_dataset.to_pandas_dataframe().tail()

Unnamed: 0,WeekStarting,Store,Brand,Quantity,Advert,Price,Age60,COLLEGE,INCOME,Hincome150,Large HH,Minorities,WorkingWoman,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
23962,1992-04-16,137,tropicana,23680,0,3.19,0.209602,0.528362,10.96649,0.860739,0.092996,0.11325,0.330293,6.026484,0.705882,0.77253,0.333761
23963,1992-04-23,137,tropicana,25728,0,2.74,0.209602,0.528362,10.96649,0.860739,0.092996,0.11325,0.330293,6.026484,0.705882,0.77253,0.333761
23964,1992-04-30,137,tropicana,80384,1,2.39,0.209602,0.528362,10.96649,0.860739,0.092996,0.11325,0.330293,6.026484,0.705882,0.77253,0.333761
23965,1992-05-07,137,tropicana,30464,0,3.19,0.209602,0.528362,10.96649,0.860739,0.092996,0.11325,0.330293,6.026484,0.705882,0.77253,0.333761
23966,1992-05-14,137,tropicana,27904,0,3.19,0.209602,0.528362,10.96649,0.860739,0.092996,0.11325,0.330293,6.026484,0.705882,0.77253,0.333761


## Modeling

For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:
* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span 
* Impute missing values in the target (via forward-fill) and feature columns (using median column values) 
* Create features based on time series identifiers to enable fixed effects across different series
* Create time-based features to assist in learning seasonal patterns
* Encode categorical variables to numeric quantities

In this notebook, AutoML will train a single, regression-type model across **all** time-series in a given training set. This allows the model to generalize across related series. If you're looking for training multiple models for different time-series, please see the many-models notebook.

You are almost ready to start an AutoML training job. First, we need to separate the target column from the rest of the DataFrame: 

In [10]:
target_column_name = "Quantity"

### Prepare Partition Dataset for Training

To enable the distributed DNN training, a partitioned dataset is needed (More details can be found [here](https://www.ups.com/track?loc=en_US&Requester=SBN&tracknum=1Z9Y057RYW55862898&AgreeToTermsAndConditions=yes&WT.z_eCTAid=ct1_eml_Tracking__ct1_eml_tra_sb_upg1&WT.z_edatesent=07082022/trackdetails). For this notebook, we use a pipeline step to partition dataset. We will create a new paritioned dataset by call `"oj_train_partitioned"`.

In [11]:
partitioned_dataset_name = "oj_train_partitioned"

#### Define RunConfig for the compute
We will also use `pandas`, `scikit-learn` and `automl`, `pyarrow` for the pipeline steps. Defining the `runconfig` for that.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# Create a new runconfig object
aml_run_config = RunConfiguration()

# Use the aml_compute you created above. 
aml_run_config.target = aml_compute

# Enable Docker
aml_run_config.environment.docker.enabled = True

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
aml_run_config.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['pandas','scikit-learn'], 
    pip_packages=['azureml-[dataset]]', 'pyarrow'])

print ("Run configuration created.")

#### Build and Run the Partition Pipeline

In [None]:
from azureml.pipeline.steps import PythonScriptStep
oj_train_partitioned = "scripts"

partition_step = PythonScriptStep(
    name="Partition Dataset",
    script_name="partition.py", 
    arguments=["--partition-columns", time_series_id_column_names,
               "--new-partitioned-dataset-name", partitioned_dataset_name
              ],
    inputs=[train_dataset.as_named_input('raw_data')],
    compute_target=compute_target,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)


In [12]:
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

pipeline_steps = [partition_step]

pipeline = Pipeline(workspace = ws, steps=pipeline_steps)
print("Pipeline is built.")

pipeline_run = experiment.submit(pipeline, regenerate_outputs=False)

print("Pipeline submitted for execution.")

Validating arguments.
Arguments validated.
Uploading file to /dominicks_OJ_train_train/35f8f6f0-19e1-4f3a-afa5-8a4f67f2c158/
Successfully uploaded file to datastore.
Creating a new dataset.
Successfully created a new dataset.
registering a new dataset.
Successfully created and registered a new dataset.


In [None]:
RunDetails(pipeline_run).show()

After this pipelien step is finished, we can retrieve this pipeline by using the following code,

In [None]:
parititoned_dataset = Dataset.get_by_name(ws, partitioned_dataset_name)

## Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**training_data**|Input dataset, containing both features and label column.|
|**label_column_name**|The name of the label column.|
|**forecasting_dnn_models_only**|The label to enable distributed featurization and training|

In [13]:
from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parameters = ForecastingParameters(
    time_column_name=time_column_name,
    forecast_horizon=n_test_periods,
    time_series_id_column_names=time_series_id_column_names
)
automl_config = AutoMLConfig(
    task="forecasting",
    primary_metric="normalized_root_mean_squared_error",
    experiment_timeout_hours=0.5,
    training_data=parititoned_dataset,
    label_column_name=target_column_name,
    verbosity=logging.INFO,
    compute_target=compute_target,
    max_concurrent_iterations=10,
    max_cores_per_iteration=-1,
    enable_dnn=True,
    enable_early_stopping=False,
    forecasting_parameters=forecasting_parameters,
    forecasting_dnn_models_only=True
)

	cv_step_size
	target_lags
	feature_lags
	target_rolling_window_size
	cv based validation settings


In [14]:
from azureml.core import Experiment

experiment = Experiment(ws, 'oj-distributed-tcn')

print('Experiment name: ' + experiment.name)

Experiment name: oj-distributed-tcn


In [None]:
remote_run = experiment.submit(automl_config, show_output=True)



Submitting remote run.
No run_configuration provided, running on oj-cluster with default configuration
Running on remote compute: oj-cluster


Experiment,Id,Type,Status,Details Page,Docs Page
oj-distributed-tcn,AutoML_43760cbd-0bcd-4885-a40c-0d2c77b1b584,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation





Displaying the run objects gives you links to the visual tools in the Azure Portal. Go try them!

### Retrieve the Best Model for Each Algorithm
Below we select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. There are overloads on get_output that allow you to retrieve the best run and fitted model for any logged metric or a particular iteration.

In [None]:
from helper import get_result_df

summary_df = get_result_df(remote_run)
summary_df

In [None]:
from azureml.core.run import Run
from azureml.widgets import RunDetails

forecast_model = "TCNForecaster"
if not forecast_model in summary_df["run_id"]:
    forecast_model = "ForecastTCN"

best_dnn_run_id = summary_df["run_id"][forecast_model]
best_dnn_run = Run(experiment, best_dnn_run_id)

In [None]:
best_dnn_run.parent
RunDetails(best_dnn_run.parent).show()

In [None]:
best_dnn_run
RunDetails(best_dnn_run).show()

## Evaluate on Test Data

We now use the best fitted model from the AutoML Run to make forecasts for the test set.  

We always score on the original dataset whose schema matches the training set schema.

In [None]:
test_experiment = Experiment(ws, experiment_name + "_test")

In [None]:
import os
import shutil

script_folder = os.path.join(os.getcwd(), "inference")
os.makedirs(script_folder, exist_ok=True)
shutil.copy("infer.py", script_folder)

In [None]:
from helper import run_inference

test_run = run_inference(
    test_experiment,
    compute_target,
    script_folder,
    best_dnn_run,
    test_dataset,
    valid_dataset,
    forecast_horizon,
    target_column_name,
    time_column_name,
    freq,
)

In [None]:
RunDetails(test_run).show()

In [None]:
from helper import run_multiple_inferences

summary_df = run_multiple_inferences(
    summary_df,
    experiment,
    test_experiment,
    compute_target,
    script_folder,
    test_dataset,
    valid_dataset,
    forecast_horizon,
    target_column_name,
    time_column_name,
    freq,
)

In [None]:
for run_name, run_summary in summary_df.iterrows():
    print(run_name)
    print(run_summary)
    run_id = run_summary.run_id
    test_run_id = run_summary.test_run_id
    test_run = Run(test_experiment, test_run_id)
    test_run.wait_for_completion()
    test_score = test_run.get_metrics()[run_summary.primary_metric]
    summary_df.loc[summary_df.run_id == run_id, "Test Score"] = test_score
    print("Test Score: ", test_score)

In [None]:
summary_df