# AutoML Forecasting Training and Inferencing using Pipelines

## Introduction

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Create a Forecasting AutoML task in pipeline.

**Motivations** - This notebook explains how to use Forecasting AutoML task inside pipeline.

In this notebook, we demonstrate how to use piplines to train and inference on AutoML Forecasting model. Two pipelines will be created: one for training AutoML model, and the other is for inference on AutoML model. We'll also demonstrate how to schedule the inference pipeline so you can get inference results periodically (with refreshed test dataset). Make sure you have executed the configuration notebook before running this notebook. In this notebook you will learn how to:

- Configure AutoML forecasting tasks.
- Create and register an AutoML model using AzureML pipeline.
- Inference and schedule the pipeline using registered model.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [None]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input, command, Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.automl import forecasting
from azure.ai.ml.entities._job.automl.tabular.forecasting_settings import (
    ForecastingSettings,
)
from azure.ai.ml.entities import Environment

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"

    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

### Show Azure ML Workspace information

In [None]:
import pandas as pd

workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Compute 

#### Create or Attach existing AmlCompute

You may need to talk to your workspace or IT admin to create the compute targets if you don't have permission.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azure.core.exceptions import ResourceNotFoundError
from azure.ai.ml.entities import AmlCompute

cluster_name = "forecast-step-cluster-v2"

try:
    # Retrieve an already attached Azure Machine Learning Compute.
    compute = ml_client.compute.get(cluster_name)
except ResourceNotFoundError as e:
    compute = AmlCompute(
        name=cluster_name,
        size="STANDARD_DS12_V2",
        type="amlcompute",
        min_instances=0,
        max_instances=4,
        idle_time_before_scale_down=120,
    )
    poller = ml_client.begin_create_or_update(compute)
    poller.wait()

## Data
You are now ready to load the historical orange juice sales data. For demonstration purposes, we extract sales time-series for just a few of the stores. We will load the CSV file into a plain pandas DataFrame; the time column in the CSV is called _WeekStarting_, so it will be specially parsed into the datetime type.

In [None]:
time_column_name = "WeekStarting"
train = pd.read_csv("./data/train/oj-train.csv", parse_dates=[time_column_name])

train.head()

Each row in the DataFrame holds a quantity of weekly sales for an orange juice (OJ) brand at a single store. The data also includes the sales price, a flag indicating if the OJ brand was advertised in the store that week, and some customer demographic information based on the store location. For historical reasons, the data also include the logarithm of the sales quantity. The Dominick's grocery data is commonly used to illustrate econometric modeling techniques where logarithms of quantities are generally preferred.    

The task is now to build a time-series model for the _Quantity_ column. It is important to note that this dataset is comprised of many individual time-series - one for each unique combination of _Store_ and _Brand_. To distinguish the individual time-series, we define the **time_series_id_column_names** - the columns whose values determine the boundaries between time-series: 

In [None]:
time_series_id_column_names = ["Store", "Brand"]
nseries = train.groupby(time_series_id_column_names).ngroups
print("Data contains {0} individual time-series.".format(nseries))

### Test Splitting
The test set will contain the final 4 weeks of observed sales for each time-series. The splits should be stratified by series, so we use a group-by statement on the time series identifier columns.

In [None]:
n_test_periods = 4

test = pd.read_csv("./data/test/oj-test.csv", parse_dates=[time_column_name])
test.head()

### Upload data to datastore
The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train data and create [Input](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.input?view=azure-python-preview) object.

In [None]:
# Training MLTable defined locally, with local data to be uploaded
train_dataset = Input(type=AssetTypes.MLTABLE, path="./data/train")

However, we will use our test data set from the pipeline run and we will need to upload it to URI directory to be used.

In [None]:
test_dataset = Input(
    type=AssetTypes.URI_FOLDER,
    path="./data/test",
)

# 2 Building training pipeline.

## 2.1 Modeling

For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:
* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span 
* Impute missing values in the target (via forward-fill) and feature columns (using median column values) 
* Create features based on time series identifiers to enable fixed effects across different series
* Create time-based features to assist in learning seasonal patterns
* Encode categorical variables to numeric quantities

In this notebook, AutoML will train a single, regression-type model across **all** time-series in a given training set. This allows the model to generalize across related series.

You are almost ready to start an AutoML training job. First, we need to define the target column.

In [None]:
target_column_name = "Quantity"

### ForecastingSettings
To define forecasting settings for your experiment training, you can leverage the ForecastingSettings class. The table below details the forecasting parameter we will be passing into our experiment.

|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**time_series_id_column_names**|The column names used to uniquely identify the time series in data that has multiple rows with the same timestamp. If the time series identifiers are not defined, the data set is assumed to be one time series.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**frequency**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.
|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is `None`, in which case AutoMl determines the cross-validation step size automatically. Or users could specify an integer value.|

### forecasting() function parameters:

The `forecasting()` factory function allows user to configure AutoML for the forecasting task for the most common scenarios with the following properties.

|Property|Description|
|-|-|
|**target_column_name**|The name of the label column.|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**training_data**|The training data to be used for this experiment. You can use a registered MLTable in the workspace using the format `<mltable_name>:<version>` OR you can use a local file or folder as a MLTable. For e.g `Input(mltable='my_mltable:1')` OR `Input(mltable=MLTable(local_path="./data"))` The parameter 'training_data' must always be provided.|
|**compute**|The compute on which the AutoML job will run. In this example we are using a compute called 'cpu-cluster' present in the workspace. You can replace it with any other compute in the workspace.|
|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. This can be set to "auto", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or, users could specify an integer value.|
|**name**|The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.
|**experiment_name**|The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment. For example, if a user runs this notebook multiple times, there will be multiple runs associated with the same Experiment name.|
|**enable_model_explainability**|If set to true, the explanations such as feature importance for the best model will be generated.|

### set_limits() parameters:
This is an optional configuration method to configure limits parameters such as timeouts.

|Property|Description|
|-|-|
|**timeout_minutes**|Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results. It is hard to say what the timeout limit should be because the runtimes depend on multiple factors such as number of unique time series in the dataset, length of time series, statistical properties of the data, etc. If your dataset is less than 10,000,000 observations, you can try to set the experiment to 1 hour. If you are seeing less than 30 child jobs completed in this time frame, increase the timeout limit and re-run the experiment.|
|**trial_timeout_minutes**|Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.|
|**max_trials**|The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If you are setting the `enable_early_termination=True` the number of trials will be smaller.|
|**max_concurrent_trials**|Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to set this number equal to the number of nodes in your cluster.|
|**enable_early_termination**|Whether to enable early termination if the score is not improving over 10 iterations. Early stopping window starts only after first 20 iterations. This means that the first iteration where stopping can occur is the 31st.|



In [None]:
## Build Custom environment

In [None]:
env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="./environment/preprocessing_env.yaml",
    name="pipeline-custom-environment",
    description="Environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(env_docker_conda)

In [None]:
model_name_str = "ojmodel"

# Define pipeline
@pipeline(
    description="AutoML Forecasting Pipeline",
)
def automl_forecasting(
    forecasting_train_data,
):
    # define command function for preprocessing the model
    preprocessing_command_func = command(
        inputs=dict(
            train_data=Input(type="mltable"),
        ),
        outputs=dict(
            preprocessed_train_data=Output(type="mltable"),
        ),
        code="./preprocess.py",
        command="python preprocess.py "
        + "--train_data ${{inputs.train_data}} "
        + "--preprocessed_train_data ${{outputs.preprocessed_train_data}}",
        environment="pipeline-custom-environment@latest",
    )
    preprocess_node = preprocessing_command_func(train_data=forecasting_train_data)

    # define forecasting settings
    forecasting_settings = ForecastingSettings(
        time_column_name=time_column_name,
        forecast_horizon=n_test_periods,
        frequency="W-THU",
    )

    # define the automl forecasting task with automl function
    forecasting_node = forecasting(
        training_data=preprocess_node.outputs.preprocessed_train_data,
        target_column_name=target_column_name,
        primary_metric="normalized_root_mean_squared_error",
        n_cross_validations="auto",
        forecasting_settings=forecasting_settings,
        # currently need to specify outputs "custom_model" explictly to reference it in following nodes
        outputs={"best_model": Output(type=AssetTypes.CUSTOM_MODEL)},
    )

    forecasting_node.set_limits(
        timeout_minutes=15,
        trial_timeout_minutes=5,
    )

    # define command function for registering the model
    command_func = command(
        inputs=dict(
            model_input_path=Input(type=AssetTypes.CUSTOM_MODEL),
            model_base_name="forecasting_example_model",
        ),
        code="scripts/register_model.py",
        command="python register_model.py "
        + "--model_path ${{inputs.model_input_path}} "
        + f"--model_base_name {model_name_str}",
        environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    )
    register_model = command_func(model_input_path=forecasting_node.outputs.best_model)

    return {
        "best_model": forecasting_node.outputs.best_model,
    }


# Create an instance of a pipeline job
pipeline_job_data = automl_forecasting(
    forecasting_train_data=train_dataset,
)

# set pipeline level compute
pipeline_job_data.settings.default_compute = cluster_name

# 2.2 Submit pipeline job
The pipeline will train AutoML model and register it in the workspace.

In [None]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job_data, experiment_name="pipeline_samples"
)
pipeline_job

In [None]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

Download the output of a pipeline.

In [None]:
ml_client.jobs.download(pipeline_job.name, download_path=".", output_name="best_model")

Now we will get the ID of the best run and download the artifacts associated with it.

In [None]:
import os
import yaml

with open(os.path.join("named-outputs", "best_model", "MLmodel"), "r") as f:
    ml_model = yaml.safe_load(f)
ml_model["run_id"]

When we know the run ID of the best run, we can instantiate the mlflow run object.

In [None]:
import mlflow
from mlflow.tracking.client import MlflowClient

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)


# Initialize MLFlow client
mlflow_client = MlflowClient()

mlflow_best_run = mlflow_client.get_run(ml_model["run_id"])

print("Parent Run: ")
print(mlflow_best_run)

Download the artiracts for this run.

In [None]:
# Create local folder
import os

local_dir = "./artifact_downloads"
os.makedirs(local_dir, exist_ok=True)
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    mlflow_best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

### Get metrics for each run
In this code we list all child runs, i.e., all runs that share the same parent run ID and end in the underscore followed by the order number.

In [None]:
import re
from mlflow.entities import RunStatus

parent_run_id = ml_model["run_id"][: ml_model["run_id"].index("_")]
child_run_regex = re.compile(r"[^_]+_\d+$")

for child_run in filter(
    lambda x: child_run_regex.match(x.name),
    ml_client.jobs.list(parent_job_name=parent_run_id),
):
    mlflow_child_run = mlflow_client.get_run(child_run.name)
    if RunStatus.from_string(mlflow_child_run.info.status) == RunStatus.FINISHED:
        print(
            f"{child_run.name}: "
            f'{mlflow_child_run.data.metrics["normalized_root_mean_squared_error"]}'
        )

# 3. Inference

There are several ways to do the inference, for here we will demonstrate how to use the registered model and pipeline to do the inference.

## 3.1 Get Inference Pipeline Environment
This environment can be created using the `yaml` file, which we have downloaded with other best run's artifacts into the `artifact_downloads` directory.

In [None]:
from azure.ai.ml.entities import Environment

env = Environment(
    description="environment for automl inference",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    conda_file=os.path.join("artifact_downloads", "outputs", "conda_env_v_1_0_0.yml"),
)

## 3.2 Build and submit the inference pipeline

The inference pipeline will create pipeline output object which can be downloaded after pipeline finishes.

In [None]:
output_ds_name = "oj-output"
# Define inference pipeline
@pipeline(
    description="AutoML Inference Pipeline",
)
def automl_forecasting(
    forecasting_inference_data,
):
    # define command function for registering the model
    inference_func = command(
        inputs=dict(
            test_dataset=Input(type=AssetTypes.URI_FOLDER),
            model_base_name="forecasting_example_model",
        ),
        outputs=dict(output_dataset=Output(type=AssetTypes.URI_FOLDER)),
        code="scripts/infer.py",
        command=(
            "python infer.py "
            "--test_dataset ${{inputs.test_dataset}} "
            f"--model_name {model_name_str} "
            f"--target_column_name {target_column_name} "
            "--output_dataset ${{outputs.output_dataset}} "
            f"--output_dataset_name {output_ds_name}"
        ),
        environment=env,
    )

    call_inferencing = inference_func(test_dataset=forecasting_inference_data)

    return {"output_dataset": call_inferencing.outputs.output_dataset}


pipeline_job_data = automl_forecasting(
    forecasting_inference_data=test_dataset,
)

# set pipeline level compute
pipeline_job_data.settings.default_compute = cluster_name

pipeline_job_data.outputs.output_dataset.mode = "rw_mount"

In [None]:
inference_job = ml_client.jobs.create_or_update(
    pipeline_job_data, experiment_name="pipeline_inference"
)
inference_job

In [None]:
ml_client.jobs.stream(inference_job.name)

## 3.3 Get the predicted data

In [None]:
ml_client.jobs.download(
    inference_job.name, download_path=".", output_name="output_dataset"
)

In [None]:
inference_df = pd.read_csv(
    os.path.join("named-outputs", "output_dataset", f"{output_ds_name}.csv"),
    parse_dates=[time_column_name],
)
inference_df.tail(5)

# 4. Schedule Pipeline

This section is about how to schedule a pipeline for periodic predictions. For more info about pipeline schedule and pipeline endpoint, please follow this [notebook](https://github.com/Azure/azureml-examples/blob/83c67ec408f10e2e07b3a2a3e648023caa09e112/sdk/python/schedules/job-schedule.ipynb).<br>
## 4.1. Define a schedule
If `test_dataset` will be updated every 4 weeks on Friday 16:00 and the objective is to generate a 4 week (forecast_horizon) forecast, we can schedule our pipeline to run every 4 weeks at 16:00 to get daily inference results. You can refresh your test dataset (a newer version will be created) periodically when new data is available (i.e. target column in test dataset would have values in the beginning as context data, and followed by NaNs to be predicted). The inference pipeline will pick up context to further improve the forecast accuracy.

In [None]:
from datetime import datetime

from azure.ai.ml.constants import TimeZone
from azure.ai.ml.entities import (
    JobSchedule,
    RecurrenceTrigger,
    RecurrencePattern,
)

schedule_name = "OJ_Inference_schedule"
schedule_start_time = datetime.now()

recurrence_trigger = RecurrenceTrigger(
    frequency="week",
    interval=4,
    schedule=RecurrencePattern(week_days=["Friday"], hours=16, minutes=[0]),
    start_time=schedule_start_time,
    time_zone=TimeZone.UTC,
)

job_schedule = JobSchedule(
    name=schedule_name, trigger=recurrence_trigger, create_job=pipeline_job_data
)

## 4.2 Create schedule

In [None]:
ml_client.schedules.begin_create_or_update(schedule=job_schedule).wait()
print(job_schedule)

## 4.3. [Optional] Disable schedule

In [None]:
ml_client.schedules.begin_disable(name=schedule_name).wait()
job_schedule.is_enabled
ml_client.schedules.begin_delete(name=schedule_name).wait()

## 4.4 [Optional] Delete the compute cluster

In [None]:
ml_client.compute.begin_delete(name=cluster_name).wait()