Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/auto-ml-forecasting-beer-remote.png)

# AutoML: Train a Forecasting model on Github Daily Active Users (DAU) dataset

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. 
- An Azure ML workspace. 
- A Compute Cluster. 
- A python environment
- Installation instructions 

Optionally, you will need a [Compute Instance](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-instance) if you want to run this notebook directly within your Azure ML Studio as notebook (a Compute Cluster cannot be used here. Learn more [here](https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-target#azure-machine-learning-compute-managed)). Also in that case you can set up a [Idle shutdown](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-manage-compute-instance?tabs=python#enable-idle-shutdown-preview) (a preview feature) to save cost of this compute instance.

**Motivations** - This notebook explains how to setup and run an AutoML forecasting job. This is one of the [nine ML-tasks](https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#when-to-use-automl-classification-regression-forecasting-computer-vision--nlp) supported by AutoML. 

In this example we use the associated Github DAU (Daily Active Users) dataset to showcase how you can use AutoML Deep Learning forecasts for a forecasting problem and explore the results. The goal is predict the users for the next 14 days based on historic time-series data.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

Please remember to use the kernel Python 3.10 or above, as stated in the env.yml file

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl
from azure.ai.ml import Input

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
credential = DefaultAzureCredential(exclude_shared_token_cache_credential=True)
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "subscription_id"
    resource_group = "resource_group"
    workspace = "workspace_name"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

### Show Azure ML Workspace information

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
output

# 2. Data

We will use github active user (DAU) count for model training. The data is stored in a tabular format.

With Azure Machine Learning [MLTables](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?tabs=cli%2Cpandas%2Cadls) you can keep a single copy of data in your storage, easily access data during model training, share data and collaborate with other users. 
Below, we will upload the data by creating an MLTable to be used for training.

In [None]:
from helpers.generate_ml_table import create_ml_table

create_ml_table("github_dau_2011-2018_train.csv", "./data/training-mltable-folder")

# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(
    type=AssetTypes.MLTABLE, 
    path="./data/training-mltable-folder"
)

create_ml_table("github_dau_2011-2018_test.csv", "./data/test-mltable-folder")

my_test_data_input = Input(
    type=AssetTypes.URI_FOLDER,
    path="test_dataset/",
)

For documentation on creating your own MLTable assets for jobs beyond this notebook:
- https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable details how to write MLTable YAMLs (required for each MLTable asset).
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK covers how to work with them in the v2 CLI/SDK.

# 3. Create or Attach existing AmlCompute.
Azure Machine Learning Compute is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. In this tutorial, you will create and an AmlCompute cluster as your training compute resource.

### Creation of AmlCompute takes approximately 5 minutes.
If the AmlCompute with that name is already in your workspace this code will skip the creation process. As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read this [article](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas#azure-machine-learning-compute) on the default limits and how to request more quota.

In [None]:
from azure.core.exceptions import ResourceNotFoundError
from azure.ai.ml.entities import AmlCompute

compute_name = "cpu-cluster"

try:
    # Retrieve an already attached Azure Machine Learning Compute.
    compute = ml_client.compute.get(compute_name)
except ResourceNotFoundError as e:
    compute = AmlCompute(
        name=compute_name,
        size="STANDARD_DS3_V2",
        type="amlcompute",
        min_instances=0,
        max_instances=4,
        idle_time_before_scale_down=120,
    )
    poller = ml_client.begin_create_or_update(compute)
    poller.wait()

# 4. Configure and run the AutoML Forecasting training job
In this section we will configure and run the AutoML job, for training the model.

## 4.1 Configure the job through the forecasting() factory function

### forecasting() function parameters:

The `forecasting()` factory function allows user to configure AutoML for the forecasting task for the most common scenarios with the following properties.

|Property|Description|
|-|-|
|**target_column_name**|The name of the column to target for predictions. It must always be specified. This parameter is applicable to 'training_data', 'validation_data' and 'test_data'.|
|**[primary_metric](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric)**|The metric that AutoML will optimize for model selection.|
|**training_data**|The data to be used for training. It should contain both training feature columns and a target column. Optionally, this data can be split for segregating a validation or test dataset. You can use a registered MLTable in the workspace using the format '<mltable_name>:<version>' OR you can use a local file or folder as a MLTable. For e.g Input(mltable='my_mltable:1') OR Input(mltable=MLTable(local_path="./data")). The parameter 'training_data' must always be provided.
|**compute**|The compute on which the AutoML job will run. In this example we are using a compute called 'github-cluster-sdkv2' present in the workspace. You can replace it any other compute in the workspace.|
|**name**|The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.|
|**experiment_name**|The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment.|

### set_limits() parameters:
This is an optional configuration method to configure limits parameters such as timeouts.     

|Property|Description|
|-|-|
|**timeout_minutes**|Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes).|
|**trial_timeout_minutes**|Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.|
|**max_trials**|The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If using 'enable_early_termination' the number of trials used can be smaller.|
|**max_concurrent_trials**|Represents the maximum number of trials (children jobs) that would be executed in parallel. We highly recommend to set the number of concurrent runs to the number of nodes in the cluster.|


## Specialized Forecasting Parameters
To define forecasting parameters for your experiment training, you can leverage the .set_forecast_settings() method. 
The table below details the forecasting parameters we will be passing into our experiment.

|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**frequency**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.

## Training Parameters

Some parameters specific to this training job can be set by .set_training() method.

|Property|Description|
|-|-|
|**[allowed_training_algorithms](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast#configure-experiment)**|The algorithms that will be allowed to train. All other models will be blocked.|
|**enable_dnn_training**|Enable Forecasting DNNs (Deep Neural Networks)|

In [None]:
# general job parameters to predict the next 14 days
max_trials = 5
exp_name = "forecasting-github-dau"

target_column_name = "count"
forecast_horizon = 14
time_column_name = "date"

In [None]:
# Create the AutoML forecasting job with the related factory-function.

forecasting_job = automl.forecasting(
    compute=compute_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    # validation_data = my_validation_data_input,
    target_column_name=target_column_name,
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=10,
)

# Limits are all optional
forecasting_job.set_limits(
    timeout_minutes=120,
    trial_timeout_minutes=30,
    max_trials=max_trials,
    max_concurrent_trials=3,
)

# Specialized properties for Time Series Forecasting training
forecasting_job.set_forecast_settings(
    time_column_name=time_column_name, 
    forecast_horizon=forecast_horizon, 
    frequency="D"
)

# Enable Dnn training and allow only TCNForecaster model
forecasting_job.set_training(
    allowed_training_algorithms=["Prophet"], enable_dnn_training=True
)

## 4.2 Train the AutoML model
Using the `MLClient` created earlier, we will now run this Command in the workspace. 

It is OK if the run ends successfully with a warning: _No scores improved over last 20 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs._ 

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    forecasting_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
ml_client.jobs.stream(returned_job.name)

# 5 Registering the model in the registry

We will locate the best run and register the model from the best run to our model registry

In [None]:
from azure.ai.ml.entities import  Model
import mlflow

# copy the job name over from the Azure ML Studio UI
job_name = returned_job.name # or specify a job name here
model_name = "github-dau-model"
local_dir = "./artifact_downloads"
output_folder = "outputs"

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

# Set the MLFLOW TRACKING URI for Azure ML
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
# Initialize MLFlow client
mlflow_client = mlflow.tracking.MlflowClient()

In [None]:
# find the best run and download the artifacts from the best run
# Get the best model's child run
mlflow_parent_run = mlflow_client.get_run(job_name)
best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

# Create local folder
import os

if not os.path.exists(local_dir):
    os.mkdir(local_dir)
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, output_folder, local_dir
)

In [None]:
model_local_path = os.path.join(local_dir, output_folder, "mlflow-model")
model = ml_client.models.create_or_update(
        Model(name=model_name, path=model_local_path, type=AssetTypes.MLFLOW_MODEL)
)

Now, in Azure ML Studio under Assets --> Models you should be able to find the newly registered model, which is the best model out of our Auto ML training process. 