# AutoML: Train "the best" Time-Series Forecasting model for the Energy Demand Dataset.

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](/sdk/resources/workspace/workspace.ipynb) 
- A Compute Cluster. [Check this notebook to create a compute cluster](/sdk/resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](/sdk/README.md#getting-started)

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create an `AutoML time-series forecasting Job` with the 'forecasting()' factory-fuction.
- Train the model using AmlCompute by submitting/running the AutoML forecasting training job
- Obtaing the model and score predictions with it

**Motivations** - This notebook explains how to setup and run an AutoML forecasting job. This is one of the nine ML-tasks supported by AutoML. Other ML-tasks are 'regression', 'classification', 'image classification', 'image object detection', 'nlp text classification', etc.

In this example we use the associated New York City energy demand dataset to showcase how you can use AutoML for a simple forecasting problem and explore the results. The goal is predict the energy demand for the next 48 hours based on historic time-series data.


# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [2]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.identity import InteractiveBrowserCredential
from azure.ml import MLClient

from azure.ml._constants import AssetTypes
from azure.ml import automl
from azure.ml.entities import JobInput

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [interactive authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.interactivebrowsercredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [3]:
credential = InteractiveBrowserCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Found the config file in: C:\Users\CESARDL.CESARDLSB2-MSFT\GitAMLRepos\azureml-examples-automl-preview-branch\sdk\jobs\automl-standalone-jobs\config.json


### Show Azure ML Workspace information

In [4]:
import pandas as pd

workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

Unnamed: 0,Unnamed: 1
Workspace,cesardl-automl-centraluseuap-ws
Subscription ID,102a16c3-37d3-48a8-9237-4c9b1e8e80e0
Resource Group,automlpmdemo
Location,centraluseuap


# 2. Data

We will use energy consumption [data from New York City](http://mis.nyiso.com/public/P-58Blist.htm) for model training. 
The data is stored in a tabular format and includes energy demand and basic weather data at an hourly frequency. 

With Azure Machine Learning MLTables you can keep a single copy of data in your storage, easily access data during model training, share data and collaborate with other users. 
Below, we will upload the data by creating an MLTable to be used for training.

**NOTE:** In this PRIVATE PREVIEW we're defining the MLTable in a separate folder and .YAML file.
In later versions, you'll be able to do it all in Python APIs.

In [5]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = JobInput(
    type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)

# Training MLTable defined locally, with local data to be uploaded
my_validation_data_input = JobInput(
    type=AssetTypes.MLTABLE, path="./data/validation-mltable-folder"
)

# WITH REMOTE PATH
# my_training_data_input  = JobInput(type=AssetTypes.MLTABLE, path="azureml://datastores/workspaceblobstore/paths/my-forecasting-mltable")

# 3. Configure and run the AutoML Forecasting training job
In this section we will configure and run the AutoML job, for training the model.

## 3.1 Configure the job through the forecasting() factory function

### forecasting() function parameters:

The `forecasting()` factory function allows user to configure AutoML for the forecasting task for the most common scenarios with the following properties.

- `target_column_name` - The name of the column to target for predictions. It must always be specified. This parameter is applicable to 'training_data', 'validation_data' and 'test_data'.
- `primary_metric` - The metric that AutoML will optimize for model selection.
- `training_data` - The data to be used for training. It should contain both training feature columns and a target column. Optionally, this data can be split for segregating a validation or test dataset. 
You can use a registered MLTable in the workspace using the format '<mltable_name>:<version>' OR you can use a local file or folder as a MLTable. For e.g JobInput(mltable='my_mltable:1') OR JobInput(mltable=MLTable(local_path="./data"))
The parameter 'training_data' must always be provided.
- `compute` - The compute on which the AutoML job will run. In this example we are using a compute called 'cpu-cluster' present in the workspace. You can replace it any other compute in the workspace. 
- `name` - The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.
- `experiment_name` - The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment.

### set_limits() parameters:
This is an optional configuration method to configure limits parameters such as timeouts.     
    
- timeout_minutes - Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.

- trial_timeout_minutes - Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
    
- max_trials - The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If using 'enable_early_termination' the number of trials used can be smaller.
    
- max_concurrent_trials - Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to match this number with the number of nodes your cluster.
    
- enable_early_termination - Whether to enable early termination if the score is not improving in the short term. 
    

## Specialized Forecasting Parameters
To define forecasting parameters for your experiment training, you can leverage the .set_forecast_settings() method. 
The table below details the forecasting parameters we will be passing into our experiment.

|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**frequency**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.

# Advanced Forecasting Training <a id="advanced_training"></a>
### Using lags and rolling window features
This training is also using the **target lags**, that is the previous values of the target variables, meaning the prediction uses a horizon. We therefore must still specify the `forecast_horizon` that the model will learn to forecast. The `target_lags` keyword specifies how far back we will construct the lags of the target variable, and the `target_rolling_window_size` specifies the size of the rolling window over which we will generate the `max`, `min` and `sum` features.

This notebook uses the .set_training(blocked_models=) parameter to exclude some models that take a longer time to train on this dataset.  You can choose to remove models from the blocked_models list but you may need to increase the iteration_timeout_minutes parameter value to get results.

In [6]:
# general job parameters
compute_name = "cpu-cluster"
max_trials = 5
exp_name = "dpv2-forecasting-experiment"

In [7]:
# Create the AutoML forecasting job with the related factory-function.

forecasting_job = automl.forecasting(
    compute=compute_name,
    # name="dpv2-forecasting-job-02",
    experiment_name=exp_name,
    training_data=my_training_data_input,
    # validation_data = my_validation_data_input,
    target_column_name="demand",
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=3,  
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"}
)

# Limits are all optional
forecasting_job.set_limits(
    timeout=600,  # timeout_minutes
    trial_timeout=20,  # trial_timeout_minutes
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Specialized properties for Time Series Forecasting training
forecasting_job.set_forecast_settings(
    time_column_name="timeStamp",
    forecast_horizon=48,
    frequency="H",
    target_lags=[12],
    target_rolling_window_size=4,
    # ADDITIONAL FORECASTING TRAINING PARAMS ---
    # time_series_id_column_names=["tid1", "tid2", "tid2"],
    # short_series_handling_config=ShortSeriesHandlingConfiguration.DROP,
    # use_stl="season",
    # seasonality=3,
)

# Training properties are optional
forecasting_job.set_training(blocked_models=["ExtremeRandomTrees"])


## 2.2 Run the CommandJob
Using the `MLClient` created earlier, we will now run this CommandJob in the workspace.

In [8]:
# Submit the AutoML job 
returned_job = ml_client.jobs.create_or_update(
    forecasting_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

Created job: ForecastingJob({'log_verbosity': <LogVerbosity.INFO: 'Info'>, 'task_type': <TaskType.FORECASTING: 'Forecasting'>, 'environment_id': None, 'environment_variables': None, 'outputs': {}, 'display_name': '3221b03b-faeb-455c-959e-d1baf5a5a379', 'type': 'automl', 'status': 'NotStarted', 'log_files': None, 'name': '3221b03b-faeb-455c-959e-d1baf5a5a379', 'description': None, 'tags': {'my_custom_tag': 'My custom value'}, 'properties': {'mlflow.source.git.repoURL': 'git@github.com:Azure/azureml-examples.git', 'mlflow.source.git.branch': 'automl-preview', 'mlflow.source.git.commit': 'e8c2b8571ff0e68a2894be74cbec7cb0c0ed5bdf', 'azureml.git.dirty': 'True'}, 'id': '/subscriptions/102a16c3-37d3-48a8-9237-4c9b1e8e80e0/resourceGroups/automlpmdemo/providers/Microsoft.MachineLearningServices/workspaces/cesardl-automl-centraluseuap-ws/jobs/3221b03b-faeb-455c-959e-d1baf5a5a379', 'base_path': './', 'creation_context': <azure.ml._restclient.v2022_02_01_preview.models._models_py3.SystemData objec

In [9]:
# Wait until AutoML training runs are finished
ml_client.jobs.stream(returned_job.name)

RunId: 3221b03b-faeb-455c-959e-d1baf5a5a379
Web View: https://ml.azure.com/runs/3221b03b-faeb-455c-959e-d1baf5a5a379?wsid=/subscriptions/102a16c3-37d3-48a8-9237-4c9b1e8e80e0/resourcegroups/automlpmdemo/workspaces/cesardl-automl-centraluseuap-ws

Execution Summary
RunId: 3221b03b-faeb-455c-959e-d1baf5a5a379
Web View: https://ml.azure.com/runs/3221b03b-faeb-455c-959e-d1baf5a5a379?wsid=/subscriptions/102a16c3-37d3-48a8-9237-4c9b1e8e80e0/resourcegroups/automlpmdemo/workspaces/cesardl-automl-centraluseuap-ws



# 3. Retrieve the Best Trial (Best Model's trial/run)
Use the MLFLowClient to access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

## Initialize MLFlow Client
The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface. 
Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

*IMPORTANT*, you need to have installed the latest MLFlow packages with:

    pip install azureml-mlflow

    pip install mlflow

### Obtain the tracking URI for MLFlow

In [10]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(name=ml_client.workspace_name).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

azureml://master.api.azureml-test.ms/mlflow/v1.0/subscriptions/102a16c3-37d3-48a8-9237-4c9b1e8e80e0/resourceGroups/automlpmdemo/providers/Microsoft.MachineLearningServices/workspaces/cesardl-automl-centraluseuap-ws


In [11]:
# Set the MLFLOW TRACKING URI

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))


Current tracking uri: azureml://master.api.azureml-test.ms/mlflow/v1.0/subscriptions/102a16c3-37d3-48a8-9237-4c9b1e8e80e0/resourceGroups/automlpmdemo/providers/Microsoft.MachineLearningServices/workspaces/cesardl-automl-centraluseuap-ws


In [12]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

### Get the AutoML parent Job

In [13]:
job_name = returned_job.name

# Example if providing an specific Job name/ID
# job_name = "591640e8-0f88-49c5-adaa-39b9b9d75531"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

Parent Run: 
<Run: data=<RunData: metrics={'explained_variance': 0.844495269637278,
 'mean_absolute_error': 203.31816622469407,
 'mean_absolute_percentage_error': 3.5692315719145227,
 'median_absolute_error': 205.24665270651454,
 'normalized_mean_absolute_error': 0.023651547883380725,
 'normalized_median_absolute_error': 0.02387588440585763,
 'normalized_root_mean_squared_error': 0.028141755988473488,
 'normalized_root_mean_squared_log_error': 0.030106212284799042,
 'r2_score': 0.8444613380895625,
 'root_mean_squared_error': 241.91779117931344,
 'root_mean_squared_log_error': 0.04177446204350856,
 'spearman_correlation': 0.866464084381684}, params={}, tags={'automl_best_child_run_id': '3221b03b-faeb-455c-959e-d1baf5a5a379_4',
 'fit_time': '',
 'iteration': '',
 'mlflow.rootRunId': '3221b03b-faeb-455c-959e-d1baf5a5a379',
 'mlflow.runName': '3221b03b-faeb-455c-959e-d1baf5a5a379',
 'model_explain_best_run_child_id': '3221b03b-faeb-455c-959e-d1baf5a5a379_4',
 'model_explain_run': 'best_run

In [14]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

{'my_custom_tag': 'My custom value', 'model_explain_run': 'best_run', 'pipeline_id': '', 'score': '', 'predicted_cost': '', 'fit_time': '', 'training_percent': '', 'iteration': '', 'run_preprocessor': '', 'run_algorithm': '', 'automl_best_child_run_id': '3221b03b-faeb-455c-959e-d1baf5a5a379_4', 'model_explain_best_run_child_id': '3221b03b-faeb-455c-959e-d1baf5a5a379_4', 'mlflow.rootRunId': '3221b03b-faeb-455c-959e-d1baf5a5a379', 'mlflow.runName': '3221b03b-faeb-455c-959e-d1baf5a5a379'}


## Get the AutoML best child run

In [15]:
# Get the best model's child run

best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

Found best child run id:  3221b03b-faeb-455c-959e-d1baf5a5a379_4
Best child run: 
<Run: data=<RunData: metrics={'explained_variance': 0.844495269637278,
 'mean_absolute_error': 203.31816622469407,
 'mean_absolute_percentage_error': 3.5692315719145227,
 'median_absolute_error': 205.24665270651454,
 'normalized_mean_absolute_error': 0.023651547883380725,
 'normalized_median_absolute_error': 0.02387588440585763,
 'normalized_root_mean_squared_error': 0.028141755988473488,
 'normalized_root_mean_squared_log_error': 0.030106212284799042,
 'r2_score': 0.8444613380895625,
 'root_mean_squared_error': 241.91779117931344,
 'root_mean_squared_log_error': 0.04177446204350856,
 'spearman_correlation': 0.866464084381684}, params={}, tags={'mlflow.parentRunId': '3221b03b-faeb-455c-959e-d1baf5a5a379',
 'mlflow.rootRunId': '3221b03b-faeb-455c-959e-d1baf5a5a379',
 'mlflow.runName': 'patient_guava_vbr2b8gk',
 'model_explain_run_id': '3221b03b-faeb-455c-959e-d1baf5a5a379_ModelExplain',
 'model_explanation

## Get best model run's metrics

Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run.

In [16]:
import pandas as pd

pd.DataFrame(best_run.data.metrics, index=[0]).T

Unnamed: 0,0
r2_score,0.844461
mean_absolute_percentage_error,3.569232
spearman_correlation,0.866464
normalized_mean_absolute_error,0.023652
root_mean_squared_log_error,0.041774
root_mean_squared_error,241.917791
normalized_median_absolute_error,0.023876
median_absolute_error,205.246653
normalized_root_mean_squared_log_error,0.030106
normalized_root_mean_squared_error,0.028142


## Download the best model locally

Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run.

In [17]:
# Create local folder
local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

In [18]:
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

Artifacts downloaded in: C:\Users\CESARDL.CESARDLSB2-MSFT\GitAMLRepos\azureml-examples-automl-preview-branch\sdk\jobs\automl-standalone-jobs\automl-forecasting-task-energy-demand\artifact_downloads\outputs
Artifacts: ['conda_env_v_1_0_0.yml', 'engineered_feature_names.json', 'env_dependencies.json', 'featurization_summary.json', 'internal_cross_validated_models.pkl', 'mlflow-model', 'model.pkl', 'pipeline_graph.json', 'run_id.txt', 'scoring_file_v_1_0_0.py', 'scoring_file_v_2_0_0.py']


In [19]:
import os

# Show the contents of the MLFlow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

['conda.yaml', 'MLmodel', 'model.pkl', 'requirements.txt']

# Next Step: Load the best model and try predictions

Loading the models locally assume that you are running the notebook in an environment compatible with the model. The list of dependencies that is expected by the model is specified in the MLFlow model produced by AutoML (in the 'conda.yaml' file within the mlflow-model folder).

Since the AutoML model was trained remotelly in a different environment with different dependencies to your current local conda environment where you are running this notebook, if you want to load the model you have several options:

1. A recommended way to locally load the model in memory and try predictions is to create a new/clean conda environment with the dependencies specified in the conda.yml file within the MLFlow model's folder, then use MLFlow to load the model and call .predict() as explained in the notebook **mlflow-model-local-inference-test.ipynb** in this same folder.

2. You can install all the packages/dependencies specified in conda.yml into your current conda environment you used for using Azure ML SDK and AutoML. MLflow SDK also have a method to install the dependencies in the current environment. However, this option could have risks of package version conflicts depending on what's installed in your current environment.

3. You can also use: mlflow models serve -m 'xxxxxxx'

# Next Steps
You can see further examples of other AutoML tasks such as Image-Classification, Image-Object-Detection, NLP-Text-Classification, Time-Series-Forcasting, etc.