Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/forecasting-bike-share/auto-ml-forecasting-bike-share.png)

# Automated Machine Learning
**BikeShare Demand Forecasting**

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Compute](#Compute)
1. [Data](#Data)
1. [Train](#Train)
1. [Featurization](#Featurization)
1. [Evaluate](#Evaluate)

## Introduction
This notebook demonstrates demand forecasting for a bike-sharing service using AutoML.

AutoML highlights here include built-in holiday featurization, accessing engineered feature names, and working with the `forecast` function. Please also look at the additional forecasting notebooks, which document lagging, rolling windows, forecast quantiles, other ways to use the forecast function, and forecaster deployment.

Make sure you have executed the [configuration notebook](../../../configuration.ipynb) before running this notebook.

Notebook synopsis:
1. Creating an Experiment in an existing Workspace
2. Configuration and local run of AutoML for a time-series model with lag and holiday features 
3. Viewing the engineered names for featurized data and featurization summary for all raw features
4. Evaluating the fitted model using a rolling test 

## Setup


In [66]:
%env AZURE_EXTENSION_DIR="E:/Src/git/sdk-cli-v2/src/cli/src"
%env AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED=true

env: AZURE_EXTENSION_DIR="E:/Src/git/sdk-cli-v2/src/cli/src"
env: AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED=true


In [5]:
import azureml.core
import pandas as pd
import numpy as np
import logging

import azure.ml
from azure.ml import MLClient

from azure.core.exceptions import ResourceExistsError

from azure.ml.entities import Workspace
from azure.ml.entities import AmlCompute
from azure.ml.entities import Data

from datetime import datetime

This sample notebook may use features that are not available in previous versions of the Azure ML SDK.

In [73]:
# TODO: Versions need to change
print("This notebook was created using version 1.31.0 of the Azure ML SDK")
print("You are currently using SDK version", azure.ml.version.VERSION, "of the Azure ML SDK")

This notebook was created using version 1.31.0 of the Azure ML SDK
You are currently using SDK version 0.0.88 of the Azure ML SDK


#### TODO: Equivalents for the following may be missing.

Accessing the Azure ML workspace requires authentication with Azure.

The default authentication is interactive authentication using the default tenant.  Executing the `ws = Workspace.from_config()` line in the cell below will prompt for authentication the first time that it is run.

If you have multiple Azure tenants, you can specify the tenant by replacing the `ws = Workspace.from_config()` line in the cell below with the following:

```
from azureml.core.authentication import InteractiveLoginAuthentication
auth = InteractiveLoginAuthentication(tenant_id = 'mytenantid')
ws = Workspace.from_config(auth = auth)
```

If you need to run in an environment where interactive login is not possible, you can use Service Principal authentication by replacing the `ws = Workspace.from_config()` line in the cell below with the following:

```
from azureml.core.authentication import ServicePrincipalAuthentication
auth = auth = ServicePrincipalAuthentication('mytenantid', 'myappid', 'mypassword')
ws = Workspace.from_config(auth = auth)
```
For more details, see [aka.ms/aml-notebook-auth](http://aka.ms/aml-notebook-auth)

### Initialize MLClient

Create an MLClient object, to interact with Azure ML resources, such as computes, jobs.

In [72]:
subscription_id = '381b38e9-9840-4719-a5a0-61d9585e1e91'
resource_group_name = 'yunba_test_rg'
workspace_name = "yunba-test-ws-eastus2" #"gasi_ws_centraleuap"
experiment_name = "sdkv2-auto-ml-forecasting-bike-share"

client = MLClient(subscription_id, resource_group_name, default_workspace_name=workspace_name)

client

<azure.ml._ml_client.MLClient at 0x2c215b7ed68>

### Initialize MLFlowClient

Create an MLFlowClient to interact with the resources that the AutoML job creates, such as models, metrics.



In [None]:
# !pip install azureml-core azureml-mlflow

In [8]:
import mlflow

########
# TODO: The API to get tracking URI is not yet available on Worksapce object.
from azureml.core import Workspace as WorkspaceV1
ws = WorkspaceV1(workspace_name=workspace_name, resource_group=resource_group_name, subscription_id=subscription_id)
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
del ws
########

# Not sure why this doesn't work w/o the double + single quotes
# mlflow.set_tracking_uri("azureml://northeurope.experiments.azureml.net/mlflow/v1.0/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/gasi_rg_neu/providers/Microsoft.MachineLearningServices/workspaces/gasi_ws_neu?")
mlflow.set_experiment(experiment_name)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

INFO: 'sdkv2-auto-ml-forecasting-bike-share' does not exist. Creating a new experiment

Current tracking uri: azureml://eastus2.experiments.azureml.net/mlflow/v1.0/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2?


## Create or Attach existing AmlCompute
You will need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [9]:
# Set or create compute

cpu_cluster_name = "cpu-cluster"
compute = AmlCompute(
    name=cpu_cluster_name, size="STANDARD_D11_V2",
    min_instances=0, max_instances=3,
    idle_time_before_scale_down=120
)

# Load directly from YAML file
# compute = Compute.load("./compute.yaml")

try:
    # TODO: This currently results in an exception in Azure ML, please create compute manually.
    client.compute.create(compute)
except ResourceExistsError as re:
    print(re)
except Exception as e:
    import traceback
    
    print("Could not create compute.", str(e))
#     traceback.print_exc()
    # Reload an existing compute target
    compute = client.compute.get(cpu_cluster_name)

compute

AzureCliCredential.get_token failed: Please run 'az login' to set up an account


Could not create compute. Cannot deserialize duration object., ISO8601Error: Unable to parse duration string ''


AmlCompute({'type': 'amlcompute', 'created_on': None, 'provisioning_state': 'Succeeded', 'provisioning_errors': None, 'name': 'cpu-cluster', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/computes/cpu-cluster', 'base_path': './', 'creation_context': None, 'location': 'eastus2', 'enable_public_ip': False, 'resource_id': None, 'size': 'STANDARD_D2_V2', 'min_instances': 0, 'max_instances': 6, 'idle_time_before_scale_down': 120.0, 'identity_type': None, 'user_assigned_identities': None, 'admin_username': 'azureuser', 'admin_password': None, 'ssh_key_value': None, 'vnet_name': None, 'subnet': None, 'priority': 'Dedicated'})

## Data

The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace) is paired with the storage account, which contains the default data store. We will use it to upload the bike share data and create [tabular dataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.

### TODO: This doesnt' work, ensure dataset is created via. the UI.
#### Below ws_tmp setup should be changed: uploading data to init a dataset, and then split data with time points 

In [11]:
# create a ws object to upload data to datastore.
from azureml.core import Workspace
ws_tmp = Workspace(subscription_id=subscription_id, resource_group=resource_group_name, workspace_name=workspace_name)

In [12]:
datastore = ws_tmp.get_default_datastore()
datastore.upload_files(files = ['./bike-no.csv'], target_path = 'dataset/', overwrite = True,show_progress = True)

Uploading an estimated of 1 files
Uploading ./bike-no.csv
Uploaded ./bike-no.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_02b53e73aaf94fbfb63f5b870e4e6683

In [14]:
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'dataset/bike-no.csv')]).with_timestamp_columns(fine_grain_timestamp=time_column_name) 

# Drop the columns 'casual' and 'registered' as these columns are a breakdown of the total and therefore a leak.
dataset = dataset.drop_columns(columns=['casual', 'registered'])

dataset.take(5).to_pandas_dataframe().reset_index(drop=True)

Unnamed: 0,instant,date,season,yr,mnth,weekday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,2011-01-01,1,0,1,6,2,0.344167,0.363625,0.805833,0.160446,985
1,2,2011-01-02,1,0,1,0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,2011-01-03,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,2011-01-04,1,0,1,2,1,0.2,0.212122,0.590435,0.160296,1562
4,5,2011-01-05,1,0,1,3,1,0.226957,0.22927,0.436957,0.1869,1600


### Split the data

The first split we make is into train and test sets. Note we are splitting on time. Data before 9/1 will be used for training, and data after and including 9/1 will be used for testing.

In [15]:
# select data that occurs before a specified date
train = dataset.time_before(datetime(2012, 8, 31), include_boundary=True)
train.to_pandas_dataframe().tail(5).reset_index(drop=True)

Unnamed: 0,instant,date,season,yr,mnth,weekday,weathersit,temp,atemp,hum,windspeed,cnt
0,605,2012-08-27,3,1,8,1,1,0.703333,0.654688,0.730417,0.128733,6917
1,606,2012-08-28,3,1,8,2,1,0.728333,0.66605,0.62,0.190925,7040
2,607,2012-08-29,3,1,8,3,1,0.685,0.635733,0.552083,0.112562,7697
3,608,2012-08-30,3,1,8,4,1,0.706667,0.652779,0.590417,0.077117,7713
4,609,2012-08-31,3,1,8,5,1,0.764167,0.6894,0.5875,0.168533,7350


In [16]:
test = dataset.time_after(datetime(2012, 9, 1), include_boundary=True)
test.to_pandas_dataframe().head(5).reset_index(drop=True)

Unnamed: 0,instant,date,season,yr,mnth,weekday,weathersit,temp,atemp,hum,windspeed,cnt
0,610,2012-09-01,3,1,9,6,2,0.753333,0.702654,0.638333,0.113187,6140
1,611,2012-09-02,3,1,9,0,2,0.696667,0.649,0.815,0.064071,5810
2,612,2012-09-03,3,1,9,1,1,0.7075,0.661629,0.790833,0.151121,6034
3,613,2012-09-04,3,1,9,2,1,0.725833,0.686888,0.755,0.236321,6864
4,614,2012-09-05,3,1,9,3,1,0.736667,0.708983,0.74125,0.187808,7112


In [17]:
# Save the CSV file locally, so that it can be uploaded to create a 
# tabular dataset

import os

if not os.path.isdir('data'):
    os.mkdir('data')
    
# Save the train-test-valid data to a csv to be uploaded to the datastore
train.to_pandas_dataframe().to_csv("data/train_data.csv", index=False)
test.to_pandas_dataframe().to_csv("data/test_data.csv", index=False)

### Init the training and test data for the MLClient object.

In [80]:
# TODO: This doesnt' work, ensure dataset is created via. the UI
# Create dataset

dataset_name = "bike_share_train"
dataset_version = 1

try:
    training_data = client.data.get(dataset_name, dataset_version)
#     training_data = Data(name=dataset_name, version=dataset_version, local_path="./data/train")
#     training_data = client.data.create_or_update(training_data)
#     print("Uploaded to path  : ", training_data.path)
#     print("Datastore location: ", training_data.datastore)
except Exception as e:
    print("Could not create dataset. ", str(e))

training_data

Data({'is_anonymous': False, 'name': 'bike_share_train', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/data/bike_share_train/versions/1', 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x000002C215BA4F98>, 'version': 1, 'datastore': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/datastores/workspaceblobstore', 'path': 'UI/08-04-2021_114520_UTC/train_data.csv', 'local_path': None})

In [81]:
test_dataset_name = "bike_share_test"
test_data = client.data.get(test_dataset_name, dataset_version)

test_data

Data({'is_anonymous': False, 'name': 'bike_share_test', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/data/bike_share_test/versions/1', 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x000002C215BA4D30>, 'version': 1, 'datastore': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/datastores/workspaceblobstore', 'path': 'UI/08-04-2021_114650_UTC/test_data.csv', 'local_path': None})

## Forecasting Parameters
To define forecasting parameters for your experiment training, you can leverage the ForecastingParameters class. The table below details the forecasting parameter we will be passing into our experiment.

|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**country_or_region_for_holidays**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|
|**target_lags**|The target_lags specifies how far back we will construct the lags of the target variable.|
|**freq**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.

## Train

Instantiate a AutoMLConfig object. This defines the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**blocked_models**|Models in blocked_models won't be used by AutoML. All supported models can be found at [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py).|
|**experiment_timeout_hours**|Experimentation timeout in hours.|
|**training_data**|Input dataset, containing both features and label column.|
|**label_column_name**|The name of the label column.|
|**compute_target**|The remote compute for training.|
|**n_cross_validations**|Number of cross validation splits.|
|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|
|**forecasting_parameters**|A class that holds all the forecasting related parameters.|

This notebook uses the blocked_models parameter to exclude some models that take a longer time to train on this dataset. You can choose to remove models from the blocked_models list but you may need to increase the experiment_timeout_hours parameter value to get results.

### Let's set up what we know about the dataset. 

**Target column** is what we want to forecast.

**Time column** is the time axis along which to predict.

In [82]:
target_column_name = 'cnt'
time_column_name = 'date'

### Setting forecaster maximum horizon 

The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 14 periods (i.e. 14 days). Notice that this is much shorter than the number of days in the test set; we will need to use a rolling test to evaluate the performance on the whole test set. For more discussion of forecast horizons and guiding principles for setting them, please see the [energy demand notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand).  

In [83]:
forecast_horizon = 14

### Config AutoML

In [84]:
from azure.ml._restclient.v2020_09_01_preview.models import (
    GeneralSettings,
    DataSettings,
    LimitSettings,
    TrainingDataSettings,
    ValidationDataSettings,
    TestDataSettings,
    FeaturizationSettings,

)

from azure.ml.entities._job.automl.training_settings import TrainingSettings
from azure.ml.entities._job.automl.featurization import FeaturizationSettings
from azure.ml.entities._job.automl.forecasting import ForecastingSettings

from azure.ml.entities import AutoMLJob, ComputeConfiguration


compute_settings = ComputeConfiguration(target=cpu_cluster_name)

general_settings = GeneralSettings(
    task_type="forecasting",
    primary_metric= "normalized_root_mean_squared_error",
    log_verbosity="Info")

limit_settings = LimitSettings(
    timeout=60,
    trial_timeout=5,
    max_concurrent_trials=4,
    enable_early_termination=True)

training_data_settings = TrainingDataSettings(
    dataset_arm_id="{}:{}".format(training_data.name, training_data.version)
)
validation_data_settings = ValidationDataSettings(
    n_cross_validations=3
)

data_settings = DataSettings(
    training_data=training_data_settings,
    target_column_name=target_column_name,
    validation_data=validation_data_settings
)

featurization_settings = FeaturizationSettings(
    featurization_config="auto"
)

training_settings = TrainingSettings(
    block_list_models=['ExtremeRandomTrees']    
)

# Forecasting setting.
forecasting_settings = ForecastingSettings(
    time_column_name=time_column_name,
    forecast_horizon=forecast_horizon,
    country_or_region_for_holidays='US', # set country_or_region will trigger holiday featurizer
    target_lags='auto', # use heuristic based lag setting
    frequency='D' # Set the forecast frequency to be daily
)

extra_automl_settings = {"save_mlflow": True}

automl_job = AutoMLJob(
    compute=compute_settings,
    general_settings=general_settings,
    limit_settings=limit_settings,
    data_settings=data_settings,
    training_settings=training_settings,
    featurization_settings=featurization_settings,
    forecasting_settings=forecasting_settings,
    properties=extra_automl_settings,
)

automl_job

AutoMLJob({'type': 'automl_job', 'status': None, 'output': None, 'log_files': None, 'name': '293be5e6-49d2-4a08-bf33-63a8de599be3', 'description': None, 'tags': {}, 'properties': {'save_mlflow': True}, 'id': None, 'base_path': './', 'creation_context': None, 'experiment_name': 'forecasting-bike-share', 'interaction_endpoints': None, 'general_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.GeneralSettings object at 0x000002C215BC7240>, 'data_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.DataSettings object at 0x000002C215BC7358>, 'limit_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.LimitSettings object at 0x000002C215BC72B0>, 'forecasting_settings': <azure.ml.entities._job.automl.forecasting.ForecastingSettings object at 0x000002C215BC73C8>, 'training_settings': <azure.ml.entities._job.automl.training_settings.TrainingSettings object at 0x000002C215BC7390>, 'featurization_settings': <azure.ml.entities._job.autom

In [85]:
created_job = client.jobs.create_or_update(automl_job)
created_job

AzureCliCredential.get_token failed: Please run 'az login' to set up an account
AzureCliCredential.get_token failed: Please run 'az login' to set up an account


AutoMLJob({'type': 'automl_job', 'status': 'NotStarted', 'output': None, 'log_files': None, 'name': '293be5e6-49d2-4a08-bf33-63a8de599be3', 'description': None, 'tags': {}, 'properties': {'save_mlflow': 'True'}, 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/jobs/293be5e6-49d2-4a08-bf33-63a8de599be3', 'base_path': './', 'creation_context': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.SystemData object at 0x000002C215B9E8D0>, 'experiment_name': 'forecasting-bike-share', 'interaction_endpoints': {'Tracking': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.JobEndpoint object at 0x000002C215B9E748>, 'Studio': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.JobEndpoint object at 0x000002C215B9EE80>}, 'general_settings': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.GeneralSettings object at 0x000002C215B9EEB8>, 'data_s

In [87]:
print("Studio URL: ", created_job.interaction_endpoints["Studio"].endpoint)

Studio URL:  https://ml.azure.com/runs/293be5e6-49d2-4a08-bf33-63a8de599be3?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/yunba_test_rg/workspaces/yunba-test-ws-eastus2&tid=72f988bf-86f1-41af-91ab-2d7cd011db47


In [None]:
# TODO: Wait for the remote run to complete
# remote_run.wait_for_completion()

### Retrieve the Best Model
Below we select the best model from all the training iterations using get_output method.

In [93]:
from mlflow.tracking import MlflowClient

# TODO: Use this run, as it has MLFlow model stored on the run
job_name = "5e759c10-d73a-46fb-8b3f-3c0616c05fbf"
# job_name = created_job.name

mlflow_client = MlflowClient()
mlflow_parent_run = mlflow_client.get_run(job_name)

best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run_customized = mlflow_client.get_run(best_child_run_id)
best_run_customized

Found best child run id:  5e759c10-d73a-46fb-8b3f-3c0616c05fbf_104


<Run: data=<RunData: metrics={'explained_variance': 0.7620329414961295,
 'mean_absolute_error': 427.4545191833113,
 'mean_absolute_percentage_error': 6.727707830057546,
 'median_absolute_error': 373.9487443518653,
 'normalized_mean_absolute_error': 0.053896673708651026,
 'normalized_median_absolute_error': 0.047150264071600716,
 'normalized_root_mean_squared_error': 0.06778592385824482,
 'normalized_root_mean_squared_log_error': 0.029927443382009773,
 'r2_score': 0.6823643627231287,
 'root_mean_squared_error': 537.6101621197396,
 'root_mean_squared_log_error': 0.08867941123607752,
 'spearman_correlation': 0.8681318681318682}, params={}, tags={'_aml_system_ComputeTargetStatus': '{"AllocationState":"steady","PreparingNodeCount":0,"RunningNodeCount":4,"CurrentNodeCount":6}',
 '_aml_system_azureml.automlComponent': 'AutoML',
 'mlflow.parentRunId': '5e759c10-d73a-46fb-8b3f-3c0616c05fbf',
 'mlflow.source.name': 'automl_driver.py',
 'mlflow.source.type': 'JOB'}>, info=<RunInfo: artifact_uri='

## Featurization

You can access the engineered feature names generated in time-series featurization. Note that a number of named holiday periods are represented. We recommend that you have at least one year of data when using this feature to ensure that all yearly holidays are captured in the training featurization.

In [94]:
# This step requires AutoML runtime libraries to be installed
# !pip install azureml-train-automl-runtime
import mlflow.sklearn

fitted_model_customized = mlflow.sklearn.load_model("runs:/{}/outputs".format(best_run_customized.info.run_id))

In [95]:
timeseries_transformer = fitted_model_customized.named_steps['timeseriestransformer']
timeseries_transformer.get_engineered_feature_names()

['_automl_target_col_WASNULL',
 'atemp',
 'atemp_WASNULL',
 'horizon_origin',
 'hum',
 'hum_WASNULL',
 'instant',
 'instant_WASNULL',
 'mnth',
 'mnth_WASNULL',
 'season',
 'season_WASNULL',
 'temp',
 'temp_WASNULL',
 'weathersit',
 'weathersit_WASNULL',
 'weekday',
 'weekday_WASNULL',
 'windspeed',
 'windspeed_WASNULL',
 'yr',
 'yr_WASNULL',
 '_automl_target_col_lag1D',
 '_automl_year',
 '_automl_year_iso',
 '_automl_half',
 '_automl_quarter',
 '_automl_month',
 '_automl_day',
 '_automl_wday',
 '_automl_qday',
 '_automl_week',
 '_automl_IsPaidTimeOff',
 '_automl_Holiday_1 day after Christmas Day',
 '_automl_Holiday_1 day after Columbus Day',
 '_automl_Holiday_1 day after Independence Day',
 '_automl_Holiday_1 day after Labor Day',
 '_automl_Holiday_1 day after Martin Luther King, Jr. Day',
 '_automl_Holiday_1 day after Memorial Day',
 "_automl_Holiday_1 day after New Year's Day",
 '_automl_Holiday_1 day after Thanksgiving',
 '_automl_Holiday_1 day after Veterans Day',
 "_automl_Holiday

### View the featurization summary

You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:

- Raw feature name
- Number of engineered features formed out of this raw feature
- Type detected
- If feature was dropped
- List of feature transformations for the raw feature

In [96]:
# Get the featurization summary as a list of JSON
featurization_summary = timeseries_transformer.get_featurization_summary()
# View the featurization summary as a pandas dataframe
pd.DataFrame.from_records(featurization_summary)

Unnamed: 0,RawFeatureName,TypeDetected,Dropped,EngineeredFeatureCount,Transformations
0,_automl_target_col,Numeric,No,2,"[ImputationMarker, Lag]"
1,atemp,Numeric,No,2,"[MedianImputer, ImputationMarker]"
2,date,DateTime,No,105,"[MaxHorizonFeaturizer, DateTimeTransformer, Da..."
3,hum,Numeric,No,2,"[MedianImputer, ImputationMarker]"
4,instant,Numeric,No,2,"[MedianImputer, ImputationMarker]"
5,mnth,Numeric,No,2,"[MedianImputer, ImputationMarker]"
6,season,Numeric,No,2,"[MedianImputer, ImputationMarker]"
7,temp,Numeric,No,2,"[MedianImputer, ImputationMarker]"
8,weathersit,Numeric,No,2,"[MedianImputer, ImputationMarker]"
9,weekday,Numeric,No,2,"[MedianImputer, ImputationMarker]"


## Evaluate

We now use the best fitted model from the AutoML Run to make forecasts for the test set. We will do batch scoring on the test dataset which should have the same schema as training dataset.

The scoring will run on a remote compute. In this example, it will reuse the training compute.

### Retrieving forecasts from the model
To run the forecast on the remote compute we will use a helper script: forecasting_script. This script contains the utility methods which will be used by the remote estimator. We copy the script to the project folder to upload it to remote compute.

In [88]:
with open('forecasting_script.py', 'r') as cefr:
    print(cefr.read())

from azureml.core.experiment import Experiment
from azureml.core import Dataset, Run
from sklearn.externals import joblib

from azureml.automl.core.shared.constants import MODEL_PATH

train_experiment_name = '<<train_experiment_name>>'
train_run_id = '<<train_run_id>>'
target_column_name = '<<target_column_name>>'
test_dataset_name = '<<test_dataset_name>>'

run = Run.get_context()
ws = run.experiment.workspace

# Get the AutoML run object from the experiment name and the workspace
train_experiment = Experiment(ws, train_experiment_name)
automl_run = Run(experiment=train_experiment, run_id=train_run_id)

# Download the trained model from the artifact store
automl_run.download_file(name=MODEL_PATH, output_file_path='model.pkl')

# get the input dataset by name
test_dataset = Dataset.get_by_name(ws, name=test_dataset_name)

X_test_df = test_dataset.drop_columns(columns=[target_column_name]).to_pandas_dataframe().reset_index(drop=True)
y_test_df = test_dataset.with_timestamp_columns(None)

In [89]:
import os
import shutil

script_folder = os.path.join(os.getcwd(), 'forecast')
os.makedirs(script_folder, exist_ok=True)
shutil.copy('forecasting_script.py', script_folder)

# Create the explainer script that will run on the remote compute.
script_file_name = script_folder + '/forecasting_script.py'

# Open the sample script for modification
with open(script_file_name, 'r') as cefr:
    content = cefr.read()

# Replace the values in train_explainer.py file with the appropriate values
content = content.replace('<<train_experiment_name>>', experiment_name) # your training experiment name.
content = content.replace('<<train_run_id>>', best_child_run_id) # Training Run-id.
content = content.replace('<<target_column_name>>', target_column_name) # Your target column name
# Name of your test dataset register with your workspace
content = content.replace('<<test_dataset_name>>', test_dataset_name)

# Write sample file into your script folder.
with open(script_file_name, 'w') as cefw:
    cefw.write(content)

For brevity, we have created a function called run_forecast that submits the test data to the best model determined during the training run and retrieves forecasts. The test set is longer than the forecast horizon specified at train time, so the forecasting script uses a so-called rolling evaluation to generate predictions over the whole test set. A rolling evaluation iterates the forecaster over the test set, using the actuals in the test set to make lag features as needed. 

In [90]:
test_experiment_name = experiment_name + "_test"

In [100]:
# Using a 'CommandJob' to submit the custom script to run the explainer
from azure.ml.entities import CommandJob, Code
from azureml.core import Run
mlflow.set_experiment(test_experiment_name)

#TODO: Here we create an environment object by loading a downloaded conda env from the training run.
train_experiment = Experiment(ws_tmp, experiment_name)
best_child_run = Run(experiment=train_experiment, run_id=best_child_run_id)

best_child_run.download_file('outputs/conda_env_v_1_0_0.yml', 'condafile.yml')

In [108]:
from azure.ml.entities._assets.environment import Environment

# environment = client.environments.get("AutoML-env")
environment = Environment(name="test-env-1", version=1, conda_file='condafile.yml',
                          docker_image='mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04:20210507.v1')
environment
client.environments.create_or_update(environment)

Environment({'is_anonymous': False, 'name': 'test-env-1', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/environments/test-env-1/versions/1', 'base_path': './', 'creation_context': <azure.ml._restclient.v2021_03_01_preview.models._models_py3.SystemData object at 0x000002C21598AE80>, 'version': 1, 'conda_file': OrderedDict([('channels', ['anaconda', 'conda-forge']), ('dependencies', ['python=3.6.2', OrderedDict([('pip', ['azureml-train-automl-runtime==1.32.0', 'inference-schema', 'azureml-interpret==1.32.0', 'azureml-defaults==1.32.0'])]), 'numpy>=1.16.0,<1.19.0', 'pandas==0.25.1', 'scikit-learn==0.22.1', 'py-xgboost<=0.90', 'fbprophet==0.5', 'holidays==0.9.11', 'psutil>=5.2.2,<6.0.0']), ('name', 'azureml_c199a2d8511501c9bde5dfe3639e54c9')]), 'path': None, 'docker': <azure.ml.entities._assets.environment._DockerConfigura

In [109]:
compute = compute_settings

command_job = CommandJob(
    command="python forecasting_script.py",
    code=Code(local_path=script_folder),
    environment=environment,
    compute=compute,
    tags={'training_run_id':
           best_child_run_id,
#            'run_algorithm':
#            train_run.properties['run_algorithm'],
#            'valid_score':
#            train_run.properties['score'],
#            'primary_metric':
#            train_run.properties['primary_metric']
         }
)

created_command_job = client.jobs.create_or_update(command_job)
created_command_job

AzureCliCredential.get_token failed: Please run 'az login' to set up an account
AzureCliCredential.get_token failed: Please run 'az login' to set up an account


CommandJob({'parameters': {}, 'type': 'command_job', 'status': 'Starting', 'output': None, 'log_files': None, 'name': 'e3866325-dc1e-4215-a332-9a706c24fdcf', 'description': None, 'tags': {'training_run_id': '5e759c10-d73a-46fb-8b3f-3c0616c05fbf_104'}, 'properties': {'mlflow.source.git.repoURL': 'https://github.com/CESARDELATORRE/Easy-AutoML-MLOps.git', 'mlflow.source.git.branch': 'master', 'mlflow.source.git.commit': '1ae739f365b9f7679a9b55c03fb2c4b2223bddf5', 'azureml.git.dirty': 'True', '_azureml.ComputeTargetType': 'amlcompute', 'ContentSnapshotId': 'e22ce1c1-43a9-461f-8b27-5f0536f5decb'}, 'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/yunba_test_rg/providers/Microsoft.MachineLearningServices/workspaces/yunba-test-ws-eastus2/jobs/e3866325-dc1e-4215-a332-9a706c24fdcf', 'base_path': './', 'creation_context': <azure.ml._restclient.v2020_09_01_preview.models._models_py3.SystemData object at 0x000002C215A60F28>, 'inputs': {}, 'command': 'python forecasting_scri

In [None]:
# remote_run.wait_for_completion(show_output=False)

In [110]:
print("Studio URL: ", created_command_job.interaction_endpoints["Studio"].endpoint)

Studio URL:  https://ml.azure.com/runs/e3866325-dc1e-4215-a332-9a706c24fdcf?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/yunba_test_rg/workspaces/yunba-test-ws-eastus2&tid=72f988bf-86f1-41af-91ab-2d7cd011db47


### Download the prediction result for metrics calcuation
The test data with predictions are saved in artifact outputs/predictions.csv. You can download it and calculation some error metrics for the forecasts and vizualize the predictions vs. the actuals.

In [None]:
# remote_run.download_file('outputs/predictions.csv', 'predictions.csv')
# df_all = pd.read_csv('predictions.csv')

In [None]:
# from azureml.automl.core.shared import constants
# from azureml.automl.runtime.shared.score import scoring
# from sklearn.metrics import mean_absolute_error, mean_squared_error
# from matplotlib import pyplot as plt

# # use automl metrics module
# scores = scoring.score_regression(
#     y_test=df_all[target_column_name],
#     y_pred=df_all['predicted'],
#     metrics=list(constants.Metric.SCALAR_REGRESSION_SET))

# print("[Test data scores]\n")
# for key, value in scores.items():    
#     print('{}:   {:.3f}'.format(key, value))
    
# # Plot outputs
# %matplotlib inline
# test_pred = plt.scatter(df_all[target_column_name], df_all['predicted'], color='b')
# test_test = plt.scatter(df_all[target_column_name], df_all[target_column_name], color='g')
# plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)
# plt.show()

For more details on what metrics are included and how they are calculated, please refer to [supported metrics](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#regressionforecasting-metrics). You could also calculate residuals, like described [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml#residuals).


Since we did a rolling evaluation on the test set, we can analyze the predictions by their forecast horizon relative to the rolling origin. The model was initially trained at a forecast horizon of 14, so each prediction from the model is associated with a horizon value from 1 to 14. The horizon values are in a column named, "horizon_origin," in the prediction set. For example, we can calculate some of the error metrics grouped by the horizon:

In [None]:
# from metrics_helper import MAPE, APE
# df_all.groupby('horizon_origin').apply(
#     lambda df: pd.Series({'MAPE': MAPE(df[target_column_name], df['predicted']),
#                           'RMSE': np.sqrt(mean_squared_error(df[target_column_name], df['predicted'])),
#                           'MAE': mean_absolute_error(df[target_column_name], df['predicted'])}))

To drill down more, we can look at the distributions of APE (absolute percentage error) by horizon. From the chart, it is clear that the overall MAPE is being skewed by one particular point where the actual value is of small absolute value.

In [None]:
# df_all_APE = df_all.assign(APE=APE(df_all[target_column_name], df_all['predicted']))
# APEs = [df_all_APE[df_all['horizon_origin'] == h].APE.values for h in range(1, forecast_horizon + 1)]

# %matplotlib inline
# plt.boxplot(APEs)
# plt.yscale('log')
# plt.xlabel('horizon')
# plt.ylabel('APE (%)')
# plt.title('Absolute Percentage Errors by Forecast Horizon')

# plt.show()