# Automated Machine Learning
**AutoML Forecasting Recipe Univariate**

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace. [Check this notebook for creating a workspace](../../../resources/workspace/workspace.ipynb) 
- A Compute Cluster. [Check this notebook to create a compute cluster](../../../resources/compute/compute.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Create an `AutoML time-series forecasting Job` with the 'forecasting()' factory-fuction
- Train the model using AmlCompute by submitting/running the AutoML forecasting training job
- Obtain the model and use it to generate forecast



### Running AutoML experiments
See the `automl-forecasting-recipe-univariate-experiment-settings` notebook on how to determine settings for seasonal features, target lags and whether the series needs to be differenced or not. To make experimentation user-friendly, the user has to specify several parameters: DIFFERENCE_SERIES, TARGET_LAGS and STL_TYPE. Once these parameters are set, the notebook will generate correct transformations and settings to run experiments, generate forecasts, compute inference set metrics and plot forecast vs actuals. It will also convert the forecast from first differences to levels (original units of measurement) if the DIFFERENCE_SERIES parameter is set to True before calculating inference set metrics.

The output generated by this notebook is saved in the `experiment_outputfolder`.

# Setup

## 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

### 1.1. Import the required libraries

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import automl
from azure.ai.ml import Input

import json
import pandas as pd
import os
import shutil
import yaml

### 1.2. Configure workspace details and get a handle to the workspace

As part of the setup you have already created a Workspace. To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

 You will also need to create a compute target for your AutoML run. In this tutorial, you create AmlCompute as your training compute resource.

<b>Note </b>that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

#### Show Azure ML Workspace information

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
output

## 2. Data

We will load the data into DataFrame objects, split the data to train and test datasets, then create the Azure Machine Learning MLTable objects to prepare for the later training and inference steps.

### 2.1 Load the data file into DataFrame

We will load the data on local machine in dataframe objects. Only for this experiment, the data is in data folder, uploaded manually. We will drop the Covid period.

In [None]:
TARGET_COLNAME = "S4248SM144SCEN"
TIME_COLNAME = "observation_date"
COVID_PERIOD_START = (
    "2020-03-01"  # start of the covid period. To be excluded from evaluation.
)

df = pd.read_csv("./data/S4248SM144SCEN.csv", parse_dates=[TIME_COLNAME])
df.sort_values(by=TIME_COLNAME, inplace=True)

# remove the Covid period
df = df.query('{} <= "{}"'.format(TIME_COLNAME, COVID_PERIOD_START))

#### Set parameters

The first set of parameters is based on the analysis performed in the `auto-ml-forecasting-univariate-recipe-experiment-settings` notebook.

In [None]:
# set parameters based on the settings notebook analysis
DIFFERENCE_SERIES = True
TARGET_LAGS = None
STL_TYPE = None

Next, define additional parameters to be used in the <a href="https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig?view=azure-ml-py"> AutoML config </a> class.

<ul> 
    <li> FORECAST_HORIZON:  The forecast horizon is the number of periods into the future that the model should predict. Here, we set the horizon to 12 periods (i.e. 12 quarters). For more discussion of forecast horizons and guiding principles for setting them, please see the <a href="https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand"> energy demand notebook </a>. 
    </li>
    <li> TIME_SERIES_ID_COLNAMES: The names of columns used to group a timeseries. It can be used to create multiple series. If time series identifier is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. Since we are working with a single series, this list is empty.
    </li>
    <li> BLOCKED_MODELS: Optional list of models to be blocked from consideration during model selection stage. At this point we want to consider all ML and Time Series models.
        <ul>
            <li> See the following <a href="https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py"> link </a> for a list of supported Forecasting models</li>
        </ul>
    </li>
</ul>


In [None]:
# set other parameters
FORECAST_HORIZON = 12
TIME_SERIES_ID_COLNAMES = []
BLOCKED_MODELS = ["ExtremeRandomTrees"]

In [None]:
# difference data and test for unit root
if DIFFERENCE_SERIES:
    df_delta = df.copy()
    df_delta[TARGET_COLNAME] = df[TARGET_COLNAME].diff()
    df_delta.dropna(axis=0, inplace=True)

### Split the data into train and test

In [None]:
from helper_functions import ts_train_test_split

# split the data into train and test set
if DIFFERENCE_SERIES:
    # generate train/inference sets using data in first differences
    df_train, df_test = ts_train_test_split(
        df_input=df_delta,
        n=FORECAST_HORIZON,
        time_colname=TIME_COLNAME,
        ts_id_colnames=TIME_SERIES_ID_COLNAMES,
    )
else:
    df_train, df_test = ts_train_test_split(
        df_input=df,
        n=FORECAST_HORIZON,
        time_colname=TIME_COLNAME,
        ts_id_colnames=TIME_SERIES_ID_COLNAMES,
    )

In [None]:
# Save the DataFrame objects to files
train_data_path = "./data/train_S4248SM144SCEN.csv"
test_data_path = "./data/test_S4248SM144SCEN.csv"
df_train.to_csv(train_data_path, index=False)
df_test.to_csv(test_data_path, index=False)

### Create the Azure Machine Learning MLTable

With Azure Machine Learning MLTables you can keep a single copy of data in your storage, easily access data during model training, share data and collaborate with other users. 
Below, we will upload the data by creating an MLTable to be used for training.

**NOTE:** In this PRIVATE PREVIEW we're defining the MLTable in a separate folder and .YAML file.
In later versions, you'll be able to do it all in Python APIs.

In [None]:
def create_folder_and_ml_table(csv_file, output, delimiter=",", encoding="ascii"):
    os.makedirs(output, exist_ok=True)
    fname = os.path.split(csv_file)[-1]

    mltable = {
        "paths": [{"file": f"./{fname}"}],
        "transformations": [
            {"read_delimited": {"delimiter": delimiter, "encoding": encoding}}
        ],
    }
    with open(os.path.join(output, "MLTable"), "w") as f:
        f.write(yaml.dump(mltable))
    shutil.copy(csv_file, os.path.join(output, fname))

In [None]:
train_mltable_path = "./data/training-mltable-folder"
create_folder_and_ml_table(train_data_path, train_mltable_path)

In [None]:
# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(type=AssetTypes.MLTABLE, path=train_mltable_path)

To create data input from TabularDataset created using V1 sdk, set the `type` to `AssetTypes.MLTABLE`, `mode` to `InputOutputModes.DIRECT` and `path` to the following format `azureml:<tabulardataset_name>` or `azureml:<tabulardataset_name:<version>`(in case we want to use specific version of the registered dataset).

In [None]:
"""
# Training MLTable with v1 TabularDataset
my_training_data_input = Input(
    type=AssetTypes.MLTABLE, path="azureml:train:1", mode=InputOutputModes.DIRECT
)
"""

Next, we upload the directory with the test set data which will be used in the batch end point inference.

In [None]:
os.makedirs("test_dataset", exist_ok=True)
shutil.copy(
    "data/test_S4248SM144SCEN.csv",
    "test_dataset/test_S4248SM144SCEN.csv",
)

In [None]:
my_test_data_input = Input(
    type=AssetTypes.URI_FOLDER,
    path="test_dataset/",
)

To use TabularDataset created in V1 sdk as a test data on the batch end point inference we need to convert it to V2 Input.

In [None]:
"""
from mltable import load
os.makedirs("test_dataset", exist_ok=True)
filedataset_asset = ml_client.data.get(name="test",version=1)
test_df = load(f"azureml:/{filedataset_asset.id}").to_pandas_dataframe()
test_df.to_csv("test_dataset/test_S4248SM144SCEN.csv")
my_test_data_input = Input(
    type=AssetTypes.URI_FOLDER,
    path="test_dataset/",
)
"""

## 3. Create or Attach existing AmlCompute

[Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. In this tutorial, you will create and an AmlCompute cluster as your training compute resource.

<b>Creation of AmlCompute takes approximately 5 minutes.</b>

If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azure.core.exceptions import ResourceNotFoundError
from azure.ai.ml.entities import AmlCompute

cluster_name = "recipe-cluster"

try:
    # Retrieve an already attached Azure Machine Learning Compute.
    compute = ml_client.compute.get(cluster_name)
except ResourceNotFoundError as e:
    compute = AmlCompute(
        name=cluster_name,
        size="STANDARD_DS12_V2",
        type="amlcompute",
        min_instances=0,
        max_instances=4,
        idle_time_before_scale_down=120,
    )
    poller = ml_client.begin_create_or_update(compute)
    poller.wait()

## 4. Configure and run the AutoML Forecasting training job

In this section we will configure and run the AutoML job to train the model.

### 4.1 Configure the job through the forecasting() factory function

#### forecasting() function parameters:

The `forecasting()` factory function allows user to configure AutoML for the forecasting task for the most common scenarios with the following properties.

|Property|Description|
|-|-|
|**target_column_name**|The name of the label column.|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**training_data**|The training data to be used for this experiment. You can use a registered MLTable in the workspace using the format `<mltable_name>:<version>` OR you can use a local file or folder as a MLTable. For e.g `Input(mltable='my_mltable:1')` OR `Input(mltable=MLTable(local_path="./data"))` The parameter 'training_data' must always be provided.|
|**compute**|The compute on which the AutoML job will run. In this example we are using a compute called 'cpu-cluster' present in the workspace. You can replace it with any other compute in the workspace.|
|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection. The default value is "auto", in which case AutoMl determines the number of cross-validations automatically, if a validation set is not provided. Or, users could specify an integer value.|
|**name**|The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.
|**experiment_name**|The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment. For example, if a user runs this notebook multiple times, there will be multiple runs associated with the same Experiment name.|
|**enable_model_explainability**|If set to true, the explanations such as feature importance for the best model will be generated.|

#### set_limits() parameters:
This is an optional configuration method to configure limits parameters such as timeouts.

|Property|Description|
|-|-|
|**timeout_minutes**|Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results. It is hard to say what the timeout limit should be because the runtimes depend on multiple factors such as number of unique time series in the dataset, length of time series, statistical properties of the data, etc. If your dataset is less than 10,000,000 observations, you can try to set the experiment to 1 hour. If you are seeing less than 30 child jobs completed in this time frame, increase the timeout limit and re-run the experiment.|
|**trial_timeout_minutes**|Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.|
|**max_trials**|The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If you are setting the `enable_early_termination=True` the number of trials will be smaller.|
|**max_concurrent_trials**|Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to set this number equal to the number of nodes in your cluster.|
|**enable_early_termination**|Whether to enable early termination if the score is not improving over 10 iterations. Early stopping window starts only after first 20 iterations. This means that the first iteration where stopping can occur is the 31st.|

#### Specialized Forecasting Parameters
To define forecasting parameters for your experiment training, you can leverage the .set_forecast_settings() method. 
The table below details the forecasting parameters we will be passing into our experiment.

|Property|Description|
|-|-|
|**time_column_name**|The name of your time column.|
|**forecast_horizon**|The forecast horizon is how many periods forward you would like to forecast. This integer horizon is in units of the timeseries frequency (e.g. daily, weekly).|
|**frequency**|Forecast frequency. This optional parameter represents the period with which the forecast is desired, for example, daily, weekly, yearly, etc. Use this parameter for the correction of time series containing irregular data points or for padding of short time series. The frequency needs to be a pandas offset alias. Please refer to [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) for more information.
|**cv_step_size**|Number of periods between two consecutive cross-validation folds. The default value is "auto", in which case AutoML determines the cross-validation step size automatically, if a validation set is not provided. Or users could specify an integer value.|
|**target_lags**|The target_lags specifies how far back we will construct the lags of the target variable.|
|**country_or_region_for_holidays**|The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes (i.e. 'US', 'GB').|

#### Using lags
This training is also using the **target lags** which are lagged values of the target variable. Since we generate lags with respect to horizon, we therefore must still specify the `forecast_horizon` that the model will learn to forecast. The `target_lags` keyword specifies how far back we will construct the lags of the target variable. The `target_lags` is set to auto in this notebook. Therefore, this value will be computed automatically.

This notebook uses the `.set_training(blocked_training_algorithms=...)` parameter to exclude some models that take a longer time to train on this dataset.  You can choose to remove models from the blocked_training_algorithms list but you may need to increase the trial_timeout_minutes parameter value to get results.

To run AutoML, you need to create an `Experiment`. An `Experiment` corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
# choose a name for the run history container in the workspace
if isinstance(TARGET_LAGS, list):
    TARGET_LAGS_STR = (
        "-".join(map(str, TARGET_LAGS)) if (len(TARGET_LAGS) > 0) else None
    )
else:
    TARGET_LAGS_STR = TARGET_LAGS

experiment_desc = "diff-{}_lags-{}_STL-{}".format(
    DIFFERENCE_SERIES, TARGET_LAGS_STR, STL_TYPE
)

# General job parameters.
max_trials = 5
exp_name = "univariate_recipe_{}".format(experiment_desc)

### 4.2 Create the AutoML forecasting job

In [None]:
# Create the AutoML forecasting job with the related factory-function.

forecasting_job = automl.forecasting(
    compute=cluster_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name=TARGET_COLNAME,
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=3,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional
forecasting_job.set_limits(
    timeout_minutes=20,
    trial_timeout_minutes=5,
    enable_early_termination=True,
)

In [None]:
# Specialized properties for Time Series Forecasting training
forecasting_job.set_forecast_settings(
    time_column_name=TIME_COLNAME,
    forecast_horizon=FORECAST_HORIZON,
    target_lags=TARGET_LAGS,
    time_series_id_column_names=TIME_SERIES_ID_COLNAMES,
    use_stl=STL_TYPE,
)

# Training properties are optional
forecasting_job.set_training(blocked_training_algorithms=BLOCKED_MODELS)

### 4.3 Train the AutoML model

Using the `MLClient` created earlier, we will execute the following commands to train the model.

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    forecasting_job, experiment_name=exp_name
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
# Wait until AutoML training runs are finished
ml_client.jobs.stream(returned_job.name)

## 5. Retrieve the Best Trial (Best Model's trial/run)

Use the MLFLowClient to access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

### 5.1 Initialize MLFlow Client
The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface. 
Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

*IMPORTANT*, you need to have installed the latest MLFlow packages with:

    pip install azureml-mlflow

    pip install mlflow

#### Obtain the tracking URI for MLFlow

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

#### Get the AutoML parent Job

In [None]:
job_name = returned_job.name

# Example if providing an specific Job name/ID
# job_name = "591640e8-0f88-49c5-adaa-39b9b9d75531"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

#### Get the AutoML best child run

In [None]:
# Get the best model's child run
best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

## 6. Model Evaluation and Forecasting

### 6.1 Download the best model
Access the results (such as models, artifacts, metrics) of a previously completed AutoML Run.

In [None]:
# Create local folder
local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

### 6.2 Forecasting using batch endpoint

Now that we have retrieved the best pipeline/model, it can be used to make predictions on test data. We will do batch inferencing on the test dataset which must have the same schema as training dataset.

The inference will run on a remote compute. In this example, it will re-use the training compute.

#### 6.2.1 Create a model endpoint
First, we need to register the model, environment and the batch endpoint.

In [None]:
from azure.ai.ml.entities import (
    Environment,
    BatchEndpoint,
    BatchDeployment,
    BatchRetrySettings,
    Model,
)
from azure.ai.ml.constants import BatchDeploymentOutputAction

model_name = "forecast-recipe-univariate-run"
batch_endpoint_name = "forecast-recipes-univariate-run"

model = Model(
    path=f"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/model.pkl",
    name=model_name,
    description="Forecasting Recipe Univariate Model",
)
registered_model = ml_client.models.create_or_update(model)

env = Environment(
    name="automl-tabular-env",
    description="environment for automl inference",
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1",
    conda_file="artifact_downloads/outputs/conda_env_v_1_0_0.yml",
)

endpoint = BatchEndpoint(
    name=batch_endpoint_name,
    description="this is a sample batch endpoint",
)
ml_client.begin_create_or_update(endpoint).wait()

To create a batch deployment, we will use the forecasting_script.py which will load the model and will call the forecast method each time we will envoke the endpoint.

In [None]:
output_file = "forecasting_recipe_univariate_output.json"
batch_deployment = BatchDeployment(
    name="univariate-non-mlflow-deployment",
    description="this is a sample non-mlflow deployment",
    endpoint_name=batch_endpoint_name,
    model=registered_model,
    code_path="./forecast",
    scoring_script="forecasting_script.py",
    environment=env,
    environment_variables={
        "TARGET_COLUMN_NAME": TARGET_COLNAME,
    },
    compute=cluster_name,
    instance_count=2,
    max_concurrency_per_instance=2,
    mini_batch_size=10,
    output_action=BatchDeploymentOutputAction.APPEND_ROW,
    output_file_name=output_file,
    retry_settings=BatchRetrySettings(max_retries=3, timeout=30),
    logging_level="info",
)

Finally, start a model deployment.

In [None]:
ml_client.begin_create_or_update(batch_deployment).wait()

We need to create the Input, representing URI folder, because the batch endpoint is intended to process multiple files at a time. In this example we will use only one test file, which we have uploaded to the blob storage earlier. This file must be available through the url link.

#### Create an inference job

In [None]:
job = ml_client.batch_endpoints.invoke(
    endpoint_name=batch_endpoint_name,
    input=my_test_data_input,
    deployment_name="univariate-non-mlflow-deployment",  # name is required as default deployment is not set
)

We will stream the job output to monitor the execution.

In [None]:
job_name = job.name
batch_job = ml_client.jobs.get(name=job_name)
print(batch_job.status)
# stream the job logs
ml_client.jobs.stream(name=job_name)

#### Download the prediction result for metrics calculation

The output of forecast output is saved in JSON format. You can use it to calculate test set metrics and plot predictions and actuals over time.

In [None]:
ml_client.jobs.download(job_name, download_path=".")

In [None]:
pred_df = pd.read_json(output_file, orient="table")
pred_df.head()

In [None]:
def convert_fcst_diff_to_levels(fcst, yt, df_orig):
    """Convert forecast from first differences to levels."""
    fcst = fcst.reset_index(drop=False, inplace=False)
    fcst["predicted_level"] = fcst["predicted"].cumsum()
    fcst["predicted_level"] = fcst["predicted_level"].astype(float) + float(yt)
    # merge actuals
    out = pd.merge(
        fcst, df_orig[[TIME_COLNAME, TARGET_COLNAME]], on=[TIME_COLNAME], how="inner"
    )
    out.rename(columns={TARGET_COLNAME + "_y": "actual_level"}, inplace=True)
    return out

In [None]:
if DIFFERENCE_SERIES:
    # convert forecast in differences to the levels
    INFORMATION_SET_DATE = max(df_train[TIME_COLNAME])
    YT = df.query("{} == @INFORMATION_SET_DATE".format(TIME_COLNAME))[TARGET_COLNAME]
    fcst_df = convert_fcst_diff_to_levels(fcst=pred_df, yt=YT, df_orig=df)
else:
    fcst_df = pred_df.copy()
    fcst_df["actual_level"] = y_test
    fcst_df["predicted_level"] = y_predictions

#### Calculate metrics and save output

In [None]:
from metrics_helper import calculate_metrics

metrics_df = calculate_metrics(fcst_df["actual_level"], fcst_df["predicted_level"])

In [None]:
# create output directory
output_dir = "experiment_output/{}".format(experiment_desc)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
# save output
metrics_file_name = "{}_metrics.csv".format(exp_name)
fcst_file_name = "{}_forecst.csv".format(exp_name)
plot_file_name = "{}_plot.pdf".format(exp_name)

metrics_df.to_csv(os.path.join(output_dir, metrics_file_name), index=True)
fcst_df.to_csv(os.path.join(output_dir, fcst_file_name), index=True)

#### Generate and save forecast versus actuals plot 
We will join actual observations of test data with the predictions to plot predictions and actuals on a time series plot.

In [None]:
from matplotlib import pyplot as plt

%matplotlib inline

In [None]:
plot_df = df.query('{} > "2010-01-01"'.format(TIME_COLNAME))
plot_df.set_index(TIME_COLNAME, inplace=True)
fcst_df.set_index(TIME_COLNAME, inplace=True)

# generate and save plots
plt.figure(dpi=180)
plt.plot(plot_df[TARGET_COLNAME], "-g", label="Historical")
plt.plot(fcst_df["actual_level"], "-b", label="Actual")
plt.plot(fcst_df["predicted_level"], "-r", label="Forecast")
plt.legend()
plt.title("Forecast vs Actuals")
plt.xlabel(TIME_COLNAME)
plt.ylabel(TARGET_COLNAME)

plt.xticks(rotation=45)
plt.savefig(os.path.join(output_dir, plot_file_name))

In [None]:
# Delete the batch endpoint and compute. Do not do it occasionally.
ml_client.batch_endpoints.begin_delete(name=batch_endpoint_name)
ml_client.compute.begin_delete(name=cluster_name)