In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Tabular Workflow for Forecasting

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This tutorial demonstrates how you can use Vertex AI Tabular Workflow for Forecasting to train an AutoML model. You can choose between the following model types: Time Series Dense Encoder (TiDE), Learn to Learn (L2L), Sequence to Sequence (Seq2Seq+), and Temporal Fusion Transformer (TFT).

Learn more about [Tabular Workflow for Forecasting](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/forecasting).

### Compared to Vertex Forecsating managed service.

Compared to Vertex Forecasting managed service, Tabular Workflow for Forecasting has the following advantages:
1. Composite time series id columns are support. You can use a combination of multiple columns as the time series id, for example, you can use either `['sku_id']` or `['sku_id', 'store_id']` as the time series id columns.
2. Model architecture search can be skipped. You can reuse the previous model architecture search tuning result to directly train the model.
3. Hardware customization. You can override the machine spec of the tuning and the training step, so that you can tune the training speed. You are also able to control the parallelism of the training process and the number of the final selected trials during the ensemble step.
4. Unlimited time steps support in one single time series. There is no more 3000 time steps limit in the training dataset.
5. No upper limit for the training dataset. There is no more 100MM rows limit or 100GB limit in dataset size.
6. All advanced features you can get from the Vertex Pipelines.

### Objective

In this tutorial, you learn how to create AutoML Forecasting models using [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) downloaded from [Google Cloud Pipeline Components](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction) (GCPC). These pipelines will be Vertex AI Tabular Workflow pipelines which are maintained by Google. These pipelines showcases different ways to customize the Vertex Tabular training process.

This tutorial uses the following Google Cloud ML services:

- `AutoML Training`
- `Vertex AI Pipelines`

The steps performed are:

- Create a training pipeline with TiDE(Time series Dense Encoder) algorithm using specified machine type for training.
- Create a training pipeline that reuses the architecture search results from the previous pipeline to save time for TiDE(Time series Dense Encoder).
- Create a training pipeline with Learn-to-learn(L2L) algorithm.
- Create a training pipeline with Seq2seq(Sequence to sequence) algorithm.
- Create a training pipeline with TFT(Temporal Fusion Transformer) algorithm.
- Perform the batch prediction using the trained model in the above steps.

### Dataset

This tutorial uses the [Liquor dataset](https://www.kaggle.com/datasets/residentmario/iowa-liquor-sales), which forecasts the alcoholic beverage sales in the Midwest.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* BigQuery
* Dataflow

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and [BigQuery](https://cloud.google.com/bigquery), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Install additional packages

Install the Google Cloud Pipeline Components (GCPC) SDK not earlier than `2.3.0`.


In [None]:
!pip3 install --upgrade --quiet google-cloud-pipeline-components==2.3.0 \
                                google-cloud-aiplatform

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API: Vertex AI APIs, Dataflow APIs, Compute Engine APIs, and Cloud Storage](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,dataflow.googleapis.com,compute_component,storage-component.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

## Notes about service account and permission

For full details of the permission setup, please refer to https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/service-accounts

**By default no configuration is required**, if you run into any permission related issue, please make sure the service accounts above have the required roles:

|Service account email|Description|Roles|
|---|---|---|
|PROJECT_NUMBER-compute@developer.gserviceaccount.com|Compute Engine default service account|Dataflow Developer, Dataflow Worker, Storage Admin, BigQuery Data Editor, Vertex AI User, Service Account User|
|service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com|AI Platform Service Agent|Vertex AI Service Agent|


1. Goto https://console.cloud.google.com/iam-admin/iam.
2. Check the "Include Google-provided role grants" checkbox.
3. Find the above emails.
4. Grant the corresponding roles.

### Using data source from a different project
- For the BQ data source, grant both service accounts the "BigQuery Data Viewer" role.
- For the CSV data source, grant both service accounts the "Storage Object Viewer" role.


### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets, TF model checkpoint, TensorBoard file, etc.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

#### Service Account

You use a service account to create Vertex AI Pipeline jobs. If you do not want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines
Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

## Import libraries and define constants

In [None]:
# Import required modules
import json
import os
import uuid
from typing import Any, Dict, List, Optional

from google.cloud import aiplatform, storage
from google_cloud_pipeline_components.preview.automl.forecasting import \
    utils as automl_forecasting_utils

## Initialize Vertex AI SDK for Python

Initialize the Vertex SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)

## VPC related config

If you need to use a custom Dataflow subnetwork, you can set it through the `dataflow_subnetwork` parameter. The requirements are:
1. `dataflow_subnetwork` must be fully qualified subnetwork name.
   [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications)]
1. The following service accounts must have [Compute Network User role](https://cloud.google.com/compute/docs/access/iam#compute.networkUser) assigned on the specified dataflow subnetwork [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#shared)]:
    1. Compute Engine default service account: PROJECT_NUMBER-compute@developer.gserviceaccount.com
    1. Dataflow service account: service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com

If your project has VPC-SC enabled, please make sure:

1. The dataflow subnetwork used in VPC-SC is configured properly for Dataflow.
   [[reference](https://cloud.google.com/dataflow/docs/guides/routes-firewall)]
1. `dataflow_use_public_ips` is set to False.


In [None]:
# Dataflow's fully qualified subnetwork name, when empty the default subnetwork will be used.
# Fully qualified subnetwork name is in the form of
# https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
# reference: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
dataflow_subnetwork = None  # @param {type:"string"}
# Specifies whether Dataflow workers use public IP addresses.
dataflow_use_public_ips = True  # @param {type:"boolean"}

## Prepare for training

### Define helper functions

In [None]:
# Below functions will serve as the utility functions.


# Fetch the tuple of GCS bucket and object URI.
def get_bucket_name_and_path(uri: str):
    no_prefix_uri = uri[len("gs://") :]
    splits = no_prefix_uri.split("/")
    return splits[0], "/".join(splits[1:])


# Fetch the content from a GCS object URI.
def download_from_gcs(uri: str):
    bucket_name, path = get_bucket_name_and_path(uri)
    storage_client = storage.Client(project=PROJECT_ID)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob.download_as_string()


# Upload the string content as a GCS object.
def write_to_gcs(uri: str, content: str):
    bucket_name, path = get_bucket_name_and_path(uri)
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    blob.upload_from_string(content)


# This is the example to set non-auto transformations.
# For more details about the transformations, please check:
# https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular#transformations
def generate_transformation(
    auto_column_names: Optional[List[str]] = None,
    numeric_column_names: Optional[List[str]] = None,
    categorical_column_names: Optional[List[str]] = None,
    text_column_names: Optional[List[str]] = None,
    timestamp_column_names: Optional[List[str]] = None,
) -> List[Dict[str, Any]]:
    if auto_column_names is None:
        auto_column_names = []
    if numeric_column_names is None:
        numeric_column_names = []
    if categorical_column_names is None:
        categorical_column_names = []
    if text_column_names is None:
        text_column_names = []
    if timestamp_column_names is None:
        timestamp_column_names = []
    return {
        "auto": auto_column_names,
        "numeric": numeric_column_names,
        "categorical": categorical_column_names,
        "text": text_column_names,
        "timestamp": timestamp_column_names,
    }


# Retrieve the data given a task name.
def get_task_detail(
    task_details: List[Dict[str, Any]], task_name: str
) -> List[Dict[str, Any]]:
    for task_detail in task_details:
        if task_detail.task_name == task_name:
            return task_detail


# Retrieve the URI of the model.
def get_deployed_model_uri(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-upload")
    return ensemble_task.outputs["model"].artifacts[0].uri


# Retrieve the feature importance details from GCS.
def get_feature_attributions(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-evaluation-2")
    return download_from_gcs(
        ensemble_task.outputs["evaluation_metrics"]
        .artifacts[0]
        .metadata["explanation_gcs_path"]
    )


# Retrieve the evaluation metrics from GCS.
def get_evaluation_metrics(
    task_details,
):
    ensemble_task = get_task_detail(task_details, "model-evaluation")
    return download_from_gcs(
        ensemble_task.outputs["evaluation_metrics"].artifacts[0].uri
    )


# Pretty print the JSON string.
def load_and_print_json(s):
    parsed = json.loads(s)
    print(json.dumps(parsed, indent=2, sort_keys=True))

### Define training specification

In [None]:
root_dir = os.path.join(BUCKET_URI, f"automl_forecasting_pipeline/run-{uuid.uuid4()}")
optimization_objective = "minimize-mae"
time_column = "date"
time_series_identifier_column = "store_name"
target_column = "sale_dollars"
data_source_csv_filenames = None
data_source_bigquery_table_path = (
    "bq://bigquery-public-data.iowa_liquor_sales_forecasting.2020_sales_train"
)

training_fraction = 0.8
validation_fraction = 0.1
test_fraction = 0.1

predefined_split_key = None
if predefined_split_key:
    training_fraction = None
    validation_fraction = None
    test_fraction = None

weight_column = None

features = [
    time_column,
    target_column,
    "city",
    "zip_code",
    "county",
]

available_at_forecast_columns = [time_column]
unavailable_at_forecast_columns = [target_column]
time_series_attribute_columns = ["city", "zip_code", "county"]
forecast_horizon = 150
context_window = 150

transformations = generate_transformation(auto_column_names=features)

# Create a Vertex managed dataset artifact.
vertex_dataset = aiplatform.TimeSeriesDataset.create(
    bq_source=data_source_bigquery_table_path
)
vertex_dataset_artifact_id = vertex_dataset.gca_resource.metadata_artifact.split("/")[
    -1
]

## Supported APIs


Currently, 4 model types are supported in the APIs/SDK with the utility functions:
1. `time_series_dense_encoder`(`TiDE`): `get_time_series_dense_encoder_forecasting_pipeline_and_parameters`
2. `learn_to_learn`(`L2L`): `get_learn_to_learn_forecasting_pipeline_and_parameters`
3. `sequence_to_sequence`(`seq2seq`): `get_sequence_to_sequence_forecasting_pipeline_and_parameters`
4. `temporal_fusion_transformer`(`TFT`): `get_temporal_fusion_transformer_forecasting_pipeline_and_parameters`

### High level workflow

The following code shows the general format for using the APIs:
```python
# Use the utility function to get the required parameters to create Vertex Pipeline job.
template_path, parameter_values = automl_forecasting_utils.get_${MODEL_TYPE}_forecasting_pipeline_and_parameters(
  ...
)

# Construct a Vertex Pipeline job.
job = aiplatform.PipelineJob(
    ...
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    ...
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    ...
)

# Launch the Vertex Pipeline job.
job.run()
```

### Utility function arguments

The utility functions for all model types have the same arguments.

`get_time_series_dense_encoder_forecasting_pipeline_and_parameters` is shown here as an example:

```python
def get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    *,
    project: str,
    location: str,
    root_dir: str,
    target_column: str,
    optimization_objective: str,
    transformations: Dict[str, List[str]],
    train_budget_milli_node_hours: float,
    time_column: str,
    time_series_identifier_columns: List[str],
    time_series_attribute_columns: Optional[List[str]] = None,
    available_at_forecast_columns: Optional[List[str]] = None,
    unavailable_at_forecast_columns: Optional[List[str]] = None,
    forecast_horizon: Optional[int] = None,
    context_window: Optional[int] = None,
    evaluated_examples_bigquery_path: Optional[str] = None,
    window_predefined_column: Optional[str] = None,
    window_stride_length: Optional[int] = None,
    window_max_count: Optional[int] = None,
    holiday_regions: Optional[List[str]] = None,
    stage_1_num_parallel_trials: Optional[int] = None,
    stage_1_tuning_result_artifact_uri: Optional[str] = None,
    stage_2_num_parallel_trials: Optional[int] = None,
    num_selected_trials: Optional[int] = None,
    data_source_csv_filenames: Optional[str] = None,
    data_source_bigquery_table_path: Optional[str] = None,
    predefined_split_key: Optional[str] = None,
    training_fraction: Optional[float] = None,
    validation_fraction: Optional[float] = None,
    test_fraction: Optional[float] = None,
    weight_column: Optional[str] = None,
    dataflow_service_account: Optional[str] = None,
    dataflow_subnetwork: Optional[str] = None,
    dataflow_use_public_ips: bool = True,
    feature_transform_engine_bigquery_staging_full_dataset_id: str = '',
    feature_transform_engine_dataflow_machine_type: str = 'n1-standard-16',
    feature_transform_engine_dataflow_max_num_workers: int = 10,
    feature_transform_engine_dataflow_disk_size_gb: int = 40,
    evaluation_batch_predict_machine_type: str = 'n1-standard-16',
    evaluation_batch_predict_starting_replica_count: int = 25,
    evaluation_batch_predict_max_replica_count: int = 25,
    evaluation_dataflow_machine_type: str = 'n1-standard-16',
    evaluation_dataflow_max_num_workers: int = 25,
    evaluation_dataflow_disk_size_gb: int = 50,
    study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None,
    stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None,
    stage_2_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None,
    enable_probabilistic_inference: bool = False,
    quantiles: Optional[List[float]] = None,
    encryption_spec_key_name: Optional[str] = None,
    model_display_name: Optional[str] = None,
    model_description: Optional[str] = None,
    run_evaluation: bool = True,
) -> Tuple[str, Dict[str, Any]]:
  """Returns l2l_forecasting pipeline and formatted parameters.

  Args:
    project: The GCP project that runs the pipeline components.
    location: The GCP region that runs the pipeline components.
    root_dir: The root GCS directory for the pipeline components.
    target_column: The target column name.
    optimization_objective: "minimize-rmse", "minimize-mae", "minimize-rmsle",
      "minimize-rmspe", "minimize-wape-mae", "minimize-mape", or
      "minimize-quantile-loss".
    transformations: Dict mapping auto and/or type-resolutions to feature
      columns. The supported types are: auto, categorical, numeric, text, and
      timestamp.
    train_budget_milli_node_hours: The train budget of creating this model,
      expressed in milli node hours i.e. 1,000 value in this field means 1 node
      hour.
    time_column: The column that indicates the time.
    time_series_identifier_columns: The columns which distinguish different time
      series.
    time_series_attribute_columns: The columns that are invariant across the
      same time series.
    available_at_forecast_columns: The columns that are available at the
      forecast time.
    unavailable_at_forecast_columns: The columns that are unavailable at the
      forecast time.
    forecast_horizon: The length of the horizon.
    context_window: The length of the context window.
    evaluated_examples_bigquery_path: The existing BigQuery dataset to write the
      predicted examples into for evaluation, in the format
      `bq://project.dataset`. The dataset needs to be created first.
    window_predefined_column: The column that indicate the start of each window.
    window_stride_length: The stride length to generate the window.
    window_max_count: The maximum number of windows that will be generated.
    holiday_regions: The geographical regions where the holiday effect is
      applied in modeling.
    stage_1_num_parallel_trials: Number of parallel trails for stage 1.
    stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS
      URI.
    stage_2_num_parallel_trials: Number of parallel trails for stage 2.
    num_selected_trials: Number of selected trails.
    data_source_csv_filenames: A string that represents a list of comma
      separated CSV filenames.
    data_source_bigquery_table_path: The BigQuery table path of format
      bq://bq_project.bq_dataset.bq_table
    predefined_split_key: The predefined_split column name.
    training_fraction: The training fraction.
    validation_fraction: The validation fraction.
    test_fraction: The test fraction.
    weight_column: The weight column name.
    dataflow_service_account: The full service account name.
    dataflow_subnetwork: The dataflow subnetwork.
    dataflow_use_public_ips: `True` to enable dataflow public IPs.
    feature_transform_engine_bigquery_staging_full_dataset_id: The full id of
      the feature transform engine staging dataset.
    feature_transform_engine_dataflow_machine_type: The dataflow machine type of
      the feature transform engine.
    feature_transform_engine_dataflow_max_num_workers: The max number of
      dataflow workers of the feature transform engine.
    feature_transform_engine_dataflow_disk_size_gb: The disk size of the
      dataflow workers of the feature transform engine.
    evaluation_batch_predict_machine_type: Machine type for the batch prediction
      job in evaluation, such as 'n1-standard-16'.
    evaluation_batch_predict_starting_replica_count: Number of replicas to use
      in the batch prediction cluster at startup time.
    evaluation_batch_predict_max_replica_count: The maximum count of replicas
      the batch prediction job can scale to.
    evaluation_dataflow_machine_type: Machine type for the dataflow job in
      evaluation, such as 'n1-standard-16'.
    evaluation_dataflow_max_num_workers: Maximum number of dataflow workers.
    evaluation_dataflow_disk_size_gb: The disk space in GB for dataflow.
    study_spec_parameters_override: The list for overriding study spec.
    stage_1_tuner_worker_pool_specs_override: The dictionary for overriding
      stage 1 tuner worker pool spec.
    stage_2_trainer_worker_pool_specs_override: The dictionary for overriding
      stage 2 trainer worker pool spec.
    enable_probabilistic_inference: If probabilistic inference is enabled, the
      model will fit a distribution that captures the uncertainty of a
      prediction. If quantiles are specified, then the quantiles of the
      distribution are also returned.
    quantiles: Quantiles to use for probabilistic inference. Up to 5 quantiles
      are allowed of values between 0 and 1, exclusive. Represents the quantiles
      to use for that objective. Quantiles must be unique.
    encryption_spec_key_name: The KMS key name.
    model_display_name: Optional display name for model.
    model_description: Optional description.
    run_evaluation: `True` to evaluate the ensembled model on the test split.
  """
  ...
```


### Use holiday regions

For some use cases, forecasting data can be affected by holidays in regional areas. See https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/forecasting-train#holiday-regions for more information on holiday regions supported by forecasting.

Pass in a list of strings `holiday_regions` to the pipeline parameter builder to incorporate holiday data into your training pipeline.

## Customize the training configurations

You can create a Forecasting pipeline with the following customizations: 
- Change machine type and tuning / training parallelism
- Skip evaluation
- Skip model architecture search

Instead of doing architecture search everytime, you can reuse the existing architecture search result. This can reduce the variation of the output model or the training cost. The existing architecture search result is stored in the `tuning_result_output` output of the `automl-forecasting-stage-1-tuner` component. You can load it programmatically with the API.

```python
stage_1_tuner_task = get_task_detail(
    pipeline_task_details, "automl-forecasting-stage-1-tuner"
)

stage_1_tuning_result_artifact_uri = (
    stage_1_tuner_task.outputs["tuning_result_output"].artifacts[0].uri
)
```

You can use the following code snippet to customize the training configuration:

In [None]:
# Customize the work pool for each trial during tuning.
# Only the chief node and the evaluator node are used.
# You can change the machine spec for these two nodes.
worker_pool_specs_override = [
    {"machine_spec": {"machine_type": "n1-standard-8"}},  # override for TF chief node
    {},  # override for TF worker node, since it's not used, leave it empty
    {},  # override for TF ps node, since it's not used, leave it empty
    {
        "machine_spec": {"machine_type": "n1-standard-4"}
    },  # override for TF evaluator node
]

# Number of weak models in the final ensemble model.
num_selected_trials = 5

# Specify the evaluation setup.
run_evaluation = False

You can export evaluated examples from training to BigQuery by setting the parameter `evaluated_examples_bigquery_path` in the training parameters. The BigQuery path needs to point to an existing BigQuery dataset in the format `bq://project.dataset`.

In [None]:
# This is ONLY available when `run_evaluation` is set to `True`.
evaluated_examples_bigquery_path = f"bq://{PROJECT_ID}.eval"

## TiDE training

Time series Dense Encoder (TiDE) is an optimized dense DNN-based encoder-decoder model, which has the great model quality with fast training and inference, especially for long contexts and horizons.

For more details, please see https://ai.googleblog.com/2023/04/recent-advances-in-deep-long-horizon.html

In this tutorial, you run the TiDE training pipeline twice:
1. With model architecture search
2. Without model architecture search 

### Run the TiDE pipeline with model architecture search

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    # `minimize-quantile-loss`
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    # Do not set `data_source_csv_filenames` and
    # `data_source_bigquery_table_path` if you want to use Vertex managed
    # dataset by commenting out the following two lines.
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    stage_1_tuner_worker_pool_specs_override=worker_pool_specs_override,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile forecast requires `minimize-quantile-loss` as optimization objective.
    # quantiles=[0.25, 0.5, 0.9],
    # holiday_regions=["US", "AE"],
)

job_id = "tide-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
    # Uncomment the following line if you want to use Vertex managed dataset.
    # input_artifacts={'vertex_dataset': vertex_dataset_artifact_id},
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

### Run the TiDE pipeline without the model architecture search


After retrieving the tuning result from the stage 1 tuner, you can use it to skip the model architecture search.

In [None]:
# Retrieve the tuning result output from the previous training pipeline.
stage_1_tuner_task = get_task_detail(
    pipeline_task_details, "automl-forecasting-stage-1-tuner"
)

stage_1_tuning_result_artifact_uri = (
    stage_1_tuner_task.outputs["tuning_result_output"].artifacts[0].uri
)

train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    stage_1_tuning_result_artifact_uri=stage_1_tuning_result_artifact_uri,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
)

job_id = "tide-forecasting-skip-architecture-search-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)

# Get model URI
skip_architecture_search_pipeline_task_details = (
    job.gca_resource.job_detail.task_details
)

## L2L training


Learn-to-Learn (L2L) is a good choice for a wide range of the time series forecasting use cases.

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_learn_to_learn_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile forecast requires `minimize-quantile-loss` as optimization objective.
    # quantiles=[0.25, 0.5, 0.9],
)

job_id = "l2l-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

## Seq2seq training

Sequence-to-sequence (seq2seq) is a good choice for experimentation. The algorithm is likely to converge faster than AutoML because its architecture is simpler and it uses a smaller search space.

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_sequence_to_sequence_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile prediction is NOT supported by Seq2seq.
)

job_id = "seq2seq-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

## TFT training

TFT stands for "Temporal Fusion Transformer", which an attention-based DNN model designed to produce high accuracy and interpretability by aligning the model with the general multi-horizon forecasting task.

With this model, you don't need to explicitly enable the explanability support during serving to get the feature importance for each feature column.

In [None]:
train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_temporal_fusion_transformer_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    # Please note that TFT model will ONLY ensemble the model from
    # the top one trial, so `num_selected_trials` can not be set for TFT model.
    # num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=run_evaluation,
    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
    dataflow_service_account=SERVICE_ACCOUNT,
    # Quantile prediction is NOT supported by TFT.
)

job_id = "tft-forecasting-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values=parameter_values,
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)


pipeline_task_details = job.gca_resource.job_detail.task_details

## Batch prediction/explain

You can enable the batch explain feature by simply setting `generate_explanation=True` in the `batch_predict` API.

Use the following code to retrieve the trained Forecasting model from the pipeline:

In [None]:
upload_model_task = get_task_detail(pipeline_task_details, "model-upload-2")

forecasting_mp_model_artifact = upload_model_task.outputs["model"].artifacts[0]

forecasting_mp_model = aiplatform.Model(
    forecasting_mp_model_artifact.metadata["resourceName"]
)

Once you retrieve the Vertex AI model, you can start to perform the batch prediction.

In [None]:
print(f"Running Batch prediction for model: {forecasting_mp_model.display_name}")

batch_predict_bq_output_uri_prefix = f"bq://{PROJECT_ID}"

PREDICTION_DATASET_BQ_PATH = (
    "bq://bigquery-public-data:iowa_liquor_sales_forecasting.2021_sales_predict"
)

batch_prediction_job = forecasting_mp_model.batch_predict(
    job_display_name="forecasting_iowa_liquor_sales_forecasting_predictions",
    bigquery_source=PREDICTION_DATASET_BQ_PATH,
    instances_format="bigquery",
    bigquery_destination_prefix=batch_predict_bq_output_uri_prefix,
    predictions_format="bigquery",
    # Uncomment the following line to run batch explain:
    # generate_explanation=True,
    sync=True,
)

print(batch_prediction_job)

## Retrieve the uploaded Vertex AI model with a Vertex AI pipeline job id

In [None]:
# Example format of pipeline_job_id: projects/{your-project-id}/locations/us-central1/pipelineJobs/{pipeline-job-id}
pipeline_job_id = ""  # @param {type:"string"}
if pipeline_job_id:
    job = aiplatform.PipelineJob.get(pipeline_job_id)
    pipeline_task_details = job.gca_resource.job_detail.task_details
    upload_model_task = get_task_detail(pipeline_task_details, "model-upload-2")

    forecasting_mp_model_artifact = upload_model_task.outputs["model"].artifacts[0]
    forecasting_mp_model = aiplatform.Model(
        forecasting_mp_model_artifact.metadata["resourceName"]
    )
    print(forecasting_mp_model)

## Upload with parent model for different model versions

To upload this model to a parent Vertex AI model, you need the resource_name of the parent Vertex AI model.

In [None]:
# The model resource name can be something like: "projects/{your-project-id}/locations/us-central1/models/{model-id}"
parent_model_resource_name = ""  # @param {type:"string"}

if parent_model_resource_name:
    parent_model_artifact = aiplatform.Artifact.get_with_uri(
        "https://us-central1-aiplatform.googleapis.com/v1/" + parent_model_resource_name
    )
    parent_model_artifact_id = str(
        parent_model_artifact.gca_resource.name.split("artifacts/")[1]
    )

    train_budget_milli_node_hours = 250.0  # 15 minutes

    (
        template_path,
        parameter_values,
    ) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
        project=PROJECT_ID,
        location=REGION,
        root_dir=root_dir,
        target_column=target_column,
        optimization_objective=optimization_objective,
        transformations=transformations,
        train_budget_milli_node_hours=train_budget_milli_node_hours,
        # Do not set `data_source_csv_filenames` and
        # `data_source_bigquery_table_path` if you want to use Vertex managed
        # dataset by commenting out the following two lines.
        data_source_csv_filenames=data_source_csv_filenames,
        data_source_bigquery_table_path=data_source_bigquery_table_path,
        weight_column=weight_column,
        predefined_split_key=predefined_split_key,
        training_fraction=training_fraction,
        validation_fraction=validation_fraction,
        test_fraction=test_fraction,
        num_selected_trials=5,
        time_column=time_column,
        time_series_identifier_columns=[time_series_identifier_column],
        time_series_attribute_columns=time_series_attribute_columns,
        available_at_forecast_columns=available_at_forecast_columns,
        unavailable_at_forecast_columns=unavailable_at_forecast_columns,
        forecast_horizon=forecast_horizon,
        context_window=context_window,
        dataflow_subnetwork=dataflow_subnetwork,
        dataflow_use_public_ips=dataflow_use_public_ips,
        run_evaluation=run_evaluation,
        # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,
        dataflow_service_account=SERVICE_ACCOUNT,
        # Quantile forecast requires `minimize-quantile-loss` as optimization objective.
        # quantiles=[0.25, 0.5, 0.9],
    )

    job_id = "tide-forecasting-with-parent-model-{}".format(uuid.uuid4())
    job = aiplatform.PipelineJob(
        display_name=job_id,
        location=REGION,  # launches the pipeline job in the specified region
        template_path=template_path,
        job_id=job_id,
        pipeline_root=root_dir,
        parameter_values=parameter_values,
        enable_caching=False,
        input_artifacts={"parent_model": parent_model_artifact_id},
    )

    job.run(service_account=SERVICE_ACCOUNT)

## Integrate Tabular Workflow for Forecasting into your existing KFP pipeline

This is implemented using [the pipeline-as-component feature](https://www.kubeflow.org/docs/components/pipelines/v2/load-and-share-components/) of KFP.

In [None]:
from kfp import compiler, components, dsl

train_budget_milli_node_hours = 250.0  # 15 minutes

(
    template_path,
    parameter_values,
) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    root_dir=root_dir,
    target_column=target_column,
    optimization_objective=optimization_objective,
    transformations=transformations,
    train_budget_milli_node_hours=train_budget_milli_node_hours,
    data_source_csv_filenames=data_source_csv_filenames,
    data_source_bigquery_table_path=data_source_bigquery_table_path,
    weight_column=weight_column,
    predefined_split_key=predefined_split_key,
    training_fraction=training_fraction,
    validation_fraction=validation_fraction,
    test_fraction=test_fraction,
    num_selected_trials=num_selected_trials,
    time_column=time_column,
    time_series_identifier_columns=[time_series_identifier_column],
    time_series_attribute_columns=time_series_attribute_columns,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    dataflow_subnetwork=dataflow_subnetwork,
    dataflow_use_public_ips=dataflow_use_public_ips,
    run_evaluation=False,
    dataflow_service_account=SERVICE_ACCOUNT,
)

# Load the forecasting pipeline as a sub-pipeline/components which can be used
# in a larger KFP pipeline.
forecasting_pipeline = components.load_component_from_file(template_path)


@dsl.component
def print_message(msg: str):
    print("message:", msg)


# Define a pipeline that follows the below steps:
# step_1(print_message) -> step_2(print_message) -> forecasting_pipeline
@dsl.pipeline
def outer_pipeline(msg_1: str, msg_2: str, ds: dsl.Artifact):
    step_1 = print_message(msg=msg_1)
    step_2 = print_message(msg=msg_2).after(step_1)
    # `vertex_dataset` argument needs to be set/forwarded here to avoid the
    # "missing-argument" error in KFP pipeline.
    forecasting_pipeline(**parameter_values, vertex_dataset=ds).after(step_2)


# Compile and save the outer/larger pipeline template.
outer_pipeline_template_path = "./outer_pipeline.yaml"
compiler.Compiler().compile(outer_pipeline, outer_pipeline_template_path)


job_id = "run-forecasting-pipeline-inside-pipeline-{}".format(uuid.uuid4())
job = aiplatform.PipelineJob(
    display_name=job_id,
    location=REGION,  # launches the pipeline job in the specified region
    template_path=outer_pipeline_template_path,
    job_id=job_id,
    pipeline_root=root_dir,
    parameter_values={"msg_1": "step 1", "msg_2": "step 2"},
    enable_caching=False,
)

job.run(service_account=SERVICE_ACCOUNT)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
import os

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI