notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb

{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "18ebbd838e32"
      },
      "outputs": [],
      "source": [
        "# Copyright 2023 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mThXALJl9Yue"
      },
      "source": [
        "# Tabular Workflow for Forecasting\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
        "      View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/automl_forecasting_on_vertex_pipelines.ipynb\">\n",
        "        <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
        "      Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "<br/><br/><br/>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "962e636b5cee"
      },
      "source": [
        "**_NOTE_**: This notebook has been tested in the following environment:\n",
        "\n",
        "* Python version = 3.9"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fcc745968395"
      },
      "source": [
        "## Overview\n",
        "\n",
        "This tutorial demonstrates how you can use Vertex AI Tabular Workflow for Forecasting to train an AutoML model. You can choose between the following model types: Time Series Dense Encoder (TiDE), Learn to Learn (L2L), Sequence to Sequence (Seq2Seq+), and Temporal Fusion Transformer (TFT).\n",
        "\n",
        "Learn more about [Tabular Workflow for Forecasting](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/forecasting)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8b54ba90629a"
      },
      "source": [
        "### Compared to Vertex Forecasting managed service.\n",
        "\n",
        "Compared to Vertex Forecasting managed service, Tabular Workflow for Forecasting has the following advantages:\n",
        "1. Composite time series id columns are supported. You can use a combination of multiple columns as the time series id, for example, you can use either `['sku_id']` or `['sku_id', 'store_id']` as the time series id columns.\n",
        "2. Model architecture search can be skipped. You can reuse the previous model architecture search tuning result to train the model directly.\n",
        "3. Hardware customization. You can override the machine spec of the tuning and the training step, so that you can tune the training speed. You are also able to control the parallelism of the training process and the number of the final selected trials during the ensemble step.\n",
        "4. Unlimited time steps support in one single time series. There's no 3000 time steps limit in the training dataset.\n",
        "5. No upper limit for the training dataset. There's no 100MM rows limit or 100GB limit in dataset size.\n",
        "6. Use all advanced features from the Vertex AI Pipelines."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f887ec5c06c5"
      },
      "source": [
        "### Objective\n",
        "\n",
        "In this tutorial, you learn how to create AutoML Forecasting models using [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) downloaded from [Google Cloud Pipeline Components](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction) (GCPC). These pipelines are Vertex AI Tabular Workflow pipelines that are maintained by Google. These pipelines showcases different ways to customize the Vertex AI Tabular training process.\n",
        "\n",
        "This tutorial uses the following Google Cloud ML services:\n",
        "\n",
        "- AutoML training\n",
        "- Vertex AI Pipelines\n",
        "\n",
        "The steps performed are:\n",
        "\n",
        "- Create a training pipeline with TiDE(Time series Dense Encoder) algorithm using specified machine type for training.\n",
        "- Create a training pipeline that reuses the architecture search results from the previous pipeline to save time for TiDE(Time series Dense Encoder).\n",
        "- Create a training pipeline with Learn-to-learn(L2L) algorithm.\n",
        "- Create a training pipeline with Seq2seq(Sequence to sequence) algorithm.\n",
        "- Create a training pipeline with TFT(Temporal Fusion Transformer) algorithm.\n",
        "- Perform the batch prediction using the trained model in the above steps."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eac26958afe8"
      },
      "source": [
        "### Dataset\n",
        "\n",
        "This tutorial uses the [Liquor dataset](https://www.kaggle.com/datasets/residentmario/iowa-liquor-sales), which forecasts the alcoholic beverage sales in the Midwest."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "181d4dfbf917"
      },
      "source": [
        "### Costs\n",
        "\n",
        "This tutorial uses billable components of Google Cloud:\n",
        "\n",
        "* Vertex AI\n",
        "* Cloud Storage\n",
        "* BigQuery\n",
        "* Dataflow\n",
        "\n",
        "Learn about [Vertex AI\n",
        "pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage\n",
        "pricing](https://cloud.google.com/storage/pricing), and [BigQuery](https://cloud.google.com/bigquery), and use the [Pricing\n",
        "Calculator](https://cloud.google.com/products/calculator/)\n",
        "to generate a cost estimate based on your projected usage."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e85f0288a6df"
      },
      "source": [
        "## Install additional packages\n",
        "\n",
        "Install the Google Cloud Pipeline Components (GCPC) SDK not earlier than `2.3.0`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "E7-SzYTR9bo2"
      },
      "outputs": [],
      "source": [
        "!pip3 install --upgrade --quiet google-cloud-pipeline-components==2.3.0 \\\n",
        "                                google-cloud-aiplatform"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Bj5O0S5RTxzY"
      },
      "source": [
        "### Colab only: Uncomment the following cell to restart the kernel."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "023DMKUaTypt"
      },
      "outputs": [],
      "source": [
        "# Automatically restart kernel after installs so that your environment can access the new packages\n",
        "# import IPython\n",
        "\n",
        "# app = IPython.Application.instance()\n",
        "# app.kernel.do_shutdown(True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yfEglUHQk9S3"
      },
      "source": [
        "## Before you begin\n",
        "\n",
        "### Set up your Google Cloud project\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager).\n",
        "\n",
        "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
        "\n",
        "3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,dataflow.googleapis.com,compute_component,storage-component.googleapis.com).\n",
        "\n",
        "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zebLBGXOky2A"
      },
      "source": [
        "## Notes about service account and permission\n",
        "\n",
        "For full details of the permission setup, refer to https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/service-accounts\n",
        "\n",
        "**By default no configuration is required**, if you run into any permission related issue, please make sure the service accounts above have the required roles:\n",
        "\n",
        "|Service account email|Description|Roles|\n",
        "|---|---|---|\n",
        "|PROJECT_NUMBER-compute@developer.gserviceaccount.com|Compute Engine default service account|Dataflow Developer, Dataflow Worker, Storage Admin, BigQuery Data Editor, Vertex AI User, Service Account User|\n",
        "|service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com|AI Platform Service Agent|Vertex AI Service Agent|\n",
        "\n",
        "\n",
        "1. Goto https://console.cloud.google.com/iam-admin/iam.\n",
        "2. Check the \"Include Google-provided role grants\" checkbox.\n",
        "3. Find the above emails.\n",
        "4. Grant the corresponding roles.\n",
        "\n",
        "### Using data source from a different project\n",
        "- For the BQ data source, grant both service accounts the \"BigQuery Data Viewer\" role.\n",
        "- For the CSV data source, grant both service accounts the \"Storage Object Viewer\" role.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "95cb7ffd6895"
      },
      "source": [
        "### Set your project ID\n",
        "\n",
        "**If you don't know your project ID**, try the following:\n",
        "* Run `gcloud config list`.\n",
        "* Run `gcloud projects list`.\n",
        "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cd85f5c794e5"
      },
      "outputs": [],
      "source": [
        "PROJECT_ID = \"[your-project-id]\"  # @param {type:\"string\"}\n",
        "\n",
        "# Set the project id\n",
        "! gcloud config set project {PROJECT_ID}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b12f508d97c6"
      },
      "source": [
        "### Region\n",
        "\n",
        "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8e8b7997de7a"
      },
      "outputs": [],
      "source": [
        "REGION = \"us-central1\"  # @param {type: \"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eu0e2TRVxjHb"
      },
      "source": [
        "### Authenticate your Google Cloud account\n",
        "\n",
        "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d118c95af93f"
      },
      "source": [
        "**1. Vertex AI Workbench**\n",
        "* Do nothing since you're already authenticated."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3035286fcdda"
      },
      "source": [
        "**2. Local JupyterLab instance, uncomment and run:**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "455882ec0f11"
      },
      "outputs": [],
      "source": [
        "# ! gcloud auth login"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5097f3233d53"
      },
      "source": [
        "**3. Colab, uncomment and run:**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2b88e46ac2c8"
      },
      "outputs": [],
      "source": [
        "# from google.colab import auth\n",
        "# auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fcdbb8929927"
      },
      "source": [
        "**4. Service account or other**\n",
        "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OUfwWir9yPNV"
      },
      "source": [
        "### Create a Cloud Storage bucket\n",
        "\n",
        "Create a storage bucket to store intermediate artifacts such as datasets, TF model checkpoint, TensorBoard file, etc."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5325f437b46a"
      },
      "outputs": [],
      "source": [
        "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\"  # @param {type:\"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6ebc0bdb07af"
      },
      "source": [
        "**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0217c63ed87f"
      },
      "outputs": [],
      "source": [
        "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "44accda192d5"
      },
      "source": [
        "#### Service Account\n",
        "\n",
        "Use a service account to create Vertex AI Pipeline jobs. If you don't want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "604ae09ab6d3"
      },
      "outputs": [],
      "source": [
        "SERVICE_ACCOUNT = \"[your-service-account]\"  # @param {type:\"string\"}\n",
        "\n",
        "if (\n",
        "    SERVICE_ACCOUNT == \"\"\n",
        "    or SERVICE_ACCOUNT is None\n",
        "    or SERVICE_ACCOUNT == \"[your-service-account]\"\n",
        "):\n",
        "    import sys\n",
        "    IS_COLAB = 'google.colab' in sys.modules\n",
        "\n",
        "    # Get your service account from gcloud\n",
        "    if not IS_COLAB:\n",
        "        shell_output = !gcloud auth list 2>/dev/null\n",
        "        SERVICE_ACCOUNT = shell_output[2].replace(\"*\", \"\").strip()\n",
        "\n",
        "    else:  # IS_COLAB:\n",
        "        shell_output = ! gcloud projects describe  $PROJECT_ID\n",
        "        project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n",
        "        SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n",
        "\n",
        "    print(\"Service Account:\", SERVICE_ACCOUNT)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d1ecb60964d5"
      },
      "source": [
        "#### Set service account access for Vertex AI Pipelines\n",
        "Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a592f0a380c2"
      },
      "outputs": [],
      "source": [
        "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI\n",
        "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fbbc3479a1da"
      },
      "source": [
        "## Import libraries and define constants"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8G6YmJT1yqkV"
      },
      "outputs": [],
      "source": [
        "# Import required modules\n",
        "import json\n",
        "import os\n",
        "import uuid\n",
        "from typing import Any, Dict, List, Optional\n",
        "\n",
        "from google.cloud import aiplatform, storage\n",
        "from google_cloud_pipeline_components.preview.automl.forecasting import \\\n",
        "    utils as automl_forecasting_utils"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c0423f260423"
      },
      "source": [
        "## Initialize Vertex AI SDK for Python\n",
        "\n",
        "Initialize the Vertex SDK for Python for your project."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ad69f2590268"
      },
      "outputs": [],
      "source": [
        "aiplatform.init(project=PROJECT_ID, location=REGION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Se89OZ0jRWD6"
      },
      "source": [
        "## VPC related config\n",
        "\n",
        "If you need to use a custom Dataflow subnetwork, you can set it through the `dataflow_subnetwork` parameter. The requirements are:\n",
        "1. `dataflow_subnetwork` must be a fully qualified subnetwork name.\n",
        "   ([Example network and subnetwork specifications](https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications))\n",
        "1. The following service accounts must have [Compute Network User role](https://cloud.google.com/compute/docs/access/iam#compute.networkUser) assigned on the specified dataflow subnetwork [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#shared)]:\n",
        "    1. Compute Engine default service account: PROJECT_NUMBER-compute@developer.gserviceaccount.com\n",
        "    1. Dataflow service account: service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com\n",
        "\n",
        "If your project has VPC-SC enabled, make sure:\n",
        "\n",
        "1. The dataflow subnetwork used in VPC-SC is configured properly for Dataflow.\n",
        "   [[reference](https://cloud.google.com/dataflow/docs/guides/routes-firewall)]\n",
        "1. `dataflow_use_public_ips` is set to False.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jjYKwvgxRbTn"
      },
      "outputs": [],
      "source": [
        "# Dataflow's fully qualified subnetwork name, when empty the default subnetwork will be used.\n",
        "# Fully qualified subnetwork name is in the form of\n",
        "# https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME\n",
        "# reference: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications\n",
        "dataflow_subnetwork = None  # @param {type:\"string\"}\n",
        "# Specifies whether Dataflow workers use public IP addresses.\n",
        "dataflow_use_public_ips = True  # @param {type:\"boolean\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cqSmR2Q3Rphx"
      },
      "source": [
        "## Prepare for training"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3LWH3PRF5o2v"
      },
      "source": [
        "### Define helper functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "g9FPFT8c5oC0"
      },
      "outputs": [],
      "source": [
        "# Below functions will serve as the utility functions.\n",
        "\n",
        "\n",
        "# Fetch the tuple of GCS bucket and object URI.\n",
        "def get_bucket_name_and_path(uri: str):\n",
        "    no_prefix_uri = uri[len(\"gs://\") :]\n",
        "    splits = no_prefix_uri.split(\"/\")\n",
        "    return splits[0], \"/\".join(splits[1:])\n",
        "\n",
        "\n",
        "# Fetch the content from a GCS object URI.\n",
        "def download_from_gcs(uri: str):\n",
        "    bucket_name, path = get_bucket_name_and_path(uri)\n",
        "    storage_client = storage.Client(project=PROJECT_ID)\n",
        "    bucket = storage_client.get_bucket(bucket_name)\n",
        "    blob = bucket.blob(path)\n",
        "    return blob.download_as_string()\n",
        "\n",
        "\n",
        "# Upload the string content as a GCS object.\n",
        "def write_to_gcs(uri: str, content: str):\n",
        "    bucket_name, path = get_bucket_name_and_path(uri)\n",
        "    storage_client = storage.Client()\n",
        "    bucket = storage_client.get_bucket(bucket_name)\n",
        "    blob = bucket.blob(path)\n",
        "    blob.upload_from_string(content)\n",
        "\n",
        "\n",
        "# This is the example to set non-auto transformations.\n",
        "# For more details about the transformations, please check:\n",
        "# https://cloud.google.com/vertex-ai/docs/datasets/data-types-tabular#transformations\n",
        "def generate_transformation(\n",
        "    auto_column_names: Optional[List[str]] = None,\n",
        "    numeric_column_names: Optional[List[str]] = None,\n",
        "    categorical_column_names: Optional[List[str]] = None,\n",
        "    text_column_names: Optional[List[str]] = None,\n",
        "    timestamp_column_names: Optional[List[str]] = None,\n",
        ") -> List[Dict[str, Any]]:\n",
        "    if auto_column_names is None:\n",
        "        auto_column_names = []\n",
        "    if numeric_column_names is None:\n",
        "        numeric_column_names = []\n",
        "    if categorical_column_names is None:\n",
        "        categorical_column_names = []\n",
        "    if text_column_names is None:\n",
        "        text_column_names = []\n",
        "    if timestamp_column_names is None:\n",
        "        timestamp_column_names = []\n",
        "    return {\n",
        "        \"auto\": auto_column_names,\n",
        "        \"numeric\": numeric_column_names,\n",
        "        \"categorical\": categorical_column_names,\n",
        "        \"text\": text_column_names,\n",
        "        \"timestamp\": timestamp_column_names,\n",
        "    }\n",
        "\n",
        "\n",
        "# Retrieve the data given a task name.\n",
        "def get_task_detail(\n",
        "    task_details: List[Dict[str, Any]], task_name: str\n",
        ") -> List[Dict[str, Any]]:\n",
        "    for task_detail in task_details:\n",
        "        if task_detail.task_name == task_name:\n",
        "            return task_detail\n",
        "\n",
        "\n",
        "# Retrieve the URI of the model.\n",
        "def get_deployed_model_uri(\n",
        "    task_details,\n",
        "):\n",
        "    ensemble_task = get_task_detail(task_details, \"model-upload\")\n",
        "    return ensemble_task.outputs[\"model\"].artifacts[0].uri\n",
        "\n",
        "\n",
        "# Retrieve the feature importance details from GCS.\n",
        "def get_feature_attributions(\n",
        "    task_details,\n",
        "):\n",
        "    ensemble_task = get_task_detail(task_details, \"model-evaluation-2\")\n",
        "    return download_from_gcs(\n",
        "        ensemble_task.outputs[\"evaluation_metrics\"]\n",
        "        .artifacts[0]\n",
        "        .metadata[\"explanation_gcs_path\"]\n",
        "    )\n",
        "\n",
        "\n",
        "# Retrieve the evaluation metrics from GCS.\n",
        "def get_evaluation_metrics(\n",
        "    task_details,\n",
        "):\n",
        "    ensemble_task = get_task_detail(task_details, \"model-evaluation\")\n",
        "    return download_from_gcs(\n",
        "        ensemble_task.outputs[\"evaluation_metrics\"].artifacts[0].uri\n",
        "    )\n",
        "\n",
        "\n",
        "# Pretty print the JSON string.\n",
        "def load_and_print_json(s):\n",
        "    parsed = json.loads(s)\n",
        "    print(json.dumps(parsed, indent=2, sort_keys=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gvNFMRmBegZq"
      },
      "source": [
        "### Define training specification"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eV4JrwB8wAkg"
      },
      "outputs": [],
      "source": [
        "root_dir = os.path.join(BUCKET_URI, f\"automl_forecasting_pipeline/run-{uuid.uuid4()}\")\n",
        "optimization_objective = \"minimize-mae\"\n",
        "time_column = \"date\"\n",
        "time_series_identifier_column = \"store_name\"\n",
        "target_column = \"sale_dollars\"\n",
        "data_source_csv_filenames = None\n",
        "data_source_bigquery_table_path = (\n",
        "    \"bq://bigquery-public-data.iowa_liquor_sales_forecasting.2020_sales_train\"\n",
        ")\n",
        "\n",
        "training_fraction = 0.8\n",
        "validation_fraction = 0.1\n",
        "test_fraction = 0.1\n",
        "\n",
        "predefined_split_key = None\n",
        "if predefined_split_key:\n",
        "    training_fraction = None\n",
        "    validation_fraction = None\n",
        "    test_fraction = None\n",
        "\n",
        "weight_column = None\n",
        "\n",
        "features = [\n",
        "    time_column,\n",
        "    target_column,\n",
        "    \"city\",\n",
        "    \"zip_code\",\n",
        "    \"county\",\n",
        "]\n",
        "\n",
        "available_at_forecast_columns = [time_column]\n",
        "unavailable_at_forecast_columns = [target_column]\n",
        "time_series_attribute_columns = [\"city\", \"zip_code\", \"county\"]\n",
        "forecast_horizon = 150\n",
        "context_window = 150\n",
        "\n",
        "transformations = generate_transformation(auto_column_names=features)\n",
        "\n",
        "# Create a Vertex managed dataset artifact.\n",
        "vertex_dataset = aiplatform.TimeSeriesDataset.create(\n",
        "    bq_source=data_source_bigquery_table_path\n",
        ")\n",
        "vertex_dataset_artifact_id = vertex_dataset.gca_resource.metadata_artifact.split(\"/\")[\n",
        "    -1\n",
        "]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bSCTxi48DRxz"
      },
      "source": [
        "## Supported APIs\n",
        "\n",
        "\n",
        "Currently, four model types are supported in the APIs/SDK with the utility functions:\n",
        "1. `time_series_dense_encoder`(`TiDE`): `get_time_series_dense_encoder_forecasting_pipeline_and_parameters`\n",
        "2. `learn_to_learn`(`L2L`): `get_learn_to_learn_forecasting_pipeline_and_parameters`\n",
        "3. `sequence_to_sequence`(`seq2seq`): `get_sequence_to_sequence_forecasting_pipeline_and_parameters`\n",
        "4. `temporal_fusion_transformer`(`TFT`): `get_temporal_fusion_transformer_forecasting_pipeline_and_parameters`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C9v6kvuXHCiy"
      },
      "source": [
        "### High level workflow\n",
        "\n",
        "The following code shows the general format for using the APIs:\n",
        "```python\n",
        "# Use the utility function to get the required parameters to create Vertex Pipeline job.\n",
        "template_path, parameter_values = automl_forecasting_utils.get_${MODEL_TYPE}_forecasting_pipeline_and_parameters(\n",
        "  ...\n",
        ")\n",
        "\n",
        "# Construct a Vertex Pipeline job.\n",
        "job = aiplatform.PipelineJob(\n",
        "    ...\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=template_path,\n",
        "    ...\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values=parameter_values,\n",
        "    ...\n",
        ")\n",
        "\n",
        "# Launch the Vertex Pipeline job.\n",
        "job.run()\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7gA0zRUEHO5b"
      },
      "source": [
        "### Utility function arguments\n",
        "\n",
        "The utility functions for all model types have the same arguments.\n",
        "\n",
        "`get_time_series_dense_encoder_forecasting_pipeline_and_parameters` is shown here as an example:\n",
        "\n",
        "```python\n",
        "def get_time_series_dense_encoder_forecasting_pipeline_and_parameters(\n",
        "    *,\n",
        "    project: str,\n",
        "    location: str,\n",
        "    root_dir: str,\n",
        "    target_column: str,\n",
        "    optimization_objective: str,\n",
        "    transformations: Dict[str, List[str]],\n",
        "    train_budget_milli_node_hours: float,\n",
        "    time_column: str,\n",
        "    time_series_identifier_columns: List[str],\n",
        "    time_series_attribute_columns: Optional[List[str]] = None,\n",
        "    available_at_forecast_columns: Optional[List[str]] = None,\n",
        "    unavailable_at_forecast_columns: Optional[List[str]] = None,\n",
        "    forecast_horizon: Optional[int] = None,\n",
        "    context_window: Optional[int] = None,\n",
        "    evaluated_examples_bigquery_path: Optional[str] = None,\n",
        "    window_predefined_column: Optional[str] = None,\n",
        "    window_stride_length: Optional[int] = None,\n",
        "    window_max_count: Optional[int] = None,\n",
        "    holiday_regions: Optional[List[str]] = None,\n",
        "    stage_1_num_parallel_trials: Optional[int] = None,\n",
        "    stage_1_tuning_result_artifact_uri: Optional[str] = None,\n",
        "    stage_2_num_parallel_trials: Optional[int] = None,\n",
        "    num_selected_trials: Optional[int] = None,\n",
        "    data_source_csv_filenames: Optional[str] = None,\n",
        "    data_source_bigquery_table_path: Optional[str] = None,\n",
        "    predefined_split_key: Optional[str] = None,\n",
        "    training_fraction: Optional[float] = None,\n",
        "    validation_fraction: Optional[float] = None,\n",
        "    test_fraction: Optional[float] = None,\n",
        "    weight_column: Optional[str] = None,\n",
        "    dataflow_service_account: Optional[str] = None,\n",
        "    dataflow_subnetwork: Optional[str] = None,\n",
        "    dataflow_use_public_ips: bool = True,\n",
        "    feature_transform_engine_bigquery_staging_full_dataset_id: str = '',\n",
        "    feature_transform_engine_dataflow_machine_type: str = 'n1-standard-16',\n",
        "    feature_transform_engine_dataflow_max_num_workers: int = 10,\n",
        "    feature_transform_engine_dataflow_disk_size_gb: int = 40,\n",
        "    evaluation_batch_predict_machine_type: str = 'n1-standard-16',\n",
        "    evaluation_batch_predict_starting_replica_count: int = 25,\n",
        "    evaluation_batch_predict_max_replica_count: int = 25,\n",
        "    evaluation_dataflow_machine_type: str = 'n1-standard-16',\n",
        "    evaluation_dataflow_max_num_workers: int = 25,\n",
        "    evaluation_dataflow_disk_size_gb: int = 50,\n",
        "    study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None,\n",
        "    stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None,\n",
        "    stage_2_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None,\n",
        "    enable_probabilistic_inference: bool = False,\n",
        "    quantiles: Optional[List[float]] = None,\n",
        "    encryption_spec_key_name: Optional[str] = None,\n",
        "    model_display_name: Optional[str] = None,\n",
        "    model_description: Optional[str] = None,\n",
        "    run_evaluation: bool = True,\n",
        ") -> Tuple[str, Dict[str, Any]]:\n",
        "  \"\"\"Returns l2l_forecasting pipeline and formatted parameters.\n",
        "\n",
        "  Args:\n",
        "    project: The GCP project that runs the pipeline components.\n",
        "    location: The GCP region that runs the pipeline components.\n",
        "    root_dir: The root GCS directory for the pipeline components.\n",
        "    target_column: The target column name.\n",
        "    optimization_objective: \"minimize-rmse\", \"minimize-mae\", \"minimize-rmsle\",\n",
        "      \"minimize-rmspe\", \"minimize-wape-mae\", \"minimize-mape\", or\n",
        "      \"minimize-quantile-loss\".\n",
        "    transformations: Dict mapping auto and/or type-resolutions to feature\n",
        "      columns. The supported types are: auto, categorical, numeric, text, and\n",
        "      timestamp.\n",
        "    train_budget_milli_node_hours: The train budget of creating this model,\n",
        "      expressed in milli node hours i.e. 1,000 value in this field means 1 node\n",
        "      hour.\n",
        "    time_column: The column that indicates the time.\n",
        "    time_series_identifier_columns: The columns which distinguish different time\n",
        "      series.\n",
        "    time_series_attribute_columns: The columns that are invariant across the\n",
        "      same time series.\n",
        "    available_at_forecast_columns: The columns that are available at the\n",
        "      forecast time.\n",
        "    unavailable_at_forecast_columns: The columns that are unavailable at the\n",
        "      forecast time.\n",
        "    forecast_horizon: The length of the horizon.\n",
        "    context_window: The length of the context window.\n",
        "    evaluated_examples_bigquery_path: The existing BigQuery dataset to write the\n",
        "      predicted examples into for evaluation, in the format\n",
        "      `bq://project.dataset`. The dataset needs to be created first.\n",
        "    window_predefined_column: The column that indicate the start of each window.\n",
        "    window_stride_length: The stride length to generate the window.\n",
        "    window_max_count: The maximum number of windows that will be generated.\n",
        "    holiday_regions: The geographical regions where the holiday effect is\n",
        "      applied in modeling.\n",
        "    stage_1_num_parallel_trials: Number of parallel trails for stage 1.\n",
        "    stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS\n",
        "      URI.\n",
        "    stage_2_num_parallel_trials: Number of parallel trails for stage 2.\n",
        "    num_selected_trials: Number of selected trails.\n",
        "    data_source_csv_filenames: A string that represents a list of comma\n",
        "      separated CSV filenames.\n",
        "    data_source_bigquery_table_path: The BigQuery table path of format\n",
        "      bq://bq_project.bq_dataset.bq_table\n",
        "    predefined_split_key: The predefined_split column name.\n",
        "    training_fraction: The training fraction.\n",
        "    validation_fraction: The validation fraction.\n",
        "    test_fraction: The test fraction.\n",
        "    weight_column: The weight column name.\n",
        "    dataflow_service_account: The full service account name.\n",
        "    dataflow_subnetwork: The dataflow subnetwork.\n",
        "    dataflow_use_public_ips: `True` to enable dataflow public IPs.\n",
        "    feature_transform_engine_bigquery_staging_full_dataset_id: The full id of\n",
        "      the feature transform engine staging dataset.\n",
        "    feature_transform_engine_dataflow_machine_type: The dataflow machine type of\n",
        "      the feature transform engine.\n",
        "    feature_transform_engine_dataflow_max_num_workers: The max number of\n",
        "      dataflow workers of the feature transform engine.\n",
        "    feature_transform_engine_dataflow_disk_size_gb: The disk size of the\n",
        "      dataflow workers of the feature transform engine.\n",
        "    evaluation_batch_predict_machine_type: Machine type for the batch prediction\n",
        "      job in evaluation, such as 'n1-standard-16'.\n",
        "    evaluation_batch_predict_starting_replica_count: Number of replicas to use\n",
        "      in the batch prediction cluster at startup time.\n",
        "    evaluation_batch_predict_max_replica_count: The maximum count of replicas\n",
        "      the batch prediction job can scale to.\n",
        "    evaluation_dataflow_machine_type: Machine type for the dataflow job in\n",
        "      evaluation, such as 'n1-standard-16'.\n",
        "    evaluation_dataflow_max_num_workers: Maximum number of dataflow workers.\n",
        "    evaluation_dataflow_disk_size_gb: The disk space in GB for dataflow.\n",
        "    study_spec_parameters_override: The list for overriding study spec.\n",
        "    stage_1_tuner_worker_pool_specs_override: The dictionary for overriding\n",
        "      stage 1 tuner worker pool spec.\n",
        "    stage_2_trainer_worker_pool_specs_override: The dictionary for overriding\n",
        "      stage 2 trainer worker pool spec.\n",
        "    enable_probabilistic_inference: If probabilistic inference is enabled, the\n",
        "      model will fit a distribution that captures the uncertainty of a\n",
        "      prediction. If quantiles are specified, then the quantiles of the\n",
        "      distribution are also returned.\n",
        "    quantiles: Quantiles to use for probabilistic inference. Up to 5 quantiles\n",
        "      are allowed of values between 0 and 1, exclusive. Represents the quantiles\n",
        "      to use for that objective. Quantiles must be unique.\n",
        "    encryption_spec_key_name: The KMS key name.\n",
        "    model_display_name: Optional display name for model.\n",
        "    model_description: Optional description.\n",
        "    run_evaluation: `True` to evaluate the ensembled model on the test split.\n",
        "  \"\"\"\n",
        "  ...\n",
        "```\n",
        "\n",
        "\n",
        "### Use holiday regions\n",
        "\n",
        "For some use cases, forecasting data can be affected by holidays in regional areas. See https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/forecasting-train#holiday-regions for more information on holiday regions supported by forecasting.\n",
        "\n",
        "Pass in a list of strings `holiday_regions` to the pipeline parameter builder to incorporate holiday data into your training pipeline."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bb17c3654bcb"
      },
      "source": [
        "## Customize the training configurations\n",
        "\n",
        "You can create a Forecasting pipeline with the following customizations: \n",
        "- Change machine type and tuning / training parallelism\n",
        "- Skip evaluation\n",
        "- Skip model architecture search\n",
        "\n",
        "Instead of doing architecture search everytime, you can reuse the existing architecture search result. This can reduce the variation of the output model or the training cost. The existing architecture search result is stored in the `tuning_result_output` output of the `automl-forecasting-stage-1-tuner` component. You can load it programmatically with the API.\n",
        "\n",
        "```python\n",
        "stage_1_tuner_task = get_task_detail(\n",
        "    pipeline_task_details, \"automl-forecasting-stage-1-tuner\"\n",
        ")\n",
        "\n",
        "stage_1_tuning_result_artifact_uri = (\n",
        "    stage_1_tuner_task.outputs[\"tuning_result_output\"].artifacts[0].uri\n",
        ")\n",
        "```\n",
        "\n",
        "Use the following code snippet to customize the training configuration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dbf9aaca20a3"
      },
      "outputs": [],
      "source": [
        "# Customize the work pool for each trial during tuning.\n",
        "# Only the chief node and the evaluator node are used.\n",
        "# You can change the machine spec for these two nodes.\n",
        "worker_pool_specs_override = [\n",
        "    {\"machine_spec\": {\"machine_type\": \"n1-standard-8\"}},  # override for TF chief node\n",
        "    {},  # override for TF worker node, since it's not used, leave it empty\n",
        "    {},  # override for TF ps node, since it's not used, leave it empty\n",
        "    {\n",
        "        \"machine_spec\": {\"machine_type\": \"n1-standard-4\"}\n",
        "    },  # override for TF evaluator node\n",
        "]\n",
        "\n",
        "# Number of weak models in the final ensemble model.\n",
        "num_selected_trials = 5\n",
        "\n",
        "# Specify the evaluation setup.\n",
        "run_evaluation = False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a359fc31ba3b"
      },
      "source": [
        "You can export evaluated examples from training to BigQuery by setting the parameter `evaluated_examples_bigquery_path` in the training parameters. The BigQuery path needs to point to an existing BigQuery dataset in the format `bq://project.dataset`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "b2abbcf372ce"
      },
      "outputs": [],
      "source": [
        "# This is ONLY available when `run_evaluation` is set to `True`.\n",
        "evaluated_examples_bigquery_path = f\"bq://{PROJECT_ID}.eval\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N-iXXE14voyR"
      },
      "source": [
        "## TiDE training\n",
        "\n",
        "Time series Dense Encoder (TiDE) is an optimized dense DNN-based encoder-decoder model, which has great model quality with fast training and inference, especially for long contexts and horizons.\n",
        "\n",
        "For more details, see https://ai.googleblog.com/2023/04/recent-advances-in-deep-long-horizon.html\n",
        "\n",
        "In this tutorial, run the TiDE training pipeline twice:\n",
        "1. With model architecture search\n",
        "2. Without model architecture search "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3577d591bad1"
      },
      "source": [
        "### Run the TiDE pipeline with model architecture search"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sG46cXVueb66"
      },
      "outputs": [],
      "source": [
        "train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "(\n",
        "    template_path,\n",
        "    parameter_values,\n",
        ") = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(\n",
        "    project=PROJECT_ID,\n",
        "    location=REGION,\n",
        "    root_dir=root_dir,\n",
        "    target_column=target_column,\n",
        "    # `minimize-quantile-loss`\n",
        "    optimization_objective=optimization_objective,\n",
        "    transformations=transformations,\n",
        "    train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "    # Do not set `data_source_csv_filenames` and\n",
        "    # `data_source_bigquery_table_path` if you want to use Vertex managed\n",
        "    # dataset by commenting out the following two lines.\n",
        "    data_source_csv_filenames=data_source_csv_filenames,\n",
        "    data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "    weight_column=weight_column,\n",
        "    predefined_split_key=predefined_split_key,\n",
        "    training_fraction=training_fraction,\n",
        "    validation_fraction=validation_fraction,\n",
        "    test_fraction=test_fraction,\n",
        "    num_selected_trials=num_selected_trials,\n",
        "    time_column=time_column,\n",
        "    time_series_identifier_columns=[time_series_identifier_column],\n",
        "    time_series_attribute_columns=time_series_attribute_columns,\n",
        "    available_at_forecast_columns=available_at_forecast_columns,\n",
        "    unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "    forecast_horizon=forecast_horizon,\n",
        "    context_window=context_window,\n",
        "    stage_1_tuner_worker_pool_specs_override=worker_pool_specs_override,\n",
        "    dataflow_subnetwork=dataflow_subnetwork,\n",
        "    dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "    run_evaluation=run_evaluation,\n",
        "    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,\n",
        "    dataflow_service_account=SERVICE_ACCOUNT,\n",
        "    # Quantile forecast requires `minimize-quantile-loss` as optimization objective.\n",
        "    # quantiles=[0.25, 0.5, 0.9],\n",
        "    # holiday_regions=[\"US\", \"AE\"],\n",
        ")\n",
        "\n",
        "job_id = \"tide-forecasting-{}\".format(uuid.uuid4())\n",
        "job = aiplatform.PipelineJob(\n",
        "    display_name=job_id,\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=template_path,\n",
        "    job_id=job_id,\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values=parameter_values,\n",
        "    enable_caching=False,\n",
        "    # Uncomment the following line if you want to use Vertex managed dataset.\n",
        "    # input_artifacts={'vertex_dataset': vertex_dataset_artifact_id},\n",
        ")\n",
        "\n",
        "job.run(service_account=SERVICE_ACCOUNT)\n",
        "\n",
        "\n",
        "pipeline_task_details = job.gca_resource.job_detail.task_details"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c5F12ZL_uZZ3"
      },
      "source": [
        "### Run the TiDE pipeline without the model architecture search\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c24aa07ead0a"
      },
      "source": [
        "After retrieving the tuning result from the stage 1 tuner, you can use it to skip the model architecture search."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4x4GA5sMuewX"
      },
      "outputs": [],
      "source": [
        "# Retrieve the tuning result output from the previous training pipeline.\n",
        "stage_1_tuner_task = get_task_detail(\n",
        "    pipeline_task_details, \"automl-forecasting-stage-1-tuner\"\n",
        ")\n",
        "\n",
        "stage_1_tuning_result_artifact_uri = (\n",
        "    stage_1_tuner_task.outputs[\"tuning_result_output\"].artifacts[0].uri\n",
        ")\n",
        "\n",
        "train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "(\n",
        "    template_path,\n",
        "    parameter_values,\n",
        ") = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(\n",
        "    project=PROJECT_ID,\n",
        "    location=REGION,\n",
        "    root_dir=root_dir,\n",
        "    target_column=target_column,\n",
        "    optimization_objective=optimization_objective,\n",
        "    transformations=transformations,\n",
        "    train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "    data_source_csv_filenames=data_source_csv_filenames,\n",
        "    data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "    weight_column=weight_column,\n",
        "    predefined_split_key=predefined_split_key,\n",
        "    training_fraction=training_fraction,\n",
        "    validation_fraction=validation_fraction,\n",
        "    test_fraction=test_fraction,\n",
        "    num_selected_trials=num_selected_trials,\n",
        "    time_column=time_column,\n",
        "    time_series_identifier_columns=[time_series_identifier_column],\n",
        "    time_series_attribute_columns=time_series_attribute_columns,\n",
        "    available_at_forecast_columns=available_at_forecast_columns,\n",
        "    unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "    forecast_horizon=forecast_horizon,\n",
        "    context_window=context_window,\n",
        "    dataflow_subnetwork=dataflow_subnetwork,\n",
        "    dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "    stage_1_tuning_result_artifact_uri=stage_1_tuning_result_artifact_uri,\n",
        "    run_evaluation=run_evaluation,\n",
        "    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,\n",
        "    dataflow_service_account=SERVICE_ACCOUNT,\n",
        ")\n",
        "\n",
        "job_id = \"tide-forecasting-skip-architecture-search-{}\".format(uuid.uuid4())\n",
        "job = aiplatform.PipelineJob(\n",
        "    display_name=job_id,\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=template_path,\n",
        "    job_id=job_id,\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values=parameter_values,\n",
        "    enable_caching=False,\n",
        ")\n",
        "\n",
        "job.run(service_account=SERVICE_ACCOUNT)\n",
        "\n",
        "# Get model URI\n",
        "skip_architecture_search_pipeline_task_details = (\n",
        "    job.gca_resource.job_detail.task_details\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xiftLomOwGda"
      },
      "source": [
        "## L2L training\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2bfe9f2568c7"
      },
      "source": [
        "Learn-to-Learn (L2L) is a good choice for a wide range of the time series forecasting use cases."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XSAcS70N1GMN"
      },
      "outputs": [],
      "source": [
        "train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "(\n",
        "    template_path,\n",
        "    parameter_values,\n",
        ") = automl_forecasting_utils.get_learn_to_learn_forecasting_pipeline_and_parameters(\n",
        "    project=PROJECT_ID,\n",
        "    location=REGION,\n",
        "    root_dir=root_dir,\n",
        "    target_column=target_column,\n",
        "    optimization_objective=optimization_objective,\n",
        "    transformations=transformations,\n",
        "    train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "    data_source_csv_filenames=data_source_csv_filenames,\n",
        "    data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "    weight_column=weight_column,\n",
        "    predefined_split_key=predefined_split_key,\n",
        "    training_fraction=training_fraction,\n",
        "    validation_fraction=validation_fraction,\n",
        "    test_fraction=test_fraction,\n",
        "    num_selected_trials=num_selected_trials,\n",
        "    time_column=time_column,\n",
        "    time_series_identifier_columns=[time_series_identifier_column],\n",
        "    time_series_attribute_columns=time_series_attribute_columns,\n",
        "    available_at_forecast_columns=available_at_forecast_columns,\n",
        "    unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "    forecast_horizon=forecast_horizon,\n",
        "    context_window=context_window,\n",
        "    dataflow_subnetwork=dataflow_subnetwork,\n",
        "    dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "    run_evaluation=run_evaluation,\n",
        "    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,\n",
        "    dataflow_service_account=SERVICE_ACCOUNT,\n",
        "    # Quantile forecast requires `minimize-quantile-loss` as optimization objective.\n",
        "    # quantiles=[0.25, 0.5, 0.9],\n",
        ")\n",
        "\n",
        "job_id = \"l2l-forecasting-{}\".format(uuid.uuid4())\n",
        "job = aiplatform.PipelineJob(\n",
        "    display_name=job_id,\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=template_path,\n",
        "    job_id=job_id,\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values=parameter_values,\n",
        "    enable_caching=False,\n",
        ")\n",
        "\n",
        "job.run(service_account=SERVICE_ACCOUNT)\n",
        "\n",
        "\n",
        "pipeline_task_details = job.gca_resource.job_detail.task_details"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1LvCFqFtI-1t"
      },
      "source": [
        "## Seq2seq training\n",
        "\n",
        "Sequence-to-sequence (seq2seq) is a good choice for experimentation. The algorithm is likely to converge faster than AutoML because its architecture is simpler and it uses a smaller search space."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "euK31b8xMJ8i"
      },
      "outputs": [],
      "source": [
        "train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "(\n",
        "    template_path,\n",
        "    parameter_values,\n",
        ") = automl_forecasting_utils.get_sequence_to_sequence_forecasting_pipeline_and_parameters(\n",
        "    project=PROJECT_ID,\n",
        "    location=REGION,\n",
        "    root_dir=root_dir,\n",
        "    target_column=target_column,\n",
        "    optimization_objective=optimization_objective,\n",
        "    transformations=transformations,\n",
        "    train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "    data_source_csv_filenames=data_source_csv_filenames,\n",
        "    data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "    weight_column=weight_column,\n",
        "    predefined_split_key=predefined_split_key,\n",
        "    training_fraction=training_fraction,\n",
        "    validation_fraction=validation_fraction,\n",
        "    test_fraction=test_fraction,\n",
        "    num_selected_trials=num_selected_trials,\n",
        "    time_column=time_column,\n",
        "    time_series_identifier_columns=[time_series_identifier_column],\n",
        "    time_series_attribute_columns=time_series_attribute_columns,\n",
        "    available_at_forecast_columns=available_at_forecast_columns,\n",
        "    unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "    forecast_horizon=forecast_horizon,\n",
        "    context_window=context_window,\n",
        "    dataflow_subnetwork=dataflow_subnetwork,\n",
        "    dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "    run_evaluation=run_evaluation,\n",
        "    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,\n",
        "    dataflow_service_account=SERVICE_ACCOUNT,\n",
        "    # Quantile prediction is NOT supported by Seq2seq.\n",
        ")\n",
        "\n",
        "job_id = \"seq2seq-forecasting-{}\".format(uuid.uuid4())\n",
        "job = aiplatform.PipelineJob(\n",
        "    display_name=job_id,\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=template_path,\n",
        "    job_id=job_id,\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values=parameter_values,\n",
        "    enable_caching=False,\n",
        ")\n",
        "\n",
        "job.run(service_account=SERVICE_ACCOUNT)\n",
        "\n",
        "\n",
        "pipeline_task_details = job.gca_resource.job_detail.task_details"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b8PdaaWdJCh-"
      },
      "source": [
        "## TFT training\n",
        "\n",
        "TFT stands for \"Temporal Fusion Transformer\", which is an attention-based DNN model designed to produce high accuracy and interpretability by aligning the model with the general multi-horizon forecasting task.\n",
        "\n",
        "With this model, you don't need to explicitly enable the explanability support during serving to get the feature importance for each feature column."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZjCEok0IMT-8"
      },
      "outputs": [],
      "source": [
        "train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "(\n",
        "    template_path,\n",
        "    parameter_values,\n",
        ") = automl_forecasting_utils.get_temporal_fusion_transformer_forecasting_pipeline_and_parameters(\n",
        "    project=PROJECT_ID,\n",
        "    location=REGION,\n",
        "    root_dir=root_dir,\n",
        "    target_column=target_column,\n",
        "    optimization_objective=optimization_objective,\n",
        "    transformations=transformations,\n",
        "    train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "    data_source_csv_filenames=data_source_csv_filenames,\n",
        "    data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "    weight_column=weight_column,\n",
        "    predefined_split_key=predefined_split_key,\n",
        "    training_fraction=training_fraction,\n",
        "    validation_fraction=validation_fraction,\n",
        "    test_fraction=test_fraction,\n",
        "    # Please note that TFT model will ONLY ensemble the model from\n",
        "    # the top one trial, so `num_selected_trials` can not be set for TFT model.\n",
        "    # num_selected_trials=num_selected_trials,\n",
        "    time_column=time_column,\n",
        "    time_series_identifier_columns=[time_series_identifier_column],\n",
        "    time_series_attribute_columns=time_series_attribute_columns,\n",
        "    available_at_forecast_columns=available_at_forecast_columns,\n",
        "    unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "    forecast_horizon=forecast_horizon,\n",
        "    context_window=context_window,\n",
        "    dataflow_subnetwork=dataflow_subnetwork,\n",
        "    dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "    run_evaluation=run_evaluation,\n",
        "    # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,\n",
        "    dataflow_service_account=SERVICE_ACCOUNT,\n",
        "    # Quantile prediction is NOT supported by TFT.\n",
        ")\n",
        "\n",
        "job_id = \"tft-forecasting-{}\".format(uuid.uuid4())\n",
        "job = aiplatform.PipelineJob(\n",
        "    display_name=job_id,\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=template_path,\n",
        "    job_id=job_id,\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values=parameter_values,\n",
        "    enable_caching=False,\n",
        ")\n",
        "\n",
        "job.run(service_account=SERVICE_ACCOUNT)\n",
        "\n",
        "\n",
        "pipeline_task_details = job.gca_resource.job_detail.task_details"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L38LI3OOwXkV"
      },
      "source": [
        "## Batch prediction/explain\n",
        "\n",
        "Enable the batch explain feature by simply setting `generate_explanation=True` in the `batch_predict` API.\n",
        "\n",
        "Use the following code to retrieve the trained Forecasting model from the pipeline:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0bf184357639"
      },
      "outputs": [],
      "source": [
        "upload_model_task = get_task_detail(pipeline_task_details, \"model-upload-2\")\n",
        "\n",
        "forecasting_mp_model_artifact = upload_model_task.outputs[\"model\"].artifacts[0]\n",
        "\n",
        "forecasting_mp_model = aiplatform.Model(\n",
        "    forecasting_mp_model_artifact.metadata[\"resourceName\"]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9327a8a390a7"
      },
      "source": [
        "Once you retrieve the Vertex AI model, you can start to perform the batch prediction."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NfssCxqRg1x-"
      },
      "outputs": [],
      "source": [
        "print(f\"Running Batch prediction for model: {forecasting_mp_model.display_name}\")\n",
        "\n",
        "batch_predict_bq_output_uri_prefix = f\"bq://{PROJECT_ID}\"\n",
        "\n",
        "PREDICTION_DATASET_BQ_PATH = (\n",
        "    \"bq://bigquery-public-data:iowa_liquor_sales_forecasting.2021_sales_predict\"\n",
        ")\n",
        "\n",
        "batch_prediction_job = forecasting_mp_model.batch_predict(\n",
        "    job_display_name=\"forecasting_iowa_liquor_sales_forecasting_predictions\",\n",
        "    bigquery_source=PREDICTION_DATASET_BQ_PATH,\n",
        "    instances_format=\"bigquery\",\n",
        "    bigquery_destination_prefix=batch_predict_bq_output_uri_prefix,\n",
        "    predictions_format=\"bigquery\",\n",
        "    # Uncomment the following line to run batch explain:\n",
        "    # generate_explanation=True,\n",
        "    sync=True,\n",
        ")\n",
        "\n",
        "print(batch_prediction_job)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PLDc1X8DVSmm"
      },
      "source": [
        "## Retrieve the uploaded Vertex AI model with a Vertex AI pipeline job id"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MJ_2TaYtTNbO"
      },
      "outputs": [],
      "source": [
        "# Example format of pipeline_job_id: projects/{your-project-id}/locations/us-central1/pipelineJobs/{pipeline-job-id}\n",
        "pipeline_job_id = \"\"  # @param {type:\"string\"}\n",
        "if pipeline_job_id:\n",
        "    job = aiplatform.PipelineJob.get(pipeline_job_id)\n",
        "    pipeline_task_details = job.gca_resource.job_detail.task_details\n",
        "    upload_model_task = get_task_detail(pipeline_task_details, \"model-upload-2\")\n",
        "\n",
        "    forecasting_mp_model_artifact = upload_model_task.outputs[\"model\"].artifacts[0]\n",
        "    forecasting_mp_model = aiplatform.Model(\n",
        "        forecasting_mp_model_artifact.metadata[\"resourceName\"]\n",
        "    )\n",
        "    print(forecasting_mp_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KtcHUmcZIi9g"
      },
      "source": [
        "## Upload with parent model for different model versions"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6qht5Rdx6fuj"
      },
      "source": [
        "To upload this model to a parent Vertex AI model, you need the `parent_model_resource_name` resource_name of the parent Vertex AI model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6-bzpVGTEq14"
      },
      "outputs": [],
      "source": [
        "# The model resource name can be something like: \"projects/{your-project-id}/locations/us-central1/models/{model-id}\"\n",
        "parent_model_resource_name = \"\"  # @param {type:\"string\"}\n",
        "\n",
        "if parent_model_resource_name:\n",
        "    parent_model_artifact = aiplatform.Artifact.get_with_uri(\n",
        "        \"https://us-central1-aiplatform.googleapis.com/v1/\" + parent_model_resource_name\n",
        "    )\n",
        "    parent_model_artifact_id = str(\n",
        "        parent_model_artifact.gca_resource.name.split(\"artifacts/\")[1]\n",
        "    )\n",
        "\n",
        "    train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "    (\n",
        "        template_path,\n",
        "        parameter_values,\n",
        "    ) = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(\n",
        "        project=PROJECT_ID,\n",
        "        location=REGION,\n",
        "        root_dir=root_dir,\n",
        "        target_column=target_column,\n",
        "        optimization_objective=optimization_objective,\n",
        "        transformations=transformations,\n",
        "        train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "        # Do not set `data_source_csv_filenames` and\n",
        "        # `data_source_bigquery_table_path` if you want to use Vertex managed\n",
        "        # dataset by commenting out the following two lines.\n",
        "        data_source_csv_filenames=data_source_csv_filenames,\n",
        "        data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "        weight_column=weight_column,\n",
        "        predefined_split_key=predefined_split_key,\n",
        "        training_fraction=training_fraction,\n",
        "        validation_fraction=validation_fraction,\n",
        "        test_fraction=test_fraction,\n",
        "        num_selected_trials=5,\n",
        "        time_column=time_column,\n",
        "        time_series_identifier_columns=[time_series_identifier_column],\n",
        "        time_series_attribute_columns=time_series_attribute_columns,\n",
        "        available_at_forecast_columns=available_at_forecast_columns,\n",
        "        unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "        forecast_horizon=forecast_horizon,\n",
        "        context_window=context_window,\n",
        "        dataflow_subnetwork=dataflow_subnetwork,\n",
        "        dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "        run_evaluation=run_evaluation,\n",
        "        # evaluated_examples_bigquery_path=evaluated_examples_bigquery_path,\n",
        "        dataflow_service_account=SERVICE_ACCOUNT,\n",
        "        # Quantile forecast requires `minimize-quantile-loss` as optimization objective.\n",
        "        # quantiles=[0.25, 0.5, 0.9],\n",
        "    )\n",
        "\n",
        "    job_id = \"tide-forecasting-with-parent-model-{}\".format(uuid.uuid4())\n",
        "    job = aiplatform.PipelineJob(\n",
        "        display_name=job_id,\n",
        "        location=REGION,  # launches the pipeline job in the specified region\n",
        "        template_path=template_path,\n",
        "        job_id=job_id,\n",
        "        pipeline_root=root_dir,\n",
        "        parameter_values=parameter_values,\n",
        "        enable_caching=False,\n",
        "        input_artifacts={\"parent_model\": parent_model_artifact_id},\n",
        "    )\n",
        "\n",
        "    job.run(service_account=SERVICE_ACCOUNT)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8Bu0wywvPYkD"
      },
      "source": [
        "## Integrate Tabular Workflow for Forecasting into your existing KFP pipeline\n",
        "\n",
        "This is implemented using [the pipeline-as-component feature](https://www.kubeflow.org/docs/components/pipelines/v2/load-and-share-components/) of KFP."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OcWHUuoYPbSi"
      },
      "outputs": [],
      "source": [
        "from kfp import compiler, components, dsl\n",
        "\n",
        "train_budget_milli_node_hours = 250.0  # 15 minutes\n",
        "\n",
        "(\n",
        "    template_path,\n",
        "    parameter_values,\n",
        ") = automl_forecasting_utils.get_time_series_dense_encoder_forecasting_pipeline_and_parameters(\n",
        "    project=PROJECT_ID,\n",
        "    location=REGION,\n",
        "    root_dir=root_dir,\n",
        "    target_column=target_column,\n",
        "    optimization_objective=optimization_objective,\n",
        "    transformations=transformations,\n",
        "    train_budget_milli_node_hours=train_budget_milli_node_hours,\n",
        "    data_source_csv_filenames=data_source_csv_filenames,\n",
        "    data_source_bigquery_table_path=data_source_bigquery_table_path,\n",
        "    weight_column=weight_column,\n",
        "    predefined_split_key=predefined_split_key,\n",
        "    training_fraction=training_fraction,\n",
        "    validation_fraction=validation_fraction,\n",
        "    test_fraction=test_fraction,\n",
        "    num_selected_trials=num_selected_trials,\n",
        "    time_column=time_column,\n",
        "    time_series_identifier_columns=[time_series_identifier_column],\n",
        "    time_series_attribute_columns=time_series_attribute_columns,\n",
        "    available_at_forecast_columns=available_at_forecast_columns,\n",
        "    unavailable_at_forecast_columns=unavailable_at_forecast_columns,\n",
        "    forecast_horizon=forecast_horizon,\n",
        "    context_window=context_window,\n",
        "    dataflow_subnetwork=dataflow_subnetwork,\n",
        "    dataflow_use_public_ips=dataflow_use_public_ips,\n",
        "    run_evaluation=False,\n",
        "    dataflow_service_account=SERVICE_ACCOUNT,\n",
        ")\n",
        "\n",
        "# Load the forecasting pipeline as a sub-pipeline/components which can be used\n",
        "# in a larger KFP pipeline.\n",
        "forecasting_pipeline = components.load_component_from_file(template_path)\n",
        "\n",
        "\n",
        "@dsl.component\n",
        "def print_message(msg: str):\n",
        "    print(\"message:\", msg)\n",
        "\n",
        "\n",
        "# Define a pipeline that follows the below steps:\n",
        "# step_1(print_message) -> step_2(print_message) -> forecasting_pipeline\n",
        "@dsl.pipeline\n",
        "def outer_pipeline(msg_1: str, msg_2: str, ds: dsl.Artifact):\n",
        "    step_1 = print_message(msg=msg_1)\n",
        "    step_2 = print_message(msg=msg_2).after(step_1)\n",
        "    # `vertex_dataset` argument needs to be set/forwarded here to avoid the\n",
        "    # \"missing-argument\" error in KFP pipeline.\n",
        "    forecasting_pipeline(**parameter_values, vertex_dataset=ds).after(step_2)\n",
        "\n",
        "\n",
        "# Compile and save the outer/larger pipeline template.\n",
        "outer_pipeline_template_path = \"./outer_pipeline.yaml\"\n",
        "compiler.Compiler().compile(outer_pipeline, outer_pipeline_template_path)\n",
        "\n",
        "\n",
        "job_id = \"run-forecasting-pipeline-inside-pipeline-{}\".format(uuid.uuid4())\n",
        "job = aiplatform.PipelineJob(\n",
        "    display_name=job_id,\n",
        "    location=REGION,  # launches the pipeline job in the specified region\n",
        "    template_path=outer_pipeline_template_path,\n",
        "    job_id=job_id,\n",
        "    pipeline_root=root_dir,\n",
        "    parameter_values={\"msg_1\": \"step 1\", \"msg_2\": \"step 2\"},\n",
        "    enable_caching=False,\n",
        ")\n",
        "\n",
        "job.run(service_account=SERVICE_ACCOUNT)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2992dc3f6019"
      },
      "source": [
        "## Cleaning up\n",
        "\n",
        "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n",
        "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
        "\n",
        "Otherwise, you can delete the individual resources you created in this tutorial:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "829cd75564f1"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# Delete Cloud Storage objects that were created\n",
        "delete_bucket = False\n",
        "if delete_bucket or os.getenv(\"IS_TESTING\"):\n",
        "    ! gsutil -m rm -r $BUCKET_URI"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "automl_forecasting_on_vertex_pipelines.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}