In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Compare Vertex AI Forecasting and BigQuery ML ARIMA_PLUS

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_forecasting_bqml_arima_plus_comparison.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_forecasting_bqml_arima_plus_comparison.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/automl_forecasting_bqml_arima_plus_comparison.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview

In this tutorial, you take on the role of a store planner who must determine how much inventory they will need to order for each of their products and stores for November 2019. You accomplish this by training forecasting models using historical sales data. You start with a baseline model using BigQuery ML (BQML) [ARIMA_PLUS](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series) and then compare it against a [Vertex AI Forecasting](https://cloud.google.com/vertex-ai/docs/tabular-data/forecasting/overview) model.

Learn more about [BQML ARIMA+ forecasting for tabular data](https://cloud.google.com/vertex-ai/docs/tabular-data/forecasting-arima/overview).

### Objective

In this tutorial, you learn how to create an BigQuery ML ARIMA_PLUS model using a training [Vertex AI Pipeline](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) from [Google Cloud Pipeline Components](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction) (GCPC), and then do a batch prediction using the corresponding prediction pipeline. You then train a Vertex AI Forecasting model using the same data and compare the evaluation metrics.

This tutorial uses the following Google Cloud ML services and resources:

- BigQuery
- Vertex AI

The steps performed are:

- Train the BigQuery ML ARIMA_PLUS model.
- View BigQuery ML model evaluation.
- Make a batch prediction with the BigQuery ML model.
- Create a Vertex AI `Dataset` resource.
- Train the Vertex AI Forecasting model.
- View the Model evaluation.
- Make a batch prediction with the Model.


### Dataset

To demonstrate the tradeoffs between using BigQuery ML and Vertex AI Forecasting, this tutorial will use a synthetic dataset where product sales are dependent on a variety of factors such as advertisements, holidays, and locations. You see how well a univariate model like ARIMA_PLUS can forecast future sales without knowing information about these factors explicitly, and how well a multivariate model like Vertex AI Forecasting can perform when these factors are known.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* BigQuery / BigQuery ML

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [None]:
! (pip3 install --upgrade --quiet \
    google-cloud-bigquery[pandas]==2.34.4 \
    google-cloud-aiplatform==1.16.1 \
    google-cloud-pipeline-components==1.0.23)

### Colab only: Uncomment the following cell to restart the kernel

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"
DATA_REGION = "US"

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

#### Service Account 

You use a service account to create Vertex AI Pipeline jobs.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines

Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### Import libraries and define constants

In [None]:
import json
import os
import urllib
import uuid

import pandas as pd
from google.cloud import aiplatform, bigquery
from google_cloud_pipeline_components.experimental.automl.forecasting import \
    utils

## Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Define train and prediction data

### Location of BigQuery destination table.

#### Create two datasets, one for each model you train. To make things simpler, create the datasets in the same region as the training data.

In [None]:
arima_dataset_name = "forecasting_demo_arima"
vertex_dataset_name = "forecasting_demo_vertex"

arima_dataset_path = ".".join([PROJECT_ID, arima_dataset_name])
vertex_dataset_path = ".".join([PROJECT_ID, vertex_dataset_name])

# Must be same region as TRAINING_DATASET_BQ_PATH.
client = bigquery.Client(project=PROJECT_ID)
bq_dataset = bigquery.Dataset(arima_dataset_path)
bq_dataset.location = DATA_REGION
bq_dataset = client.create_dataset(bq_dataset)
print(f"Created bigquery dataset {arima_dataset_path} in {DATA_REGION}")

# Make this the same region as the other dataset for easier comparisons.
bq_dataset = bigquery.Dataset(vertex_dataset_path)
bq_dataset.location = DATA_REGION
bq_dataset = client.create_dataset(bq_dataset)
print(f"Created bigquery dataset {vertex_dataset_path} in {DATA_REGION}")

### Location of BigQuery training data.

Before training a model, you must first generate our dataset of store sales. This dataset will include multiple products and stores, and it will also simulate factors such as advertisements and holiday effects. The data will be split into `TRAIN`, `VALIDATE`, `TEST`, and `PREDICT` sets, where the last three sets are all 1 month in duration.

#### Begin by defining the subqueries that will create this base sales data.

In [None]:
base_data_query = """
  WITH 

    -- Create time series for each product + store with some covariates.
    time_series AS (
      SELECT
        CONCAT("id_", store_id, "_", product_id) AS id,
        CONCAT('store_', store_id) AS store,
        CONCAT('product_', product_id) AS product,
        date,
        -- Advertise 1/100 products.
        IF(
          ABS(MOD(FARM_FINGERPRINT(CONCAT(product_id, date)), 100)) = 0,
          1,
          0
        ) AS advertisement,
        -- Mark Thanksgiving sales as holiday sales.
        IF(
          EXTRACT(DAYOFWEEK FROM date) = 6
            AND EXTRACT(MONTH FROM date) = 11
            AND EXTRACT(DAY FROM date) BETWEEN 23 AND 29,
          1,
          0
        ) AS holiday,
        -- Set when each data split ends.
        CASE
          WHEN date < '2019-09-01' THEN 'TRAIN'
          WHEN date < '2019-10-01' THEN 'VALIDATE'
          WHEN date < '2019-11-01' THEN 'TEST'
          ELSE 'PREDICT'
        END AS split,
      -- Generate the sales with one SKU per date.
      FROM
        UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2019-12-01')) AS date
      CROSS JOIN
        UNNEST(GENERATE_ARRAY(0, 10)) AS product_id
      CROSS JOIN
        UNNEST(GENERATE_ARRAY(0, 3)) AS store_id  
    ),
    
    -- Randomly determine factors that contribute to how syntheic sales are calculated. 
    time_series_sales_factors AS (
      SELECT
        *,
        ABS(MOD(FARM_FINGERPRINT(product), 10)) AS product_factor,
        ABS(MOD(FARM_FINGERPRINT(store), 10)) AS store_factor,
        [1.6, 0.6, 0.8, 1.0, 1.2, 1.8, 2.0][
          ORDINAL(EXTRACT(DAYOFWEEK FROM date))] AS day_of_week_factor,
        1 +  SIN(EXTRACT(MONTH FROM date) * 2.0 * 3.14 / 24.0) AS month_factor,    
        -- Advertised products have increased sales factors for 5 days.
        CASE
          WHEN LAG(advertisement, 0) OVER w = 1.0 THEN 1.2
          WHEN LAG(advertisement, 1) OVER w = 1.0 THEN 1.8
          WHEN LAG(advertisement, 2) OVER w = 1.0 THEN 2.4
          WHEN LAG(advertisement, 3) OVER w = 1.0 THEN 3.0
          WHEN LAG(advertisement, 4) OVER w = 1.0 THEN 1.4
          ELSE 1.0
        END AS advertisement_factor,
        IF(holiday = 1.0, 2.0, 1.0) AS holiday_factor,
        0.001 * ABS(MOD(FARM_FINGERPRINT(CONCAT(product, store, date)), 100)) AS noise_factor
      FROM
        time_series
      WINDOW w AS (PARTITION BY id ORDER BY date)
    ),
  
    -- Use factors to calculate synthetic sales for each time series. 
    base_data AS (
      SELECT
        id,
        store,
        product,
        date,
        split,
        advertisement,
        holiday,
        (
          (1 + store_factor) 
          * (1 + product_factor) 
          * (1 + month_factor + day_of_week_factor) 
          * (
            1.0 
            + 2.0 * advertisement_factor 
            + 3.0 * holiday_factor 
            + 5.0 * noise_factor
          )
        ) AS sales
      FROM
        time_series_sales_factors
      )
"""

#### Next, convert this base sales data into a dataset you use to train a model, and a dataset you pass to a trained model at serving time. The training dataset will include the `TRAIN`, `VALIDATE`, and `TEST` splits, while the prediction dataset will include the `PREDICT` split and also the `TEST` split to provide context information.

In [None]:
TRAINING_DATASET_BQ_PATH = f"bq://{arima_dataset_path}.train"
PREDICTION_DATASET_BQ_PATH = f"bq://{arima_dataset_path}.pred"

train_query = f"""
    CREATE OR REPLACE TABLE `{arima_dataset_path}.train` AS
    {base_data_query}
    SELECT *
    FROM base_data
    WHERE split != 'PREDICT'
"""
client.query(train_query).result()
print(f"Created {TRAINING_DATASET_BQ_PATH}.")

pred_query = f"""
    CREATE OR REPLACE TABLE `{arima_dataset_path}.pred` AS
    {base_data_query}
    SELECT *
    FROM base_data
    WHERE split = 'TEST'

    UNION ALL

    SELECT * EXCEPT (sales), NULL AS sales
    FROM base_data
    WHERE split = 'PREDICT'
"""
client.query(pred_query).result()
print(f"Created {PREDICTION_DATASET_BQ_PATH}.")

You can take a look at the sales data that was generated. Later in this tutorial, we will visualize the time series along with our forecast.

The model is trained with data from January 2017 to October 2019 inclusive.

#### Look at the training data

In [None]:
query = f"SELECT * FROM `{arima_dataset_path}.train` LIMIT 10"
client.query(query).to_dataframe().head()

The table used for prediction contains data from November 2019. It also includes actuals from October 2019 as context information.

#### Look at the prediction data

In [None]:
query = f"SELECT * FROM `{arima_dataset_path}.pred` LIMIT 10"
client.query(query).to_dataframe().head()

# Create a BigQuery ML ARIMA_PLUS model

Now you are ready to start creating your own BigQuery ML ARIMA_PLUS model.

Like with Vertex AI Forecasting, the pipeline you run will train evaluation models using the training and validation sets and use backtesting to create evaluation metrics on the test set. Finally, a serving model will be produced that uses all available data.

**How do you estimate the cost?**

Backtesting involves training a single BigQuery ML model for each period in the test set, so the cost is a function of the length of the test set after any downsampling done by the windowing strategy. The cost is also multiplied by the number of candidate models trained, which is determined by `max_order`.

According to [BQ pricing](https://cloud.google.com/bigquery-ml/pricing), BigQuery ML model creation costs $250 per TB. We'll use a max order of 3, which translates to 20 candidate models when there are multiple time series. Our demo dataset is 3 MB in size, and includes 31 test periods. We window with a stride length of 1, so all periods are used for evaluation.

In this tutorial, the model create stage of the pipeline costs `3 MB * ($250 / 1024^2) * (31 / 1) periods * 20 candidates = $0.44`.

## Create and run the training job
To train a model using the ARIMA pipeline, you perform two steps: 

1. download the training pipeline from GCPC.
1. run the job

#### Create training job

The training pipeline expects the following parameters:

- `bigquery_destination_uri`: (optional) BigQuery Dataset URI. Used to export the metrics table and model. If not given, we will create one for the user.
- `data_granularity_unit`: Enum used to specify the time granularity (hour, day, week, month, etc).
- `data_source_csv_filenames` or `data_source_bigquery_table_path`: A URI for either a CSV stored in GCR or a BigQuery table, respectively.
- `evaluated_examples_destination_uri	`: (optional) BigQuery Dataset URI OR Table URI. Used to export the evaluated examples table. Will use bigquery_destination_uri if not provided.
- `forecast_horizon`: Integer number of periods to predict.
- A data splitting strategy of either:
  - `predefined_split_key`: A column containing `TRAIN`, `VALIDATE`, or `TEST` to denote the splits for each row.
  -  `training_fraction`, `validation_fraction`, and `test_fraction` to set the fractions to split on chronologically on the time column.
  - `timestamp_split_key` plus the fractions in the previous option to perform fractional splitting on a column other than the time column.
- A windowing strategy of either:
  - `window_column`: A boolean column decides whether or now each row gets considered when calculating the evaluation metrics.
  - `window_stride_length`: Every N rows will be used to compute the evaluation metrics.
  - `window_max_count`: Downsample rows such that only the given number are used to calculate the evaluation metrics.
- `target_column`: Name of target column.
- `time_column`: Name of time column.
- `time_series_identifier_column`: Name of id column.
- `max_order`: Integer between 1 and 5 representing the size of the parameter search space for ARIMA_PLUS. 5 would result in the highest accuracy model, but also the longest training runtime/cost.

The execution of the training pipeline may take around **20 minutes**.

In [None]:
time_column = "date"  # @param {type: "string"}
time_series_identifier_column = "id"  # @param {type: "string"}
target_column = "sales"  # @param {type: "string"}
forecast_horizon = 30  # @param {type: "integer"}
data_granularity_unit = "day"  # @param {type: "string"}
split_column = "split"  # @param {type: "string"}
window_stride_length = 1  # @param {type: "integer"}
max_order = 3  # @param {type: "integer"}
override_destination = True  # @param {type: "boolean"}

(
    train_job_spec_path,
    train_parameter_values,
) = utils.get_bqml_arima_train_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    time_column=time_column,
    time_series_identifier_column=time_series_identifier_column,
    target_column=target_column,
    forecast_horizon=forecast_horizon,
    data_granularity_unit=data_granularity_unit,
    predefined_split_key=split_column,
    data_source_bigquery_table_path=TRAINING_DATASET_BQ_PATH,
    window_stride_length=window_stride_length,
    bigquery_destination_uri=arima_dataset_path,
    override_destination=override_destination,
    max_order=max_order,
)

### Run the training pipeline

Use the Vertex AI Python SDK to kick off a training pipeline run. Once the run has started, the following cell outputs a link that will allow you to monitor the run. The link should look like this: 

`https://console.cloud.google.com/vertex-ai/locations/[REGION]/pipelines/runs/[DISPLAY_NAME]`

In [None]:
# The display name should be unique even if this cell is rerun.
DISPLAY_NAME = f"forecasting-demo-train-{str(uuid.uuid1())}"

job = aiplatform.PipelineJob(
    job_id=DISPLAY_NAME,
    display_name=DISPLAY_NAME,
    pipeline_root=os.path.join(BUCKET_URI, DISPLAY_NAME),
    template_path=train_job_spec_path,
    parameter_values=train_parameter_values,
)
job.run(service_account=SERVICE_ACCOUNT)

## Review model evaluation scores
After your model has finished training, you can review the evaluation scores for it.

#### Metrics are always reported via the `metrics` table in the destination dataset.

In [None]:
query = f"SELECT * FROM `{arima_dataset_path}.metrics`"
arima_metrics = client.query(query).to_dataframe().rename({0: "arima"})
arima_metrics.head()

You can view the predictions used to calculate the evaluation metrics if you want to calculate your own. 

#### View predictions used to calculate the evaluation metrics

This table containing all these predictions is called `evaluated_examples`. In this table, each distinct `predicted_on_date` represents the starting period of a window of predictions. The backtesting metrics make use of all these windows.

In [None]:
query = f"SELECT * FROM `{arima_dataset_path}.evaluated_examples`"
arima_examples = client.query(query).to_dataframe()
arima_examples.head()

## Create and run prediction job

### Create prediction job
Now that your Model resource is trained, you can make a batch prediction using the prediction pipeline, with the following parameters:

- `bigquery_destination_uri`: (optional) BigQuery Dataset URI. Used to export the metrics table and model. If not given, we will create one for the user.
- `data_source_csv_filenames` or `data_source_bigquery_table_path`: A URI for either a CSV stored in GCR or a BigQuery table, respectively.
- `generate_explanation`: If True, the predictions table will have some extra xAI columns.
- `model_name`: Name of an existing BigQuery ML ARIMA_PLUS model to use for predictions.

The execution of the prediction pipeline may take around **5 minutes**.

In [None]:
# Get the model name programmatically, you can find this by looking at the
# execution graph in Vertex AI Pipelines.
for task_detail in job.gca_resource.job_detail.task_details:
    if task_detail.task_name == "bigquery-create-model-job":
        model_name = task_detail.outputs["model"].artifacts[0].metadata["modelId"]
        break
else:
    raise ValueError("Couldn't find the model training task.")


(
    predict_job_spec_path,
    predict_parameter_values,
) = utils.get_bqml_arima_predict_pipeline_and_parameters(
    project=PROJECT_ID,
    location=REGION,
    model_name=f"{arima_dataset_path}.{model_name}",
    data_source_bigquery_table_path=PREDICTION_DATASET_BQ_PATH,
    bigquery_destination_uri=arima_dataset_path,
)

### Run the prediction pipeline

Use the Vertex AI Python SDK to kick off a prediction pipeline run. Once the run has started, the following cell outputs a link that will allow you to monitor the run. The link should look like this: 

`https://console.cloud.google.com/vertex-ai/locations/[REGION]/pipelines/runs/[DISPLAY_NAME]`

In [None]:
# The display name should be unique even if this cell is rerun.
DISPLAY_NAME = "forecasting-demo-predict"

job = aiplatform.PipelineJob(
    job_id=DISPLAY_NAME,
    display_name=DISPLAY_NAME,
    pipeline_root=os.path.join(BUCKET_URI, DISPLAY_NAME),
    template_path=predict_job_spec_path,
    parameter_values=predict_parameter_values,
)
job.run(service_account=SERVICE_ACCOUNT)

### Get the predictions

Next, get the results from the completed batch prediction job. These are always written to a table called `predictions` under the output dataset.

In [None]:
# Get the prediction table programmatically, you can find this by looking at the
# execution graph in Vertex AI Pipelines.
for task_detail in job.gca_resource.job_detail.task_details:
    if task_detail.task_name == "bigquery-query-job":
        pred_table = (
            task_detail.outputs["destination_table"].artifacts[0].metadata["tableId"]
        )
        break
else:
    raise ValueError("Couldn't find the prediction task.")

query = f"SELECT * FROM `{arima_dataset_path}.{pred_table}`"
arima_preds = client.query(query).to_dataframe()
arima_preds.head()

## Visualize the forecasts

Lastly, follow the given link to visualize the generated forecasts in [Data Studio](https://support.google.com/datastudio/answer/6283323?hl=en).
The code block included in this section dynamically generates a Data Studio link that specifies the template, the location of the forecasts, and the query to generate the chart. The data is populated from the forecasts generated earlier.

You can inspect the used template at https://datastudio.google.com/c/u/0/reporting/067f70d2-8cd6-4a4c-a099-292acd1053e8. This was created by Google specifically to view forecasting predictions.

In [None]:
def _sanitize_bq_uri(bq_uri: str):
    if bq_uri.startswith("bq://"):
        bq_uri = bq_uri[5:]
    return bq_uri.replace(":", ".")


def get_data_studio_link(
    batch_prediction_bq_input_uri: str,
    batch_prediction_bq_output_uri: str,
    time_column: str,
    time_series_identifier_column: str,
    target_column: str,
):
    """Creates a link that fills in the demo Data Studio template."""
    batch_prediction_bq_input_uri = _sanitize_bq_uri(batch_prediction_bq_input_uri)
    batch_prediction_bq_output_uri = _sanitize_bq_uri(batch_prediction_bq_output_uri)
    query = f"""
        SELECT
          CAST(input.{time_column} as DATETIME) timestamp_col,
          CAST(input.{time_series_identifier_column} as STRING) time_series_identifier_col,
          CAST(input.{target_column} as NUMERIC) historical_values,
          CAST(predicted_{target_column}.value as NUMERIC) predicted_values,
        FROM `{batch_prediction_bq_input_uri}` input
        LEFT JOIN `{batch_prediction_bq_output_uri}` output
          ON
            TIMESTAMP(input.{time_column}) = TIMESTAMP(output.{time_column})
            AND CAST(input.{time_series_identifier_column} as STRING) = CAST(
              output.{time_series_identifier_column} as STRING)
    """
    params = {
        "templateId": "067f70d2-8cd6-4a4c-a099-292acd1053e8",
        "ds0.connector": "BIG_QUERY",
        "ds0.projectId": PROJECT_ID,
        "ds0.billingProjectId": PROJECT_ID,
        "ds0.type": "CUSTOM_QUERY",
        "ds0.sql": query,
    }
    base_url = "https://datastudio.google.com/c/u/0/reporting"
    url_params = urllib.parse.urlencode({"params": json.dumps(params)})
    return f"{base_url}?{url_params}"

In [None]:
actuals_table = f"{arima_dataset_path}.actuals"
query = f"""
    CREATE OR REPLACE TABLE `{actuals_table}` AS
    {base_data_query}
    SELECT *
    FROM base_data
    WHERE split != 'TRAIN'
"""
client.query(query).result()
print(f"Created {actuals_table}.")

In [None]:
print("Click the link below to view ARIMA predictions:")
print(
    get_data_studio_link(
        batch_prediction_bq_input_uri=actuals_table,
        batch_prediction_bq_output_uri=f"{arima_dataset_path}.{pred_table}",
        time_column=time_column,
        time_series_identifier_column=time_series_identifier_column,
        target_column=target_column,
    )
)

# Compare Against Vertex AI Forecasting

### Create the Dataset

Next, create the `Dataset` resource using the `create` method for the `TimeSeriesDataset` class, which takes the following parameters:

- `display_name`: The human readable name for the `Dataset` resource.
- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.
- `bq_source`: Alternatively, import data items from a BigQuery table into the `Dataset` resource.

This operation may take several minutes.

In [None]:
dataset = aiplatform.TimeSeriesDataset.create(
    display_name="forecasting_demo_train",
    bq_source=[TRAINING_DATASET_BQ_PATH],
)
print(dataset.resource_name)

### Create and run the training job

To train an AutoML model, you perform two steps: 1) create a training job, and 2) run the job.

#### Create training job

An AutoML training job is created with the [`AutoMLForecastingTrainingJob`](https://googleapis.dev/python/aiplatform/latest/aiplatform.html?highlight=batchpredictionjob#google.cloud.aiplatform.AutoMLForecastingTrainingJob) class, with the following parameters:

- `display_name`: The human readable name for the `TrainingJob` resource.
- `column_transformations`: (Optional): Transformations to apply to the input columns
- `optimization_objective`: The optimization objective to minimize or maximize.
    - `minimize-rmse`
    - `minimize-mae`
    - `minimize-rmsle`

The instantiated object is the job for the training pipeline.

In [None]:
column_specs = {
    "date": "timestamp",
    "sales": "numeric",
    "advertisement": "categorical",
    "store": "categorical",
    "product": "categorical",
    "holiday": "categorical",
}
available_at_forecast_columns_ = [
    "date",
    "advertisement",
    "holiday",
]
available_at_forecast_columns = available_at_forecast_columns_  # @param {type: "raw"}
unavailable_at_forecast_columns = ["sales"]  # @param {type: "raw"}
time_series_attribute_columns = ["store", "product"]  # @param {type: "raw"}
context_window = 30  # @param {type: "integer"}
data_granularity_count = 1  # @param {type: "integer"}
budget_milli_node_hours = 1000  # @param {type: "integer"}

In [None]:
MODEL_DISPLAY_NAME = "forecasting-demo-model"

training_job = aiplatform.AutoMLForecastingTrainingJob(
    display_name=MODEL_DISPLAY_NAME,
    optimization_objective="minimize-rmse",
    column_specs=column_specs,
)

#### Run the training pipeline

Next, you start the training job by invoking the method `run`, with the following parameters:

- `dataset`: The `Dataset` resource to train the model.
- `model_display_name`: The human readable name for the trained model.
- `target_column`: The name of the column to train as the label.
- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).
- `time_column`: Name of the column that identifies time order in the time series. This column must be available at forecast.
- `time_series_identifier_column`: Name of the column that identifies the time series.

You can specify the split with either
- `training_fraction_split`: The percentage of the dataset to use for training.
- `validation_fraction_split`: The percentage of the dataset to use for validation.
- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).

or
- `predefined_split_column_name`: Column to use to perform the data split.

The `run` method, when completed, returns the `Model` resource.

The execution of the training pipeline may take up to **one hour**. You can learn about the pricing for Vertex AI Forecasting [here](https://cloud.google.com/vertex-ai/pricing#tabular-data).

In [None]:
model = training_job.run(
    dataset=dataset,
    target_column=target_column,
    time_column=time_column,
    time_series_identifier_column=time_series_identifier_column,
    available_at_forecast_columns=available_at_forecast_columns,
    unavailable_at_forecast_columns=unavailable_at_forecast_columns,
    time_series_attribute_columns=time_series_attribute_columns,
    forecast_horizon=forecast_horizon,
    context_window=context_window,
    data_granularity_unit=data_granularity_unit,
    data_granularity_count=data_granularity_count,
    weight_column=None,
    budget_milli_node_hours=budget_milli_node_hours,
    model_display_name=MODEL_DISPLAY_NAME,
    predefined_split_column_name=split_column,
    window_stride_length=window_stride_length,
)

## Review model evaluation scores
After your model has finished training, you can review the evaluation scores for it.

First, you need to get a reference to the new model. As with datasets, you can either use the reference to the model variable you created when you deployed the model or you can list all of the models in your project.

In [None]:
# Get model resource ID
models = aiplatform.Model.list(filter=f"display_name={MODEL_DISPLAY_NAME}")
model = models[0]

# Drop any metrics the ARIMA pipeline doesn't support yet.
model_evaluation = list(model.list_model_evaluations())[0]
metrics_dict = {k: [v] for k, v in dict(model_evaluation.metrics).items()}
vertex_metrics = (
    pd.DataFrame()
    .from_dict(metrics_dict)
    .rename(
        {
            "meanAbsoluteError": "MAE",
            "rootMeanSquaredError": "RMSE",
            "meanAbsolutePercentageError": "MAPE",
        },
        axis=1,
    )
    .rename({0: "vertex"})[["MAE", "RMSE", "MAPE"]]
)
vertex_metrics

## Compare both metrics

Now that you have backtesting metrics from both models, you can compare the two side-by-side.

Since the sales in this dataset were a function of covariates, we should expect the MAE, RMSE, and MAPE to be lower when using Vertex AI Forecasting. The BigQuery ML ARIMA_PLUS evaluation metrics show the relative impact of including these additional features in a model.

In [None]:
pd.concat([arima_metrics, vertex_metrics])

## Send a batch prediction request

The following section shows how you can send a batch prediction to your Vertex AI Forecasting model in case you want to compare the models at serving time.

### Make the batch prediction request

Now that your Model resource is trained, you can make a batch prediction by invoking the batch_predict() method using a BigQuery source and destination, with the following parameters:


- `job_display_name`: The human readable name for the batch prediction job.
- `bigquery_source`: BigQuery URI to a table, up to 2000 characters long. For example: `bq://projectId.bqDatasetId.bqTableId`
- `bigquery_destination_prefix`: The BigQuery dataset or table for storing the batch prediction resuls.\n",
- `instances_format`: The format for the input instances. Since a BigQuery source is used here, this should be set to `bigquery`.
- `predictions_format`: The format for the output predictions, `bigquery` is used here to output to a BigQuery table.
- `generate_explanations`: Set to `True` to generate explanations.
- `sync`: If set to True, the call will block while waiting for the asynchronous batch job to complete.

In [None]:
batch_prediction_job = model.batch_predict(
    job_display_name="forecasting_demo_predictions",
    bigquery_source=PREDICTION_DATASET_BQ_PATH,
    instances_format="bigquery",
    bigquery_destination_prefix=f"bq://{vertex_dataset_path}",
    predictions_format="bigquery",
    sync=False,
)

print(batch_prediction_job)

### Wait for completion of batch prediction job

Next, wait for the batch job to complete. Alternatively, you can set the parameter `sync` to `True` in the `batch_predict()` method to block until the batch prediction job is completed.

The execution of the prediction pipeline may take up to 30 minutes.


In [None]:
batch_prediction_job.wait()

### Get the predictions

The predictions table can be found in the output dataset. This table always starts with `predictions_`.

In [None]:
tables = client.list_tables(vertex_dataset_path)
pred_tables = [
    table.table_id for table in tables if table.table_id.startswith("predictions")
]
batch_predict_bq_output_uri = f"{vertex_dataset_path}.{max(pred_tables)}"
print(f"Found predictions table: {batch_predict_bq_output_uri}.\n")

query = f"SELECT * FROM `{batch_predict_bq_output_uri}`"
vertex_preds = client.query(query).to_dataframe()
vertex_preds.head()

## Visualize the forecasts

Lastly, follow the given link to visualize the generated forecasts in [Data Studio](https://support.google.com/datastudio/answer/6283323?hl=en).
The code block included in this section dynamically generates a Data Studio link that specifies the template, the location of the forecasts, and the query to generate the chart. The data is populated from the forecasts generated earlier.

You can inspect the used template at https://datastudio.google.com/c/u/0/reporting/067f70d2-8cd6-4a4c-a099-292acd1053e8. This was created by Google specifically to view forecasting predictions.

In [None]:
print("Click the link below to view Vertex AI Forecasting predictions:")
print(
    get_data_studio_link(
        batch_prediction_bq_input_uri=actuals_table,
        batch_prediction_bq_output_uri=batch_predict_bq_output_uri,
        time_column=time_column,
        time_series_identifier_column=time_series_identifier_column,
        target_column=target_column,
    )
)

## Clean up Vertex AI and BigQuery resources

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Dataset
- AutoML Training Job
- Model
- Batch Prediction Job
- Cloud Storage Bucket
- BigQuery tables

In [None]:
# Delete dataset
dataset.delete()

# Training job
training_job.delete()

# Delete model
model.delete()

# Delete batch prediction job
batch_prediction_job.delete()

# Delete output datasets
for dataset_id in [arima_dataset_path, vertex_dataset_path]:
    client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)

delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI