In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Anomaly detection with BigQuery ML and Vertex AI

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/tree/main/notebooks/community/pipelines/google_cloud_pipeline_components_bqml_pipeline_anomaly_detection.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_bqml_pipeline_anomaly_detection.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/pipelines/google_cloud_pipeline_components_bqml_pipeline_anomaly_detection.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

Anomaly detection is the identification of rare observations which deviate significantly from the data using ML. Anomaly detection can be done in many ways. Supervised, unsupervised, graph-based. It is particularly important for certain industries like telecommunications, manufacturing, and financial services.

For instance, in a manufacturing scenario, you may collect some sensor data to predict the number remaining cycles before engine failure (TTF). In this way, you can take actionable decisions about maintenance planning.

### Objective

In the absence of labelled data, you may wonder how to best create an anomaly detector.

In this notebook, you learn how to use autoencoders to detect anomalies from turbo fan engine data, and from there build an anomaly detection pipeline.

This tutorial uses the following Google Cloud ML services and resources:

- `Vertex AI Pipelines`
- `BigQuery ML pipeline components`


The steps performed include:

- Define a custom evaluation and metrics visualization components
- Define a pipeline:
  - Build training dataset in BigQuery
  - Train a BigQuery AutoEncoder model
  - Evaluate the BigQuery AutoEncoder model
  - Check the model performance
  - Build test dataset in BigQuery
  - Detect anomalies
  - Generate the MSE plot to evaluate predictions
- Compile the pipeline.
- Execute the pipeline.

### Dataset

The [`NASA Turbofan Jet Engine Data Set`](https://www.kaggle.com/datasets/behrad3d/nasa-cmaps) is a multivariate time series where time series describes a different engine.

The dataset contains 26 columns and it consists data taken during a single operational cycle.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
[BigQuery pricing](https://cloud.google.com/bigquery/pricing)
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [None]:
# Install the packages
! pip3 install --user --upgrade jinja2 google-cloud-bigquery kfp google-cloud-aiplatform google_cloud_pipeline_components -q --no-warn-conflicts

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Set project template

You create a set of repositories to organize your project locally.

In [None]:
import os

KFP_COMPONENTS_PATH = "components"
PIPELINES_PATH = "pipelines"
TRAIN_PIPELINES_PATH = os.path.join(PIPELINES_PATH, "train_pipelines")
TEST_PIPELINES_PATH = os.path.join(PIPELINES_PATH, "test_pipelines")

! mkdir -m 777 -p {KFP_COMPONENTS_PATH} {TRAIN_PIPELINES_PATH} {TEST_PIPELINES_PATH}

### Prepare the training data

Next, you make a copy of the CSV training data into your Cloud Storage bucket and then create a BigQuery dataset table for the training data.

In [None]:
PUBLIC_DATA_URI = (
    "gs://cloud-samples-data/vertex-ai/pipeline-deployment/datasets/turbofan_anomaly"
)
GCS_TRAIN_URI = f"{PUBLIC_DATA_URI}/train_FD001.csv"
GCS_TEST_URI = f"{PUBLIC_DATA_URI}/test_FD001.csv"
GCS_LABELS_URI = f"{PUBLIC_DATA_URI}/RUL_FD001.csv"

### Set the BigQuery datasets

You create the following BigQuery datasets for the tutorial:

- `sensors_train_raw_data_<timestamp>` contains training data collected from sensors
- `sensors_test_raw_data_<timestamp>` contains testing data collected from sensors
- `sensors_label_data_<timestamp>` contains testing label collected to validate results

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

In [None]:
LOCATION = REGION.split("-")[0]
BQ_DATASET = "iot_dataset"
BQ_TRAIN_RAW_TABLE = f"sensors_train_raw_data_{TIMESTAMP}"
BQ_TEST_RAW_TABLE = f"sensors_test_raw_data_{TIMESTAMP}"
BQ_LABELS_TABLE = f"sensors_label_data_{TIMESTAMP}"

! bq mk --location={LOCATION} --dataset {PROJECT_ID}:{BQ_DATASET}

! bq load \
  --location={LOCATION} \
  --source_format=CSV \
  --skip_leading_rows=1 \
  {BQ_DATASET}.{BQ_TRAIN_RAW_TABLE} \
  {GCS_TRAIN_URI} \
  id:INT64,cycle:INT64,setting1:FLOAT64,setting2:FLOAT64,setting3:FLOAT64,sensor:STRING,value:FLOAT64

! bq load \
  --location={LOCATION} \
  --source_format=CSV \
  --skip_leading_rows=1 \
  {BQ_DATASET}.{BQ_TEST_RAW_TABLE} \
  {GCS_TEST_URI} \
  id:INT64,cycle:INT64,setting1:FLOAT64,setting2:FLOAT64,setting3:FLOAT64,sensor:STRING,value:FLOAT64

! bq load \
  --location={LOCATION} \
  --source_format=CSV \
  --skip_leading_rows=1 \
  {BQ_DATASET}.{BQ_LABELS_TABLE} \
  {GCS_LABELS_URI} \
  id:INT64,time_to_failure:FLOAT64

### Import libraries

Next, import libraries and set up some variables used throughout the tutorial.


In [None]:
from typing import NamedTuple

import tensorflow as tf
from google.cloud import aiplatform as vertex_ai
from google.cloud import bigquery
from google_cloud_pipeline_components.v1.bigquery import (
    BigqueryCreateModelJobOp, BigqueryEvaluateModelJobOp, BigqueryQueryJobOp)
from jinja2 import Template
from kfp.v2 import compiler, dsl
from kfp.v2.dsl import (HTML, Artifact, Condition, Input, Metrics, Output,
                        component)

### Set up variables

In [None]:
# SQL templates
SENSORS = (
    "s1",
    "s2",
    "s3",
    "s4",
    "s5",
    "s6",
    "s7",
    "s8",
    "s9",
    "s10",
    "s11",
    "s12",
    "s13",
    "s14",
    "s15",
    "s16",
    "s17",
    "s18",
    "s19",
    "s20",
    "s21",
)
WINDOW = 5
PERIOD = 30
TARGET = "is_anomalous_ttf"
EXCLUDED_VARIABLES = "id, cycle, setting1, setting2, setting3"

### Helper functions

The `print_pipeline_output` helper function allows to validate the pipeline run checking for executed job.

In [None]:
def print_pipeline_output(pipeline_root, job, output_task_name):
    JOB_ID = job.name
    print(JOB_ID)
    for _ in range(len(job.gca_resource.job_detail.task_details)):
        TASK_ID = job.gca_resource.job_detail.task_details[_].task_id
        EXECUTE_OUTPUT = (
            pipeline_root
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/executor_output.json"
        )
        GCP_RESOURCES = (
            pipeline_root
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/gcp_resources"
        )
        EVAL_METRICS = (
            pipeline_root
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/evaluation_metrics"
        )
        if tf.io.gfile.exists(EXECUTE_OUTPUT):
            ! gsutil cat $EXECUTE_OUTPUT
            return EXECUTE_OUTPUT
        elif tf.io.gfile.exists(GCP_RESOURCES):
            ! gsutil cat $GCP_RESOURCES
            return GCP_RESOURCES
        elif tf.io.gfile.exists(EVAL_METRICS):
            ! gsutil cat $EVAL_METRICS
            return EVAL_METRICS

    return None

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Initialize BigQuery SDK for Python

Initialize the BigQuery SDK for Python for your project.

In [None]:
bq_client = bigquery.Client(project=PROJECT_ID, location=REGION)

## BigQuery ML pipeline formalization

In the next cells, you build the components and pipeline to train and evaluate the anomaly detection model.

### Set variables for running the pipeline

Below you initialize a set of variables that are specific to the pipeline run you are going to run in this tutorial. For instance, you define the pipeline configuration passing training table name, model configuration and performance threshold.

In [None]:
# BQML pipeline job configuation
TRAIN_PIPELINE_NAME = "bqml-anomaly-detection-train-pipeline"
TRAIN_PIPELINE_ROOT = (
    urlparse(BUCKET_URI)._replace(path="pipelines/train_pipelines").geturl()
)
TRAIN_PIPELINE_PACKAGE = os.path.join(
    TRAIN_PIPELINES_PATH, f"{TRAIN_PIPELINE_NAME}.json"
)

# BQML pipeline conponent configuration
BQ_TRAIN_FEATURES_TABLE_PREFIX = "train_features"
BQ_TEST_FEATURES_TABLE_PREFIX = "test_features"
BQ_TRAIN_TABLE_PREFIX = "train_dataset"
BQ_TEST_TABLE_PREFIX = "test_dataset"
BQ_RECOSTRUCTION_MODEL_TABLE_PREFIX = "reconstruction_model"
DETECT_ANOMALIES_TABLE_PREFIX = "detect_anomalies"
BQ_TRAIN_FEATURES_TABLE = f"{BQ_TRAIN_FEATURES_TABLE_PREFIX}_{TIMESTAMP}"
BQ_TEST_FEATURES_TABLE = f"{BQ_TEST_FEATURES_TABLE_PREFIX}_{TIMESTAMP}"
BQ_TRAIN_TABLE = f"{BQ_TRAIN_TABLE_PREFIX}_{TIMESTAMP}"
BQ_TEST_TABLE = f"{BQ_TEST_TABLE_PREFIX}_{TIMESTAMP}"
BQ_RECOSTRUCTION_MODEL_TABLE = f"{BQ_RECOSTRUCTION_MODEL_TABLE_PREFIX}_{TIMESTAMP}"
DETECT_ANOMALIES_TABLE = f"{DETECT_ANOMALIES_TABLE_PREFIX}_{TIMESTAMP}"
CONTAMINATION_THRESHOLD = 0.1
PERF_THRESHOLD = 10

### Set SQL queries using templates

One way to run BigQuery and BigQuery ML pipelines on Vertex AI is defining sql queries as Jinja templates and pass them as parameters of `pipeline components`.

In this tutorial, you define the following templates:

  - `CREATE_FEATURES_SQL_TEMPLATE` to run feature engineering
  - `CREATE_TRAIN_SQL_TEMPLATE` to create the training dataset
  - `TRAIN_RECONSTRUCTION_MODEL_TEMPLATE` to build a reconstruction model using BigQuery ML AutoEncoder model
  - `CREATE_TEST_SQL_TEMPLATE` to create the testing dataset
  - `DETECT_ANOMALIES_TEMPLATE` to detect anomalies
  - `VISUALIZE_MSE_TEMPLATE` to visualize MSE plots

#### Define SQL query templates

In [None]:
# Training ---------------------------------------------------------------------
CREATE_FEATURES_SQL_TEMPLATE = """
CREATE OR REPLACE TABLE
  `{{project_id}}.{{bq_dataset}}.{{features_table}}` AS
WITH
  get_long_from_wide_table AS (
    SELECT *
    FROM `{{project_id}}.{{bq_dataset}}.{{data_table}}`
    PIVOT(MAX(value) FOR sensor IN {{sensors}})
  ),

  get_features_table AS (
    SELECT
    *,
    {%- for sensor in sensors %}
    -- calculate rolling average sensor value
    AVG({{sensor}}) OVER(PARTITION BY id ORDER BY cycle RANGE BETWEEN {{window}} PRECEDING AND CURRENT ROW) AS {{"rolling_avg_" ~ sensor}},
    -- calculate rolling stdev sensor value
    IFNULL(STDDEV({{sensor}}) OVER(PARTITION BY id ORDER BY cycle RANGE BETWEEN {{window}} PRECEDING AND CURRENT ROW), 0) AS {{"rolling_sd_" ~ sensor}}
    {%- if not loop.last -%}
        ,
    {%- endif -%}
    {%- endfor %}
    FROM get_long_from_wide_table
  )

  SELECT * FROM get_features_table ORDER BY id, cycle
"""

CREATE_TRAIN_SQL_TEMPLATE = """
DECLARE period INT64 DEFAULT {{period}};

CREATE OR REPLACE TABLE
  `{{project_id}}.{{bq_dataset}}.{{train_table}}` AS
WITH
  get_last_cycle AS (
    SELECT id, max(cycle) as last_cycle
    FROM `{{project_id}}.{{bq_dataset}}.{{features_table}}`
    GROUP BY id
  ),

  get_target_train AS (
    SELECT
    a.*,
    CASE WHEN (b.last_cycle - a.cycle) < period THEN 1 ELSE 0 END AS {{target}},
    FROM `{{project_id}}.{{bq_dataset}}.{{features_table}}` as a
    LEFT JOIN get_last_cycle as b on a.id = b.id
  )

  SELECT * EXCEPT({{excluded_variables}}) FROM get_target_train
"""

TRAIN_RECONSTRUCTION_MODEL_TEMPLATE = """
CREATE OR REPLACE MODEL `{{project_id}}.{{bq_dataset}}.{{recostruction_model_name}}`
OPTIONS(MODEL_TYPE='AUTOENCODER',
        ACTIVATION_FN='RELU',
        HIDDEN_UNITS=[32, 16, 4, 16, 32],
        BATCH_SIZE=8,
        DROPOUT=0.2,
        EARLY_STOP=TRUE,
        LEARN_RATE=0.001,
        L1_REG_ACTIVATION=0.0001,
        OPTIMIZER='ADAM',
        MODEL_REGISTRY = 'vertex_ai',
        VERTEX_AI_MODEL_ID = 'reconstruction_model',
        VERTEX_AI_MODEL_VERSION_ALIASES = ['staging']
        )
AS SELECT * FROM `{{project_id}}.{{bq_dataset}}.{{train_table}}`
"""

# Test -------------------------------------------------------------------------
CREATE_TEST_SQL_TEMPLATE = """
DECLARE period INT64 DEFAULT {{period}};

CREATE OR REPLACE TABLE
 `{{project_id}}.{{bq_dataset}}.{{test_table}}` AS
WITH
 get_last_cycle AS (
   SELECT id, max(cycle) as last_cycle
   FROM `{{project_id}}.{{bq_dataset}}.{{features_table}}`
   GROUP BY id
 ),

 get_target_test AS (
   SELECT
   a.*
   FROM `{{project_id}}.{{bq_dataset}}.{{features_table}}` as a
   LEFT JOIN get_last_cycle as b ON a.id = b.id
   WHERE a.cycle = b.last_cycle
 )

 SELECT
 a.*,
 CASE WHEN b.time_to_failure < period THEN 1 ELSE 0 END AS {{target}}
 FROM get_target_test as a
 LEFT JOIN `{{project_id}}.{{bq_dataset}}.{{labels_table}}` as b ON a.id = b.id
"""

DETECT_ANOMALIES_TEMPLATE = """
CREATE OR REPLACE TABLE
  `{{project_id}}.{{bq_dataset}}.{{anomalies_table}}` AS
SELECT
  is_anomaly, mean_squared_error, {{target}}
FROM
  ML.DETECT_ANOMALIES(MODEL `{{project_id}}.{{bq_dataset}}.{{recostruction_model_name}}`,
                      STRUCT({{contamination_thr}} AS contamination),
                      TABLE `{{project_id}}.{{bq_dataset}}.{{test_table}}`)
"""

VISUALIZE_MSE_TEMPLATE = """
SELECT
  *
FROM
  `{{project_id}}.{{bq_dataset}}.{{anomalies_table}}`
"""

#### Compile SQL query templates

After defining the SQL query templates, you compile them passing training and testing parameters.

In [None]:
# Training parameters specification
TRAIN_SQL_PARAMS = dict(
    project_id=PROJECT_ID,
    bq_dataset=BQ_DATASET,
    sensors=SENSORS,
    period=PERIOD,
    window=WINDOW,
    target=TARGET,
    excluded_variables=EXCLUDED_VARIABLES,
    contamination_threshold=CONTAMINATION_THRESHOLD,
    data_table=BQ_TRAIN_RAW_TABLE,
    features_table=BQ_TRAIN_FEATURES_TABLE,
    train_table=BQ_TRAIN_TABLE,
    recostruction_model_name=BQ_RECOSTRUCTION_MODEL_TABLE,
    anomalies_table=DETECT_ANOMALIES_TABLE,
    contamination_thr=CONTAMINATION_THRESHOLD,
)

CREATE_TRAIN_FEATURES_QUERY = Template(CREATE_FEATURES_SQL_TEMPLATE).render(
    TRAIN_SQL_PARAMS
)
CREATE_TRAIN_TABLE_QUERY = Template(CREATE_TRAIN_SQL_TEMPLATE).render(TRAIN_SQL_PARAMS)
TRAIN_RECOSTRUCTION_MODEL_QUERY = Template(TRAIN_RECONSTRUCTION_MODEL_TEMPLATE).render(
    TRAIN_SQL_PARAMS
)

# Testing parameters specification
TEST_SQL_PARAMS = dict(
    project_id=PROJECT_ID,
    bq_dataset=BQ_DATASET,
    sensors=SENSORS,
    period=PERIOD,
    window=WINDOW,
    data_table=BQ_TEST_RAW_TABLE,
    labels_table=BQ_LABELS_TABLE,
    target=TARGET,
    features_table=BQ_TEST_FEATURES_TABLE,
    test_table=BQ_TEST_TABLE,
    recostruction_model_name=BQ_RECOSTRUCTION_MODEL_TABLE,
    anomalies_table=DETECT_ANOMALIES_TABLE,
    contamination_thr=CONTAMINATION_THRESHOLD,
)

CREATE_TEST_FEATURES_QUERY = Template(CREATE_FEATURES_SQL_TEMPLATE).render(
    TEST_SQL_PARAMS
)
CREATE_TEST_TABLE_QUERY = Template(CREATE_TEST_SQL_TEMPLATE).render(TEST_SQL_PARAMS)
DETECT_ANOMALIES_QUERY = Template(DETECT_ANOMALIES_TEMPLATE).render(TEST_SQL_PARAMS)
VISUALIZE_MSE_QUERY = Template(VISUALIZE_MSE_TEMPLATE).render(TRAIN_SQL_PARAMS)

### Create a custom component to read model evaluation metrics

Build a custom component to consume model evaluation metrics for visualizations in the Vertex AI Pipelines UI using Kubeflow SDK visualization APIs.

In [None]:
@component(
    base_image="python:3.8-slim",
    output_component_file=f"{KFP_COMPONENTS_PATH}/build_bq_evaluate_metrics.yaml",
)
def get_model_evaluation_metrics(
    metrics_in: Input[Artifact],
    metrics_out: Output[Metrics],
    model_out: Output[Artifact],
) -> NamedTuple("Outputs", [("mean_squared_error", float)]):
    """
    Get the average mean absolute error from the metrics
    Args:
        metrics_in: metrics artifact
        metrics_out: resulting metrics artifact
        model_out: resulting model artifact
    Returns:
        avg_mean_absolute_error: average mean absolute error
    """

    # Extract rows and schema from metrics artifact
    rows = metrics_in.metadata["rows"]
    schema = metrics_in.metadata["schema"]

    # Convert into a dictionary format
    columns = [metrics["name"] for metrics in schema["fields"] if "name" in metrics]
    records = [dl["v"] for dl in rows[0]["f"]]
    metrics = {key: round(float(value), 3) for key, value in zip(columns, records)}

    # Log metrics
    for key in metrics.keys():
        metrics_out.log_metric(key, metrics[key])

    # Return the target metrics
    mean_absolute_error = metrics["mean_squared_error"]
    component_outputs = NamedTuple("Outputs", [("mean_squared_error", float)])

    # model metadata
    model_framework = "BQML"
    model_type = "AutoEncoder"
    model_user = "Author"
    model_function = "Reconstruction model"
    model_out.metadata["framework"] = model_framework
    model_out.metadata["type"] = model_type
    model_out.metadata["model function"] = model_function
    model_out.metadata["modified by"] = model_user

    return component_outputs(mean_absolute_error)

### Create a custom component to visualize MSE per label

Build a custom component to visualize MSE per label in the Vertex AI Pipelines UI using Kubeflow SDK visualization APIs.

In [None]:
@component(
    base_image="python:3.8-slim",
    packages_to_install=["pandas", "google-cloud-bigquery[bqstorage,pandas]", "plotly"],
    output_component_file=f"{KFP_COMPONENTS_PATH}/build_evaluation_plot.yaml",
)
def get_mse_plots(
    query: str,
    project: str,
    location: str,
    metrics_out: Output[HTML],
    model_out: Output[Artifact],
):
    """
    Get the mean squared error per labels
    Args:
        query: the query to generate the metrics
        project: the project id to iniziate the BQ client
        location: the region to iniziate the BQ client
        metrics_out: resulting metrics artifact
        model_out: resulting model artifact
    Returns:
        avg_mean_absolute_error: average mean absolute error
    """

    import plotly.graph_objects as go
    from google.cloud import bigquery
    from plotly.subplots import make_subplots

    # Initiate client
    client = bigquery.Client(project=project, location=location)

    # Run a Standard SQL query using the environment's default project
    table_df = client.query(query).to_dataframe()

    # Create anomalies/no anomalies datasets
    anomalies_df = table_df.query("is_anomalous_ttf == 1")
    no_anomalies_df = table_df.query("is_anomalous_ttf == 0")

    # Create a figure with subplots
    fig = make_subplots(
        rows=2,
        cols=2,
        specs=[[{"colspan": 2}, None], [{}, {}]],
        subplot_titles=(
            "Distribution of mean squared error (MSE) for anomaly and not anomaly sensor data",
            "Distribution of mean squared error (MSE) for anomaly sensor data",
            "Distribution of mean squared error (MSE) for not anomaly sensor data",
        ),
        x_title="Mean squared error (MSE)",
        y_title="Density",
    )

    # Add subplots to figure
    fig.add_trace(
        go.Histogram(
            x=anomalies_df["mean_squared_error"],
            name="Anomaly",
            marker_color="blue",
            showlegend=True,
        ),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Histogram(
            x=no_anomalies_df["mean_squared_error"],
            name="No Anomaly",
            marker_color="orange",
            showlegend=True,
        ),
        row=1,
        col=1,
    )
    fig.add_trace(
        go.Histogram(
            x=anomalies_df["mean_squared_error"],
            name="MSE_1",
            marker_color="red",
            showlegend=False,
        ),
        row=2,
        col=1,
    )
    fig.add_trace(
        go.Histogram(
            x=no_anomalies_df["mean_squared_error"],
            name="MSE_2",
            marker_color="green",
            showlegend=False,
        ),
        row=2,
        col=2,
    )

    # Update figure properties
    fig.update_layout(
        title="Anomaly detection report",
        title_x=0.5,
        bargap=0.2,
        bargroupgap=0.1,
        showlegend=True,
    )

    # Save output to static HTML file
    fig.write_html(metrics_out.path)

### Build the BQML training pipeline

Define your workflow using Kubeflow Pipelines DSL package.

Below you have the steps of the pipeline workflow:

1. Build training dataset in BigQuery
2. Train a BigQuery AutoEncoder model
3. Evaluate the BigQuery AutoEncoder model
4. Check the model performance
5. Build test dataset in BigQuery
6. Detect anomalies
7. Generate the MSE plot to evaluate predictions


In [None]:
@dsl.pipeline(
    name=TRAIN_PIPELINE_NAME,
    description="A batch pipeline to train recostruction model using BQML",
)
def pipeline(
    create_train_features_query: str,
    create_train_table_query: str,
    train_recostruction_model_query: str,
    create_test_features_query: str,
    create_test_table_query: str,
    generate_anomalies_query: str,
    performance_thr: float,
    visualize_mse_query: str,
    project: str,
    location: str,
):

    # Create training features
    create_train_features_op = BigqueryQueryJobOp(
        query=create_train_features_query,
        project=project,
        location=location,
    ).set_display_name("build train features")

    # Create train dataset
    create_train_dataset_op = (
        BigqueryQueryJobOp(
            query=create_train_table_query, project=project, location=location
        )
        .set_display_name("build train table")
        .after(create_train_features_op)
    )

    # Train the recostruction model
    bq_recostruction_model_op = (
        BigqueryCreateModelJobOp(
            query=train_recostruction_model_query,
            project=project,
            location=location,
        )
        .set_display_name("train reconstruction model")
        .after(create_train_dataset_op)
    )

    # Evaluate recostruction model
    bq_arima_evaluate_model_op = (
        BigqueryEvaluateModelJobOp(
            model=bq_recostruction_model_op.outputs["model"],
            project=project,
            location=location,
        )
        .set_display_name("evaluate reconstruction model")
        .after(bq_recostruction_model_op)
    )

    # Plot model metrics
    get_evaluation_model_metrics_op = (
        get_model_evaluation_metrics(
            bq_arima_evaluate_model_op.outputs["evaluation_metrics"]
        )
        .after(bq_arima_evaluate_model_op)
        .set_display_name("generate evaluation metrics")
    )

    # Check the model performance. If AUTOENCODER MSE metric is below to a minimal threshold
    with Condition(
        get_evaluation_model_metrics_op.outputs["mean_squared_error"] < performance_thr,
        name="MSE good",
    ):

        # Create test features dataset
        create_test_features_op = BigqueryQueryJobOp(
            query=create_test_features_query,
            project=project,
            location=location,
        ).set_display_name("build test features")

        # Create test dataset
        create_test_dataset_op = (
            BigqueryQueryJobOp(
                query=create_test_table_query, project=project, location=location
            )
            .set_display_name("build test table")
            .after(create_test_features_op)
        )

        # Generate anomalies
        generate_anomalies_op = (
            BigqueryQueryJobOp(
                query=generate_anomalies_query,
                project=project,
                location=location,
            )
            .after(create_test_dataset_op)
            .set_display_name("generate anomalies")
        )

        # Plot mse graph of anomalies
        _ = (
            get_mse_plots(query=visualize_mse_query, project=project, location=location)
            .after(generate_anomalies_op)
            .set_display_name("plot mse report")
        )

### Compile the pipeline into a JSON file

Next, you compile the pipeline, which produces a JSON specification for your pipeline.

In [None]:
compiler.Compiler().compile(pipeline_func=pipeline, package_path=TRAIN_PIPELINE_PACKAGE)

### Execute your pipeline

Next, you execute the pipeline. It takes the following parameters which you set as default:


#### Submit pipeline job

In [None]:
TRAIN_PIPELINE_RUN_PARAMS = dict(
    create_train_features_query=CREATE_TRAIN_FEATURES_QUERY,
    create_train_table_query=CREATE_TRAIN_TABLE_QUERY,
    train_recostruction_model_query=TRAIN_RECOSTRUCTION_MODEL_QUERY,
    create_test_features_query=CREATE_TEST_FEATURES_QUERY,
    create_test_table_query=CREATE_TEST_TABLE_QUERY,
    generate_anomalies_query=DETECT_ANOMALIES_QUERY,
    performance_thr=PERF_THRESHOLD,
    visualize_mse_query=VISUALIZE_MSE_QUERY,
    project=PROJECT_ID,
    location=LOCATION,
)

bqml_train_pipeline = vertex_ai.PipelineJob(
    display_name=f"{TRAIN_PIPELINE_PACKAGE}-job",
    template_path=TRAIN_PIPELINE_PACKAGE,
    parameter_values=TRAIN_PIPELINE_RUN_PARAMS,
    pipeline_root=TRAIN_PIPELINE_ROOT,
    enable_caching=True,
)

bqml_train_pipeline.run()

#### View BigQuery ML training pipeline results

Finally, you will view the artifact outputs of each task in the pipeline.

In [None]:
PROJECT_NUMBER = bqml_train_pipeline.gca_resource.name.split("/")[1]
print("PROJECT NUMBER: ", PROJECT_NUMBER)
print("\n\n")
print("bigquery-create-model-job")
artifacts = print_pipeline_output(
    TRAIN_PIPELINE_ROOT, bqml_train_pipeline, "bigquery-create-model-job"
)
print("\n\n")
print("bigquery-ml-evaluate-job")
artifacts = print_pipeline_output(
    TRAIN_PIPELINE_ROOT, bqml_train_pipeline, "bigquery-evaluate-model-job"
)
print("\n\n")

## Conclusion

In this notebook, you built a ML pipeline to train an autoencoder for detecting anomalies using Vertex AI Pipelines and BigQuery ML.

Now you know how to leverage prebuilt `google_cloud_components` for training BigQuery ML model and how to build custom components to evaluate and visualize performance metrics.

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.

In [None]:
# delete pipeline
delete_pipeline = False
if delete_pipeline:
    vertex_ai_pipeline_jobs = vertex_ai.PipelineJob.list(
        filter=f'pipeline_name="{TRAIN_PIPELINE_NAME}"'
    )
    for pipeline_job in vertex_ai_pipeline_jobs:
        pipeline_job.delete()

# delete model
delete_model = False
if delete_model:
    DELETE_MODEL_SQL = f"DROP MODEL {BQ_DATASET}.{BQ_RECOSTRUCTION_MODEL_TABLE}"
    try:
        delete_model_query_job = bq_client.query(DELETE_MODEL_SQL)
        delete_model_query_result = delete_model_query_job.result()
    except Exception as e:
        print(e)

# delete bucket
delete_bucket = False
if os.getenv("IS_TESTING") or delete_bucket:
    ! gsutil -m rm -r $BUCKET_URI

# Remove local resorces
delete_local_resources = False
if delete_local_resources:
    ! rm -rf {KFP_COMPONENTS_PATH}
    ! rm -rf {TRAIN_PIPELINES_PATH}
    ! rm -rf {TEST_PIPELINES_PATH}