In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Custom model batch prediction with feature filtering 

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>  
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview


This tutorial demonstrates how to use the Vertex AI SDK for Python to train a custom tabular classification model and perform batch prediction with feature filtering. This means that you can run batch prediction on a list of selected features or exclude a list of features from prediction.

Learn more about [Vertex AI Batch Prediction](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions).

### Objective

In this notebook, you learn how to create a custom-trained model from a Python script in a Docker container using the Vertex AI SDK for Python, and then run a batch prediction job by including or excluding a list of features. 

This tutorial uses the following Google Cloud ML services and resources:

- BigQuery
- Cloud Storage
- Vertex AI managed Datasets
- Vertex AI Training
- Vertex AI BatchPrediction

The steps performed include:

- Create a Vertex AI custom `TrainingPipeline` for training a model.
- Train a TensorFlow model.
- Send batch prediction job.

### Dataset

The dataset used for this tutorial is the penguins dataset from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). This dataset has the following fields: `culmen_length_mm`, `culmen_depth_mm`, `flipper_length_mm`, `body_mass_g` from the dataset to predict the penguins species (`species`).

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* BigQuery

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
# Install the packages
! pip3 install --upgrade google-cloud-aiplatform \
                                    google-cloud-storage \
                                    google-cloud-bigquery \
                                    pyarrow -q

### Colab only: Uncomment the following cell to restart the kernel.


In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the following APIs: Vertex AI API, Cloud Resource Manager API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,cloudresourcemanager.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

When you submit a training job using the Cloud SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. Using this model artifact, you can then
create Vertex AI Model resource and use for prediction.

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Import libraries

In [None]:
import json
import os

import numpy as np
from google.cloud import aiplatform, bigquery

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
# Initialize the Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Initialize BigQuery Client

Initialize the BigQuery Python client for your project.

In [None]:
# Set up BigQuery client
bqclient = bigquery.Client(project=PROJECT_ID)

### Set pre-built containers

Vertex AI provides pre-built containers to run training and prediction.

For the latest list, see [Pre-built containers for training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) and [Pre-built containers for prediction](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers)

In [None]:
TRAIN_VERSION = "tf-cpu.2-8"
DEPLOY_VERSION = "tf2-cpu.2-8"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

print("Training:", TRAIN_IMAGE)
print("Deployment:", DEPLOY_IMAGE)

## Prepare the data

To improve the convergence of the custom deep learning model, normalize the data. To prepare for this, calculate the mean and standard deviation for each numeric column.

Pass these summary statistics to the training script to normalize the data before training. Later, during prediction, use these summary statistics again to normalize the testing data.

In [None]:
# Calculate mean and std across all rows

# Define NA values
NA_VALUES = ["NA", "."]


# Download a table
def download_table(bq_table_uri: str):
    # Remove bq:// prefix if present
    prefix = "bq://"
    if bq_table_uri.startswith(prefix):
        bq_table_uri = bq_table_uri[len(prefix) :]

    table = bigquery.TableReference.from_string(bq_table_uri)
    rows = bqclient.list_rows(
        table,
    )
    return rows.to_dataframe()


# Remove NA values
def clean_dataframe(df):
    return df.replace(to_replace=NA_VALUES, value=np.NaN).dropna()


def calculate_mean_and_std(df):
    # Calculate mean and std for each applicable column
    mean_and_std = {}
    dtypes = list(zip(df.dtypes.index, map(str, df.dtypes)))
    # Normalize numeric columns.
    for column, dtype in dtypes:
        if dtype == "float32" or dtype == "float64":
            mean_and_std[column] = {
                "mean": df[column].mean(),
                "std": df[column].std(),
            }

    return mean_and_std

In [None]:
# Define the BigQuery source dataset
BQ_SOURCE = "bq://bigquery-public-data.ml_datasets.penguins"

dataframe = download_table(BQ_SOURCE)
dataframe = clean_dataframe(dataframe)
mean_and_std = calculate_mean_and_std(dataframe)
print(f"The mean and stds for each column are: {str(mean_and_std)}")

# Write to a file
MEAN_AND_STD_JSON_FILE = "mean_and_std.json"

with open(MEAN_AND_STD_JSON_FILE, "w") as outfile:
    json.dump(mean_and_std, outfile)

# Save to the staging bucket
! gsutil cp {MEAN_AND_STD_JSON_FILE} {BUCKET_URI}

## Create a Vertex AI tabular Dataset from BigQuery dataset

Your first step in training the model is to create a Vertex AI tabular dataset resource.

In [None]:
DATASET_DISPLAY_NAME = "sample-penguins-unique"

dataset = aiplatform.TabularDataset.create(
    display_name=DATASET_DISPLAY_NAME, bq_source=BQ_SOURCE
)

## Train a model

There are two ways you can train a model using a container image:

- **Use a Vertex AI pre-built container**. If you use a pre-built training container, you must additionally specify a Python package to install into the container image. This Python package contains your training code.

- **Use your own custom container image**. If you use your own container, the container image must contain your training code.

### Define the command args for the training script

Prepare the command-line arguments to pass to your training script.
* `args`: The command line arguments to pass to the corresponding Python module. In this example, they are:
  * `--epochs`: The number of epochs for training.
  * `--batch_size`: The number of batch size for training.
  * `--distribute` : The training distribution strategy to use for single or distributed training.
     * `"single"`: single device.
     * `"mirror"`: all GPU devices on a single compute instance.
     * `"multi"`: all GPU devices on all compute instances.
  * `--mean_and_std_json_file`: The file on Cloud Storage with pre-calculated means and standard deviations.

In [None]:
JOB_NAME = "penquins-custom-job-unique"
EPOCHS = 20
BATCH_SIZE = 10
TRAIN_STRATEGY = "single"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--distribute=" + TRAIN_STRATEGY,
    "--mean_and_std_json_file=" + f"{BUCKET_URI}/{MEAN_AND_STD_JSON_FILE}",
]

### Training script

In the next cell, write the contents of the training script, `task.py`. In summary, the script does the following:

- Loads the data from the BigQuery table using the BigQuery Python client library.
- Loads the pre-calculated mean and standard deviation from the Cloud Storage bucket.
- Builds a model using TF.Keras model API.
- Compiles the model by calling `compile()`.
- Sets a training distribution strategy according to the argument `args.distribute`.
- Trains the model by calling `fit()` with epochs and batch size according to the arguments `args.epochs` and `args.batch_size`
- Gets the directory where to save the model artifacts from the environment variable `AIP_MODEL_DIR`. This variable is [set by the training service](https://cloud.google.com/vertex-ai/docs/training/code-requirements#environment-variables).
- Saves the trained model to the model directory.

In [None]:
%%writefile task.py

import argparse
import os
from typing import Tuple, Optional

import pandas as pd
import numpy as np
import tensorflow as tf

from google.cloud import bigquery
from google.cloud import storage

# Read environmental variables
training_data_uri = os.getenv("AIP_TRAINING_DATA_URI")
validation_data_uri = os.getenv("AIP_VALIDATION_DATA_URI")
test_data_uri = os.getenv("AIP_TEST_DATA_URI")

# Read args
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', dest='epochs',
                    default=10, type=int,
                    help='Number of epochs.')
parser.add_argument('--batch_size', dest='batch_size',
                    default=10, type=int,
                    help='Batch size.')
parser.add_argument('--distribute', dest='distribute', type=str, default='single',
                    help='Distributed training strategy.')
parser.add_argument('--mean_and_std_json_file', dest='mean_and_std_json_file', type=str,
                    help='GCS URI to the JSON file with pre-calculated column means and standard deviations.')
args = parser.parse_args()

# Set up BigQuery clients
bqclient = bigquery.Client()


def download_blob(
  bucket_name: str, 
  source_blob_name: str, 
  destination_file_name: str
  ) -> None:
    """Downloads a blob from the bucket to a local path.
    Args:
        - bucket_name: "your-bucket-name"
        - source_blob_name: "storage-object-name"
        - destination_file_name: "local/path/to/file"
    """

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Blob {} downloaded to {}.".format(
            source_blob_name, destination_file_name
        )
    )

def extract_bucket_and_prefix_from_gcs_path(gcs_path: str) -> Tuple[str, Optional[str]]:
    """Given a complete GCS path, return the bucket name and prefix as a tuple.

    Example Usage:

        bucket, prefix = extract_bucket_and_prefix_from_gcs_path(
            "gs://example-bucket/path/to/folder"
        )

        # bucket = "example-bucket"
        # prefix = "path/to/folder"

    Args:
        gcs_path (str):
            Required. A full path to a Cloud Storage folder or resource.
            Can optionally include "gs://" prefix or end in a trailing slash "/".

    Returns:
        Tuple[str, Optional[str]]
            A (bucket, prefix) pair from provided GCS path. If a prefix is not
            present, None is returned in its place.
    """
    if gcs_path.startswith("gs://"):
        gcs_path = gcs_path[5:]
    if gcs_path.endswith("/"):
        gcs_path = gcs_path[:-1]

    gcs_parts = gcs_path.split("/", 1)
    gcs_bucket = gcs_parts[0]
    gcs_blob_prefix = None if len(gcs_parts) == 1 else gcs_parts[1]

    return (gcs_bucket, gcs_blob_prefix)


# Download means and std
def download_mean_and_std(mean_and_std_json_file):
    """Download mean and std for each column"""
    import json
    
    bucket, file_path = extract_bucket_and_prefix_from_gcs_path(mean_and_std_json_file)
    download_blob(bucket_name=bucket, source_blob_name=file_path, destination_file_name=file_path)
    
    with open(file_path, 'r') as file:
        return json.loads(file.read())

        
# # Download a table
def download_table(bq_table_uri: str):
    # Remove bq:// prefix if present
    prefix = "bq://"
    if bq_table_uri.startswith(prefix):
        bq_table_uri = bq_table_uri[len(prefix):]

    table = bigquery.TableReference.from_string(bq_table_uri)
    rows = bqclient.list_rows(table)
    
    return rows.to_dataframe(create_bqstorage_client=False)


def standardize(df, mean_and_std):
    """Scales numerical columns using their means and standard deviation to get
    z-scores: the mean of each numerical column becomes 0, and the standard
    deviation becomes 1. This can help the model converge during training.

    Args:
      df: Pandas df

    Returns:
      Input df with the numerical columns scaled to z-scores
    """
    dtypes = list(zip(df.dtypes.index, map(str, df.dtypes)))
    # Normalize numeric columns.
    for column, dtype in dtypes:
        if dtype == "float32":
            df[column] -= mean_and_std[column]["mean"]
            df[column] /= mean_and_std[column]["std"]
    return df


def preprocess(df):
    """Converts categorical features to numeric. Removes unused columns.

    Args:
      df: Pandas df with raw data

    Returns:
      df with preprocessed data
    """
    df = df.drop(columns=UNUSED_COLUMNS)

    # Drop rows with NaN's
    df = df.dropna()

    # Convert integer valued (numeric) columns to floating point
    numeric_columns = df.select_dtypes(["int32", "float32", "float64"]).columns
    df[numeric_columns] = df[numeric_columns].astype("float32")

    # Convert categorical columns to numeric
    cat_columns = df.select_dtypes(["object"]).columns

    df[cat_columns] = df[cat_columns].apply(
        lambda x: x.astype(_CATEGORICAL_TYPES[x.name])
    )
    df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
    return df


def convert_dataframe_to_dataset(
    df_train,
    df_validation,
    mean_and_std
):
    df_train = preprocess(df_train)
    df_validation = preprocess(df_validation)

    df_train_x, df_train_y = df_train, df_train.pop(LABEL_COLUMN)
    df_validation_x, df_validation_y = df_validation, df_validation.pop(LABEL_COLUMN)

    # Join train_x and eval_x to normalize on overall means and standard
    # deviations. Then separate them again.
    all_x = pd.concat([df_train_x, df_validation_x], keys=["train", "eval"])
    all_x = standardize(all_x, mean_and_std)
    df_train_x, df_validation_x = all_x.xs("train"), all_x.xs("eval")

    y_train = np.asarray(df_train_y).astype("float32")
    y_validation = np.asarray(df_validation_y).astype("float32")

    # Convert to numpy representation
    x_train = np.asarray(df_train_x)
    x_test = np.asarray(df_validation_x)

    # Convert to one-hot representation
    y_train = tf.keras.utils.to_categorical(y_train, num_classes=len(SPECIES))
    y_validation = tf.keras.utils.to_categorical(y_validation, num_classes=len(SPECIES))

    dataset_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    dataset_validation = tf.data.Dataset.from_tensor_slices((x_test, y_validation))
    return (dataset_train, dataset_validation)


# Remove NA values
def clean_dataframe(df):
    return df.replace(to_replace=NA_VALUES, value=np.NaN).dropna()


def create_model(num_features):
    # Create model
    Dense = tf.keras.layers.Dense
    model = tf.keras.Sequential(
        [
            Dense(
                100,
                activation=tf.nn.relu,
                kernel_initializer="uniform",
                input_dim=num_features,
            ),
            Dense(75, activation=tf.nn.relu),
            Dense(50, activation=tf.nn.relu),
            Dense(25, activation=tf.nn.relu),
            Dense(3, activation=tf.nn.softmax),
        ]
    )
    
    # Compile Keras model
    optimizer = tf.keras.optimizers.RMSprop(lr=0.001)
    model.compile(
        loss="categorical_crossentropy", metrics=["accuracy"], optimizer=optimizer
    )
    
    return model


mean_and_std = download_mean_and_std(args.mean_and_std_json_file)

# Single Machine, single compute device
if args.distribute == 'single':
    if tf.test.is_gpu_available():
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
# Single Machine, multiple compute device
elif args.distribute == 'mirror':
    strategy = tf.distribute.MirroredStrategy()
# Multiple Machine, multiple compute device
elif args.distribute == 'multi':
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Set up training variables
LABEL_COLUMN = "species"
UNUSED_COLUMNS = []
NA_VALUES = ["NA", "."]

# Possible categorical values
SPECIES = ['Adelie Penguin (Pygoscelis adeliae)',
           'Chinstrap penguin (Pygoscelis antarctica)',
           'Gentoo penguin (Pygoscelis papua)']
ISLANDS = ['Dream', 'Biscoe', 'Torgersen']
SEXES = ['FEMALE', 'MALE']

df_train = download_table(training_data_uri)
df_validation = download_table(validation_data_uri)
df_test = download_table(test_data_uri)

df_train = clean_dataframe(df_train)
df_validation = clean_dataframe(df_validation)

_CATEGORICAL_TYPES = {
    "island": pd.api.types.CategoricalDtype(categories=ISLANDS),
    "species": pd.api.types.CategoricalDtype(categories=SPECIES),
    "sex": pd.api.types.CategoricalDtype(categories=SEXES),
}

# Create datasets
dataset_train, dataset_validation = convert_dataframe_to_dataset(
  df_train, 
  df_validation, 
  mean_and_std
)

# Shuffle train set
dataset_train = dataset_train.shuffle(len(df_train))

# Create the model
with strategy.scope():
    model = create_model(num_features=dataset_train._flat_shapes[0].dims[0].value)

# Set up datasets
NUM_WORKERS = strategy.num_replicas_in_sync
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size.
GLOBAL_BATCH_SIZE = args.batch_size * NUM_WORKERS
dataset_train = dataset_train.batch(GLOBAL_BATCH_SIZE)
dataset_validation = dataset_validation.batch(GLOBAL_BATCH_SIZE)

# Train the model
model.fit(dataset_train, epochs=args.epochs, validation_data=dataset_validation)

tf.saved_model.save(model, os.getenv("AIP_MODEL_DIR"))

### Train the model

Define your custom `TrainingPipeline` on Vertex AI.

Use the `CustomTrainingJob` class to define the `TrainingPipeline`. The class takes the following parameters:

- `display_name`: The user-defined name of this training pipeline.
- `script_path`: The local path to the training script.
- `container_uri`: The URI of the training container image.
- `requirements`: The list of Python package dependencies of the script.
- `model_serving_container_image_uri`: The URI of a container that can serve predictions for your model — either a pre-built container or a custom container.

Use the `run` function to start training. The function takes the following parameters:

- `dataset`: Vertex AI Dataset to fit this training against.
- `model_display_name`: The display name of the `Model` if the script produces a managed `Model`.
- `bigquery_destination`: The BigQuery project location where the training data is to be written to.
- `args`: The command line arguments to be passed to the Python script.

The `run` function creates a training pipeline that trains and creates a `Model` object. After the training pipeline completes, the `run` function returns the `Model` object.

In [None]:
job = aiplatform.CustomTrainingJob(
    display_name=JOB_NAME,
    script_path="task.py",
    container_uri=TRAIN_IMAGE,
    requirements=["google-cloud-bigquery>=2.20.0", "db-dtypes"],
    model_serving_container_image_uri=DEPLOY_IMAGE,
)

MODEL_DISPLAY_NAME = "penguins-unique"

# Start the training
model = job.run(
    dataset=dataset,
    model_display_name=MODEL_DISPLAY_NAME,
    bigquery_destination=f"bq://{PROJECT_ID}",
    args=CMDARGS,
)

## Send Batch Prediction job request with feature filtering (instanceConfig field)

Now that the model is ready, you can send batch prediction request directly from the model resource without needing to deploy the model to an endpoint. 

Sometimes, your input data does not match the data format that the predictor accepts. Feature filtering lets you either exclude certain fields (such as identifiers or metadata) that are in the input data from your prediction request, or include only a subset of fields from the input data in your prediction request, without having to do any custom pre/post-processing in the prediction container.
You can filter and/or transform your batch input 

In this notebook you learn how to send batch prediction request by including or excluding a list of features by specifying `instanceConfig` in your `BatchPredictionJob` request (**v1beta1 only**).

Learn more about [Prediction on Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/overview)<br>
Learn more about [feature filtering](https://cloud.google.com/vertex-ai/docs/predictions/get-predictions#filter_and_transform_input_data_preview)

### Prepare test data

Prepare test data by normalizing it and converting categorical values to numeric values.
You must normalize these values in the same way that your normalized training data.

In this example, we add an extra column called `id` to the test dataset which was not used for training. We show how to exclude this feature at prediction. 
Here, you perform testing with the same dataset that you used for training. In practice, you generally want to use a separate test dataset to verify your results.

In [None]:
import pandas as pd
from google.cloud import bigquery

UNUSED_COLUMNS = []
LABEL_COLUMN = "species"

# Possible categorical values
SPECIES = [
    "Adelie Penguin (Pygoscelis adeliae)",
    "Chinstrap penguin (Pygoscelis antarctica)",
    "Gentoo penguin (Pygoscelis papua)",
]
ISLANDS = ["Dream", "Biscoe", "Torgersen"]
SEXES = ["FEMALE", "MALE"]

_CATEGORICAL_TYPES = {
    "island": pd.api.types.CategoricalDtype(categories=ISLANDS),
    "species": pd.api.types.CategoricalDtype(categories=SPECIES),
    "sex": pd.api.types.CategoricalDtype(categories=SEXES),
}


def standardize(df, mean_and_std):
    """Scales numerical columns using their means and standard deviation to get
    z-scores: the mean of each numerical column becomes 0, and the standard
    deviation becomes 1. This can help the model converge during training.

    Args:
      df: Pandas df

    Returns:
      Input df with the numerical columns scaled to z-scores
    """
    dtypes = list(zip(df.dtypes.index, map(str, df.dtypes)))
    # Normalize numeric columns.
    for column, dtype in dtypes:
        if dtype == "float32":
            df[column] -= mean_and_std[column]["mean"]
            df[column] /= mean_and_std[column]["std"]
    return df


def preprocess(df, mean_and_std):
    """Converts categorical features to numeric. Removes unused columns.

    Args:
      df: Pandas df with raw data

    Returns:
      df with preprocessed data
    """
    df = df.drop(columns=UNUSED_COLUMNS)

    # Drop rows with NaN's
    df = df.dropna()

    # Convert integer valued (numeric) columns to floating point
    numeric_columns = df.select_dtypes(["int32", "float32", "float64"]).columns
    df[numeric_columns] = df[numeric_columns].astype("float32")

    # Convert categorical columns to numeric
    cat_columns = df.select_dtypes(["object"]).columns

    df[cat_columns] = df[cat_columns].apply(
        lambda x: x.astype(_CATEGORICAL_TYPES[x.name])
    )
    df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
    return df


def convert_dataframe_to_list(df, mean_and_std):
    df = preprocess(df, mean_and_std)

    df_x, df_y = df, df.pop(LABEL_COLUMN)

    # Normalize on overall means and standard deviations.
    df = standardize(df, mean_and_std)

    y = np.asarray(df_y).astype("float32")

    # Convert to numpy representation
    x = np.asarray(df_x)

    # Convert to one-hot representation
    return x.tolist(), y.tolist(), df_x


x_test, y_test, df_x = convert_dataframe_to_list(dataframe, mean_and_std)

In [None]:
# Add id column to the test dataframe
ID_COLUMN_NAME = "id"
df_x_with_id = df_x.copy()
df_x_with_id[ID_COLUMN_NAME] = [i for i in range(0, df_x_with_id.shape[0])]

# Print columns of the datafram
print(f"Test dataset columns: {df_x_with_id.columns}")

### Upload the test DataFrame to BigQuery 

In [None]:
def save_dataframe_to_bigquery(
    dataframe: pd.DataFrame, dataset_name: str, table_name: str
) -> str:
    """This function loads a dataframe to a new bigquery table

    Args:
        dataframe (pd.Dataframe): dataframe to be loaded to bigquery
        dataset_name (str): name of the BigQuery dataset for storing the data
        table_name (str): name of the BigQuery table that is being created

    Returns:
        str: table id of the destination bigquery table
    """
    client = bigquery.Client(PROJECT_ID)

    bq_dataset = bigquery.Dataset(f"{PROJECT_ID}.{dataset_name}")
    bq_dataset = client.create_dataset(bq_dataset, exists_ok=True)

    job_config = bigquery.LoadJobConfig(
        # Optionally, set the write disposition. BigQuery appends loaded rows
        # to an existing table by default, but with WRITE_TRUNCATE write
        # disposition it replaces the table with the loaded data.
        write_disposition="WRITE_TRUNCATE",
    )

    # Reference: https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
    job = client.load_table_from_dataframe(
        dataframe=dataframe,
        destination=f"{PROJECT_ID}.{dataset_name}.{table_name}",
        job_config=job_config,
    )

    job.result()

    return str(job.destination)

In [None]:
# Upload the Dataframe to a BigQuery table

DATASET_NAME = "test_dataset"
TABLE_NAME = "test-data-unique"

TABLE_ID = save_dataframe_to_bigquery(
    dataframe=df_x_with_id, dataset_name=DATASET_NAME, table_name=TABLE_NAME
)

### Send the BatchPredictionJob request using REST API

Now that you have test data, you can use it to send a batch prediction request using REST API. To do that you need to create a `JSON` request with the following information:

- `BATCH_JOB_NAME`: Display name for the batch prediction job.
- `MODEL_URI`: The URI for the Model resource to use for making predictions.
- `INPUT_FORMAT`: The format of your input data: bigquery, jsonl, csv, tf-record, tf-record-gzip, or file-list.
- `INPUT_URI`: Cloud Storage URI of your input data. May contain wildcards.
- `OUTPUT_URI`: Cloud Storage URI of a directory where you want Vertex AI to save output.
- `MACHINE_TYPE`: The machine resources to be used for this batch prediction job.

In this example, we create two versions of the same JSON request: one with `excludedFields` and the other with `includeFields` to show how to include or exclude certain features. Note that these two requests do the same job in this example!

Learn more about [request a batch prediction](https://cloud.google.com/vertex-ai/docs/predictions/get-predictions#api_1)<br>
Learn more about [instanceconfig](https://cloud.google.com/vertex-ai/docs/reference/rest/v1beta1/projects.locations.batchPredictionJobs#instanceconfig)

In [None]:
BATCH_JOB_NAME = "penguins-test"
MODEL_URI = model.resource_name
INPUT_FORMAT = "bigquery"
INPUT_URI = f"bq://{TABLE_ID}"
OUTPUT_FORMAT = "bigquery"
OUTPUT_URI = f"bq://{PROJECT_ID}"
MACHINE_TYPE = "n1-standard-2"
EXCLUDED_FIELDS = [ID_COLUMN_NAME]

# Create a list of columns to be included
ALL_COLUMNS = list(df_x_with_id.columns)
INCLUDED_FIELDS = ALL_COLUMNS.copy()
INCLUDED_FIELDS.remove(ID_COLUMN_NAME)

### Create JSON body requests

In [None]:
import json

request_with_excluded_fields = {
    "displayName": f"{BATCH_JOB_NAME}-excluded_fields",
    "model": MODEL_URI,
    "inputConfig": {
        "instancesFormat": INPUT_FORMAT,
        "bigquerySource": {"inputUri": INPUT_URI},
    },
    "outputConfig": {
        "predictionsFormat": OUTPUT_FORMAT,
        "bigqueryDestination": {"outputUri": OUTPUT_URI},
    },
    "dedicatedResources": {
        "machineSpec": {
            "machineType": MACHINE_TYPE,
        }
    },
    "instanceConfig": {"excludedFields": EXCLUDED_FIELDS},
}

with open("request_with_excluded_fields.json", "w") as outfile:
    json.dump(request_with_excluded_fields, outfile)

In [None]:
request_with_included_fields = {
    "displayName": f"{BATCH_JOB_NAME}-included_fields",
    "model": MODEL_URI,
    "inputConfig": {
        "instancesFormat": INPUT_FORMAT,
        "bigquerySource": {"inputUri": INPUT_URI},
    },
    "outputConfig": {
        "predictionsFormat": OUTPUT_FORMAT,
        "bigqueryDestination": {"outputUri": OUTPUT_URI},
    },
    "dedicatedResources": {
        "machineSpec": {
            "machineType": MACHINE_TYPE,
        }
    },
    "instanceConfig": {"includedFields": INCLUDED_FIELDS},
}

with open("request_with_included_fields.json", "w") as outfile:
    json.dump(request_with_included_fields, outfile)

### Send the requests

To send the requests, specify the API version you want to use. In this case you use `v1beta1` to be able to use `instanceConfig`.

#### Exclude fields

Here, we send the request with `excludedFields`. After running the follwing cell you should receive a JSON response with your provided information. Then wait for the job to complete (you can check your job status on your Vertex AI Batch Predictions menu or use the Python SDK).

In [None]:
! curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d @request_with_excluded_fields.json \
  https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/batchPredictionJobs

#### Include fields

Here, we send the request with `includedFields`:

In [None]:
! curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d @request_with_included_fields.json \
  https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/batchPredictionJobs

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this notebook:

- Training Job
- Model
- Cloud Storage Bucket
- BigQuery Dataset

In [None]:
# Warning: Setting this to true deletes everything in your bucket
delete_bucket = True

# Delete the training job
job.delete()

# Delete the model
model.delete()

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI

# Delete the created BigQuery dataset
! bq rm -r -f $PROJECT_ID:$DATASET_NAME