In [None]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AI Platform (Unified) SDK: AutoML tabular classification model for batch prediction

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/deepdive/automl/tabular/ucaip_automl_tabular_classification-batch.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/deepdive/automl/tabular/ucaip_automl_tabular_classification-batch.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>
<br/><br/><br/>

# Overview


This tutorial demonstrates how to use the AI Platform (Unified) Python SDK to create tabular models and do batch prediction using Google Cloud's [AutoML Tabular](https://cloud.google.com/automl-tables).


### Dataset

The dataset used for this tutorial is the [Iris dataset](https://www.tensorflow.org/datasets/catalog/iris) from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview). This dataset does not require any feature engineering. The version of the dataset you will use in this tutorial is stored in a public Cloud Storage bucket.


### Objective

In this notebook, you will learn how to create a tabular classification model with AutoML Tabular from a Python script, and then do a batch prediction using the AI Platform (Unified) SDK. You can alternatively create models with AutoML Tabular from the command line using `gcloud` or online using Google Cloud Console.

The steps performed include: 

- Create a AI Platform (Unified) managed Dataset.
- Train the model for up to one hour.
- View the model evaluation.
- Create a batch prediction service for the model.
- Make a batch prediction.

How is the batch prediction service different than using the prediction service of a deployed model with multiple instances. There is one key difference, but otherwise they are essentially the same as far as outcome:

* Prediction Service - Does an on-demand prediction for the entire set of instances (i.e., one or more data items) and returns the results in real-time.

* Batch Prediction Service - Does a queued (batch) prediction for the entire set of instances in the background and stores the results in a Cloud Storage bucket when ready.

### Costs 

This tutorial uses billable components of Google Cloud Platform (GCP):

* Cloud AI Platform
* Cloud Storage

Learn about [Cloud AI Platform
pricing](https://cloud.google.com/ml-engine/docs/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the latest (alpha) version of AI Platform (Unified) SDK from a tar file we have in a GCP storage bucket.

**{Google Staff: When public, replace this with pip install from PyPi distribution}**

In [None]:
! pip3 install https://storage.googleapis.com/google-cloud-aiplatform/libraries/python/0.1.1/google-cloud-aiplatform-0.1.1.tar.gz

Install Google cloud-storage as well.

In [None]:
! pip3 install google-cloud-storage

### Restart the Kernel

Once you've installed the AI Platform (Unified) SDK and Google cloud-storage, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Before you begin

### GPU run-time

**Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select Runtime --> Change runtime type -> GPU**

### Set up your GCP project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a GCP project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)

4. [Google Cloud SDK](https://cloud.google.com/sdk) is already installed in AI Platform Notebooks.

5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Project ID

**If you don't know your project ID**, try to get your project ID using `gcloud` command by executing the second cell below.

In [None]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Cloud
AI Platform services are
available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may
not use a Multi-Regional Storage bucket for training with AI Platform.

In [None]:
REGION = 'us-central1' #@param {type: "string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, we create a timestamp for each instance session, and append onto the name of resources which will be created in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your GCP account

**If you are using AI Platform Notebooks**, your environment is already
authenticated. Skip this step.

*Note, if you are on AI Platform notebook and run the cell, the cell knows to skip executing the authentication steps.*

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your Google Cloud account. This provides access
# to your Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# If on AI Platform, then don't execute this code
if not os.path.exists('/opt/deeplearning/metadata/env_version'):
    if 'google.colab' in sys.modules:
        from google.colab import auth as google_auth
        google_auth.authenticate_user()

    # If you are running this tutorial in a notebook locally, replace the string
    # below with the path to your service account key and run this cell to
    # authenticate your Google Cloud account.
    else:
        %env GOOGLE_APPLICATION_CREDENTIALS your_path_to_credentials.json

### Create a Cloud Storage bucket

**The following steps are required if your data is in your own local Cloud Storage bucket, regardless of your notebook environment.**

This tutorial is designed to use training data that is in a public Cloud Storage bucket and a local Cloud Storage bucket for your batch predictions. You may alternatively use your own training data that you have stored in a local Cloud Storage bucket.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets. 

In [None]:
BUCKET_NAME = "[your-bucket-name]" #@param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "ucaip-automl-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al gs://$BUCKET_NAME

### Set up variables

Let's set up some variables used to create an AutoML model.

### Import libraries and define constants

#### Import AI Platform (Unified) SDK

Import the AI Platform (Unified) SDK into our python environment.

In [None]:
import os
import sys
import time

from google.cloud import aiplatform_v1alpha1 as aip

#### AI Platform (Unified) constants

Let's now setup some constants for AutoML:

- `API_ENDPOINT`: The AI Platform (Unified) API service endpoint for dataset, model, job, pipeline and endpoint services.
- `PARENT`: The AI Platform (Unified) location root path for dataset, model and endpoint resources.

In [None]:
# API Endpoint
API_ENDPOINT = "us-central1-aiplatform.googleapis.com"

# AI Platform (Unified) location root path for your dataset and model resources
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

#### AutoML constants

Let's now setup some constants for AutoML:

- Dataset Schemas - tells the managed dataset service which type of dataset it is.
- Data Labeling (Annotations) Schemas - tells the managed dataset service how the data is labeled (annotated).
- Dataset Training Schemas - tells the managed pipelines service the task (e.g., classification) to train the model for.

In [None]:
# Tabular Dataset type
TABLE_SCHEMA = 'google-cloud-aiplatform/schema/dataset/metadata/tables_1.0.0.yaml'
# Tabular Labeling type
IMPORT_SCHEMA_TABLE_CLASSIFICATION = 'gs://google-cloud-aiplatform/schema/dataset/ioformat/table_io_format_1.0.0.yaml'
# Tabular Training task
TRAINING_TABLE_CLASSIFICATION_SCHEMA = "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_tables_1.0.0.yaml"

#### Deployment constants

Let's now setup some constants for deployment.

- Docker container image we will use for prediction. 
 - Set the variable `GPU = True` to use a container image supporting a GPU; otherwise the container image will be a CPU.

In [None]:
GPU = True

# Tutorial

Now you are ready to start creating your own AutoML Tabular model for tabular classification.

## Clients

The AI Platform (Unified) SDK works as a client/server model. On your side, the Python script, you will create a client that sends requests and receives responses from the server -- AI Platform.

You will use several clients in this tutorial, so you will set them all up upfront.

- Dataset Service for managed datasets.
- Model Service for managed models.
- Pipeline Service for training.
- Job Service for batch prediction. 

In [None]:
# client options same for all services
client_options = {"api_endpoint": API_ENDPOINT}


def create_dataset_client():
    client = aip.DatasetServiceClient(
        client_options=client_options
    )
    return client


def create_model_client():
    client = aip.ModelServiceClient(
        client_options=client_options
    )
    return client


def create_pipeline_client():
    client = aip.PipelineServiceClient(
        client_options=client_options
    )
    return client


def create_job_client():
    client = aip.JobServiceClient(
        client_options=client_options
    )
    return client


clients = {}
clients['dataset'] = create_dataset_client()
clients['model'] = create_model_client()
clients['pipeline'] = create_pipeline_client()
clients['job'] = create_job_client()

for client in clients.items():
    print(client)

## Dataset 

Now that your clients are ready, your first step is to create a managed dataset instance. This step differs from Vision, Video and Language. For those products, after the managed dataset is created, one then separately imports the data, using the `import_data` method.

For tabular, importing of the data is deferred until the training pipeline starts training the model. What do we do different? Well, first you won't be calling the `import_data` method. Instead, when you create the dataset instance you specify the Cloud Storage location of the CSV file, which contains your tabular data as part of the managed dataset's metadata.

`metadata = {"input_config": {"gcs_source": {"uri": [gcs_uri]}}}`

Note that the `gcs_source` field is a list, whereby you can input multiple CSV files when your data is split across files. 

### Location of Cloud Storage training data.

Let's now set the variable `IMPORT_FILE` to the location of the CSV file in Cloud Storage.

Set the local variable `IMPORT_FORMAT` to indicate whether your dataset is a CSV index file.

Additionally, you can set the variable `SPLIT_TYPE` to choose how AutoML will handle splitting the dataset into training, test and validation sets:

- DEFAULT - AutoML chooses the split.
- ML_USE - Examples are tagged which set they below to (TRAINING, TEST, VALIDATION).
- FRACTION - Percentage split ratios specified in `input_config` when training.

In [None]:
# Tabular Classification
# No Split
IRIS_CSV = 'gs://cloud-samples-data/tables/iris_1000.csv'
# ML_USE split
IRIS_SPLIT_CSV = 'gs://cloud-samples-data/tables/iris_1000-split.csv'

IMPORT_FORMAT = 'CSV'  # [CSV]
SPLIT_TYPE = 'DEFAULT'  # [ML_USE, FRACTION, DEFAULT]

if IMPORT_FORMAT == 'CSV':
    if SPLIT_TYPE == 'ML_USE':
        IMPORT_FILE = IRIS_SPLIT_CSV
    else:
        IMPORT_FILE = IRIS_CSV

### Create a managed dataset instance

Use this helper function `create_dataset` to create the instance of your managed dataset. This function does:

1. Uses the dataset client service.
2. Creates a AI Platform (Unified) dataset object (`aip.Dataset`), with the parameters:
- `display_name`: The human-readable name you choose to give it, and
- `metadata_schema_uri`: The dataset type. For this tutorial this will be the schema for tabular dataset type.
- `metadata`: The Cloud Storage location of the tabular data.
3. Calls the client dataset service method `create_dataset`, with the parameters:
- `parent`: AI Platform (Unified) location root path for your datase, model and endpoint resources.
- `dataset`: the AI Platform (Unified) dataset object instance you created.
4. The method returns an `operation` object.

An `operation` object is how AI Platform (Unified) handles asynchronous calls for long running operations. While this step usually goes fast, when you first use it in your project, there is a longer delay due to provisioning.

You can use the `operation` object to get status on the operation (e.g., create managed dataset) or to cancel the operation, by invoking an operation method:

| Method      | Description |
| ----------- | ----------- |
| result()    | Waits for the operation to complete and returns a result object in JSON format.      |
| running()   | Returns True/False on whether the operation is still running.        |
| done()      | Returns True/False on whether the operation is completed. |
| canceled()  | Returns True/False on whether the operation was canceled. |
| cancel()    | Cancels the operation (this may take up to 30 seconds). |


In [None]:
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value, Struct

TIMEOUT = 60
DATA_SCHEMA = TABLE_SCHEMA


def create_dataset(name, schema, gcs_uri=None, labels=None, timeout=TIMEOUT):
    start_time = time.time()
    try:
        metadata = {"input_config": {"gcs_source": {"uri": [gcs_uri]}}}
        dataset = aip.Dataset(display_name=name, metadata_schema_uri="gs://" + schema, labels=labels,
                              metadata=json_format.ParseDict(metadata, Value()))

        operation = clients['dataset'].create_dataset(parent=PARENT, dataset=dataset)
        print("Long running operation:", operation.operation.name)
        response = operation.result(timeout=TIMEOUT)
        print("time:", time.time() - start_time)
        print("response")
        print(" name:", response.name)
        print(" display_name:", response.display_name)
        print(" metadata_schema_uri:", response.metadata_schema_uri)
        print(" metadata:", dict(response.metadata))
        print(" create_time:", response.create_time)
        print(" update_time:", response.update_time)
        print(" etag:", response.etag)
        print(" labels:", dict(response.labels))
        return {'name': response.name, 'schema': schema}
    except Exception as e:
        print("exception:", e)
        return (None, None)


dataset = create_dataset("automl-" + TIMESTAMP, DATA_SCHEMA, gcs_uri=IMPORT_FILE)

### Data preparation

The AI Platform (Unified) managed dataset for tabular has a couple of requirements for your tabular data.

- Must be in a CSV file.

#### CSV

For tabular classification, the CSV file has a few requirements:

- The first row must be the heading -- note how this is different from Vision, Video and Language where the requirement is no heading.
- All but one column are features.
- One column is the label, which you will specify when you subsequently create the training pipeline.

### Dataset splitting

#### CSV

Each row entry in a CSV file may be preceded by a first column that indicates whether the data is part of the training (TRAINING), test (TEST) or validation (VALIDATION) data.  Alternatively, AI Platform (Unified) supports the CAIP (pre-AI Platform (Unified)) version of the tags: TRAIN, TEST and VALIDATE. For example:

    TRAINING, "this is the data item", "this is the label"
    TEST, "this is the data item", "this is the label"
    VALIDATION, "this is the data item", "this is the label"
    
Otherwise, AutoML will automatically split the dataset for you.

#### Quick Peek at your data

You will use a version of the Iris dataset that is stored in a public Cloud Storage bucket, using a CSV file. 

Let's start by doing a quick peek at the data. You count the number of examples by counting the number of rows in the CSV file  (`wc -l`) and then peek at the first few rows.

You also need for training to know the heading name of the label column, which is save as `label_column`.

In [None]:
count = ! gsutil cat $IMPORT_FILE | wc -l
print("Number of Examples", int(count[0]))

print("First 10 rows")
! gsutil cat $IMPORT_FILE | head

heading = ! gsutil cat $IMPORT_FILE | head -n1
if IMPORT_FORMAT == 'CSV':
    label_column = str(heading).split(',')[-1].split("'")[0]
print("Label Column Name", label_column)

### Get dataset information

Now that the data is imported into your AI Platform (Unified) managed dataset, lets get some information about the current state of dataset. Use this helper function `get_dataset`, with the parameter:

- `name`: The AI Platform (Unified) fully qualified dataset identifier, which is in the form:

    projects/[project_id]/locations/[region]/datasets/[dataset id]

The helper function uses the dataset service client's method `get_dataset`, which takes as a parameter:

- `name`: The AI Platform (Unified) fully qualified dataset identifier.
    
If your recall, we got the fully qualified dataset identifier in the `name` field of the response object when we created the AI Platform (Unified) managed dataset instance.

The method returns an AI Platform (Unified) managed dataset object.

In [None]:
def get_dataset(name):
    response = clients['dataset'].get_dataset(name=name)
    print("TYPE", type(response))

    print("name:", response.name)
    print("display name:", response.display_name)
    print("create_time:", response.create_time)
    print("update_time:", response.update_time)
    print("labels:", response.labels)
    print("metadata_schema_uri:", response.metadata_schema_uri)
    print("metadata:", dict(response.metadata))


get_dataset(dataset['name'])

### List all data items

Unlike image, vision and text, there is no dataitem concept in tabular.

## Train the model

Let's now train an AutoML tabular classification model using your AI Platform (Unified) managed dataset. To train the model, do the following steps:

1. Create a AI Platform (Unified) managed training pipeline for the dataset.
2. Execute the pipeline to start the training.

### Create a training pipeline

You may ask, what do we use a pipeline for? We typically use pipelines when the job (such as training) has multiple steps, generally in sequential order: do step A, do step B, etc. By putting the steps into a pipeline, we gain the benefits of:

1. Reusable for subsequent training jobs.
2. Can be containerized and ran as a batch job.
3. Can be distributed.
4. All the steps are associated with the same pipeline job for tracking progress.

Use this helper function `create_pipeline`, which takes the parameters:

- `pipeline_name`: A human readable name for the pipeline job.
- `model_name`: A human readable name for name the model.
- `dataset`: The AI Platform (Unified) fully qualified dataset identifier.
- `schema`: The dataset labeling (annotation) schema. For this tutorial, it will be the schema for training a tabular classification model.
- `task`: A dictionary describing the requirements for the training job.

The helper function uses the AI Platform (Unified) pipeline client service, calling the method `create_pipeline`, which takes the parameters:

- `parent`: The AI Platform (Unified) location root path for your dataset, model and endpoint resources.
- `training_pipeline`: The full specification for the pipeline training job.

Let's look now deeper into the *minimal* requirements for constructing a `training_pipeline` specification:

- `display_name`: A human readable name for the pipeline job.
- `training_task_definition`: The dataset labeling (annotation) schema.
- `training_task_inputs`: A dictionary describing the requirements for the training job.
- `input_data_config`: The dataset specification.
 - `dataset_id`: The AI Platform (Unified) dataset identifier only (non-fully qualified) -- this is the last part of the fully-qualified identifier.
 - `fraction_split`: If specified, the percentages of the dataset to use for training, test and validation. Otherwise, the percentages are automatically selected by AutoML.
- `model_to_upload`: A human readable name for name the model. 

In [None]:
def create_pipeline(pipeline_name, model_name, dataset, schema, task):

    dataset_id = dataset.split('/')[-1]
    if SPLIT_TYPE == 'FRACTION':
        input_config = {'dataset_id': dataset_id,
                        'fraction_split': {
                            'training_fraction': 0.8,
                            'validation_fraction': 0.1,
                            'test_fraction': 0.1,
                        }}
    else:
        input_config = {'dataset_id': dataset_id}

    training_pipeline = {
        "display_name": pipeline_name,
        "training_task_definition": schema,
        "training_task_inputs": task,
        "input_data_config": input_config,
        "model_to_upload": {"display_name": model_name},
    }

    try:
        pipeline = clients['pipeline'].create_training_pipeline(parent=PARENT, training_pipeline=training_pipeline)
        print(pipeline)
    except Exception as e:
        print("exception:", e)
        return None
    return pipeline

Next, you will construct the task requirements. Unlike other parameters which take a python (JSON-like) dictionary, the `task` field takes a Google protobuf Struct, which is very similar to a python dictionary. The minimal fields you need to specify are:

- `prediction_type`: Whether we are doing "classification" or "regression".
- `target_column`: The CSV heading column name for the column we want to predict (i.e., the label).
- `train_budget_milli_node_hours`: The maximum time to budget (billed) for training the model, where 1000 = 1 hour.
- `disable_early_stopping`: Whether True/False to let AutoML use its judgement to stop training early or train for the entire budget.
- `transformations`: Specifies the feature engineering for each feature column.

For `transformations`, the list must have an entry for each column. The outer key field indicates the type of feature engineering for the corresponding column. In this tutorial, you set it to `"auto"` to tell AutoML to automatically determine it.

Finally, you create the pipeline by calling the helper function `create_pipeline`, which returns an instance of a training pipeline object.

In [None]:
SCHEMA = TRAINING_TABLE_CLASSIFICATION_SCHEMA
PIPE_NAME = "iris_pipe-" + TIMESTAMP
MODEL_NAME = "iris_model-" + TIMESTAMP

TRANSFORMATIONS = [
    {"auto": {"column_name": "sepal_width"}},
    {"auto": {"column_name": "sepal_length"}},
    {"auto": {"column_name": "petal_length"}},
    {"auto": {"column_name": "petal_width"}}
]

task = Value(struct_value=Struct(
            fields={
                'target_column': Value(string_value=label_column),
                'prediction_type': Value(string_value="classification"),
                'train_budget_milli_node_hours': Value(number_value=1000),
                'disable_early_stopping': Value(bool_value=False),
                'transformations': json_format.ParseDict(TRANSFORMATIONS, Value())
            }
        ))
pipeline = create_pipeline(PIPE_NAME, MODEL_NAME, dataset['name'], SCHEMA, task)

### List all training pipelines

Your training pipeline is now executing on Google Cloud AI Platform. Let's start by getting a list of all your pipelines and corresponding execution state. You likely only have one, but if you been experimenting with this tutorial or otherwise have used AI Platform (Unified) pipelines previously, you will see those as well.

Use this helper function `list_training_pipeline`. This function uses the pipeline client service and calls the method `list_training_pipelines`, with the parameter:

- `parent`: The AI Platform (Unified) location root path for your dataset, model and endpoint resources.

The method returns a `response object` as a list, where every element in the list is a pipeline object instance. The field we are most interest in is `response.state`, which should be at this early point: `PIPELINE_STATE_RUNNING` -- which means the model is being trained, but not completed. 

You could also see `PIPELINE_STATE_PENDING`, which indicates either the service has not yet finished provisioning the resources for the training job, or that the training job is momentarily been paused.

In [None]:
def list_training_pipeline():

    response = clients['pipeline'].list_training_pipelines(parent=PARENT)
    for pipeline in response:
        print("pipeline")
        print(" name:", pipeline.name)
        print(" display_name:", pipeline.display_name)
        print(" training_task_definition:", pipeline.training_task_definition)
        print(" training_task_inputs:", dict(pipeline.training_task_inputs))
        print(" state:", pipeline.state)
        print(" create_time:", pipeline.create_time)
        print(" start_time:", pipeline.start_time)
        print(" end_time:", pipeline.end_time)
        print(" update_time:", pipeline.update_time)
        print(" labels:", dict(pipeline.labels))


list_training_pipeline()

### Get information on a training pipeline

Let's now get pipeline information for just this training pipeline instance. You will use the pipeline client service and invoke the `get_training_pipeline` method, with the parameter:

- `name`: The AI Platform (Unified) fully qualified pipelinne identifier.

When the model is done training, the pipeline state will be `PIPELINE_STATE_SUCCEEDED`.

In [None]:
def get_training_pipeline(name):
    response = clients['pipeline'].get_training_pipeline(name=name)

    print("pipeline")
    print(" name:", response.name)
    print(" display_name:", response.display_name)
    print(" state:", response.state)
    print(" training_task_definition:", response.training_task_definition)
    print(" training_task_inputs:", dict(response.training_task_inputs))
    print(" create_time:", response.create_time)
    print(" start_time:", response.start_time)
    print(" end_time:", response.end_time)
    print(" update_time:", response.update_time)
    print(" labels:", dict(response.labels))
    return response


pipeline_response = get_training_pipeline(pipeline.name)

# Deployment

## Pre-Cooked

Training the above model may take upwards of ~30 minutes time. For expendiency, we have a pre-cooked (already trained) version of this model you can use for the next steps, while you wait for your model to finish training. 

Once your model is done training, you can repeat these steps for your trained model. You can calcuate the actual time it took to train the model by subtracting `end_time` from `start_time`. For your model, we will need to know the fully qualified AI Platform (Unified) managed model identifier, which the pipeline service assigned to it. We can get this from the returned pipeline instance as the field `model_to_deploy.name`.

You can choose between the precooked model or your trained model with the python variable `precooked` in the cell below.

In [None]:
# Tabular Classification
PRECOOK_TABLE_CLASSIFICATION_MODEL = '[not-supported-yet]'
PRECOOK_MODEL = PRECOOK_TABLE_CLASSIFICATION_MODEL

# Precooked flag
precook = False

if precook:
    model_to_deploy_name = PRECOOK_MODEL
else:
    model_to_deploy = pipeline_response.model_to_upload
    model_to_deploy_name = model_to_deploy.name

print("model_to_deploy:", model_to_deploy_name)

## Evaluate the model

Now let's find out how good the model service believes your model is. As part of training, some portion of the dataset was set aside as the test (holdout) data, which is used by the pipeline service to evaluate the model.

### List evaluations for all slices

Use this helper function `list_model_evaluations`, which takes the parameter:

- `name`: The AI Platform (Unified) fully qualified model identifier.

This helper function uses the AI Platform (Unified) model client service, and calls the method `list_model_evaluations`, which takes the same parameter. The response object from the call is a list, where each element is an evaluation metric.

For each evaluation -- you probably only have one, we then print all the key names for each metric in the evaluation, and for a small set (`logLoss` and `auPrc`) we print the result.

In [None]:
def list_model_evaluations(name):
    response = clients['model'].list_model_evaluations(parent=name)
    for evaluation in response:
        print("model_evaluation")
        print(" name:", evaluation.name)
        print(" metrics_schema_uri:", evaluation.metrics_schema_uri)
        metrics = json_format.MessageToDict(evaluation._pb.metrics)
        for metric in metrics.keys():
            print(metric)
        print('logloss', metrics['logLoss'])
        print('auPrc', metrics['auPrc'])

    return evaluation.name


last_evaluation = list_model_evaluations(model_to_deploy_name)

### Get evaluations for a slice

Now, let's use the AI Platform (Unified) fully qualified identifier for an evaluation to get just that specific evaluation. Use the last evaluation (`last_evaluation`) from your previous list of evaluations as an example.

Use this helper function `model_evaluation`, which takes as a parameter:

- `name`: The AI Platform (Unified) fully qualified identifier for the specific model evaluation.

The helper function uses the model client service and calls the method `get_model_evaluation`, with the parameter:

- `name`: The AI Platform (Unified) fully qualified identifier for the specific model evaluation.

Next, print the entire evaluation data -- which may seem at first somewhat verbose.

In [None]:
def model_evaluation(name):
    response = clients['model'].get_model_evaluation(name=name)
    print("response")
    print(" name:", response.name)
    print(" metrics_schema_uri:", response.metrics_schema_uri)
    print(" metrics:", json_format.MessageToDict(response._pb.metrics))
    print(" create_time:", response.create_time)
    print(" slice_dimensions:", response.slice_dimensions)
    model_explanation = response.model_explanation
    print(" model_explanation")
    mean_attributions = model_explanation.mean_attributions
    for mean_attribution in mean_attributions:
        print("  mean_attribution")
        print("   baseline_output_value:", mean_attribution.baseline_output_value)
        print("   instance_output_value:", mean_attribution.instance_output_value)
        print("   feature_attributions:",
            json_format.MessageToDict(mean_attribution._pb.feature_attributions),
        )
        print("   output_index:", mean_attribution.output_index)
        print("   output_display_name:", mean_attribution.output_display_name)
        print("   approximation_error:", mean_attribution.approximation_error)


model_evaluation(last_evaluation)

## Model deployment for batch prediction

Let's now deploy the trained AI Platform (Unified) model you created with AutoML for batch prediction. This differs from deploying a model for on-demand prediction.

For on-demand prediction, you:

1. Create an endpoint for deploying the model to.

2. Deploy the model to the endpoint.

3. Make on-demand (live) prediction requests to the endpoint.

For batch-prediction, you:

1. Create a batch prediction job.

2. The job service will provision resources for the batch prediction request.

3. The results of the batch prediction request are returned to the caller.

4. The job service will unprovision the resoures for the batch prediction request.

### Make the batch input file

Let's now make a batch input file, which you will store in your local Cloud Storage bucket. Unlike image, video and text, the batch input file for tabular is only supported for CSV. For CSV file, you make:

- The first line is the heading with the feature (fields) heading names.
- Each remaining line is a separate prediction request with the corresponding feature values.

In [None]:
HEADING = "petal_length,petal_width,sepal_length,sepal_width"
INSTANCE = "1.4,1.3,5.1,2.8"

In [None]:
import tensorflow as tf

gcs_input_uri = "gs://" + BUCKET_NAME + '/test.jsonl'
with tf.io.gfile.GFile(gcs_input_uri, 'w') as f:
    f.write(HEADING + '\n')
    f.write(INSTANCE + '\n')

### Make batch prediction request

Now that your batch of two image test items is ready, let's do the batch request. Use this helper function `create_batch_prediction_job`, with the parameters:

- `display_name`: The human readable name for the prediction job.
- `model_name`: The AI Platform (Unified) fully qualified identifier for the model.
- `gcs_source_uri`: The Cloud Storage path to the JSONL/CSV input file -- which we created above.
- `gcs_destination_output_uri_prefix`: The Cloud Storage path that the service will write the predictions to.

The helper function uses the job client service and calls the method `create_batch_prediction_job`, with the parameters:

- `parent`: The AI Platform (Unified) location root path for dataset, model and pipeline resources.
- `batch_prediction_job`: The specification for the batch prediction job.

Let's now dive into the specification for the `batch_prediction_job`:

- `display_name`: The human readable name for the prediction batch job.
- `model`: The AI Platform (Unified) fully qualified identifier for the model.
- `model_parameters`: requirements/constrains on the prediction service.
 - `confidenceThreshold`: The minimum confidence threshold on doing a prediction.
 - `maxPredictions`: The maximum size of the batch request.
- `input_config`: The input source and format type for the instances to predict.
 - `instances_format`: The format of the batch prediction request file: `csv` only supported.
 - `gcs_source`: A list of one or more Cloud Storage paths to your batch prediction requests.
- `output_config`: The output destination and format for the predictions.
 - `prediction_format`: The format of the batch prediction response file: `csv` only supported.
 - `gcs_destination`: The output destination for the predictions.
- `dedicated_resources`: The compute resources to provision for the batch prediction job. 
  - `machine_spec`: The compute instance to provision. Use the variable you set earlier `GPU = True` to use a GPU; otherwise only a CPU is allocated.
  - `starting_replica_count`: The number of compute instances to initially provision.
  - `max_replica_count`: The maximum number of compute instances to scale to. In this tutorial, only one instance is provisioned.

This call is an asychronous operation. You will print from the response object a few select fields, including:

- `name`: The AI Platform (Unified) fully qualified identifier assigned to the batch prediction job.
- `display_name`: The human readable name for the prediction batch job.
- `model`: The AI Platform (Unified) fully qualified identifier for the model.
- `generate_explanations`: Whether True/False explanations were provided with the predictions (explainability).
- `state`: The state of the prediction job (pending, running, etc).

Since this call will take a few moments to execute, you will likely get `JobState.JOB_STATE_PENDING` for `state`.

The helper function will return and save the AI Platform (Unified) fully qualified identifier assigned to the batch prediction job as `prediction_name`.

In [None]:
BATCH_MODEL = "iris_batch-" + TIMESTAMP

def create_batch_prediction_job(display_name, model_name, gcs_source_uri, gcs_destination_output_uri_prefix):

    model_parameters = {
        "confidenceThreshold": 0.5,
        "maxPredictions": 10000,
    }

    if GPU:
        machine_spec = {
            "machine_type": "n1-standard-2",
            "accelerator_type": aip.AcceleratorType.NVIDIA_TESLA_K80,
            "accelerator_count": 1,
        }
    else:
        machine_spec = {
            "machine_type": "n1-standard-2",
            "accelerator_count": 0,
        }

    batch_prediction_job = {
        "display_name": display_name,
        # Format: 'projects/{project}/locations/{location}/models/{model_id}'
        "model": model_name,
        "model_parameters": json_format.ParseDict(model_parameters, Value()),
        "input_config": {
            "instances_format": "csv",
            "gcs_source": {"uris": [gcs_source_uri]},
        },
        "output_config": {
            "predictions_format": "csv",
            "gcs_destination": {"output_uri_prefix": gcs_destination_output_uri_prefix},
        },
        "dedicated_resources": {
            "machine_spec": machine_spec,
            "starting_replica_count": 1,
            "max_replica_count": 1,
        }
    }
    response = clients['job'].create_batch_prediction_job(
        parent=PARENT, batch_prediction_job=batch_prediction_job
    )
    print("response")
    print(" name:", response.name)
    print(" display_name:", response.display_name)
    print(" model:", response.model)
    print(" generate_explanation:", response.generate_explanation)
    print(" state:", response.state)
    print(" create_time:", response.create_time)
    print(" start_time:", response.start_time)
    print(" end_time:", response.end_time)
    print(" update_time:", response.update_time)
    print(" labels:", response.labels)
    return response


response = create_batch_prediction_job(BATCH_MODEL, model_to_deploy_name, gcs_input_uri, "gs://" + BUCKET_NAME)

prediction_name = response.name

### List all batch prediction jobs

Use this helper function `list_batch_prediction_jobs`. This helper function uses the job client service and calls the method `list_batch_prediction_jobs`, with the parameter:

- `parent`: The AI Platform (Unified) location root path to the dataset, model and pipeline resources.

The method will return a list, where each element is a single batch prediction job. You will probably only have one, unless you've already been using the service or been experimenting with this tutorial.

We will print a couple of additional fields:

- `error`: An error description if an error occurred.
- `output_uri_prefix`: The Cloud Storage location you gave for outputtng the predictions.

In [None]:
def list_batch_prediction_jobs():
    response = clients['job'].list_batch_prediction_jobs(parent=PARENT)
    for batch in response:
        print(" name:", batch.name)
        print(" display_name:", batch.display_name)
        print(" model:", batch.model) 
        print(" generate_explanation:", batch.generate_explanation)
        print(" state:", batch.state)
        print(" error:", batch.error)
        gcs_destination = batch.output_config.gcs_destination
        print(" gcs_destination")
        print("  output_uri_prefix:", gcs_destination.output_uri_prefix)


list_batch_prediction_jobs()

### Get information on a batch prediction job

Use this helper function `get_batch_prediction_job`, with the paramter:

- `job_name`: The AI Platform (Unified) fully qualified identifier for the batch prediction job.

The helper function uses the job client service and calls the method `get_batch_prediction_job`, with the paramter:

- `name`: The AI Platform (Unified) fully qualified identifier for the batch prediction job. In this tutorial, we will pass it the AI Platform (Unified) fully qualified identifier for your batch prediction job -- `prediction_name`

The helper function will return the Cloud Storage path to where the predictions are stored -- `gcs_destination`.

In [None]:
def get_batch_prediction_job(job_name):
    response = clients['job'].get_batch_prediction_job(name=job_name)
    print("response")
    print(" name:", response.name)
    print(" display_name:", response.display_name)
    print(" model:", response.model) 
    print(" generate_explanation:", response.generate_explanation)
    print(" state:", response.state)
    print(" error:", response.error)
    gcs_destination = response.output_config.gcs_destination
    print(" gcs_destination")
    print("  output_uri_prefix:", gcs_destination.output_uri_prefix)
    return gcs_destination.output_uri_prefix, response.state


predictions, state = get_batch_prediction_job(prediction_name)

### Get the predictions

When the batch prediction is done processing, the job state will be `JOB_STATE_SUCCEEDED`.

Finally you view the predictions stored at the Cloud Storage path you set as output. The predictions will be in a CSV format, which you indicated at the time we made the batch prediction job, under a subfolder starting with the name `prediction`, and under that folder will be a file called `table*.csv`.

Let's display (cat) the contents. You will see one line for each prediction. The first four fields are the values (features) you did the prediction on, and the remaining fields are the confidence values, between 0 and 1, for each prediction.

In [None]:
if state == aip.JobState.JOB_STATE_RUNNING:
    print("The job is still running")
else:
    ! gsutil ls $predictions/prediction*/table*.csv

    ! gsutil cat $predictions/prediction*/table*.csv

# Cleaning up

To clean up all GCP resources used in this project, you can [delete the GCP
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Dataset
- Model
- Batch Prediction
- Cloud Storage Bucket

In [None]:
delete_dataset = True
delete_model = True
delete_batch = True
delete_bucket = True

# Delete the dataset using the AI Platform (Unified) fully qualified identifier for the dataset
try:
    if delete_dataset:
        clients['dataset'].delete_dataset(name=dataset['name'])
except Exception as e:
    print(e)

# Delete the model using the AI Platform (Unified) fully qualified identifier for the model
try:
    if delete_model:
        clients['model'].delete_model(name=model_to_deploy_name)
except Exception as e:
    print(e)

# Delete the batch prediction job using the AI Platform (Unified) fully qualified identifier for the batch job
try:
    if delete_batch:
        clients['job'].delete_batch_prediction_job(name=prediction_name)
except Exception as e:
    print(e)

if delete_bucket and 'BUCKET_NAME' in globals():
    ! gsutil rm -r gs://$BUCKET_NAME