In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI SDK: Training an AutoML text sentiment analysis model for online predictions

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/sdk_automl_text_sentiment_analysis_online.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/sdk_automl_text_sentiment_analysis_online.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/sdk_automl_text_sentiment_analysis_online.ipynb" target='_blank'>
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview

This tutorial demonstrates how to use the Vertex AI SDK to train and deploy an [AutoML](https://cloud.google.com/vertex-ai/docs/start/automl-users) text sentiment analysis model and get online predictions from it.

Learn more about [Sentiment analysis for text data](https://cloud.google.com/vertex-ai/docs/training-overview#sentiment_analysis_for_text).

### Objective

In this tutorial, you learn how to create an AutoML text sentiment analysis model and deploy it for online predictions from a Python script using the Vertex AI SDK. You can alternatively create and deploy models using the `gcloud` command-line tool or online using the Cloud Console.

This tutorial uses the following Google Cloud ML services and resources:
- Vertex AI Datasets
- Vertex AI Training (AutoML)
- Vertex AI Model Registry
- Vertex AI Endpoints

The steps performed include:

- Create a `Vertex AI Dataset` resource.
- Create a training job for the AutoML model on the dataset.
- View the model evaluation metrics.
- Deploy the `Vertex AI Model` resource to a serving `Vertex AI Endpoint`.
- Make a prediction request to the deployed model.
- Undeploy the model from endpoint.
- Perform clean up process.

### Dataset

The dataset used for this tutorial is the [Crowdflower Claritin-Twitter dataset](https://data.world/crowdflower/claritin-twitter) that consists of tweets tagged with sentiment, the author's gender, and whether or not they mention any of the top 10 adverse events reported to the FDA. The version of the dataset you use in this tutorial is stored in a public Cloud Storage bucket. In this tutorial, you use the tweets data to build an AutoML text sentiment analysis model on Google Cloud platform.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

## Installation

Install the latest version of Vertex AI SDK for Python.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q
! pip3 install -U google-cloud-storage $USER_FLAG -q

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI, Compute Engine, and Cloud Storage APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component,storage-component.googleapis.com)

1. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already
authenticated. 

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS '[your-service-account-key-path]'

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize your Vertex AI SDK, you provide a Cloud Storage bucket to the SDK to serve as a staging bucket for the session. 

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

### Import libraries

In [None]:
import google.cloud.aiplatform as aiplatform

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

### Define the constants

Set the constants that you use in this tutorial.

In [None]:
# Set the location of the CSV index file in Cloud Storage.
IMPORT_FILE = "gs://cloud-samples-data/language/claritin.csv"
# Set the max. sentiment score
SENTIMENT_MAX = 4

## Take a quick peek at your data

This tutorial uses a version of the `Crowdflower Claritin-Twitter` dataset which is stored in a public Cloud Storage bucket, using a CSV index file.

Start by taking a quick peek at the data. Further, count the number of examples by counting the number of rows in the CSV index file  (`wc -l`) and then print the first few rows.

In [None]:
FILE = IMPORT_FILE

count = ! gsutil cat $FILE | wc -l
print("Number of Examples", int(count[0]))

print("First 10 rows")
! gsutil cat $FILE | head

## Create the Dataset

Now, create a `Vertex AI Dataset` resource using the `create` method of the `TextDataset` class, which takes the following parameters:

- `display_name`: The human readable name for the dataset resource.
- `gcs_source`: A list of one or more dataset index files to import the data items into the dataset resource.
- `import_schema_uri`: The data labeling schema for the data items.

This operation may take several minutes.

In [None]:
dataset = aiplatform.TextDataset.create(
    display_name="Crowdflower Claritin-Twitter" + "_" + UUID,
    gcs_source=[IMPORT_FILE],
    import_schema_uri=aiplatform.schema.dataset.ioformat.text.sentiment,
)

print(dataset.resource_name)

## Create and run training job

In this section, to train an AutoML model, you perform these steps:

1) create a training job.
2) run the job.

### Create a training job

An AutoML training job is created with the `AutoMLTextTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the training job resource.
- `prediction_type`: The type task to train the model for.
  - `classification`: A text classification model.
  - `sentiment`: A text sentiment analysis model.
  - `extraction`: A text entity extraction model.
- `multi_label`: If a classification task, whether single (False) or multi-labeled (True).
- `sentiment_max`: If a sentiment analysis task, the maximum sentiment value.

In [None]:
job = aiplatform.AutoMLTextTrainingJob(
    display_name="claritin_" + UUID,
    prediction_type="sentiment",
    sentiment_max=SENTIMENT_MAX,
)

print(job)

### Run the training job

Next, you run the training job by invoking the method `run`, with the following parameters:

- `dataset`: The `Dataset` resource to train the model.
- `model_display_name`: The human readable name for the trained model.
- `training_fraction_split`: The percentage of the dataset to use for training.
- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).
- `validation_fraction_split`: The percentage of the dataset to use for validation.

The `run` method when completed returns the `Model` resource.

The execution of the training pipeline take upto 180 minutes.

In [None]:
model = job.run(
    dataset=dataset,
    model_display_name="claritin_" + UUID,
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
)

## Review model evaluation scores

Once your model training has finished, you can review the evaluation scores.

Firstly, you need to get a reference to the newly created model. As with datasets, you can either use the reference to the model variable you created when you deployed the model or you can list all of the models in your project and filter.

In [None]:
# Get model resource ID
models = aiplatform.Model.list(filter="display_name=claritin_" + UUID)

# Get a reference to the Model Service client
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}
model_service_client = aiplatform.gapic.ModelServiceClient(
    client_options=client_options
)

model_evaluations = model_service_client.list_model_evaluations(
    parent=models[0].resource_name
)
model_evaluation = list(model_evaluations)[0]
print(model_evaluation)

## Deploy the model

Next, deploy your model to serve online predictions. To deploy the model, you invoke the `deploy` method of the model resource which in turn returns you the deployed endpoint.

**Note:** Normally, an endpoint is created beforehand and is given as a reference while model deployment. By default, `deploy()` method creates an endpoint when an endpoint reference is not given.

In [None]:
endpoint = model.deploy()

## Send online prediction requests

In this step, you prepare some test instances from the dataset and send an online prediction request to your deployed model.

### Create test instances

You use an arbitrary example out of the dataset as a test item. Don't be concerned that the example was likely used in training the model. It is just to demonstrate how to make a prediction.

In [None]:
test_item = ! gsutil cat $IMPORT_FILE | head -n1
if len(test_item[0]) == 3:
    _, test_item, test_label, max = str(test_item[0]).split(",")
else:
    test_item, test_label, max = str(test_item[0]).split(",")

print(test_item, test_label)

### Make the prediction request

Now that your model is deployed to an endpoint, you can send online prediction requests to the endpoint resource.

#### Request format

The format of each instance should be in JSON as below:

     { 'content': text_string }

Since the `predict()` method can take multiple instances, send your request as a list of one test instance.

#### Response

The response from the `predict()` call is a Python dictionary with the following entries:

- `ids`: The internal assigned unique identifiers for each prediction request.
- `sentiment`: The sentiment value.
- `deployed_model_id`: The Vertex AI identifier for the deployed `Model` resource which did the predictions.

In [None]:
instances_list = [{"content": test_item}]

prediction = endpoint.predict(instances_list)
print(prediction)

## Undeploy the model

After you explore the predictions, you undeploy the model from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [None]:
endpoint.undeploy_all()

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Vertex AI Dataset
- Vertex AI Model
- Vertex AI Endpoint
- AutoML Training Job
- Cloud Storage Bucket (set `delete_bucket` to **True** to delete the bucket)

In [None]:
delete_bucket = False

# Delete the dataset using the Vertex dataset object
dataset.delete()

# Delete the model using the Vertex model object
model.delete()

# Delete the endpoint using the Vertex endpoint object
endpoint.delete()

# Delete the AutoML or Pipeline training job
job.delete()

# Delete the Cloud storage bucket
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI