# Create, train, and deploy a text classification model on Vertex AI

This notebook walks you through the major phases of building and using a text classification model on Vertex AI. Specifically, this notebook demonstrates how to:

* Set up your development environment
* Create a dataset and import data
* Train a model
* Get and review evaluations for the model
* Deploy a model to an endpoint
* Get online predictions
* Get batch predictions

## Set up your development environment

This notebook uses the Python SDK for Vertex AI, which is contained in the `python-aiplatform` package. You must first install the package into your development environment.

In [1]:
!pip install google-cloud-aiplatform

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-0.8.0-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.0 MB/s eta 0:00:01
Installing collected packages: google-cloud-aiplatform
Successfully installed google-cloud-aiplatform-0.8.0


Next, you need to import the package into your development environment. 


In [2]:
from google.cloud import aiplatform

Finally, you must initialize the client library before you can send requests to the Vertex AI service. With the Python SDK, you initialize the client library as shown in the following cell. Be sure to provide the ID for your Google Cloud project in the `project` variable.

This notebook uses the `us-central1` Compute location, although you can change it to another location. 

In [4]:
project = '580378083368'  # TODO: Replace with placeholder
location = 'us-central1'

aiplatform.init(project=project, location=location)

## Create a dataset and import your data

The notebook uses the 'Happy Moments' dataset for demonstration purposes. You can change it to another text classification dataset that [conforms to the data preparation requirements](https://cloud.google.com/vertex-ai/docs/datasets/prepare-text#classification).

Using the Python SDK, you can create a dataset and import the dataset in one call to `TextDataset.create()`, as shown in the following cell.

This next step can take a while. The client library prints out statements that include the name of the new dataset--you might want to copy the resource name down someplace.

**Note**: You can close this tab while you wait for this operation to complete. 

In [5]:
src_uris = 'gs://cloud-ml-data/NL-classification/happiness.csv'
display_name = 'e2e-text-dataset'

In [9]:
ds = aiplatform.TextDataset.create(
    display_name=display_name,
    gcs_source=src_uris,
    import_schema_uri=aiplatform.schema.dataset.ioformat.text.single_label_classification,
    sync=True,
)

ds.wait()

INFO:google.cloud.aiplatform.datasets.dataset:Creating TextDataset
INFO:google.cloud.aiplatform.datasets.dataset:Create TextDataset backing LRO: projects/580378083368/locations/us-central1/datasets/2306836967725203456/operations/4298348844011225088
INFO:google.cloud.aiplatform.datasets.dataset:TextDataset created. Resource name: projects/580378083368/locations/us-central1/datasets/2306836967725203456
INFO:google.cloud.aiplatform.datasets.dataset:To use this TextDataset in another session:
INFO:google.cloud.aiplatform.datasets.dataset:ds = aiplatform.TextDataset('projects/580378083368/locations/us-central1/datasets/2306836967725203456')
INFO:google.cloud.aiplatform.datasets.dataset:Importing TextDataset data: projects/580378083368/locations/us-central1/datasets/2306836967725203456
INFO:google.cloud.aiplatform.datasets.dataset:Import TextDataset data backing LRO: projects/580378083368/locations/us-central1/datasets/2306836967725203456/operations/5334176758306439168
INFO:google.cloud.aipl

## Train your text classification model

Once your dataset has finished importing data, you are ready to train your model. To do this, you first need the full resource name of your dataset, where the full name has the format `projects/[YOUR_PROJECT]/locations/us-central1/datasets/[YOUR_DATASET_ID]`. If you don't have the resource name handy, you can list all of the datasets in your project using `TextDataset.list()`. 

As shown in the following code block, you can pass in the display name of your dataset in the call to `list()` to filter the results.


In [10]:
aiplatform.TextDataset.list(filter=f'display_name="{display_name}"')

[<google.cloud.aiplatform.datasets.text_dataset.TextDataset object at 0x7f44267ebf50> 
 resource name: projects/580378083368/locations/us-central1/datasets/2306836967725203456]

Now you can begin training your model. Training the model is a two part process:

1. **Define the training job.** You must provide a display name and the type of training you want when you define the training job.
2. **Run the training job.** When you run the training job, you need to supply a reference to the dataset to use for training. At this step, you can also configure the train/test/validate split percentages.

You do not need to specify train/test/validate splits. The training job has a default setting of 80%/10%/10% if you don't provide these values.

As with importing data into the dataset, training your model can take a substantial amount of time. The client library prints out operation status messages while the training pipeline operation processes. You must wait for the training process to complete before you can get the resource name and ID of your new model.

**Note**: You can close this tab while you wait for the operation to complete.

In [11]:
# Define the training job
training_job_display_name = 'e2e-text-training-job'
job = aiplatform.AutoMLTextTrainingJob(
    display_name=training_job_display_name,
    prediction_type='classification',
    multi_label=False
)

In [12]:
# Run the training job
dataset_id = '2306836967725203456' # TODO: Replace with placeholder
model_display_name = 'e2e-text-classification-model'

text_dataset = aiplatform.TextDataset(dataset_id)

model = job.run(
    dataset=text_dataset,
    model_display_name=model_display_name,
    training_fraction_split=0.7,
    validation_fraction_split=0.2,
    test_fraction_split=0.1,
    sync=False
)

INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1624805506969239552?project=580378083368
INFO:google.cloud.aiplatform.training_jobs:AutoMLTextTrainingJob projects/580378083368/locations/us-central1/trainingPipelines/1624805506969239552 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:AutoMLTextTrainingJob projects/580378083368/locations/us-central1/trainingPipelines/1624805506969239552 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:AutoMLTextTrainingJob projects/580378083368/locations/us-central1/trainingPipelines/1624805506969239552 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:AutoMLTextTrainingJob projects/580378083368/locations/us-central1/trainingPipelines/1624805506969239552 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.

## Get and review model evaluation scores

After your model has finished training, you can review the evaluation scores for it.

First, you need to get the resource name of your new model. To get the resource name, list all of the models in your project. As before with datasets, you can provide filter criteria to narrow down your search.

In [13]:
aiplatform.Model.list(filter='display_name="e2e-text-classification-model"')

[<google.cloud.aiplatform.models.Model object at 0x7f4425f89110> 
 resource name: projects/580378083368/locations/us-central1/models/8638449443063988224]

Using the model name (in the format `projects/[PROJECT_NAME]/locations/us-central1/models/[MODEL_ID`), you can get its model evaluations. To get model evaluations, you must use the underlying service client.

Building a service client requires that you provide the name of the regionalized hostname used for your model. In this tutorial, the hostname is `us-central1-aiplatform.googleapis.com` because the model was created in the `us-central1` location.

In [16]:
model_name = "projects/580378083368/locations/us-central1/models/8638449443063988224" # TODO: Replace with placeholder value
client_options = {"api_endpoint": "us-central1-aiplatform.googleapis.com"}
model_service_client = aiplatform.gapic.ModelServiceClient(client_options=client_options)

Before you can view the model evaluation you must first list all of the evaluations for that model. Each model can have multiple evaluations, although new model is likely to only have one. 

In [17]:
model_evaluations = model_service_client.list_model_evaluations(parent=model_name)
model_evaluation = list(model_evaluations)[0]

Now that you have the model evaluation, you can look at your model's scores. If you have questions about what the scores mean, review the [public documentation](https://cloud.google.com/vertex-ai/docs/training/evaluating-automl-models#text).

The results returned from the service are formatted as [`google.protobuf.Value`](https://googleapis.dev/python/protobuf/latest/google/protobuf/struct_pb2.html) objects. You can transform the return object as a `dict` for easier reading and parsing.

In [18]:
from google.protobuf import json_format

model_eval_dict = json_format.MessageToDict(model_evaluation._pb)
metrics = model_eval_dict['metrics']
confidence_metrics = metrics['confidenceMetrics']

print(f'Area under precision-recall curve (AuPRC): {metrics["auPrc"]}')
for confidence_scores in confidence_metrics:
    metrics = confidence_scores.keys()
    print('\n')
    for metric in metrics:
        print(f'\t{metric}: {confidence_scores[metric]}')

Area under precision-recall curve (AuPRC): 0.958425


	recall: 1.0
	precisionAt1: 0.88270044
	recallAt1: 0.88270044
	f1Score: 0.25
	f1ScoreAt1: 0.88270044
	precision: 0.14285715


	recall: 0.97890294
	precision: 0.7030303
	confidenceThreshold: 0.05
	f1Score: 0.81834215
	precisionAt1: 0.88270044
	recallAt1: 0.88270044
	f1ScoreAt1: 0.88270044


	precision: 0.7623498
	f1Score: 0.8512859
	recallAt1: 0.88270044
	f1ScoreAt1: 0.88270044
	confidenceThreshold: 0.1
	precisionAt1: 0.88270044
	recall: 0.9637131


	recall: 0.9535865
	f1Score: 0.86656445
	f1ScoreAt1: 0.88270044
	confidenceThreshold: 0.15
	precisionAt1: 0.88270044
	recallAt1: 0.88270044
	precision: 0.794097


	f1ScoreAt1: 0.88270044
	confidenceThreshold: 0.2
	precisionAt1: 0.88270044
	precision: 0.8136464
	recallAt1: 0.88270044
	recall: 0.935865
	f1Score: 0.8704867


	f1Score: 0.8747007
	recallAt1: 0.88270044
	recall: 0.9248945
	precision: 0.8296745
	f1ScoreAt1: 0.88270044
	confidenceThreshold: 0.25
	precisionAt1: 0.88270044


	recal

## Deploy your text classification model

Once your model has completed training, you must deploy it to an _endpoint_ to get online predictions from it. When you deploy the model to an endpoint, a copy of the model is made on the endpoint with a new resource name and display name.

You can deploy multiple models to the same endpoint and split traffic between the various models assigned to the endpoint. However, you must deploy one model at a time to the endpoint. To change the traffic split percentages, you must assign new values on your second (and subsequent) models each time you deploy a new model.

The following code block demonstrates how to deploy a model. The code snippet relies on the Python SDK to create a new endpoint for deployment. During deployment, the Python SDK prints out the name of your new endpoint--you might want to record the name of the endpoint for future reference.

In [19]:
deployed_model_display_name="e2e-deployed-text-classification-model"

model.deploy(
    deployed_model_display_name=deployed_model_display_name,
    sync=True
)

model.wait()

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/580378083368/locations/us-central1/endpoints/9059975812874240/operations/8809389966078509056
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/580378083368/locations/us-central1/endpoints/9059975812874240
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/580378083368/locations/us-central1/endpoints/9059975812874240')
INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/580378083368/locations/us-central1/endpoints/9059975812874240
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/580378083368/locations/us-central1/endpoints/9059975812874240/operations/252550674074566656
INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/580378083368/locations/us-c

In case you didn't record the name of the new endpoint, you can get a list of all your endpoints as you did before with datasets and models. For each endpoint, you can list the models deployed to that endpoint. To get a reference to the model that you just deployed, you check the `display_name` of each model to the models deployed to each endpoint.

In [20]:
endpoints = aiplatform.Endpoint.list()

endpoint_with_deployed_model = []

for endpoint in endpoints:
    for model in endpoint.list_models():
        if model.display_name.find(deployed_model_display_name) == 0:
            endpoint_with_deployed_model.append(endpoint)

print(endpoint_with_deployed_model)

[<google.cloud.aiplatform.models.Endpoint object at 0x7f4425f08e90> 
resource name: projects/580378083368/locations/us-central1/endpoints/9059975812874240]


## Get online predictions from your model

Now that you have your endpoint's resource name, you can get online predictions from the text classification model. To get the online prediction, you send a prediction request to your endpoint.

In [21]:
endpoint_name = "projects/580378083368/locations/us-central1/endpoints/9059975812874240" # TODO: Replace
endpoint = aiplatform.Endpoint(endpoint_name)
content = "I got a high score on my math final!"

response = endpoint.predict(instances=[{"content": content}])

for prediction_ in response.predictions:
    ids = prediction_["ids"]
    display_names = prediction_["displayNames"]
    confidence_scores = prediction_["confidences"]
    for count, id in enumerate(ids):
        print(f"Prediction ID: {id}")
        print(f"Prediction display name: {display_names[count]}")
        print(f"Prediction confidence score: {confidence_scores[count]}")

Prediction ID: 344988265889136640
Prediction display name: affection
Prediction confidence score: 0.000125991296954453
Prediction ID: 6397826165075083264
Prediction display name: achievement
Prediction confidence score: 0.9996910095214844
Prediction ID: 2650831275102830592
Prediction display name: enjoy_the_moment
Prediction confidence score: 0.00011234146950300783
Prediction ID: 8703669174288777216
Prediction display name: bonding
Prediction confidence score: 1.217706721945433e-05
Prediction ID: 7262517293530218496
Prediction display name: leisure
Prediction confidence score: 5.6070250138873234e-05
Prediction ID: 4091983155861389312
Prediction display name: nature
Prediction confidence score: 4.2392474597363616e-07
Prediction ID: 4956674284316524544
Prediction display name: exercise
Prediction confidence score: 2.0062395833519986e-06


## Get batch predictions from your model

You can get batch predictions from a text classification model without deploying it. You must first format all of your prediction instances (prediction input) in JSONL format and you must store the JSONL file in a Google Cloud Storage bucket. You must also provide a Google Cloud Storage bucket to hold your prediction output.

To start, you must first create your predictions input file in JSONL format. Each line in the JSONL document needs to be formatted like so:

```
{ "content": "gs://sourcebucket/datasets/texts/source_text.txt", "mimeType": "text/plain"}
```

The `content` field in the JSON structure must be a Google Cloud Storage URI to another document that contains the text input for prediction.
[See the documentation for more information.](https://cloud.google.com/ai-platform-unified/docs/predictions/batch-predictions#text)

In [58]:
instances = [
    "We hiked through the woods and up the hill to the ice caves",
    "My kitten is so cute"
]

For this tutorial, you can create a new set of input files using an existing Google Cloud Storage bucket. You need to provide the URI for the bucket. For batch prediction, you must use a standard regional bucket.

In [60]:
!pip install google-cloud-storage 



In [83]:
import uuid

gcs_bucket_uri = "e2e-text-notebook" # TODO: Replace
gcs_prefix = str(uuid.uuid4()).replace("-", "")[0:15]
gcs_uri = f"gs://{gcs_bucket_uri}/{gcs_prefix}"
input_file_name = "batch-prediction-input.jsonl"

print(gcs_uri)

gs://e2e-text-notebook/a7de3c8b6aab4a9


In [84]:
from google.cloud import storage

storage = storage.Client()
bucket = storage.bucket(gcs_bucket_uri)

In [85]:
input_file_data = []
input_str = ""
for count, instance in enumerate(instances):
    tmp_data = {
        "content": f"{gcs_uri}/input_{count}.txt",
        "mimeType": "text/plain" 
    }
    input_file_data.append(tmp_data)
    blob = bucket.blob(f"{gcs_prefix}/input_{count}.txt")
    blob.upload_from_string(instance)
    input_str += str(tmp_data) + "\n"

print(input_file_data)

file_blob = bucket.blob(f"{gcs_prefix}/{input_file_name}")
file_blob.upload_from_string(input_str)


[{'content': 'gs://e2e-text-notebook/a7de3c8b6aab4a9/input_0.txt', 'mimeType': 'text/plain'}, {'content': 'gs://e2e-text-notebook/a7de3c8b6aab4a9/input_1.txt', 'mimeType': 'text/plain'}]


In [88]:
job_display_name = "e2s-batch-prediction-job"
model = aiplatform.Model(model_name=model_name)

batch_prediction_job = model.batch_predict(
    job_display_name=job_display_name,
    gcs_source=f"{gcs_uri}/{input_file_name}",
    gcs_destination_prefix=gcs_uri,
    sync=True,
)

batch_prediction_job.wait()

INFO:google.cloud.aiplatform.jobs:Creating BatchPredictionJob
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob created. Resource name: projects/1025771077852/locations/us-central1/batchPredictionJobs/4090737409187119104
INFO:google.cloud.aiplatform.jobs:To use this BatchPredictionJob in another session:
INFO:google.cloud.aiplatform.jobs:bpj = aiplatform.BatchPredictionJob('projects/1025771077852/locations/us-central1/batchPredictionJobs/4090737409187119104')
INFO:google.cloud.aiplatform.jobs:View Batch Prediction Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/batch-predictions/4090737409187119104?project=1025771077852
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/1025771077852/locations/us-central1/batchPredictionJobs/4090737409187119104 current state:
JobState.JOB_STATE_PENDING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob  run. Resource name: projects/1025771077852/locations/us-central1/batchPredictionJobs/4090737409187119104
