In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Model Versioning with Vertex AI Model Registry


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook will show the model versioning capabilities of Vertex AI Model Registry with BQML and custom models.

### Objective

In this tutorial, you learn how to manage your models using the Vertex AI SDK and Vertex AI Model Registry.

This tutorial uses the following Google Cloud ML services and resources:

- BigQuery
- Vertex AI Training
- Vertex AI Model Registry

The steps performed include:

- Preprocess data using SparkNLP and load them into BQML
- Train and register a Logistic Regression using BQML
- Train and register a Naive Bayes Classifier using scikit-learn
- Review and validate BQML and scikit-learn model. 
- Nominee a champion and approve the model to production by updating aliases with `production` alias
- Deploy the default/production version of a Model resource.

### Dataset

[BBC](http://mlg.ucd.ie/datasets/bbc.html) consists of 2225 documents from the BBC news website corresponding to stories in five topical areas (business, entertainment, politics, sport, tech) from 2004-2005. Each of the articles is in a .txt file.


### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Dataproc
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

## Installation

Install the following packages required to execute this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade tensorflow google-cloud-bigquery google-cloud-aiplatform {USER_FLAG} -q --no-warn-conflicts

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable APIs](https://console.cloud.google.com/flows/enableapi?apiid=iam.googleapis.com,aiplatform.googleapis.com,cloudresourcemanager.googleapis.com,artifactregistry.googleapis.com,dataproc.googleapis.com,cloudbuild.googleapis.com)

1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already
authenticated.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type and select
the following role into the filter box:

    * Artifact Registry Administrator
    * Artifact Registry Repository Administrator
    * BigQuery Admin
    * Compute Network Admin
    * Cloud Build Editor
    * Dataproc Administrator
    * Dataproc Worker
    * Service Account User
    * Service Usage Admin
    * Storage Admin
    * Storage Object Admin
    * Vertex AI Administrator


5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

#### Get your project number

Now that the project ID is set, you get your corresponding project number.

In [None]:
shell_output = ! gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = shell_output[0]
print("Project Number:", PROJECT_NUMBER)

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "-aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

#### Service Account

If you do not want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access

Run the following commands to grant your service account access to the bucket that you created in the previous step. You only need to run this step once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### Enabling Private Google Access for Dataproc Serverless

To execute Serverless Spark workloads, the VPC subnetwork must meet the [requirements](https://cloud.google.com/dataproc-serverless/docs/concepts/network) listed in Dataproc Serverless for Spark network configuration. In this tutorial we are going to use the default one and enable it to private ip access. 

In [None]:
SUBNETWORK = "default"  # @param {type:"string"}

In [None]:
!gcloud compute networks subnets list --regions=$REGION --filter=$SUBNETWORK

In [None]:
!gcloud compute networks subnets update $SUBNETWORK \
--region=$REGION \
--enable-private-ip-google-access

In [None]:
!gcloud compute networks subnets describe $SUBNETWORK \
--region=$REGION \
--format="get(privateIpGoogleAccess)"

### Create and configure the Docker repository

You create a Docker repository in the Artefact Registry for the custom dataproc image you are going to create for NLP data preprocessing. 

In [None]:
REPO_NAME = "vertex-ai-model-registry-demo"

In [None]:
!gcloud artifacts repositories create $REPO_NAME \
    --repository-format=docker \
    --location=$REGION \
    --description="vertex ai model registry spark docker repository"

### Set project template

You create a set of repositories to organize your project locally.

In [None]:
DATA_PATH = "data"
SRC_PATH = "src"
BUILD_PATH = "build"
CONFIG_PATH = "config"

In [None]:
!mkdir -m 777 -p $DATA_PATH $SRC_PATH $BUILD_PATH $CONFIG_PATH

### Get input data

In the following code, you download and extract the tutorial dataset.

In [None]:
RAW_DATA_URI = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"

In [None]:
!rm -Rf {DATA_PATH}/raw 
!wget --no-parent {RAW_DATA_URI} --directory-prefix={DATA_PATH}/raw 
!unzip -qo {DATA_PATH}/raw/bbc-fulltext.zip -d {DATA_PATH}/raw && mv {DATA_PATH}/raw/bbc/* {DATA_PATH}/raw/
!rm -Rf {DATA_PATH}/raw/bbc-fulltext.zip {DATA_PATH}/raw/bbc

### Set Bigquery dataset

You create the BigQuery dataset for the tutorial.

In [None]:
LOCATION = REGION.split("-")[0]
BQ_DATASET = "bcc_sport"

! bq mk --location={LOCATION} --dataset {PROJECT_ID}:{BQ_DATASET}

### Import libraries

In [None]:
# General
import csv
import datetime as dt
import glob
import json
import os
import sys

import pandas as pd

pd.set_option("display.max_colwidth", 3000)

# Model Training
import tensorflow as tf
from google.cloud import aiplatform as vertex_ai
from google.cloud import bigquery

In [None]:
print("BigQuery library version:", bigquery.__version__)
print("Vertex AI library version:", vertex_ai.__version__)

### Set up variables

In [None]:
# General
STAGING_BUCKET = f"{BUCKET_URI}/jobs"
RAW_PATH = os.path.join(DATA_PATH, "raw")
DATAPROC_IMAGE_BUILD_PATH = os.path.join(BUILD_PATH, "dataproc_image")
PREPROCESS_DOCKERFILE_PATH = os.path.join(DATAPROC_IMAGE_BUILD_PATH, "Dockerfile")
DATAPROC_RUNTIME_IMAGE = "dataproc_serverless_custom_runtime"
IMAGE_TAG = "1.0.0"
DATAPROC_RUNTIME_CONTAINER_IMAGE = (
    f"gcr.io/{PROJECT_ID}/{DATAPROC_RUNTIME_IMAGE}:{IMAGE_TAG}"
)
INIT_PATH = os.path.join(SRC_PATH, "__init__.py")
MODULE_URI = f"{BUCKET_URI}/{SRC_PATH}"
VERTEX_AI_MODEL_ID = "text-classifier-model"

# Ingest
PREPARED_PATH = os.path.join(DATA_PATH, "prepared")
PREPARED_FILE = "prepared_data.csv"
PREPARED_FILE_PATH = os.path.join(PREPARED_PATH, PREPARED_FILE)
PREPARED_FILE_URI = f"{BUCKET_URI}/{PREPARED_FILE_PATH}"

# Preprocess
PREPROCESS_MODULE_PATH = os.path.join(SRC_PATH, "preprocess.py")
LEMMA_DICTIONARY_PATH = os.path.join(CONFIG_PATH, "lemmas.txt")
LEMMA_DICTIONARY_URI = f"{BUCKET_URI}/{CONFIG_PATH}/lemmas.txt"
PROCESS_PYTHON_FILE_URI = f"{MODULE_URI}/preprocess.py"
PROCESS_DATA_PATH = os.path.join(DATA_PATH, "processed")
BQ_OUTPUT_TABLE_URI = f"{BQ_DATASET}.news_processed_{UUID}"
PROCESS_DATA_URI = f"{BUCKET_URI}/{PROCESS_DATA_PATH}"
PROCESS_FILE_URI = f"{PROCESS_DATA_URI}/*.parquet"
PREPROCESS_BATCH_ID = f"nlp-preprocess-{UUID}"

# Training
TRAIN_NAIVE_MODULE_PATH = os.path.join(SRC_PATH, "train_naive.py")
NAIVE_TRAIN_JOB_NAME = f"naive_training_job_{UUID}"
TRAIN_VERSION = "scikit-learn-cpu.0-23"
NAIVE_TRAIN_CONTAINER_URI = (
    f"{REGION.split('-')[0]}-docker.pkg.dev/vertex-ai/training/{TRAIN_VERSION}:latest"
)
NAIVE_TRAIN_REQUIREMENTS = ["pyarrow", "fastparquet", "gcsfs"]
DEPLOY_VERSION = "sklearn-cpu.0-23"
NAIVE_DEPLOY_CONTAINER_URI = f"{REGION.split('-')[0]}-docker.pkg.dev/vertex-ai/prediction/{DEPLOY_VERSION}:latest"
NAIVE_MODEL_BASE_URI = f"{BUCKET_URI}/deliverables/naive"
NAIVE_MODEL_URI = f"{BUCKET_URI}/deliverables/naive/model"
NAIVE_METRICS_FILE_URI = f"{NAIVE_MODEL_URI}/metrics.json"

# Deployment
SERVING_BUILD_PATH = os.path.join(BUILD_PATH, "serving")
SERVING_APP_BUILD_PATH = os.path.join(SERVING_BUILD_PATH, "app")
SERVE_NAIVE_MODULE_PATH = os.path.join(SERVING_APP_BUILD_PATH, "main.py")
SERVE_REQUIREMENTS_PATH = os.path.join(SERVING_BUILD_PATH, "requirements.txt")
SERVE_DOCKERFILE_PATH = os.path.join(SERVING_BUILD_PATH, "Dockerfile")
SERVE_AUTH_PATH = os.path.join(SERVING_BUILD_PATH, "key.json")
SERVE_SCRIPT_PATH = os.path.join(SERVING_BUILD_PATH, "copy_model.sh")
SERVING_RUNTIME_IMAGE = "serving_custom_naive"
IMAGE_TAG = "1.0.0"
SERVING_NAIVE_RUNTIME_CONTAINER_IMAGE = (
    f"gcr.io/{PROJECT_ID}/{SERVING_RUNTIME_IMAGE}:{IMAGE_TAG}"
)
ENDPOINT_NAME = "text-classifier-endpoint"
DEPLOYED_MODEL_NAME = "naive-bayes-text-classifier"

### Initialize Vertex AI SDK for Python

Initialize the Python SDKs for your project and corresponding bucket.

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

### Helpers

A set of helpers to facilitate some tasks.

In [None]:
def prepare_data(input_path: str, output_path: str, file_name: str):
    """
    This function prepares the data for the model registry demo.
    Args:
        input_path: The directory where the raw data is stored.
        output_path: The directory where the prepared data will be stored.
        file_name: The name of the file to be prepared.
    Returns:
        None
    """
    # Read folder names
    categories = [f.name for f in os.scandir(input_path) if f.is_dir()]

    # Create output directory if it doesn't exist
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # Create output file
    with open(output_path + "/" + file_name, "w") as output_file:
        csv_writer = csv.writer(output_file)
        csv_writer.writerow(["category", "text"])

        # For each category, read all files and write to output file
        for category in categories:
            # Read all files in category
            for filename in glob.glob(os.path.join(input_path, category, "*.txt")):
                # Read file
                with open(filename, "r") as input_file:
                    output_text = "".join([line.rstrip() for line in input_file])
                    # Write to output file
                    csv_writer.writerow([category, output_text])
                    input_file.close()

    # Close output file
    output_file.close()


def run_query(query):

    """
    This function runs a query on the prepared data.
    Args:
        query: The query to be run.
    Returns:
        None
    """

    # Construct a BigQuery client object.
    client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

    # Run the query_job
    query_job = client.query(query)

    # Wait for the query to finish
    result = query_job.result()

    # Return table
    table = query_job.ddl_target_table

    return table, result


def read_metrics_file(metrics_file_uri):
    """
    This function reads metrics file on bucket
    Args:
      metrics_file_uri: The uri of the metrics file
    Returns:
      metrics_str: metrics string
    """

    with tf.io.gfile.GFile(metrics_file_uri, "r") as metrics_file:
        metrics = metrics_file.read().replace("'", '"')
    metrics_file.close()
    return metrics

## Data Engineering with Dataproc Serverless

Before building a NLP machine learning model, there are some common pre-processing steps to use:

1.   Preliminaries such as sentence segmentation and word tokenization
2.   Frequent steps such as stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.

Other steps are normalization, language detection other than POS tagging, parsing. 

In the following section you will ingest your dataset and you will use SparkNLP on Dataproc serverless to build and execute a simple NLP preprocessing pipeline. To do that you need:

1.   Upload data on Google Cloud Bucket
2.   Create a custom Dataproc Serverless image
3.   Create and upload the `preprocess` module and its dependencies on Google Cloud bucket

Then you will run the Dataproc serverless job and the resulting data will be loaded into BigQuery. 


### Ingest data

Below you will

1.   Prepare data by extracting news from directories and create the associated csv file.
2.   Upload the data to Google Cloud Bucket


#### Prepare data

In [None]:
prepare_data(RAW_PATH, PREPARED_PATH, PREPARED_FILE)

##### Quick peek at the CSV data

In [None]:
! head $PREPARED_FILE_PATH

#### Upload the data to bucket

In [None]:
! gsutil cp $PREPARED_FILE_PATH $PREPARED_FILE_URI

### Basic Data and Feature Engineering 

In this scenario, you will use a Spark pipeline to cover the following steps using Spark NLP

1. Sentence segmentation
2. Word tokenization
3. Normalization
4. Stopword removal
5. Stemming
6. Lemmatization

Finally, you will create a bag of words (BOW) using `CountVectorizer` object.


#### Build a custom dataproc serverless image

The `DataprocPySparkBatchOp` allows you to pass custom image you would like to use when the [provided Dataproc Serverless runtime versions](https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions) does not respect your requirements. 

In this case, an image with Spark NLP library is needed.

##### Download the spark job dependencies

You download the spark dependencies required to run the NLP preprocessing pipeline. 

In [None]:
! rm -rf $DATAPROC_IMAGE_BUILD_PATH
! mkdir $DATAPROC_IMAGE_BUILD_PATH

In [None]:
!gsutil cp gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar $DATAPROC_IMAGE_BUILD_PATH
!wget -P $DATAPROC_IMAGE_BUILD_PATH https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar
!wget -P $DATAPROC_IMAGE_BUILD_PATH https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh

##### Define the Dataproc serverless custom runtime image

You define the Dockerfile to create the custom image. 

In [None]:
dataproc_serverless_custom_runtime_image = """
# Debian 11 is recommended.
FROM debian:11-slim

# Suppress interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# (Required) Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini

# (Optional) Add extra jars.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"
COPY spark-bigquery-with-dependencies_2.12-0.22.2.jar "${SPARK_EXTRA_JARS_DIR}"
COPY spark-nlp-assembly-4.0.2.jar "${SPARK_EXTRA_JARS_DIR}"

# (Optional) Install and configure Miniconda3.
ENV CONDA_HOME=/opt/miniconda3
ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python
ENV PATH=${CONDA_HOME}/bin:${PATH}
COPY Miniconda3-py38_4.9.2-Linux-x86_64.sh .
RUN bash Miniconda3-py38_4.9.2-Linux-x86_64.sh -b -p /opt/miniconda3 \
  && ${CONDA_HOME}/bin/conda config --system --set always_yes True \
  && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
  && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
  && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict

# (Optional) Install Conda packages.
#
# The following packages are installed in the default image, it is strongly
# recommended to include all of them.
#
# Use mamba to install packages quickly.
RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \
    && ${CONDA_HOME}/bin/mamba install \
      conda \
      cython \
      gcsfs \
      google-cloud-bigquery-storage \
      google-cloud-bigquery[pandas] \
      google-cloud-dataproc \
      numpy \
      pandas \
      python \
      pyspark \
      findspark

# Use conda to install spark-nlp
RUN ${CONDA_HOME}/bin/conda install -n base -c johnsnowlabs spark-nlp

# Add lemma dictionary
# ENV CONFIG_DIR='/home/app/build'
# RUN mkdir -p "${CONFIG_DIR}"
# COPY lemmas.txt "${CONFIG_DIR}"

# (Required) Create the 'spark' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
"""

with open(PREPROCESS_DOCKERFILE_PATH, "w") as f:
    f.write(dataproc_serverless_custom_runtime_image)
f.close()

##### Build the Dataproc serverless custom runtime using Google Cloud Build

You use cloud build to create and register the container image to Artifact registry. 

Notice that `<PROJECT_ID>@cloudbuild.gserviceaccount.com` requires to have storage.objects.get access to the Google Cloud Storage object.

**Notice**: This step would take ~5min

In [None]:
CLOUD_BUILD_SERVICE_ACCOUNT = f"{PROJECT_NUMBER}@cloudbuild.gserviceaccount.com"

! gsutil iam ch serviceAccount:{CLOUD_BUILD_SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{CLOUD_BUILD_SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

In [None]:
!gcloud builds submit --tag $DATAPROC_RUNTIME_CONTAINER_IMAGE $DATAPROC_IMAGE_BUILD_PATH --machine-type=N1_HIGHCPU_32 --timeout=900s --verbosity=info

#### Prepare `preprocess` module

##### Create the preprocess module

This module will preprocess the data and it covers the following steps:

1. Sentence segmentation
2. Word tokenization
3. Normalization
4. Stopword removal
5. Stemming
6. Lemmatization

In [None]:
with open(INIT_PATH, "w") as init_file:
    pass

In [None]:
process_module = """
#!/usr/bin/env python3

'''
This is a simple module to preprocess the data for the model registry demo.
Steps:
1. Sentence segmentation
2. Word tokenization
3. Normalization
4. Stopword removal
5. Stemming
6. Lemmatization
'''

# Libraries
import logging
import argparse

from pyspark.sql.types import StructType, StringType
from pyspark.sql.functions import col, concat_ws, rand
from pyspark.ml.functions import vector_to_array
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline

# Variables ------------------------------------------------------------------------------------------------------------
DATA_SCHEMA = (StructType()
               .add("category", StringType(), True)
               .add("text", StringType(), True))
SEED=8

# Helper functions -----------------------------------------------------------------------------------------------------
def get_logger():
    '''
    This function returns a logger object.
    Returns:
        logger: The logger object.
    '''
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger


def get_args():
    '''
    This function returns the arguments from the command line.
    Returns:
        args: The arguments from the command line.
    '''
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_path', type=str, help='The input path uri without bucket prefix')
    parser.add_argument('--lemmas_path', type=str, help='The lemma dictionary path without bucket prefix')
    parser.add_argument('--gcs_output_path', type=str, help='The gcs path for preprocessed data without bucket prefix')
    parser.add_argument('--bq_output_table_uri', type=str, help='The Bigquery output table URI')
    parser.add_argument('--bucket', type=str, help='The staging bucket')
    parser.add_argument('--project', type=str, help='The project id')
    args = parser.parse_args()
    return args


def build_preliminary_steps():
    '''
    This function builds the preliminary steps for the preprocessing.
    Returns:
        preliminary_steps: The preliminary steps for the preprocessing.
    '''

    document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document").setCleanupMode('shrink_full')
    sentence_detector = SentenceDetector().setInputCols("document").setOutputCol("sentence")
    tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")
    preliminary_steps = [document_assembler, sentence_detector, tokenizer]
    return preliminary_steps


def build_common_preprocess_steps(lemma_uri):
    '''
    This function builds the common preprocessing steps.
    Args:
        lemma_uri: The uri of lemma dictionary
    Returns:
        common_preprocess_steps: The common preprocessing steps.
    '''

    normalizer = Normalizer().setInputCols("token").setOutputCol("normalized_token").setLowercase(True)
    stopwords_cleaner = StopWordsCleaner().setInputCols("normalized_token").setOutputCol(
        "cleaned_tokens").setCaseSensitive(False)
    stemmer = Stemmer().setInputCols("cleaned_tokens").setOutputCol("stem")
    lemmatizer = Lemmatizer().setInputCols("stem").setOutputCol("lemma").setDictionary(lemma_uri, "->", "\t")
    finisher = Finisher().setInputCols("lemma").setOutputCols(["lemma_features"]).setIncludeMetadata(
        False).setOutputAsArray(True)
    common_preprocess_steps = [normalizer, stopwords_cleaner, stemmer, lemmatizer, finisher]
    return common_preprocess_steps


def build_feature_extraction_steps():
    '''
    This function builds the feature extraction steps.
    Returns:
        feature_extraction_steps: The feature extraction steps.
    '''

    count_vectorizer = CountVectorizer().setInputCol("lemma_features").setOutputCol("features").setVocabSize(30)
    feature_extraction_steps = [count_vectorizer]
    return feature_extraction_steps


def read_data(spark_session, data_schema, input_dir):
    '''
    This function reads the data from the input directory.
    Args:
        spark_session: The SparkSession object.
        data_schema: The data schema.
        input_dir: The input directory.
    Returns:
        raw_df: The raw dataframe.
    '''

    raw_df = (spark_session.read.option("header", True)
              .option("delimiter", ',')
              .schema(data_schema)
              .csv(input_dir))
    return raw_df


def prepare_train_df(df):
    '''
    This function prepares the training dataframe.
    Args:
        df: The dataframe.
    Returns:
        None
    '''
    train_df = (df.withColumn("bow_col", vector_to_array("features"))
                .withColumn("lemmas", concat_ws(" ", col("lemma_features")))
                .select(["text"] + ["lemmas"] + [col("bow_col")[i] for i in range(30)] + ["category"]))

    return train_df


def save_data(data, bucket, gcs_path, bigquery_uri):
    '''
    This function saves the data to Bigquery.
    Args:
        data: The data to save.
        bucket: The bucket.
        gcs_path: The path to store processed data.
        bigquery_uri: The URI of the Bigquery table.
    Returns:
        None
    '''
    # df_sample = data.sample(withReplacement=False, fraction=0.7, seed=SEED)
    df_sample = data.orderBy(rand(SEED)).limit(1000)
    df_sample.write.format('bigquery') \
        .mode("overwrite") \
        .option("persistentGcsBucket", bucket) \
        .option("persistentGcsPath", gcs_path) \
        .save(bigquery_uri)


# Main function --------------------------------------------------------------------------------------------------------
def preprocess(args):
    '''
    preprocess function.
    Args:
        args: The arguments from the command line.
    Returns:
        None
    '''
    # Get logger
    logger = get_logger()

    # Initialize variables
    input_path = args.input_path
    lemma_path = args.lemmas_path
    gcs_output_path = args.gcs_output_path
    bq_output_table_uri = args.bq_output_table_uri
    bucket = args.bucket
    project = args.project
    lemma_uri = f'gs://{bucket}/{lemma_path}'
    input_uri = f'gs://{bucket}/{input_path}'

    # Initialize SparkSession
    logger.info('Starting preprocessing')
    spark = sparknlp.start()
    print(f"Spark NLP version: {sparknlp.version()}")
    print(f"Spark version: {spark.version}")

    # Build pipeline steps
    logger.info('Building pipeline steps')
    preliminary_steps = build_preliminary_steps()
    common_preprocess_steps = build_common_preprocess_steps(lemma_uri)
    feature_extraction_steps = build_feature_extraction_steps()
    pipeline = Pipeline(stages=preliminary_steps + common_preprocess_steps + feature_extraction_steps)

    # Read data
    logger.info('Reading data')
    raw_df = read_data(spark, DATA_SCHEMA, input_uri)

    # Preprocess data
    logger.info('Preprocessing data')
    processed_pipeline = pipeline.fit(raw_df)
    preprocessed_df = processed_pipeline.transform(raw_df)
    preprocessed_df.show(10, truncate=False)

    # Save data to Bigquery
    logger.info('Saving data to Bigquery')
    train_df = prepare_train_df(preprocessed_df)
    save_data(train_df, bucket, gcs_output_path, bq_output_table_uri)
    logging.info('done.')
    spark.stop()


if __name__ == '__main__':
    # Get args
    args = get_args()
    preprocess(args)
"""

with open(PREPROCESS_MODULE_PATH, "w") as process_file:
    process_file.write(process_module)
process_file.close()

##### Upload the module on bucket

In [None]:
!gsutil cp $SRC_PATH/__init__.py $MODULE_URI/__init__.py
!gsutil cp $SRC_PATH/preprocess.py $MODULE_URI/preprocess.py

##### Upload config file

You use the lemma dictionary according to Spark NLP documentation and you will upload it to Google Cloud bucket. 

In [None]:
!wget https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt -O $LEMMA_DICTIONARY_PATH
!gsutil cp $LEMMA_DICTIONARY_PATH $LEMMA_DICTIONARY_URI

#### Run a preprocess spark job using dataproc serverless

Now that you prepare the execution, you can submit the preprocessing Dataproc Serverless job. The explanation of this cli command is out of scope but you can have a look of all its option [in the official documentation](https://cloud.google.com/dataproc-serverless/docs/quickstarts/spark-batch).

In [None]:
! gcloud beta dataproc batches submit pyspark $PROCESS_PYTHON_FILE_URI \
  --batch=$PREPROCESS_BATCH_ID \
  --container-image=$DATAPROC_RUNTIME_CONTAINER_IMAGE \
  --region=$REGION \
  --subnet='default' \
  --properties spark.executor.instances=2,spark.driver.cores=4,spark.executor.cores=4,spark.app.name=spark_preprocessing_job \
  -- --input_path=$PREPARED_FILE_PATH --lemmas_path=$LEMMA_DICTIONARY_PATH --gcs_output_path=$PROCESS_DATA_PATH --bq_output_table_uri=$BQ_OUTPUT_TABLE_URI --bucket=$BUCKET_NAME --project=$PROJECT_ID

## Model Training for Text Classification

According to [Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/), there are different approaches to train text classifiers. For instance, you have

- Traditional methods such Logistic Regression or Naive Bayes Classifier
- Neural Embeddings methods
- Deep Learning methods
- Large, Pre-trained Language Models

In the following section you are going to use the traditional methods and you will show how Vertex AI Model Registry will govern all of them. 


#### Logistic Regression using BQML

##### Train and register the model

To register a BigQuery ML model to Vertex AI Model Registry, you must use `model_registry="vertex_ai"`. 

In [None]:
train_lr_query = f"""
CREATE OR REPLACE MODEL
  `{PROJECT_ID}.{BQ_DATASET}.text_logit_classifier`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    AUTO_CLASS_WEIGHTS=TRUE,
    DATA_SPLIT_METHOD='RANDOM',
    DATA_SPLIT_EVAL_FRACTION = .10,
    INPUT_LABEL_COLS=['category'],
    ENABLE_GLOBAL_EXPLAIN=TRUE,
    MODEL_REGISTRY='vertex_ai',
    VERTEX_AI_MODEL_ID='{VERTEX_AI_MODEL_ID}',
    VERTEX_AI_MODEL_VERSION_ALIASES=['experimental', 'baseline', 'BQML', 'logistic_regression']
  ) AS
    SELECT * EXCEPT(text, lemmas)
    FROM `{PROJECT_ID}.{BQ_OUTPUT_TABLE_URI}`
"""

In [None]:
model_table, result = run_query(query=train_lr_query)
print(f"The {model_table.dataset_id}.{model_table.table_id} successfully created!")

#### Naive Bayes Classifier with scikit-learn

###### Create naive training module

With this module, you will train a simple scikit-learn Naive Bayes estimator for text classification.

In [None]:
train_naive_module = """
#!/usr/bin/env python3
'''
This is a simple module to train a naive bayes model.
'''

import logging
import argparse
import os
import glob

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, log_loss, roc_auc_score
import pickle



# Variables
RANDOM_STATE = 8
TEST_SIZE = 0.2
EVAL_SIZE = 0.25



# Helpers --------------------------------------------------------------------------------------------------------------

def get_logger():
    '''
    This function returns a logger object.
    Returns:
        logger: The logger object.
    '''
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger


def get_args():
    '''
    This function parses and return arguments passed in command line.
    Returns:
        args: Arguments list.
    '''

    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path",
                        type=str, help="The path of the training data.")
    parser.add_argument('--model_dir',
                        type=str, help='The path of the model directory.')
    args = parser.parse_args()
    return args


def read_data(data_path: str):
    '''
    This function reads the data from the provided data path.
    Args:
        data_path: The path of the data.
    Returns:
        x_train: The training data.
        y_train: The training labels.
        x_test: The test data.
        y_test: The test labels.
    '''
    # Read data
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if data_path.startswith(gs_prefix):
        data_path = data_path.replace(gs_prefix, gcsfuse_prefix)
    parquet_files = glob.glob(data_path)
    dataframes = []
    for parquet_file_path in parquet_files:
        parquet_file_path = parquet_file_path.replace(gcsfuse_prefix, gs_prefix)
        dataframes.append(pd.read_parquet(parquet_file_path, engine='fastparquet'))
    df = pd.concat(dataframes, axis=0)
    x = df.text
    # y = np.where(df.category == 'sport', 1, 0)
    y = df.category
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=RANDOM_STATE, test_size=TEST_SIZE)
    x_train, x_eval, y_train, y_eval = train_test_split(x_train, y_train, random_state=RANDOM_STATE, test_size=EVAL_SIZE)
    return x_train, y_train, x_test, y_test


def get_weights(y_train):
    '''
    This function returns the class weights for the model.
    Returns:
        weights: The class weights.
    '''
    weights = compute_sample_weight('balanced', y_train)
    return weights


def build_model():
    '''
    This function builds the model.
    Returns:
        model: The model.
    '''
    model = Pipeline([
        ('count_vectorizer', CountVectorizer()),
        ('classifier', MultinomialNB())
    ])
    return model


def train_model(x_train, y_train, model):
    '''
    This function trains the model.
    Args:
        x_train: The training data.
        y_train: The training labels.
        model: The model to train.
    Returns:
        model: The trained model.
    '''
    model = model.fit(x_train, y_train, classifier__sample_weight=get_weights(y_train))
    return model


def evaluate_model(model, x_test, y_test):
    '''
    This function evaluates the model on the test data.
    Parameters:
        model: The model to evaluate.
        x_test: The test data.
        y_test: The test labels.
    '''

    y_pred = model.predict(x_test)
    y_pred_proba = model.predict_proba(x_test)
    metrics = {
        "precision": round(precision_score(y_test, y_pred, sample_weight=get_weights(y_test), average="weighted"), 5),
        "recall": round(recall_score(y_test, y_pred, sample_weight=get_weights(y_test), average="weighted"), 5),
        "accuracy": round(accuracy_score(y_test, y_pred, sample_weight=get_weights(y_test)), 5),
        "f1_score": round(f1_score(y_test, y_pred, sample_weight=get_weights(y_test), average="weighted"), 5),
        "log_loss": round(log_loss(y_test, y_pred_proba, sample_weight=get_weights(y_test)), 5),
        "roc_auc": round(roc_auc_score(y_test, y_pred_proba, multi_class='ovr'), 5)
    }
    return metrics


def save_model(model, model_dir):
    '''
    This function saves the model to the provided model directory.
    Parameters:
        model: The model to save.
        model_dir: The directory to save the model to.
    '''

    # Create output directory if it doesn't exist
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if model_dir.startswith(gs_prefix):
        model_dir = model_dir.replace(gs_prefix, gcsfuse_prefix)
    model_dir = os.path.join(model_dir, 'model')
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)
    model_path = os.path.join(model_dir, 'model.pkl')
    with open(model_path, 'wb') as model_file:
        pickle.dump(model, model_file)

def save_metrics(metrics, model_dir):
    '''
    This function saves the metrics to the provided model directory.
    Parameters:
        metrics: The metrics to save.
        model_dir: The directory to save the metrics to.
    '''

    # Create output directory if it doesn't exist
    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if model_dir.startswith(gs_prefix):
        model_dir = model_dir.replace(gs_prefix, gcsfuse_prefix)
    metrics_path = os.path.join(model_dir, 'model', 'metrics.json')
    with open(metrics_path, 'w') as f:
        f.write(str(metrics))


def train_naive(args):
    '''
    This function trains the model and saves it to the provided model directory.
    Parameters:
        args: The arguments from the command line.
    '''
    # Get logger
    logger = get_logger()
    logger.info('Starting model training...')

    # Initialize variables
    data_path = args.data_path
    model_dir = args.model_dir

    # Build model
    model = build_model()

    # Read data
    logger.info('Reading data')
    x_train, y_train, x_test, y_test = read_data(data_path)

    # Train model
    logger.info('Training model')
    model = train_model(x_train, y_train, model)

    # Evaluate model
    logger.info('Evaluating model')
    metrics = evaluate_model(model, x_test, y_test)
    for key, value in metrics.items():
        print(f'{key}: {value}')

    # Save model
    logger.info('Saving model')
    save_model(model, model_dir)

    # Save metrics
    logger.info('Saving metrics')
    save_metrics(metrics, model_dir)

    logger.info('Training complete.')


if __name__ == '__main__':
    # Get args
    args = get_args()
    train_naive(args)
"""

with open(TRAIN_NAIVE_MODULE_PATH, "w") as train_naive_file:
    train_naive_file.write(train_naive_module)
train_naive_file.close()

##### Train and register the model using Vertex AI Training

To register a new custom model versioned trained using Vertex AI Training of an existing model, you will provide those additional arguments:

*   `parent_model`: The parent resource name of an existing model to register a new version. 
*   `model_version_aliases`: The aliases of the model version to create.
*   `model_version_description`: The description of the model version.
*   `is_default_version`: Whether the model version is the default version.

Once you run the training job, it would take **~5 min** to finish it.


In [None]:
naive_bayes_train_job = vertex_ai.CustomTrainingJob(
    display_name=NAIVE_TRAIN_JOB_NAME,
    script_path=TRAIN_NAIVE_MODULE_PATH,
    container_uri=NAIVE_TRAIN_CONTAINER_URI,
    requirements=NAIVE_TRAIN_REQUIREMENTS,
)

In [None]:
naive_model = naive_bayes_train_job.run(
    args=["--data_path", PROCESS_FILE_URI, "--model_dir", NAIVE_MODEL_BASE_URI],
    replica_count=1,
    machine_type="n1-standard-4",
    base_output_dir=NAIVE_MODEL_BASE_URI,
)

## Model Governance with Vertex AI Model Registry

#### Initialize Vertex AI Model Registry

To access different model versions of a Vertex AI Model resource, you can initialize a model registry instance of the model.

In [None]:
registry = vertex_ai.models.ModelRegistry(VERTEX_AI_MODEL_ID)

#### Compare model versions

Then, you use `ML.EVALUATE` to generate BQML model evaluation metrics and compare them to the same metrics you created with your custom model.

In [None]:
evaluation_query = f"""
SELECT *
FROM
  ML.EVALUATE(MODEL `{BQ_DATASET}.text_logit_classifier`)
ORDER BY  roc_auc desc
LIMIT 1
"""
_, result = run_query(query=evaluation_query)
evaluation_df = result.to_dataframe().rename(index={0: "bqml_text_logit_classifier"})
evaluation_df

In [None]:
naive_metrics = read_metrics_file(NAIVE_METRICS_FILE_URI)
metrics_dict = [json.loads(naive_metrics)]
naive_metrics_df = pd.DataFrame.from_dict(metrics_dict).rename(index={0: "naive_bayes"})
evaluation_df = evaluation_df.append(naive_metrics_df, ignore_index=False)
evaluation_df

### Register the `champion` model version

Based on the model evalutions, the scikit-learn Naive Bayes Classifier outperformed the BQML Logistic regression and it is the production candidate. 

##### Build and push the custom serving container to Artifact Registry

###### Build the custom serving image

In [None]:
! rm -rf $SERVING_BUILD_PATH
! mkdir $SERVING_BUILD_PATH
! mkdir $SERVING_APP_BUILD_PATH

###### Build the serving app

In [None]:
serve_naive_module = """
'''
This is a simple web application to serve the naive bayes model.
'''

# Libraries ------------------------------------------------------------------------------------------------------------

import logging
import os
from flask import Flask, Response, request, jsonify
import pickle
import pandas as pd


# Helpers --------------------------------------------------------------------------------------------------------------

def get_probabilities(model_classes, probabilities):
    proba_classes = []
    for probabilities_list in probabilities:
      proba_classes.append({"classes": model_classes, "scores": probabilities_list})
    return proba_classes


# App ------------------------------------------------------------------------------------------------------------------

# Initialize the app
app = Flask(__name__)
app.logger.setLevel(logging.INFO)

# Load the model
app.logger.info("Loading model...")
model = pickle.load(open('./model/model.pkl', 'rb'))
app.logger.info("Model loaded.")

# classes = model.classes_
classes = model.classes_.tolist()


@app.route(os.environ['AIP_HEALTH_ROUTE'], methods=['GET'])
def health():
    '''
    A health check endpoint.
    '''
    app.logger.info("Health check")
    return Response(response='OK', status=200)


@app.route(os.environ['AIP_PREDICT_ROUTE'], methods=['POST'])
def predict():
    '''
    A predict endpoint.
    '''
    app.logger.info("Predict")

    # Get instances
    instances_dict = request.get_json()["instances"]

    # Generate predictions
    instances_df = pd.DataFrame.from_records(instances_dict)
    probabilities = model.predict_proba(instances_df.iloc[:, 0])

    # Format predictions
    fmt_probabilities = get_probabilities(classes, probabilities.tolist())

    return jsonify({"predictions": fmt_probabilities})


if __name__ == "main":
    app.run(debug=True, host="0.0.0.0", port=9999)
"""

with open(SERVE_NAIVE_MODULE_PATH, "w") as serve_naive_file:
    serve_naive_file.write(serve_naive_module)
serve_naive_file.close()

###### Copy model

In [None]:
!gsutil cp -r $NAIVE_MODEL_URI $SERVING_APP_BUILD_PATH

###### Create `requirements` file

In [None]:
serve_requirements = """
flask==2.2.2
gunicorn==20.1.0
numpy==1.22.4
pandas==1.4.3
scikit-learn==0.23.1
"""

with open(SERVE_REQUIREMENTS_PATH, "w") as serve_requirements_file:
    serve_requirements_file.write(serve_requirements)
serve_requirements_file.close()

###### Create `Dockerfile` file

In [None]:
serve_dockerfile = """
FROM python:3.8-slim

# Update pip
RUN pip3 install --upgrade pip

# Install requirements
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

# Create app folder and copy app files
RUN mkdir /app
COPY app /app
WORKDIR /app

# Run app
EXPOSE 9999
CMD ["gunicorn", "main:app", "--timeout=0", "--preload", \
     "--workers=1", "--threads=4", "--bind=0.0.0.0:9999"]
"""

with open(SERVE_DOCKERFILE_PATH, "w") as serve_dockerfile_file:
    serve_dockerfile_file.write(serve_dockerfile)
serve_dockerfile_file.close()

###### Build and push the custom image

In [None]:
!gcloud builds submit --tag $SERVING_NAIVE_RUNTIME_CONTAINER_IMAGE $SERVING_BUILD_PATH --machine-type=N1_HIGHCPU_32 --timeout=900s --verbosity=info

##### Register the model

In [None]:
naive_model = vertex_ai.Model.upload(
    parent_model=VERTEX_AI_MODEL_ID,
    is_default_version=False,
    version_aliases=["experimental", "challenger", "custom-training", "naive-bayes"],
    version_description="A Naive Bayes text classifier",
    serving_container_image_uri=SERVING_NAIVE_RUNTIME_CONTAINER_IMAGE,
    serving_container_health_route="/health",
    serving_container_predict_route="/predict",
    serving_container_ports=[9999],
    labels={"created_by": "inardini", "team": "advocacy"},
)

### Listing versions of a model

You can list all the model versions using `list_versions` method.

In [None]:
versions = registry.list_versions()
for version in versions:
    version_id = version.version_id
    version_created_time = dt.datetime.fromtimestamp(
        version.version_create_time.timestamp()
    ).strftime("%m/%d/%Y %H:%M:%S")
    version_aliases = version.version_aliases
    print(
        "\n",
        f"Model version {version_id} was created at {version_created_time} with aliases {version_aliases}",
    )

### Getting all information about the `champion` model version

To get all information about your `champion` model you can use `get_version_info` method. 

In [None]:
CHAMPION_VERSION_ID = versions[-1].version_id

In [None]:
champion_model_version_info = registry.get_version_info(CHAMPION_VERSION_ID)
champion_model_version_info_df = pd.DataFrame(
    champion_model_version_info,
    columns=["model_version"],
    index=[
        "version_id",
        "created_at",
        "updated_at",
        "model_display_name",
        "model_resource_name",
        "version_aliases",
        "version_description",
    ],
)
champion_model_version_info_df

### Set the champion model ready to `production` as `default` 

To update the aliases and change the state of the model from `experimental` to `production`, the Vertex AI SDK provides `add_version_aliases` and `remove_version_aliases` methods. 

Notice we set those aliases in respect of the online experimention phase discussed in the the [Practitioners guide to MLOps:
A framework for continuous
delivery and automation of
machine learning.](https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf)

In [None]:
registry.remove_version_aliases(
    ["experimental", "challenger"], version=CHAMPION_VERSION_ID
)
registry.add_version_aliases(["default", "production"], version=CHAMPION_VERSION_ID)

### Deploy the `champion` model

Finally, you initiate the champion model ready to production and you deploy it to a Vertex AI Endpoint.

In [None]:
champion_model = registry.get_model(version="production")

#### Create the endpoint

In [None]:
endpoint = vertex_ai.Endpoint.create(
    display_name=ENDPOINT_NAME,
    project=PROJECT_ID,
    location=REGION,
)

#### Deploy the champion model

In [None]:
endpoint.deploy(
    model=champion_model,
    deployed_model_display_name=DEPLOYED_MODEL_NAME,
    machine_type="n1-standard-8",
)

### Generate predictions

In [None]:
text = """The singer to headline the event halftime show: 'It's on"""  # @param {type:"string"}

In [None]:
instances = [{"text": text}]
predictions = endpoint.predict(instances)
print(predictions)

### Final thoughts 

As you can imagine you can also upload external models. Have a look at the documentation sample and the [sample notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/get_started_with_model_registry.ipynb). 

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created.


In [None]:
endpoint.undeploy_all()

endpoint.delete()

drop_model_query = f"DROP MODEL `{PROJECT_ID}.{BQ_DATASET}.text_logit_classifier`"
run_query(drop_model_query)

versions = registry.list_versions()
for version in versions:
    registry.delete_version(version=version.version_id)

naive_bayes_train_job.delete()

! gcloud dataproc batches delete $PREPROCESS_BATCH_ID --region=$REGION --quiet

! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET

! gcloud artifacts repositories delete $REPO_NAME --location=$REGION --quiet

delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

! rm -rf $DATA_PATH $SRC_PATH $BUILD_PATH $CONFIG_PATH