# MLOps on Vertex AI

In this notebook, we will build an MLOps pipeline on Vertex AI.

# Introduction and setup

In this notebook, we will build a pipeline that will perform the following steps:
1. Custom data processing such as feature scaling, one-hot encoding, and feature engineering, in a Google Cloud Serverless Spark environment.
2. Implement a custom training job in Vertex AI to train a custom model. In this case, our model uses the [Titanic dataset from Kaggle](https://www.kaggle.com/competitions/titanic/data) to predict the likelihood of survival of each passenger based on their associated features in the dataset.
3. Upload the trained model to Vertex AI Model Registry.
4. Deploy the trained model to a Vertex AI endpoint for online inference.

In this initial section, we set up all of the baseline requirements to run our pipeline.

## Prerequisites
**Note:** This notebook and repository are supporting artifacts for the "Google Machine Learning and Generative AI for Solutions Architects" book. The book describes the concepts associated with this notebook, and for some of the activities, the book contains instructions that should be performed before running the steps in the notebooks. Each top-level folder in this repo is associated with a chapter in the book. Please ensure that you have read the relevant chapter sections before performing the activities in this notebook.

**There are also important generic prerequisite steps outlined [here](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Prerequisite-steps/Prerequisites.ipynb).**


**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)
* [Google Cloud Storage Pricing](https://cloud.google.com/storage/pricing)
* [Dataproc pricing](https://cloud.google.com/dataproc/pricing)


## Install required packages

We will use the following libraries in this notebook:

* [The Vertex AI Python SDK](https://cloud.google.com/python/docs/reference/aiplatform/latest)
* [Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/sdk-overview/)
* [Google Cloud Pipeline Components (GCPC)](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction)

*The pip installation commands sometimes report various errors. Those errors usually do not affect the activities in this notebook, and you can ignore them.*


In [None]:
! python -m pip install --upgrade pip

In [None]:
! pip3 install --quiet --user --upgrade google-cloud-aiplatform kfp google-cloud-pipeline-components

## Restart the kernel

The code in the next cell will retart the kernel, which is sometimes required after installing/upgrading packages.

**When prompted, click OK to restart the kernel.**

The sleep command simply prevents further cells from executing before the kernel restarts.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)


In [None]:
import time
time.sleep(10)

# (Wait for kernel to restart before proceeding...)

## Import required libraries

In [None]:
# General
from google.cloud import aiplatform

# Kubeflow Pipelines (KFP)
import kfp
from kfp import compiler, dsl
from kfp.dsl import component, Input, Output, Artifact

# Google Cloud Pipeline Components (GCPC)
from google_cloud_pipeline_components.v1.dataproc import DataprocPySparkBatchOp
from google_cloud_pipeline_components.v1 import dataset, custom_job
from google_cloud_pipeline_components.v1.model import ModelUploadOp
from google_cloud_pipeline_components.types import artifact_types
from google_cloud_pipeline_components.v1.endpoint import EndpointCreateOp, ModelDeployOp

## Set Google Cloud resource variables

The following code will set variables specific to your Google Cloud resources that will be used in this notebook, such as the Project ID, Region, and GCS Bucket.

**Note: This notebook is intended to execute in a Vertex AI Workbench Notebook, in which case the API calls issued in this notebook are authenticated according to the permissions (e.g., service account) assigned to the Vertex AI Workbench Notebook.**

We will use the `gcloud` command to get the Project ID details from the local Google Cloud project, and assign the results to the PROJECT_ID variable. If, for any reason, PROJECT_ID is not set, you can set it manually or change it, if preferred.

We also use a default bucket name for most of the examples and activities in this book, which has the format: `{PROJECT_ID}-aiml-sa-bucket`. You can change the bucket name if preferred.

Also, we're defaulting to the **us-central1** region, but you can optionally replace this with your [preferred region](https://cloud.google.com/about/locations).

In [None]:
PROJECT_ID_DETAILS = !gcloud config get-value project
PROJECT_ID = PROJECT_ID_DETAILS[0]  # The project ID is item 0 in the list returned by the gcloud command
BUCKET=f"{PROJECT_ID}-aiml-sa-bucket" # Optional: replace with your preferred bucket name, which must be a unique name.
REGION="us-central1" # Optional: replace with your preferred region (See: https://cloud.google.com/about/locations) 
print(f"Project ID: {PROJECT_ID}")
print(f"Bucket Name: {BUCKET}")

## Create bucket

The following code will create the bucket if it doesn't already exist.

If you get an error saying that it already exists, that's fine, you can ignore it and continue with the rest of the steps, unless you want to use a different bucket.

In [None]:
!gsutil mb -l us-central1 gs://{BUCKET}

# Begin implementation

Now that we have performed the prerequisite steps for this activity, it's time to implement the activity.

## Define constants
In this section, we define all of the constants that will be referenced throughout the rest of the notebook.


In [None]:
# Core constants
BUCKET_URI = f"gs://{BUCKET}"
APPLICATION_DIR = "mlops-titanic-app" # Local parent directory for our pipeline resources
TRAINER_DIR = f"{APPLICATION_DIR}/trainer" # Local directory for training resources
PYSPARK_DIR = "pyspark-titanic-dir" # Local directory for PySpark data processing resources
APP_NAME="mlops-titanic" # Base name for our pipeline application

# Pipeline constants
PIPELINE_NAME = "mlops-titanic-pipeline" # Name of our pipeline
PIPELINE_ROOT = f"{BUCKET_URI}/pipelines" # (See: https://www.kubeflow.org/docs/components/pipelines/v1/overview/pipeline-root/)
SUBNETWORK = "default" # Our VPC subnet name
SUBNETWORK_URI = f"projects/{PROJECT_ID}/regions/{REGION}/subnetworks/{SUBNETWORK}" # Our VPC subnet resource identifier
MODEL_NAME = "mlops-titanic" # Name of our model
EXPERIMENT_NAME = "aiml-sa-mlops-experiment" # Vertex AI "Experiment" name for metadata tracking

# Preprocessing constants
SOURCE_DATASET = f"{BUCKET_URI}/data/unprocessed/titanic/train.csv" # Our raw source dataset
PREPROCESSING_PYTHON_FILE_URI = f"{BUCKET_URI}/code/mlops/preprocessing.py" # GCS location of our PySpark script
PROCESSED_DATA_URI =f"{BUCKET_URI}/data/processed/mlops-titanic" # Location to store the output of our data preprocessing step
DATAPROC_RUNTIME_VERSION = "2.1" # (See https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions)
# Arguments to pass to our preprocessing script:
PREPROCESSING_ARGS = [
    "--source_dataset",
    SOURCE_DATASET,
    "--processed_data_path",
    PROCESSED_DATA_URI,
]

# Training constants
TRAIN_REPO_NAME=f'{APP_NAME}-train' # Name of repository in which we will store our custom training image
TRAIN_IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{TRAIN_REPO_NAME}/{APP_NAME}-train:latest"
MODEL_URI = f"{BUCKET_URI}/models/mlops-chapter/titanic" # Where to store our trained model
# Where to store our test data:
TEST_DATA_PREFIX = "test_data" 
TEST_DATA_DIR = f"{TEST_DATA_PREFIX}_dir"
TEST_DATA_FILE_NAME = f"{TEST_DATA_PREFIX}.jsonl"
TEST_DATASET_PATH = f"{BUCKET_URI}/{TEST_DATA_FILE_NAME}"
LOCAL_TEST_DATASET_PATH = f"./{TEST_DATA_DIR}/{TEST_DATA_FILE_NAME}"

# Hyperparameters for training
BATCH_SIZE: int = 32
EPOCHS: int = 20
LEARNING_RATE: float = 0.001
N_HIDDEN_LAYERS: int = 3
N_UNITS: int = 64
ACTIVATION_FN: str = 'relu'

# Arguments to pass to our training job
TRAINING_ARGS=[
        "--project_id",
        PROJECT_ID,
        "--bucket_name",
        BUCKET,
        "--processed_data_path",
        PROCESSED_DATA_URI,
        "--test_data_file_name",
        TEST_DATA_FILE_NAME,
        "--model_path",
        MODEL_URI,
        "--batch_size",
        str(BATCH_SIZE),
        "--epochs",
        str(EPOCHS),
        "--learning_rate",
        str(LEARNING_RATE),
        "--n_hidden_layers",
        str(N_HIDDEN_LAYERS),
        "--n_units",
        str(N_UNITS),
        "--activation_fn",
        ACTIVATION_FN,
    ]

# Worker pool spec (see https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#workerpoolspec)
WORKER_POOL_SPEC = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": TRAIN_IMAGE_URI,
            "args": TRAINING_ARGS
        },
    }
]

# These will be referenced in a later chapter
EXPLANATION_PARAMS = {
  "sampledShapleyAttribution": {
    "pathCount": 10
  }
}

# Serving constants
SERVING_IMAGE_URI = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest" # (See: https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers)
ENDPOINT_NAME = "mlops-endpoint" # Name of endpoint on which to serve our trained model

### Create local directories
We will use the following local directories during the activities in this notebook.

In [None]:
# make a source directory to save the code
!mkdir -p $APPLICATION_DIR
!mkdir -p $TRAINER_DIR
!mkdir -p $PYSPARK_DIR
!mkdir -p $TEST_DATA_DIR

### Upload source dataset 
Upload our source dataset to GCS. Our data preprocessing step in our pipeline will ingest this data from GCS.

In [None]:
! gsutil cp ./data/train.csv $SOURCE_DATASET

### Set  project ID for  gcloud
The following command sets our project ID for using gcloud commands in this notebook.

In [None]:
! gcloud config set project $PROJECT_ID --quiet

### Initialize the Vertex AI SDK client

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Configure Private Google Access for Dataproc 
Our Serverless Spark data preprocessing job in our pipeline will run in Dataproc, which is (as Google defines) Google's "fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."
We're going to configure something called "Private Google Access", which allows us to interact with Google services without sending requests over the public Internet.

You can learn more about Dataproc [here](https://cloud.google.com/dataproc?hl=en), and learn more about Private Google Access [here](https://cloud.google.com/vpc/docs/private-google-access).

In [None]:
!gcloud compute networks subnets list --regions=$REGION --filter=$SUBNETWORK

!gcloud compute networks subnets update $SUBNETWORK \
--region=$REGION \
--enable-private-ip-google-access

!gcloud compute networks subnets describe $SUBNETWORK \
--region=$REGION \
--format="get(privateIpGoogleAccess)"

# Create custom PySpark job

The following code will create a file that contains the code for our custom PySpark data preprocessing job. 

The code initiates a Spark session, loads our raw source dataset, and then performs the following processing steps (we performed many of these steps using pandas in our feature engineering chapter earlier in this book, but in this case we will implement the steps using PySpark in Google Cloud Serverless Spark):

1. Removes rows from the dataset where the target variable ("Survived") is missing values.
2. Drops columns that are unlikely to affect the likelihod of surviving, such as 'PassengerId', 'Name', 'Ticket', and 'Cabin'.
3. Fills in missing values in input features.
4. Performs some feature engineering by creating new features such as 'FamilySize' and 'IsAlone' from combinations of existing features.
5. Ensures that all numeric features are on a consistent scale with each other.
6. One-hot encodes all categorical features.
7. Converts the resulting sparse vector to a dense vector. This mainly makes it easier for us to feed the data into our Keras model in our training script later, with minimal processing needed in the training script.
8. Writes the resulting processed data to a parquet file in GCS.

**Note:** we could create a custom container in which to run our PySpark code on Dataproc Serverless if we had very specific dependencies that needed to be installed. However, Dataproc Serverless also provides [default runtimes](https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions) that we can use, and these are fine for our needs in this activity, so all we need to do is define our code and put it into GCS so that it can be referenced by the `DataprocPySparkBatchOp` component in our pipeline.

In [None]:
%%writefile $PYSPARK_DIR/preprocessing.py

import argparse
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, StandardScaler, VectorAssembler
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import StringType, ArrayType, FloatType

# Setting up the argument parser
parser = argparse.ArgumentParser(description='Data Preprocessing Script')
parser.add_argument('--source_dataset', type=str, help='Path to the source dataset')
parser.add_argument('--processed_data_path', type=str, help='Path to save the output data')

# Parsing the arguments
args = parser.parse_args()
source_dataset = args.source_dataset
processed_data_path = args.processed_data_path

# Initialize a SparkSession
spark = SparkSession.builder \
    .appName("Titanic Data Processing") \
    .getOrCreate()

# Load the data
titanic = spark.read.csv(args.source_dataset, header=True, inferSchema=True)

# Remove rows where 'Survived' is missing
titanic = titanic.filter(titanic.Survived.isNotNull())

# Drop irrelevant columns
titanic = titanic.drop('PassengerId', 'Name', 'Ticket', 'Cabin')

# Fill missing values
def calculate_median(column_name):
    return titanic.filter(col(column_name).isNotNull()).approxQuantile(column_name, [0.5], 0)[0]

median_age = calculate_median('Age')  # Median age
median_fare = calculate_median('Fare')  # Median fare

titanic = titanic.fillna({
    'Pclass': -1,
    'Sex': 'Unknown',
    'Age': median_age,
    'SibSp': -1,
    'Parch': -1,
    'Fare': median_fare,
    'Embarked': 'Unknown'
})

# Feature Engineering
titanic = titanic.withColumn('FamilySize', col('SibSp') + col('Parch') + 1)
titanic = titanic.withColumn('IsAlone', when(col('FamilySize') == 1, 1).otherwise(0))

# Define categorical features 
categorical_features = ['Pclass', 'Sex', 'Embarked', 'IsAlone']

# Define numerical features 
numerical_features = ['Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']

# One-hot encoding for categorical features
stages = []
for col_name in categorical_features:
    string_indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_Index")
    encoder = OneHotEncoder(inputCols=[f"{col_name}_Index"], outputCols=[f"{col_name}_Vec"])
    stages += [string_indexer, encoder]
    
# Scaling numerical features 
for col_name in numerical_features:
    assembler = VectorAssembler(inputCols=[col_name], outputCol=f"vec_{col_name}")
    scaler = StandardScaler(inputCol=f"vec_{col_name}", outputCol=f"scaled_{col_name}", withStd=True, withMean=False)
    stages += [assembler, scaler]

# Create a pipeline and transform the data
pipeline = Pipeline(stages=stages)
pipeline_model = pipeline.fit(titanic)
titanic = pipeline_model.transform(titanic)

# Drop intermediate columns created during scaling and one-hot encoding
titanic = titanic.drop('vec_Age', 'vec_Fare', 'vec_FamilySize', 'vec_SibSp', 'vec_Parch', 'Pclass_Index', 'Sex_Index', 'Embarked_Index', 'IsAlone_Index')

# Drop original categorical columns (no longer needed after one-hot encoding)
titanic = titanic.drop(*categorical_features)

# Drop original numeric columns (no longer needed after scaling)
titanic = titanic.drop(*numerical_features)

vector_columns = ["Pclass_Vec", "Sex_Vec", "Embarked_Vec", "IsAlone_Vec", "scaled_Age", "scaled_Fare", "scaled_FamilySize", "scaled_SibSp", "scaled_Parch"]

def to_dense(vector):
    return vector.toArray().tolist()

to_dense_udf = udf(to_dense, ArrayType(FloatType()))

for vector_col in vector_columns:
    titanic = titanic.withColumn(vector_col, to_dense_udf(col(vector_col)))

for vector_col in vector_columns:
    num_features = len(titanic.select(vector_col).first()[0])  # Getting the size of the vector
    
    for i in range(num_features):
        titanic = titanic.withColumn(f"{vector_col}_{i}", col(vector_col).getItem(i))
    
    titanic = titanic.drop(vector_col)

# Save the processed data to GCS
titanic.write.parquet(args.processed_data_path, mode="overwrite")

# Stop the SparkSession
spark.stop()

### Upload source code for PySpark

We need to upload our PySpark code to our GCS bucket to be referenced by the `DataprocPySparkBatchOp` component in our pipeline.

In [None]:
! gsutil cp $PYSPARK_DIR/preprocessing.py $PREPROCESSING_PYTHON_FILE_URI

# Create custom training job
In this section, we will create our custom training job. It will consist of the following steps:
1. Create a Google Artifact Registry repository to host our custom container image.
2. Create our custom training script.
3. Create a Dockerfile that will specify how to build our custom container image. 
4. Build our custom container image.
5. Push our custom container image to Google Artifact Registry so that we can use it in subsequent steps in our pipeline.

## Create Google Artifact Registry repository

Our custom training component in our pipeline will run in a container on the Vertex AI Training service. In this section, we will create the Google Artifact Registry repository in which we can store our custom container image that we will build in later steps in this notebook.

In [None]:
!gcloud artifacts repositories create $TRAIN_REPO_NAME --repository-format=docker \
--location=$REGION --description="Train repo for MLOps workload"

## Define the code for our training job

The following code will create a file that contains the code for our custom training job. 

The code performs the following processing steps:

1. Imports required libraries and sets initial variable values based on arguments passed to the script (the arguments are described below).
2. Reads in the processed dataset that was created by the data preprocessing step in our pipeline.
3. Fills in missing values in input features.
4. Performs some feature engineering by creating new features such as 'FamilySize' and 'IsAlone' from combinations of existing features.
5. Ensures that all numeric features are on a consistent scale with each other.
6. One-hot encodes all categorical features.
7. Converts the resulting sparse vector to a dense vector. This mainly makes it easier for us to feed the data into our Keras model in our training script later, with minimal processing needed in the training script.
8. Writes the resulting processed data to a parquet file in GCS.

In [None]:
%%writefile {TRAINER_DIR}/train.py

import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from google.cloud import storage
import gcsfs
import os
import json

def train_model(args):
    # Input arguments
    project_id = args.project_id
    bucket_name = args.bucket_name
    processed_data_path = args.processed_data_path
    model_path = args.model_path
    batch_size = args.batch_size
    epochs = args.epochs
    learning_rate = args.learning_rate
    n_hidden_layers = args.n_hidden_layers
    n_units = args.n_units
    activation_fn = args.activation_fn
    test_data_file_name = args.test_data_file_name
    
    ### DATA PREPARATION SECTION ###
    
    # Get list of all Parquet files created by our preprocessing step in our GCS directory
    fs = gcsfs.GCSFileSystem(project=project_id)  # replace with your project name
    files = [f for f in fs.ls(processed_data_path) if 'part' in os.path.basename(f)]

    print(f"Found files: {files}")
    if not files:
        raise FileNotFoundError(f"No Parquet files found in directory: {processed_data_path}")

    # Read all Parquet files and concatenate into a single DataFrame
    dfs = [pd.read_parquet('gs://' + file) for file in files]
    data = pd.concat(dfs, ignore_index=True)

    # Separate the target and input features in the dataset
    y = data['Survived'].values.astype('float32') # Ensuring the target column has a consistent data type
    X = data.drop('Survived', axis=1)

    # Convert X to NumPy array (required input for training)
    X = X.values

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    ### MODEL TRAINING AND EVALUATION SECTION ###

    # Define the model
    model = Sequential()
    model.add(Dense(n_units, activation=activation_fn, input_shape=(X_train.shape[1],)))
    for _ in range(n_hidden_layers - 1):
        model.add(Dense(n_units, activation=activation_fn))
    model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model
    model.fit(
        X_train,
        y_train,
        epochs=epochs,
        batch_size=batch_size,
        validation_data=(X_test, y_test)
    )

    # Evaluate the model
    test_loss, test_acc = model.evaluate(X_test, y_test)
    
    # Get the model predictions
    y_pred = model.predict(X_test).ravel()

    # Compute the AUC
    auc = roc_auc_score(y_test, y_pred)
    print(f'Test Loss: {test_loss}, Test Accuracy: {test_acc}, AUC: {auc}')

    # Save the model to GCS
    model.save(model_path)
    
    ### SAVING TEST DATA FOR LATER REFERENCE ###    
    # Converting the test dataset to JSON Lines format and saving it to GCS
    
    # Convert numpy array to list of lists
    X_test_list = X_test.tolist()
    
    # Create a JSONL string from the list of lists
    jsonl_str = "\n".join(json.dumps(instance) for instance in X_test_list)
    
    # Initialize the GCS client
    client = storage.Client()
    
    # Get the bucket details
    bucket = client.get_bucket(bucket_name)
    
    # Create the blob (See https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob)
    blob = bucket.blob(test_data_file_name)
    
    # Upload the JSONL string to the blob
    blob.upload_from_string(jsonl_str)
    
    # Return the trained model
    return model  

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Train a neural network model for Titanic survival prediction')
    
    parser.add_argument('--project_id', type=str, help='GCP Project ID')
    parser.add_argument('--bucket_name', type=str, help='GCP Bucket ID')
    parser.add_argument('--processed_data_path', type=str, help='Path to the directory containing the preprocessed data')
    parser.add_argument('--test_data_file_name', type=str, help='Path to the directory containing the preprocessed data')
    parser.add_argument('--model_path', type=str, help='Path to save the trained model')
    parser.add_argument('--n_hidden_layers', type=int, default=2, help='Number of hidden layers')
    parser.add_argument('--n_units', type=int, default=64, help='Number of units per layer')
    parser.add_argument('--activation_fn', type=str, default='relu', help='Activation function for hidden layers')
    parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate')
    parser.add_argument('--batch_size', type=int, default=32, help='Batch size')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs')

    args = parser.parse_args()

    train_model(args)

### Create our requirements.txt file
The requirements.txt file is a convenient way to specify all of the packages that we want to install in our custom container image. This file will be referenced in the Dockerfile for our image.

In this case, we will install:
* [The Vertex AI Python SDK](https://cloud.google.com/python/docs/reference/aiplatform/latest)
* [Python Client for Google Cloud Storage](https://cloud.google.com/python/docs/reference/storage/latest)
* [Filesystem interfaces for Python](https://filesystem-spec.readthedocs.io/en/latest/)
* [GCSFS](https://gcsfs.readthedocs.io/en/latest/)
* [pyarrow](https://arrow.apache.org/docs/python/index.html)

In [None]:
%%writefile {APPLICATION_DIR}/requirements.txt
google-cloud-aiplatform
google-cloud-storage
fsspec==2023.5.0
gcsfs==2023.5.0
pyarrow

## Create the Dockerfile for our custom training container

The [Dockerfile](https://docs.docker.com/engine/reference/builder/) specifies how to build our custom container image.

This Dockerfile specifies that we want to:
1. Use Vertex AI [prebuilt container for custom training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) as a base image.
2. Install the required dependencied specified in our requirements.txt file.
3. Copy our custom training script to the container image.
4. Run our custom training script when the container starts up.

In [None]:
%%writefile {APPLICATION_DIR}/Dockerfile

# Use an official Python runtime as a parent image
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-12.py310:latest

WORKDIR /

COPY requirements.txt /requirements.txt

# Install any needed packages specified in requirements.txt
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt

# Copies the trainer code to the Docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.train"]

### Build our custom training image

The steps required to build our image are:

1. Change directory to our application directory.
2. Build Docker image.
3. Push the image to our Google Artifact Registry.
4. Change directory back to our parent application directory.

In [None]:
cd $APPLICATION_DIR

In [None]:
! gcloud auth configure-docker us-central1-docker.pkg.dev --quiet

In [None]:
! docker build ./ -t $TRAIN_IMAGE_URI --quiet

### Push our custom image to Google Artifact Registry

In [None]:
! docker push $TRAIN_IMAGE_URI

In [None]:
cd ..

# Define our Vertex AI Pipeline

Now that we have defined our custom data preprocessing and model training components, it's time to define our MLOps pipeline.

In this section, we will use the Kubeflow Pipelines SDK and Google Cloud Pipeline Components to define our MLOps pipeline.

We begin by specifying all of the required variables in our pipeline, and populating their values from the constants we defined earlier in our notebook. We then specify the following components in our pipeline:

1. [DataprocPySparkBatchOp](https://cloud.google.com/vertex-ai/docs/pipelines/dataproc-component) to perform our data preprocessing step.
2. [CustomTrainingJobOp](https://cloud.google.com/vertex-ai/docs/pipelines/customjob-component#customjobop) to perform our custom model training step.
3. [importer](https://www.kubeflow.org/docs/components/pipelines/v2/components/importer-component/) to import our [UnmanagedContainerModel](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.types.UnmanagedContainerModel) object.
4. [ModelUploadOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/model.html#v1.model.ModelUploadOp) to upload our Model artifact into Vertex AI Model Registry.
5. [EndpointCreateOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/endpoint.html#v1.endpoint.EndpointCreateOp) to create a Vertex AI [Endpoint](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints).
6. [ModelDeployOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/endpoint.html#v1.endpoint.ModelDeployOp) to deploy our Google Cloud Vertex AI Model to an Endpoint, creating a [DeployedModel](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints#deployedmodel) object within it.

In [None]:
@dsl.pipeline(name=PIPELINE_NAME, description="MLOps pipeline for custom data preprocessing, model training, and deployment.")
def pipeline(
    bucket_name: str = BUCKET,
    display_name: str = PIPELINE_NAME,
    preprocessing_main_python_file_uri: str = PREPROCESSING_PYTHON_FILE_URI,
    preprocessing_args: list = PREPROCESSING_ARGS,
    processed_data_path: str = PROCESSED_DATA_URI,
    model_path: str = MODEL_URI,
    model_name: str = MODEL_NAME,
    project_id: str = PROJECT_ID,
    location: str = REGION,
    subnetwork_uri: str = SUBNETWORK_URI,
    dataproc_runtime_version: str = DATAPROC_RUNTIME_VERSION,
    worker_pool_specs: list = WORKER_POOL_SPEC,
    base_output_directory: str = PIPELINE_ROOT,
    explanation_parameters: dict = EXPLANATION_PARAMS,
    serving_image_uri: str = SERVING_IMAGE_URI,
    endpoint_name: str = ENDPOINT_NAME
):
    
    # Preprocess data
    preprocessing_op = DataprocPySparkBatchOp(
        project=project_id,
        location=location,
        main_python_file_uri=preprocessing_main_python_file_uri,
        args=preprocessing_args,
        subnetwork_uri=subnetwork_uri,
        runtime_config_version=dataproc_runtime_version,
    )

    # Train model
    model_training_op = custom_job.CustomTrainingJobOp(
        project=project_id,
        location=location,
        display_name="train-mlops-model",
        worker_pool_specs = worker_pool_specs,
    ).after(preprocessing_op)
    
    importer_op = dsl.importer(
        artifact_uri=model_path,
        artifact_class=artifact_types.UnmanagedContainerModel,
        metadata={
            "containerSpec": {
                "imageUri": serving_image_uri,
            },
        },
    ).after(model_training_op)

    model_upload_op = ModelUploadOp(
        project=project_id,
        display_name=model_name,
        unmanaged_container_model=importer_op.outputs["artifact"],
        explanation_parameters=explanation_parameters,
    ).after(importer_op)

    endpoint_create_op = EndpointCreateOp(
        project=project_id,
        display_name=endpoint_name,
    ).after(model_upload_op)

    model_deploy_op = ModelDeployOp(
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=model_name,
        dedicated_resources_machine_type="n1-standard-16",
        dedicated_resources_min_replica_count=1,
        dedicated_resources_max_replica_count=1,
    ).after(endpoint_create_op)

### Compile our pipeline into a YAML file

Now that we have defined out pipeline structure, we need to compile it into YAML format in order to run it in Vertex AI Pipelines.

In [None]:
compiler.Compiler().compile(pipeline, 'mlops-pipeline.yaml')

## Submit and run our pipeline in Vertex AI Pipelines

Now we're ready to use the Vertex AI Python SDK to submit and run our pipeline in Vertex AI Pipelines.

The parameters, artifacts, and metrics produced from the pipeline run are automatically captured into Vertex AI Experiments as an experiment run. We will discuss the concept of Vertex AI Experiments in more detail in laer chapters in the book. The output of the following cell will provide a link at which you can watch your pipeline as it progresses through each of the steps.

In [None]:
pipeline = aiplatform.PipelineJob(display_name=PIPELINE_NAME, template_path='mlops-pipeline.yaml', enable_caching=False)

pipeline.submit(experiment=EXPERIMENT_NAME)

### Wait for the pipeline to complete
The following function will periodically print the status of our pipeline execution. If all goes to plan, you will eventually see a message saying "PipelineJob run completed".

In [None]:
pipeline.wait()

## Great job!! You have officially created and implemented an MLOps pipeline on Vertex AI!!

### Next, let's send an inference request to our new model!

Now that our model has been deployed to a Vertex AI Endpoint, we can start sending inference requests to it! In the real world, inference requests may come from a variety of potential sources. In our case, our training script created a small subset of our processed dataset to use for testing purposes. For convenience and example purposes, our training script also saved that test dataset to GCS. We can use it now to send inference requests to our model.

### Copy the test dataset to a local directory in our notebook

In [None]:
! gsutil cp $TEST_DATASET_PATH $TEST_DATA_DIR

## Get model endpoint details

In order to test our model, we first need to get the details of our newly-deployed model endpoint in Vertex AI:

In [None]:
mlops_endpoint_list = aiplatform.Endpoint.list(filter=f'display_name={ENDPOINT_NAME}', order_by='create_time desc')
new_mlops_endpoint = mlops_endpoint_list[0]
endpoint_resource_name = new_mlops_endpoint.resource_name
print(endpoint_resource_name)

## Send inference requests to our model
Let's go ahead and test it out! The following code will read a record from our test dataset and send it in an inference request to our model endpoint in Vertex AI.
Our model should provide a prediction response that will be printed below the code cell. It should be a number between 0 and 1, which predicts the probability of survival for that record. Numbers closer to zero predict a low probability of survival, while numbers closer to 1 predict a higher probability.

In [None]:
import json

file_path = LOCAL_TEST_DATASET_PATH

with open(file_path, 'r') as f:
    # Read the first line of the file
    line = f.readline()

    # Convert JSON line to Python dictionary
    instance = json.loads(line)
    
    # Convert to a list of lists (required for our model input)
    instance_list = [instance]

    # Send the inference request
    response = aiplatform.Endpoint(endpoint_resource_name).predict(instance_list)

    # Print the response
    print(f"Response: {response}")

# That's it! Well Done!

# Clean up

# WARNING: THE MODEL AND ENDPOINT CREATED IN THIS NOTEBOOK ARE ALSO USED IN CHAPTER 12. IF YOU PLAN TO PROCEED WITH THE ACTIVITIES IN CHAPTER 12, DO NOT DELETE THESE RESOURCES YET.

When you no longer need the resources created by this notebook. You can delete them as follows.

**Note: if you do not delete the resources, you will continue to pay for them.**

In [None]:
clean_up = False  # Set to True if you want to delete the resources

## Delete Vertex AI resources

In [None]:
from google.api_core import exceptions as gcp_exceptions
if clean_up:  
    try:
        endpoint_list = aiplatform.Endpoint.list(filter=f'display_name="{ENDPOINT_NAME}"')
        if endpoint_list:
            endpoint = endpoint_list[0]  # Assuming only one endpoint with that name

            # Undeploy all models (if any)
            try:
                endpoint.undeploy_all()
                print(f"Undeployed all models from endpoint: {ENDPOINT_NAME}")
            except gcp_exceptions.NotFound:
                print(f"No models found to undeploy from endpoint: {ENDPOINT_NAME}")
            except Exception as e:  # Catching general errors for better debugging
                print(f"Unexpected error while undeploying models: {e}")

            # Delete endpoint
            try:
                endpoint.delete()
                print(f"Deleted endpoint: {ENDPOINT_NAME}")
            except Exception as e:
                print(f"Error deleting endpoint: {e}")
        else:
            print(f"No endpoint found matching: {ENDPOINT_NAME}")
    except gcp_exceptions.NotFound:
        print(f"Endpoint not found: {ENDPOINT_NAME}")

    # Delete models
    try:
        model_list = aiplatform.Model.list(filter=f'display_name="{MODEL_NAME}"')
        if model_list:
            for model in model_list:
                print(f"Deleting model: {model.display_name}")
                model.delete()
        else:
            print(f"No models found matching: {MODEL_NAME}")
    except gcp_exceptions.NotFound:
        print(f"Model not found: {MODEL_NAME}")

    # Delete pipeline
    try:
        pipeline.delete()
    except gcp_exceptions.NotFound:
        print("error deleting pipeline")

else:
    print("clean_up parameter is set to False.")


## Delete artifact repository

In [None]:
if clean_up:  
    try:
        # Delete the artifact repository
        ! gcloud artifacts repositories delete $TRAIN_REPO_NAME --location=$REGION --quiet
    except Exception as e:
        print(f"Error deleting artifact registry: {e}")
else:
    print("clean_up parameter is set to False.")

## Delete GCS Bucket
The bucket can be reused throughout multiple activities in the book. Sometimes, activities in certain chapters make use of artifacts from previous chapters that are stored in the GCS bucket.

I highly recommend **not deleting the bucket** unless you will be performing no further activities in the book. For this reason, there's a separate `delete_bucket` variable to specify if you want to delete the bucket.

If you want to delete the bucket, set the `delete_bucket` parameter to `True`.

In [None]:
delete_bucket = False

In [None]:
if delete_bucket == True:
    # Delete the bucket
    ! gcloud storage rm --recursive gs://$BUCKET
else:
    print("delete_bucket parameter is set to False")