# Introduction and setup

In this notebook, we will build a pipeline that will perform the following steps:
1. Custom data processing such as feature scaling, one-hot encoding, and feature engineering, in a Google Cloud Serverless Spark environment.
2. Implement a custom training job in Vertex AI to train a custom model. In this case, our model uses the [Titanic dataset from Kaggle](https://www.kaggle.com/competitions/titanic/data) to predict the likelihood of survival of each passenger based on their associated features in the dataset.
3. Upload the trained model to Vertex AI Model Registry.
4. Deploy the trained model to a Vertex AI endpoint for online inference.

In this initial section, we set up all of the baseline requirements to run our pipeline.

## Install required packages

Install additional package dependencies not installed in your notebook environment, such as Pyspark, MLeap and others. Use the latest major GA version of each package.

In [None]:
! python -m pip install --upgrade pip

In [None]:
! pip3 install --quiet --user --upgrade google-cloud-aiplatform kfp google-cloud-pipeline-components

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Import required libraries

We will use the following libraries in this notebook:

* [The Vertex AI Python SDK](https://cloud.google.com/python/docs/reference/aiplatform/latest)
* [Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/sdk-overview/)
* [Google Cloud Pipeline Components (GCPC)](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction)

In [1]:
# General
from google.cloud import aiplatform

# Kubeflow Pipelines (KFP)
import kfp
from kfp import compiler, dsl
from kfp.dsl import component, Input, Output, Artifact

# Google Cloud Pipeline Components (GCPC)
from google_cloud_pipeline_components.v1.dataproc import DataprocPySparkBatchOp
from google_cloud_pipeline_components.v1 import dataset, custom_job
from google_cloud_pipeline_components.v1.model import ModelUploadOp
from google_cloud_pipeline_components.types import artifact_types
from google_cloud_pipeline_components.v1.endpoint import EndpointCreateOp, ModelDeployOp

## Define constants
In this section, we define all of the constants that will be referenced throughout the rest of the notebook.

**REPLACE THE PROJECT_ID, REGION, AND BUCKET DETAILS WITH YOUR DETAILS.**

In [32]:
# Core constants
PROJECT_ID="YOUR_PROJECT_ID"
REGION="YOUR_REGION"
BUCKET="YOUR_BUCKET"
BUCKET_URI = f"gs://{BUCKET}"
APPLICATION_DIR = "mlops-titanic-app" # Local parent directory for our pipeline resources
TRAINER_DIR = f"{APPLICATION_DIR}/trainer" # Local directory for training resources
PYSPARK_DIR = "pyspark-titanic-dir" # Local directory for PySpark data processing resources
APP_NAME="mlops-titanic" # Base name for our pipeline application

# Pipeline constants
PIPELINE_NAME = "mlops-titanic-pipeline" # Name of our pipeline
PIPELINE_ROOT = f"{BUCKET_URI}/pipelines" # (See: https://www.kubeflow.org/docs/components/pipelines/v1/overview/pipeline-root/)
SUBNETWORK = "default" # Our VPC subnet name
SUBNETWORK_URI = f"projects/{PROJECT_ID}/regions/{REGION}/subnetworks/{SUBNETWORK}" # Our VPC subnet resource identifier
MODEL_NAME = "mlops-titanic" # Name of our model
EXPERIMENT_NAME = "aiml-sa-mlops-experiment" # Vertex AI "Experiment" name for metadata tracking

# Preprocessing constants
PYSPARK_REPO_NAME=f'{APP_NAME}-pyspark' # Name of repository in which we will store our custom PySpark image
PYSPARK_IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{PYSPARK_REPO_NAME}/{APP_NAME}-pyspark:latest"
SOURCE_DATASET = f"{BUCKET_URI}/data/unprocessed/titanic/train.csv" # Our raw source dataset
PREPROCESSING_PYTHON_FILE_URI = f"{BUCKET_URI}/code/mlops/preprocessing.py" # GCS location of our PySpark script
PROCESSED_DATA_URI =f"{BUCKET_URI}/data/processed/mlops-titanic" # Location to store the output of our data preprocessing step
DATAPROC_RUNTIME_VERSION = "2.1" # (See https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions)
# Arguments to pass to our preprocessing script:
PREPROCESSING_ARGS = [
    "--source_dataset",
    SOURCE_DATASET,
    "--processed_data_path",
    PROCESSED_DATA_URI,
]

# Training constants
TRAIN_REPO_NAME=f'{APP_NAME}-train' # Name of repository in which we will store our custom training image
TRAIN_IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{TRAIN_REPO_NAME}/{APP_NAME}-train:latest"
MODEL_URI = f"{BUCKET_URI}/models/mlops-chapter/titanic" # Where to store our trained model
# Where to store our test data:
TEST_DATA_PREFIX = "test_data" 
TEST_DATA_DIR = f"{TEST_DATA_PREFIX}_dir"
TEST_DATA_FILE_NAME = f"{TEST_DATA_PREFIX}.jsonl"
TEST_DATASET_PATH = f"{BUCKET_URI}/{TEST_DATA_FILE_NAME}"
LOCAL_TEST_DATASET_PATH = f"./{TEST_DATA_DIR}/{TEST_DATA_FILE_NAME}"

# Hyperparameters for training
BATCH_SIZE: int = 32
EPOCHS: int = 20
LEARNING_RATE: float = 0.001
N_HIDDEN_LAYERS: int = 3
N_UNITS: int = 64
ACTIVATION_FN: str = 'relu'

# Arguments to pass to our training job
TRAINING_ARGS=[
        "--project_id",
        PROJECT_ID,
        "--bucket_name",
        BUCKET,
        "--processed_data_path",
        PROCESSED_DATA_URI,
        "--test_data_file_name",
        TEST_DATA_FILE_NAME,
        "--model_path",
        MODEL_URI,
        "--batch_size",
        str(BATCH_SIZE),
        "--epochs",
        str(EPOCHS),
        "--learning_rate",
        str(LEARNING_RATE),
        "--n_hidden_layers",
        str(N_HIDDEN_LAYERS),
        "--n_units",
        str(N_UNITS),
        "--activation_fn",
        ACTIVATION_FN,
    ]

# Worker pool spec (see https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#workerpoolspec)
WORKER_POOL_SPEC = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": TRAIN_IMAGE_URI,
            "args": TRAINING_ARGS
        },
    }
]

# Serving constants
SERVING_IMAGE_URI = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest" # (See: https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers)
ENDPOINT_NAME = "mlops-endpoint" # Name of endpoint on which to serve our trained model

### Create local directories
We will use the following local directories during the activities in this notebook.

In [3]:
# make a source directory to save the code
!mkdir -p $APPLICATION_DIR
!mkdir -p $TRAINER_DIR
!mkdir -p $PYSPARK_DIR
!mkdir -p $TEST_DATA_DIR

### Upload source dataset 
Upload our source dataset to GCS. Our data preprocessing step in our pipeline will ingest this data from GCS.

In [4]:
! gsutil cp ./data/train.csv $SOURCE_DATASET

Copying file://./data/train.csv [Content-Type=text/csv]...
/ [1 files][ 59.8 KiB/ 59.8 KiB]                                                
Operation completed over 1 objects/59.8 KiB.                                     


### Set  project ID for  gcloud
The following command sets our project ID for using gcloud commands in this notebook.

In [5]:
! gcloud config set project $PROJECT_ID --quiet

Updated property [core/project].


### Initialize the Vertex AI SDK client

In [6]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Configure Private Google Access for Dataproc 
Our Serverless Spark data preprocessing job in our pipeline will run in Dataproc, which is (as Google defines) Google's "fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."
We're going to configure something called "Private Google Access", which allows us to interact with Google services without sending requests over the public Internet.

You can learn more about Dataproc [here](https://cloud.google.com/dataproc?hl=en), and learn more about Private Google Access [here](https://cloud.google.com/vpc/docs/private-google-access).

In [7]:
!gcloud compute networks subnets list --regions=$REGION --filter=$SUBNETWORK

!gcloud compute networks subnets update $SUBNETWORK \
--region=$REGION \
--enable-private-ip-google-access

!gcloud compute networks subnets describe $SUBNETWORK \
--region=$REGION \
--format="get(privateIpGoogleAccess)"

NAME     REGION       NETWORK  RANGE          STACK_TYPE  IPV6_ACCESS_TYPE  INTERNAL_IPV6_PREFIX  EXTERNAL_IPV6_PREFIX
default  us-central1  default  10.128.0.0/20  IPV4_ONLY
Updated [https://www.googleapis.com/compute/v1/projects/still-sight-352221/regions/us-central1/subnetworks/default].
True


### Create Google Artifact Registry repositories

Our custom preprocessing and custom training components in our pipeline will run in containers on Dataproc Serverless Spark and Vertex AI Training, respectively. In this sestion, we will create the Google Artifact Registry repositories in which we can store our custom container images that we will build in later steps in this notebook.

Our data preprocessing step in our pipeline will use Serverless Spark to perform our data preparation and feature engineering steps. We will use our own PySpark code to perform the data processing steps, so we're going to create a custom container to run our code.
This step creates a Docker repository in Google Artifact Registry, which is where we will store our custom container image that we're going to create.

#### Create Google Artifact Registry repository for our custom preprocessing container image

In [8]:
!gcloud artifacts repositories create $PYSPARK_REPO_NAME --repository-format=docker \
--location=$REGION --description="PySpark repo for MLOps workload"

# Register gcloud as a Docker credential helper
! gcloud auth configure-docker $REGION-docker.pkg.dev --quiet

Create request issued for: [mlops-titanic-pyspark]
Waiting for operation [projects/still-sight-352221/locations/us-central1/operat
ions/6456da31-d277-4975-bfe4-2fed546062f5] to complete...done.                 
Created repository [mlops-titanic-pyspark].

{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud",
    "us-central1-docker.pkg.dev": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev
gcloud credential helpers already registered correctly.


#### Create Google Artifact Registry repository for our custom training container image

In [9]:
!gcloud artifacts repositories create $TRAIN_REPO_NAME --repository-format=docker \
--location=$REGION --description="Train repo for MLOps workload"

Create request issued for: [mlops-titanic-train]
Waiting for operation [projects/still-sight-352221/locations/us-central1/operat
ions/4e7f375f-b13a-42ff-9ccb-06c5ee9b1007] to complete...done.                 
Created repository [mlops-titanic-train].


# Create custom PySpark job
In this section, we will create our custom PySpark job. It will consist of the following steps:
1. Create our custom Pyspark script.
2. Create a Dockerfile that will specify how to build our custom container image. See [here](https://docs.docker.com/engine/reference/builder/) for more details. 
3. Build our custom container image.
4. Push our custom container image to Google Artifact Registry so that we can use it in subsequent steps in our pipeline.

## Define the code for our PySpark job

The following code will create a file that contains the code for our custom PySpark data preprocessing job. 

The code initiates a Spark session, loads our raw source dataset, and then performs the following processing steps (we performed many of these steps using pandas in our feature engineering chapter earlier in this book, but in this case we will implement the steps using PySpark in Google Cloud Serverless Spark):

1. Removes rows from the dataset where the target variable ("Survived") is missing values.
2. Drops columns that are unlikely to affect the likelihod of surviving, such as 'PassengerId', 'Name', 'Ticket', and 'Cabin'.
3. Fills in missing values in input features.
4. Performs some feature engineering by creating new features such as 'FamilySize' and 'IsAlone' from combinations of existing features.
5. Ensures that all numeric features are on a consistent scale with each other.
6. One-hot encodes all categorical features.
7. Converts the resulting sparse vector to a dense vector. This mainly makes it easier for us to feed the data into our Keras model in our training script later, with minimal processing needed in the training script.
8. Writes the resulting processed data to a parquet file in GCS.

In [10]:
%%writefile $PYSPARK_DIR/preprocessing.py

import argparse
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, StandardScaler, VectorAssembler
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import StringType, ArrayType, FloatType

# Setting up the argument parser
parser = argparse.ArgumentParser(description='Data Preprocessing Script')
parser.add_argument('--source_dataset', type=str, help='Path to the source dataset')
parser.add_argument('--processed_data_path', type=str, help='Path to save the output data')

# Parsing the arguments
args = parser.parse_args()
source_dataset = args.source_dataset
processed_data_path = args.processed_data_path

# Initialize a SparkSession
spark = SparkSession.builder \
    .appName("Titanic Data Processing") \
    .getOrCreate()

# Load the data
titanic = spark.read.csv(args.source_dataset, header=True, inferSchema=True)

# Remove rows where 'Survived' is missing
titanic = titanic.filter(titanic.Survived.isNotNull())

# Drop irrelevant columns
titanic = titanic.drop('PassengerId', 'Name', 'Ticket', 'Cabin')

# Fill missing values
def calculate_median(column_name):
    return titanic.filter(col(column_name).isNotNull()).approxQuantile(column_name, [0.5], 0)[0]

median_age = calculate_median('Age')  # Median age
median_fare = calculate_median('Fare')  # Median fare

titanic = titanic.fillna({
    'Pclass': -1,
    'Sex': 'Unknown',
    'Age': median_age,
    'SibSp': -1,
    'Parch': -1,
    'Fare': median_fare,
    'Embarked': 'Unknown'
})

# Feature Engineering
titanic = titanic.withColumn('FamilySize', col('SibSp') + col('Parch') + 1)
titanic = titanic.withColumn('IsAlone', when(col('FamilySize') == 1, 1).otherwise(0))

# Define categorical features 
categorical_features = ['Pclass', 'Sex', 'Embarked', 'IsAlone']

# Define numerical features 
numerical_features = ['Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']

# One-hot encoding for categorical features
stages = []
for col_name in categorical_features:
    string_indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_Index")
    encoder = OneHotEncoder(inputCols=[f"{col_name}_Index"], outputCols=[f"{col_name}_Vec"])
    stages += [string_indexer, encoder]
    
# Scaling numerical features 
for col_name in numerical_features:
    assembler = VectorAssembler(inputCols=[col_name], outputCol=f"vec_{col_name}")
    scaler = StandardScaler(inputCol=f"vec_{col_name}", outputCol=f"scaled_{col_name}", withStd=True, withMean=False)
    stages += [assembler, scaler]

# Create a pipeline and transform the data
pipeline = Pipeline(stages=stages)
pipeline_model = pipeline.fit(titanic)
titanic = pipeline_model.transform(titanic)

# Drop intermediate columns created during scaling and one-hot encoding
titanic = titanic.drop('vec_Age', 'vec_Fare', 'vec_FamilySize', 'vec_SibSp', 'vec_Parch', 'Pclass_Index', 'Sex_Index', 'Embarked_Index', 'IsAlone_Index')

# Drop original categorical columns (no longer needed after one-hot encoding)
titanic = titanic.drop(*categorical_features)

# Drop original numeric columns (no longer needed after scaling)
titanic = titanic.drop(*numerical_features)

vector_columns = ["Pclass_Vec", "Sex_Vec", "Embarked_Vec", "IsAlone_Vec", "scaled_Age", "scaled_Fare", "scaled_FamilySize", "scaled_SibSp", "scaled_Parch"]

def to_dense(vector):
    return vector.toArray().tolist()

to_dense_udf = udf(to_dense, ArrayType(FloatType()))

for vector_col in vector_columns:
    titanic = titanic.withColumn(vector_col, to_dense_udf(col(vector_col)))

for vector_col in vector_columns:
    num_features = len(titanic.select(vector_col).first()[0])  # Getting the size of the vector
    
    for i in range(num_features):
        titanic = titanic.withColumn(f"{vector_col}_{i}", col(vector_col).getItem(i))
    
    titanic = titanic.drop(vector_col)

# Save the processed data to GCS
titanic.write.parquet(args.processed_data_path, mode="overwrite")

# Stop the SparkSession
spark.stop()

Overwriting pyspark-titanic-dir/preprocessing.py


## Create the Dockerfile for our PySpark container

The [Dockerfile](https://docs.docker.com/engine/reference/builder/) specifies how to build our custom container image.

This Dockerfile (adapted from [this official Google example](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/google_cloud_pipeline_components_dataproc_tabular.ipynb)) specifies that we want to:
1. Use a base Debian image.
2. Install required dependencied such as [procps](https://www.linux.co.cr/ldp/lfs/appendixa/procps.html), [tini](https://packages.debian.org/sid/tini), and [openjdk-11-jdk-headless](https://packages.debian.org/sid/openjdk-11-jdk-headless).
3. Set required environment variables.
4. Copy our PySpark script to the container image.
5. Set required permissions.
6. Run our PySpark script.

In [11]:
%%writefile {PYSPARK_DIR}/Dockerfile

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Debian 11 is recommended.
FROM debian:11-slim

# Suppress interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Switch to the root user for installation
USER root

# (Required) Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini

# Install system dependencies
RUN apt-get update -y && \
    apt-get install -y wget openjdk-11-jdk-headless && \
    rm -rf /var/lib/apt/lists/*

# Set environment variables for Java
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/
RUN export JAVA_HOME

# Install Spark
ENV APACHE_SPARK_VERSION 3.1.2
RUN wget -qO - https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop3.2.tgz | tar -xz -C /usr/local/ && \
    ln -s /usr/local/spark-${APACHE_SPARK_VERSION}-bin-hadoop3.2 /usr/local/spark

ENV SPARK_HOME /usr/local/spark
ENV PATH $PATH:/usr/local/spark/bin

COPY preprocessing.py /preprocessing.py

# (Required) Create the 'spark' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark

CMD ["spark-submit", "/preprocessing.py"]

Overwriting pyspark-titanic-dir/Dockerfile


## Build a custom container image for our PySpark job

Our custom PySpark job will run on Google Cloud Serverless Spark, which executes our data processing code as a Batch job on [Dataproc Serverless](https://cloud.google.com/dataproc-serverless/docs). This requires our code to be packaged into a Docker container, and that's what we're doing in this section. Our custom container image will then be referenced in the `DataprocPySparkBatchOp` component in our pipeline to run the Serverless Spark job.

First, we change our working directory to the local PySpark directory that we've created.

In [12]:
cd $PYSPARK_DIR

/home/jupyter/Chapter-11-MLops/pyspark-titanic-dir


Next, we build our custom container image.

In [13]:
! docker build ./ -t $PYSPARK_IMAGE_URI --quiet

sha256:6335b55851a6888db1c3510f96b5eb3456866075ea0efef762c68767708b2ea1


Finally, we push our image to Google Artifact Registry.

In [14]:
! docker push $PYSPARK_IMAGE_URI

The push refers to repository [us-central1-docker.pkg.dev/still-sight-352221/mlops-titanic-pyspark/mlops-titanic-pyspark]

[1Bcdda1b6d: Preparing 
[1B3e1bd22b: Preparing 
[1Bf3e38a64: Preparing 
[1B914ac522: Preparing 
[1Bed1d026e: Preparing 
[1Bd1067a00: Preparing 
[2Bd1067a00: Mounted from still-sight-352221/mlops-titanic-train-pyspark/mlops-titanic-train-pyspark [6A[2K[5A[2K[4A[2K[1A[2K[2A[2Klatest: digest: sha256:d380b67ac9b7c278fac3588d7994ab16f3d65b0b33396d80111da3f49fa9c2f2 size: 1790


Change our working directory back to our original working directory. 

In [15]:
cd ..

/home/jupyter/Chapter-11-MLops


### Upload source code for PySpark

This is a required step for the `DataprocPySparkBatchOp` component in google-cloud-pipeline-components. We need to upload our PySpark code to our GCS bucket to be references by the `DataprocPySparkBatchOp` component in our pipeline.

In [16]:
! gsutil cp $PYSPARK_DIR/preprocessing.py $PREPROCESSING_PYTHON_FILE_URI

Copying file://pyspark-titanic-dir/preprocessing.py [Content-Type=text/x-python]...
/ [1 files][  3.9 KiB/  3.9 KiB]                                                
Operation completed over 1 objects/3.9 KiB.                                      


# Create custom training job
In this section, we will create our custom training job. It will consist of the following steps:
1. Create our custom training script.
2. Create a Dockerfile that will specify how to build our custom container image. 
3. Build our custom container image.
4. Push our custom container image to Google Artifact Registry so that we can use it in subsequent steps in our pipeline.

## Define the code for our training job

The following code will create a file that contains the code for our custom training job. 

The code performs the following processing steps:

1. Imports required libraries and sets initial variable values based on arguments passed to the script (the arguments are described below).
2. Reads in the processed dataset that was created by the data preprocessing step in our pipeline.
3. Fills in missing values in input features.
4. Performs some feature engineering by creating new features such as 'FamilySize' and 'IsAlone' from combinations of existing features.
5. Ensures that all numeric features are on a consistent scale with each other.
6. One-hot encodes all categorical features.
7. Converts the resulting sparse vector to a dense vector. This mainly makes it easier for us to feed the data into our Keras model in our training script later, with minimal processing needed in the training script.
8. Writes the resulting processed data to a parquet file in GCS.

In [17]:
%%writefile {TRAINER_DIR}/train.py

import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from google.cloud import storage
import gcsfs
import os
import json

def train_model(args):
    # Input arguments
    project_id = args.project_id
    bucket_name = args.bucket_name
    processed_data_path = args.processed_data_path
    model_path = args.model_path
    batch_size = args.batch_size
    epochs = args.epochs
    learning_rate = args.learning_rate
    n_hidden_layers = args.n_hidden_layers
    n_units = args.n_units
    activation_fn = args.activation_fn
    test_data_file_name = args.test_data_file_name
    
    ### DATA PREPARATION SECTION ###
    
    # Get list of all Parquet files created by our preprocessing step in our GCS directory
    fs = gcsfs.GCSFileSystem(project=project_id)  # replace with your project name
    files = [f for f in fs.ls(processed_data_path) if 'part' in os.path.basename(f)]

    print(f"Found files: {files}")
    if not files:
        raise FileNotFoundError(f"No Parquet files found in directory: {processed_data_path}")

    # Read all Parquet files and concatenate into a single DataFrame
    dfs = [pd.read_parquet('gs://' + file) for file in files]
    data = pd.concat(dfs, ignore_index=True)

    # Separate the target and input features in the dataset
    y = data['Survived'].values.astype('float32') # Ensuring the target column has a consistent data type
    X = data.drop('Survived', axis=1)

    # Convert X to NumPy array (required input for training)
    X = X.values

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    ### MODEL TRAINING AND EVALUATION SECTION ###

    # Define the model
    model = Sequential()
    model.add(Dense(n_units, activation=activation_fn, input_shape=(X_train.shape[1],)))
    for _ in range(n_hidden_layers - 1):
        model.add(Dense(n_units, activation=activation_fn))
    model.add(Dense(1, activation='sigmoid'))

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model
    model.fit(
        X_train,
        y_train,
        epochs=epochs,
        batch_size=batch_size,
        validation_data=(X_test, y_test)
    )

    # Evaluate the model
    test_loss, test_acc = model.evaluate(X_test, y_test)
    
    # Get the model predictions
    y_pred = model.predict(X_test).ravel()

    # Compute the AUC
    auc = roc_auc_score(y_test, y_pred)
    print(f'Test Loss: {test_loss}, Test Accuracy: {test_acc}, AUC: {auc}')

    # Save the model to GCS
    model.save(model_path)
    
    ### SAVING TEST DATA FOR LATER REFERENCE ###    
    # Converting the test dataset to JSON Lines format and saving it to GCS
    
    # Convert numpy array to list of lists
    X_test_list = X_test.tolist()
    
    # Create a JSONL string from the list of lists
    jsonl_str = "\n".join(json.dumps(instance) for instance in X_test_list)
    
    # Initialize the GCS client
    client = storage.Client()
    
    # Get the bucket details
    bucket = client.get_bucket(bucket_name)
    
    # Create the blob (See https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob)
    blob = bucket.blob(test_data_file_name)
    
    # Upload the JSONL string to the blob
    blob.upload_from_string(jsonl_str)
    
    # Return the trained model
    return model  

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Train a neural network model for Titanic survival prediction')
    
    parser.add_argument('--project_id', type=str, help='GCP Project ID')
    parser.add_argument('--bucket_name', type=str, help='GCP Bucket ID')
    parser.add_argument('--processed_data_path', type=str, help='Path to the directory containing the preprocessed data')
    parser.add_argument('--test_data_file_name', type=str, help='Path to the directory containing the preprocessed data')
    parser.add_argument('--model_path', type=str, help='Path to save the trained model')
    parser.add_argument('--n_hidden_layers', type=int, default=2, help='Number of hidden layers')
    parser.add_argument('--n_units', type=int, default=64, help='Number of units per layer')
    parser.add_argument('--activation_fn', type=str, default='relu', help='Activation function for hidden layers')
    parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate')
    parser.add_argument('--batch_size', type=int, default=32, help='Batch size')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs')

    args = parser.parse_args()

    train_model(args)

Overwriting mlops-titanic-app/trainer/train.py


### Create our requirements.txt file
The requirements.txt file is a convenient way to specify all of the packages that we want to install in our custom container image. This file will be referenced in the Dockerfile for our image.

In this case, we will install:
* [The Vertex AI Python SDK](https://cloud.google.com/python/docs/reference/aiplatform/latest)
* [Python Client for Google Cloud Storage](https://cloud.google.com/python/docs/reference/storage/latest)
* [Filesystem interfaces for Python](https://filesystem-spec.readthedocs.io/en/latest/)
* [GCSFS](https://gcsfs.readthedocs.io/en/latest/)
* [pyarrow](https://arrow.apache.org/docs/python/index.html)

In [18]:
%%writefile {APPLICATION_DIR}/requirements.txt
google-cloud-aiplatform
google-cloud-storage
fsspec==2023.5.0
gcsfs==2023.5.0
pyarrow

Overwriting mlops-titanic-app/requirements.txt


## Create the Dockerfile for our custom training container

The [Dockerfile](https://docs.docker.com/engine/reference/builder/) specifies how to build our custom container image.

This Dockerfile specifies that we want to:
1. Use Vertex AI [prebuilt container for custom training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) as a base image.
2. Install the required dependencied specified in our requirements.txt file.
3. Copy our custom training script to the container image.
4. Run our custom training script when the container starts up.

In [19]:
%%writefile {APPLICATION_DIR}/Dockerfile

# Use an official Python runtime as a parent image
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-12.py310:latest

WORKDIR /

COPY requirements.txt /requirements.txt

# Install any needed packages specified in requirements.txt
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt

# Copies the trainer code to the Docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.train"]

Overwriting mlops-titanic-app/Dockerfile


### Build our custom training image

These steps are the same as the steps we performed above for our PySpark container, but in this case we are building our custom training container.

In [20]:
cd $APPLICATION_DIR

/home/jupyter/Chapter-11-MLops/mlops-titanic-app


In [21]:
! docker build ./ -t $TRAIN_IMAGE_URI --quiet

sha256:c685ebd7295abab98b38e7baf4a7b2dadaedb14d8c131e973378c297d8f1902b


### Push our custom image to Google Artifact Registry

In [22]:
! docker push $TRAIN_IMAGE_URI

The push refers to repository [us-central1-docker.pkg.dev/still-sight-352221/mlops-titanic-train/mlops-titanic-train]

[1B7f0187c3: Preparing 
[1B12663cc0: Preparing 
[1B53c1883b: Preparing 
[1B76b5629f: Preparing 
[1B95c7b436: Preparing 
[1B967c8575: Preparing 
[1Bbcc2d347: Preparing 
[1B439bc91e: Preparing 
[1Bee080030: Preparing 
[1B84cdb93c: Preparing 
[1B9a008127: Preparing 
[1Bcc96b9f7: Preparing 
[1B37f6dd03: Preparing 
[1B3c0be4ce: Preparing 
[1B26e42ac3: Preparing 
[1B0a15895d: Preparing 
[1B98a818bd: Preparing 
[1B01f76958: Preparing 
[1B53965cbb: Preparing 
[1B66efbc81: Preparing 
[1B099a1ebc: Preparing 
[1B0c658f04: Preparing 
[1Bedee2eb8: Preparing 
[1Bdcc29187: Preparing 
[1Bcb1fd645: Preparing 
[2Bcb1fd645: Mounted from still-sight-352221/mlops-titanic-train-app/mlops-titanic-train [26A[2K[23A[2K[22A[2K[21A[2K[20A[2K[19A[2K[17A[2K[16A[2K[14A[2K[12A[2K[13A[2K[10A[2K[8A[2K[7A[2K[5A[2K[6A[2K[4A[2K[1A[2K[2A[2Kla

In [23]:
cd ..

/home/jupyter/Chapter-11-MLops


# Define our Vertex AI Pipeline

Now that we have defined our custom data preprocessing and model training components, it's time to define our MLOps pipeline.

In this section, we will use the Kubeflow Pipelines SDK and Google Cloud Pipeline Components to define our MLOps pipeline.

We begin by specifying all of the required variables in our pipeline, and populating their values from the constants we defined earlier in our notebook. We then specify the following components in our pipeline:

1. [DataprocPySparkBatchOp](https://cloud.google.com/vertex-ai/docs/pipelines/dataproc-component) to perform our data preprocessing step.
2. [CustomTrainingJobOp](https://cloud.google.com/vertex-ai/docs/pipelines/customjob-component#customjobop) to perform our custom model training step.
3. [importer](https://www.kubeflow.org/docs/components/pipelines/v2/components/importer-component/) to import our [UnmanagedContainerModel](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.types.UnmanagedContainerModel) object.
4. [ModelUploadOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/model.html#v1.model.ModelUploadOp) to upload our Model artifact into Vertex AI Model Registry.
5. [EndpointCreateOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/endpoint.html#v1.endpoint.EndpointCreateOp) to create a Vertex AI [Endpoint](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints).
6. [ModelDeployOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/endpoint.html#v1.endpoint.ModelDeployOp) to deploy our Google Cloud Vertex AI Model to an Endpoint, creating a [DeployedModel](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints#deployedmodel) object within it.

In [24]:
@dsl.pipeline(name=PIPELINE_NAME, description="MLOps pipeline for custom data preprocessing, model training, and deployment.")
def pipeline(
    bucket_name: str = BUCKET,
    display_name: str = PIPELINE_NAME,
    preprocessing_main_python_file_uri: str = PREPROCESSING_PYTHON_FILE_URI,
    preprocessing_args: list = PREPROCESSING_ARGS,
    processed_data_path: str = PROCESSED_DATA_URI,
    model_path: str = MODEL_URI,
    custom_container_image: str = PYSPARK_IMAGE_URI,
    model_name: str = MODEL_NAME,
    project_id: str = PROJECT_ID,
    location: str = REGION,
    subnetwork_uri: str = SUBNETWORK_URI,
    dataproc_runtime_version: str = DATAPROC_RUNTIME_VERSION,
    worker_pool_specs: list = WORKER_POOL_SPEC,
    base_output_directory: str = PIPELINE_ROOT,
    serving_image_uri: str = SERVING_IMAGE_URI,
    endpoint_name: str = ENDPOINT_NAME
):
    
    # Preprocess data
    preprocessing_op = DataprocPySparkBatchOp(
        project=project_id,
        location=location,
        container_image=custom_container_image,
        main_python_file_uri=preprocessing_main_python_file_uri,
        args=preprocessing_args,
        subnetwork_uri=subnetwork_uri,
        runtime_config_version=dataproc_runtime_version,
    )

    # Train model
    model_training_op = custom_job.CustomTrainingJobOp(
        project=project_id,
        location=location,
        display_name="train-mlops-model",
        worker_pool_specs = worker_pool_specs,
    ).after(preprocessing_op)
    
    importer_op = dsl.importer(
        artifact_uri=model_path,
        artifact_class=artifact_types.UnmanagedContainerModel,
        metadata={
            "containerSpec": {
                "imageUri": serving_image_uri,
            },
        },
    ).after(model_training_op)

    model_upload_op = ModelUploadOp(
        project=project_id,
        display_name=model_name,
        unmanaged_container_model=importer_op.outputs["artifact"],
    ).after(importer_op)

    endpoint_create_op = EndpointCreateOp(
        project=project_id,
        display_name=endpoint_name,
    ).after(model_upload_op)

    model_deploy_op = ModelDeployOp(
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name=model_name,
        dedicated_resources_machine_type="n1-standard-16",
        dedicated_resources_min_replica_count=1,
        dedicated_resources_max_replica_count=1,
    ).after(endpoint_create_op)

### Compile our pipeline into a YAML file

Now that we have defined out pipeline structure, we need to compile it into YAML format in order to run it in Vertex AI Pipelines.

In [25]:
compiler.Compiler().compile(pipeline, 'mlops-pipeline.yaml')

## Submit and run our pipeline in Vertex AI Pipelines

Now we're ready to use the Vertex AI Python SDK to submit and run our pipeline in Vertex AI Pipelines.

The parameters, artifacts, and metrics produced from the pipeline run are automatically captured into Vertex AI Experiments as an experiment run. We will discuss the concept of Vertex AI Experiments in more detail in laer chapters in the book. The output of the following cell will provide a link at which you can watch your pipeline as it progresses through each of the steps.

In [26]:
pipeline = aiplatform.PipelineJob(display_name=PIPELINE_NAME, template_path='mlops-pipeline.yaml')

pipeline.submit(experiment=EXPERIMENT_NAME)

Creating PipelineJob
PipelineJob created. Resource name: projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-titanic-pipeline-20230916202555?project=96449483013
Associating projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 to Experiment: aiml-sa-mlops-experiment


### Wait for the pipeline to complete
The following function will periodically print the status of our pipeline execution. If all goes to plan, you will eventually see a message saying "PipelineJob run completed".

In [27]:
pipeline.wait()

PipelineJob projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/96449483013/locations/us-central1/pipelineJobs/mlops-titanic-pipeline-20230916202555 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/96449483013/locations/us-centra

## Great job!! You have officially created and implemented an MLOps pipeline on Vertex AI!!

### Next, let's send an inference request to our new model!

Now that our model has been deployed to a Vertex AI Endpoint, we can start sending inference requests to it! In the real world, inference requests may come from a variety of potential sources. In our case, our training script created a small subset of our processed dataset to use for testing purposes. For convenience and example purposes, our training script also saved that test dataset to GCS. We can use it now to send inference requests to our model.

### Copy the test dataset to a local directory in our notebook

In [28]:
! gsutil cp $TEST_DATASET_PATH $TEST_DATA_DIR

Copying gs://kk-ml-book-us-actual-central-1/test_data.jsonl...
/ [1 files][ 19.8 KiB/ 19.8 KiB]                                                
Operation completed over 1 objects/19.8 KiB.                                     


## Get model endpoint details

In order to test our model, we first need to get the details of our newly-deployed model endpoint in Vertex AI:

In [29]:
mlops_endpoint_list = aiplatform.Endpoint.list(filter=f'display_name={ENDPOINT_NAME}', order_by='create_time desc')
new_mlops_endpoint = mlops_endpoint_list[0]
endpoint_resource_name = new_mlops_endpoint.resource_name
print(endpoint_resource_name)

projects/96449483013/locations/us-central1/endpoints/1927472470793650176


## Send inference requests to our model
Let's go ahead and test it out! The following code will read a record from our test dataset and send it in an inference request to our model endpoint in Vertex AI.
Our model should provide a prediction response that will be printed below the code cell. It should be a number between 0 and 1, which predicts the probability of survival for that record. Numbers closer to zero predict a low probability of survival, while numbers closer to 1 predict a higher probability.

In [30]:
import json

file_path = LOCAL_TEST_DATASET_PATH

with open(file_path, 'r') as f:
    # Read the first line of the file
    line = f.readline()

    # Convert JSON line to Python dictionary
    instance = json.loads(line)
    
    # Convert to a list of lists (required for our model input)
    instance_list = [instance]

    # Send the inference request
    response = aiplatform.Endpoint(endpoint_resource_name).predict(instance_list)

    # Print the response
    print(f"Response: {response}")

Response: Prediction(predictions=[[0.233431324]], deployed_model_id='4937398745969983488', model_version_id='1', model_resource_name='projects/96449483013/locations/us-central1/models/6037257819420360704', explanations=None)


# Cleaning up

When you no longer need the resources created by this notebook. You can delete them as follows.

**Note: if you do not delete the resources, you will continue to pay for them**

In [None]:
# Delete pipeline
pipeline.delete()

# Delete endpoints
endpoint_list = aiplatform.Endpoint.list(filter=f'display_name="{ENDPOINT_NAME}"')
for endpoint in endpoint_list:
    endpoint.undeploy_all()
    endpoint.delete()

# Delete model
model_list = aiplatform.Model.list(filter=f'display_name="{MODEL_NAME}"')
for model in model_list:
    model.delete()

In [None]:
# Delete the Artifact repository
! gcloud artifacts repositories delete $PYSPARK_REPO_NAME --location=$REGION --quiet
! gcloud artifacts repositories delete $TRAIN_REPO_NAME --location=$REGION --quiet