
# Deploying Custom Models in Vertex AI 
By Jameson Hohbein
[Linkedin](https://www.linkedin.com/in/jameson-hohbein/) [Github](https://github.com/jamesonhohbein)


*   toc: true
*   badges: true
*   comments: true
*   categories: [jupyter]
*   image: images/vertex.png









## Introduction
This tutorial is for people who want to deploy a functioning custom PyTorch model in Vertex AI. This tutorial is not a substitute for reading the documentation of Vertex AI, TorchServe, PyTorch, or any other service used in this tutorial. This is only for people who want to learn and get an introduction through hands-on, project-based learning. 

## Background 
Vertex AI has now taken over as Google Cloud Platform's center for all things pertaining to AI. As far as I have seen, there are little to no tutorials going in to deploying custom models in vertex AI. In this tutorial, I am going to define each step to host your own custom PyTorch models in Vertex AI since Vertex AI does not have any prebuilt containers for deploying PyTorch models. You will be able to scale and customize the software around your model as you see fit. 


## Step 1 Select Your Environment 
We will be working with docker images and google cloud SDK. I personally recommend spinning up a jupyter notebook instance in vertex AI workbench, or running your own jupyter instance on your local machine. Docker is incompatible with google colab and other high level notebook services so you will need a dedicated instance that can run docker. 

## Step 2 Collect your Artifacts 
Now that you have your dedicated work environment, you will need to collect your model artifacts into whatever environment you are working in. First thing is first, let's organize our files. Create a new directory titled 'predictor' and within that directory, another one called, 'model'.


In [None]:
!mkdir predictor
!mkdir predictor/model

Within the model directory, place all of your model artifacts. Your model artifacts are a list of files needed to perform inference on your model. This will range from .bin file storing the weights of your model, to config files, to a tokenizer, all models will be different. But simply put, whatever files you need to perform inference on your model, put them in a predictor/model directory. 

## Step 3 Preparing to Dockerize TorchServe
To host any model in Vertex AI, your docker image needs to contain instructions to run a HTTP server that can serve as a means of retrieving inferences and checking the health of the server. There are many ways to do this, such as making a flask app along with most major ML frameworks containing some kind of their own serving software. TorchServe is the one for PyTorch models. 

TorchServe requires what is called a handler. A handler is a python class that handles all the pre/post processing of inputs/outputs as well as the actual code of loading your model, tokenizer, whatever you need for performing inference on your model. TorchServe has many prebuilt handlers [found here](https://pytorch.org/serve/default_handlers.html). Here is my own custom handler. 

In [None]:

import os
import json
import logging
from os import listdir

import torch
from transformers import AutoTokenizer, OPTForCausalLM
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)


class TransformersHandler(BaseHandler):
    """
    This handler takes in a input string and multiple parameters and returns autoregressive generations from various OPT models. 
    """
    def __init__(self):
        super(TransformersHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        """ 
        The function looks at the specs of the device that is running the server and loads in the model and any other objects that must be loaded in.
        
        """
        # get the passed properties of the torchserve compiler and the device 
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        # Read model serialize/pt file
        serialized_file = self.manifest["model"]["serializedFile"]
        
        model_pt_path = os.path.join(model_dir, serialized_file)
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.pt or pytorchf_model.bin file")
        
        # Load model
        logger.info("Loading Model...")
        self.model = OPTForCausalLM.from_pretrained(model_dir)
        logger.info("Model loaded...")
        
        self.model.to(self.device)
        
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        
        # Ensure to use the same tokenizer used during training
        logger.info("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
        logger.info("Tokenizer loaded")

        self.initialized = True

    def preprocess(self, data):
        """
        The initial entry of data being passed for inference. 
        Here it is where we extract the parameters and inputs. 
        Inputs are tokenized for inference.
        """
        params = data[0].get("parameters")
        text = data[0].get("data").get('text')
        
        # set the params 
        self.num_return_sequences = params.get('num_return_sequences')
        self.top_p = params.get('top_p')
        self.top_k = params.get('top_k')
        self.temperature = params.get('temperature')
        self.max_length = params.get('max_length')
        self.no_repeat_ngram_size = params.get('no_repeat_ngram_size')
        
        inputs = self.tokenizer(text, return_tensors='pt')

        return inputs

    def inference(self, inputs):
        """
        Function for performing inference on the processed input. The predictions are then decoded and returned.
        """
        
        prediction = self.model.generate(inputs.input_ids,
                                         max_length=self.max_length,
                                         num_return_sequences= self.num_return_sequences,
                                         do_sample = True,
                                         temperature = self.temperature,
                                         early_stopping = True,
                                         top_k = self.top_k,
                                         top_p = self.top_p,
                                         no_repeat_ngram_size = 2,
                                         return_dict_in_generate = True,
                                         tokenizer = self.tokenizer)
        
        
        prediction = self.tokenizer.batch_decode(prediction['sequences'],
                                                  skip_special_tokens=True,
                                                  clean_up_tokenization_spaces=False)
        
        return [prediction]

    def postprocess(self, inference_output):
        '''
        Extra function for processing inference outputs if not already done so.
        '''
        return inference_output


This is a custom handler for inference on an autoregressive language model from huggingface, in this case I am using OPT-125m. You can see that the handler is made up of different pieces such as initializing the model for inference, processing inputs and outputs. Whatever your handler is, it will have to match the specifics of the model you want to perform inference on and what pre/post processing are involved with the inputs/outputs of said model. 

Place your handler in the predictor directory.


## Step 4 Define Statics

We need to assign some variables that pertain to your specific project. Project ID should be your GCP project id, app name should be whatever you want your app to be called. 


In [None]:
PROJECT_ID = 'my-fun-project'
APP_NAME = "vertex-test"
CUSTOM_PREDICTOR_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_predict_{APP_NAME}"

## Step 5 Write Your Dockerfile

In [None]:
docker_file = '''   
bash -s $APP_NAME

APP_NAME=$1

cat << EOF > ./predictor/Dockerfile

FROM pytorch/torchserve:latest-cpu

# install dependencies
RUN python3 -m pip install --upgrade pip
RUN pip3 install transformers
RUN pip3 install torch

USER model-server

# copy model artifacts, custom handler and other dependencies
COPY {0} /home/model-server/
COPY ./model/ /home/model-server/

# create torchserve configuration files
USER root
RUN printf "\\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
USER model-server

# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081

# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f   --model-name={1}   --version={7}   --serialized-file=/home/model-server/{4}   --handler=/home/model-server/{5}   --extra-files "{6}"   --export-path=/home/model-server/model-store

# run Torchserve HTTP serve to respond to prediction requests
CMD ["torchserve",      "--start",      "--ts-config=/home/model-server/config.properties",      "--models",      "{2}={3}.mar",      "--model-store",      "/home/model-server/model-store"]

EOF

echo "Writing ./predictor/Dockerfile"
'''.format(handler_name,APP_NAME,APP_NAME,APP_NAME,model_source.split('/')[model_source.count('/')],handler_name,','.join(extra_files),str(VERSION))

#write docker file
os.system(docker_file)

The Dockerfile must be within your predictor directory. 

## Step 6 Build and Push Image to GCP

Now that we have our Dockerfile all set, the next thing to do is the use docker to build and deploy an image in GCP. To do this, we need to determine what our image URI will be in GCP and build the Dockerfile to the image. Docker will take the Dockerfile and the artifacts for our model/software that are all in the predictor directory, and build it into an image. When the image is built, we then want to deploy it into the GCP registry. 

In [None]:
# set the URI
CUSTOM_PREDICTOR_IMAGE_URI = f"gcr.io/{PROJECT_ID}/pytorch_predict_{APP_NAME}"

# build the image
!docker build --tag=$CUSTOM_PREDICTOR_IMAGE_URI ./predictor

# deploy the image to the GCP registry
!docker push $CUSTOM_PREDICTOR_IMAGE_URI

## Step 7 Docker Image to Vertex AI Model

Normally, docker images are deployed into a container, or a runnable instance of that docker image. To move past Vertex nomenclature, we can view vertex AI endpoints as containers and models as images. We are simply taking an image in the GCP container registry and moving it into Vertex AI as a "model". We will do this with the GCP python SDK. 

In [None]:
# import the google-cloud-aiplatform package and assign some static variables 
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID)
VERSION = 1
model_display_name = f"{APP_NAME}-v{VERSION}"
model_description = "This is so fun"

MODEL_NAME = APP_NAME
health_route = "/ping"
predict_route = f"/predictions/{MODEL_NAME}"
serving_container_ports = [7080]


In [None]:
# now deploy the model into Vertex AI
model = aiplatform.Model.upload(
    display_name=model_display_name,
    description=model_description,
    serving_container_image_uri=CUSTOM_PREDICTOR_IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
)

model.wait()

print(model.display_name)
print(model.resource_name)

## Step 8 Containerizing your Vertex Model

Now that you have deployed your image/model in vertex AI, you now need to containerize it, or give it a home within a runnable instance. In Vertex AI, these are called endpoints. An endpoint is what assigns the machine to run your callable http server to perform inferences and health checks. 

In [None]:
# create the endpoint here. Initially it will just be an empty object registered in Vertex AI
endpoint_display_name = f"{APP_NAME}-endpoint"
endpoint = aiplatform.Endpoint.create(display_name=endpoint_display_name)


traffic_percentage = 100 # If you have multiple models assigned to a single endpoint, you can split traffic between the models. 
machine_type = "n1-standard-16" # assign what type of machine you want in GCP, list found here https://cloud.google.com/vertex-ai/docs/predictions/configure-compute
deployed_model_display_name = model_display_name
sync = True
# deploy your model, to your endpoint. This may take some time. 
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=deployed_model_display_name,
    machine_type=machine_type,
    traffic_percentage=traffic_percentage,
    sync=sync,
)

When your model has been deployed to your endpoint, you will get a message of something like "projects/{myprojectID}/locations/us-central1/endpoints/{endpoint ID}". This is what is known as the "resource name" and you will need it to perform inferences. 

## Step 9 Performing Inference 

To perform inference, make an endpoint object and pass the endpoint resource name as the only argument. Format json-like data to pass to your endpoint object that coincides with whatever your handler takes in. For my handler, I define variable t as my input for inference. 

In [None]:
endpoint = aiplatform.Endpoint('{my endpoint resource}')
endpoint.list_models()

t = [
    {
        "data": {
            "text": "This is a test"
        },
        "parameters":{
            "num_return_sequences": 5,
            "top_p":0.9,
            "top_k":10,
            "temperature":0.8,
            "max_length":20,
            "no_repeat_ngram_size":2
            
            
        }
    }
]

# call the prediction 
prediction = endpoint.predict(instances=t)

## Conclusion 

And there you have it, you will have your model's prediction. In this tutorial I walk through how to deploy custom PyTorch models in Vertex AI. To continue your educational journey, play around with what you produced, make it your own, and read up on the many services used here. I hope you found it useful and I am happy to connect with anyone via [linkedin](https://www.linkedin.com/in/jameson-hohbein/). 

All the code and the deployment of PyTorch models have been packaged in a public software I wrote which you can find here. 
https://github.com/jamesonhohbein/vertex_deployment_4dummies


Warm regards,

Jameson
