<img src="./images/logo.png" alt="Drawing" style="width: 500px;"/>

# **Exercise 6:** Serving your model with Kserve

By this stage, you have a model that is saved in a Model Regsitry as a 'Production' model. Now, it's time to make this model available to every self-serve checkout in every one of our retail stores.

In this exercise, you will:

- Be briefly introduced to containers, Kubernetes, Kubeflow and Kserve.
- Create an InferenceService configuration.
- Serve the model through an InferenceService using KServe.
- Learn how to manage Kubeflow Endpoints.

By the end of this exercise, you will have learned how to serve a model at scale using KServe and Kubeflow on **HPE Ezmeral Unified Analytics**.

Let's dive in!

### **Prerequsites**

As instructed in the [Introductory notebook](./00.introduction.ipynb), ensure that you have run `pip install -r requirements.txt` in a Terminal window, located in the same working directory, prior to running this notebook. 

<div class="alert alert-block alert-danger">
<b>Important:</b> This exercise requires the completion of Exercises 5: Tracking, Registering and Inferencing Models in MLflow.</div>

## **1. Introduction to Kubeflow and Kserve**

If you were just looking to serve your model locally, such as to a single notebook, script or application, you would now be in a position to simply use the MLflow URI to get a "copy" of it each time. 

You can probably guess that whilst this may work for a one or a few instances, it is hardly a **scalable** solution. What if we want to deploy this model such that it can be remotely inferenced by every self-serve checkout in every retail store across multiple countries?

### A quick word about Kubernetes and containers

This problem of scaling applications (such as a model inferencing server) based on user demand is one that has plagued infrastructure managers since the birth of the modern computer. How do I deploy an application onto infrastructure such that it is available to 100 users during the evening and 100,000 users during the day? Making it available.

This problem brought about the idea of the **container**, whereby a lightweight instance of an application could be easily spun up and powered down near-instantaneously. By deploying applications in containers, we could better manage the allocation of our compute infrastructure resources at any given time. 

However, containerization brought about another problem: when there are hundreds of thousands or even **millions** of containers, who (or *what*) is going to spin up and power down these containers based on demand? Enter **Kubernetes**. 

**Kubernetes**, also known as K8s, is an open-source system designed to automate the deployment, scaling, and management of containerized applications. Kubernetes is like a conductor for a container orchestra*, ensuring everything runs smoothly and efficiently on top of several resource notes (compute and storage servers). Kubernetes groups containers, which are self-contained units of software, into rapidly replicatable logical units for easier management and discovery.

**A fitting analogy, seeing as Kubernetes is, by definition, a container orchestration platform!*



### What is Kserve?

KServe is a framework for deploying and serving machine learning (ML) models in production on Kubernetes. It simplifies the process of serving models by providing a Kubernetes Custom Resource Definition (CRD) that lets you easily define **how** your models should be served.

KServe is **incredibly powerful** for model inferencing, as it can scale your model serving instances up or down based on real-time traffic. This allows for what is known as "scale-to-zero" functionality on CPUs and GPUs for efficient resource utilization.

KServe is not a standalone platform but instead core add-on component of **Kubeflow**, specifically addressing the model serving aspect of the ML pipeline. Kubeflow an open-source platform focused on machine learning operations (MLOps) on Kubernetes. It offers a collection of tools that cover the entire ML lifecycle, from model building, training, and deployment to monitoring and management.

### Wait, Kubeflow sounds familiar...

... to another machine learning platform we've already been introduced to in MLflow. 

Kubeflow and MLflow serve different purposes in the machine learning workflow:

* **Kubeflow** is an open-source platform designed to facilitate the end-to-end **orchestration and management** of machine learning workflows on **Kubernetes**, providing capabilities for model training, **deployment**, **serving**, and **monitoring**.

* **MLflow**, on the other hand, is a light **platform-agnostic** open-source tool that excels at **experiment tracking**, **version control** for models through a Model Registry, and facilitates **collaboration** among data scientists by keeping everything organized and reproducible.  

The latest versions of Kubeflow and MLflow come natively installed with **HPE Ezmeral Unified Analytics**, which sits on top of a Kubernetes distribution - taking away all of the pain of deploying, connecting and managing these applications on top of Kubernetes yourself. 

Today, data scientists and engineers leverage both Kubeflow and MLflow to address distinct needs within the machine learning lifecycle. Together with Unified Analytics, they provide the complete MLOps solution!

## **2. Declaring Variables and Importing Libraries**

Let's re-declare the variables related to our MLflow experiement such that we can access them in this exercise.

In [None]:
# Experiment variables for MLflow
experiment_name = "retail-experiment"
model_name = "produce-detection"

Next, we'll import the necessary libraries. 

Ignore any warnings that appear.

In [None]:
from kubernetes import client 
from kubernetes.client import V1EnvVar
from kubernetes.client.models import V1ObjectMeta
from kserve import KServeClient
from kserve import constants
from kserve import utils
from kserve import V1beta1InferenceService
from kserve import V1beta1InferenceServiceSpec
from kserve import V1beta1PredictorSpec
from kserve import V1beta1TFServingSpec
import urllib3
import mlflow
import requests
from PIL import Image
import numpy as np
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import os

## **3. Model Serving with KServe**

It's time to leverage the power of KServe to make our produce detection model available wherever we may need it!

### Get Model details from MLflow

First, let's connect this notebook to MLflow and get the URI for the 'Production' version of our model. We confirmed this URI was accessible in our Testing section of <a href="./05.working_in_mlflow.ipynb" style="color: black">Exercise 5</a>.

In [None]:
%update_token

In [None]:
# create an instance of the MlflowClient
client = mlflow.tracking.MlflowClient()

# Get the latest model version in Production
latest_versions = client.get_latest_versions(name=model_name, stages=["Production"])
latest_version = latest_versions[0]

# Get the model uri
model_uri = latest_version.source.replace("model", "tf_serving_model")

print("Model Storage Path in S3 (as shown in MLflow): " + str(model_uri))

### Create the InferenceService config file (YAML)

Next, we'll create the configuration file (YAML) that tells Kserve to serve our model as an **InferenceService**. An InferenceService is a Kubernetes Custom Resource (CR) provided by Kserve. In the context of Kserve, an InferenceService represents a scalable and load-balanced service that hosts one or more machine learning models for real-time inference or prediction.

A YAML file for defining an InferenceService in Kserve typically specifies the configuration for deploying and serving a machine learning model. Like most Kserve YAMLs, our InferenceService YAML will require:

- `apiVersion` and `kind` to specify the Kubernetes API version and kind of resource being defined, respectively.
- `metadata` to provide metadata, such as the name of the InferenceService.
- `spec` to define the specifications for the InferenceService, including the `predictor` and `transformer`.
- `predictor` section specifies the details of the model to be served, such as the model's location (`modelUri`), runtime version (`runtimeVersion`), and resource requirements (`resources`).
- `transformer` section specifies any pre-processing or post-processing steps required before or after making predictions. In this example, it includes a container image (`image`) and environment variables (`env`).

For our YAML file, we set up the necessary resources in Kubernetes to deploy an InferenceService with KServe, including secrets for authentication with the Unified Analytics internal S3 storage that MLflow uses to store models in the Model Registry, service account authentication details, and the configuration for the InferenceService itself (where we parse the model URI and declare Tensorflow as the `predictor`). 

We'll declare the secret and service name parameters that we will parse into the YAML text, then create a YAML file and store it in the local directory. 

In [None]:
# Set paramentes
isvc_name = experiment_name
secret_name = 's3-proxy-kserve-secret'
sa_name = 's3-proxy-kserve-sa'

#Set name of YAML file
yaml_name = './model-kserve.yaml'

In [None]:
# Create YAML configuration file
with open(yaml_name, 'w') as file:
    text = f"""---
apiVersion: v1
kind: Secret
metadata:
  name: "{secret_name}"
  annotations:
    serving.kserve.io/s3-cabundle: ""
    serving.kserve.io/s3-endpoint: "local-s3-service.ezdata-system.svc.cluster.local:30000/"
    serving.kserve.io/s3-useanoncredential: "false"
    serving.kserve.io/s3-usehttps: "0"
    serving.kserve.io/s3-verifyssl: "0"
stringData:
  AWS_ACCESS_KEY_ID: "{os.environ['AUTH_TOKEN']}"
  AWS_SECRET_ACCESS_KEY: "s3"
type: Opaque

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: "{sa_name}"
secrets:
  - name: "access-token"

---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "{isvc_name}"
  annotations:
    "sidecar.istio.io/inject": "false"
spec:
  predictor:
    tensorflow:
      storageUri: "{model_uri}"
    serviceAccountName: "{sa_name}"
"""
    file.write(text)

Now, we run the Kubernetes command `apply` to deploy our InferenceService!

In [None]:
# Create the container
!kubectl apply -f {yaml_name}

We can confirm that it's up, running and ready to inference!

In [None]:
# Wait until the ISvc is ready
kserve_client = KServeClient()
kserve_client.wait_isvc_ready(isvc_name, watch=True, timeout_seconds=120)
print(f"\nInferenceService {isvc_name} is ready.")

## **4. Managing Kubeflow Endpoints**

We can observe the status of our InferenceService through the **Endpoints** pane in Kubeflow.

To access the Endpoints pane:

1. Navigate back to the Unified Analytics dashboard.
1. In the sidebar navigation menu, select `Data Science` > `Model Serving`.
1. The **Kubeflow Endpoints** pane will open in a new tab.

Here, we will see the complete list of applications (in our case, an InferenceService instance) currently being served using Kserve.

<img src="./images/exercise6/kserve1.png" alt="Drawing" style="width: 90%;"/>

Let's explore the InferenceService further.

4. Click on the `retail-experiment` Endpoint.

Here, we can see details about our InferenceService, including the **serving endpoints**, the MLflow URI for the model and our choice of predictor.

<img src="./images/exercise6/kserve2.png" alt="Drawing" style="width: 70%;"/>

<div class="alert alert-block alert-success">
<b>Tip:</b> Should you ever encounter issues with an InferenceService, you can diagnose the issue via the logs located under the `Logs` tab.</div>

The `URL internal` link is what can be used to inference the link from any application within **HPE Ezmeral Unified Analytics**, including notebooks and custom applications hosted within Unified Analytics. For applications hosted outside of our Unified Analytics cluster that have the right authentication, you would use the `URL external` link. 

5. Copy the `URL internal` link to the clipboard and paste it below.

In [None]:
internal_url = "http://retail-experiment-predictor.ezmeral.svc.cluster.local" #paste the URL internal endpoint for your model here

## **5. Inferencing our Model using an Endpoint**

Your produce recognition model is scalably served and is accessible via an endpoint URL. Let's learn how we use this endpoint to send the model an image of a fruit or vegetable, and for it to send back what it detects*.

**Calling an model to make a detection or prediction is known as **inferencing** a model.* 

First, we'll build out the full **endpoint URL** using the `URL internal` from our InferenceService.

In [None]:
%update_token

In [None]:
# Build the Serving URL
serving_url = internal_url + "/v1/models/" + isvc_name + ":predict"

print("Serving URL: " + serving_url)


Next, we'll define some functions. Similar to previous exercises, we'll prepare any image that we want to infer the model on. We'll also define a function that will convert our preprocessed image into a JSON `REST` package that we can `POST` to the endpoint URL.

In [None]:
def preprocess_image(location):
    # Load the image
    if "http" in location:
        response = requests.get(location)
        img = Image.open(BytesIO(response.content))
    else:
        from tensorflow.keras.preprocessing.image import load_img,img_to_array
        img=load_img(location,target_size=(224,224,3))
    
    img = img.resize((224, 224))
    print(type(img))
    img_array = img_to_array(img)
    print(img_array.shape)
    print(type(img_array))
    img_array = img_array / 255.0
    print(img_array.shape)
    print(type(img_array))
    img_array = np.expand_dims(img_array, axis=0)
    print(img_array.shape)
    print(type(img_array))

    return img_array

def format_data(data):
    # Convert the NumPy array to a list
    data_list = data.tolist()
    
    # Format the list as a JSON string
    data_formatted = json.dumps(data_list)
    
    # Create a JSON request string with the formatted data
    json_request = '{{ "instances" : {} }}'.format(data_formatted)
    
    return json_request

Now, we're ready to inference our model with a supplied image. 

For the final time, **go out** onto Google Images and find a **new** image with a fruit or vegetable in it. Replace the `image_url` variable with the link to your image.

In [None]:
# Define your labels
labels = {'apple': 0, 'banana': 1, 'carrot': 2, 'cucumber': 3, 'lemon': 4, 'orange': 5}
labels = dict((v, k) for k, v in labels.items())

online_url = ""
local_url = os.getcwd() + "/images/test_image.jpg"

if online_url:
    image_url = online_url
else:
    image_url = local_url

preprocessed_image = preprocess_image(image_url)

json_request = format_data(preprocessed_image)

headers = headers = {"Authorization": f"Bearer {os.environ['AUTH_TOKEN']}"}

# Make the POST request
response = requests.post(serving_url, data=json_request, headers=headers, verify=False)
print(response)
print("Request successfully made.")

Let's see what we got back!

In [None]:
# Print the raw response content
print("Raw Response Content:")
print(response.content.decode('utf-8'))

# Decode the JSON response
if response.headers.get("Content-Type") == "application/json":
    response_data = response.json()
    predictions = response_data['predictions']

    formatted_predictions = [[round(pred * 100, 2) for pred in prediction] for prediction in predictions]

    print("\nTranslated Predictions:")
    for label, prob in zip(labels, formatted_predictions[0]):
        print(f"- {label}: \t{prob}%")
    
    # Get the predicted label
    predicted_label_index = np.argmax(formatted_predictions)
    predicted_label = labels[predicted_label_index]

    print("\nPredicted class label:", predicted_label, "with", formatted_predictions[0][predicted_label_index], "%")

Did the model correctly guess what was in your image? If so, great! If not... still great! We've successfully validated that our model is being served with an endpoint using Kserve!

# **Conclusion**

In this exercise, you have learned the underlying theory behind scaling the serving of a model - including containers, Kubernetes, Kubeflow and KServe. You took the latest version of your produce detection model from the MLflow Model Repository, created the configuration YAML for a KServe InferenceService, and deployed it on the Kubernetes cluster that powers Unified Analytics.

Now, we have an endpoint URL that we can call to inference our model from any application within **HPE Ezmeral Unified Analytics**!

In the next exercise, we will leverage this endpoint to make **real-time** detections of produce within a custom self-checkout application!