# Creating an Inference Service using MLFlow and KServe

Welcome to part two of the tutorial on building a question-answering application over a private document corpus with
Large Language Models (LLMs). In the previous Notebook, you transformed the documents into a high-dimensional latent
space and saved these embeddings in a Vector Store using the Chroma database interface from LangChain.

<figure>
  <img src="images/inference-service.jpg" alt="isvc" style="width:100%">
  <figcaption>
    Photo by <a href="https://unsplash.com/@growtika?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Growtika</a> on <a href="https://unsplash.com/photos/GSiEeoHcNTQ?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

In this Notebook, you delve deeper. You use MLflow to log the Chroma DB files as experiment artifacts. Once logged, you
then set up an Inference Service (ISVC) that fetches these artifacts and leverages them to provide context to user
inquiries. For this task, you work with KServe, a Kubernetes-centric platform that offers a serverless blueprint for
scaling Machine Learning (ML) models seamlessly.

A crucial point to remember: KServe doesn't support Chroma DB files natively. Because of this, you integrate a custom
predictor component. This involves creating a Docker image, which then serves as your ISVC endpoint. This approach
grants you a high level of customization, ensuring the service fits your requirements. You can find the necessary code
and the Dockerfile for this custom predictor in the `dockerfiles/vectorstore` directory. But for a quicker setup,
there's a pre-built option available: `marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-vectorstore:v1.3.0-e658264`.

## Table of Contents

1. [Logging the Vector Store as an Artifact](#logging-the-vector-store-as-an-artifact)
1. [Creating and Submitting the Inference Service](#creating-and-submitting-the-inference-service)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import os
import base64
import requests
import subprocess
import mlflow
import ipywidgets as widgets

from IPython.display import display

In [None]:
# If you're running behind a proxy, set set your proxy URLs below. DO NOT edit the "NO_PROXY" variable unless you know what you're doing!
os.environ["HTTP_PROXY"] = ""
os.environ["HTTPS_PROXY"] = ""
os.environ["NO_PROXY"] = "10.96.0.0/12, .local"

In [None]:
def encode_base64(message: str):
    encoded_bytes = base64.b64encode(message.encode('ASCII'))
    return encoded_bytes.decode('ASCII')

# Logging the Vector Store as an Artifact

To begin, you create a new experiment or utilize an existing one and log the Chroma DB files as an artifact of this
experiment. Ultimately, you retrieve the URI that points to this artifact's location and provide it to the custom
predictor component. By doing this, the custom predictor component understands how to fetch the artifact and serve it
effectively.

In [None]:
# Add heading
heading = widgets.HTML("<h2>MLflow Credentials</h2>")
display(heading)

domain_input = widgets.Text(description='Domain:', placeholder="i001ua.tryezmeral.com")
username_input = widgets.Text(description='Username:')
password_input = widgets.Password(description='Password:')
submit_button = widgets.Button(description='Submit')
success_message = widgets.Output()

domain = None
mlflow_username = None
mlflow_password = None

def submit_button_clicked(b):
    global domain, mlflow_username, mlflow_password
    domain = domain_input.value
    mlflow_username = username_input.value
    mlflow_password = password_input.value
    with success_message:
        success_message.clear_output()
        print("Credentials submitted successfully!")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(domain_input, username_input, password_input, submit_button, success_message)

In [None]:
token_url = f"https://keycloak.{domain}/realms/UA/protocol/openid-connect/token"

data = {
    "username" : mlflow_username,
    "password" : mlflow_password,
    "grant_type" : "password",
    "client_id" : "ua-grant",
}

token_responce = requests.post(token_url, data=data, allow_redirects=True, verify=False)

token = token_responce.json()["access_token"]

In [None]:
os.environ['MLFLOW_TRACKING_TOKEN'] = token
os.environ["AWS_ACCESS_KEY_ID"] = os.environ['MLFLOW_TRACKING_TOKEN']
os.environ["AWS_SECRET_ACCESS_KEY"] = "s3"
os.environ["AWS_ENDPOINT_URL"] = 'http://local-s3-service.ezdata-system.svc.cluster.local:30000'
os.environ["MLFLOW_S3_ENDPOINT_URL"] = os.environ["AWS_ENDPOINT_URL"]
os.environ["MLFLOW_S3_IGNORE_TLS"] = "true"
os.environ["MLFLOW_TRACKING_INSECURE_TLS"] = "true"
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow.mlflow.svc.cluster.local:5000"

In [None]:
def get_or_create_experiment(exp_name):
    """Register an experiment in MLFlow.
    
    args:
      exp_name (str): The name of the experiment.
    """
    try:
        mlflow.set_experiment(exp_name)
    except Exception as e:
        raise RuntimeError(f"Failed to set the experiment: {e}")

In [None]:
# Create a new MLFlow experiment or re-use an existing one
get_or_create_experiment('question-answering')

# Log the Chroma DB files as an artifact of the experiment
mlflow.log_artifact(f"{os.getcwd()}/db")

# Retrieve the URI of the artifact
uri = mlflow.get_artifact_uri("db")

# Creating and Submitting the Inference Service

In the final segment of this Notebook, you create and submit an ISVC via a YAML template and a Python subprocess. This
process unfolds as follows:

1. Drafting the YAML Template: Here, you craft a YAML file that outlines the ISVC's specifics. This captures elements
   like the service's name, the chosen Docker image, and additional configurations. After drafting, you save this YAML
   to a file for inspection and later submission.
1. Pushing the YAML Template: With your YAML template prepped, the next step is to present it to KServe for deployment.
   You accomplish this by leveraging a Python subprocess to execute a shell command.

By the end of this section, you will have a running ISVC that is ready to receive user queries and provide context for
answering them using the Vector Store. This marks the completion of your journey, from transforming unstructured text
data into structured vector embeddings, to creating a scalable service that can provide context based on those
embeddings.

In the upcoming cell, input the name of the Docker image you constructed in the initial phase. If you wish to utilize
the pre-fabricated one, simply leave the field untouched:

In [None]:
# Add heading
heading = widgets.HTML("<h2>Predictor Image</h2>")
display(heading)

predictor_image_widget = widgets.Text(
    description="Image Name:",
    placeholder="marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-vectorstore:v1.3.0-e658264",
    layout=widgets.Layout(width='30%'))
submit_button = widgets.Button(description="Submit")
success_message = widgets.Output()

predictor_image = None

def submit_button_clicked(b):
    global predictor_image
    predictor_image = predictor_image_widget.value
    with success_message:
        success_message.clear_output()
        if not predictor_image:
            predictor_image = "marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-vectorstore:v1.3.0-e658264"
        print(f"The name of the predictor image will be: '{predictor_image}'")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(predictor_image_widget, submit_button, success_message)

In [None]:
isvc = """
apiVersion: v1
kind: Secret
metadata:
  name: minio-secret
type: Opaque
data:
  MINIO_ACCESS_KEY: {0}
  MINIO_SECRET_KEY: {1}

---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: vectorstore
spec:
  predictor:
    containers:
    - name: kserve-container
      image: {2}
      imagePullPolicy: Always
      resources:
        requests:
          memory: "2Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "500m"
      args:
      - --persist-uri
      - {3}
      env:
      # If you are running behind a proxy, uncomment the following lines and replace the values with your proxy URLs.
      - name: HTTP_PROXY
        value: {4}
      - name: HTTPS_PROXY
        value: {5}
      - name: NO_PROXY
        value: .local
      - name: MLFLOW_S3_ENDPOINT_URL
        value: {6}
      - name: TRANSFORMERS_CACHE
        value: /src
      - name: SENTENCE_TRANSFORMERS_HOME
        value: /src
      - name: MINIO_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            key: MINIO_ACCESS_KEY
            name: minio-secret
      - name: MINIO_SECRET_KEY
        valueFrom:
          secretKeyRef:
            key: MINIO_SECRET_KEY
            name: minio-secret
""".format(encode_base64(os.environ["AWS_ACCESS_KEY_ID"]),
           encode_base64(os.environ["AWS_SECRET_ACCESS_KEY"]),
           predictor_image,
           uri,
           os.environ["HTTP_PROXY"],
           os.environ["HTTPS_PROXY"],
           os.environ["MLFLOW_S3_ENDPOINT_URL"])

with open("vectorstore-isvc.yaml", "w") as f:
    f.write(isvc)

In [None]:
subprocess.run(["kubectl", "apply", "-f", "vectorstore-isvc.yaml"])

# Conclusion and Next Steps

Congratulations! You've successfully navigated through the process of logging the Chroma DB files as artifacts using
MLflow, creating a custom Docker image, and setting up an ISVC with KServe that retrieves these artifacts to serve your
Vector Store. This ISVC forms the backbone of your question-answering application, enabling you to efficiently answer
queries based on the document embeddings we generated previously.

From here, there are two paths you can choose:

- Testing the Vector Store ISVC: If you'd like to test the Vector Store ISVC that you've just created, you can proceed
  to the third (optional) Notebook. This Notebook provides a step-by-step guide on how to invoke the ISVC and validate
  its performance.
- Creating the LLM ISVC: Alternatively, if you're ready to move on to the next stage of the project, you
  can jump straight to our fourth Notebook. In this Notebook, you create an ISVC for the Large Language Model (LLM),
  which will work in conjunction with the Vector Store ISVC to provide answers to user queries.