# Creating an Inference Service using MLFlow and KServe

Welcome to the second part of our tutorial on building a question-answering application over a corpus of private documents using Large Language Models (LLMs). In the previous Notebook, you focused on embedding the documents into a high-dimensional latent space and storing these embeddings in a Vector Store using the Chroma database interface provided by LangChain.

<figure>
  <img src="images/inference-service.jpg" alt="isvc" style="width:100%">
  <figcaption>
    Photo by <a href="https://unsplash.com/@growtika?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Growtika</a> on <a href="https://unsplash.com/photos/GSiEeoHcNTQ?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

In this Notebook, you will be taking the next step in this journey. You will use MLFlow to log the Chroma DB files as artifacts of an experiment. After logging the artifacts, you will then create an Inference Service that retrieves these artifacts and uses them to provide context to user queries. For this purpose, you'll be using KServe, a Kubernetes-based platform that provides a serverless framework for serving machine learning models at scale.

It's important to note that since KServe does not support serving Chroma DB files out-of-the-box, you will be using a custom predictor component. This means that you'll need to create a Docker image first, which can then be deployed as our Inference Service. This process allows for a high degree of customization, enabling you to fine-tune your service to your specific needs. You can find the code, as well as the Dockerfile for the custom predictor inside the `vectorstore` directory of this project. However, to save time, you can use the one we have pre-built for you: `gcr.io/mapr-252711/ezua-demos/vectorstore:v0.1.0`.

Once you're ready, this Notebook will guide you through the necessary steps for creating a scalable Vector Store service. First, let's import the libraries you'll need:

In [None]:
import os
import base64
import logging
import warnings
import subprocess

import mlflow

warnings.filterwarnings('ignore')

In [None]:
import base64

def encode_base64(message: str):
    encoded_bytes = base64.b64encode(message.encode('ASCII'))
    return encoded_bytes.decode('ASCII')

# Logging the Vector Store as an Artifact

To kick-off this process, you need to create a new experiment (or re-use and existing one) and log the Chroma DB files as an artifact of this experiment. In the end, you'll need to retrieve the URI pointing to the location of this artifact and pass it to the custom predictor component. This way, the custom predictor component will know hot to retrieve the artifact and serve it.

In [None]:
def get_or_create_experiment(exp_name):
    """Register an experiment in MLFlow.
    
    args:
      exp_name (str): The name of the experiment.
    """
    try:
        mlflow.set_experiment(exp_name)
    except Exception as e:
        raise RuntimeError(f"Failed to set the experiment: {e}")

In [None]:
# Create a new MLFlow experiment or re-use an existing one
get_or_create_experiment('question-answering')

# Log the Chroma DB files as an artifact of the experiment
mlflow.log_artifact(f"{os.getcwd()}/db")

# Retrieve the URI of the artifact
uri = mlflow.get_artifact_uri("db")

# Creating and Submitting the Inference Service

In the final section of this Notebook, you will create and submit an Inference Service using a YAML template and a Python subprocess. The steps in this section include:

1. Creating the YAML Template: You will create a YAML file that defines the specifications of our Inference Service. This includes details like the name of the service, the Docker image to use, and other configuration settings. You will store this YAML into a file that you can explore and submit.
1. Submitting the YAML Template: Once your YAML template is ready, you need to submit it to KServe for deployment. To do this, you will use a Python subprocess to run a shell command that submits your YAML template to KServe.

By the end of this section, you will have a running Inference Service that is ready to receive user queries and provide context for answering them using the Vector Store. This marks the completion of your journey, from transforming unstructured text data into structured vector embeddings, to creating a scalable service that can provide context based on those embeddings.

Provide the name of the docker image you built at the first step in the next cell. Leave it blank to use the one we have pre-built for you:

In [None]:
predictor_image = (input("Enter the name of the predictor image (default: gcr.io/mapr-252711/ezua-demos/vectorstore:v0.1.0): ")
                   or "gcr.io/mapr-252711/ezua-demos/vectorstore:v0.1.0")

In [None]:
isvc = """
apiVersion: v1
kind: Secret
metadata:
  name: minio-secret
type: Opaque
data:
  MINIO_ACCESS_KEY: {0}
  MINIO_SECRET_KEY: {1}

---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: vectorstore
spec:
  predictor:
    containers:
    - name: kserve-container
      image: {2}
      imagePullPolicy: Always
      resources:
        requests:
          memory: "2Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "500m"
      args:
      - --persist-uri
      - {3}
      env:
      - name: MLFLOW_S3_ENDPOINT_URL
        value: {4}
      - name: TRANSFORMERS_CACHE
        value: /src
      - name: SENTENCE_TRANSFORMERS_HOME
        value: /src
      - name: MINIO_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            key: MINIO_ACCESS_KEY
            name: minio-secret
      - name: MINIO_SECRET_KEY
        valueFrom:
          secretKeyRef:
            key: MINIO_SECRET_KEY
            name: minio-secret
""".format(encode_base64(os.environ["AWS_ACCESS_KEY_ID"]),
           encode_base64(os.environ["AWS_SECRET_ACCESS_KEY"]),
           predictor_image,
           uri,
           os.environ["MLFLOW_S3_ENDPOINT_URL"])

with open("vectorstore/isvc.yaml", "w") as f:
    f.write(isvc)

In [None]:
subprocess.run(["kubectl", "apply", "-f", "vectorstore/isvc.yaml"])

# Conclusion and Next Steps

Congratulations! You've successfully navigated through the process of logging the Chroma DB files as artifacts using MLFlow, creating a custom Docker image, and setting up an Inference Service with KServe that retrieves these artifacts to serve your Vector Store. This Inference Service forms the backbone of our question-answering application, enabling us to efficiently answer queries based on the document embeddings we generated previously.

From here, there are two paths you can choose:

- Testing the Vector Store Inference Service: If you'd like to test the Vector Store Inference Service that you've just created, you can proceed to our third (optional) Notebook. This Notebook provides a step-by-step guide on how to invoke the Inference Service and validate its performance.
- Creating the LLM Inference Service: Alternatively, if you're ready to move on to the next stage of the project, you can jump straight to our fourth Notebook. In this Notebook, we'll guide you through the process of creating an Inference Service for the Large Language Model (LLM), which will work in conjunction with the Vector Store Inference Service to provide answers to user queries.