# Creating a Large Language Model Inference Service

Welcome to the fourth part of the tutorial series on building a question-answering application over a corpus of private
documents using Large Language Models (LLMs). In the previous Notebooks, you journeyed through the processes of creating
vector embeddings of our documents, setting up a Vector Store Inference Service (ISVC), and testing its performance.

<figure>
  <img src="images/llm.jpg" alt="llm" style="width:100%">
  <figcaption>
      Photo by <a href="https://unsplash.com/@deepmind?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Google DeepMind</a> on <a href="https://unsplash.com/photos/LaKwLAmcnBc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

Now, you're moving towards the next crucial step: creating an ISVC for the LLM. This ISVC is the centerpiece of the
question-answering system, working in tandem with the Vector Store ISVC to deliver comprehensive and accurate answers to
user queries.

In this Notebook, you set up this LLM ISVC. You learn how to build a Docker image for the custom predictor, the
role of the transformer component, define a KServe ISVC YAML file, and deploy the service. By the end of this Notebook,
you'll have a fully functioning LLM ISVC that can accept user queries, interact with the Vector Store, and provide
insightful responses.

## Table of Contents

1. [Architecture](#architecture)
1. [Creating the Inference Service](#creating-the-inference-service)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import os
import subprocess
import ipywidgets as widgets

from IPython.display import display

In [None]:
# If you're running behind a proxy, set set your proxy URLs below.
# DO NOT edit the "NO_PROXY" variable unless you know what you're doing!
os.environ["HTTP_PROXY"] = ""
os.environ["HTTPS_PROXY"] = ""
os.environ["NO_PROXY"] = "10.96.0.0/12, .local"  # DO NOT edit.

# Architecture

In this setup, an additional component, called a "transformer", plays a pivotal role in processing user queries and
integrating the Vector Store ISVC with the LLM ISVC. The transformer's role is to intercept the user's request, extract
the necessary information, and then communicate with the Vector Store ISVC to retrieve the relevant context. The
transformer then takes the response of the Vector Store ISVC (i.e., the context), combines it with the user's query, and
forwards the enriched prompt to the LLM predictor.

Here's a detailed look at the process:

1. Intercepting the User's Request: The transformer acts as a gateway between the user and the LLM ISVC. When a user
   sends a query, it first reaches the transformer. The transformer extracts the query from the request.
1. Communicating with the Vector Store ISVC: The transformer then takes the user's query and sends a POST request to the
   Vector Store ISVC including the user's query in the payload, just like you did in the previous Notebook.
1. Receiving and Processing the Context: The Vector Store ISVC responds by sending back the relevant context.
1. Combining the Context with the User's Query: The transformer then combines the received context with the user's
   original query using a prompt template. This creates an enriched prompt that contains both the user's original
   question and the relevant context from our documents.
1. Forwarding the Enriched Query to the LLM Predictor: Finally, the transformer forwards this enriched query to the LLM
   predictor. The predictor then processes this query and generates a response, which is sent back to the transformer.
   Steps 2 through 5 are transparent to the user.
1. The transformer returns the response to the user.

As such, you should build two custom Docker images at this point: one for the predictor and one for the transformer. The
source code and the Dockerfiles are provided in the corresponding folders: `dockerfiles/llm` and
`dockerfiles/transformer`. For your convenience, you can use the images we have pre-built for you:

- Predictor: `marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-llm:v1.3.0-e658264`
- Transformer: `marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-transformer:v1.3.0-f7315d3`

Once ready, proceed with the next steps.

# Creating the Inference Service

As before, you need to provide a few variables:

1. The custom predictor image you built.
1. The custom transfromer image you built.

You can leave any field empty to use the image we provide for you:

In [None]:
# Add heading
heading = widgets.HTML("<h2>KServe Images</h2>")
display(heading)

predictor_image_widget = widgets.Text(
    description="Predictor Image Name:",
    placeholder="marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-llm:v1.3.0-e658264",
    layout=widgets.Layout(width='30%'))
transformer_image_widget = widgets.Text(
    description="Transformer Image Name:",
    placeholder="marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-transformer:v1.3.0-f7315d3",
    layout=widgets.Layout(width='30%'))
submit_button = widgets.Button(description="Submit")
success_message = widgets.Output()

predictor_image = None
transformer_image = None

def submit_button_clicked(b):
    global predictor_image, transformer_image
    predictor_image = predictor_image_widget.value
    transformer_image = transformer_image_widget.value
    with success_message:
        success_message.clear_output()
        if not predictor_image:
            predictor_image = "marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-llm:v1.3.0-e658264"
        print(f"The name of the predictor image will be: '{predictor_image}'")
        if not transformer_image:
            transformer_image = "marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-transformer:v1.3.0-f7315d3"
        print(f"The name of the transformer image will be: '{predictor_image}'")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(predictor_image_widget, transformer_image_widget, submit_button, success_message)

In [None]:
isvc = """
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm
spec:
  predictor:
    timeout: 600
    containers:
      - name: kserve-container
        image: {0}
        imagePullPolicy: Always
        # If you are running behind a proxy, uncomment the following lines and replace the values with your proxy URLs.
        env:
        - name: HTTP_PROXY
          value: {1}
        - name: HTTPS_PROXY
          value: {2}
        - name: NO_PROXY
          value: .local
        resources:
          requests:
            memory: "8Gi"
            cpu: "1000m"
          limits:
            memory: "8Gi"
            cpu: "1000m"
  transformer:
    timeout: 600
    containers:
      - image: {3}
        imagePullPolicy: Always
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        name: kserve-container
        args: ["--use_ssl"]
""".format(predictor_image,
           os.environ["HTTP_PROXY"],
           os.environ["HTTPS_PROXY"],
           transformer_image)

with open("llm-isvc.yaml", "w") as f:
    f.write(isvc)

In [None]:
subprocess.run(["kubectl", "apply", "-f", "llm-isvc.yaml"])

# Conclusion and Next Steps

Congratulations on completing this crucial step in this tutorial series! You've successfully built an LLM ISVC, and
you've learned about the role of a transformer in enriching user queries with relevant context from our documents.
Together with the Vector Store ISVC, these components form the backbone of your question-answering application.

However, the journey doesn't stop here. The next and final step is to test the LLM ISVC, ensuring that it's working as
expected and delivering accurate responses. This will help you gain confidence in your setup and prepare you for
real-world applications. In the next Notebook, you invoke the LLM ISVC. You see how to construct suitable requests,
communicate with the service, and interpret the responses.