# Creating a Large Language Model Inference Service

Welcome to the fourth part of the tutorial series on building a question-answering application over a corpus of private
documents using Large Language Models (LLMs). The previous Notebooks walked you through the processes of creating
vector embeddings of the documents, setting up the Inference Services (ISVCs) for the Vector Store and the Embeddings model,
and testing the performance of the information retrieval system.

<figure>
  <img src="images/llm.jpg" alt="llm" style="width:100%">
  <figcaption>
      Photo by <a href="https://unsplash.com/@deepmind?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Google DeepMind</a> on <a href="https://unsplash.com/photos/LaKwLAmcnBc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

Now, you're moving towards the next crucial step: creating an ISVC for the LLM. This ISVC is the centerpiece of the
question-answering system, working in tandem with the Vector Store ISVC to deliver comprehensive and accurate answers to
user queries.

In this Notebook, you set up this LLM ISVC. You learn how to build a Docker image for the transformer component and its
role, define a KServe ISVC YAML file, and deploy the service. By the end of this Notebook, you'll have a fully functioning
LLM ISVC that can accept user queries, interact with the Vector Store, and provide insightful responses.

To complete this step you need to have access to the Llama 2 model on Hugging Face Hub and the pre-built TensorRT-LLM engines:

* The TensorRT-LLM engines can be downloaded from the following link: https://ezmeral-artifacts.s3.us-east-2.amazonaws.com/llama-engines.tar.gz.
  We will do this later in the Notebook.
* The model repository is available via the Hugging Face hub at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.
  It is not necessary to clone the model weights, thus eliminating the need for Git LFS. You should sign in to Hugging Face and accept the terms
  and conditions for using this model.

## Table of Contents

1. [Architecture](#architecture)
1. [Creating the Inference Service](#creating-the-inference-service)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import os
import subprocess
import ipywidgets as widgets

from IPython.display import display

# Architecture

In this setup, an additional component, called a "transformer", plays a pivotal role in processing user queries and
integrating the Vector Store ISVC with the LLM ISVC. The transformer's role is to intercept the user's request, extract
the necessary information, and then communicate with the Vector Store ISVC to retrieve the relevant context. The
transformer then takes the response of the Vector Store ISVC (i.e., the context), combines it with the user's query, and
forwards the enriched prompt to the LLM predictor.

Here's a detailed look at the process:

1. **Intercepting the User's Request**: The transformer acts as a gateway between the user and the LLM ISVC. When a user
   sends a query, it first reaches the transformer. The transformer extracts the query from the request.
1. **Communicating with the Vector Store ISVC**: The transformer then takes the user's query and sends a POST request to the
   Vector Store ISVC including the user's query in the payload, just like you did in the previous Notebook.
1. **Receiving and Processing the Context**: The Vector Store ISVC responds by sending back the relevant context.
1. **Combining the Context with the User's Query**: The transformer then combines the received context with the user's
   original query using a prompt template. This creates an enriched prompt that contains both the user's original
   question and the relevant context from our documents.
1. **Forwarding the Enriched Query to the LLM Predictor**: Finally, the transformer forwards this enriched query to the LLM
   predictor. The predictor then processes this query and generates a response, which is sent back to the transformer.
   Steps 2 through 5 are transparent to the user.
1. **Final response**:The transformer returns the response to the user.

As such, you should build one custom Docker image at this point for the transformer component. The
source code and the Dockerfile is provided in the corresponding folder: `dockerfiles/transformer`.
For your convenience, you can use the image we have pre-built for you: `marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-transformer-gpu:v1.3.0-e658264`

Once ready, proceed with the next steps.

# Downloading the Artifacts

Next, you should download the necessary artifacts for the model. The first step is to clone the model repository. Visit the Llama 2 7B page on Hugging Face Hub and clone the model in this directory, using the following command:

```
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
```

You should provide your username and the access token for your accound. If you don't have an access token you should create a new one. When everything's ready, run the following command to move the Llama 2 7B directory you pulled in the right location:

In [None]:
!mv Llama-2-7b-chat-hf/ inflight_batcher_llm/preprocessing/1/

Next, let's download the pre-built TensorRT-LLM engines:

In [None]:
# If you are behind a proxy, do not forget to set your `https_proxy` and `HTTPS_PROXY` environment variable.
# os.environ["https_proxy"] = ""
# os.environ["HTTPS_PROXY"] = ""

In [None]:
!wget https://ezmeral-artifacts.s3.us-east-2.amazonaws.com/llama-engines.tar.gz  # download the tarball

In [None]:
!mv llama-engines.tar.gz inflight_batcher_llm/tensorrt_llm/1/  # move the tarball to the right location

In [None]:
!tar xzf inflight_batcher_llm/tensorrt_llm/1/llama-engines.tar.gz -C inflight_batcher_llm/tensorrt_llm/1/  # extract the tarball

# Creating the Inference Service

As before, you need to provide the name of the transofmer image You can leave any field empty to use the image we
provide for you:

In [None]:
# Add heading
heading = widgets.HTML("<h2>Transformer Image</h2>")
display(heading)

transformer_image_widget = widgets.Text(
    description="Image Name:",
    placeholder="Default: marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-transformer-gpu:v1.3.0-e658264",
    layout=widgets.Layout(width='30%'))
submit_button = widgets.Button(description="Submit")
success_message = widgets.Output()

transformer_image = None

def submit_button_clicked(b):
    global transformer_image
    transformer_image = transformer_image_widget.value
    with success_message:
        success_message.clear_output()
        if not transformer_image:
            transformer_image = "marketplace.us1.greenlake-hpe.com/ezmeral/ezkf/qna-transformer-gpu:v1.3.0-e658264"
        print(f"The name of the transformer image will be: '{transformer_image}'")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(transformer_image_widget, submit_button, success_message)

Copy the `inflight_batcher_llm` directory to the shared PVC. The `inflight_batcher_llm` directory defines the structure or the model repository that the Triton Inference Server expects to serve Llama 2 7B, using the TensorRT-LLM backend.

In [None]:
!cp -r inflight_batcher_llm/ /mnt/shared/

Define and apply the LLM Inference Service:

In [None]:
llama_isvc = """
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "ensemble"
spec:
  predictor:
    timeout: 600
    triton:
      image: nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
      securityContext:
          runAsUser: 0
      resources:
        limits:
          cpu: "4"
          memory: 32Gi
          nvidia.com/gpu: 1
        requests:
          cpu: "4"
          memory: 32Gi
      storageUri: "pvc://kubeflow-shared-pvc/inflight_batcher_llm"
  transformer:
    timeout: 600
    containers:
      - image: {0}
        imagePullPolicy: Always
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        name: kserve-container
        args: ["--protocol", "v2"]
""".format(transformer_image)

with open("llama-isvc.yaml", "w") as f:
    f.write(llama_isvc)

In [None]:
subprocess.run(["kubectl", "apply", "-f", "llama-isvc.yaml"])

# Conclusion and Next Steps

Congratulations on completing this crucial step in this tutorial series! You've successfully built an LLM ISVC, and
you've learned about the role of a transformer in enriching user queries with relevant context from our documents.
Together with the Vector Store ISVC, these components form the backbone of your question-answering application.

However, the journey doesn't stop here. The next and final step is to test the LLM ISVC, ensuring that it's working as
expected and delivering accurate responses. This will help you gain confidence in your setup and prepare you for
real-world applications. In the next Notebook, you invoke the LLM ISVC. You see how to construct suitable requests,
communicate with the service, and interpret the responses.