# Large Language Model Inference Service

Welcome to the second part of the tutorial series on building a chatbot over a corpus of private
documents using Large Language Models (LLMs). The previous Notebooks walked you through the process
of deploying an embeddings model and using it to populate a Vector Store.

Now, you're moving towards the next crucial step: creating an Inference Service (ISVC) for the LLM.
This ISVC is the centerpiece of the chatbot application, working in tandem with the Vector Store
to deliver comprehensive and accurate answers to user queries.

In this Notebook, you set up this LLM ISVC. You will deploy a new KServe serving runtime and deploy
an KServe ISVC that uses an NVIDIA NIM backend.

## Table of Contents

1. [Creating the Inference Service](#creating-the-inference-service)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import subprocess

# Creating the Inference Service

As before, first you need to apply a custom KServe serving runtime:

In [None]:
serving_runtime_image = "..."  # The image of the serving runtime here

serving_runtime_llm = """
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-llm-24.02.day0
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
  - args:
    - |
      ln -s /mnt/models/* /model-store/; \
      echo 'engine:
        model: /model-store
        tensor_parallel_size: {{.Annotations.num_gpus}}
        dtype: float16' > /tmp/model_config.yaml; \
      nim_vllm --model_name={{.Annotations.nim_model_name}} \
        --openai_port=8000 \
        --model_config /tmp/model_config.yaml
    command:
    - /bin/sh
    - -c
    image: {0}
    name: kserve-container
    ports:
    - containerPort: 8000
      protocol: TCP
    resources:
      limits:
        cpu: "8"
        memory: 128Gi
        nvidia.com/gpu: 1
      requests:
        cpu: "4"
        memory: 64Gi
        nvidia.com/gpu: 1
    securityContext:
      runAsUser: 4474987
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
  imagePullSecrets:
  - name: ngc-secret
  protocolVersions:
  - v2
  - grpc-v2
  supportedModelFormats:
  - autoSelect: true
    name: nvidia-nim-llm
    priority: 1
    version: 24.02.day0
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 128Gi
    name: dshm
""".format(serving_runtime_image)

with open("serving-runtime-llm.yaml", "w") as f:
    f.write(serving_runtime_llm)

subprocess.run(["kubectl", "apply", "-f", "serving-runtime-llm.yaml"])

Next, deploy the LLM:

In [None]:
storage_uri = "..."  # The storage URI here

llm_isvc = """
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    autoscaling.knative.dev/target: "10"
    nim_model_name: llama2-7b-chat
    num_gpus: "1"
  name: llama2-7b-chat-1xgpu-day0
spec:
  predictor:
    minReplicas: 1
    model:
      modelFormat:
        name: nvidia-nim-llm
      name: ""
      resources:
        limits:
          cpu: "8"
          memory: 64Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 48Gi
          nvidia.com/gpu: "1"
      runtime: nvidia-nim-llm-24.02.day0
      storageUri: {0}
""".format(storage_uri)

with open("llm-isvc.yaml", "w") as f:
    f.write(llm_isvc)

subprocess.run(["kubectl", "apply", "-f", "llm-isvc.yaml"])

# Conclusion and Next Steps

Congratulations on completing this crucial step in this tutorial series! You've successfully
deployed an LLM with KServe, using a custom NVIDIA NIM serving runtime.