# Vector Stores: Embedding and Storing Documents in a Latent Space

In this Jupyter Notebook, you explore a foundational element of a question-answering system: the
Vector Store. The Vector Store serves as the key component that allows you to efficiently retrieve
relevant context from a corpus of documents based on a user's query, providing the backbone of the
information retrieval system.

The approach you will use involves transforming each document into a high-dimensional numerical
representation known as an "embedding", using a fine-tuned embeddings model. This process is
sometimes referred to as "embedding" the document in a latent space. The latent space here is a
high-dimensional space where similar documents are close to each other. The position of a document
in this space is determined by the content and the semantic meaning it carries.

Once you have these embeddings, you store them in a Vector Store. A Vector Store is an advanced
AI-native database designed to hold these high-dimensional vectors, index them, and provide
efficient search capabilities. This enables you to quickly identify documents in your corpus that
are semantically similar to a given query, which will also be represented as a vector in the same
latent space. For this example, you will use [FAISS](https://ai.meta.com/tools/faiss/), a popular
open source vector database.

The following cells in this Notebook guide you through the process of creating such a Vector Store.
You start by generating embeddings for each document, then you move on to storing these embeddings
in FAISS, and finally, you see how easy it is to to retrieve documents from it based on a query.

## Table of Contents

1. [Deploy the Embeddings Model](#deploy-the-embeddings-model)
1. [Load the Documents](#load-the-documents)
1. [Document Processing](#document-processing-chunking-text-for-the-language-model)
1. [Generate and Store Embeddings](#generating-embeddings--storing-them-in-chroma)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import os
import warnings
import requests
import subprocess
import ipywidgets as widgets

from IPython.display import IFrame, display
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader

from embeddings import EmbeddingsModelClient


warnings.filterwarnings('ignore')

First, let's get an authentication token that we can use to invoke the embeddings model Inference
Service (ISVC). You will need this later.

In [None]:
# Add heading
heading = widgets.HTML("<h2>Credentials</h2>")
display(heading)

domain_input = widgets.Text(description='Username:', placeholder="i001ua.tryezmeral.com")
username_input = widgets.Text(description='Username:')
password_input = widgets.Password(description='Password:')
submit_button = widgets.Button(description='Submit')
success_message = widgets.Output()

domain = None
username = None
password = None

def submit_button_clicked(b):
    global domain, username, password
    domain = domain_input.value
    username = username_input.value
    password = password_input.value
    with success_message:
        success_message.clear_output()
        print("Credentials submitted successfully!")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(domain_input, username_input, password_input, submit_button, success_message)

In [None]:
token_url = f"https://keycloak.{domain}/realms/UA/protocol/openid-connect/token"

data = {
    "username" : username,
    "password" : password,
    "grant_type" : "password",
    "client_id" : "ua-grant",
}

token_responce = requests.post(token_url, data=data, allow_redirects=True, verify=False)

token = token_responce.json()["access_token"]

# Deploy the Embeddings Model

First, we need to deploy the embeddings model we will use to turn documents into multi-dimensional
vectors. For this, we will use KServe, which leverages NVIDIA NIM as a backend.

The first step is to create an image pull secret, to pull the necessary images for deploying NVIDIA
NIM models. For this, you'll need a NVCR token.

In [None]:
nvcr_token = "..."  # Your NGC token here

ngc_secret = """
apiVersion: v1
kind: Secret
metadata:
  name: ngc-secret
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: {0}
""".format(nvcr_token)

with open("ngc-secret.yaml", "w") as f:
    f.write(ngc_secret)

subprocess.run(["kubectl", "apply", "-f", "ngc-secret.yaml"])

Then, create a custom KServe runtime for NVIDIA NIM models. For this you need to get the name of the
image for the runtime, and ensure that you can pull it using the image pull secret you created in
the previous step.

In [None]:
serving_runtime_image = "..."  # The image of the serving runtime here

serving_runtime_embeddings = """
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-embedding-24.02
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
  - args:
    - |
      sed -i 's/checkpoint_path: .*/checkpoint_path: "\/mnt\/models\/nv-embed-qa_v4\/NV-Embed-QA-4.nemo"/g' /app/model_config_templates/NV-Embed-QA_template.yaml; \
      /app/bin/web -c /mnt/models/nv-embed-qa_v4/NV-Embed-QA-4.nemo -g /app/model_config_templates/NV-Embed-QA_template.yaml -p "8080"
    command:
    - /bin/sh
    - -c
    image: {0}
    name: kserve-container
    ports:
    - containerPort: 8080
      protocol: TCP
    resources:
      limits:
        cpu: "8"
        memory: 128Gi
        nvidia.com/gpu: 1
      requests:
        cpu: "4"
        memory: 64Gi
        nvidia.com/gpu: 1
    securityContext:
      runAsUser: 4474987
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
  imagePullSecrets:
  - name: ngc-secret
  protocolVersions:
  - v2
  - grpc-v2
  supportedModelFormats:
  - autoSelect: true
    name: nvidia-nim-embedding
    priority: 1
    version: "24.02"
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 128Gi
    name: dshm
""".format(serving_runtime_image)

with open("serving-runtime-embeddings.yaml", "w") as f:
    f.write(serving_runtime_embeddings)

subprocess.run(["kubectl", "apply", "-f", "serving-runtime-embeddings.yaml"])

Finally, deploy the model. Make sure that you have the model stored in a location that the server
can access.

In [None]:
storage_uri = "..."  # The storage URI here

embeddings_isvc = """
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    autoscaling.knative.dev/target: "10"
    nim_model_name: nv-embed-qa
  name: nv-embed-qa-4
spec:
  predictor:
    minReplicas: 1
    model:
      modelFormat:
        name: nvidia-nim-embedding
      name: ""
      resources:
        limits:
          cpu: "8"
          memory: 128Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 64Gi
          nvidia.com/gpu: "1"
      runtime: nvidia-nim-embedding-24.02
      storageUri: {0}
""".format(storage_uri)

with open("embeddings-isvc.yaml", "w") as f:
    f.write(embeddings_isvc)

subprocess.run(["kubectl", "apply", "-f", "embeddings-isvc.yaml"])

# Load the Documents

The next cells contain a set of helper functions designed to load JSON documents from a specified
directory. These functions are essential for preparing your data before embedding it into the
high-dimensional latent space. By running the following cells, you have a list of documents ready to
be processed and embedded in the latent space. This forms your corpus.

First, let's take a look at the documents we will be working with.

In [None]:
pdf_example='documents/Q2xhcmEgSG9sb3NjYW4gTUdYIDMvMjIvMjIucGRm.pdf'

# visualize the pdf sample
IFrame(pdf_example, width=900, height=500)

In [None]:
NUM_DOCS = 100  # Number of documents to load
pdf_folder_path = "documents/"

documents = []
for num_docs, file in enumerate(os.listdir(pdf_folder_path)):
    if file.endswith('.pdf') and num_docs < NUM_DOCS:
        pdf_path = os.path.join(pdf_folder_path, file)
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())

In [None]:
documents[50]  # examine of the document's extracted text 

# Document Processing: Chunking Text for the Language Model

In this section of the Notebook, you process the documents by splitting them into chunks. This
operation is crucial when working with Large Language Models (LLMs), as these models have a maximum
limit on the number of tokens (words or pieces of words) they can process at once. This limit is
often referred to as the model's "context window".

In this example, you split each document into segments that are at most `500` tokens long. You use
LangChain's `RecursiveCharacterTextSplitter`, which, by default, splits each document when it
encounters two consecutive newline characters, represented as `\n\n`.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=100)
all_splits = text_splitter.split_documents(documents)

# Generating Embeddings & Storing them in Chroma

In this section of the Notebook, you use the embeddings model to transform your documents into
semantically meaningful vectors.

By leveraging this model and the FAISS database interface provided by LangChain, you can embed your
documents into a latent space and subsequently store the results in a Vector Store.

In [None]:
DOMAIN_NAME = domain
DEPLOYMENT_NAME = "nv-embed-qa-4-predictor"
NAMESPACE = open(
    "/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r"
).read()

In [None]:
embeddings = EmbeddingsModelClient(
    model_name="NV-Embed-QA",
    domain_name=DOMAIN_NAME,
    deployment_name=DEPLOYMENT_NAME,
    namespace=NAMESPACE,
    token=token)

In [None]:
vectorstore = FAISS.from_documents(all_splits, embeddings)
vectorstore.save_local("vectorstore")

Finally, you can test the accuracy of the document retrieval mechanism by providing a simple query.
FAISS will return with the four most similar documents by default.

In [None]:
query = "What is NVIDIA cuOpt?"
matches = vectorstore.similarity_search(query); matches

# Conclusion and Next Steps

Congratulations! You have successfully embedded your documents into a high-dimensional latent space
and stored these embeddings in a Vector Store. By accomplishing this, you've transformed
unstructured text data into a structured form that can power a robust question-answering system.