# Vector Stores: Embedding and Storing Documents in a Latent Space

In this Jupyter Notebook, you explore a foundational element of a question-answering system: the Vector Store. The
Vector Store serves as the key component that allows you to efficiently retrieve relevant context from a corpus of
documents based on a user's query, providing the backbone of the information retrieval system.

<figure>
  <img src="images/documents.jpg" alt="documents" style="width:100%">
  <figcaption>
      Photo by <a href="https://unsplash.com/@anniespratt?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Annie Spratt</a> on <a href="https://unsplash.com/photos/5cFwQ-WMcJU?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

The approach you will use involves transforming each document into a high-dimensional numerical
representation known as an "embedding", using a fine-tuned [BGE-M3](https://arxiv.org/abs/2402.03216) embeddings model.
This process is sometimes referred to as "embedding" the document in a latent space. The latent space here is a
high-dimensional space where similar documents are close to each other. The position of a document in this space is
determined by the content and the semantic meaning it carries.

Once you have these embeddings, you store them in a Vector Store. A Vector Store is an advanced AI-native database
designed to hold these high-dimensional vectors, index them, and provide efficient search capabilities. This enables you
to quickly identify documents in your corpus that are semantically similar to a given query, which will also be
represented as a vector in the same latent space. For this example, you will use [Chroma](https://www.trychroma.com/),
a popular open source vector database.

The following cells in this Notebook guides you through the process of creating such a Vector Store. You start by
generating embeddings for each document, then you move on to storing these embeddings in a Vector Store, and finally,
you see how easy it is to to retrieve documents from the Vector Store based on a query.

## Table of Contents

1. [Download the Embeddings Model](#download-the-embeddings-model)
1. [Load the Documents](#load-the-documents)
1. [Document Processing](#document-processing-chunking-text-for-the-language-model)
1. [Generate and Store Embeddings](#generating-embeddings--storing-them-in-chroma)
1. [Conclusion and Next Steps](#conclusion-and-next-steps)

In [None]:
import os
import glob

from tqdm import tqdm
from chromadb.config import Settings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import JSONLoader

from embeddings import EmbeddingsModel

# Download the Embeddings Model

The initial step in this process involves downloading the fine-tuned embeddings model. This is a vital step, as you will require the model to both create the Vector Store and deploy it for inference with KServe.

In [None]:
# If you are behind a proxy, do not forget to set your `https_proxy` and `HTTPS_PROXY` environment variable.
# os.environ["https_proxy"] = ""
# os.environ["HTTPS_PROXY"] = ""

In [None]:
!mkdir bge-m3  # create a directorry to download the embeddings model

In [None]:
!wget https://ezmeral-artifacts.s3.us-east-2.amazonaws.com/bge-m3.tar.gz  # download the embeddings model

In [None]:
!mv bge-m3.tar.gz bge-m3  # move the embeddings model tarball into the right directory

In [None]:
!tar xzf bge-m3/bge-m3.tar.gz -C bge-m3 # extract the embeddings model

In [None]:
!rm bge-m3/bge-m3.tar.gz  # remove the tarball you downloaded

# Load the Documents

The next cells contain a set of helper functions designed to load JSON documents from a specified directory. These
functions are essential for preparing your data before embedding it into the high-dimensional latent space. By
running the following cells, you have a list of documents ready to be processed and embedded in the latent space.
This forms your corpus.

In [None]:
docs = []

In [None]:
ezua_loader = JSONLoader(
    file_path='./documents/EzUA.json',
    jq_schema='.[].content',
    text_content=False)

ezua_data = ezua_loader.load()

In [None]:
ezdf_loader = JSONLoader(
    file_path='./documents/EzDF.json',
    jq_schema='.[].content',
    text_content=False)

ezdf_data = ezdf_loader.load()

In [None]:
mlde_loader = JSONLoader(
    file_path='./documents/MLDE.json',
    jq_schema='.[].content',
    text_content=False)

mlde_data = mlde_loader.load()

mlde_data_filtered = list(filter(lambda doc: doc.page_content != "", mlde_data))

In [None]:
mldm_loader = JSONLoader(
    file_path='./documents/MLDM.json',
    jq_schema='.[].body',
    text_content=False)

mldm_data = mldm_loader.load()

mldm_data_filtered = list(filter(lambda doc: doc.page_content != "", mldm_data))

In [None]:
docs = ezua_data + ezdf_data + mlde_data_filtered + mldm_data_filtered

# Document Processing: Chunking Text for the Language Model

In this section of the Notebook, you process the documents by splitting them into chunks. This operation is crucial when
working with Large Language Models (LLMs), as these models have a maximum limit on the number of tokens (words or pieces
of words) they can process at once. This limit is often referred to as the model's "context window".

In this example, you split each document into segments that are at most `500` tokens long. You use the LangChain's
`RecursiveCharacterTextSplitter`, which, by default, splits each document when it encounters two consecutive newline
characters, represented as `\n\n`. Furthermore, each segment is distinct, meaning there is no overlap between them.

In [None]:
def process_docs(docs: list, chunk_size: int, chunk_overlap: int) -> list:
    """Load the documents and split them into chunks."""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    texts = text_splitter.split_documents(docs)
    return texts

texts = process_docs(docs, chunk_size=500, chunk_overlap=100)

# Generating Embeddings & Storing them in Chroma

In this section of the Notebook, you use the embeddings model to transform your documents into semantically
meaningful vectors.

By leveraging this model and the Chroma database interface provided by LangChain, you can embed your documents into
a latent space and subsequently store the results in a Vector Store. At this step, the model processes batches of documents
in order, thus it may take a while to complete.

In [None]:
embeddings = EmbeddingsModel()

In [None]:
db = Chroma(embedding_function=embeddings, persist_directory=f"{os.getcwd()}/db")

In [None]:
batch_size = 100  # Documents to process simultaneously

for i in tqdm(range(0, len(texts), batch_size), desc="Processing Batches"):
    batch = texts[i:i+batch_size]
    db.add_documents(batch)

In [None]:
db.persist()

Finally, you can test the accuracy of the document retrieval mechanism by providing a simple query. Chroma will return
with the four most similar documents by default.

In [None]:
query = "How can I get started with HPE Ezmeral Unified Anaytics?"
matches = db.similarity_search(query); matches

# Conclusion and Next Steps

Congratulations! You have successfully embedded your documents into a high-dimensional latent space
and stored these embeddings in a Vector Store. By accomplishing this, you've transformed unstructured text data into a
structured form that can power a robust question-answering system.

However, your journey doesn't end here. Now that you have the Vector Store ready, the next step is to create an
Inference Service (ISVC) that can leverage this store to provide context to user queries. For this, you use KServe, a
flexible, cloud-native platform for serving Machine Learning models.

In the next Notebook, you will configure two ISVCs: a custom ISVC using KServe to deploy the Chroma Database, and
another one backed by with the Triton Inference Service to facilitate the embeddings model connection.