# Vector Stores: Embedding and Storing Documents in a Latent Space

In this Jupyter Notebook, you explore a foundational element of a question-answering system: the Vector Store. The
Vector Store serves as the key component that allows us to efficiently retrieve relevant context from a corpus of
documents based on a user's query.

<figure>
  <img src="images/documents.jpg" alt="documents" style="width:100%">
  <figcaption>
      Photo by <a href="https://unsplash.com/@anniespratt?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Annie Spratt</a> on <a href="https://unsplash.com/photos/5cFwQ-WMcJU?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

The approach you employ involves transforming each document in the corpus into a high-dimensional numerical
representation known as an "embedding", using a pre-trained Transformer model. This process is sometimes referred to as
"embedding" the document in a latent space. The latent space here is a high-dimensional space where similar documents
are close to each other. The position of a document in this space is determined by the content and the semantic meaning
it carries.

Once you have these embeddings, you store them in a Vector Store. A Vector Store is an advanced AI-native database
designed to hold these high-dimensional vectors and provide efficient search capabilities. This enables you to quickly
identify documents in your corpus that are semantically similar to a given query, which will also be represented as a
vector in the same latent space.

The following cells in this Notebook guides you through the process of creating such a Vector Store. You start by
generating embeddings for each document, then you move on to storing these embeddings in a Vector Store and finally,
you see how easy it is to to retrieve documents from the Vector Store based on a query.

First, let's import the libraries you need:

In [None]:
import os
import glob

import mlflow

from tqdm import tqdm
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from chromadb.config import Settings

# Load the Documents

The next cells contain a set of helper functions designed to load text documents from a specified directory. These
functions are essential for preparing your data before embedding it into the high-dimensional latent space.

The key operations performed by these functions are:

- Directory Scanning: Scan the specified directory for all `.txt` files recursively.
- Document Loading: Load the file in LangChain `Document` object, using the provided `TextLoader` object.

By running this cell, you have a list of documents ready to be processed and embedded in the latent space. This forms
your corpus.

In [None]:
def load_doc(fn):
    loader = TextLoader(fn)
    doc = loader.load()
    return doc

In [None]:
def load_docs(source_dir: str) -> list:
    """Load all documents in a the given directory."""
    fns = glob.glob(os.path.join(source_dir, "*.txt"))
    
    docs = []
    for i, fn in enumerate(tqdm(fns, desc="Loading documents...")):
        docs.extend(load_doc(fn))

    return docs

In [None]:
docs = load_docs("documents")

# Document Processing: Chunking Text for the Language Model

In this section of the Notebook, you process the documents by splitting them into chunks. This operation is crucial when
working with Large Language Models (LLMs), as these models have a maximum limit on the number of tokens (words or pieces
of words) they can process at once. This limit is often referred to as the model's "context window".

In this example, you split each document into segments that are at most `500` tokens long. You use the LangChain's
`RecursiveCharacterTextSplitter`, which, by default, splits each document when it encounters two consecutive newline
characters, represented as `\n\n`. Furthermore, each segment is distinct, meaning there is no overlap between them.

In [None]:
def process_docs(docs: list, chunk_size: int, chunk_overlap: int) -> list:
    """Load the documents and split them into chunks."""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    texts = text_splitter.split_documents(docs)
    return texts

texts = process_docs(docs, chunk_size=500, chunk_overlap=0)

# Generating Embeddings & Storing them in Chroma

In this section of the Notebook, you use HuggingFace's `all-MiniLM-L6-v2` model in conjunction with the Sentence
Transformers framework to generate embeddings for your document chunks. Sentence Transformers is a Python framework that
allows you to leverage the power of Transformer models to generate dense vector embeddings for sentences. These
embeddings can capture the semantic meaning of the input text, making them ideal for tasks like semantic search,
clustering, and information retrieval.

By leveraging this framework and the Chroma database interface provided by LangChain, you can embed your documents into
a latent space and subsequently store the results in a Vector Store.

In [None]:
embeddings_model = "all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model)

In [None]:
settings = Settings(anonymized_telemetry=False, chroma_db_impl="duckdb+parquet",  persist_directory=f"{os.getcwd()}/db")

db = Chroma.from_documents(texts, embeddings, persist_directory=f"{os.getcwd()}/db", client_settings=settings)
db.persist()

Finally, you can test the accuracy of the document retrieval mechanism by providing a simple query. Chroma will return
with the four most similar documents by default.

In [None]:
query = "How can I create a cgroup?"
matches = db.similarity_search(query); matches

# Conclusion and Next Steps

Congratulations! You have successfully traversed the journey of embedding documents into a high-dimensional latent space
and storing these embeddings in a Vector Store. By accomplishing this, you've transformed unstructured text data into a
structured form that can power a robust question-answering system.

However, your journey doesn't end here. Now that you have the Vector Store ready, the next step is to create an
Inference Service that can leverage this store to provide context to user queries. For this, you use KServe, a flexible,
cloud-native platform for serving Machine Learning models.

In the next Notebook, you set up a custom Inference Service using KServe. This service uses the Vector Store to retrieve
and rank relevant document chunks based on a user's query, providing accurate and efficient context to an LLM!