# Utilizing a Vector Database for our RAG application

This notebook demonstrates how to use a vector database to build a Retrieval-Augmented Generation (RAG) pipeline. Follow the steps below to understand the process of setting up the pipeline, indexing documents, and retrieving answers to queries using OpenAI's GPT models.



## Setup Environment

This section clones the relevant data we are going to use in this notebook, while also installed all the relevant packages.


**NOTE: Make sure to change the notebook runtime to T4 GPU**

In [None]:
!git clone https://github.com/CaSToRC-CyI/AI-Agents-Training.git

In [None]:
%cd ./AI-Agents-Training

In [None]:
%%bash

uv pip install haystack-ai
uv pip install milvus_haystack
uv pip install pymilvus
uv pip install python-docx

### Import packages

In [4]:
from pathlib import Path
import glob
import os
from dotenv import load_dotenv
from getpass import getpass
from haystack import Pipeline
from haystack.components.converters import DOCXToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder

from milvus_haystack import MilvusDocumentStore
from milvus_haystack.milvus_embedding_retriever import MilvusEmbeddingRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
import textwrap

### Setup Open-AI API key

In [None]:
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

### Convert data to Haystack Documents

Haystack uses these abstraction called *Documents*. They can hold text, tables, and binary data.

They have the following unique features:

- Unique ID for each document.
- Multiple content types are supported.
- Custom metadata and scoring for advanced document management.
- Optional embeddings for AI-based applications

**Example:**

```python
@dataclass
class Document(metaclass=_BackwardCompatible):
    id: str = field(default="")
    content: Optional[str] = field(default=None)
    blob: Optional[ByteStream] = field(default=None)
    meta: Dict[str, Any] = field(default_factory=dict)
    score: Optional[float] = field(default=None)
    embedding: Optional[List[float]] = field(default=None)
    sparse_embedding: Optional[SparseEmbedding] = field(default=None)
```

---

In [6]:
DOCUMENTS_DIR = Path("./dummy_data/documents_dir")
FILES = [file.resolve() for file in DOCUMENTS_DIR.rglob("*") if file.is_file()]

In [None]:
converter = DOCXToDocument()

result = converter.run(sources=FILES)
print(f'{result["documents"][0].meta["file_path"]}')
print(result["documents"][0].content)

### Initialize the Vector Database

Set up the Milvus vector database to store document embeddings. The `drop_old` parameter ensures any existing data is cleared.

In [10]:
connection_args={"uri": "./rag_vectordb.db"}
document_store = MilvusDocumentStore(
    connection_args=connection_args,
    drop_old=True,
)

## Indexing Documents and performing RAG

Create a pipeline to process, clean, split, embed, and store the documents in the vector database.

### Setup Indexing components

To be able to use our documents we need to perform 5 things:

1. Turn them into compatible Haystack *Documents*.
2. Clean each Document using Haystack's `DocumentCleaner`. This removes any whitespaces, empty lines, specified substrings, regexes and so on.
3. Then we split our documents into *smaller chunks*. We can define various split methods and length.
4. Turn them into embeddings with an *embedder*.
5. Store them in a Haystack *Document Store* so they can be accessed later on.

In [None]:
# Initialize the indexing pipeline
indexing_pipeline = Pipeline()

# Add each component to the pipeline
indexing_pipeline.add_component("converter", DOCXToDocument())
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=2))
indexing_pipeline.add_component("embedder", OpenAIDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store))

# Connect each component
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# Run the Pipeline
indexing_pipeline.run({"converter": {"sources": FILES}})

### Testing the retrieval using a Vector Database

In this cell below, we can see the output of our retriever based on our query.

The two components we need is the embedder that turns the query into an embedding, and then the retriever.

In [None]:
question = "Tell me a bit about QuantumStream"  # You can replace it with your own question.

retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("embedder", OpenAITextEmbedder())
retrieval_pipeline.add_component("retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=3))
retrieval_pipeline.connect("embedder", "retriever")

retrieval_results = retrieval_pipeline.run({"embedder": {"text": question}})

for doc in retrieval_results["retriever"]["documents"]:
    print(doc.content)
    print("-" * 10)

### Initialize RAG

#### Prompt Template for user

This prompt template will be used by our LLM to generate a response based on our Query.

Specifically, the LLM will read this text from top to bottom:

- It will read the task, which is to respond to the user's query using the **provided context**.
- It will then read some **General Guidelines**.
- Then it will read the **provided context**.
- And finally it will read the **user's query**.

You can see that we pass the **context** and **user's query** through this template. This is why we use a prompt builder later on. This component constructs prompts dynamically by processing chat messages.

Specifically, the *ChatPromptBuilder* component creates prompts using static or dynamic templates written in Jinja2 syntax, by processing a list of chat messages. The templates contain placeholders like {{ variable }} that are filled with values provided during runtime. You can use it for static prompts set at initialization or change the templates and variables dynamically while running.

In [None]:
template = [
    ChatMessage.from_user(
        """
Respond to the User Query using the provided Context.

General Guidelines:
    - Ensure citations are concise and directly related to the information provided.
    - If the answer is not found in the context, state this clearly instead of making assumptions.
    - If the answer comes from several sources, make sure to cite every one of them, including their Source Filename, Source Chapter and Source Page.
    - If information is region-specific, clarify which region it pertains to.
    - Respond in the same language as the user’s query.  
    - Do not use emojis.
    - Be professional and punctual
    - *Avoid* writing a conclusion or a follow-up at the end of each response unless you were asked to.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

User's Query: {{query}}
Answer:
"""
    )
]

#### RAG Components

For the actual RAG pipeline we have the following components:

- The *text_embedder* which takes the user's query and turns it into embeddings
- The *retriever* which retrieves the relevant documents. The retriever is **different** now. We are using one that is compatible with our Vector Database.
- The *chat_generator* which is our LLM
- The *promt_builder* which was explained above.

In [None]:
# Initialize RAG pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
rag_pipeline.add_component("retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
rag_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))

# Connect the input/output of each component
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")

#### Perform RAG on our data

Feel free to change the question to something else.

Our documents contain information about the following topics:

- Annual Hackathon the company is organizing
- Cybersecurity Awareness Month
- Employee Recognition Program
- New Office Layout Plan
- Office layout redesign plan
- Product X Launch Timeline
- Product Y Launch Timeline
- QuantumStream product CLI Usage
- QuantumStream product Data Encryption feature
- QuantumStream product Plugin System
- QuantumStream product REST API documentation
- QuantumStream product Scheduler feature
- QuantumStream product Scheduling tasks

---

Feel free to ask anything relating to these topics.

**Suggested prompts:**

- "Whats the purpose of the new office layout? Are we loosing our desks??"
- "I am a new employee at the company. Onboard me about the QuantumStream product."
- "I cannot find the relevant email about the Hackathon, can you tell me more details about it?"

In [None]:
question = "Tell me a bit about QuantumStream"  # You can replace it with your own question.
results = rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"query": question}})
formatted_text = response["llm"]["replies"][0].text

wrapped_text = "\n".join(
    textwrap.fill(line, width=120, subsequent_indent="  ") if line.strip() else line
    for line in formatted_text.splitlines()
)

print(wrapped_text)