# Build RAG With LlamaIndex and watsonx.data Milvus

In this notebook, we will explore the process of building a Retrieval-Augmented Generation (RAG) system using LlamaIndex, Milvus, and watsonx.ai models. RAG is a powerful method that combines information retrieval with language generation, enabling systems to retrieve relevant documents and generate meaningful responses based on them.

We will be using Milvus, to store and manage high-dimensional vector embeddings. These embeddings represent the knowledge contained in documents and are used for efficient similarity search. Then, we'll integrate LlamaIndex, a framework that can connect large language models (LLMs) with external data sources

Finally, we'll utilize watsonx.ai pre-trained models to enhance the system's ability to generate contextually rich, accurate, and relevant answers. This combination of cutting-edge technologies allows us to create an intelligent, scalable, and high-performing RAG system capable of delivering powerful insights from large data sets.

## Understanding the RAG Architecture
Before diving into the implementation, let's understand the basic flow of a RAG system:
- Data Preparation: Raw data (documents, text, etc.) is collected and processed.
-	Embedding Generation: The processed data is converted into vector embeddings using an embedding model.
-	Vector Storage: These embeddings are stored in a vector database (in our case, Milvus).
-	Query Processing: When a user asks a question, the query is also converted to an embedding.
-	Similarity Search: The system searches for the most similar vectors to the query embedding.
-	Context Generation: The retrieved relevant information is used as context.
-	Response Generation: An LLM uses the retrieved context to generate a comprehensive answer.
This architecture ensures that the AI's responses are factually grounded in your data, reducing the likelihood of hallucinations or generating incorrect information.


## Step-by-Step Implementation
Let's walk through the process of building a RAG system using LlamaIndex, Milvus, and watsonx.ai models:
1. Create Milvus Instance on watsonx.data
You can refer to this Getting Started with IBM watsonx.data Milvus . 
2. Set up a Watson Machine Learning service instance and API key
1.	Create a Watson Machine Learning service instance (you can choose the Lite plan, which is a free instance).
2.	Generate an API Key in WML. Save this API key for use in this tutorial.
Associate the WML service to the project you created in watsonx.ai


## Installing Required Libraries
Our implementation requires several Python libraries:


In [1]:
%pip install -qU llama-index
%pip install -qU llama-index-llms-ibm
%pip install -qU llama-index-postprocessor-ibm
%pip install -qU llama-index-embeddings-ibm
%pip install -qU llama-index-vector-stores-milvus
%pip install -qU pymilvus>=2.4.2

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.10 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.10 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.10 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;

## Environment Configuration
Set up the environment variables with your watsonx.ai credentials:


In [2]:
import os
os.environ["WATSONX_URL"] = "<WATSONX_URL>"
os.environ["WATSONX_APIKEY"] = '<WATSONX_APIKEY>'
os.environ["WATSONX_PROJECT_ID"] = '<WATSONX_PROJECT_ID>'

## Preparing Sample Data


In [3]:
!mkdir -p 'data/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/uber_2021.pdf'


--2025-05-06 22:31:02--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham_essay.txt’


2025-05-06 22:31:02 (3.57 MB/s) - ‘data/paul_graham_essay.txt’ saved [75042/75042]

--2025-05-06 22:31:03--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [a

## Generate our data
As a first example, lets generate a document from the file paul_graham_essay.txt. It is a single essay from Paul Graham titled What I Worked On. To generate the documents, we will use the SimpleDirectoryReader.


In [4]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core import Settings

# Set chunk size for document splitting
Settings.chunk_size = 512

# Load documents from file
documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

print(f"Document ID: {documents[0].doc_id}")


Document ID: a22be147-250b-41ef-8b28-ccf89bf8f86c


## IBM watsonx.ai Configuration
Next, we'll configure our connection to IBM watsonx.ai:


In [5]:
from ibm_watsonx_ai import APIClient

# Set up WatsonX API credentials
my_credentials = {
    "url":  "<watsonx_url>",  # Replace with your your service instance url (watsonx URL)
    "apikey": '<watsonx_api_key>' # Replace with your watsonx_api_key
}


# Initialize the WatsonX client for embeddings
client = APIClient(my_credentials)

## Initializing the Embedding Model
We'll use IBM's slate-30m-english-rtrvr model for generating embeddings:


In [6]:
from llama_index.embeddings.ibm import WatsonxEmbeddings

# Truncating inputs to fit embedding model's context window
truncate_input_tokens = 512

# Initialize watsonx embedding model
watsonx_embedding = WatsonxEmbeddings(
    model_id="ibm/slate-30m-english-rtrvr",  # Or any preferred embedding model
    credentials=my_credentials,
    project_id="<project_id>",
    truncate_input_tokens=truncate_input_tokens,
)

## Initializing the Language Model
For text generation, we'll use Llama 3:


In [7]:
from llama_index.llms.ibm import WatsonxLLM

# Maximum tokens to generate in response
max_new_tokens = 256

# Initialize watsonx LLM
watsonx_llm = WatsonxLLM(
    model_id="meta-llama/llama-3-3-70b-instruct",  # Or any preferred foundation model
    credentials=my_credentials,
    project_id="<project_id>",
    max_new_tokens=max_new_tokens,
)

## Setting Up Milvus Vector Store
Now we configure LlamaIndex to use our Milvus instance:


In [8]:
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import StorageContext

In [9]:
vector_store = MilvusVectorStore(
                         uri="https://<hostname>:<port>",
                         token="<user>:<password>",
                         server_pem_path="/root/path to ca.cert",
                         dim=384,
                         overwrite=True,# refer Managing Vectors Collections Section
                         collection_name="watsonx_llamaindex",
 )



2025-05-06 22:31:11,078 [DEBUG][_create_connection]: Created new connection using: bbdd600defec4476b2644f08d49d3340 (async_milvus_client.py:599)


In [10]:
storage_context = StorageContext.from_defaults(vector_store=vector_store)

## Creating the Index
With all components in place, we create the vector index:


In [11]:
# Create index with watsonx embeddings and Milvus vector store
index = VectorStoreIndex.from_documents(
    documents=documents, 
    embed_model=watsonx_embedding,
    storage_context=storage_context
)

## Building a Query Engine
The query engine retrieves the most relevant document chunks and generates a coherent response using the LLM.


In [12]:
# Create a query engine 
query_engine = index.as_query_engine(
    llm=watsonx_llm,
    similarity_top_k=3,  # Retrieve top 3 most similar nodes
)

# Execute the same query
response_simple = query_engine.query(
    "What did Sam Altman do in this essay?",
)

# Print the response with sources
from llama_index.core.response.pprint_utils import pprint_response
print("\n\nResponse:")
pprint_response(response_simple, show_source=True)



Response:
Final Response: In this essay, Sam Altman was asked to be the
president of Y Combinator (YC) and initially said no because he wanted
to start a startup to make nuclear reactors. However, he eventually
agreed to take over as president, starting with the winter 2014 batch,
and was given the freedom to reorganize YC. He learned the job and
took over running YC, allowing the original founders, including Paul
Graham, to retire or become ordinary partners.
______________________________________________________________________
Source Node 1/3
Node ID: 8728b685-c0b1-4cf4-9c9b-038ef0a60b8f
Similarity: 0.6335484385490417
Text: [17]  As well as HN, I wrote all of YC's internal software in
Arc. But while I continued to work a good deal in Arc, I gradually
stopped working on Arc, partly because I didn't have time to, and
partly because it was a lot less attractive to mess around with the
language now that we had all this infrastructure depending on it. So
now my three pr...
____________

**This is a Retrieval-Augmented Generation (RAG) use case. The query engine retrieved the top 3 most relevant text chunks and used them to answer the question. It summarized that Sam Altman initially declined the YC president role to build nuclear reactors but eventually accepted and restructured the organization. The sources show the retrieved content with similarity scores, although only one directly relates to the answer.**

## Now, let’s check out a few more things.

## Managing Vector Collections
LlamaIndex and Milvus offer flexibility in how you manage your vector collections:
### Overwriting Existing Data

In [13]:
from llama_index.core import Document
#overwrite=True ( overwriting removes the previous data)

vector_store = MilvusVectorStore(
    uri="https://<hostname>:<port>",
    token="<user>:<password>",
    server_pem_path="/root/path to ca.cert",
    dim=384,
    overwrite=True,
    collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create a new document
new_doc = Document(text="The number that is being searched for is ten.")

# Create index with the new document and watsonx embedding model
index = VectorStoreIndex.from_documents(
    [new_doc],
    embed_model=watsonx_embedding,  # Use the watsonx embedding model we defined earlier
    storage_context=storage_context,
)

# Try a more specific query
res = query_engine.query("Who is the author?")
print(f"Response: {res}")


2025-05-06 22:31:20,795 [DEBUG][_create_connection]: Created new connection using: 98caa97df1c04ed1af987d9f51ee603a (async_milvus_client.py:599)


Response:  There is no information about an author in the context. The context only mentions a number being searched for, which is ten. Therefore, it is not possible to determine the author based on the provided context.


### Appending to Existing Data

In [14]:
from llama_index.core import Document
#overwrite=False (  adding additional data to an already existing index)

vector_store = MilvusVectorStore(
    uri="https://<hostname>:<port>",
    token="<user>:<password>",
    server_pem_path="/root/path to ca.cert",
    dim=384,
    overwrite=False,
    collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents=documents, 
    embed_model=watsonx_embedding,
    storage_context=storage_context
)

# Try a more specific query
res = query_engine.query("What is the number?")
print(f"Response: {res}")

2025-05-06 22:31:25,945 [DEBUG][_create_connection]: Created new connection using: 19783b8c21ed445ca896450d2aa5db93 (async_milvus_client.py:599)


Response: 10.


In [15]:
res = query_engine.query("Who is the author?")
print(res)

 Paul Graham.


## Metadata filtering
We can generate results by filtering specific sources. The following example illustrates loading all documents from the directory and subsequently filtering them based on metadata.


In [16]:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# Load all the two documents loaded before
documents_all = SimpleDirectoryReader("./data/").load_data()

vector_store = MilvusVectorStore(
    uri="https://<hostname>:<port>",
    token="<user>:<password>",
    server_pem_path="/root/path to ca.cert",
    dim=384,
    overwrite=True,
    collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents( documents=documents_all, 
                                         embed_model=watsonx_embedding,
                                         storage_context=storage_context)

2025-05-06 22:31:50,768 [DEBUG][_create_connection]: Created new connection using: b0163873bcce48ae9c9558e6a0fee9a5 (async_milvus_client.py:599)


### We want to only retrieve documents from the file uber_2021.pdf.

In [17]:
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]
)
query_engine = index.as_query_engine( llm=watsonx_llm,
                                      similarity_top_k=3,  # Retrieve top 3 most similar nodes,
                                      filters=filters)
res = query_engine.query("What difficulties did the author face due to the disease?")

print(res)

 The author faced difficulties such as reduced demand for Mobility offerings, accelerated growth of Delivery offerings, and challenges in managing driver availability and consumer demand due to the COVID-19 pandemic. The author also had to implement measures such as suspending shared rides, implementing "leave at door" delivery options, and asking employees to work remotely to comply with social distancing guidelines. Additionally, the author had to increase investments in driver incentives to improve driver availability. The author was also unable to accurately predict the full impact of COVID-19 on their business due to numerous uncertainties.


### We get a different result this time when retrieve from the file paul_graham_essay.txt.

In [18]:
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]
)
query_engine = index.as_query_engine(llm=watsonx_llm,
                                     similarity_top_k=3,  # Retrieve top 3 most similar nodes,
                                     filters=filters)
res = query_engine.query("What difficulties did the author face due to the disease?")

print(res)

 None. The text does not mention the author facing any difficulties due to a disease. It mentions the author facing stress due to Hacker News (HN), but this is not related to a disease. It also mentions a blister from an ill-fitting shoe as a metaphor for the stress caused by HN, but this is not a real disease or health issue.


## Conclusion
Building a RAG system with LlamaIndex, Milvus, and watsonx.ai models provides an elegant solution for creating knowledge-rich AI applications. This architecture separates concerns effectively:
- LlamaIndex handles document processing and query orchestration
-	Milvus efficiently stores and retrieves vector embeddings
-	watsonx.ai provides powerful models for embedding generation and text generation

This separation makes the system modular and maintainable, allowing you to swap components as needed or scale individual parts of the system.
By following the steps outlined in this blog post, you can create a RAG system that provides accurate, contextually relevant responses grounded in your own data. Whether you're building a customer support chatbot, a document analysis tool, or a research assistant, the LlamaIndex-Milvus-watsonx.ai stack offers a robust foundation for your AI application.
As LLM technology continues to evolve, the RAG architecture will remain relevant because it addresses one of the fundamental challenges of AI systems: connecting models to real-world, up-to-date information. By mastering RAG, you're preparing for the future of AI application development.
