# Build RAG with Llama Stack with watsonx.data Milvus

The Llama Stack is a set of open-source tools that work together to build powerful AI applications, especially LLM (Large Language Model) apps like chatbots, document search, and question answering systems.

Llama Stack offers flexibility in how it's deployed—whether as a library, a standalone server, or a custom-built distribution. You can mix and match components with different providers, so the setup can vary widely based on your goals.

In this tutorial, we’ll show you how to set up a Llama Stack Server with Milvus using watsonx.ai models. This setup will let you upload your own data and use it as your knowledge base. Then, we’ll run some example questions, creating a full RAG (Retrieval-Augmented Generation) app that can give helpful answers using your data.


## Setting Up the Environment

### 1. Create Milvus Instance on watsonx.data

You can refer to the [Getting Started guide with IBM watsonx.data Milvus](https://community.ibm.com/community/user/blogs/swati-karot/2025/02/06/getting-started-with-watsonxdata-milvus).

### 2. Set up a Watson Machine Learning service instance and API key

1. Create a [Watson Machine Learning](https://cloud.ibm.com/catalog/services/watson-machine-learning?utm_source=ibm_developer&utm_content=in_content_link&utm_id=tutorials_awb-create-langchain-rag-system-python-watsonx&cm_sp=ibmdev-_-developer-tutorials-_-trial) service instance (Lite plan is available).
2. Generate and save an API Key for use in this tutorial.
3. Associate the WML service to your project in watsonx.ai.

### 3. Starting the Llama Stack Server

#### Clone the Llama Stack Repo
```bash
git clone https://github.com/meta-llama/llama-stack.git
cd llama-stack
```

#### Set Up a Conda Environment
```bash
conda create -n stack python=3.10 -y
conda activate stack
pip install -e .
```

####  Set Environment Variables
Llama Stack will need environment variables to authenticate and configure services. Here, we are using the watsonx inference model. Set the following environment variables with your watsonx API key and project ID:
```bash
export WATSONX_API_KEY="<WATSONX_API_KEY>"
export WATSONX_PROJECT_ID="<WATSONX_PROJECT_ID>"
```
Make sure you replace <WATSONX_API_KEY> and <WATSONX_PROJECT_ID> with your actual API key and project ID.


#### Configure Milvus as Your Vector Store

Edit the file: `llama_stack/templates/watsonx/run.yaml`

Replace the `vector_io` section with:

```yaml
vector_io:
- provider_id: milvus
  provider_type: remote::milvus
  config:
    uri: http://localhost:19530
    token: <user>:<Password>
    secure: True
    server_pem_path: "path/to/server.pem"
```


## Building a Custom Distribution Using a Template

### 1. Build the Distribution
```bash
llama stack build --template watsonx --image-type conda
```

### 2. Launch the Llama Stack Server
```bash
llama stack run --image-type conda ~/.llama/distributions/watsonx/watsonx-run.yaml
```

If everything goes well, you should see the Llama Stack server running on port 8321.


## Running RAG from the Client
Once the Llama Stack server is up and running, the next step is to interact with it using client code. The script below demonstrates how to perform Retrieval-Augmented Generation (RAG) using your own documents.

**Note**:This script must be executed inside the Llama Stack environment, such as within the Docker container or the Conda environment created by Llama Stack. This ensures access to the required dependencies, file paths, and the running Llama Stack service.


In [1]:
# Import required modules
import uuid  
from llama_stack_client.types import Document  
from llama_stack_client.lib.agents.agent import Agent  


In [2]:
# Define the inference model and the port where LlamaStack is running
INFERENCE_MODEL = "meta-llama/llama-3-3-70b-instruct"
LLAMA_STACK_PORT = 8321


In [3]:
# Function to create a client for connecting to the local LlamaStack server
def create_http_client():
    from llama_stack_client import LlamaStackClient
    return LlamaStackClient(base_url=f"http://localhost:{LLAMA_STACK_PORT}")


In [4]:
# Create a client instance
client = create_http_client()

# List of file paths containing content to be inserted into Milvus
doc_paths = [
    "/root/VP/milvus_intro.txt",
    "/root/VP/collection.txt",
    "/root/VP/schema.txt"
]

In [5]:
# Read and convert the documents into Document objects with metadata
documents = []
for i, path in enumerate(doc_paths):
    with open(path, 'r', encoding='utf-8') as f:
        content = f.read()
        documents.append(Document(
            document_id=f"milvus-doc-{i}",  
            content=content,  
            mime_type="text/plain",  
            metadata={"source": path}  
        ))


In [6]:
# Create a unique ID for the vector database using UUID
vector_db_id = f"milvus-vector-db-{uuid.uuid4().hex}"

# Register a new vector database in LlamaStack, using Milvus as the backend
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",  # Model used to generate embeddings
    embedding_dimension=384,  
    provider_id="milvus"  
)


VectorDBRegisterResponse(embedding_dimension=384, embedding_model='all-MiniLM-L6-v2', identifier='milvus-vector-db-1a92a8d20ee2467494a40b71507a9aa9', provider_id='milvus', provider_resource_id='milvus-vector-db-1a92a8d20ee2467494a40b71507a9aa9', type='vector_db', access_attributes=None)

In [7]:
print("Inserting Milvus docs into vector DB...")

# Insert the documents into the vector DB using LlamaStack's built-in RAG tool
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=1024 
)

Inserting Milvus docs into vector DB...


In [8]:
# Create a RAG agent using the selected LLM and the registered Milvus vector store
rag_agent = Agent(
    client=client,
    model=INFERENCE_MODEL,  # LLM to use for generating responses
    instructions="You are a Milvus expert assistant.",  # System prompt to guide behavior
    enable_session_persistence=False,  # Don't persist chat history
    tools=[{
        "name": "builtin::rag",  # Use the built-in RAG tool
        "args": {"vector_db_ids": [vector_db_id]}  # Connect RAG to the created vector DB
    }],
    sampling_params={
        "max_tokens": 2048,  # Max tokens for response generation
    },
)


In [9]:
# Start a new chat session with the agent
session_id = rag_agent.create_session(session_name="milvus-session")


In [10]:
# Provide a user question for the agent to answer
user_prompt = "What is Milvus ? Give it in bullet points"



In [11]:
# Generate a response to the user prompt within the session
response = rag_agent.create_turn(
    messages=[{"role": "user", "content": user_prompt}],
    session_id=session_id,
    stream=False  # Set to True to stream response if supported
)

# Print the AI assistant's response
print("Response from Milvus Bot:")
print(response.output_message.content)

Response from Milvus Bot:
* Milvus is an open-source vector database built to power embedding similarity search and AI applications.
* It is designed to manage large-scale vector data, such as embeddings generated by machine learning models.
* Milvus provides a scalable and efficient way to store, index, and search vector data, enabling fast and accurate similarity searches.
* It supports a wide range of indexing algorithms and distance metrics, allowing users to choose the best approach for their specific use case.
* Milvus is often used in applications such as image and video search, natural language processing, recommendation systems, and more.
* It provides a simple and intuitive API, making it easy to integrate with existing machine learning workflows and applications.
* Milvus is highly scalable and can handle large volumes of data, making it suitable for large-scale AI applications.
* It also provides features such as data partitioning, replication, and backup, ensuring high ava

# Understanding the Code

1. **Client Setup**: Establishes a connection to the Llama Stack server  
2. **Document Preparation**: Reads files and converts to Document objects  
3. **Vector DB Registration**: Creates vector DB in Milvus  
4. **Document Ingestion**: Uploads docs to Milvus DB  
5. **RAG Agent Creation**: Initializes an agent  
6. **Query Execution**: Sends a user query and retrieves the response  

# Conclusion

The integration of Llama Stack with watsonx.data Milvus represents a powerful approach to building intelligent, context-aware applications. This complete RAG pipeline enables:

- Storage and indexing of domain-specific knowledge  
- Retrieval of relevant info via semantic similarity  
- Generation of contextual responses with LLMs  
