# Building RAG with Langchain and watsonx.data Milvus

In this notebook, we will explore the process of building a Retrieval-Augmented Generation (RAG) system using Langchain, Milvus, and watsonx.ai models. RAG is a powerful method that combines information retrieval with language generation, enabling systems to retrieve relevant documents and generate meaningful responses based on them.

We will be using Milvus, to store and manage high-dimensional vector embeddings. These embeddings represent the knowledge contained in documents and are used for efficient similarity search. Then, we'll integrate Langchain, a framework designed for building applications with language models, to facilitate the retrieval and generation of responses from the stored data.

Finally, we'll utilize watsonx.ai pre-trained models to enhance the system's ability to generate contextually rich, accurate, and relevant answers. This combination of cutting-edge technologies allows us to create an intelligent, scalable, and high-performing RAG system capable of delivering powerful insights from large data sets.

## What Are We Trying to Achieve?
Our goal is to create a retrieval-augmented generation (RAG) chain that:

Retrieves relevant content from a set of documents.

Embeds and indexes that content into a vector store using Milvus.

Uses a powerful foundation model from watsonx.ai to answer questions based on retrieved context.



## Prerequisites -
Before proceeding, ensure that your environment is set up with the necessary libraries.These libraries are essential for loading, processing, and working with documents and embeddings. Follow the steps below to set up your environment:

1. Install the necessary libraries.

2. Ensure you have:
-	IBM watsonx.ai service instance (URL, API key, project ID)
-	IBM watsonx.data Milvus instance (Connection params)



In [1]:
pip install --upgrade --quiet  langchain langchain-community  langchain-milvus  langchain_ibm ibm-watson-machine-learning>=1.0.327 unstructured

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.10 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Authentication Setup

In [2]:
from ibm_watsonx_ai import APIClient

# Set up WatsonX API credentials
my_credentials = {
    "url": "<watsonx URL>",  # Replace with your your service instance url (watsonx URL)
    "apikey": '<watsonx_api_key>'  # Replace with your watsonx_api_key
}


# Initialize the WatsonX client for embeddings
client = APIClient(my_credentials)

## Loading and Splitting Documents-
We start by collecting content from IBM-related blog pages and announcements. For this, we use Langchain’s WebBaseLoader and split the text into manageable chunks using a recursive character splitter.

In [3]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a WebBaseLoader instance to load documents from web sources
loader = WebBaseLoader(
    web_paths=(
        "https://www.ibm.com/events/think/faq",
        "https://www.ibm.com/events/think/agenda",
        "https://www.ibm.com/products/watsonx-ai",
        "https://www.ibm.com/products/watsonx-ai/foundation-models",
        "https://www.ibm.com/watsonx/pricing",
        "https://www.ibm.com/watsonx",
        "https://www.ibm.com/products/watsonx-data",
        "https://www.ibm.com/products/watsonx-assistant",
        "https://www.ibm.com/products/watsonx-code-assistant",
        "https://www.ibm.com/products/watsonx-orchestrate",
        "https://www.ibm.com/products/watsonx-governance",
        "https://research.ibm.com/blog/granite-code-models-open-source",
        "https://www.redhat.com/en/about/press-releases/red-hat-delivers-accessible-open-source-generative-ai-innovation-red-hat-enterprise-linux-ai",
        "https://www.ibm.com/blog/announcement/enterprise-grade-model-choices/",
        "https://www.ibm.com/blog/announcement/democratizing-large-language-model-development-with-instructlab-support-in-watsonx-ai/",
        "https://newsroom.ibm.com/Blog-IBM-Consulting-Expands-Capabilities-to-Help-Enterprises-Scale-AI",
        "https://www.ibm.com/products/data-product-hub",
        "https://www.ibm.com/blog/announcement/delivering-superior-price-performance-and-enhanced-data-management-for-ai-with-ibm-watsonx-data/",
        "https://www.ibm.com/blog/a-new-era-in-bi-overcoming-low-adoption-to-make-smart-decisions-accessible-for-all/",
        "https://www.ibm.com/blog/announcement/ibm-watsonx-code-assistant-for-z-accelerate-the-application-lifecycle-with-generative-ai-and-automation/",
        "https://www.ibm.com/blog/announcement/watsonx-code-assistant-java/",
        "https://www.ibm.com/blog/announcement/watsonx-orchestrate-ai-z-assistant/",
        "https://newsroom.ibm.com/Blog-How-IBM-Cloud-is-Accelerating-Business-Outcomes-with-Gen-AI",
        "https://newsroom.ibm.com/2024-05-21-IBM-Unveils-Next-Chapter-of-watsonx-with-Open-Source,-Product-Ecosystem-Innovations-to-Drive-Enterprise-AI-at-Scale",
        "https://www.ibm.com/products/concert",
        "https://newsroom.ibm.com/2024-01-17-IBM-Introduces-IBM-Consulting-Advantage,-an-AI-Services-Platform-and-Library-of-Assistants-to-Empower-Consultants",
        "https://www.ibm.com/consulting/info/ibm-consulting-advantage"
                
    ),
    bs_kwargs = dict(
    parse_only=bs4.SoupStrainer(name=["main", "article", "p"])
    ),
)
# Load documents from web sources using the loader
documents = loader.load()
# Initialize a RecursiveCharacterTextSplitter for splitting text into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
# Split the documents into chunks using the text_splitter
docs = splitter.split_documents(documents)
print(docs[1])


USER_AGENT environment variable not set, consider setting it to identify your requests.


page_content='Registration opened on the Think website on 21 January 2025. Or a direct link to register is available here.Yes, the Think® 2025 event in Boston will be a fee-based event. Standard pricing is USD 1,899.00, effective 21 January through 8 May 2025.Yes. All attendees must be at least 21 years of age by the day they pick up their conference badge.Requests for cancellation must be made in writing to IBMThink@gpj.com. Requests received by 4 April 2025, 11:59 PM ET will receive a full refund. Requests received after 12:00 AM ET, 5 April 2025 will not be eligible for a refund unless within 24-hours of registration.Substitutions for purchased passes will be allowed for attendees from the same company, and processed free of charge if requested by 11:59 PM ET on 4 April 2025. Substitution requests received after 12:00 AM ET, 5 April 2025 are subject to a USD 50 transaction fee.Substitutions for complimentary pass holders are not permitted. Please contact your IBM account representat

## Embedding Configuration-
IBM watsonx.ai offers several embedding models. Here we use SLATE_30M_ENGLISH_RTRVR, with truncation enabled. Each text chunk is converted into a vector (a list of numbers) using SLATE_30M_ENGLISH_RTRVR from IBM watsonx.ai.


In [4]:
from ibm_watsonx_ai.foundation_models.embeddings import Embeddings
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams

model_id = client.foundation_models.EmbeddingModels.SLATE_30M_ENGLISH_RTRVR

# Define embedding parameters
embed_params = {
    EmbedParams.TRUNCATE_INPUT_TOKENS: 128,  # Adjust token truncation as needed
    EmbedParams.RETURN_OPTIONS: {'input_text': True},
}

# Set up the embedding model
embedding = Embeddings(
    model_id=model_id,
    credentials=my_credentials,
    params=embed_params,
    project_id="<project_id>",  # Replace with your project ID
    space_id=None,
    verify=False
)

## Generate Embeddings
Each document chunk is converted into a high-dimensional vector.


In [5]:
# Assuming embedding.embed_documents() can process a list of text chunks
embedding_vectors = embedding.embed_documents(texts=[doc.page_content for doc in docs])


## Store Embeddings in Milvus

We will initialize a Milvus vector store with the documents, which load the documents into the Milvus vector store and build an index under the hood.


In [6]:
from langchain_milvus import Milvus

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=embedding,
     connection_args={
        "uri": "https://<hostname>:<port>", # Replace with your watsonx.data Milvus URI or IP
        "user":"<user>",
        "password":"<password>",
        "secure": True,  # Set True if TLS is enabled
        "server_pem_path": "/path_to_ca.cert"
    }, 
    drop_old=True
)


print("connected")


connected


## Perform a Search Query
You may wonder: Why do we run a search query before using a language model?

Because in a RAG pipeline, the retriever step brings in relevant knowledge before generation. So our language model doesn’t generate from thin air — it generates from facts.

In [7]:
query = "Describe in detail some of the foundational models in watsonx-ai?"
print(vectorstore.similarity_search(query, k=1))

[Document(metadata={'pk': 457518948738408506, 'source': 'https://www.ibm.com/products/watsonx-ai'}, page_content='trusted and cost-effective, including IBM Granite models, select open-source models from Hugging Face, third-party models from strategic partners and custom foundation models.DeepSeek R1 Distilled Models now available on watsonx.aiLearn more about DeepSeek-R1 on IBM watsonx.ai and how you can seamlessly integrate, fine-tune, and deploy with watsonx.ai’s secure, governed environment.\xa0Whether you need code explanation, campaigns or lesson planning, use fit-for-purpose foundation models to get a head start on\xa0quality content, both general and personalized.AI developers can build and deploy ready-to-use knowledge management applications quickly with pre-built RAG templates, frameworks and APIs.Streamline discovery and analysis of large amounts of data for faster, more valuable insights and forecasts specific to your needs and business requirements.From more customer satis

## Set up watsonx.ai Language Model

We use ibm/granite-3-3-8b-instruct for answering the user query:


In [8]:
from ibm_watsonx_ai.foundation_models import ModelInference
from langchain_ibm import WatsonxLLM

# Initialize model inference
model_inference = ModelInference(
    model_id="ibm/granite-3-3-8b-instruct",  # Use a watsonx.ai foundational model
     params={
        "max_new_tokens": 1024         
    },    
    credentials=my_credentials,
    project_id="<project_id>"
)

# Wrap with LangChain's WatsonxLLM
llm = WatsonxLLM(watsonx_model=model_inference)


## Compose the Final RAG Chain
Let’s glue everything together using Langchain’s composable Runnable interface. Here’s how it works:

* Retriever fetches top documents.

* Formatter prepares them for prompt injection.

* PromptTemplate frames a question and context.

* LLM generates an answer.

* OutputParser extracts the response.


In [9]:

from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define a better prompt template with clearer instructions
PROMPT_TEMPLATE = """Generate a summary of the context that answers the question. Explain the answer in multiple steps if possible. 
Answer style should match the context. Ideal Answer Length 5-12 sentences.
Context:
{context}

Question:
{question}

Answer:
"""


# Create a PromptTemplate instance with the defined template and input variables
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
response = rag_chain.invoke(query)
print("Answer:", response)


Answer: Watsonx-ai offers a variety of foundational models, including IBM Granite models, select open-source models from Hugging Face, third-party models from strategic partners, and custom foundation models. Among these, IBM Granite models are specifically highlighted for their performance, trustworthiness, and cost-effectiveness. These models are developed with a robust process that involves searching for and removing duplication, employing URL blocklists, filters for objectionable content, document quality checks, sentence splitting, and tokenization techniques before model training. Additionally, DeepSeek R1 Distilled Models are now available on watsonx-ai, which have been integrated into the InstructLab community. Developers can contribute to enhancing these models, similar to open-source projects, fostering a collaborative environment for model improvement. The platform also provides pre-built RAG (Retrieval-Augmented Generation) templates, frameworks, and APIs for AI developers 

## Conclusion:
Through this step-by-step guide, we've built RAG using Langchain, Milvus, and watsonx.ai models.

This architecture enables:

- Accurate and grounded responses

- Scalable vector search with Milvus

- Seamless use of IBM’s foundation models for intelligent generation

By combining the power of retrieval with generation, this RAG system can help enterprises unlock insights from massive knowledge bases, powering next-gen assistants, customer support bots, or internal knowledge tools.

Whether you're exploring RAG for research or real-world deployment, this pipeline gives you a robust foundation to build on.

