# Building a Retrieval-Augmented Generation Pipeline Using Db2's Vector Search and LangChain

### **Notebook Summary: Implementing Retrieval-Augmented Generation (RAG) with IBM Db2 and LangChain**

This notebook demonstrates how to build a **Retrieval-Augmented Generation (RAG) pipeline** using **IBM Db2’s vector search capabilities** and **LangChain**. The pipeline enhances large language model (LLM) responses by retrieving relevant information from a knowledge base before generating an answer.

#### **Key Components & Workflow:**

1. **Loading & Processing Documents:**
   - `document_loader(url)`: Fetches and cleans text from a given webpage.
   - `text_splitter(data)`: Splits the text into **manageable chunks** while maintaining context.

2. **Storing & Retrieving Vectors:**
   - `watsonx_embedding()`: Converts text chunks into **vector embeddings** using **IBM Watsonx AI**.
   - `Db2VectorStore`: Stores the **vectorized document chunks** in **IBM Db2**.
   - `retriever(file)`: Retrieves **semantically similar** document chunks based on user queries.

3. **Building the RAG Pipeline:**
   - `get_llm()`: Loads the **Meta Llama 3 70B** LLM from **IBM Watsonx AI**.
   - `qa_chain(file)`: Creates a **retrieval-based QA system** using **retrieved document chunks** and **LLM** to generate context-aware answers.

4. **Running the QA System:**
   - A sample **URL** (`qa_chain(url)`) is used to load a blog post into the pipeline.
   - The system can now **answer questions** based on the content retrieved from IBM Db2.

#### **Purpose & Benefits:**
- **Prevents hallucinations** by restricting LLM responses to **retrieved knowledge**.
- **Improves accuracy** by dynamically **fetching relevant information** before answering.
- **Leverages IBM Db2’s vector capabilities** to store and search high-dimensional embeddings.
- **Creates a reusable RAG pipeline** for answering domain-specific questions.

### **Environment Setup**
Before running this notebook, make sure you have set up your environment correctly. Follow the **setup instructions** provided in the [README.md](README.md) in the same directory.

This notebook serves as a **step-by-step guide** for implementing RAG with **IBM Db2, LangChain, and Watsonx AI**, enabling **more reliable, context-aware AI applications**. 🚀

# 1. Import

In [1]:
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams, EmbedTextParamsMetaNames
from langchain_ibm import WatsonxLLM, WatsonxEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain.chains import RetrievalQA
from dotenv import load_dotenv
import os
import pandas as pd
from langchain.prompts import PromptTemplate
from db2_utils import get_db2_connection, close_db2_connection, initialize_db, cleanup_db, Db2VectorStore
from IPython.display import display, Markdown
from langchain.text_splitter import TokenTextSplitter

USER_AGENT environment variable not set, consider setting it to identify your requests.


# 2. Setup

`Connecting to Db2`

In [2]:
get_db2_connection()

Connected to Db2 successfully.


<ibm_db.IBM_DBConnection at 0x7f555f814170>

`Creating a Db2 table for storing vectors`
```sql
CREATE TABLE embeddings (
    id INT NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    content CLOB,
    source VARCHAR(255),
    title VARCHAR(255),
    embedding VECTOR(1024, FLOAT32),
    PRIMARY KEY (id)
)
```

In [3]:
initialize_db()

Table 'embeddings' dropped successfully (if it existed).
Table 'embeddings' created successfully.


`Load wx.ai API credentials` from `.env` file

In [4]:
load_dotenv(os.getcwd()+"/.env", override=True)

True

`Supress warnings from code`

In [5]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# 3. Define `RAG` pipeline functions

`Set up LLM services`

The `watsonx_embedding` function **creates and returns an embedding service** using IBM Watsonx for generating vector representations of text.

**What it does:**
1. **Defines embedding parameters (`embed_params`)**:  
   - `TRUNCATE_INPUT_TOKENS: 3` → Limits token truncation.  
   - `RETURN_OPTIONS: {"input_text": True}` → Ensures the input text is included in the response.  

2. **Initializes a `WatsonxEmbeddings` instance**:  
   - Uses the `"intfloat/multilingual-e5-large"` model for text embeddings.  
   - Connects to **IBM Watsonx AI** at `"https://us-south.ml.cloud.ibm.com"`.  
   - Retrieves API credentials (`apikey` and `project_id`) from **environment variables** (`WATSONX_APIKEY`, `WATSONX_PROJECT`).  

3. **Returns the configured Watsonx embedding model**, which can be used to convert text into vector representations for similarity search.

### **Purpose:**
- Enables **text embeddings** for vector-based retrieval in **RAG pipelines**.  
- Supports **multilingual processing** with `multilingual-e5-large`.  
- Facilitates **semantic search** in AI applications by converting text into numerical vectors for similarity comparison.

In [6]:
# embedding service for generating vectors
def watsonx_embedding():
    embed_params = {
        EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
        EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
    }
    watsonx_embedding = WatsonxEmbeddings(
        model_id="intfloat/multilingual-e5-large",
        url="https://us-south.ml.cloud.ibm.com",
        apikey=os.getenv("WATSONX_APIKEY", ""),
        project_id=os.getenv("WATSONX_PROJECT", ""),
        params=embed_params,
    )
    return watsonx_embedding

The `get_llm` function **initializes and returns a large language model (LLM) service** using IBM Watsonx for generating AI-driven responses.

**What it does:**
1. **Defines the LLM model ID**:  
   - Uses **Meta Llama 3 (70B parameters)**: `'meta-llama/llama-3-1-70b-instruct'`, a powerful instruction-tuned language model.

2. **Sets generation parameters** (`parameters`):  
   - `MAX_NEW_TOKENS: 512` → Limits response length to **512 tokens**.  
   - `TEMPERATURE: 0.5` → Balances between **creativity and determinism** in responses (higher values make responses more random).  

3. **Creates a Watsonx LLM instance (`WatsonxLLM`)**:  
   - Connects to **IBM Watsonx AI** at `"https://us-south.ml.cloud.ibm.com"`.  
   - Retrieves API credentials (`apikey` and `project_id`) from **environment variables** (`WATSONX_APIKEY`, `WATSONX_PROJECT`).  
   - Passes the **model ID** and **generation parameters** for configuration.  

4. **Returns the Watsonx LLM instance**, which can be used to generate text-based responses.

### **Purpose:**
- Enables **natural language generation (NLG)** in AI applications.  
- Supports **instruction-following tasks** using **Meta Llama 3**.  
- Works in **retrieval-augmented generation (RAG) pipelines** to provide context-aware responses based on retrieved information.

In [7]:
# llm for generating responses
def get_llm():
    # model_id = 'meta-llama/llama-3-1-70b-instruct'
    model_id = 'mistralai/mistral-large'
    parameters = {
        GenParams.MAX_NEW_TOKENS: 512,
        GenParams.TEMPERATURE: 0.5,
    }
   
    watsonx_llm = WatsonxLLM(
        model_id=model_id,
        url="https://us-south.ml.cloud.ibm.com",
        apikey=os.getenv("WATSONX_APIKEY", ""),
        project_id=os.getenv("WATSONX_PROJECT", ""),
        params=parameters,
    )
    return watsonx_llm

The following code block **defines a prompt template** for a retrieval-augmented generation (RAG) system, ensuring that the AI assistant answers questions **only based on provided context** and avoids hallucination.

`What it does`:
1. **Defines a structured prompt** (`prompt_template`):  
   - Instructs the AI to **only use** the given context to answer.  
   - If the answer **is not found in the context**, it must explicitly state:  
     *"The information is not available in the provided context."*  
2. **Creates a `PromptTemplate` object** (`PROMPT`):  
   - Uses `{context}` and `{question}` as placeholders for dynamic input.  
   - Ensures the large language model (LLM) receives a **properly formatted** prompt.  

### Purpose:
- Helps prevent **hallucinations** by restricting responses to known data.  
- Enhances the **accuracy and trustworthiness** of LLM-generated answers in **RAG pipelines**.

In [8]:
prompt_template = """
You are a knowledgeable assistant. Answer the question based solely on the provided context.
- If the context contains the answer, respond directly to the reader using 'you' to make it personal.
- If the answer includes code, provide an explanation of the code following the code block.
- If the information is not available in the context, respond with 'The information is not available in the provided context.'

Context: {context}

Question: {question}
Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

The `document_loader` function **fetches and cleans text content from a webpage** using `WebBaseLoader`.  

`What it does`:  
1. **Loads webpage content**: It initializes a `WebBaseLoader` with the given URL and loads the page content as `Document` objects.  
2. **Cleans the text**: It removes empty lines from the `page_content` of each document to ensure cleaner text.  
3. **Returns the extracted content**: The function outputs a list of `Document` objects containing the cleaned webpage content.  

This function is useful for **scraping and preprocessing web-based textual data** before further processing (e.g., chunking, embedding).

In [9]:
def document_loader(urls):
    """
    Loads content from a list of webpages using WebBaseLoader.

    Args:
        urls (str): The URLs of the webpages to load.

    Returns:
        list: A list of Document objects extracted from the webpage.
    """
    docs = [WebBaseLoader(url).load() for url in urls]
    docs_list = [item for sublist in docs for item in sublist]
    
    return docs_list

In [10]:
# docs_list[0]

The `text_splitter` function **splits a document into smaller chunks** for better processing in retrieval-augmented generation (RAG) pipelines.  

`What it does`:  
1. **Initializes a text splitter** using `RecursiveCharacterTextSplitter`, which:  
   - Splits text into chunks of **1024 characters**.  
   - Maintains a **256-character overlap** between chunks to preserve context.  
   - Uses `len` as the length function to determine chunk size.  
2. **Splits the input data** into chunks using `split_documents(data)`.  
3. **Returns the list of document chunks**.  

This function ensures that long documents are **divided into manageable segments** while retaining contextual continuity between adjacent chunks, improving **vector search accuracy** and **LLM-generated responses**.

In [11]:
def text_splitter(docs):
   # Split
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=2048, chunk_overlap=256
    )
    chunks = text_splitter.split_documents(docs)
    return chunks

The `retriever` function **builds a retrieval system** by processing a document and storing its vectorized chunks for efficient semantic search.  

 `What it does`:  
1. **Loads the document** using `document_loader(file)`.  
2. **Splits the document** into smaller chunks using `text_splitter(splits)`.  
3. **Creates a vector database** (`Db2VectorStore`) with an embedding function (`watsonx_embedding()`) and retrieves the top **5** most relevant results (`k=5`).  
4. **Adds the document chunks** to the vector store for indexing.  
5. **Returns a retriever instance** that enables similarity-based search over the stored vectorized chunks.  

This function allows **efficient retrieval of relevant document segments** based on a query, improving context-aware AI responses in **retrieval-augmented generation (RAG) pipelines**.

`INSERT VECTOR SQL`:
```sql
INSERT INTO embeddings(content, source, title, embedding)
VALUES (?, ?, ?, VECTOR('[{embedding_vector_str}]', 1024, FLOAT32));
```

In [12]:
def retriever(urls):
    docs = document_loader(urls)
    chunks = text_splitter(docs)
    vectordb = Db2VectorStore(embedding_function=watsonx_embedding(), k=5)
    vectordb.add_documents(chunks)
    retriever_instance = vectordb.as_retriever()
    return retriever_instance

The `qa_chain` function creates a **question-answering (QA) pipeline** using a retrieval-augmented generation (RAG) approach. It:  

1. **Initializes an LLM** using `get_llm()`.  
2. **Retrieves relevant document chunks** using the `retriever(file)` function.  
3. **Builds a RetrievalQA chain** with `RetrievalQA.from_chain_type()`, specifying the LLM, retrieval method, and prompt.  
4. **Returns the QA chain**, which can answer queries based on retrieved documents while also returning source documents.  

This function enables **context-aware Q&A** by retrieving relevant information before generating responses.

In [13]:
def qa_chain(urls):
    llm = get_llm()
    retriever_obj = retriever(urls)
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever_obj,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )
    return qa_chain

# 4. Create a QA chain and ask questions using the chain

The given code **runs a question-answering (QA) pipeline** using a specified URL as the knowledge source.

**What it does:**
1. **Defines the URL** .
2. **Calls `qa_chain(url)`**:
   - Loads the webpage content using `document_loader(url)`.
   - Splits the text into **manageable chunks** via `text_splitter`.
   - Stores vector embeddings in **Db2VectorStore** using `watsonx_embedding()`.
   - Creates a **retrieval-based QA pipeline** (`RetrievalQA`) using `get_llm()` as the language model.
3. **Outputs a QA system (`qa_chain`)** that can:
   - Retrieve **relevant chunks** from the blog post using **vector search**.
   - Generate **context-aware responses** using an **LLM**.

**Purpose:**
- Enables **automated question-answering** over the blog’s content.
- Ensures the **LLM only answers based on retrieved context**.
- Supports **retrieval-augmented generation (RAG)** for **accurate, domain-specific AI responses**.

In [14]:
initialize_db()
urls = [
    "https://community.ibm.com/community/user/datamanagement/blogs/shaikh-quader/2024/05/07/building-an-in-db-linear-regression-model-with-ibm",
    "https://community.ibm.com/community/user/datamanagement/blogs/shaikh-quader/2024/05/27/db2ai-pyudf"
]

qa_chain = qa_chain(urls)

Table 'embeddings' dropped successfully (if it existed).


Table 'embeddings' created successfully.


# 5. Ask LLM

Behind the scene SQL for vector search:
```sql
SELECT 
    content AS CONTENT, 
    source AS SOURCE, 
    title AS TITLE, 
    VECTOR_DISTANCE(
        VECTOR('{query_embedding_str}', 1024, FLOAT32), 
        embedding, 
        EUCLIDEAN
    ) AS DISTANCE 
FROM embeddings 
ORDER BY SIMILARITY ASC 
FETCH FIRST {top_k} ROWS ONLY;
```

`Question 1: How to use the stored procedure for training a linear regression model in Db2?`

In [None]:
# query = 'How to train a linear regression model inside Db2?'
# query = 'How to see the list of in database ML models in Db2?'
# query = 'How to impute missing values of columns in Db2?'
# query = 'How to generate predictions using a Python UDF?'
# query = 'How to compute summary statistics in Db2 for machine learning?'
# query = 'How to generate predictions using a linear regression model?'
# query = 'How to drop a model?'
# query = 'What is Python UDF?'



response = qa_chain.invoke({"query": query})

# Extract the answer
answer = response['result']
print("Answer:")
display(Markdown(answer))

# Extract and print the retrieved documents
source_documents = response['source_documents']
print("\nRetrieved Contexts:")
for i, doc in enumerate(source_documents, 1):
    print(f"\nDocument {i}:")
    print(f"Content: {doc.page_content[:500]}...")  # Display the first 500 characters
    print(f"Metadata: {doc.metadata['distance']}")

Answer:



Python UDF stands for Python User-Defined Function. It is a custom function written in Python that can be registered and used within a database system, such as IBM Db2. This allows you to leverage the power and flexibility of Python for tasks that may be complex or not easily achievable with standard SQL functions.

In the provided context, a Python UDF is defined to perform batch inferencing using a trained machine learning model. This UDF collects input rows into a batch, processes them using the model, and returns the predictions. The UDF is then registered in the Db2 database, enabling you to use it in SQL queries for generating predictions.

Here is the relevant part of the context that defines the Python UDF:

```python
import nzae
import pandas as pd
from joblib import load

ml_model_path = '/home/shaikhq/pipe_lr/myudf_lr.joblib'
ml_model_features = ['YEAR', 'QUARTER', 'MONTH', 'DAYOFMONTH', 'DAYOFWEEK', 'UNIQUECARRIER', 'ORIGIN', 'DEST', 'CRSDEPTIME', 'DEPDELAY', 'DEPDEL15', 'TAXIOUT', 'WHEELSOFF', 'CRSARRTIME', 'CRSELAPSEDTIME', 'AIRTIME', 'DISTANCEGROUP']

class full_pipeline(nzae.Ae):
    def _runUdtf(self):
        # Load the trained pipeline
        trained_pipeline = load(ml_model_path)

        # Collect rows into a single batch
        rownum = 0
        row_list = []
        for row in self:
            if rownum == 0:
                # Grab batch size from first element value (select count(*))
                batchsize = row[0]

            # Collect everything but the first element (which is select count(*))
            row_list.append(row[1:])
            rownum += 1

            if rownum == batchsize:
                # Collect data into a Pandas dataframe for scoring
                data = pd.DataFrame(row_list, columns=ml_model_features)


Retrieved Contexts:

Document 1:
Content: # define the strategy to fill in missing values in the categorical columns
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'),
                            OneHotEncoder(handle_unknown='ignore'))

# combine the previous 2 pipelines into a data preproessing pipeline. 

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include='object'))
)

# create a final pipeline by chaining ...
Metadata: 0.6180782797877933

Document 2:
Content: SELECT count(*) FROM GOSALES.GOSALES_TRAIN



SELECT count(*) FROM GOSALES.GOSALES_TEST



The above counts confirm that the train and test tables have 80% and 20% of records, respectively, from the original table with 60252 records.
Data Exploration
Now, I will look into some sample records from the training dataset, GOSALES_TRAIN.
SELECT * FROM GOSALES.GOSALES_TRAIN FETCH FIRST 5 ROWS ONLY





Fr

# 6. Clean Up

In [16]:
# Close the connection when done
# cleanup_db()
# close_db2_connection()