## Code to Chapter 10 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fqilNVeTErmrpkZ-qVEUzc54zCnIYWGy?usp=sharing)

## RAG Tutorial with AdalFlow

This notebook demonstrates how to build a complete Retrieval-Augmented Generation (RAG) system using the AdalFlow library. We'll walk through setting up the environment, processing documents, creating embeddings, and implementing a full RAG pipeline.

## Table of Contents
1. [Environment Setup](#environment-setup)
2. [Configuration](#configuration)
3. [Data Pipeline Creation](#data-pipeline-creation)
4. [Document Processing](#document-processing)
5. [Database Setup](#database-setup)
6. [RAG Pipeline Implementation](#rag-pipeline-implementation)
7. [Testing and Results](#testing-and-results)

## Environment Setup

First, we need to install the required dependencies for our RAG system:

In [1]:
!pip install -qU adalflow[openai] PyPDF2 pyvis faiss-cpu

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.1/310.1 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# patch for colab to run

!pip -q uninstall httpx anyio -y
!pip -q install "anyio>=3.1.0,<4.0"
!pip -q install httpx==0.24.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/80.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.9/80.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-genai 1.27.0 requires httpx<1.0.0,>=0.28.1, which is not installed.
openai 1.97.1 requires httpx<1,>=0.23.0, which is not installed.
gradio 5.38.1 requires httpx<1.0,>=0.24.1, which is not installed.
google-genai 1.27.0 requires anyio<5.0.0,>=4.8.0, but you have anyio 3.7.1 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip freeze | grep "adal\|httpx\|faiss\|openai"

adalflow==1.0.4
faiss-cpu==1.11.0.post1
httpx==0.24.1
openai==1.97.1
safehttpx==0.1.6


In [4]:
import os
import openai
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("LC4LS_OPENAI_API_KEY")

## RAG Architecture Overview

Unlike other libraries, AdalFlow's RAG pipeline consists of two main components:

1. **Task Pipeline**: Contains a retriever and a generator for query processing
2. **Data Pipeline**: Handles preprocessing and persistence of documents with local/cloud databases

This architecture mirrors real production environments where data processing and query handling are separated for better scalability and maintainability.

## Configuration

We centralize all system configurations in a single dictionary for easy management:

In [5]:
configs = {
    "embedder": {
        "batch_size": 100,
        "model_kwargs": {
            "model": "text-embedding-3-large",
            "dimensions": 1024,
            "encoding_format": "float",
        },
    },
    "retriever": {
        "top_k": 2,
    },
    "generator": {
        "model": "gpt-4o-mini",
        "temperature": 0.1,
        "stream": False,
    },
    "text_splitter": {
        "split_by": "word",
        "chunk_size": 500,
        "chunk_overlap": 100,
    },
}

## Data Pipeline Creation

We will use local data base `LocalDB` and `core.data_process` to create a data processing pipeline. This data pipeline will split documents into chunks and work with `LocalDB` to persis the transformed/processed documents in local file `index.faiss` (pickle format).

Data pipeline requires a sequence of `Document` as inputs.

In [6]:
from adalflow.components.data_process import (
    RetrieverOutputToContextStr,
    ToEmbeddings,
    TextSplitter,
)
from adalflow.core import Embedder, Sequential, Component, Generator, ModelClient
from adalflow.core.types import Document, ModelClientType


def prepare_data_pipeline():
    splitter = TextSplitter(**configs["text_splitter"])
    embedder = Embedder(
        model_client=ModelClientType.OPENAI(),
        model_kwargs=configs["embedder"]["model_kwargs"],
    )
    embedder_transformer = ToEmbeddings(
        embedder=embedder, batch_size=configs["embedder"]["batch_size"]
    )
    data_transformer = Sequential(splitter, embedder_transformer)
    return data_transformer

In [7]:
data_transformer = prepare_data_pipeline()
data_transformer

Sequential(
  (0): TextSplitter(split_by=word, chunk_size=500, chunk_overlap=100)
  (1): ToEmbeddings(
    batch_size=100
    (embedder): Embedder(
      model_kwargs={'model': 'text-embedding-3-large', 'dimensions': 1024, 'encoding_format': 'float'}, 
      (model_client): OpenAIClient()
    )
    (batch_embedder): BatchEmbedder(
      (embedder): Embedder(
        model_kwargs={'model': 'text-embedding-3-large', 'dimensions': 1024, 'encoding_format': 'float'}, 
        (model_client): OpenAIClient()
      )
    )
  )
)

## Document Processing

Now we'll download a research paper and convert it into processable documents:

### Download Research Paper

In [8]:
os.makedirs('./data', exist_ok=True)

In [9]:
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'Referer': 'https://github.com/IvanReznikov/LangChain4LifeScience/blob/main/data/articles/2410.20354v4.pdf',
}

response = requests.get(
    'https://raw.githubusercontent.com/IvanReznikov/LangChain4LifeScience/refs/heads/main/data/articles/2410.20354v4.pdf',
    headers=headers,
)

pdf_path = "./data/article.pdf"
with open(pdf_path, "wb") as f:
    f.write(response.content)

### PDF Text Extraction


In [10]:
from PyPDF2 import PdfReader
from uuid import uuid4

def pdf_to_documents(pdf_path):
    # Read the PDF file
    reader = PdfReader(pdf_path)
    documents = []

    # Loop through each page to extract text
    for i, page in enumerate(reader.pages):
        text = page.extract_text()
        if text:
            # Create a dictionary for each page
            doc = {
                "meta_data": {"title": f"Page {i + 1} of {pdf_path.split('/')[-1]}"},
                "text": text,
                "id": f"doc{i + 1}"
            }
            documents.append(Document(*doc))

    return documents

docs = pdf_to_documents(pdf_path)

### Transform Documents

Apply the data pipeline to convert documents into embeddings:

In [11]:
transformed_documents = data_transformer(docs)

Splitting Documents in Batches: 100%|██████████| 1/1 [00:00<00:00, 1336.62it/s]
Batch embedding documents: 100%|██████████| 1/1 [00:02<00:00,  2.56s/it]
Adding embeddings to documents from batch: 1it [00:00, 2721.81it/s]


## Database Setup

We will use localdb to manage the `documents`, `transformers`, and the persistance of the transformed documents. This resembles more of the production environment where the embeddings and documents are often handled in data base and can be reused to save cost.

In [12]:
os.makedirs('./index', exist_ok=True)
index_name = "index.faiss"
index_key = "data_transformer"

In [13]:
from typing import List
import os
from adalflow.core.db import LocalDB

def prepare_database_with_index(
    docs: List[Document],
    index_key: str = "data_transformer",
    index_path: str = "./index/default"
):
    if os.path.exists(index_path):
        os.remove(index_path)

    db = LocalDB()
    db.load(docs)

    data_transformer = prepare_data_pipeline()

    # 🧩 Fix: manually register then apply
    key = db.register_transformer(transformer=data_transformer, key=index_key)
    db.transform(key=key)

    db.save_state(index_path)
    print(db)


In [14]:
# prepare the database for retriever

prepare_database_with_index(docs, index_key=index_key, index_path=f"./index/{index_name}")

Splitting Documents in Batches: 100%|██████████| 1/1 [00:00<00:00, 1298.14it/s]
Batch embedding documents: 100%|██████████| 1/1 [00:00<00:00,  1.08it/s]
Adding embeddings to documents from batch: 1it [00:00, 6345.39it/s]

Saved the state of the DB to ./index/index.faiss
LocalDB(name='LocalDB', items=[Document(id=830f3afb-2243-4256-a377-5c98ef000042, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=fb77bee6-1585-4ca5-900e-fd89c826990b, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=e67bb650-51b8-401d-babe-751a8ecb30b2, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=174f8ae9-24e4-4f8c-979e-042b29c80a22, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=2b51fdd1-4bf4-4ee1-a353-9a615ffa6d6d, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=f11d6778-8c79-4bfc-9b61-c1f7273c472f, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=cb4e0c8b-3b6e-4f8b-a8c1-b738097a




## Loading from Persistent Storage

LocalDB `save_state` not only persist the transformed documents, but also the `data_transformer`.

This is really helpful as your retriever needs to have a matching `embedder` to embed the string query. Saving the transformer lets you verify and know what embedder you need to pass to Retriever.

In [15]:
db = LocalDB.load_state(f"./index/{index_name}")
db

LocalDB(name='LocalDB', items=[Document(id=830f3afb-2243-4256-a377-5c98ef000042, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=fb77bee6-1585-4ca5-900e-fd89c826990b, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=e67bb650-51b8-401d-babe-751a8ecb30b2, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=174f8ae9-24e4-4f8c-979e-042b29c80a22, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=2b51fdd1-4bf4-4ee1-a353-9a615ffa6d6d, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=f11d6778-8c79-4bfc-9b61-c1f7273c472f, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=cb4e0c8b-3b6e-4f8b-a8c1-b738097abb15, text='meta_data', meta_data=text, vector='l

In [16]:
db.get_transformed_data(key=index_key)

[Document(id=98aac68c-9cba-46a8-b60b-7f063b2df3d6, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=830f3afb-2243-4256-a377-5c98ef000042, order=0, score=None),
 Document(id=da8d6901-3ebc-4751-839e-87ec1fc15598, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=fb77bee6-1585-4ca5-900e-fd89c826990b, order=0, score=None),
 Document(id=3f663a0d-4492-4917-938d-a6330d3bca75, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=e67bb650-51b8-401d-babe-751a8ecb30b2, order=0, score=None),
 Document(id=98bae3a1-a8d1-40e8-a30f-75e6cec5d445, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=174f8ae9-24e4-4f8c-979e-042b29c80a22, order=0, score=None),
 Document(id=a0eb4dd9-6e3b-4f9f-8c5c-1797f01426ac, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=2b51fdd1-4bf4-4ee1-a353-9a615ffa6d6d, order=0, score=None),
 Document(id=eae7e279-7b10-42bf-9489-56a3deaf4595, text='meta_data', meta_data=text, vector='len: 1024', 

## RAG Pipeline Implementation

Now we'll create the complete RAG system that combines retrieval and generation:

* db (we will load from index_path), we will use `data_transformer` as the key to load the transformed documents.
* `FAISSRetriever` which will use embeddings to perform semantic search, and return similarity score in range [0, 1].
* `RetrieverOutputToContextStr`: this will convert the retrieved documents to a single str.
* `Generator`: we will use a simple `JsonParser` to output a dict with field `answer`.

In [17]:
from typing import Optional, Any, List
import os

from adalflow.core.db import LocalDB
from adalflow.core.component import Component

from adalflow.components.retriever.faiss_retriever import FAISSRetriever
from adalflow.components.model_client.openai_client import OpenAIClient
from adalflow.core.string_parser import JsonParser



rag_prompt_task_desc = r"""
You are a helpful assistant.

Your task is to answer the query that may or may not come with context information.
When context is provided, you should stick to the context and less on your prior knowledge to answer the query.

Output JSON format:
{
    "answer": "The answer to the query",
}"""


class RAG(Component):
    def __init__(self, index_path: str = f"./index/{index_name}"):
        super().__init__()

        self.db = LocalDB.load_state(index_path)

        # ✅ FIXED: Access transformed data using dict directly
        self.transformed_docs: List[Document] = self.db.transformed_items[index_key]

        embedder = Embedder(
            model_client=ModelClientType.OPENAI(),
            model_kwargs=configs["embedder"]["model_kwargs"],
        )

        self.retriever = FAISSRetriever(
            **configs["retriever"],
            embedder=embedder,
            documents=self.transformed_docs,
            document_map_func=lambda doc: doc.vector,
        )

        self.retriever_output_processors = RetrieverOutputToContextStr(deduplicate=True)

        self.generator = Generator(
            prompt_kwargs={
                "task_desc_str": rag_prompt_task_desc,
            },
            model_client=OpenAIClient(),
            model_kwargs=configs["generator"],
            output_processors=JsonParser(),
        )

    def generate(self, query: str, context: Optional[str] = None) -> Any:
        if not self.generator:
            raise ValueError("Generator is not set")

        prompt_kwargs = {
            "context_str": context,
            "input_str": query,
        }
        response = self.generator(prompt_kwargs=prompt_kwargs)
        return response

    def call(self, query: str) -> Any:
        retrieved_documents = self.retriever(query)

        # 🧩 Re-attach original documents
        for i, retriever_output in enumerate(retrieved_documents):
            retrieved_documents[i].documents = [
                self.transformed_docs[doc_index]
                for doc_index in retriever_output.doc_indices
            ]

        print(f"retrieved_documents: \n {retrieved_documents}\n")

        context_str = self.retriever_output_processors(retrieved_documents)
        print(f"context_str: \n {context_str}\n")

        return self.generate(query, context=context_str), retrieved_documents

In [18]:
# initialize rag and visualize its structure
rag = RAG(index_path=f"./index/{index_name}")
rag

RAG(
  (db): LocalDB(name='LocalDB', items=[Document(id=830f3afb-2243-4256-a377-5c98ef000042, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=fb77bee6-1585-4ca5-900e-fd89c826990b, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=e67bb650-51b8-401d-babe-751a8ecb30b2, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=174f8ae9-24e4-4f8c-979e-042b29c80a22, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=2b51fdd1-4bf4-4ee1-a353-9a615ffa6d6d, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=f11d6778-8c79-4bfc-9b61-c1f7273c472f, text='meta_data', meta_data=text, vector='len: 2', parent_doc_id=None, order=None, score=None), Document(id=cb4e0c8b-3b6e-4f8b-a8c1-b738097abb15, text='meta_data', meta_data=te

## Testing and Results

Let's test our RAG system with a specific question about the research paper:

In [19]:
# Test query about watermarking protein generative models
query = "What are the benefits of watermarking protein generative models?"

print(f"Question: {query}\n")
print("Processing query through RAG pipeline...\n")

# Run the complete RAG pipeline
response, retrieved_documents = rag.call(query)

# Extract and display the answer
answer = response.to_dict()['data']['answer']
print("="*60)
print("FINAL ANSWER:")
print("="*60)
print(answer)
print("="*60)

Question: What are the benefits of watermarking protein generative models?

Processing query through RAG pipeline...

retrieved_documents: 
 [RetrieverOutput(id=None, doc_indices=[1, 0], doc_scores=[0.5860000252723694, 0.5860000252723694], query='What are the benefits of watermarking protein generative models?', documents=[Document(id=da8d6901-3ebc-4751-839e-87ec1fc15598, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=fb77bee6-1585-4ca5-900e-fd89c826990b, order=0, score=None), Document(id=98aac68c-9cba-46a8-b60b-7f063b2df3d6, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=830f3afb-2243-4256-a377-5c98ef000042, order=0, score=None)])]

context_str: 
  meta_data meta_data

FINAL ANSWER:
The benefits of watermarking protein generative models include: 1) Ensuring the authenticity of generated proteins, 2) Protecting intellectual property by marking proprietary models, 3) Enabling traceability of generated data back to the source model, 4) Preventing

In [20]:
response.to_dict()['data']['answer']

'The benefits of watermarking protein generative models include: 1) Ensuring the authenticity of generated proteins, 2) Protecting intellectual property by marking proprietary models, 3) Enabling traceability of generated data back to the source model, 4) Preventing misuse of the models by identifying unauthorized use, and 5) Facilitating accountability in research and applications involving generated proteins.'

In [21]:
retrieved_documents[0].documents

[Document(id=da8d6901-3ebc-4751-839e-87ec1fc15598, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=fb77bee6-1585-4ca5-900e-fd89c826990b, order=0, score=None),
 Document(id=98aac68c-9cba-46a8-b60b-7f063b2df3d6, text='meta_data', meta_data=text, vector='len: 1024', parent_doc_id=830f3afb-2243-4256-a377-5c98ef000042, order=0, score=None)]

## Key Features and Benefits

### What Makes This RAG System Special:

1. **Persistent Storage**: The database saves both documents and transformers, ensuring consistency across sessions
2. **Batch Processing**: Efficient embedding generation through batching
3. **Modular Architecture**: Clear separation between data processing and query handling
4. **Production-Ready**: Designed to mirror real-world deployment scenarios
5. **Flexible Configuration**: Centralized config management for easy tuning

### Performance Considerations:

- **Chunk Overlap**: The 100-word overlap ensures context preservation across chunks
- **Top-K Retrieval**: Limited to 2 documents to focus on most relevant information
- **Low Temperature**: Ensures consistent, factual responses
- **Deduplication**: Prevents redundant context in the final prompt