# LangChain Docling RAG Example

This notebook demonstrates how to use LangChain's Docling integration for RAG (Retrieval Augmented Generation) pipeline.
Based on the official LangChain Docling documentation: https://python.langchain.com/docs/integrations/document_loaders/docling/

## 1. Setup and Configuration

Load configuration from config.json file including API keys and model parameters.

In [None]:
import json
import os
from pathlib import Path
from tempfile import mkdtemp

# Load configuration
config_path = "../config.json"
with open(config_path, "r", encoding="utf-8") as f:
    config = json.load(f)

# Environment setup for tokenizers (to avoid warnings)
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Helper function to get environment variables
def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)


# Load environment variables
from dotenv import load_dotenv

load_dotenv()

HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")

print(f"Configuration loaded from: {config_path}")
print(f"HuggingFace Token available: {'Yes' if HF_TOKEN else 'No'}")

## 2. Install Required Dependencies

Install necessary packages including langchain-docling and other dependencies.

In [None]:
# Install required packages
!pip install -qU langchain-docling
!pip install -q --progress-bar off --no-warn-conflicts langchain-core langchain-huggingface langchain-milvus langchain python-dotenv

## 3. Import Libraries and Load Configuration

Import all required libraries including LangChain, Docling, and load configuration settings.

In [None]:
# Import required libraries
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType
from docling.chunking import HybridChunker
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import MarkdownHeaderTextSplitter

print("All libraries imported successfully!")

## 4. Configure Pipeline Parameters

Set up parameters for the RAG pipeline based on configuration and LangChain Docling example.

In [None]:
# Pipeline parameters from LangChain Docling example
FILE_PATH = ["https://arxiv.org/pdf/2408.09869"]  # Docling Technical Report
EMBED_MODEL_ID = (
    "sentence-transformers/all-MiniLM-L6-v2"  # You can change to BAAI model later
)
GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE = ExportType.DOC_CHUNKS  # or ExportType.MARKDOWN
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")  # Temporary database

# Prompt template
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n"
    "---------------------\n"
    "{context}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, answer the query.\n"
    "Query: {input}\n"
    "Answer:\n"
)

print(f"File path: {FILE_PATH}")
print(f"Embedding model: {EMBED_MODEL_ID}")
print(f"Export type: {EXPORT_TYPE}")
print(f"Milvus URI: {MILVUS_URI}")

## 5. Initialize Docling Document Loader

Set up the DoclingLoader with proper configuration for document processing.

In [None]:
# Initialize DoclingLoader with HybridChunker
loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

print("DoclingLoader initialized successfully!")
print(f"Export type: {EXPORT_TYPE}")
print(f"Chunker: HybridChunker with tokenizer {EMBED_MODEL_ID}")

## 6. Load and Process Documents

Use Docling loader to load and process documents from specified paths.

In [None]:
# Load documents
print("Loading documents...")
docs = loader.load()

print(f"Loaded {len(docs)} documents")

# Inspect first few documents
for i, d in enumerate(docs[:3]):
    print(f"\nDocument {i+1}:")
    print(f"Content preview: {d.page_content[:100]}...")
    print(f"Metadata keys: {list(d.metadata.keys())}")

## 7. Determine Document Splits

Process documents into splits based on the export type (DOC_CHUNKS or MARKDOWN).

In [None]:
# Determine splits based on export type
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    splits = docs
    print(f"Using DOC_CHUNKS: {len(splits)} splits")
elif EXPORT_TYPE == ExportType.MARKDOWN:
    # Use MarkdownHeaderTextSplitter for Markdown export
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ("#", "Header_1"),
            ("##", "Header_2"),
            ("###", "Header_3"),
        ],
    )
    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
    print(f"Using MARKDOWN with header splitting: {len(splits)} splits")
else:
    raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")

# Inspect some sample splits
print("\nSample splits:")
for i, d in enumerate(splits[:3]):
    print(f"\nSplit {i+1}: {d.page_content[:150]}...")
print("...")

## 8. Setup Vector Store with Embeddings

Initialize embedding model and Milvus vector store for document indexing.

In [None]:
# Initialize embeddings
print("Initializing embeddings model...")
embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)

# Create Milvus vector store from documents
print("Creating Milvus vector store...")
vectorstore = Milvus.from_documents(
    documents=splits,
    embedding=embedding,
    collection_name="docling_demo",
    connection_args={"uri": MILVUS_URI},
    index_params={"index_type": "FLAT"},
    drop_old=True,
)

print(f"Vector store created successfully!")
print(f"Collection: docling_demo")
print(f"Index type: FLAT")
print(f"Number of documents indexed: {len(splits)}")

## 9. Setup Retrieval Chain

Create a retrieval system using the processed documents and vector store.

In [None]:
# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})

# Initialize LLM (using HuggingFace Endpoint)
llm = HuggingFaceEndpoint(
    repo_id=GEN_MODEL_ID,
    huggingfacehub_api_token=HF_TOKEN,
    task="text-generation",
)

# Create chains
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

print("RAG chain created successfully!")
print(f"Retriever top-k: {TOP_K}")
print(f"LLM model: {GEN_MODEL_ID}")

## 10. Test Document Loading and Retrieval

Test the complete pipeline by loading sample documents and performing retrieval queries.

In [None]:
# Helper function to clip text for display
def clip_text(text, threshold=100):
    return f"{text[:threshold]}..." if len(text) > threshold else text


# Run the RAG chain
print(f"Question: {QUESTION}")
print("\nProcessing...")

resp_dict = rag_chain.invoke({"input": QUESTION})

# Display results
clipped_answer = clip_text(resp_dict["answer"], threshold=350)
print(f"\nQuestion:\n{resp_dict['input']}")
print(f"\nAnswer:\n{clipped_answer}")

# Display retrieved context sources
print("\nRetrieved Sources:")
for i, doc in enumerate(resp_dict["context"]):
    print(f"\nSource {i + 1}:")
    print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}")

    # Display metadata (excluding 'pk' if present)
    for key in doc.metadata:
        if key != "pk":
            val = doc.metadata.get(key)
            clipped_val = clip_text(val) if isinstance(val, str) else val
            print(f"  {key}: {clipped_val}")

## 11. Alternative: Test with Local PDF Files

Example of how to use the system with local PDF files instead of URLs.

In [None]:
# Example with local files (uncomment to use)
# LOCAL_PDF_PATH = "../data/pdf/"  # Adjust path as needed
#
# # Create loader for local files
# local_loader = DoclingLoader(
#     file_path=LOCAL_PDF_PATH,
#     export_type=ExportType.DOC_CHUNKS,
#     chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
# )
#
# # Load and process local documents
# local_docs = local_loader.load()
# print(f"Loaded {len(local_docs)} local documents")

print("Local file processing example is commented out.")
print("Uncomment the code above to test with local PDF files.")

## Summary

This notebook demonstrates the LangChain Docling integration for RAG:

1. **Document Loading**: Uses DoclingLoader to parse PDFs with rich metadata
2. **Chunking**: Supports both DOC_CHUNKS and MARKDOWN export types
3. **Vector Storage**: Integrates with Milvus for efficient similarity search
4. **Retrieval**: Creates a complete RAG chain with HuggingFace models
5. **Rich Metadata**: Preserves document structure, headings, and bounding boxes

### Key Benefits of Docling:
- Preserves document layout and structure
- Extracts tables and other elements properly
- Provides rich metadata for better grounding
- Works with multiple document formats (PDF, DOCX, PPTX, etc.)

### Next Steps:
- Replace with BAAI embedding models as requested
- Add BM25 hybrid retrieval
- Implement Vietnamese language support
- Add proper error handling and logging