# Deploy a RAG system with Mosaic AI Agent Evaluation and Lakehouse Applications

In this chapter, you will build a **Databricks Docs Assistant** to help users answer questions about Databricks, using :  
- Vector database / index, 
- LLM endpoint
- MLFlow for tracking, deployment
- Lakehouse for data housing


## In this notebook, we will build the pipeline to feed the vectorstore database.

Our goal is to develop an assistant able to read the Databricks docs to answer dev questions.


## Extraction and preprocessing of the contextual documents. 

This notebook will be used to create a delta table in the Unity Catalog that will be next used as vectorstore database.

For this we will use two types of chunking method, 
- the first one with classic recursive chunking on a pdf source
- the second with a docling extract + chunk.


 

In [0]:
%pip install -U --quiet databricks-langchain==0.6.0 mlflow[databricks]==3.4.0  langchain==0.3.27 langchain_core==0.3.74 bs4 langchain_community markdownify docling pypdf2 pypdf

In [0]:
dbutils.library.restartPython()

# 0- Configuration

In a first notebook, we have uploaded the PDFs from a GitHub repository to a Unity Catalog Volume :

catalog  : demo
schema : demo
volume : /Volumes/{catalog}/{schema}/raw_data/pdf


## Catalog and schema 
The magic %run command will load the UC catalog config and navigate to our "demo" work schema.


In [0]:
%run "../_config/config_unity_catalog"


## Source extract
The source is a collection of PDFs from the dedicated UC volume.

In [0]:
# PDF volume path
path_volume = f"/Volumes/{catalog}/{schema}/raw_data/pdf/"
pdf_list = dbutils.fs.ls(path_volume)
print("PDF list : ") 
for pdf_infos in pdf_list[:2] : 
    print(pdf_infos)

# 1- Create a table with classic chunks.  

### Extracting Databricks documentation sitemap and pages

For this demo, we will directly load a few PDF documents pages and extract their contents.
The library used is Langchain that has integrated tools for each stage of the authoring of agents.

Here are the main steps:

- Extract URLs from the Databricks docs sitemap
- Download each page
- Split them into small chunks for our vector search to be able to digest them accurately and add them to our index



In [0]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
import time

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

def classic_pdf_splitter(pdf_path) : 
    """
    Load and split a PDF using PyPDFLoader and RecursiveCharacterTextSplitter.
    
    Args:
        pdf_path: Path to the PDF file
    
    Returns:
        List of Document chunks with metadata
    """
    print(f"Processing {pdf_path}")
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    #Return the chunked pages
    return text_splitter.split_documents(pages)

def create_classic_documents(path_volume) : 
    """
    For each PDF, add the Document results from classic_pdf_splitter
    to the list of documents extracted from the sources. 
    
    Args:
        path_volume: Path to the volume containing PDF files
    
    Returns:
        List of all document chunks from all PDFs
    """
    documents = []
    start_time = time.time()
    for pdf_infos in dbutils.fs.ls(path_volume) :
        print(f"Processing {pdf_infos.name}")
        documents += classic_pdf_splitter(f"{path_volume}{pdf_infos.name}")
    
    end_time = time.time() - start_time

    # Evaluation of the processing time
    print(f"classic_pdf_splitter took {end_time} seconds to process classic documents")
    return documents


documents = create_classic_documents(path_volume)
print(f"{len(documents)} chunks ! ")



In [0]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pathlib import Path
schema = StructType([
    StructField("content", StringType(), True),
    StructField("source", StringType(), True),
    StructField("source_name", StringType(), True),
])
doc_dict = [{'content' : doc.page_content,
              'source': doc.metadata['source'],
              'source_name' : Path(doc.metadata['source']).name,
              } for doc in documents]

df_spark = spark.createDataFrame(doc_dict, schema=schema)
display(df_spark.limit(2))

The delta table needs :
- 'delta.enableChangeDataFeed' = 'true'
- an id
- timestamp for the date of creation

In [0]:
%sql
CREATE OR REPLACE TABLE pdf_document (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  content STRING,
  source STRING,
  source_name STRING,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
USING DELTA
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true',
  'delta.feature.allowColumnDefaults' = 'enabled'
);

### Creation of the vectorstore database : "demo.demo.pdf_document"

In [0]:
# Define the name of your UC table (catalog.schema.table)
table_name = "pdf_document"

# Create the UC table
df_spark.write \
    .format("delta") \
    .mode("overwrite") \
    .option("delta.enableChangeDataFeed", "true") \
    .option("overwriteSchema", "true") \
    .saveAsTable(table_name)

### Check that the table exists

In [0]:
%sql
SELECT * FROM pdf_document limit 2;

# 2- Create a table with the second method : Docling 

In [0]:
from pathlib import Path
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
import time

doc_converter = DocumentConverter()
chunker = HybridChunker(
    tokenizer="sentence-transformers/all-MiniLM-L6-v2",
    max_tokens=512,
    include_section_info=True,
)

def docling_pdf_splitter(pdf_path) : 
    """
    Load and split a PDF using Docling's HybridChunker.
    
    Args:
        pdf_path: Path to the PDF file
    
    Returns:
        List of dictionaries containing chunk content and metadata
    """
    print(f"Processing {pdf_path}")

    doc_result = doc_converter.convert(pdf_path).document
    print(f"doc_result {doc_result.name}")
    chunk_iter = chunker.chunk(dl_doc=doc_result)
    data_docling = []
    for i, chunk in enumerate(chunk_iter):
        enriched_text = chunker.contextualize(chunk=chunk)
        row = {
            #'content': chunk,
            'content' : enriched_text,
            'source': pdf_path, 
            'source_name' : Path(pdf_path).name, 
        }
        data_docling.append(row)
    print(f"=> {i} Chunks ")
    return data_docling

def create_docling_documents(path_volume) : 
    """
    For each PDF, add the Document results from docling_pdf_splitter
    to the list of documents extracted from the sources. 
    
    Args:
        path_volume: Path to the volume containing PDF files
    
    Returns:
        List of all document chunks from all PDFs
    """
    documents = []
    start_time = time.time()
    for pdf_infos in dbutils.fs.ls(path_volume) :
        print(f"Processing {pdf_infos.name}")
        documents += docling_pdf_splitter(f"{path_volume}{pdf_infos.name}")
    
    end_time = time.time() - start_time

    # Evaluation of the processing time
    print(f"docling_pdf_splitter took {end_time} seconds to process docling documents")
    return documents


docling_documents = create_docling_documents(path_volume)
print(f"{len(docling_documents)} chunks ! ")

In [0]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("content", StringType(), True),
    StructField("source", StringType(), True),
    StructField("source_name", StringType(), True),
])
df_spark_docling = spark.createDataFrame(docling_documents, schema=schema)
display(df_spark_docling.limit(2))

The delta table needs :
- 'delta.enableChangeDataFeed' = 'true'
- an id
- timestamp for the date of creation

### Docling :
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

It's an open-source project from IBM.

This tool benefits from advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more.

Ready for image integration with GraniteDocling that Supports several Visual Language Models.

This is currently one of the most deployed repositories on the GitHub platform.

In [0]:
%sql
CREATE OR REPLACE TABLE pdf_document_docling (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  content STRING,
  source STRING,
  source_name STRING,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
USING DELTA
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true',
  'delta.feature.allowColumnDefaults' = 'enabled'
);

### Creation of the vectorstore database : "demo.demo.pdf_document_docling"

In [0]:
# Define the name of your UC table (catalog.schema.table)
table_name = "pdf_document_docling"

# Create the UC table
df_spark.write \
    .format("delta") \
    .mode("overwrite") \
    .option("delta.enableChangeDataFeed", "true") \
    .option("overwriteSchema", "true") \
    .saveAsTable(table_name)

### Check that the table exists

In [0]:
%sql
SELECT * FROM pdf_document_docling limit 2;