# Deploy a RAG system with Mosaic AI Agent Evaluation and Lakehouse Applications

In this chapter, you will build a **ecommerce Docs Assistant** to help users answer questions about Databricks, using :  
- Vector database / index, 
- LLM endpoint
- MLFlow for tracking, deployment
- Lakehouse for data housing


## In this notebook, we will build the pipeline to feed the vectorstore database.

Our goal is to develop an assistant able to read the docs of databricks to answer dev questions.


## Extract and preprocess of the contextual documents. 

This notebook will be used to create a delta table in the Unity Catalog that will be next used as vectorstore database.

For this we will use two types of chunking method, 
- the first one with classic recursive chunking on a pdf source
- the second with a docling extract + chunk.


 

In [0]:
%pip install -U --quiet databricks-langchain==0.6.0 mlflow[databricks]==3.4.0  langchain==0.3.27 langchain_core==0.3.74 bs4 langchain_community markdownify docling pypdf2 pypdf


dbutils.library.restartPython()

# 0- Configuration

In a first notebook, we have uploaded the pdfs from a github to a Unity Catalog Volume :

catalog  : demo
schema : demo
volume : /Volumes/{catalog}/{schema}/raw_data/pdf


## Catalog et schema 


In [0]:
catalog = "demo"
schema = "demo"


## Source extract
The source is a bunch of pdf from the dedicated UC volume.

In [0]:
# PDF volume path
path_volume = f"/Volumes/{catalog}/{schema}/raw_data/pdf/"
pdf_list = dbutils.fs.ls(path_volume) 

# 1- Create a table with classic chunks.  

### Extracting Databricks documentation sitemap and pages

For this demo, we will directly load a few pdf documents pages extract their contents.
The library used is langchain that has integrated tools for each stage of the authoirng of agents.

Here are the main steps:

    Load each pdf from the volume
    Convert each document to markdown versions, 
    

### Splitting documentation pages into small chunks

LLM models typically have a maximum input context length, and you won't be able to compute embeddings for very long texts. In addition, the longer your context length is, the longer it will take for the model to provide a response.

Document preparation is key for your model to perform well, and multiple strategies exist depending on your dataset:

    Split document into small chunks 
    Truncate documents to a fixed length chunk_size=1000, chunk_overlap=100 in our case
    The chunk size depends on your content and how you'll be using it to craft your prompt. Adding multiple small doc chunks in your prompt might give different results than sending only a big one
    Split into big chunks and ask a model to summarize each chunk as a one-off job, for faster live inference
    Create multiple agents to evaluate each bigger document in parallel, and ask a final agent to craft your answer...
-

In [0]:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain_community.document_transformers import MarkdownifyTransformer
from langchain_community.document_loaders import PyPDFLoader


#md = MarkdownifyTransformer(["script", "style", "nav", "footer"])

def custom_pdf_splitter(pdf_path, chunk_size=500, chunk_overlap=100):
    """
        for the url_doc, fetch the HTML content, parse it with BeautifulSoup, and split it into smaller chunks.
        chunk_size and chunk_overlap are parameters of RecursiveCharacterTextSplitter
    """
    # Load the pdf doc
    loader = PyPDFLoader(pdf_path)

    # Create a list of LangChain documents
    documents =  loader.load()


    # Apply text splitter, you can add a separator if needed
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    return text_splitter.split_documents(documents)



def create_documents(path_volume) : 
    """
        for each pdf, add the Documents results from custom_html_splitter
        to the list of documents extracted form the html docs. 
    """
    documents = []
    
    for pdf_infos in dbutils.fs.ls(path_volume) :
        documents += custom_pdf_splitter(f"{path_volume}{pdf_infos.name}", chunk_size=1000)
        
    return documents


documents = create_documents(path_volume)

In [0]:
print(documents[0].metadata)
print(documents[0].page_content[:100])

In [0]:
print(len(documents))

#### Convert the list of Chunks split with lanchain features to columns of text, doc.content, page_label , source

In [0]:
import pandas as pd
from pyspark.sql import SparkSession

# From the list of langchain chunks create data for the vectorstore table. 
data = [{
            'content': doc.page_content,
            'source': str(doc.metadata['source']), 
            'page': str(doc.metadata['page_label']), 
            # Add as many information as you need
        }
        for doc in documents ]

# Create a table in UC that can be used a vector store
df_pandas = pd.DataFrame(data)
df_spark = spark.createDataFrame(df_pandas)

In [0]:
display(df_spark)

### Create a delta table with the documents 

In [0]:
%sql
USE CATALOG demo;
USE SCHEMA demo;

CREATE OR REPLACE TABLE pdf_document_raw (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  content STRING,
  source STRING,
  page STRING,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
USING DELTA
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true',
  'delta.feature.allowColumnDefaults' = 'enabled'
);

delta.feature.allowColumnDefaults property is actually enabled by default in newer Databricks runtimes (DBR 10.4 and above), so you might not even need to explicitly set it unless you're on an older runtime.

In [0]:
# Définir le nom de votre table UC (catalog.schema.table)
table_name = f"pdf_document_raw"

# Créer la table UC
df_spark.write \
    .format("delta") \
    .mode("append") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(table_name)

# 2- Use the docling preparation module

In [0]:
import json
import logging
import time
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.document import InputDocument
from docling.chunking import HybridChunker, HierarchicalChunker

doc_convert = DocumentConverter()
chunker = HierarchicalChunker()

# Docling Parse without EasyOCR
# -------------------------
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

doc_converter = DocumentConverter(
    format_options={
       InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)     
    }
)

def docling_pdf_splitter(input_doc_path, chunk_size=500, chunk_overlap=100):
    print(f"Processing {input_doc_path}")
    start_time = time.time()
    doc_result = doc_converter.convert(input_doc_path).document
    end_time = time.time() - start_time
    #Evaluation of the needed time
    print(f"docling_pdf_splitter took {end_time} seconds to process {input_doc_path}")


    print(f"doc_result {doc_result}")
    chunk_iter = chunker.chunk(dl_doc=doc_result) #, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    data_docling = []
    for i, chunk in enumerate(chunk_iter):
        print(f"=== {i} ===")
        #print(f"chunk.text:\n{f'{chunk.text[:300]}…'!r}")

        enriched_text = chunker.contextualize(chunk=chunk)
        print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}…'!r}")

        row = {
            #'content': chunk,
            'content' : enriched_text,
            'source': input_doc_path,  
            #'page': str(doc_result.metadata['page_label'])
        }
       
        data_docling.append(row)

    return data_docling

def create_docling_documents(path_volume) : 
    """
        for each pdf, add the Documents results from custom_html_splitter
        to the list of documents extracted form the html docs. 
    """
    documents = []
    
    for pdf_infos in dbutils.fs.ls(path_volume) :
        print(f"Processing {pdf_infos.name}")
        documents += docling_pdf_splitter(f"{path_volume}{pdf_infos.name}", chunk_size=1000)
        
    return documents


docling_documents = create_docling_documents(path_volume)

In [0]:
print(len(docling_documents))
#print(type(docling_documents[0]))


In [0]:

import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("content", StringType(), True),
    StructField("source", StringType(), True)
])
df_spark_docling = spark.createDataFrame(docling_documents, schema=schema)
display(df_spark_docling)

The delat table needs :
- 'delta.enableChangeDataFeed' = 'true'
- an id
- timestamp for the date of creation

### Docling :
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

It's an open-source project from IBM.

This tool benefits from advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more.

Ready for image integration with GraniteDocling that Supports several Visual Language Models.

This at the time one of the most deployed repository from Github platform  

In [0]:
%sql
USE CATALOG demo;
USE SCHEMA demo;

CREATE OR REPLACE TABLE pdf_document_docling (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  content STRING,
  source STRING,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
USING DELTA
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true',
  'delta.feature.allowColumnDefaults' = 'enabled'
);

### Creation of the vectorstore database : "demo.demo.pdf_document_docling"

In [0]:
# Définir le nom de votre table UC (catalog.schema.table)
table_name = "pdf_document_docling"

# Créer la table UC
df_spark_docling.write \
    .format("delta") \
    .mode("overwrite") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(table_name)