# Deploy a RAG system with Mosaic AI Agent Evaluation and Lakehouse Applications

In this chapter, you will build a **Databricks Docs Assistant** to help users answer questions about Databricks, using :  
- Vector database / index, 
- LLM endpoint
- MLFlow for tracking, deployment
- Lakehouse for data housing


## In this notebook, we will build the pipeline to feed the vectorstore database.

Our goal is to develop an assistant able to read the docs of databricks to answer dev questions.


## Extract and preprocess of the contextual documents. 

This notebook will be used to create a delta table in the unity Cat&alog that will be next used as vectorstore database.

For this we will use two types of chunking method, 
- the first one with classic recursive chunking on a html source
- the second with a docling extract + chunk.


 

### Docling :
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

It's an open-source project from IBM.

This tool benefits from advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more.

Ready for image integration with GraniteDocling that Supports several Visual Language Models.

This at the time one of the most deployed repository from Github platform  

In [0]:
%pip install -U --quiet databricks-langchain==0.6.0 mlflow[databricks]==3.4.0  langchain==0.3.27 langchain_core==0.3.74 bs4 langchain_community markdownify docling
dbutils.library.restartPython()

## Catalog et schema 


In [0]:
catalog = "demo"
schema = "demo"

## Source extract
The source is a part of the doc QA sources of databricks docs.

As we are limited in use volume, I will only keep the part of databricks docs about MLFlow.

## 1- Web scraping

In [0]:
DATABRICKS_SITEMAP_URL = "https://docs.databricks.com/aws/en/sitemap.xml"

### 1-1 Recup of the pertinent URLs.

In [0]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

# Constants
URL_PARALLELISM = 50

# Setup shared retry-enabled HTTP session
shared_session = requests.Session()
retries = Retry(total=3, backoff_factor=3, status_forcelist=[429])
adapter = HTTPAdapter(max_retries=retries, pool_maxsize=URL_PARALLELISM)
shared_session.mount("http://", adapter)
shared_session.mount("https://", adapter)

# Fetch sitemap URLs
def fetch_urls(filter_documents=None):
    response = shared_session.get(DATABRICKS_SITEMAP_URL)
    root = ET.fromstring(response.content)
    urls = [loc.text for loc in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc") if filter_documents is None or filter_documents in loc.text]
    return urls

url_docs_mlflow = fetch_urls(filter_documents="/mlflow")

## 2- Create a table with classic chunks.  

#### Extracting Databricks documentation sitemap and pages

For this demo, we will directly download a few documentation pages from docs.databricks.com and extract the HTML content.

Here are the main steps:

    Run a quick script to extract the page URLs from the databricks_urls.csv file
    Download the web pages
    Use BeautifulSoup to extract the ArticleBody
    

#### Splitting documentation pages into small chunks

LLM models typically have a maximum input context length, and you won't be able to compute embeddings for very long texts. In addition, the longer your context length is, the longer it will take for the model to provide a response.

Document preparation is key for your model to perform well, and multiple strategies exist depending on your dataset:

    Split document into small chunks (paragraph, h2...)
    Truncate documents to a fixed length
    The chunk size depends on your content and how you'll be using it to craft your prompt. Adding multiple small doc chunks in your prompt might give different results than sending only a big one
    Split into big chunks and ask a model to summarize each chunk as a one-off job, for faster live inference
    Create multiple agents to evaluate each bigger document in parallel, and ask a final agent to craft your answer...
-

In [0]:
from bs4 import BeautifulSoup
import requests
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_transformers import MarkdownifyTransformer


md = MarkdownifyTransformer(["script", "style", "nav", "footer"])

def custom_html_splitter(url_doc, chunk_size=500, chunk_overlap=100):
    """
        for the url_doc, fetch the HTML content, parse it with BeautifulSoup, and split it into smaller chunks.
        chunk_size and chunk_overlap are parameters of RecursiveCharacterTextSplitter
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
        
    response = requests.get(url_doc, headers=headers, timeout=30)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    doc = {
        'content': soup.get_text(),
        'metadata': {"url":url_doc}
    }
    
  
    # Create a LangChain document
    documents = [Document(page_content=soup.get_text(), 
                         metadata={"url":url_doc}) 
                ]
    
    converted_docs = md.transform_documents(documents)
    # Apply text splitter, you can add a separator if needed
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    return text_splitter.split_documents(converted_docs)



def create_documents(databricks_urls) : 
    """
        for each url_doc, add the Documents results from custom_html_splitter
        to the list of documents extracted form the html docs. 
    """
    documents = []
    
    for databricks_url in databricks_urls :
        documents += custom_html_splitter(databricks_url, chunk_size=1000)
        
    return documents


documents = create_documents(databricks_urls=url_docs_mlflow)

In [0]:
print(documents[0].metadata)
print(documents[0].page_content[:100])

#### Convert the list of Chunks split with lanchain features to columns of text, doc.content, and url , doc.metadata.url.

In [0]:
import pandas as pd
from pyspark.sql import SparkSession

# From the list of langchain chunks create data for the vectorstore table. 
data = [{
            'content': doc.page_content,
            'url': str(doc.metadata['url']),  
            # Add as many information as you need
        }
        for doc in documents ]

# Create a table in UC that can be used a vector store
df_pandas = pd.DataFrame(data)
df_spark = spark.createDataFrame(df_pandas)

In [0]:
display(df_spark)

## 3- Launch the docling preparation module

In [0]:
import urllib.request
from io import BytesIO
from docling.backend.html_backend import HTMLDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

doc_convert = DocumentConverter()
chunker = HybridChunker()
data_docling = []
for reference_url in url_docs_mlflow:

    # example github : url = "https://en.wikipedia.org/wiki/Duck"
    text = urllib.request.urlopen(reference_url).read()
    in_doc = InputDocument(
        path_or_stream=BytesIO(text),
        format=InputFormat.HTML,
        backend=HTMLDocumentBackend,
        filename=reference_url,
    )
    backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
    dl_doc = backend.convert().export_to_markdown()
    #print(len(dl_doc))

    doc = doc_convert.convert(source=reference_url).document


    chunk_iter = chunker.chunk(dl_doc=doc)
    for i, chunk in enumerate(chunk_iter):
        #print(f"=== {i} ===")
        #print(f"chunk.text:\n{f'{chunk.text[:300]}…'!r}")

        enriched_text = chunker.contextualize(chunk=chunk)
        #print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}…'!r}")

        row = {
            'content': chunk,
            'contextualize_content' : enriched_text,
            'url': reference_url,  # ou sérialisez selon vos besoins
            # Ajoutez d'autres champs selon votre structure
        }
    

    
    data_docling.append(row)


In [0]:

df_spark_docling = spark.createDataFrame(pd.DataFrame(data_docling).astype({'content': 'string'}))
display(df_spark_docling)

The delat table needs :
- 'delta.enableChangeDataFeed' = 'true'
- an id
- timestamp for the date of creation

In [0]:
%sql
USE CATALOG demo;
USE SCHEMA demo;

CREATE OR REPLACE TABLE databricks_document_docling (
  id BIGINT GENERATED ALWAYS AS IDENTITY,
  content STRING,
  contextualize_content STRING,
  url STRING,
  created_at TIMESTAMP
)
USING DELTA
TBLPROPERTIES (
  'delta.enableChangeDataFeed' = 'true'
);

ALTER TABLE databricks_document_docling SET TBLPROPERTIES (
  'delta.feature.allowColumnDefaults' = 'enabled'
);

ALTER TABLE databricks_document_docling ALTER COLUMN created_at SET DEFAULT CURRENT_TIMESTAMP();

### Creation of the vectorstore database : "demo.demo.databricks_document_docling"

In [0]:
# Définir le nom de votre table UC (catalog.schema.table)
table_name = "demo.demo.databricks_document_docling"

# Créer la table UC
df_spark_docling.write \
    .format("delta") \
    .mode("append") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(table_name)