# Visual Document Search with Azure Content Understanding

Source: https://github.com/Azure-Samples/azure-ai-search-with-content-understanding-python

(slightly modified for training purposes)

## Objective
This document illustrates an example workflow for how to leverage the Azure AI Content Understanding API to enhance the quality of document search.

The sample will demonstrate the following steps:
1. Extract the layout and content of a document using Azure AI Document Intelligence.
2. For each figure in the document, extract its content with a custom analyzer using Azure AI Content Understanding, and insert it into the corresponding location in the document content.
2. Chunk and embed the document content with LangChain and Azure OpenAI, and index them with Azure Search to generate an Azure Search index.
3. Utilize an OpenAI chat model to search through content in the document with a natural language query.


## Pre-requisites
1. Follow the [README](README.md) to create the required resources for this sample.
1. Install the required packages.

In [1]:
# %pip install -r ../requirements_aisearch.txt

## Load environment variables

In [2]:
from dotenv import load_dotenv
import os
from datetime import datetime # added for customizing AZURE_SEARCH_INDEX_NAME (if needed)

load_dotenv(override=True)

# Load and validate Azure AI Services configs
AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_SERVICE_ENDPOINT")
AZURE_AI_SERVICE_API_VERSION = os.getenv("AZURE_AI_SERVICE_API_VERSION") or "2024-12-01-preview"
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION") or "2024-11-30"

# Load and validate Azure OpenAI configs
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_CHAT_API_VERSION = os.getenv("AZURE_OPENAI_CHAT_API_VERSION") or "2024-08-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING_API_VERSION = os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION") or "2023-05-15"

# Load and validate Azure Search Services configs
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME") or "sample-index-visual-doc"

In [3]:
print(f"Current Azure AI Services endpoint: {AZURE_AI_SERVICE_ENDPOINT}")

Current Azure AI Services endpoint: https://ep-ai-services.services.ai.azure.com/


In [None]:
print(f"Current Azure OpenAI endpoint: {AZURE_OPENAI_ENDPOINT}")

In [None]:
print(f"Current Azure AI Search endpoint: {AZURE_SEARCH_ENDPOINT}")

## File to analyze

In [6]:
from pathlib import Path

# Get the path to the file that will be analyzed
# Sample report source: https://www.imf.org/en/Publications/CR/Issues/2024/07/18/United-States-2024-Article-IV-Consultation-Press-Release-Staff-Report-and-Statement-by-the-552100
file = Path("assets/reports/sample_report_3pg.pdf")

In [7]:
file

WindowsPath('assets/reports/sample_report_3pg.pdf')

## Create custom analyzer using chart and diagram understanding template

### Setup

In [8]:
import json
import sys
import uuid
import pandas as pd # added for visualizing existing analyzers into a df
import logging # added for visualizing response details from methods e.g. delete_analyzer()

In [9]:
# only if necessary, add the parent directory to the path to use shared modules
# parent_dir = Path(Path.cwd()).parent
# sys.path.append(str(parent_dir))

# import the utility class AzureContentUnderstandingClient, which is a wrapper around the Azure Content Understanding REST API client
from python.content_understanding_client import AzureContentUnderstandingClient

In [10]:
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

In [11]:
# try:
#     credential = DefaultAzureCredential()
#     # Test token acquisition
#     token = credential.get_token("https://cognitiveservices.azure.com/.default")
#     print("Successfully acquired token!")
# except Exception as e:
#     print(f"Authentication failed: {str(e)}")

### Create content understanding client

In [12]:
content_understanding_client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    # subscription_key= "715b91dd-7c91-4bf2-8987-8640c7168071",
    token_provider=token_provider,
    # x_ms_useragent="azure-ai-content-understanding-python/search_with_visusal_document", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

### Create an analyzer

In [13]:
# Get path to sample template
ANALYZER_TEMPLATE_PATH = "analyzer_templates/image_chart_diagram_understanding.json"

In [14]:
# Create analyzer
ANALYZER_ID = "content-understanding-search-sample-" + str(uuid.uuid4())
print(f"Creating analyzer with ID '{ANALYZER_ID}'...")


Creating analyzer with ID 'content-understanding-search-sample-97062efd-e55f-4a1a-b276-1dd504e5d6da'...


In [None]:
try:
    response = content_understanding_client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_PATH)
    result = content_understanding_client.poll_result(response)
    print(f'Analyzer details for {result["result"]["analyzerId"]}:')
    # print(json.dumps(result, indent=2))
except Exception as e:
    print(e)
    print("Error in creating analyzer. Please double-check your analysis settings.\nIf there is a conflict, you can delete the analyzer and then recreate it, or move to the next cell and use the existing analyzer.")

In [19]:
# check if the analyzer was created successfully
result = content_understanding_client.get_analyzer_detail_by_id(ANALYZER_ID)
print(json.dumps(result, indent=2))

{
  "analyzerId": "content-understanding-search-sample-97062efd-e55f-4a1a-b276-1dd504e5d6da",
  "description": "Extract detailed structured information from charts and diagrams.",
  "createdAt": "2025-04-02T17:33:07Z",
  "lastModifiedAt": "2025-04-02T17:33:07Z",
  "config": {
    "returnDetails": false,
    "disableContentFiltering": false
  },
  "fieldSchema": {
    "name": "ChartsAndDiagrams",
    "fields": {
      "Title": {
        "type": "string",
        "description": "Verbatim title of the chart."
      },
      "ChartType": {
        "type": "string",
        "description": "The type of chart.",
        "enum": [
          "area",
          "bar",
          "box",
          "bubble",
          "candlestick",
          "funnel",
          "heatmap",
          "histogram",
          "line",
          "pie",
          "radar",
          "rings",
          "rose",
          "treemap"
        ],
        "enumDescriptions": {
          "histogram": "Continuous values on the x-axis,

## Analyze document layout and compose with figure descriptions

### Helper functions for document-figure composition

In [21]:
# %pip install PyMuPDF

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
import fitz
from PIL import Image

In [22]:
# Define helper functions for document-figure composition
def insert_figure_contents(md_content, figure_contents, span_offsets):
    """
    Inserts the figure content for each of the provided figures in figure_contents
    before the span offset of that figure in the given markdown content.

    Args:
    - md_content (str): The original markdown content.
    - figure_contents (list[str]): The contents of each figure to insert.
    - span_offsets (list[int]): The span offsets of each figure in order. These should be sorted and strictly increasing.

    Returns:
    - str: The modified markdown content with the the figure contents prepended to each figure's span.
    """
    # NOTE: In this notebook, we only alter the Markdown content returned by the Document Intelligence API,
    # and not the per-element spans in the API response. Thus, after figure content insertion, these per-element spans will be inaccurate.
    # This may impact use cases like citation page number calculation.
    # Additional code may be needed to correct the spans or otherwise infer the page numbers for each citation.
    # The main purpose of the notebook is to show the feasibility of using Content Understanding with Azure Search for RAG chat applications.

    # Validate span_offsets are sorted and strictly increasing
    if span_offsets != sorted(span_offsets) or not all([o < span_offsets[i + 1] for i, o in enumerate(span_offsets) if i < len(span_offsets) - 1]):
        raise ValueError("span_offsets should be sorted and strictly increasing.")

    # Split the content based on the provided spans
    parts = []
    preamble = None
    for i, offset in enumerate(span_offsets):
        if i == 0 and offset > 0:
            preamble = md_content[0:offset]
            parts.append(md_content[offset:span_offsets[i + 1]])
        elif i == len(span_offsets) - 1:
            parts.append(md_content[offset:])
        else:
            parts.append(md_content[offset:span_offsets[i + 1]])

    # Join the parts back together with the figure content inserted
    modified_content = ""
    if preamble:
        modified_content += preamble
    for i, part in enumerate(parts):
        modified_content += f"<!-- FigureContent=\"{figure_contents[i]}\" -->" + part

    return modified_content

def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    Args:    
    - pdf_path (pathlib.Path): Path to the PDF file.
    - page_number (int): The page number to crop from (0-indexed).
    - bounding_box (tuple): A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
    
    Returns:
    - PIL.Image: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    pix = page.get_pixmap(matrix=fitz.Matrix(300 / 72, 300 / 72), clip=rect)
    
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()

    return img

def format_content_understanding_result(content_understanding_result):
    """
    Formats the JSON output of the Content Understanding result as Markdown for downstream usage in text.
    
    Args:
    - content_understanding_result (dict): A dictionary containing the output from Content Understanding.

    Returns:
    - str: A Markdown string of the result content.
    """
    def _format_result(key, result):
        result_type = result["type"]
        if result_type in ["string", "integer", "number", "boolean"]:
            return f"**{key}**: " + str(result[f'value{result_type.capitalize()}']) + "\n"
        elif result_type == "array":
            return f"**{key}**: " + ', '.join([str(result["valueArray"][i][f"value{r['type'].capitalize()}"]) for i, r in enumerate(result["valueArray"])]) + "\n"
        elif result_type == "object":
            return f"**{key}**\n" + ''.join([_format_result(f"{key}.{k}", result["valueObject"][k]) for k in result["valueObject"]])

    fields = content_understanding_result['result']['contents'][0]['fields']
    markdown_result = ""
    for field in fields:
        markdown_result += _format_result(field, fields[field])

    return markdown_result

### Extract figures and run content understanding

In [23]:
import io
import json
import os

In [24]:
file

WindowsPath('assets/reports/sample_report_3pg.pdf')

In [None]:
# observed computation time: ~1 minute for 3 pages
# the output is cached in a file called 'sample_report.cache'

In [26]:
# Run Content Understanding on each figure, format figure contents, and insert figure contents into corresponding document locations
with open(file, 'rb') as f:
    pdf_bytes = f.read()

    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=AZURE_AI_SERVICE_ENDPOINT,
        api_version=AZURE_DOCUMENT_INTELLIGENCE_API_VERSION,
        credential=credential,
        output=str('figures')
    )

    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=pdf_bytes),
        output=[str('figures')],
        features=['ocrHighResolution'],
        output_content_format="markdown"
    )

    result: AnalyzeResult = poller.result()
    
    md_content = result.content

    figure_contents = []
    if result.figures:
        print("Extracting figure contents with Content Understanding.")
        for figure_idx, figure in enumerate(result.figures):
            for region in figure.bounding_regions:
                    # Uncomment the below to print out the bounding regions of each figure
                    # print(f"Figure {figure_idx + 1} body bounding regions: {region}")
                    # To learn more about bounding regions, see https://aka.ms/bounding-region
                    bounding_box = (
                            region.polygon[0],  # x0 (left)
                            region.polygon[1],  # y0 (top
                            region.polygon[4],  # x1 (right)
                            region.polygon[5]   # y1 (bottom)
                        )
            page_number = figure.bounding_regions[0]['pageNumber']
            cropped_img = crop_image_from_pdf_page(file, page_number - 1, bounding_box)

            os.makedirs("figures", exist_ok=True)

            figure_filename = f"figure_{figure_idx + 1}.png"
            # Full path for the file
            figure_filepath = os.path.join("figures", figure_filename)

            # Save the figure
            cropped_img.save(figure_filepath)
            bytes_io = io.BytesIO()
            cropped_img.save(bytes_io, format='PNG')
            cropped_img = bytes_io.getvalue()

            # Collect formatted content from the figure
            content_understanding_response = content_understanding_client.begin_analyze(ANALYZER_ID, figure_filepath)
            content_understanding_result = content_understanding_client.poll_result(content_understanding_response, timeout_seconds=1000)
            figure_content = format_content_understanding_result(content_understanding_result)
            figure_contents.append(figure_content)
            print(f"Figure {figure_idx + 1} contents:\n{figure_content}")

        # Insert figure content into corresponding location in document
        md_content = insert_figure_contents(md_content, figure_contents, [f.spans[0]["offset"] for f in result.figures])
    
    # Save results as a JSON file to cache the result for downstream use
    result.content = md_content
    output = {}
    output['analyzeResult'] = result.as_dict()
    output = json.dumps(output)
    with open('sample_report.cache', 'w') as f:
        f.write(output)

Extracting figure contents with Content Understanding.
Figure 1 contents:
**Title**: 2023 Real GDP
**ChartType**: bar
**TopicKeywords**: Business and finance, Economics, Global economy
**DetailedDescription**: The bar chart titled '2023 Real GDP' shows the percent deviation from the pre-crisis trend for various regions and countries. The United States is the only region with a positive deviation, approximately 1.0%. Japan, Canada, and the Euro Area have negative deviations, each around -1.0%. The United Kingdom shows a significant negative deviation of about -4.0%. The G-20 Emerging Markets (EMs) also have a negative deviation, slightly less than -3.0%.
**Summary**: The chart illustrates the 2023 Real GDP deviations from pre-crisis trends for several major economies. The United States is the only economy with a positive deviation, indicating growth above the pre-crisis trend. In contrast, the UK shows the largest negative deviation, suggesting a significant downturn compared to its pre

In [27]:
# Uncomment the first line below to load in a previously cached result.
# output = open("sample_report.cache").read()
document_content = json.loads(output)
document_content = document_content['analyzeResult']['content']

In [28]:
type(document_content)

str

In [None]:
# nicely print document content
# print(document_content)

## Enhanced RAG

This is a simple (and not exhaustive) starting point. Feel free to give your own chunking strategies a try!

...and replace langchain with Azure AI Foundry SDK! 😊

In the following example:
- we use semantic chunking to chunk the output from document intelligence (enriched with content understading on charts)
- we use an embedding model to embed the chunks and store them in Azure AI Search for retrieval
- we retrieve the most relevant chunks based on a sample query (e.g. "Which is the country with the lowest GDP in 2023?")
- we inject the question and the retrieved content in the LLM prompt to give the LLM a context
- we inspect the LLM response

### Chunk text by splitting with Markdown header splitting and recursive character splitting

In [30]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure langchain text splitting settings
EMBEDDING_CHUNK_SIZE = 512
EMBEDDING_CHUNK_OVERLAP = 20
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3")
]

# First split text using Markdown headers
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
chunks = text_splitter.split_text(document_content)

# Then further split the text using recursive character text splitting
char_text_splitter = RecursiveCharacterTextSplitter(separators=["<!--", "\n\n", "#"], chunk_size=EMBEDDING_CHUNK_SIZE, chunk_overlap=EMBEDDING_CHUNK_OVERLAP, is_separator_regex=True)
chunks = char_text_splitter.split_documents(chunks)

print("Number of chunks: " + str(len(chunks)))

Number of chunks: 19


### Calculate embeddings and populate the Azure AI Search index

In [31]:
from langchain_openai import AzureOpenAIEmbeddings
from langchain.vectorstores.azuresearch import AzureSearch

In [32]:
print(f"Embedding model: {AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME}")
print(f"Embedding API version: {AZURE_OPENAI_EMBEDDING_API_VERSION}")

Embedding model: text-embedding-3-large
Embedding API version: 2024-10-21


In [33]:
aoai_embeddings = AzureOpenAIEmbeddings(model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
                                        azure_endpoint=AZURE_OPENAI_ENDPOINT,
                                        azure_ad_token_provider=token_provider,
                                        api_version=AZURE_OPENAI_EMBEDDING_API_VERSION)

In [34]:
type(aoai_embeddings)

langchain_openai.embeddings.azure.AzureOpenAIEmbeddings

In [35]:
print(f"Search endpoint: {AZURE_SEARCH_ENDPOINT}")
print(f"Saving on index: {AZURE_SEARCH_INDEX_NAME}-{datetime.now().strftime('%Y-%m-%d')}")

Search endpoint: https://ep-aisearch-swedencentral-s1.search.windows.net
Saving on index: my-index-2025-04-02


In [36]:
# IMPORTANT:
# 1) grant user roles Search Index Data Contributor + Search Service Contributor for the Azure AI Search resource
# 2) Settings --> Keys --> API Access Control --> select Role-based access control
vector_store = AzureSearch(
    azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
    azure_search_key=None,
    index_name= AZURE_SEARCH_INDEX_NAME, #f"{AZURE_SEARCH_INDEX_NAME}-{datetime.now().strftime('%Y-%m-%d')}",  #"my-first-index",
    embedding_function=aoai_embeddings.embed_query,
)

In [None]:
# This is a one-time operation to add the documents to the vector store. Comment out this line if you are re-running this cell with the same index.
vector_store.add_documents(documents=chunks)

### Query vector index to retrieve relevant documents

In [38]:
# Set up the retriever that will be used to query the index for similar documents
retriever = vector_store.as_retriever(search_type="similarity")

In [39]:
# Retrieve relevant documents
# query = "What was the crude oil production in 2019?"
query = "What was the country with the lowest real GDP in 2023?"
retrieved_docs = retriever.invoke(query)

In [40]:
# Print retrieved documents
for doc in retrieved_docs:
    print("Document id:", doc.metadata['id'])
    print("Content:", doc.page_content)
    print("=" * 50)

Document id: OTA3NGJjMGUtODM4MS00YmI1LTk0Y2UtOTVkOGUzOTNiMTQ0
Content: <!-- FigureContent="**Title**: 2023 Real GDP
**ChartType**: bar
**TopicKeywords**: Business and finance, Economics, Global economy
**DetailedDescription**: The bar chart titled '2023 Real GDP' shows the percent deviation from the pre-crisis trend for various regions and countries. The United States is the only region with a positive deviation, approximately 1.0%. Japan, Canada, and the Euro Area have negative deviations, each around -1.0%. The United Kingdom shows a significant negative deviation of about -4.0%. The G-20 Emerging Markets (EMs) also have a negative deviation, slightly less than -3.0%.
**Summary**: The chart illustrates the 2023 Real GDP deviations from pre-crisis trends for several major economies. The United States is the only economy with a positive deviation, indicating growth above the pre-crisis trend. In contrast, the UK shows the largest negative deviation, suggesting a significant downturn co

### Generate answer to query

In [None]:
# Define system prompt template for chat model
prompt = """
You are an expert in document analysis. You are proficient in reading and analyzing technical reports. You are good at numerical reasoning and have a good understanding of financial concepts. You are given a question which you need to answer based on the references provided. To answer this question, you may first read the question carefully to know what information is required or helpful to answer the question. Then, you may read the references to find the relevant information.

If you find enough information to answer the question, you can first write down your thinking process and then provide a concise answer at the end.
If you find that there is not enough information to answer the question, you can state that there is insufficient information.
If you are not able or sure how to answer the question, say that you are not able to answer the question.
Do not provide any information that is not present in the references.
References are in markdown format, you may follow the markdown syntax to better understand the references.

---
References:
{context}
---

Now, here is the question:
---
Question:
{question}
---
Thinking Process::: 
Answer::: 
"""

# Helper function to generate the formatted context from each retrieved document
def generate_context(chunks):
    context = []
    for i, chunk in enumerate(chunks):
        s = (f'Source {i} Metadata: {chunk.metadata}\n'
                f'Source {i} Content: {chunk.page_content}')
        context.append(s)
    context = '\n---\n'.join(context)
    return context

# Remove redundant chunks
appeared = set()
unique_chunks = []
for chunk in retrieved_docs:
    chunk_id = chunk.metadata['id']
    if chunk_id not in appeared:
        appeared.add(chunk_id)
        unique_chunks.append(chunk)
context = generate_context(unique_chunks)

# Format the prompt with the provided query and formatted context
# The context is given by the retrieved documents
prompt = prompt.format(question=query,
                       context=context)

In [None]:
print(prompt)

In [44]:
from langchain_openai import AzureChatOpenAI

In [45]:
print(f"Chat model: {AZURE_OPENAI_CHAT_DEPLOYMENT_NAME}")
print(f"Chat API version: {AZURE_OPENAI_CHAT_API_VERSION}")

Chat model: gpt-4o
Chat API version: 2025-01-01-preview


In [46]:
chat_llm = AzureChatOpenAI(model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
                            azure_endpoint=AZURE_OPENAI_ENDPOINT,
                            azure_ad_token_provider=token_provider,
                            api_version=AZURE_OPENAI_CHAT_API_VERSION,
                            temperature=0.7)

In [47]:
# Print the LLM's answer to the query with the retrieved documents as additional context
answer = chat_llm.invoke(prompt)

In [48]:
print(answer.content)

### Thinking Process:
1. The question asks for the country with the lowest real GDP in 2023 based on the references provided. Specifically, it refers to the percent deviation from the pre-crisis trend.
2. From the references, we can analyze the markdown data tables and descriptions provided about the "2023 Real GDP" deviations.
3. The markdown tables and detailed descriptions indicate that the United Kingdom (UK) has the largest negative deviation of -4.0%, making it the country with the lowest real GDP in 2023 relative to its pre-crisis trend.
4. Other regions like Japan, Canada, and the Euro Area have moderate negative deviations (-1.0%), and G-20 Emerging Markets (EMs) show a slightly larger negative deviation (-3.0%). The United States is the only region with positive deviation (+1.0%).

### Answer:
The country with the lowest real GDP in 2023, based on percent deviation from the pre-crisis trend, was the United Kingdom (UK) with a deviation of -4.0%.
