# Visual Document Search with Azure Content Understanding and AI Foundry SDK

Source: https://github.com/Azure-Samples/azure-ai-search-with-content-understanding-python

(modified for training purposes)

## Objective
This document illustrates an example workflow for how to leverage the Azure AI Content Understanding API to enhance the quality of document search.

The sample will demonstrate the following steps:
1. Extract the layout and content of a document using Azure AI Document Intelligence.
2. For each figure in the document, extract its content with a custom analyzer using Azure AI Content Understanding, and insert it into the corresponding location in the document content.
2. Chunk and embed the document content with LangChain and Azure OpenAI, and index them with Azure Search to generate an Azure Search index.
3. Utilize an OpenAI chat model to search through content in the document with a natural language query.


## Pre-requisites
1. Follow the [README](README.md) to create the required resources for this sample.
1. Install the required packages.


## Load environment variables

In [1]:
from dotenv import load_dotenv
import os
from datetime import datetime # added for customizing AZURE_SEARCH_INDEX_NAME (if needed)

load_dotenv(dotenv_path='../infra/credentials.env', override=True)

# Load and validate Azure AI Services configs
AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_SERVICE_ENDPOINT")
AZURE_AI_SERVICE_API_VERSION = os.getenv("AZURE_AI_SERVICE_API_VERSION") or "2024-12-01-preview"
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION") or "2024-11-30"

# Load and validate Azure OpenAI configs
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.getenv("MODEL_DEPLOYMENT_NAME")
AZURE_OPENAI_CHAT_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION") or "2024-08-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.getenv("EMBEDDING_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING_API_VERSION = os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION") or "2023-05-15"

# Load and validate Azure Search Services configs
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_KEY = os.getenv("AZURE_SEARCH_KEY")
AZURE_SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME") or "sample-index-visual-doc"

In [2]:
print(f"Current Azure AI Services endpoint: {AZURE_AI_SERVICE_ENDPOINT}")

Current Azure AI Services endpoint: https://ai-aifoundryupskillinghub687267079310.cognitiveservices.azure.com/


In [3]:
print(f"Current Azure OpenAI endpoint: {AZURE_OPENAI_ENDPOINT}")

Current Azure OpenAI endpoint: https://ai-aifoundryupskillinghub687267079310.openai.azure.com/


In [4]:
print(f"Current Azure AI Search endpoint: {AZURE_SEARCH_ENDPOINT}")

Current Azure AI Search endpoint: https://ai-search-abutneva687267079310.search.windows.net


## File to analyze

In [5]:
from pathlib import Path

# Get the path to the file that will be analyzed
# Sample report source: https://www.imf.org/en/Publications/CR/Issues/2024/07/18/United-States-2024-Article-IV-Consultation-Press-Release-Staff-Report-and-Statement-by-the-552100
file = Path("assets/reports/sample_report_3pg.pdf")

In [6]:
file

PosixPath('assets/reports/sample_report_3pg.pdf')

## Create custom analyzer using chart and diagram understanding template

### Setup

In [7]:
import json
import sys
import uuid
import pandas as pd # added for visualizing existing analyzers into a df
import logging # added for visualizing response details from methods e.g. delete_analyzer()

In [8]:
# only if necessary, add the parent directory to the path to use shared modules
# parent_dir = Path(Path.cwd()).parent
# sys.path.append(str(parent_dir))

# import the utility class AzureContentUnderstandingClient, which is a wrapper around the Azure Content Understanding REST API client
from python.content_understanding_client import AzureContentUnderstandingClient

In [9]:
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

In [10]:
# try:
#     credential = DefaultAzureCredential()
#     # Test token acquisition
#     token = credential.get_token("https://cognitiveservices.azure.com/.default")
#     print("Successfully acquired token!")
# except Exception as e:
#     print(f"Authentication failed: {str(e)}")

### Create content understanding client

In [11]:
content_understanding_client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    # subscription_key= "715b91dd-7c91-4bf2-8987-8640c7168071",
    token_provider=token_provider,
    # x_ms_useragent="azure-ai-content-understanding-python/search_with_visusal_document", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

### Create an analyzer

In [12]:
# Get path to sample template
ANALYZER_TEMPLATE_PATH = "analyzer_templates/image_chart_diagram_understanding.json"

In [13]:
# Create analyzer
ANALYZER_ID = "content-understanding-search-sample-" + str(uuid.uuid4())
print(f"Creating analyzer with ID '{ANALYZER_ID}'...")


Creating analyzer with ID 'content-understanding-search-sample-4d23a4ef-c9f3-4225-91f7-b935ab815914'...


In [14]:
try:
    response = content_understanding_client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_PATH)
    result = content_understanding_client.poll_result(response)
    print(f'Analyzer details for {result["result"]["analyzerId"]}:')
    # print(json.dumps(result, indent=2))
except Exception as e:
    print(e)
    print("Error in creating analyzer. Please double-check your analysis settings.\nIf there is a conflict, you can delete the analyzer and then recreate it, or move to the next cell and use the existing analyzer.")

Analyzer details for content-understanding-search-sample-4d23a4ef-c9f3-4225-91f7-b935ab815914:


In [15]:
# check if the analyzer was created successfully
result = content_understanding_client.get_analyzer_detail_by_id(ANALYZER_ID)
print(json.dumps(result, indent=2))

{
  "analyzerId": "content-understanding-search-sample-4d23a4ef-c9f3-4225-91f7-b935ab815914",
  "description": "Extract detailed structured information from charts and diagrams.",
  "createdAt": "2025-04-10T16:42:26Z",
  "lastModifiedAt": "2025-04-10T16:42:26Z",
  "config": {
    "returnDetails": false,
    "disableContentFiltering": false
  },
  "fieldSchema": {
    "name": "ChartsAndDiagrams",
    "fields": {
      "Title": {
        "type": "string",
        "description": "Verbatim title of the chart."
      },
      "ChartType": {
        "type": "string",
        "description": "The type of chart.",
        "enum": [
          "area",
          "bar",
          "box",
          "bubble",
          "candlestick",
          "funnel",
          "heatmap",
          "histogram",
          "line",
          "pie",
          "radar",
          "rings",
          "rose",
          "treemap"
        ],
        "enumDescriptions": {
          "histogram": "Continuous values on the x-axis,

## Analyze document layout and compose with figure descriptions

### Helper functions for document-figure composition

In [16]:
# %pip install PyMuPDF

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
import fitz
from PIL import Image

In [17]:
# Define helper functions for document-figure composition
def insert_figure_contents(md_content, figure_contents, span_offsets):
    """
    Inserts the figure content for each of the provided figures in figure_contents
    before the span offset of that figure in the given markdown content.

    Args:
    - md_content (str): The original markdown content.
    - figure_contents (list[str]): The contents of each figure to insert.
    - span_offsets (list[int]): The span offsets of each figure in order. These should be sorted and strictly increasing.

    Returns:
    - str: The modified markdown content with the the figure contents prepended to each figure's span.
    """
    # NOTE: In this notebook, we only alter the Markdown content returned by the Document Intelligence API,
    # and not the per-element spans in the API response. Thus, after figure content insertion, these per-element spans will be inaccurate.
    # This may impact use cases like citation page number calculation.
    # Additional code may be needed to correct the spans or otherwise infer the page numbers for each citation.
    # The main purpose of the notebook is to show the feasibility of using Content Understanding with Azure Search for RAG chat applications.

    # Validate span_offsets are sorted and strictly increasing
    if span_offsets != sorted(span_offsets) or not all([o < span_offsets[i + 1] for i, o in enumerate(span_offsets) if i < len(span_offsets) - 1]):
        raise ValueError("span_offsets should be sorted and strictly increasing.")

    # Split the content based on the provided spans
    parts = []
    preamble = None
    for i, offset in enumerate(span_offsets):
        if i == 0 and offset > 0:
            preamble = md_content[0:offset]
            parts.append(md_content[offset:span_offsets[i + 1]])
        elif i == len(span_offsets) - 1:
            parts.append(md_content[offset:])
        else:
            parts.append(md_content[offset:span_offsets[i + 1]])

    # Join the parts back together with the figure content inserted
    modified_content = ""
    if preamble:
        modified_content += preamble
    for i, part in enumerate(parts):
        modified_content += f"<!-- FigureContent=\"{figure_contents[i]}\" -->" + part

    return modified_content

def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    Args:    
    - pdf_path (pathlib.Path): Path to the PDF file.
    - page_number (int): The page number to crop from (0-indexed).
    - bounding_box (tuple): A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
    
    Returns:
    - PIL.Image: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    pix = page.get_pixmap(matrix=fitz.Matrix(300 / 72, 300 / 72), clip=rect)
    
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()

    return img

def format_content_understanding_result(content_understanding_result):
    """
    Formats the JSON output of the Content Understanding result as Markdown for downstream usage in text.
    
    Args:
    - content_understanding_result (dict): A dictionary containing the output from Content Understanding.

    Returns:
    - str: A Markdown string of the result content.
    """
    def _format_result(key, result):
        result_type = result["type"]
        if result_type in ["string", "integer", "number", "boolean"]:
            return f"**{key}**: " + str(result[f'value{result_type.capitalize()}']) + "\n"
        elif result_type == "array":
            return f"**{key}**: " + ', '.join([str(result["valueArray"][i][f"value{r['type'].capitalize()}"]) for i, r in enumerate(result["valueArray"])]) + "\n"
        elif result_type == "object":
            return f"**{key}**\n" + ''.join([_format_result(f"{key}.{k}", result["valueObject"][k]) for k in result["valueObject"]])

    fields = content_understanding_result['result']['contents'][0]['fields']
    markdown_result = ""
    for field in fields:
        markdown_result += _format_result(field, fields[field])

    return markdown_result

### Extract figures and run content understanding

In [90]:
import io
import json
import os

In [18]:
file

PosixPath('assets/reports/sample_report_3pg.pdf')

In [62]:
# observed computation time: ~1 minute for 3 pages
# the output is cached in a file called 'sample_report.cache'

In [63]:
# Run Content Understanding on each figure, format figure contents, and insert figure contents into corresponding document locations
with open(file, 'rb') as f:
    pdf_bytes = f.read()

    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=AZURE_AI_SERVICE_ENDPOINT,
        api_version=AZURE_DOCUMENT_INTELLIGENCE_API_VERSION,
        credential=credential,
        output=str('figures')
    )

    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=pdf_bytes),
        output=[str('figures')],
        features=['ocrHighResolution'],
        output_content_format="markdown"
    )

    result: AnalyzeResult = poller.result()
    
    md_content = result.content

    figure_contents = []
    if result.figures:
        print("Extracting figure contents with Content Understanding.")
        for figure_idx, figure in enumerate(result.figures):
            for region in figure.bounding_regions:
                    # Uncomment the below to print out the bounding regions of each figure
                    # print(f"Figure {figure_idx + 1} body bounding regions: {region}")
                    # To learn more about bounding regions, see https://aka.ms/bounding-region
                    bounding_box = (
                            region.polygon[0],  # x0 (left)
                            region.polygon[1],  # y0 (top
                            region.polygon[4],  # x1 (right)
                            region.polygon[5]   # y1 (bottom)
                        )
            page_number = figure.bounding_regions[0]['pageNumber']
            cropped_img = crop_image_from_pdf_page(file, page_number - 1, bounding_box)

            os.makedirs("figures", exist_ok=True)

            figure_filename = f"figure_{figure_idx + 1}.png"
            # Full path for the file
            figure_filepath = os.path.join("figures", figure_filename)

            # Save the figure
            cropped_img.save(figure_filepath)
            bytes_io = io.BytesIO()
            cropped_img.save(bytes_io, format='PNG')
            cropped_img = bytes_io.getvalue()

            # Collect formatted content from the figure
            content_understanding_response = content_understanding_client.begin_analyze(ANALYZER_ID, figure_filepath)
            content_understanding_result = content_understanding_client.poll_result(content_understanding_response, timeout_seconds=1000)
            figure_content = format_content_understanding_result(content_understanding_result)
            figure_contents.append(figure_content)
            print(f"Figure {figure_idx + 1} contents:\n{figure_content}")

        # Insert figure content into corresponding location in document
        md_content = insert_figure_contents(md_content, figure_contents, [f.spans[0]["offset"] for f in result.figures])
    
    # Save results as a JSON file to cache the result for downstream use
    result.content = md_content
    output = {}
    output['analyzeResult'] = result.as_dict()
    output = json.dumps(output)
    with open('sample_report.cache', 'w') as f:
        f.write(output)

Extracting figure contents with Content Understanding.
Figure 1 contents:
**Title**: 2023 Real GDP
**ChartType**: bar
**TopicKeywords**: Business and finance, Economics
**DetailedDescription**: The bar chart displays the percent deviation of the 2023 Real GDP from the pre-crisis trend for various regions. The United States shows a positive deviation of approximately 1.0%, indicating growth above the pre-crisis trend. Japan, Canada, and the Euro Area have negative deviations, with Japan and Canada around -1.0% and the Euro Area slightly more negative. The United Kingdom has the largest negative deviation at approximately -4.0%, indicating significant underperformance compared to the pre-crisis trend. The G-20 Emerging Markets (EMs) also show a negative deviation, slightly less than the UK, around -3.5%.
**Summary**: In 2023, the United States is the only region with a positive GDP deviation from the pre-crisis trend, while the UK shows the largest negative deviation. Other regions like 

In [19]:
# Uncomment the first line below to load in a previously cached result.
# output = open("sample_report.cache").read()
document_content = json.loads(output)
document_content = document_content['analyzeResult']['content']

In [20]:
type(document_content)

str

In [21]:
# nicely print document content
# print(document_content)

## Enhanced RAG

This is a simple (and not exhaustive) starting point. Feel free to give your own chunking strategies a try!

In the following example:
- we use semantic chunking to chunk the output from document intelligence (enriched with content understading on charts)
- we use an embedding model to embed the chunks and store them in Azure AI Search for retrieval
- we retrieve the most relevant chunks based on a sample query (e.g. "Which is the country with the lowest GDP in 2023?")
- we inject the question and the retrieved content in the LLM prompt to give the LLM a context
- we inspect the LLM response

### Chunk text by splitting with Markdown header splitting and recursive character splitting

In [22]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure langchain text splitting settings
EMBEDDING_CHUNK_SIZE = 512
EMBEDDING_CHUNK_OVERLAP = 20
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3")
]

# First split text using Markdown headers
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
chunks = text_splitter.split_text(document_content)
print("Number of chunks after semantic chunking: " + str(len(chunks)))

# Then further split the text using recursive character text splitting
# It first attempts to split the text using the highest-priority separator ("<!--"). 
# If the resulting chunks are still too large (i.e., exceed chunk_size), 
# it recursively applies the next separator ("\n\n"), and so on.
char_text_splitter = RecursiveCharacterTextSplitter(separators=["<!--", "\n\n", "#"], chunk_size=EMBEDDING_CHUNK_SIZE, chunk_overlap=EMBEDDING_CHUNK_OVERLAP, is_separator_regex=True)
chunks = char_text_splitter.split_documents(chunks)

print("Final number of chunks: " + str(len(chunks)))

Number of chunks after semantic chunking: 6
Final number of chunks: 19


In [23]:
type(chunks)

list

In [24]:
type(chunks[0])

langchain_core.documents.base.Document

### [AI Foundry SDK] Calculate embeddings and populate the Azure AI Search index

In [25]:
# These packages are required for Azure AI Foundry SDK (see requirements_aisearch.txt for packages versions)
# %pip install azure-ai-projects
# %pip install azure-ai-inference

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

In [26]:
# These dependencies handles the connection to Azure Search and the processing of the documents into the index
from azure.core.credentials import AzureKeyCredential
from azure.ai.projects.models import ConnectionType
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient

In [27]:
# Get your AI Foundry project connection string from the AI Foundry portal
project_connection_string="swedencentral.api.azureml.ms;dbc342d5-96b5-4aef-a49d-5f6cbd7db6ce;aifoundry-upskilling-rg;aifoundry-upskilling-pj"

# Initialize the AI Foundry project client
project = AIProjectClient.from_connection_string(
  conn_str=project_connection_string,
  credential=DefaultAzureCredential())

In [28]:
# Create a vector embeddings client that will be used to generate vector embeddings
# (at least one AI model that supports text embeddings must be deployed in the project)
embeddings = project.inference.get_embeddings_client()

type(embeddings)

azure.ai.inference._patch.EmbeddingsClient

In [29]:
# Azure AI Search resource
print(f"Search endpoint: {AZURE_SEARCH_ENDPOINT}")

# Azure AI Search index name
# load_dotenv(dotenv_path='../infra/credentials.env', override=True)
# AZURE_SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME") or "sample-index-visual-doc"
# AZURE_SEARCH_INDEX_NAME = f"{AZURE_SEARCH_INDEX_NAME}-{datetime.now().strftime('%Y-%m-%d')}-sdk"
AZURE_SEARCH_INDEX_NAME = "sample-index-visual-doc"
print(f"Saving on index: {AZURE_SEARCH_INDEX_NAME}") #-{datetime.now().strftime('%Y-%m-%d')}")

Search endpoint: https://ai-search-abutneva687267079310.search.windows.net
Saving on index: sample-index-visual-doc


In [30]:
# Use the project client to get the default search connection
# Ensure that you have an Azure AI Search among the connected resources for your AI Foundry project
search_connection = project.connections.get_default(
    connection_type=ConnectionType.AZURE_AI_SEARCH,
    include_credentials=True)

# Print to check it
# search_connection

In [31]:
# Create a client to interact with Azure search service index
index_client = SearchIndexClient(
    endpoint=search_connection.endpoint_url,
    credential=AzureKeyCredential(key=search_connection.key)
)

In [32]:
from azure.search.documents.indexes.models import (
    SemanticSearch,
    SearchField,
    SimpleField,
    SearchableField,
    SearchFieldDataType,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchAlgorithmKind,
    HnswParameters,
    VectorSearchAlgorithmMetric,
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    VectorSearchProfile,
    SearchIndex
)

In [33]:
def create_index_definition(index_name: str, model: str) -> SearchIndex:
    dimensions = 1536  # text-embedding-ada-002
    if model == "text-embedding-3-large":
        dimensions = 3072

    # The fields we want to index. The "embedding" field is a vector field that will
    # be used for vector search.
    fields = [
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        SimpleField(name="filepath", type=SearchFieldDataType.String),
        SearchableField(name="title", type=SearchFieldDataType.String),
        SimpleField(name="url", type=SearchFieldDataType.String),
        SearchField(
            name="contentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            # Size of the vector created by the embedding model
            vector_search_dimensions=dimensions,
            vector_search_profile_name="myHnswProfile",
        ),
    ]

    # The "content" field should be prioritized for semantic ranking.
    semantic_config = SemanticConfiguration(
        name="default",
        prioritized_fields=SemanticPrioritizedFields(
            title_field=SemanticField(field_name="title"),
            keywords_fields=[],
            content_fields=[SemanticField(field_name="content")],
        ),
    )

    # For vector search, we want to use the HNSW (Hierarchical Navigable Small World)
    # algorithm (a type of approximate nearest neighbor search algorithm) with cosine
    # distance.
    vector_search = VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="myHnsw",
                kind=VectorSearchAlgorithmKind.HNSW,
                parameters=HnswParameters(
                    m=4,
                    ef_construction=1000,
                    ef_search=1000,
                    metric=VectorSearchAlgorithmMetric.COSINE,
                ),
            ),
            ExhaustiveKnnAlgorithmConfiguration(
                name="myExhaustiveKnn",
                kind=VectorSearchAlgorithmKind.EXHAUSTIVE_KNN,
                parameters=ExhaustiveKnnParameters(metric=VectorSearchAlgorithmMetric.COSINE),
            ),
        ],
        profiles=[
            VectorSearchProfile(
                name="myHnswProfile",
                algorithm_configuration_name="myHnsw",
            ),
            VectorSearchProfile(
                name="myExhaustiveKnnProfile",
                algorithm_configuration_name="myExhaustiveKnn",
            ),
        ],       
    )

    # Create the semantic settings with the configuration
    semantic_search = SemanticSearch(configurations=[semantic_config])

    # Create the search index definition
    return SearchIndex(
        name=index_name,
        fields=fields,
        semantic_search=semantic_search,
        vector_search=vector_search,
    )

In [34]:
print(AZURE_SEARCH_INDEX_NAME)

sample-index-visual-doc


In [35]:
index_definition = create_index_definition(AZURE_SEARCH_INDEX_NAME, model="text-embedding-ada-002")
index_client.create_index(index_definition)

<azure.search.documents.indexes.models._index.SearchIndex at 0x7fd6a58c61d0>

In [36]:
# https://learn.microsoft.com/en-us/azure/search/tutorial-rag-build-solution-models#configure-search-engine-access-to-azure-models
# Assign Cognitive Services OpenAI User + Azure AI Developer role to the current user and to the Azure AI Search system-managed identity.

# Resource was added as Search Index Data Contributor at subscription level.
# Resource was added as Search Service Contributor at subscription level.

# Create a client to interact with an existing Azure Search index
search_client = SearchClient(
	index_name=AZURE_SEARCH_INDEX_NAME,
	endpoint=AZURE_SEARCH_ENDPOINT,
	credential=AzureKeyCredential(search_connection.key)  # Use the correct key from the search_connection
)

In [37]:
# Define a function to process the current list of chunks,
# generate vector embeddings, and prepare them for indexing.
def process_chunks(chunks: list[dict], model: str) -> list[dict]:
    items = []
    for chunk in chunks:
        content = chunk["content"]
        id = str(chunk["id"])
        title = chunk.get("title", "")
        url = chunk.get("url", f"/documents/{id}")
        emb = embeddings.embed(input=content, model=model)
        rec = {
            "id": id,
            "content": content,
            # "filepath": chunk.get("filepath", ""),
            "title": title,
            "url": url,
            "contentVector": emb.data[0].embedding,
        }
        items.append(rec)

    return items

In [38]:
# Convert chunks to the expected format (list of dictionaries)
formatted_chunks = [
	{
		"content": chunk.page_content,  # Extract the content
		"id": chunk.metadata.get("id", str(index)),  # Use metadata 'id' or fallback to index
		"title": chunk.metadata.get("title", ""),  # Extract title if available
		"url": chunk.metadata.get("url", ""),  # Extract URL if available
		# "filepath": chunk.metadata.get("filepath", ""),  # Extract filepath if available
	}
	for index, chunk in enumerate(chunks)
]

In [39]:
chunks[0]

Document(metadata={}, page_content='<!-- PageHeader="UNITED STATES" -->')

In [40]:
formatted_chunks[0]

{'content': '<!-- PageHeader="UNITED STATES" -->',
 'id': '0',
 'title': '',
 'url': ''}

In [41]:
# Upload documents to the search index
search_client.upload_documents(process_chunks(formatted_chunks, model="text-embedding-ada-002"))

[<azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09a94e0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09a9480>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09ab610>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09a95a0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09a9750>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09a92d0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09aada0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09aaaa0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09aa410>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09aa3b0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x7fd6c09aa290>,
 <azure.se

In [42]:
# Check the number of documents in the index 
search_client.get_document_count()

19

### [AI Foundry SDK] Query vector index to retrieve relevant documents

In [43]:
from azure.ai.inference.prompts import PromptTemplate
from azure.search.documents.models import VectorizedQuery

In [44]:
# Create a chat completion client
chat = project.inference.get_chat_completions_client()

In [45]:
# query = "What was the crude oil production in 2019?"
query = "What was the country with the lowest real GDP in 2023?"

In [46]:
def get_product_documents(search_query: str) -> list[dict]:
    """
    Retrieves the top 5 documents from the Azure AI Search index that are most relevant to the given search query.

    Args:
        search_query (str): The search query string to find relevant documents.

    Returns:
        list[dict]: A list of dictionaries, where each dictionary represents a document with the following keys:
            - "id" (str): The unique identifier of the document.
            - "content" (str): The content of the document.
            - "title" (str): The title of the document.
            - "url" (str): The URL of the document.
    """
    # generate a vector representation of the search query
    embedding = embeddings.embed(model="text-embedding-ada-002", input=search_query)
    search_vector = embedding.data[0].embedding

    # search the index for products matching the search query
    vector_query = VectorizedQuery(vector=search_vector, k_nearest_neighbors=5, fields="contentVector")

    search_results = search_client.search(
        search_text=search_query, vector_queries=[vector_query], select=["id", "content", "title", "url"] # ["id", "content", "filepath", "title", "url"]
    )

    documents = [
        {
            "id": result["id"],
            "content": result["content"],
            # "filepath": result["filepath"],
            "title": result["title"],
            "url": result["url"],
        }
        for result in search_results
    ]

    print(f"📄 {len(documents)} documents retrieved") #: {documents}")
    return documents

In [47]:
retrieved_documents = get_product_documents(query)

📄 11 documents retrieved


In [48]:
# Helper function to merge the metadata and content of the retrieved documents into a single context string
def generate_context(chunks):
    context = []
    for i, chunk in enumerate(chunks):
        s = (f"Source {i} Metadata: {chunk['id']}\n"
             f"Source {i} Content: {chunk['content']}")
        context.append(s)
    context = '\n---\n'.join(context)
    return context

# Remove redundant chunks
appeared = set()
unique_chunks = []
for chunk in retrieved_documents:
    chunk_id = chunk['id']
    if chunk_id not in appeared:
        appeared.add(chunk_id)
        unique_chunks.append(chunk)
context = generate_context(unique_chunks)

### [AI Foundry SDK] Generate answer to query

In [49]:
# Define system prompt template for chat model
GROUNDED_PROMPT = """
You are an expert in document analysis. You are proficient in reading and analyzing technical reports. You are good at numerical reasoning and have a good understanding of financial concepts. You are given a question which you need to answer based on the references provided. To answer this question, you may first read the question carefully to know what information is required or helpful to answer the question. Then, you may read the references to find the relevant information.

If you find enough information to answer the question, you can first write down your thinking process and then provide a concise answer at the end.
If you find that there is not enough information to answer the question, you can state that there is insufficient information.
If you are not able or sure how to answer the question, say that you are not able to answer the question.
Do not provide any information that is not present in the references.
References are in markdown format, you may follow the markdown syntax to better understand the references.

---
References:
{{context}}
---

Now, here is the question:
---
Question:
{{question}}
---
Thinking Process::: 
Answer::: 
"""

In [50]:
grounded_chat_prompt = PromptTemplate.from_string(GROUNDED_PROMPT)

In [51]:
grounded_chat_prompt.create_messages(context = context, question = query)

[{'role': 'system',
  'content': "You are an expert in document analysis. You are proficient in reading and analyzing technical reports. You are good at numerical reasoning and have a good understanding of financial concepts. You are given a question which you need to answer based on the references provided. To answer this question, you may first read the question carefully to know what information is required or helpful to answer the question. Then, you may read the references to find the relevant information.\n\nIf you find enough information to answer the question, you can first write down your thinking process and then provide a concise answer at the end.\nIf you find that there is not enough information to answer the question, you can state that there is insufficient information.\nIf you are not able or sure how to answer the question, say that you are not able to answer the question.\nDo not provide any information that is not present in the references.\nReferences are in markdow

In [52]:
messages = [{"role": "user", "content": query}]
# documents = get_product_documents(query)
# grounded_chat_prompt = PromptTemplate.from_string(GROUNDED_PROMPT)

system_message = grounded_chat_prompt.create_messages(context = context, question = query)
response = chat.complete(
    model="gpt-4o",
    messages=system_message + messages,
    **grounded_chat_prompt.parameters,
)
print(f"💬 Response: {response.choices[0].message}")

💬 Response: {'content': 'Based on the reference from Source 0, which provides a chart detailing the 2023 Real GDP percent deviation from the pre-crisis trend, the United Kingdom had the lowest real GDP in 2023 with a percent deviation close to -4.0%.', 'refusal': None, 'role': 'assistant'}


In [53]:
type(response)

azure.ai.inference.models._patch.ChatCompletions

In [54]:
print(response.choices[0].message.content)

Based on the reference from Source 0, which provides a chart detailing the 2023 Real GDP percent deviation from the pre-crisis trend, the United Kingdom had the lowest real GDP in 2023 with a percent deviation close to -4.0%.
