## Setup

The goal of this quickstart is to provide a reference for the most common uses cases of interacting with prebuilt models of Azure Document Intelligence (**prebuilt-read** and **prebuilt-layout**).


Some add-on capabilities are also explored, together with the usage of **markdown output format** for the layout model.
This option is particularly powerful when the results need to be served as context to a LLM, as demonstrated in the last section of this notebook.

### Import libraries

In [71]:
import os
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.ai.documentintelligence.models import DocumentAnalysisFeature
# import base64
import pandas as pd

### Document Intelligence client

In [72]:
# Load environment variables from .env file
load_dotenv(override=True)

True

In [73]:
# Be aware if your deployment is single-service (Azure Document Intelligence resource) or multi-service (Azure AI Services resource)
azure_docintelligence_endpoint = os.environ.get('AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT')
azure_docintelligence_key = os.environ.get('AZURE_DOCUMENT_INTELLIGENCE_KEY')
print(f'Current endpoint: {azure_docintelligence_endpoint}')

Current endpoint: https://ep-di-standalone.cognitiveservices.azure.com


In [74]:
document_intelligence_client = DocumentIntelligenceClient(
    endpoint=azure_docintelligence_endpoint, 
    credential=AzureKeyCredential(azure_docintelligence_key),
    # api_version="2024-11-30" # v4.0 (default)
)

## Sample document

In [5]:
# a lot of test files in different formats are available in this repo:
# https://github.com/Azure-Samples/cognitive-services-REST-api-samples/tree/master/curl/form-recognizer

In [6]:
# for an example of how to use a local file, see the Prebuilt-layout --> Key-value pairs section

In [7]:
# get the document file from a URL
formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"

In [8]:
#formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/invoice-logic-apps-tutorial.pdf"

In [9]:
#formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/invoice_sample.jpg"

## Analyze document

### Prebuilt-read

In [41]:
poller = document_intelligence_client.begin_analyze_document(
    model_id="prebuilt-read", body=AnalyzeDocumentRequest(url_source=formUrl
))

In [42]:
# An instance of AnalyzeDocumentLROPoller that returns AnalyzeResult. 
# (LRO = long-running operation)
poller

<azure.ai.documentintelligence._operations._patch.AnalyzeDocumentLROPoller at 0x2acb70078e0>

In [43]:
# The result() method is designed to retrieve the result of a long-running operation (LRO), 
# which is a common pattern in cloud services where certain tasks, such as analyzing data or deploying resources, take time to complete.
# It abstracts the complexity of polling and waiting, handling the operation's result once it is available.

# Returns: The deserialized resource of the long running operation, if one is available
result: AnalyzeResult = poller.result(timeout=1000)

In [None]:
print(result)

In [None]:
print(result.content)

In [15]:
# print dir(result) ignoring hidden attributes
print([attr for attr in dir(result) if not attr.startswith('_')])



In [16]:
# experiment with prebuilt read model: it does not return tables
if result.tables:
    print(f"I've found {len(result.tables)} tables.")
else:
    print("I haven't found any tables.")

I haven't found any tables.


### Prebuilt-layout

In [17]:
poller = document_intelligence_client.begin_analyze_document(
    model_id="prebuilt-layout", 
    body=AnalyzeDocumentRequest(url_source=formUrl) # the parameter urlSource or base64Source is required
)

In [18]:
# The result() method is designed to retrieve the result of a long-running operation (LRO), 
# which is a common pattern in cloud services where certain tasks, such as analyzing data or deploying resources, take time to complete.
# It abstracts the complexity of polling and waiting, handling the operation's result once it is available.

# Returns: The deserialized resource of the long running operation, if one is available
result: AnalyzeResult = poller.result(timeout=1000)

In [None]:
print(result)

In [20]:
type(result)

azure.ai.documentintelligence.models._models.AnalyzeResult

In [21]:
result.model_id

'prebuilt-layout'

In [22]:
result.api_version

'2024-11-30'

In [23]:
print(result.content)

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549
FORM 10-Q
☐ ☒ :selected: QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Quarterly Period Ended March 31, 2020 OR :unselected: TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to
Commission File Number 001-37845
MICROSOFT CORPORATION
WASHINGTON (STATE OF INCORPORATION) ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399 (425) 882-8080 www.microsoft.com/investor
91-1144442 (I.R.S. ID)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading Symbol
Name of exchange on which registered
Common stock, $0.00000625 par value per share
MSFT
NASDAQ
2.125% Notes due 2021
MSFT
NASDAQ
3.125% Notes due 2028
MSFT
NASDAQ
2.625% Notes due 2033
MSFT
NASDAQ
Securities registered pursuant to Section 12(g) of the Act: NONE
Indicate by check mark whether the registrant (1) has filed all rep

In [24]:
if result.tables:
    print(f"I've found {len(result.tables)} tables.")

I've found 2 tables.


#### Tables parsing

In [None]:
if result.tables:
    for table_idx, table in enumerate(result.tables):
        print(
            f"Table # {table_idx} has {table.row_count} rows and "
            f"{table.column_count} columns"
        )
        if table.bounding_regions:
            for region in table.bounding_regions:
                print(
                    f"Table # {table_idx} location on page: {region.page_number} is {region.polygon}"
                )
        for cell in table.cells:
            print(
                f"...Cell[{cell.row_index}][{cell.column_index}] has text '{cell.content}'"
            )
            if cell.bounding_regions:
                for region in cell.bounding_regions:
                    print(
                        f"...content on page {region.page_number} is within bounding polygon '{region.polygon}'"
                    )

In [26]:
# table to dataframe
if result.tables:
    # list to store all dataframes
    dataframes = []  
    for table_idx, table in enumerate(result.tables):
        # count rows and columns, considering the header row
        print(
            f"Table # {table_idx} has {table.row_count - 1} rows and "
            f"{table.column_count} columns"
        )
        # initialize an empty dataframe with the correct dimensions
        df = pd.DataFrame(index=range(table.row_count), columns=range(table.column_count))
        for cell in table.cells:
            # Assign the cell content to the correct location in the dataframe
            df.at[cell.row_index, cell.column_index] = cell.content        
        # promote the first row as column headers
        df.columns = df.iloc[0]  # Set the first row as the header
        df = df[1:].reset_index(drop=True)  # Drop the first row and reset the index
        
        # add the current dataframe to the list of dataframes
        dataframes.append(df)  

Table # 0 has 4 rows and 3 columns
Table # 1 has 1 rows and 2 columns


In [27]:
len(dataframes)

2

In [28]:
dataframes[0]

Unnamed: 0,Title of each class,Trading Symbol,Name of exchange on which registered
0,"Common stock, $0.00000625 par value per share",MSFT,NASDAQ
1,2.125% Notes due 2021,MSFT,NASDAQ
2,3.125% Notes due 2028,MSFT,NASDAQ
3,2.625% Notes due 2033,MSFT,NASDAQ


#### Key-value pairs

In [45]:
# Read the local file in binary mode
with open('assets/simple-invoice.png', "rb") as file:
    poller = document_intelligence_client.begin_analyze_document(
        model_id="prebuilt-layout",
        body=file,  # Pass the file as the 'body' parameter
        features=[DocumentAnalysisFeature.KEY_VALUE_PAIRS],
        content_type="image/png",  # default "application/json", 
                                         # other examples: "image/jpeg", "image/png", "application/pdf"
                                         # "application/octet-stream" for flexible usage dealing with various file types (it is a safe default but may not provide the best performance for specific file types)
    )


In [46]:
result: AnalyzeResult = poller.result()
print(f"I've found {len(result.key_value_pairs)} key-value pairs.")

I've found 7 key-value pairs.


In [47]:
# verbose print of key-value pairs
result.key_value_pairs

[{'key': {'content': 'Address:', 'boundingRegions': [{'pageNumber': 1, 'polygon': [186, 353, 329, 353, 329, 385, 186, 384]}], 'spans': [{'offset': 8, 'length': 8}]}, 'value': {'content': '1 Redmond way Suite\n6000 Redmond, WA\n99243', 'boundingRegions': [{'pageNumber': 1, 'polygon': [186, 397, 508, 397, 508, 519, 186, 519]}], 'spans': [{'offset': 17, 'length': 42}]}, 'confidence': 0.997},
 {'key': {'content': 'Invoice For:', 'boundingRegions': [{'pageNumber': 1, 'polygon': [1031, 351, 1201, 351, 1201, 386, 1031, 386]}], 'spans': [{'offset': 60, 'length': 12}]}, 'value': {'content': 'Microsoft\n1020 Enterprise Way\nSunnayvale, CA 87659', 'boundingRegions': [{'pageNumber': 1, 'polygon': [1220, 351, 1568, 351, 1568, 480, 1220, 480]}], 'spans': [{'offset': 73, 'length': 50}]}, 'confidence': 0.997},
 {'key': {'content': 'Invoice Number', 'boundingRegions': [{'pageNumber': 1, 'polygon': [123, 671, 374, 672, 374, 706, 123, 706]}], 'spans': [{'offset': 124, 'length': 14}]}, 'value': {'content'

In [48]:
print("----Key-value pairs found in document----")
if result.key_value_pairs:
    for kv_pair in result.key_value_pairs:
        key = kv_pair.key.content if kv_pair.key else "None"
        value = kv_pair.value.content if kv_pair.value else "None"
        print(f"Key: {key}, \nValue: {value}")
        print("--")


----Key-value pairs found in document----
Key: Address:, 
Value: 1 Redmond way Suite
6000 Redmond, WA
99243
--
Key: Invoice For:, 
Value: Microsoft
1020 Enterprise Way
Sunnayvale, CA 87659
--
Key: Invoice Number, 
Value: 34278587
--
Key: Invoice Date, 
Value: 6/18/2017
--
Key: Invoice Due Date, 
Value: 6/24/2017
--
Key: Charges, 
Value: $56,651.49
PT
--
Key: VAT ID, 
Value: None
--


#### Markdown output

In [85]:
poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-layout",
    body=AnalyzeDocumentRequest(url_source=formUrl),
    output_content_format="markdown" # default "text"
)

In [86]:
print(formUrl)

https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf


In [87]:
# retrieve the file name from the URL
file_name = os.path.basename(formUrl)
print(f"File name: {file_name}")
# file name without extension
file_name_without_ext = os.path.splitext(file_name)[0]

File name: sample-layout.pdf


In [88]:
result: AnalyzeResult = poller.result()

In [89]:
print(result.content)

# UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549


## FORM 10-Q

☐
☒
☒
QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
1934
For the Quarterly Period Ended March 31, 2020
OR
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
1934
For the Transition Period From
to

Commission File Number 001-37845


## MICROSOFT CORPORATION

WASHINGTON
(STATE OF INCORPORATION)
ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399
(425) 882-8080
www.microsoft.com/investor

91-1144442
(I.R.S. ID)

Securities registered pursuant to Section 12(b) of the Act:


<table>
<tr>
<th>Title of each class</th>
<th>Trading Symbol</th>
<th>Name of exchange on which registered</th>
</tr>
<tr>
<td>Common stock, $0.00000625 par value per share</td>
<td>MSFT</td>
<td>NASDAQ</td>
</tr>
<tr>
<td>2.125% Notes due 2021</td>
<td>MSFT</td>
<td>NASDAQ</td>
</tr>
<tr>
<td>3.125% Notes due 2028</td>
<td>MSFT</td>
<td>NASDAQ</td>
</tr>
<tr>
<t

In [90]:
# save result content to file
with open(f"assets/{file_name_without_ext}.md", "w") as f:
    f.write(result.content)  # Write the string content directly

#### Extract figures

In [77]:
import fitz
from PIL import Image
import io

In [82]:
file_path = "assets/sample_report_10pg.pdf"

In [83]:
# Run Content Understanding on each figure, format figure contents, and insert figure contents into corresponding document locations
with open(file_path, "rb") as f:
    pdf_bytes = f.read()

    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=pdf_bytes),
        # output=["figures"],
        features=["ocrHighResolution"],
        output_content_format="markdown"
    )

    result: AnalyzeResult = poller.result()

In [80]:
result.figures

[{'id': '1.1', 'boundingRegions': [{'pageNumber': 1, 'polygon': [0.6657, 0.201, 1.4683, 0.2009, 1.4685, 1.0425, 0.6659, 1.0427]}], 'spans': [{'offset': 0, 'length': 28}], 'elements': ['/paragraphs/0']},
 {'id': '2.1', 'boundingRegions': [{'pageNumber': 2, 'polygon': [0.3589, 0.3289, 1.1488, 0.329, 1.1488, 1.1574, 0.3588, 1.1574]}], 'spans': [{'offset': 1811, 'length': 33}], 'elements': ['/paragraphs/16']},
 {'id': '5.1', 'boundingRegions': [{'pageNumber': 5, 'polygon': [0.6718, 0.1812, 1.4738, 0.1812, 1.4739, 1.0343, 0.6717, 1.0343]}], 'spans': [{'offset': 10946, 'length': 32}], 'elements': ['/paragraphs/168']},
 {'id': '9.1', 'boundingRegions': [{'pageNumber': 9, 'polygon': [4.3481, 2.1541, 7.1709, 2.1544, 7.1705, 3.8987, 4.3477, 3.8981]}], 'spans': [{'offset': 20989, 'length': 204}], 'elements': ['/paragraphs/238'], 'footnotes': [{'content': 'Sources: IMF World Economic Outlook. Note: Bars show the difference in real output in 2023 and anticipated output for the same period prior to 

In [67]:
for figure_idx, figure in enumerate(result.figures):
     page_number = figure.bounding_regions[0]['pageNumber']
     print(f"Figure {figure_idx} is on page {page_number}.")

Figure 0 is on page 1.
Figure 1 is on page 2.
Figure 2 is on page 5.
Figure 3 is on page 9.
Figure 4 is on page 10.
Figure 5 is on page 10.
Figure 6 is on page 10.
Figure 7 is on page 10.


In [None]:
def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    Args:    
    - pdf_path (pathlib.Path): Path to the PDF file.
    - page_number (int): The page number to crop from (0-indexed).
    - bounding_box (tuple): A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
            These coordinates are in inches and are later converted to points (1 inch = 72 points).
    
    Returns:
    - PIL.Image: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    # Render the cropped region into a high-resolution image (300 DPI)
    pix = page.get_pixmap(matrix=fitz.Matrix(300 / 72, 300 / 72), clip=rect)
    
    # The resulting pixel data is converted into a PIL.Image object using Image.frombytes, allowing for further manipulation or saving in various image formats.
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()

    return img

In [85]:
if result.figures:
    print("Extracting figures...")
    for figure_idx, figure in enumerate(result.figures):
        for region in figure.bounding_regions:
                # Uncomment the below to print out the bounding regions of each figure
                # print(f"Figure {figure_idx + 1} body bounding regions: {region}")
                # To learn more about bounding regions, see https://aka.ms/bounding-region
                bounding_box = (
                        region.polygon[0],  # x0 (left)
                        region.polygon[1],  # y0 (top
                        region.polygon[4],  # x1 (right)
                        region.polygon[5]   # y1 (bottom)
                    )
        page_number = figure.bounding_regions[0]['pageNumber']
        cropped_img = crop_image_from_pdf_page(file_path, page_number - 1, bounding_box)

        os.makedirs("figures", exist_ok=True)

        figure_filename = f"figure_{figure_idx + 1}.png"
        # Full path for the file
        figure_filepath = os.path.join("figures", figure_filename)

        # Save the figure
        cropped_img.save(figure_filepath)
        bytes_io = io.BytesIO()
        cropped_img.save(bytes_io, format='PNG')
        cropped_img = bytes_io.getvalue()

        # Print the figure name
        print(f"\tFigure {figure_idx + 1} saved as {figure_filepath}")

Extracting figures...
	Figure 1 saved as figures\figure_1.png
	Figure 2 saved as figures\figure_2.png
	Figure 3 saved as figures\figure_3.png
	Figure 4 saved as figures\figure_4.png
	Figure 5 saved as figures\figure_5.png
	Figure 6 saved as figures\figure_6.png
	Figure 7 saved as figures\figure_7.png
	Figure 8 saved as figures\figure_8.png


## Chat with your document (basic)

In [91]:
from openai import AzureOpenAI

In [92]:
# Load environment variables from .env file
load_dotenv(override=True)

# Use your `key` and `endpoint` environment variables for Azure Document Intelligence
azure_openai_endpoint = os.environ.get('AZURE_OPENAI_ENDPOINT')
print(f'Current endpoint: {azure_openai_endpoint}')

Current endpoint: https://epaifhub1921084884.openai.azure.com/


### Azure OpenAI client

In [99]:
client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_KEY"),  
  api_version="2024-05-01-preview" #"2024-08-01-preview"
)

### Prompt template

In [94]:
question = "What quarterly period does this form cover?"

In [95]:
document_prompt = f"""
Given the markdown-formatted content extracted, answer the following question using only the information contained in the content.
---

Answer concisely and factually. If the information is not present, reply: "Not specified in the document."
---

Markdown Content:
{result.content}
---

Question:
{question}
"""

### Chat

In [100]:
document_response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a document understanding assistant.",
        },
        {
            "role": "user", 
            "content": document_prompt,
        },
    ],
    model="gpt-4o", 
    temperature=0.0, # for stable results
)

print(document_response.choices[0].message.content)

The quarterly period covered by this form is the period ended March 31, 2020.


### Questions generation

In [97]:
questions_prompt = f"""
Given the following markdown-formatted content, generate a list of 5 relevant questions that can be used to verify the correct processing and comprehension of the document by an automated pipeline.

The questions could cover:
- Document metadata and structure
- Company information
- Securities details
- Compliance and filing status
- Shares outstanding

Make sure the questions are clear, factual, and refer only to the information available in the text.

---

Markdown Content:
{result.content}

---

Now, list the questions based on the markdown content above.
"""

In [98]:
questions_response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a document understanding assistant.",
        },
        {
            "role": "user", 
            "content": questions_prompt,
        },
    ],
    model="gpt-4o", 
    temperature=0.0, # for stable results
)

print(questions_response.choices[0].message.content)

Here are five relevant questions based on the provided markdown content:

1. **Document Metadata and Structure**  
   - What is the form type of the document, and for which quarterly period is the report filed?

2. **Company Information**  
   - What is the name of the company, its state of incorporation, and its principal address as listed in the document?

3. **Securities Details**  
   - What securities are registered pursuant to Section 12(b) of the Securities Exchange Act, and what is the trading symbol and exchange for these securities?

4. **Compliance and Filing Status**  
   - Has the registrant filed all reports required by Section 13 or 15(d) of the Securities Exchange Act during the preceding 12 months, and is the registrant classified as a large accelerated filer?

5. **Shares Outstanding**  
   - As of April 24, 2020, how many shares of common stock, with a par value of $0.00000625 per share, were outstanding?
