# Document Intelligence with Markdown

In this notebook we will experiment with Document Intelligence and its Markdown output. We will try reading documents and experiment whether Document Intelligence can manage advanced actions, such as reading multi-page tables. 


### Python Imports


In [None]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append('..\\code')


import os
from dotenv import load_dotenv
load_dotenv()

from IPython.display import display, Markdown, HTML
from PIL import Image
from doc_utils import *


def show_img(img_path, width = None):
    if width is not None:
        display(HTML(f'<img src="{img_path}" width={width}>'))
    else:
        display(Image.open(img_path))


### Make sure we have the OpenAI Models information

We will need the GPT-4-Turbo and GPT-4-Vision models for this notebook.

When running the below cell, the values should reflect the OpenAI reource you have created in the `.env` file.

In [None]:
model_info = {
        'AZURE_OPENAI_RESOURCE': os.environ.get('AZURE_OPENAI_RESOURCE'),
        'AZURE_OPENAI_KEY': os.environ.get('AZURE_OPENAI_KEY'),
        'AZURE_OPENAI_MODEL_VISION': os.environ.get('AZURE_OPENAI_MODEL_VISION'),
        'AZURE_OPENAI_MODEL': os.environ.get('AZURE_OPENAI_MODEL'),
}

### Experimenting with Document Intelligence

First make sure to install the right version of Document Intelligence

In [None]:
## This version corresponds to API Version 2024-02-29-preview
## Visit: https://learn.microsoft.com/en-us/python/api/overview/azure/ai-documentintelligence-readme?view=azure-python-preview

%pip install azure-ai-documentintelligence>=1.0.0b2

Defining a helper function

In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, ContentFormat

endpoint = os.environ["DI_ENDPOINT"]
key = os.environ["DI_KEY"]

def analyze_document(path):
    document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

    with open(path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, output_content_format=ContentFormat.MARKDOWN, content_type="application/octet-stream"
        )
    result: AnalyzeResult = poller.result()
    
    return result



#### Reading in Sample Documents

In [None]:
path = r"sample_data/1_London_Brochure.docx"
london_docx_result = analyze_document(path)
Markdown(london_docx_result['content'])

In [None]:
path = r"sample_data/1_London_Brochure.pdf"
london_pdf_result = analyze_document(path)
Markdown(london_pdf_result['content'])

### Figures

For the same document, Figures are detected for the PDF version, and not for the Docx version

In [None]:
london_pdf_result.keys()

In [None]:
london_docx_result.keys()

In [None]:
london_pdf_result['figures']

In [None]:
from PIL import Image

def get_dpi(image):
    try:
        dpi = image.info['dpi']
        print("DPI", dpi)
    except KeyError:
        dpi = (300, 300)
    return dpi

def polygon_to_bbox(polygon):
    xs = polygon[::2]  # Extract all x coordinates
    ys = polygon[1::2]  # Extract all y coordinates
    left = min(xs)
    top = min(ys)
    right = max(xs)
    bottom = max(ys)
    return (left, top, right, bottom)


def inches_to_pixels(inches, dpi):
    dpi_x, dpi_y = dpi
    return [int(inches[i] * dpi_x if i % 2 == 0 else inches[i] * dpi_y) for i in range(len(inches))]

polygon = [2.7496, 2.5964, 7.1241, 2.597, 7.1249, 5.0656, 2.7505, 5.0645]


def extract_figure(image_path, polygon):
    bbox_in_inches = polygon_to_bbox(polygon)

    # Load the image
    image = Image.open(image_path)

    filename = os.path.splitext(os.path.basename(image_path))[0].strip()
    extension = os.path.splitext(os.path.basename(image_path))[1].strip()
    crop_name = os.path.join(os.path.dirname(image_path), f"{filename}_{generate_uuid_from_string(str(polygon))}{extension}")
    print(f"Cropped image will be saved under name: {crop_name}")

    # Get DPI from the image
    dpi = get_dpi(image)

    # Convert the bounding box to pixels using the image's DPI
    bbox_in_pixels = inches_to_pixels(bbox_in_inches, dpi)
    print("bbox_in_pixels", bbox_in_pixels)

    # Crop the image
    cropped_image = image.crop(bbox_in_pixels)
    cropped_image.save(crop_name)  
    show_img(crop_name, 400)

for bounding in london_pdf_result['figures']:
    polygon = bounding['boundingRegions'][0]['polygon']
    extract_figure(png_files[0], polygon)

### Document Intelligence with MS Word documents (docx)

Another sample document ..

In [None]:
path = r"sample_data/2_ai_facts.docx"
docx_result = analyze_document(path)
Markdown(docx_result['content'])

In [None]:
print(docx_result['content'])

### Document Intelligence with PDF Documents

Notice the broken table over 3 pages, which gives 3 tables

In [None]:
path = r"sample_data/2_ai_facts.pdf"
pdf_result = analyze_document(path)
Markdown(pdf_result['content'])

It seems that for the PDF, Document Intelligence did not correctly identify this as a single table, but rather as three separate tables.

In [None]:
print(pdf_result['content'])

### Table Merging

Let's try to merge the tables using OpenAI GPT-4

In [None]:
#### Let's try something "hacky"

if len(pdf_result['content'].split('|\n')) > 0: ## This is a hacky way to check if there is a Markdown table in the text
    print('yes')


#### First Step

First try to establish the table boundaries

In [None]:
## Experiment with prompting, to see if broken tables can be "merged" based on LLM's output
## This is work in progress .. NOT YET COMPLETE


prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

In the markdown text above, detect whether there are Markdown tables. If there are, you **MUST** then output the first row and the last row of each table. We define a Markdown table by having consecutive rows with no space between, with the each row separated from the next by '|\n'. Do not separate Markdown table semantically. 


JSON OUTPUT FORMAT:
You **MUST** output a JSON object with the following key-value pairs and following format:

{{
    "total_number_of_markdown_tables_located:": "the total number of Markdown tables located",,
    "table_first_and_last_rows_array":
    [
        {{
            "first_row": "The first row in Markdown table format. Make sure that the last row of the first table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
            "last_row": "The last row of the table in Markdown table format. Make sure that the first row of the second table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
        }}
    ]
}}
"""

p = prompt.format(text = pdf_result['content'])
table_edges = ask_LLM_with_JSON(p, model_info=model_info)
print(table_edges)



#### Second Step

Second, try to use the tables boundaries to determine whether the tables are split because of the pages, or OCR extraction, and if they are, try to merge

In [None]:

prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

The below are the edges that define the start and end of each Markdown table. Use the first and last row mentioned in the "table_first_and_last_rows_array" as the definition of a Markdown table:
## START OF TABLE EDGES
{table_edges}
## END OF TABLE EDGES

In the markdown text above, detect whether there are Markdown tables. If there are, you **MUST** follow the Chain of Thought below:
    1. Locate every Markdown table in the text using the provided Table Edges, and write down the number of those tables in your scratchpad.
    2. Work on every two consecutive Markdown tables in a pairwise manner. For every pair of consecutive Markdown tables, you **MUST** check if they have any text between them. If the two tables are immediately consecutive, but there is text between them and this text is a footnote or a page number, you can safely ignore it, and mark the two tables as consecutive in your scratchpad. If, however, there is valid text between the two tables, then you **MUST** mark the two tables as not consecutive in your scratchpad.
    3. Write down the number of checks you have to perform. You will have to perform checks for consecutive tables pair-wise.

 After completing the above steps, you **MUST** then do the following for every two immediately consecutive Markdown table:
    A. If the separated Markdown tables are immediately consecutive, then check the headers and the data type of the columns of each table to estimate if the two tables originated from the same table but got split into two because of the page break, or because of the OCR extraction process.
    B. If you find that these two Markdown tables likely belong to the same original table, then output **EXACTLY** the full last row of the first table, and the first full row of the first table in Markdown table format. Also output the exact string of characters that exist between the two tables in the JSON output.
    C. Repeat the above steps for **ALL** immediately consecutive Markdown tables located in your scratchpad, to match the number of checks in your scratchpad.

**SUPER IMPORTANT**: We define a Markdown table by having consecutive rows with no space between, with the each row separated from the next by '|\n'. Do not separate Markdown table semantically. 


JSON OUTPUT FORMAT:
Remember that you **MUST** process every two immediately consecutive or consecutive Markdown tables. You **MUST** output a JSON object with the following key-value pairs and following format:

{{
    "total_number_of_markdown_tables_located:": "the total number of Markdown tables located",
    "num_of_tables_to_be_merged": "the number of tables to be merged",
    "tables_to_be_merged":
    [
        {{
            "first_table_last_row": "The last row of the first table in Markdown table format. Make sure that the last row of the first table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
            "second_table_first_row": "The first row of the second table in Markdown table format. Make sure that the first row of the second table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
            "dividing_string": "The exact string of characters that exist between the two tables. Make sure that the string is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
        }}
    ]
}}

"""

p = prompt.format(text = pdf_result['content'], table_edges=table_edges)
output = ask_LLM_with_JSON(p, model_info=model_info)
print(output)




In [None]:
import json
import copy


## Let's try to merge the tables
findings = json.loads(output)
pdf_content = copy.deepcopy(pdf_result['content'])


for table in findings['tables_to_be_merged']:
    first = table['first_table_last_row']
    second = table['second_table_first_row']

    print("Row in first table found: ", first in pdf_result['content'])
    print("Row in last table found: ", second in pdf_result['content'])

    for table in findings['tables_to_be_merged']:
        first_table_last_row = table['first_table_last_row']
        second_table_first_row = table['second_table_first_row']

        first_part = pdf_content.split(first_table_last_row)[0]
        second_part = pdf_content.split(second_table_first_row)[1]

        pdf_content = first_part + first_table_last_row + '\n' + second_table_first_row + second_part


Markdown(pdf_content)



In [None]:
print(pdf_content)

#### Change prompt and provide an example

In [None]:
## Experiment with prompting, to see if broken tables can be "merged" based on LLM's output
## This is work in progress .. NOT YET COMPLETE

prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

In the markdown text above, detect whether there are Markdown tables. If table is spanning across pages, you **MUST** combine table chunks from consecutive pages and create single table **BEFORE** output to JSON.

## EXAMPLE
Here is example of table that spans across pages:

|||
| - | - |
| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |

|||
| - | - |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## COMBINED TABLE
Here result of combining table chunks from consecutive pages. First row is always the header row.

| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## JSON OUTPUT FORMAT:
You **MUST** output a JSON object with the following key-value pairs and following format. Do **NOT* provide any other comments or unnecessary information in the JSON output:

{{
    "Column1 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    "Column2 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    ...,
    "ColumnN Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ]
}}
"""

p = prompt.format(text = pdf_result['content'])
table_edges = ask_LLM_with_JSON(p, model_info=model_info)
print(table_edges)

In [None]:
import pandas as pd
import json

# Parse the JSON string
data = json.loads(table_edges)

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
display(df)


#### Lets try with different PDF file

In [None]:
path = r"sample_data/1_London_Brochure.pdf"
pdf_result = analyze_document(path)
Markdown(pdf_result['content'])

In [None]:
## Experiment with prompting, to see if broken tables can be "merged" based on LLM's output
## This is work in progress .. NOT YET COMPLETE

prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

In the markdown text above, detect whether there are Markdown tables. If table is spanning across pages, you **MUST** combine table chunks from consecutive pages and create single table **BEFORE** output to JSON.

## EXAMPLE
Here is example of table that spans across pages:

|||
| - | - |
| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |

|||
| - | - |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## COMBINED TABLE
Here result of combining table chunks from consecutive pages. First row is always the header row.

| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## JSON OUTPUT FORMAT:
You **MUST** output a JSON object with the following key-value pairs and following format. Do **NOT* provide any other comments or unnecessary information in the JSON output:

{{
    "Column1 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    "Column2 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    ...,
    "ColumnN Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ]
}}
"""

p = prompt.format(text = pdf_result['content'])
table_edges = ask_LLM_with_JSON(p, model_info=model_info)
print(table_edges)

In [None]:
import pandas as pd
import json

# Parse the JSON string
data = json.loads(table_edges)

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
display(df)
