# 01. Azure AI Document Intelligence - Layout Model
> https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0

## A. Create an AI Document Intelligence resource and set up environment to run notebook

### Prerequisite

#### Create Document Intelligence resource
To create a AI Document Intelligence resource in your Azure subscription:
Please follow the steps as specified https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0

#### Create a storage account.
1. https://learn.microsoft.com/en-us/azure/storage/blobs/blob-containers-portal
    - Primary Service - Azure Blob Storage 
    - Primary Workload - Other 
    - Standard performance 
    - As this is a demo we can also choose Locally Redundant Storage 

1. Create a container called *data* in the `Container` blade.
1. Upload a file to analyse into your container by clicking on the newly created container and selecting the `Upload` button in the ribbon.


Navigate to the file you want to analyse in the portal - it should be uploaded into blob storage  
On the far right there are 3 dots from there you can generate a SAS url for the file.
in your .env file populate BLOB_SAS_URL with the SAS url.

#### Environment

**AML workspace**: Please ensure you have Python 3.10 version or above ie select **Python 3.10 - SDK v2** as kernel in AML Notebook.

##### Add the following environment variables to your .env file

1. Document Intelligence variables (form recogniser)
    - Getting the values from the `Resource Management`/ `Keys and Endpoint` blade in the *Document Intelligence* service in the Azure portal.
    ```
    FORM_RECOGNIZER_ENDPOINT= 
    FORM_RECOGNIZER_KEY= 
    ```

1. Storage Account variables  
    - Go to your storage resource in the porta l.
        - For Connection string Security & Netowrking / Access Keys / Connection String
    - Click on the newly created container   
    - For SAS URL:  Settings, Shared access tokens  
        - Adjust the expiry date (after the workshop finishes) and create a Shared access token for the container by clicking the *Generate SAS token and URL* button.  
        - Copy SAS Url  
    - 
    ```
    BLOB_STORAGE_ACCOUNT_CONNECTION_STRING=
    BLOB_SAS_URL=
    
    ``` 


## B. Install AI Doc Intelligence library


In [None]:
# install azure-ai-formrecognizer python library and restart the kernel after installation
%pip install azure-ai-formrecognizer --upgrade --user
%pip install tabulate
%pip install python-dotenv

## C. Setting up AI Document Intelligence endpoint and key

In [None]:
# Import the os module for interacting with the operating system
import os
# Import for handling resource not found errors
from azure.core.exceptions import ResourceNotFoundError
# Import for authenticating with the Azure service
from azure.core.credentials import AzureKeyCredential
# Import formrecognizer library to analysis the docs
from azure.ai.formrecognizer import DocumentAnalysisClient, AnalyzeResult

# load the environments details
from dotenv import load_dotenv
load_dotenv()


# Set `<your-endpoint>` and `<your-key>` variables with the values from the Azure portal
# END_POINT is the endpoint URL of your AI Document Intelligence service
# END_POINT_KEY is the key for your AI Document Intelligence service
END_POINT = os.getenv("FORM_RECOGNIZER_ENDPOINT")
END_POINT_KEY = os.getenv("FORM_RECOGNIZER_KEY")

# Create a DocumentAnalysisClient instance
# This client is used to interact with the Azure Form Recognizer service
# It is initialized with your endpoint and key
form_recognizer_client = DocumentAnalysisClient(END_POINT, AzureKeyCredential(END_POINT_KEY))

## D. Analysing the Layout Document

In [None]:
# Define the URL of the sample document to analyse.
# You can change the URL to your sample layout docs but ensure you provide appropriate access
#layoutUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"
# Try out pdfs with interessting layouts and tables.

blob_sas=os.getenv("BLOB_SAS_URL")

# Start the analysis of the document using the prebuilt layout model
# The result is a poller object that can be used to check the status of the operation
poller = form_recognizer_client.begin_analyze_document_from_url("prebuilt-layout",blob_sas)
print(poller)

# Get the result of the analysis
result = poller.result()

# Extract document insights
# Check if the document contains any handwritten content
if any([style.is_handwritten for style in result.styles]):
    print("Document contains handwritten content")
else:
    print("Document does not contain handwritten content")

# Loop through each page in the document
for page in result.pages:
    print(f"----Analyzing layout from page #{page.page_number}----")
    print(
        f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}"
    )
    for line_idx, line in enumerate(page.lines):
        words = line.get_words()
        #words = get_words(page, line)
        print(
            f"...Line # {line_idx} has word count {len(words)} and text '{line.content}' "
            f"within bounding polygon '{line.polygon}'"
        )

        # Loop through each word in the line  
        for word in words:
            print(
                f"......Word '{word.content}' has a confidence of {word.confidence}"
            )

    # Loop through each selection mark in the page 
    for selection_mark in page.selection_marks:
        print(
            f"Selection mark is '{selection_mark.state}' within bounding polygon "
            f"'{selection_mark.polygon}' and has a confidence of {selection_mark.confidence}"
        )

# Loop through each table in the document
for table_idx, table in enumerate(result.tables):
    print(
        f"Table # {table_idx} has {table.row_count} rows and "
        f"{table.column_count} columns"
    )
    # Loop through each bounding region of the table
    for region in table.bounding_regions:
        print(
            f"Table # {table_idx} location on page: {region.page_number} is {region.polygon}"
        )
    # Loop through each cell in the table    
    for cell in table.cells:
        print(
            f"...Cell[{cell.row_index}][{cell.column_index}] has text '{cell.content}'"
        )

        # Loop through each bounding region of the cell
        for region in cell.bounding_regions:
            print(
                f"...content on page {region.page_number} is within bounding polygon '{region.polygon}'"
            )

## F. Extracted Layout Document insights/ response as a JSON format 

In [None]:
# import JSON packages
import json
import datetime
import time
from azure.core.serialization import AzureJSONEncoder
from urllib.parse import urlparse


# generate the unique file name based on the current timestamp and the basename of the URL
filename = datetime.datetime.fromtimestamp(time.time()).strftime('%Y%m%d%H%M%S')+"_"+os.path.splitext(os.path.basename(urlparse(blob_sas).path))[0]

# parse and format the model response json 
# convert the received model to a dictionary
analyze_result_dict = result.to_dict()

# save the dictionary as JSON content in a JSON file, use the AzureJSONEncoder
# to help make types, such as dates, JSON serializable
with open(str(filename), 'w') as f:
        json.dump(analyze_result_dict, f, cls=AzureJSONEncoder,indent=4)

# convert the dictionary back to the original model
model = AnalyzeResult.from_dict(analyze_result_dict)
print("--------------JSON Response from Model Starts---------------------")
# use the model as normal
print("Model ID: '{}'".format(model.model_id))
print("Number of pages analyzed {}".format(len(model.pages)))
print("API version used: {}".format(model.api_version))
print(json.dumps(analyze_result_dict,cls=AzureJSONEncoder,indent=4))
print("--------------JSON Response from Model Ends---------------------")##

## G. Extracted document insights/ response as table of Key Value Pair

In [None]:
# Get the document insights as key / value table
from tabulate import tabulate

data = []

# Display key value pairs
for idx, document in enumerate(result.documents):
    print()
    print("--------Analyzing document #{}--------".format(idx + 1))
    print("Document has type {}".format(document.doc_type))
    print("Document has document type confidence {}".format(
        document.confidence))
    print("Document was analyzed with model with ID {}".format(
        result.model_id))
    print()
    for name, field in document.fields.items():
        field_value = field.value if field.value else field.content
        if field.value_type != 'list':
            data.append([name, field.value, field.confidence])

data.sort()
print(tabulate(data, headers=[
    'Label', 'Value', 'Confidence'], tablefmt='fancy_grid'))

# Display table data
for i, table in enumerate(result.tables):

    row_index = 1
    hdr = []
    rows = []
    row = []

    print("\nTable {} can be found on page:".format(i + 1))
    # for region in table.bounding_regions:
    #     print("...{}".format(i + 1, region.page_number))

    for cell in table.cells:
        if cell.row_index == 0:
            hdr.append(cell.content)
        else:
            if row_index != cell.row_index:
                rows.append(row)
                row_index = cell.row_index
                row = []

            row.append(cell.content)

    rows.append(row)
    print(tabulate(rows, headers=hdr, tablefmt='fancy_grid'))    