# OCR and Embed with Azure Document Intelligence

This notebook uses Azure Document Intelligence to perform OCR on a document and saves the results in JSON format.

## Setup

First, we need to import the necessary libraries and load environment variables.

In [None]:
# Cell 1: Import necessary libraries
import os
import json
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
import time

# Load environment variables
load_dotenv(override=True)

# Set up Azure endpoint and key
endpoint = os.getenv("AZURE_DI_ENDPOINT")
key = os.getenv("AZURE_DI_KEY")

## Define Helper Functions

Next, we'll define helper functions that will be used to perform the OCR and process the results.

In [None]:
# Cell 2: Define function to get text from a document
def get_document_text(document_path):
    print("Starting OCR")
    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )

    with open(document_path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
        )
    result: AnalyzeResult = poller.result()

    paragraph_results = []
    page_paragraph_count = {}
    for paragraph in result.paragraphs:
        if "role" not in paragraph and paragraph['spans'][0]['length'] >= 40:
            page_number = paragraph.bounding_regions[0].page_number
            if page_number not in page_paragraph_count:
                page_paragraph_count[page_number] = 1
            else:
                page_paragraph_count[page_number] += 1
            
            paragraph_id = f"{page_number}_{page_paragraph_count[page_number]}"
            paragraph_results.append({
                'id': paragraph_id,
                'page': page_number,
                'content': paragraph.content,
                'contentVector': []
            })
    print("Finished OCR")
    return paragraph_results

## Main Execution

Now we can run the OCR process and save the results to a JSON file. For this example, specify the input and output files.

In [None]:
# Cell 3: Run OCR on a PDF file and save the results to a JSON file
# Specify input and output files
input_file = 'demofile.pdf'  # Change the path to your input file
output_file = 'textOCR.json'  # Change this to your desired output file name

start_time = time.time()
result = get_document_text(input_file)

# Save result to JSON file
with open(output_file, 'w') as json_file:
    json.dump(result, json_file, indent=4)

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")