Skip to content

Create seperate document for each page in "markdown" mode of AzureAIDocumentIntelligenceLoader #40790

@4MIR2000

Description

@4MIR2000

Hi all,
I would like to have a seperate document for each processed page of the pdf file in AzureAIDocumentIntelligence for markdown mode. I need to store the page-number for each chunk in the vector-database. Currently the load() function only returns a single document. I tried also the "page" mode, however it does not contain the tables and figures like the "markdown" mode.

loader = AzureAIDocumentIntelligenceLoader(bytes_source=html_bytes, api_key = doc_intelligence_key, api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-layout", mode="markdown")
docs_azure = loader.load()

I also tried this approach, however it also does not contain the tables and figures:

separate_docs = []
seperate_docs_join = ""
for page in docs_azure[0].metadata["pages"]:
    page_number = page["pageNumber"]
    page_content = "\n".join([line["content"] for line in page["lines"]])
    page_metadata = {
        "page_number": page_number,
        **docs_azure[0].metadata  # Include other metadata if needed
    }
    seperate_docs_join += page_content + "\n\n"  # Join the content of all pages
    separate_docs.append(Document(page_content=page_content, metadata=page_metadata))

Thank you for your support
Amir

Metadata

Metadata

Assignees

Labels

Document IntelligenceService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-author-feedbackWorkflow: More information is needed from author to address the issue.no-recent-activityThere has been no recent activity on this issue.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions