# LDAI Splitter and Classifier (Synchronous v1)

This notebook demonstrates how to use the Lending Document Splitter and Classifier to parse a single PDF file with multiple scanned files (specifically mortgage and federal tax related documents) to separate these on logical boundaries. In addition, this parser will also classify based on known document types.

In [1]:
# Install necessary Python libraries and restart your kernel after.
!python -m pip install -r ../requirements.txt



## Set your Processor Variables

In [2]:
# TODO(developer): Fill these variables with your values before running the sample
PROJECT_ID= 'rand-automl-project'
LOCATION = 'us' # Format is 'us' or 'eu'
PROCESSOR_ID = 'f8bd845a6e664b68' # Create processor in Cloud Console
FILE_PATH = '../resources/lending/splitter_classifier/federal_package_sample.pdf'

#GCS_INPUT_BUCKET = 'cloud-samples-data'
#GCS_INPUT_PREFIX = 'documentai/async_invoices/'
#GCS_OUTPUT_URI = 'gs://YOUR-OUTPUT-BUCKET'
#GCS_OUTPUT_URI_PREFIX = 'TEST'
#TIMEOUT = 300

Now let's define the function to process the document with Document AI Python client 

In [3]:
# Import necessary Python modules
from google.cloud import documentai_v1beta3 as documentai

In [4]:
def process_document_sample(
    project_id: str, location: str, processor_id: str, file_path: str
):

    # Instantiates a client
    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory
    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "document": document}

    # Recognizes text entities in the PDF document
    result = client.process_document(request=request)

    document = result.document
    print("Document processing complete.")

    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    document_pages = document.pages

    # Read the text recognition output from the processor
    text = document.text
    print("The document contains the following text (first 100 characters):")
    print(text[:100])
    
    # Read the detected page split from the processor
    print("\nThe processor detected the following page split entities:")
    print_pages_split(text, document)


def print_pages_split(text: str, document: dict):
    """
    Document AI identifies possible page splits
    in document. This function converts page splits
    to text snippets and prints it.    
    """
    for i, entity in enumerate(document.entities):
        confidence = entity.confidence
        doc_type = entity.type_
        text_entity = ''
        for segment in entity.text_anchor.text_segments:
            start = segment.start_index
            end = segment.end_index
            text_entity += text[start:end]
        pages = [p.page for p in entity.page_anchor.page_refs]
        print(f"*** Document Type: {doc_type} ***")
        print(f"*** Entity number: {i}, Split Confidence: {confidence} ***")
        print(f"*** Pages numbers: {[p for p in pages]} ***\nText snippet: {text_entity[:100]}")

We can now run the processor on the sample multi-document pdf.

In [5]:
process_document_sample(PROJECT_ID, LOCATION, PROCESSOR_ID, FILE_PATH)

Document processing complete.
The document contains the following text (first 100 characters):
A
A
A
A
May
A
May
A
A
US
Affordable Care Act Worksheet
2018
Name: Dawn Miller
SSN: 589-50-0176
Did t

The processor detected the following page split entities:
*** Document Type: other ***
*** Entity number: 0, Split Confidence: 0.9956564903259277 ***
*** Pages numbers: [0] ***
Text snippet: A
A
A
A
May
A
May
A
A
US
Affordable Care Act Worksheet
2018
Name: Dawn Miller
SSN: 589-50-0176
Did t
*** Document Type: 1040_2018 ***
*** Entity number: 1, Split Confidence: 0.9490926861763 ***
*** Pages numbers: [1, 2] ***
Text snippet: 1040
2018
Department of the Treasury-Internal Revenue Service
(99)
U.S. Individual Income Tax Return
*** Document Type: other ***
*** Entity number: 2, Split Confidence: 0.6018480658531189 ***
*** Pages numbers: [3] ***
Text snippet: OMB No. 1545-0074
SCHEDULE 3
(Form 1040)
Nonrefundable Credits
2018
49
Attach to Form 1040.
Departme
*** Document Type: other ***
*** Entity

---------
Congratulations, you've successfully used the Document AI API to extract page logical boundaries from a multipage document. We encourage you to experiment with other documents.