# Overview

## What is Document AI?
The [Document AI](https://cloud.google.com/document-ai/docs) API is a document understanding solution that takes unstructured data, such as documents, emails, and so on, and makes the data easier to understand, analyze, and consume. The API provides structure through content classification, entity extraction, advanced searching, and more.

In this tutorial, you focus on using the Document AI API with Python. The tutorial demonstrates how to use Document Splitter to parse a simple PDF document with multiple scanned files to separate documents on page logical boundaries.

## What you'll learn
- How to enable the Document AI API
- How to authenticate API requests
- How to install the client library for Python
- How to parse data from a multipage document and detect page logical boundaries

## What you'll need
- A Google Cloud Project
- A Browser, such as Chrome or Firefox
- Knowledge of Python 3
- An instance of AI Notebook

# Setup and Requirements


## Install the client library
Install the client library:

In [None]:
%%bash
pip3 install --upgrade google-cloud-documentai
pip3 install --upgrade google-cloud-storage

You should see something like this:

```
...
Installing collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-0.3.0
.
.
Installing collected packages: google-cloud-storage
Successfully installed google-cloud-storage-1.35.0
```

Now, you're ready to use the Document AI API!

# Document Splitter (Private Beta)

We will test the Document Splitter on a sample document. 
Let's setup variables with:
- Google project id
- Document AI parser location 
- Documnet AI processor id
- Path to the sample document

In [None]:
# TODO(developer): Fill these variables with your values before running the sample
project_id= 'YOUR_GCP_PROJECT_ID'
location = 'eu' # Format is 'us' or 'eu'
processor_id = 'YOUR_DOCAI_PROCESSOR_ID' # Create processor in Cloud Console
file_path = '../resources/general/multi-document.pdf'

Now let's define the function to process the document with Document AI python client 

In [64]:
def process_document_sample(
    project_id: str, location: str, processor_id: str, file_path: str
):
    from google.cloud import documentai_v1beta3 as documentai

    # Instantiates a client
    opts = {"api_endpoint": f"{location}-documentai.googleapis.com"}
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory
    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "document": document}

    # Recognizes text entities in the PDF document
    result = client.process_document(request=request)

    document = result.document

    print("Document processing complete.")

    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    document_pages = document.pages

    # Read the text recognition output from the processor
    text = document.text
    print("The document contains the following text (first 100 charactes):")
    print(text[:100])
    
    # Read the detected page split from the processor
    print("\nThe processor detected the following page split entities:")
    print_pages_split(text, document)


def print_pages_split(text: str, document: dict):
    """
    Document AI identifies possible page splits
    in document. This function converts page splits
    to text snippets and prints it.    
    """
    for i, entity in enumerate(document.entities):
        confidence = entity.confidence
        text_entity = ''
        for segment in entity.text_anchor.text_segments:
            start = segment.start_index
            end = segment.end_index
            text_entity += text[start:end]
        pages = [p.page for p in entity.page_anchor.page_refs]
        print(f"*** Entity number: {i}, Split Confidence: {confidence} ***")
        print(f"*** Pages numbers: {[p for p in pages]} ***\nText snippet: {text_entity[:100]}")

We can now run the processor on the sample multi-document pdf.

In [None]:
process_document_sample(project_id, location, processor_id, file_path)

You should see the output similar to:
```
Document processing complete.
The document contains the following text (first 100 charactes):
FakeDoc M.D.
HEALTH INTAKE FORM
Please fill out the questionnaire carefully. The information you pro

The processor detected the following page split entities:
*** Entity number: 0, Split Confidence: 0.21864357590675354 ***
*** Pages numbers: [0, 1] ***
Text snippet: FakeDoc M.D.
HEALTH INTAKE FORM
Please fill out the questionnaire carefully. The information you pro
*** Entity number: 1, Split Confidence: 0.970017671585083 ***
*** Pages numbers: [2] ***
Text snippet: Invoice
DATE: 01/01/1970
INVOICE: NO. 001
FROM: Company ABC
user@companyabc.com
TO: John Doe
johndoe
```

---------
Congratulations, you've successfully used the Document AI API to extract page logical boundaries from a multipage document. We encourage you to experiment with other documents.