## Using Document AI Warehouse Python Client Library to run common operations on Document AI Warehouse
**Prerequisites**
1. Please ensure that you have a Document AI instance . You can follow this [quickstart](https://cloud.google.com/document-warehouse/docs/quickstart) to complete the setup.
2. Create a document AI [Invoice processor](https://cloud.google.com/document-ai/docs/processors-list#processor_invoice-processor) and update the DOCAI_PROCESSOR_ID variable below.
3. If you are using Vertex AI Workbench Managed jupyter lab, pls ensure to grant [roles/contentwarehouse.documentAdmin](https://cloud.google.com/document-warehouse/docs/manage-access-control) & [documentai.apiUser](https://cloud.google.com/document-ai/docs/access-control/iam-roles). If you are using your own dev environment please ensure to grant the specified permissions to the identity.
4. Install dependencies mentioned in requirements.txt

#### This notebooks cover the following scenarios
1. Create document & folder Schema. 
2. Create a folder using schema created in step #1.
3. Create a document using schema created in step #1 using inline raw document & set property values.
4. Create a document using schema created in step #1 using document stored in gcs & embed DocumentAI processor output alongwith.
5. Link document created in step #4 to the folder
6. Search document
7. Clean-up

In [None]:
# Install necessary Python libraries and restart your kernel after.
# !python -m pip install -r ./requirements.txt

In [1]:
import time

from document_ai_utils import DocumentaiUtils
from document_warehouse_utils import DocumentWarehouseUtils
import storage_utils

In [2]:
import base64
import json

In [3]:
PROJECT_NUM = 226216830996  # Set this to your project number
API_LOCATION = "us"  # Choose "us" or "eu"

DOCAI_PROCESSOR_ID = "<insert here>"  # Create a documentAI Invoice Processor and enter the processor id here.
caller_user = "user:<insert here>"  # Change this to the service account you have created. user: at the beginning is required.

# public GCS bucket and Invoice document. You may change it to document stored in your own bucket.
GCS_BUCKET = "cloud-samples-data"
BLOB_NAME = "documentai/Custom/Invoices/PDF_Unlabeled/fake_invoice_10103.pdf"

In [4]:
invoice_document_path = "./resources/sample_invoices/invoice1.pdf"
invoice_document_metadata_path = "./resources/metadata/invoice1.json"
document_schema_path = "./resources/schema_files/invoice_schema.json"
folder_schema_path = "./resources/schema_files/invoice_folder_schema.json"

In [5]:
schema_id_list = []
folder_id_list = []
document_id_list = []

#### 1. Create document & folder Schema.

In [6]:
# Instanciate DocumentWarehouseUtils.
dw_utils = DocumentWarehouseUtils(project_number=PROJECT_NUM, api_location=API_LOCATION)

In [7]:
!head {folder_schema_path} # Let's look at folder schema

{
  "display_name": "Invoices",
  "property_definitions": [],
  "document_is_folder": true,
  "description": "Invoice Folder"
}

##### As you can see, folder is just special type of document.

In [8]:
folder_schema = storage_utils.read_file(folder_schema_path, mode="r")

In [9]:
create_schema_response = dw_utils.create_document_schema(folder_schema)

parent: "projects/226216830996/locations/us"
document_schema {
  display_name: "Invoices"
  document_is_folder: true
  description: "Invoice Folder"
}

display_name: "Invoices"
document_is_folder: true
description: "Invoice Folder"



In [10]:
folder_schema_id = create_schema_response.name.split("/")[-1]

In [11]:
folder_schema_id

'6t6ebbp36kb9o'

In [12]:
schema_id_list.append(folder_schema_id)

In [13]:
# Let's try to get the schema using the schema_id
dw_utils.get_document_schema(schema_id=folder_schema_id)

name: "projects/226216830996/locations/us/documentSchemas/6t6ebbp36kb9o"
display_name: "Invoices"
document_is_folder: true
update_time {
  seconds: 1671184616
  nanos: 543639000
}
create_time {
  seconds: 1671184616
  nanos: 543639000
}
description: "Invoice Folder"

#### Let's create the document schema

In [14]:
!head {document_schema_path} # Let's look at document schema

{
  "display_name": "Invoice",
  "property_definitions": [
    {
      "name": "supplier_name",
      "display_name": "supplier name",
      "is_repeatable": false,
      "is_filterable": true,
      "is_searchable": true,
      "is_metadata": true,


In [15]:
document_schema = storage_utils.read_file(document_schema_path, mode="r")

In [16]:
create_schema_response = dw_utils.create_document_schema(document_schema)
document_schema_id = create_schema_response.name.split("/")[-1]

parent: "projects/226216830996/locations/us"
document_schema {
  display_name: "Invoice"
  property_definitions {
    name: "supplier_name"
    is_filterable: true
    is_searchable: true
    is_metadata: true
    text_type_options {
    }
    display_name: "supplier name"
  }
  property_definitions {
    name: "supplier_address"
    is_filterable: true
    is_searchable: true
    is_metadata: true
    text_type_options {
    }
    display_name: "supplier address"
  }
  property_definitions {
    name: "invoice_id"
    is_filterable: true
    is_searchable: true
    is_metadata: true
    text_type_options {
    }
    display_name: "invoice id"
  }
  property_definitions {
    name: "invoice_date"
    is_filterable: true
    is_searchable: true
    is_metadata: true
    display_name: "invoice date"
    date_time_type_options {
    }
  }
  property_definitions {
    name: "total_amount"
    is_filterable: true
    is_searchable: true
    is_metadata: true
    float_type_options {
    }
 

In [17]:
schema_id_list.append(document_schema_id)

In [18]:
# Let's try to get the schema using the schema_id
dw_utils.get_document_schema(schema_id=document_schema_id)

name: "projects/226216830996/locations/us/documentSchemas/3lbhrm6pcsdo8"
display_name: "Invoice"
property_definitions {
  name: "supplier_name"
  is_filterable: true
  is_searchable: true
  is_metadata: true
  text_type_options {
  }
  display_name: "supplier name"
}
property_definitions {
  name: "supplier_address"
  is_filterable: true
  is_searchable: true
  is_metadata: true
  text_type_options {
  }
  display_name: "supplier address"
}
property_definitions {
  name: "invoice_id"
  is_filterable: true
  is_searchable: true
  is_metadata: true
  text_type_options {
  }
  display_name: "invoice id"
}
property_definitions {
  name: "invoice_date"
  is_filterable: true
  is_searchable: true
  is_metadata: true
  display_name: "invoice date"
  date_time_type_options {
  }
}
property_definitions {
  name: "total_amount"
  is_filterable: true
  is_searchable: true
  is_metadata: true
  float_type_options {
  }
  display_name: "Total Amount"
}
update_time {
  seconds: 1671184622
  nanos: 8

#### 2. Create a folder using schema created in step #1.

In [19]:
create_folder_response = dw_utils.create_document(
    display_name="Invoice",
    document_schema_id=folder_schema_id,
    caller_user_id=caller_user,
)

In [20]:
folder_id = create_folder_response.document.name.split("/")[-1]

In [21]:
folder_id

'1pa5a3u7e5cag'

In [22]:
folder_id_list.append(folder_id)

#### 3. Create a document using schema created in step #1 using inline raw document & set property values.

In [23]:
document_raw_bytes = storage_utils.read_file(invoice_document_path, mode="rb")

In [24]:
invoice_document_metadata = storage_utils.read_file(
    invoice_document_metadata_path, mode="r"
)

In [25]:
invoice_document_metadata_json = json.loads(invoice_document_metadata)

In [26]:
create_document_response = dw_utils.create_document(
    display_name=invoice_document_metadata_json["display_name"],
    mime_type=invoice_document_metadata_json["mime_type"],
    document_schema_id=document_schema_id,
    metadata_properties=invoice_document_metadata_json["properties"],
    raw_inline_bytes=document_raw_bytes,
    caller_user_id=caller_user,
)

In [27]:
create_document_response  # Verify that the properties have been set correctly

document {
  name: "projects/226216830996/locations/us/documents/4i3b1v8ka8qho"
  display_name: "Invoice 1"
  document_schema_name: "projects/226216830996/locations/us/documentSchemas/3lbhrm6pcsdo8"
  cloud_ai_document {
    mime_type: "application/pdf"
  }
  properties {
    name: "supplier_name"
    text_values {
      values: "Acme Inc"
    }
  }
  properties {
    name: "supplier_address"
    text_values {
      values: "123 Main street N. Seattle, WA 98109"
    }
  }
  properties {
    name: "invoice_id"
    text_values {
      values: "ABC123"
    }
  }
  properties {
    name: "invoice_date"
    date_time_values {
      values {
        year: 2022
        month: 12
        day: 12
        utc_offset {
        }
      }
    }
  }
  properties {
    name: "total_amount"
    float_values {
      values: 3524.449951171875
    }
  }
  update_time {
    seconds: 1671184638
    nanos: 609607000
  }
  create_time {
    seconds: 1671184638
    nanos: 609607000
  }
  raw_document_file_typ

Now Let's create a document in Document AI Warehouse using file stores in GCS bucket

In [28]:
document_id = create_document_response.document.name.split("/")[-1]
print(document_id)

4i3b1v8ka8qho


In [29]:
document_id_list.append(document_id)

#### 4. Create a document using schema created in step #1 using document stored in GCS bucket & embed DocumentAI processor output alongwith.

In [30]:
docai_utils = DocumentaiUtils(project_number=PROJECT_NUM, api_location=API_LOCATION)

In [31]:
# Call Document AI Online API
document_ai_output = docai_utils.process_file_from_gcs(
    processor_id=DOCAI_PROCESSOR_ID, bucket_name=GCS_BUCKET, file_path=BLOB_NAME
)

In [32]:
create_document_response = dw_utils.create_document(
    display_name=invoice_document_metadata_json["display_name"],
    mime_type=invoice_document_metadata_json["mime_type"],
    document_schema_id=document_schema_id,
    metadata_properties=invoice_document_metadata_json["properties"],
    raw_document_path=f"gs://{GCS_BUCKET}/{BLOB_NAME}",
    docai_document=document_ai_output,
    caller_user_id=caller_user,
)

In [33]:
create_document_response

document {
  name: "projects/226216830996/locations/us/documents/2lj1f4rick560"
  display_name: "Invoice 1"
  document_schema_name: "projects/226216830996/locations/us/documentSchemas/3lbhrm6pcsdo8"
  properties {
    name: "supplier_name"
    text_values {
      values: "Acme Inc"
    }
  }
  properties {
    name: "supplier_address"
    text_values {
      values: "123 Main street N. Seattle, WA 98109"
    }
  }
  properties {
    name: "invoice_id"
    text_values {
      values: "ABC123"
    }
  }
  properties {
    name: "invoice_date"
    date_time_values {
      values {
        year: 2022
        month: 12
        day: 12
        utc_offset {
        }
      }
    }
  }
  properties {
    name: "total_amount"
    float_values {
      values: 3524.449951171875
    }
  }
  update_time {
    seconds: 1671184656
    nanos: 200909000
  }
  create_time {
    seconds: 1671184656
    nanos: 200909000
  }
  raw_document_file_type: RAW_DOCUMENT_FILE_TYPE_PDF
  creator: "226216830996-comp

In [34]:
document_id_gcs = create_document_response.document.name.split("/")[-1]

In [35]:
document_id_list.append(document_id_gcs)

#### 5. Link document created in step #4 to the folder

In [36]:
dw_utils.link_document_to_folder(
    document_id=document_id, folder_document_id=folder_id, caller_user_id=caller_user
)

name: "projects/226216830996/locations/us/documents/1pa5a3u7e5cag/documentLinks/3ult13jh4soc0"
source_document_reference {
  document_name: "projects/226216830996/locations/us/documents/1pa5a3u7e5cag"
  display_name: "Invoice"
  document_is_folder: true
  update_time {
    seconds: 1671184629
    nanos: 133972000
  }
  create_time {
    seconds: 1671184629
    nanos: 133972000
  }
}
target_document_reference {
  document_name: "projects/226216830996/locations/us/documents/4i3b1v8ka8qho"
  display_name: "Invoice 1"
  update_time {
    seconds: 1671184638
    nanos: 611727000
  }
  create_time {
    seconds: 1671184638
    nanos: 611727000
  }
}
update_time {
  seconds: 1671184663
  nanos: 13675000
}
create_time {
  seconds: 1671184663
  nanos: 13675000
}
state: ACTIVE

#### 6. Search document

In [37]:
query = "ABC123"
search_response = dw_utils.search_documents(query=query, caller_user_id=caller_user)

In [38]:
search_response

SearchDocumentsPager<matching_documents {
  document {
    name: "projects/226216830996/locations/us/documents/4i3b1v8ka8qho"
    display_name: "Invoice 1"
    document_schema_name: "projects/226216830996/locations/us/documentSchemas/3lbhrm6pcsdo8"
    properties {
      name: "supplier_name"
      text_values {
        values: "Acme Inc"
      }
    }
    properties {
      name: "invoice_date"
      date_time_values {
        values {
          year: 2022
          month: 12
          day: 12
          utc_offset {
          }
        }
      }
    }
    properties {
      name: "supplier_address"
      text_values {
        values: "123 Main street N. Seattle, WA 98109"
      }
    }
    properties {
      name: "invoice_id"
      text_values {
        values: "ABC123"
      }
    }
    properties {
      name: "total_amount"
      float_values {
        values: 3524.449951171875
      }
    }
    update_time {
      seconds: 1671184638
      nanos: 611727000
    }
    create_time {

#### 7. Clean-up

In [1]:
def cleanup():
    # delete documents
    for document_id in document_id_list:
        dw_utils.delete_document(document_id, caller_user_id=caller_user)
        print(f"Document:{document_id} deleted")

    # delete folders
    for folder_document_id in folder_id_list:
        dw_utils.delete_document(folder_document_id, caller_user_id=caller_user)
        print(f"Folder:{folder_document_id} deleted")

    time.sleep(2)
    # Delete schemas
    for schema_id in schema_id_list:
        dw_utils.delete_document_schema(schema_id)
        print(f"Schema:{schema_id} deleted")

    document_id_list.clear()
    folder_id_list.clear()
    schema_id_list.clear()

In [40]:
cleanup()

Document:4i3b1v8ka8qho deleted
Document:2lj1f4rick560 deleted
Folder:1pa5a3u7e5cag deleted
Schema:6t6ebbp36kb9o deleted
Schema:3lbhrm6pcsdo8 deleted
