# CDE HITL Line Item Prefix Issue

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.


## Objective

This document is to deal with the CDE trained using invoices and has issues in updating the child items in HITL because of the prefix of child items. The code snippet adds the prefix to child items (similar to invoice parser output) and triggers HITL using invoice parser HITL endpoint.


## Prerequisites
* Vertex AI Notebook
* CDE parser output
* Invoice parser( if schema of standard invoice parser varies from CDE then invoice parser has to be untrained to edit all the child items based on threshold)


## Step by Step procedure 


### 1.Importing Required Modules

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import utilities
import json
from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from tqdm import tqdm

### 2.Setup the inputs

`gcs_input_path`: This contains the storage bucket path of the input files.        
`gcs_output_path`: This contains the storage bucket path of the output files.                  
`project_id`: This contains the project ID of the project.               
`invoice_processor_id`: This is the processor ID of the invoice processor.                  
`location_processor`: This contains the location/region of the processor.        


In [None]:
# input details
gcs_input_path = "gs://xxxxxxxxxx/"
gcs_output_path = "gs://xxxxxxxxxx/"
project_id = "xxxxxxxxxxxx"
invoice_processor_id = "xxxxxxxxxxx"
location_processor = "us"

* **Note** : Invoice parser( if schema of standard invoice parser varies from CDE then invoice parser has to be untrained to edit all the child items based on threshold)


### 3.Run the code

In [None]:
def update_json(json_dict: object) -> object:
    """
    Updates the JSON document by combining subentities and adjusting page anchors.

    Args:
    - json_dict (Document): Input document in protobuf format.

    Returns:
    - Document: Updated document with combined subentities and adjusted page anchors.
    """
    for entity in json_dict.entities:
        if entity.properties:
            for subentity in entity.properties:
                subentity.type = entity.type_ + "/" + subentity.type_
    line = []
    for entity in json_dict.entities:
        if entity.properties:
            line.append(entity)
    for line_1 in line:
        x_temp = []
        y_temp = []
        for subentity in line_1.properties:
            for item in subentity.page_anchor.page_refs[
                0
            ].bounding_poly.normalized_vertices:
                # for item_1 in item:
                x_temp.append(item.x)
                y_temp.append(item.y)
        x_min = min(x_temp)
        y_min = min(y_temp)
        x_max = max(x_temp)
        y_max = max(y_temp)

        line_1.page_anchor = {
            "page_refs": [
                {
                    "bounding_poly": {
                        "normalized_vertices": [
                            {"x": x_min, "y": y_min},
                            {"x": x_max, "y": y_max},
                            {"x": x_min, "y": y_max},
                            {"x": x_max, "y": y_min},
                        ]
                    }
                }
            ]
        }

    updated_ent = []
    for ent in json_dict.entities:
        if entity.properties:
            pass
        else:
            updated_ent.append(ent)

    for l1 in line:
        updated_ent.append(l1)

    json_dict.entities = updated_ent
    return json_dict


def review_document_sample(
    project_id: str, location: str, processor_id: str, file_path: str
) -> str:
    """
    Sends a human review request for a processed document.

    Args:
    - project_id (str): Project ID.
    - location (str): Location of the document processor.
    - processor_id (str): ID of the document processor.
    - file_path (str): Path to the document file.

    Returns:
    - str: Operation name that can be used to check the status of the request.
    """
    # You must set the api_endpoint if you use a location other than 'us'.
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    # Create a client
    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # Make Processing Request
    inline_document = json_to_inline(file_path)

    # Gets the full resource name of the human review config
    human_review_config = client.human_review_config_path(
        project_id, location, processor_id
    )

    # Options are DEFAULT, URGENT
    priority = documentai.ReviewDocumentRequest.Priority.DEFAULT

    # Sends the human review request
    request = documentai.ReviewDocumentRequest(
        inline_document=inline_document,
        human_review_config=human_review_config,
        enable_schema_validation=True,
        priority=priority,
    )

    # Make a request for human review of the processed document
    operation = client.review_document(request=request)

    # Return operation name, can be used to check status of the request
    operation_name = operation.operation.name
    return operation


def json_to_inline(file_path: str) -> object:
    """
    Converts a JSON document to inline content.

    Args:
    - file_path (str): Path to the JSON document file.

    Returns:
    - The document object.
    """
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_path)
    json_data = json.loads(blob.download_as_text())
    json_data.pop("docid", None)
    json_string = json.dumps(json_data)
    document = documentai.Document.from_json(json_string)
    return document

In [None]:
# saving updated jsons to ouput folder
file_names_list, file_dict = utilities.file_names(gcs_input_path)
for filename, filepath in tqdm(file_dict.items(), desc="Progress"):
    input_bucket_name = gcs_input_path.split("/")[2]
    json_dict = utilities.documentai_json_proto_downloader(input_bucket_name, filepath)
    json_dict_updated = update_json(json_dict)
    output_bucket_name = gcs_output_path.split("/")[2]
    output_path_within_bucket = "/".join(gcs_output_path.split("/")[3:]) + filename
    utilities.store_document_as_json(
        documentai.Document.to_json(json_dict_updated),
        output_bucket_name,
        output_path_within_bucket,
    )

In [None]:
# Manually triggering HITL using invoice parser HITL end point
file_out, file_dict_out = utilities.file_names(gcs_output_path)
for filename, filepath in tqdm(file_dict_out.items(), desc="Progress"):
    bucket_name = gcs_output_path.split("/")[2]
    x = review_document_sample(
        project_id=project_id,
        location=location_processor,
        processor_id=invoice_processor_id,
        file_path=filepath,
    )

### 4.Output

The output jsons are similar to the invoice parser jsons(only in terms of line items) and HITL triggered using invoice parser end point

<img src="./Images/HITL_output_1.png" width=800 height=400 ></img>