# Label migration - Child to Parent

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Objective
This tool uses the exported labeled dataset from the processor and removes the child_item “po_number” from “invoice_line_item” , adds it as an individual entity and saves back the json to google storage. 

## Prerequisite
* Vertex AI Notebook
* Folder path for exported labeled dataset (GCS URI)
* Permission For Google Storage and Vertex AI Notebook


## Step by Step procedure 

**Input and Output Path**
1. Give the input and output gcs path. 


In [None]:
input_bucket_path = "gs://xxxxxxxxxxxxx-xxxx-test-bucket/cde_xxxxxxx_export_test/"  # Please keep '/' at the last.
output_bucket_path = "gs://xxxxxxxxxx-xxxxx-test-bucket/cde_xxxxxxx_output_export_test/"  # Please keep '/' at the last.

**input_bucket_path:** GCS URI of the folder, where the dataset is exported from the processor.  

**output_bucket_path:** GCS URI of the folder, where the updated json should be saved.

**Note:** The output folder maintain the folder structure same as the input_bucket_path , please refer below.

<img src="./images/image_1.png" width=800 height=400></img>

<img src="./images/image_2.png" width=800 height=400></img>

**2. Run the Code**
	
Copy the code provided in this document, Enter the paths as described in step 1. 


<img src="./images/image_3.png" width=800 height=400></img>

**3.Output**

We should get the updated json where “po_number” is removed from invoice_line_item, and it is present as an independent entity. 


**4.Comparison Between Input and Output File:**

**Input File:**

<img src="./images/label_migration_input.png" width=800 height=400></img>

**Output File**

<img src="./images/label_migration_output.png" width=800 height=400></img>

## Code to Execute

In [None]:
# Install the below libraries, if they are not installed , then move further
# !pip install google-cloud-documentai
# !pip install google-cloud-storage


input_bucket_path = "gs://xxxxxxxxxxxxx-xxxx-test-bucket/cde_xxxxxxx_export_test/"  # Please keep '/' at the last.
output_bucket_path = "gs://xxxxxxxxxx-xxxxx-test-bucket/cde_xxxxxxx_output_export_test/"  # Please keep '/' at the last.

import json
from google.cloud import documentai_v1beta3 as documentai
from tqdm.notebook import tqdm
from google.cloud import storage

storage_client = storage.Client()
source_bucket = storage_client.bucket(input_bucket_path.split("/")[2])
source_blob = source_bucket.list_blobs(
    prefix="/".join(input_bucket_path.split("/")[3:])
)
destination_bucket = storage_client.bucket(output_bucket_path.split("/")[2])
list_of_files = []
for blob in source_blob:
    if ".json" in blob.name:
        list_of_files.append(blob.name)


def remove_po_number_parent(filePath):
    """
    Removes 'boundingPoly' keys from the entities and their properties within the document JSON retrieved from a GCS bucket, and reclassifies certain properties related to purchase order numbers.

    Args:
    - filePath (str): The path of the file within the Google Cloud Storage bucket.

    This function:
    1. Downloads a JSON representation of a document from a specified Google Cloud Storage blob.
    2. Iterates over all entities, removing 'boundingPoly' from each entity and its properties.
    3. Changes the property type of 'invoice_line_item/po_number' or 'po_number' to just 'po_number'.
    4. Moves the affected properties to the end of the entities list.
    5. Uploads the modified JSON back to a specified location in a Google Cloud Storage bucket.
    """
    x = json.loads(source_bucket.blob(filePath).download_as_string().decode("utf-8"))
    for entity in x["entities"]:
        if "boundingPolyForDemoFrontend" in entity.keys():
            del entity["boundingPolyForDemoFrontend"]
        if "properties" in entity.keys():
            for j in entity["properties"]:
                if "boundingPolyForDemoFrontend" in j.keys():
                    del j["boundingPolyForDemoFrontend"]
    doc = documentai.Document.from_json(json.dumps(x))
    entity_deleted = []
    for entity in doc.entities:
        if entity.properties:
            for property in entity.properties:
                if (
                    property.type == "invoice_line_item/po_number"
                    or property.type == "po_number"
                ):
                    property.type = "po_number"
                    entity_deleted.append(property)
                    entity.properties.remove(property)
    for entity in entity_deleted:
        doc.entities.append(entity)
    blob = destination_bucket.blob(
        "/".join(output_bucket_path.split("/")[3:]) + "/".join(filePath.split("/")[1:])
    )
    temp_dict = json.loads(documentai.Document.to_json(doc))
    blob.upload_from_string(
        data=bytes(json.dumps(temp_dict, ensure_ascii=False), "utf-8"),
        content_type="application/json",
    )


for i in tqdm(list_of_files):
    remove_po_number_parent(i)