# Key Value Pair Entity Conversion

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Purpose and Description

This tool uses Form parser JSON files (Parsed from a processor) from the GCS bucket as input, converts the key/value pair to the entities and stores it to the GCS bucket as JSON files.

## Prerequisites

1. Vertex AI Notebook
2. Labeled json files in GCS Folder

## Step by Step procedure 

### 1. Setup Input Variables


 * **PROJECT_ID:** provide your GCP project ID (Optional)
 * **bucket_name:** provide the bucket name 
 * **formparser_path:** provide the folder name of the jsons gor parsed with form parser.
 * **output_path:** provide for the folder name where jsons will be saved.
 * **entity_synonyms_list:** Here add the entity names in place of "Entity_1", "Entity_2" and  add the synonyms related to the entity in place of "Entity_1_synonyms_1" and so on. Add multiple entities with their synonyms in the list.

     [{"Entity_1":["Entity_1_synonyms_1","Entity_1_synonyms_2","Entity_1_synonyms_3"]},{"Entity_2":["Entity_2_synonyms_1","Entity_2_synonyms_2","Entity_2_synonyms_3"]}]

### 2. Output

We get the converted Json in the GCS path which is provided in the script with the variable name **output_path**. 

![](https://screenshot.googleplex.com/43tgnEWB3HXSRpt.png)

### 3. Sample Code

#### importing necessary modules

In [None]:
# Download incubator-tools utilities module to present-working-directory
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [2]:
import json
import re
from io import BytesIO
from pathlib import Path
from utilities import *

# import gcsfs
import google.auth
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from tqdm import tqdm

#### Setup the required inputs

In [2]:
PROJECT_ID = "xxx-xxx-xxx"  # your project id
bucket_name = "xxxxxx"  # bucket name

formparser_path = (
    "xxx/xxxxxxx/xxxxxx"  # path of the form parser output without bucket name
)
output_path = "xxxx/xxxxxxxx/xxxxx"  # output path for this script without bucket name

entity_synonyms_list = [
    {"Bill_to": ["Bill To:", "Bill To", "BillTo"]},
    {"Due_date": ["Due Date:", "Due Date", "DueDate"]},
]  # example

#### Execute the code

In [171]:
def list_blobs(bucket_name):
    """This function will give the list of files in a bucket
    args: gcs bucket name
    output: list of files"""
    from google.cloud import storage

    blob_list = []
    storage_client = storage.Client()
    blobs = storage_client.list_blobs(bucket_name)
    for blob in blobs:
        blob_list.append(blob.name)
    return blob_list


def store_blob(document, file_name: str):
    """
    Store files in cloud storage.
    """

    storage_client = storage.Client()
    process_result_bucket = storage_client.get_bucket(bucket_name)
    document_blob = storage.Blob(
        name=str(Path(output_path, file_name)), bucket=process_result_bucket
    )
    document_blob.upload_from_string(document, content_type="application/json")


def entity_synonyms(old_entity: str):
    """
    To check for any synonyms for the entites and replace.
    """
    for item in entity_synonyms_list:
        synonym_list = list(map(str.lower, [*item.values()][0]))
        if old_entity.lower() in synonym_list:
            return [*item][0]

    # if entity does not match with any synonyms, will return none.
    return ""


def entity_data(formField_data, page_number: int):
    """
    Function to create entity objects with some cleaning.
    """
    # Cleaning the entity name
    key_name = re.sub(
        r"[^\w\s]",
        "",
        formField_data.field_name.text_anchor.content.replace(" ", "").strip(),
    )
    # checking for entity synonyms
    key_name = entity_synonyms(key_name)
    # initializing new entity
    entity = documentai.Document.Entity()

    if key_name:
        entity.confidence = formField_data.field_value.confidence
        entity.mention_text = formField_data.field_value.text_anchor.content
        page_ref = entity.page_anchor.PageRef()
        page_ref.bounding_poly.normalized_vertices.extend(
            formField_data.field_value.bounding_poly.normalized_vertices
        )
        page_ref.page = page_number
        entity.page_anchor.page_refs.append(page_ref)
        entity.text_anchor = formField_data.field_value.text_anchor
        entity.type = key_name
        return entity
    else:
        return {}


def convert_kv_entities(file: str):
    """
    Function to convert form parser key value to entities.
    """
    # initializing entities list
    file.entities = []

    for page_number, page_data in enumerate(file.pages):
        for formField_number, formField_data in enumerate(
            getattr(page_data, "form_fields", [])
        ):
            # get the element and push it to the entities array
            entity_obj = entity_data(formField_data, page_number)
            if entity_obj:
                file.entities.append(entity_obj)
    # removing the form parser data
    for page in file.pages:
        del page.form_fields
        del page.tables

    return file


def main():
    """
    Main function to call helper functions
    """
    # fetching all the files
    files = list(file_names(f"gs://{bucket_name}/{formparser_path}")[1].values())
    for file in tqdm(files, desc="Status : "):
        # converting key value to entites

        entity_json = convert_kv_entities(
            documentai_json_proto_downloader(bucket_name, file)
        )

        # storing the json
        store_blob(documentai.Document.to_json(entity_json), file.split("/")[-1])


# calling main function
main()

Status : 100%|██████████| 2/2 [00:01<00:00,  1.14it/s]
