# Document AI Parser Result Merger

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Objective
Document AI Parser Result Merger is a tool built using Python programming language. Its purpose is to address the issue of merging the two or more resultant json files of Document AI processors. This document highlights the working of the tool(script) and its requirements. The documents usually contain multiple pages. There are 2 use cases by which this solution can be operated. 
### Case 1: Different documents, parser results  json merger (Default).
 * Case 1 deals when we are using two or multiple parser output Jsons are from different documents
 * To Enable this case the flag should be ‘1’
### Case 2: Same document, different parsers json merger(Added functionality).
 * Case 2 deals when we are using two or multiple parser outputs from the same document.
 * To Enable this case the flag should be ‘2’

## Prerequisites

This tool requires the following services:

 * Google Jupyter Notebook or Colab.
 * Google Cloud Storage 
 * DocumentAI processor and JSON files
 
Google Jupyter Notebook or Colab is used for running the python notebook file. Cloud Storage Buckets have the input files to this script. The multiple input files are the json files which are the result of a Document AI processor (for eg., Bank Statement Parser). These json files include multiple pages in its document. After the script executes, the output file is a single merged json file stored in the output bucket path.

## Workflow overview
    

![](https://screenshot.googleplex.com/9F5qLEtZJ4Kdj8m.png)

The above diagram shows the flow diagram of the tool. As highlighted there are input and output GCP buckets and there is a python script which processes the request. The input bucket holds the multiple json files which need to be merged into a single file and this is achieved by the python script. This script accepts the input json files and prompts users to switch between the default case-1 or the case-2 mode as highlighted in the previous sections.  Finally there is an output GCP bucket to store the single merged file. 

## Script walkthrough
Insights and details about the script are explained in detail as follows.

## 1. Import Modules/Packages


In [5]:
import json
import re
from typing import Dict, List, Tuple, Union

from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from google.cloud.documentai_v1beta3 import Document

## 2. Input Details : Entering Project details in below variables


 * **PROJECT_ID:** provide your GCP project ID (Optional)
 * **INPUT_MULTIPLE_JSONS_URI:** provide the uri link of folder containing the input files (ends with "/")
 * **JSON_DIRECTORY_PATH_OUTPUT:** provide the folder name of the output file(ends with "/") which gets generated post execution of the script.
 * **OUTPUT_FILE_NAME:** enter a name for the generated file which is saved in the output bucket.
 * **MERGER_TYPE_FLAG:** based on user need, values 1 or 2 can be provided as mentioned in the earlier part of this document.

     - Case 1 deals when we are using two or multiple parser output Jsons are from different documents

     - Case 2 deals when we are using two or multiple parser outputs from the same document.


In [6]:
PROJECT_ID = "xxxx-xxxx"  # Optional
INPUT_MULTIPLE_JSONS_URI = "gs://xxxx/xxxx/"  # ends with "/"
JSON_DIRECTORY_PATH_OUTPUT = "gs://xxxx/xxxx/"  # ends with "/"
OUTPUT_FILE_NAME = "merged_file.json"
MERGER_TYPE_FLAG = 1  # 1-for different docs, 2-same doc default=1

## 3. Run the below code.

Use the below code and Run all the cells (Update the Path parameter if it is not available in the current working directory)


In [11]:
def split_gcs_folder(path: str) -> Tuple[str, str]:
    """
    This function splits gcs uri to 2 parts
        1. gcs bucket
        2. file path after bucket
    """

    pattern = re.compile("gs://(?P<bucket>.*?)/(?P<files_dir>.*)")
    uri = re.match(pattern, path)
    return uri.group("bucket"), uri.group("files_dir")


def file_names(bucket: str, files_dir_prefix: str) -> Tuple[List[str], Dict[str, str]]:
    """This Function will load the bucket and get the list of files
    in the gs path given
    """

    filenames_list = []
    filenames_dict = {}
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket)
    blobs = bucket.list_blobs(prefix=files_dir_prefix)
    filenames = [blob.name for blob in blobs]
    for filename in filenames:
        file = filename.split("/")[-1]
        if file:
            filenames_list.append(file)
            filenames_dict[file] = filename
    return filenames_list, filenames_dict


def list_json_files(filenames: List[str]) -> List[str]:
    """
    Takes filenames and return JSON files only as list
    """

    json_files = []
    for filename in filenames:
        if filename.endswith(".json"):
            json_files.append(filename)
    return json_files


def assign_indexes(
    layout: Union[Document.Entity, Document.Page.Layout], text: str
) -> None:
    """
    It will assign new index values to start_index and end_index for respective class object
    """

    for text_segment in layout.text_anchor.text_segments:
        text_segment.end_index = int(text_segment.end_index) + len(text)
        text_segment.start_index = int(text_segment.start_index) + len(text)


def assign_page_ref_page(
    entity: documentai.Document.Entity, doc_first: documentai.Document
) -> None:
    """
    It will accumulate page count for entities-page_anchor-page_refs
    """

    for page_ref in entity.page_anchor.page_refs:
        page_ref.page = str(int(page_ref.page) + len(doc_first.pages))


### CASE - 1
def different_doc_merger(
    doc_first: documentai.Document, doc_second: documentai.Document
) -> documentai.Document:
    """
    This function takes two documentai.Document objects and merges them as one
    """

    doc_merged = documentai.Document()

    ### Entities ###
    for entity in doc_second.entities:
        assign_indexes(entity, doc_first.text)  # entity-textanchors
        assign_page_ref_page(entity, doc_first)  # entity-pageanchors
        for prop in entity.properties:  # entity properties
            assign_indexes(prop, doc_first.text)
            assign_page_ref_page(prop, doc_first)
    doc_merged.entities = list(doc_first.entities) + list(doc_second.entities)

    # Pages
    for page in doc_second.pages:
        print(page.page_number, end=" ")
        page.page_number = int(page.page_number) + len(
            doc_first.pages
        )  # Page Number increment in second doc
        print(" ", page.page_number)

        # page . layout . textanchor . textsegment
        assign_indexes(page.layout, doc_first.text)

        for block in page.blocks:
            assign_indexes(block.layout, doc_first.text)

        for paragraph in page.paragraphs:
            assign_indexes(paragraph.layout, doc_first.text)

        for line in page.lines:
            assign_indexes(line.layout, doc_first.text)

        for token in page.tokens:
            assign_indexes(token.layout, doc_first.text)

    doc_merged.pages = list(doc_first.pages) + list(doc_second.pages)
    doc_merged.text = doc_first.text + doc_second.text
    doc_merged.shard_info = doc_second.shard_info
    doc_merged.uri = doc_second.uri
    return doc_merged


### CASE -2
def same_doc_diff_parser_merger(
    doc_first: documentai.Document, doc_second: documentai.Document
) -> documentai.Document:
    """
    This function merges the entities of two documentai.Document object as one
    """

    doc_first.entities = list(doc_first.entities) + list(doc_second.entities)
    doc_first.uri = doc_second.uri
    doc_first.text = doc_second.text
    doc_first.pages = doc_second.pages
    doc_first.shard_info = doc_second.shard_info
    return doc_first


def iter_json_files(
    bucket_obj: storage.Bucket,
    input_bucket_files: List[str],
    file_dict: Dict[str, str],
    doc_merged: documentai.Document,
    MERGER_TYPE_FLAG: int = 1,
) -> documentai.Document:
    """
    It will iterate through all json files and merges each file to doc_merged parameter based on MERGER_TYPE_FLAG
    """

    func = (
        different_doc_merger if (MERGER_TYPE_FLAG == 1) else same_doc_diff_parser_merger
    )
    for file in input_bucket_files:
        print(file)
        doc_second = load_document_from_gcs(bucket_obj, file_dict[file])
        doc_merged = func(doc_merged, doc_second)
    return doc_merged


def delete_id(doc_merged: documentai.Document) -> documentai.Document:
    """
    It will assign empty string to id property of Entity object
    """

    for entity in doc_merged.entities:
        entity.id = ""
        for prop in entity.properties:
            prop.id = ""
    return doc_merged


def load_document_from_gcs(
    bucket_obj: storage.Bucket, filepath: str
) -> documentai.Document:
    """
    It will load json file from GCS filepath and returns documentai.Document object
    """

    data_str = bucket_obj.blob(filepath).download_as_string().decode("utf-8")
    document = documentai.Document.from_json(data_str)
    return document


def merge_document_objects(
    MERGER_TYPE_FLAG: int,
    input_bucket: str,
    input_bucket_files: List[str],
    file_dict: Dict[str, str],
) -> documentai.Document:
    """
    It will merges all json files from gcs folder based on MERGER_TYPE_FLAG and return merged documentai.Document object
    """

    if MERGER_TYPE_FLAG not in (2, "2"):
        print("\t" * 5, "Using Default Merger")
        MERGER_TYPE_FLAG = 1
    else:
        print("\t" * 5, "Using Different Processor Result jsons merger")
        MERGER_TYPE_FLAG = 2
    storage_client = storage.Client()
    bucket_obj = storage_client.get_bucket(input_bucket)
    if len(input_bucket_files) < 2:
        raise AssertionError(
            "minimum number of files required are >= 2 to perform Merging."
        )
    print(
        "....more than 2 JSON files detected....",
        "Process Started...",
        sep="\n",
    )
    doc_merged = documentai.Document()
    if int(MERGER_TYPE_FLAG) == 1:
        doc_merged = iter_json_files(
            bucket_obj, input_bucket_files, file_dict, doc_merged, MERGER_TYPE_FLAG=1
        )
    elif int(MERGER_TYPE_FLAG) == 2:
        doc_merged = iter_json_files(
            bucket_obj, input_bucket_files, file_dict, doc_merged, MERGER_TYPE_FLAG=2
        )
    return doc_merged


def upload_doc_obj_to_gcs(
    doc_merged: documentai.Document, output_bucket: str, merged_json_path: str
) -> None:
    """
    It will convert documentai.Document object to JSON and uploads to specified GCS uri path as JSON.
    """

    storage_client = storage.Client(output_bucket)
    bucket_obj = storage_client.get_bucket(output_bucket)
    blob = bucket_obj.blob(merged_json_path)
    print(f"Uploading file to gs://{output_bucket}/{merged_json_path} ...")
    blob.upload_from_string(
        documentai.Document.to_json(
            doc_merged,
            use_integers_for_enums=False,
            including_default_value_fields=False,
        )
    )
    print(
        "Entities count After Merging - ",
        len(doc_merged.entities),
    )
    print(
        "Pages count After Merging - ",
        len(doc_merged.pages),
    )
    blob.content_type = "application/json"
    blob.update()
    print(f"Successfully uploaded Merged Documnet Object as JSON to GCS")


def main(
    INPUT_MULTIPLE_JSONS_URI: str,
    JSON_DIRECTORY_PATH_OUTPUT: str,
    OUTPUT_FILE_NAME: str,
    MERGER_TYPE_FLAG: int,
    PROJECT_ID: str = "",
) -> None:
    print("Merging JSON's tool started")
    input_bucket, input_files_dir = split_gcs_folder(INPUT_MULTIPLE_JSONS_URI)
    output_bucket, output_files_dir = split_gcs_folder(JSON_DIRECTORY_PATH_OUTPUT)
    output_files_dir = output_files_dir.strip("/")
    file_names_list, file_dict = file_names(input_bucket, input_files_dir)
    print(
        f"Pulling list of JSON files from source GCS Path - {INPUT_MULTIPLE_JSONS_URI}"
    )
    input_bucket_files = list_json_files(file_names_list)
    doc_merged = merge_document_objects(
        MERGER_TYPE_FLAG, input_bucket, input_bucket_files, file_dict
    )
    print("Merging process completed...")
    print("Deleting id under Entities & Properties of Document Object...")
    delete_id(doc_merged)
    merged_json_path = (
        (output_files_dir + "/" + OUTPUT_FILE_NAME)
        if output_files_dir
        else (OUTPUT_FILE_NAME)
    )
    upload_doc_obj_to_gcs(doc_merged, output_bucket, merged_json_path)
    print("Process Completed.")

In [12]:
main(
    INPUT_MULTIPLE_JSONS_URI,
    JSON_DIRECTORY_PATH_OUTPUT,
    OUTPUT_FILE_NAME,
    MERGER_TYPE_FLAG,
    PROJECT_ID,
)

Merging JSON's tool started
Pulling list of JSON files from source GCS Path - gs://siddam_bucket_test/cde_processor_test/test/
					 Using Default Merger
....more than 2 JSON files detected....
Process Started...
InsuranceCard-7.json
1   1
InsuranceCard_24.json
1   2
InsuranceCard_21.json
1   3
InsuranceCard_20.json
1   4
InsuranceCard_26.json
1   5
InsuranceCard_22.json
1   6
InsuranceCard_23.json
1   7
InsuranceCard_25.json
1   8
InsuranceCard-6.json
1   9
InsuranceCard-10.json
1   10
Merging process completed...
Deleting id under Entities & Properties of Document Object...
Uploading file to gs://siddam_bucket_test/cde_processor_test/merged_file.json ...
Entities count After Merging -  40
Pages count After Merging -  10
Successfully uploaded Merged Documnet Object as JSON to GCS
Process Completed.


## 4. Output 

The output of the tool is a **single json file**. Let's examine the outputs for each of the case types. We’ll consider 3 json docs for our experiment and examine the output formats.

Consider following 3 input json files residing the input GCS Bucket: 

json_doc_merge / 0 / doc-0.json
json_doc_merge / 1 / doc-1.json
json_doc_merge / 2 / doc-2.json

Upon running the script for both the cases, the below output details are observed as follows.

### CASE - 1 Output 
Let's suppose the three json files are from different documents (The parser used may be same or different )
In Case - 1, we observe in the output that the Pages and Entities count increases with the number of pages and entities present in the input files upon merging. The same applies for the and Text, the value is changed and texts are concatenated and stored as a single value for the Text key of the output file.  

| Input json files | Screenshot highlighting the number of entities and number of pages in each of the input json files | The output single merged json file                         |
|:----------------:|----------------------------------------------------------------------------------------------------|------------------------------------------------------------|
|    **doc-0.json**    | ![](https://screenshot.googleplex.com/7Cn7bf5HKA62omx.png)                                         | ![](https://screenshot.googleplex.com/7zWP7zPZkLeZSra.png) |
|    **doc-1.json**    | ![](https://screenshot.googleplex.com/BMGMEcW3EFxWrRc.png)                                         |                                                            |
|    **doc-2.json**    | ![](https://screenshot.googleplex.com/3wCEqP9i3Bm9dqB.png)                                         |                                                            |

**For example :** each json has  2 pages and 21 entities , the final output merged json has 6 pages and 63 entities.

### CASE - 2 Output 

Let's suppose the three json files are from the single document and from different parser results.

In Case - 2, we observe the pages count remains the same and there is only an increase in the count of Entities upon merging the multiple input json files. 


| Input json files | Screenshot highlighting the number of entities and number of pages in each of the input json files | The output single merged json file                         |
|:----------------:|----------------------------------------------------------------------------------------------------|------------------------------------------------------------|
|    **doc-0.json**    | ![](https://screenshot.googleplex.com/ZofmvdULKVFvZ9w.png)                                         | ![](https://screenshot.googleplex.com/Bx2WNCxdcv3pN8p.png) |
|    **doc-1.json**    | ![](https://screenshot.googleplex.com/6fgDDEEtRaxNJ2N.png)                                         |                                                            |
|    **doc-2.json**    | ![](https://screenshot.googleplex.com/BwYcWwMuT6byLTm.png)                                         |                                                            |

**For example :** each json has  2 pages and 21 entities , the final output merged json has 2 pages and 63 entities.
