# DocAI - Script for Removing Empty Bounding Boxes

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.	

## Purpose of the Script


The purpose of this document is to provide instructions and a Python script for removing empty bounding boxes from a labeled JSON file. The script identifies and removes any bounding boxes (entities) in the JSON file that do not contain any mentionText or textAnchors, streamlining the labeling process and improving the accuracy of the labeling data.


## Prerequisites

1. Python : Jupyter notebook (Vertex AI) 
2. Service account permissions in projects.

## Installation Procedure

The script consists of Python code. It can be loaded and run via: 
1.  Upload the IPYNB file or copy the code to the Vertex Notebook and follow the operation procedure. \
**NOTE:** Don’t Execute the Script with Processor Dataset Path. Export the dataset to json and then use that bucket as an input.

##  Operation Procedure

### 1. Import the modules

**Note :** external modules are used so they need to be installed. To install run these commands : 

In [1]:
# !pip install gcsfs
# !pip install google-cloud
import json
from pathlib import Path

import gcsfs
import google.auth
import pandas as pd
from google.cloud import storage
from tqdm import tqdm
from google.cloud import documentai_v1beta3 as documentai

### 2. Setup the required inputs

* **PROJECT_ID** - Your Google project id or name
* **BUCKET_NAME** - Name of the bucket
* **INPUT_FOLDER_PATH** - The path of the folder containing the JSON files to be processed, without the bucket name.
* **OUTPUT_FOLDER_PATH** - The path of the folder where the JSON files need to be stored after process, without the bucket * name.

**Note :**  Both Input and output paths should be in the same bucket. 

In [2]:
PROJECT_ID = "rand-automl-project"
BUCKET_NAME = "accenture_line_items_samples"
INPUT_FOLDER_PATH = "output/output/2839778604252110189/0"  # Path without bucket name
OUTPUT_FOLDER_PATH = "output_atul/output/"  # Path without bucket name
credentials, _ = google.auth.default()
fs = gcsfs.GCSFileSystem(project=PROJECT_ID, token=credentials)

### 3. Execute the code

In [12]:
def get_file(file_path: str) -> documentai.Document:
    """
    To read files from cloud storage.
    """
    file_object = fs.cat(file_path)
    doc = documentai.Document.from_json(file_object)  # JSON to DocumentProto Format
    return doc


def store_blob(document, file: str):
    """
    Store files in cloud storage.
    """
    storage_client = storage.Client()
    result_bucket = storage_client.get_bucket(BUCKET_NAME)
    document_blob = storage.Blob(name=str(file), bucket=result_bucket)
    document_blob.upload_from_string(
        documentai.Document.to_json(document), content_type="application/json"
    )


def main():
    logs = pd.DataFrame(columns=["FileName"])

    files = [
        i for i in fs.find(f"{BUCKET_NAME}/{INPUT_FOLDER_PATH}") if i.endswith(".json")
    ]
    document_files_list = [get_file(i) for i in files]
    print("No. of files : ", len(files))

    for index in tqdm(range(len(files))):
        file_name = files[index].split("/", 1)[-1]
        output_file_name = file_name.replace(INPUT_FOLDER_PATH, OUTPUT_FOLDER_PATH)
        is_updated = False
        doc = document_files_list[index]
        # print(doc)
        sub_log = pd.DataFrame(columns=[file_name])
        # for i in reversed(range(len(doc["entities"]))):
        #     entity = doc["entities"][i]
        if doc.entities:
            for entity in doc.entities:
                if not entity.mention_text:
                    sub_log = sub_log.append(
                        {file_name: entity.type}, ignore_index=True
                    )
                    doc.entities.remove(entity)
                    is_updated = True
                    continue
                else:
                    if entity.properties and entity.mention_text.strip():
                        for sub_entity in entity.properties:
                            if sub_entity.mention_text:
                                if sub_entity.mention_text.strip() == "":
                                    sub_log = sub_log.append(
                                        {file_name: sub_entity.type}, ignore_index=True
                                    )
                                    entity.properties.remove(sub_entity)
                                    is_updated = True
                                    continue
                            elif not sub_entity.mention_text:
                                sub_log = sub_log.append(
                                    {file_name: sub_entity.type}, ignore_index=True
                                )
                                entity.properties.remove(sub_entity)
                                is_updated = True
                                continue
                if not sub_entity.text_anchor:
                    sub_log = sub_log.append(
                        {file_name: sub_entity.type}, ignore_index=True
                    )
                    entity.properties.remove(sub_entity)
                    is_updated = True
                    continue
                elif sub_entity.text_anchor:
                    if not sub_entity.text_anchor.text_segments:
                        sub_log = sub_log.append(
                            {file_name: sub_entity.type}, ignore_index=True
                        )
                        entity.properties.remove(sub_entity)
                        is_updated = True
                        continue
                    elif len(sub_entity.text_anchor.text_segments) < 1:
                        sub_log = sub_log.append(
                            {file_name: sub_entity.type}, ignore_index=True
                        )
                        entity.properties.remove(sub_entity)
                        is_updated = True
                        continue
        else:
            print("Entities missing : ", files[index])
            # if is_updated:
        store_blob(doc, output_file_name)
        if not sub_log.empty:
            logs = pd.concat([logs, sub_log], axis=1)
    # logs.drop("FileName", axis=1, inplace=True)
    logs.to_csv("output.csv", index=False)


main()

No. of files :  1


100%|██████████| 1/1 [00:00<00:00,  1.62it/s]


## Output File

The script deletes all bounding boxes (entities) in the JSON file that do not contain any mentionText or textAnchors, and overwrites the file. The script will also create a CSV file containing a list of deleted entities.