# DocAI JSON to Canonical Doc JSON conversion

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description
<div><span style="background-color:#f5f569;font-weight:800" ><i><b>Note:</b> This feature is in Preview with allowlist. To turn on this feature , contact your Google account team.</i></span><div>


A parsed, unstructured document(Canonical Doc JSONs) is represented by JSON that describes the unstructured document using a sequence of text, table, and list blocks. You import canonical JSON files with your parsed unstructured document data in the same way that you import other types of unstructured documents, such as PDFs. When this feature is turned on, whenever a JSON file is uploaded and identified by either an application/json MIME type or a .JSON extension, it is treated as a parsed document.

**Canonical Json :** Canonical Doc JSONs are a JSON representation of parsed unstructured documents. They use a sequence of text, table, and list blocks to describe the document's structure.

Refer below cell, which gives details about **Canonical Json Schema**  
Change the metadata_schema according to your need. Ex : if you want to add file info, add a key&value pair to structData key of metadata_schema object.

In [None]:
# Pre defined json structure

conanical_schema = {
    "title": "Some Title",
    "blocks": [
        {
            "textBlock": {"text": "Some PARAGRAPH 1", "type": "PARAGRAPH"},
            "pageNumber": 1,
        }
    ],
}


metadata_schema = {
    "id": "your_random_id",
    "structData": {
        "Title": "File Title",
        "Description": "Your file description",
        "Source_url": "https://storage.mtls.cloud.google.com/",
    },
    "content": {"mimeType": "application/json", "uri": "gs://"},
}

## Prerequisites

1. Vertex AI Notebook
2. Parser Json files in GCS Folders.
3. PDFs files in GCS Folders.

## Step by Step procedure 

### 1. Import Modules/Packages

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

### 2. Input Details

* <b>GCS_INPUT_PATH :</b> GCS path for input files. It should contain DocAI processed output json files and also the pdfs which got parsed by the processor with the same name as json files.

This is how the input bucket structure should look like : 
<img src="./Images/input_sample_image.png" width=1200 height=400 alt="input bucket sample image">

In [None]:
# Please follow the folders path to sucessfully create metadata

# Given  folders in GCS_INPUT_PATH with the following structure :
#
# gs://path/to/input/folder
#   ├──/processor_output/    Folder having parsed json from the processor.
#   └──/pdfs/                Folder having all the pdfs which got parsed with same name as jsons.

GCS_INPUT_PATH = "gs://{bucket_name}/{folder_path}"

input_bucket_name = GCS_INPUT_PATH.split("/")[2]
input_prefix_path = "/".join(GCS_INPUT_PATH.split("/")[3:])

#### Change the metadata_schema according to your need. Ex : if you want to add file info, add a key&value pair to structData key of metadata_schema object.

In [None]:
# Pre defined json structure

conanical_schema = {
    "title": "Some Title",
    "blocks": [
        {
            "textBlock": {"text": "Some PARAGRAPH 1", "type": "PARAGRAPH"},
            "pageNumber": 1,
        }
    ],
}


metadata_schema = {
    "id": "your_random_id",
    "structData": {
        "Title": "File Title",
        "Description": "Your file description",
        "Source_url": "https://storage.mtls.cloud.google.com/",
    },
    "content": {"mimeType": "application/json", "uri": "gs://"},
}

### 3. Run the scipt

In [None]:
def convert_doc_object_to_conanical_object(
    doc_object: documentai.Document, file_path: str
) -> Dict:
    """
    To convert the document AI object structure to conanical object which is compactible with vertex AI search.

    Parameters
    ----------
    doc_object : documentai.Document
        The documnet AI  object from the input file provided by the user.

    file_path : str
        The GCS file path of the json file.

    Returns
    -------
    Dict
        Returns the converted conanical object.
    """

    conanical_obj = {}
    conanical_obj["blocks"] = list()
    conanical_schema_c = deepcopy(conanical_schema)
    file_name = file_path.split("/")[-1].split(".")[0]
    OCR_text = doc_object.text

    conanical_schema_c["title"] = file_name
    conanical_obj["title"] = conanical_schema_c["title"]

    # looping through all pages
    for page in doc_object.pages:
        page_number = page.page_number
        conanical_schema_c["blocks"][0]["pageNumber"] = page_number

        # looping through all paragraph and getting OCR text by index
        paragraph = page.paragraphs
        if paragraph:
            first_paragraph = True
            for paragraph in page.paragraphs:
                text_segments = paragraph.layout.text_anchor.text_segments[0]
                if first_paragraph:
                    page_starting_index = text_segments.start_index
                first_paragraph = False
                last_paragraph_index = text_segments.end_index
            paragraph_text = OCR_text[page_starting_index:last_paragraph_index]
            conanical_schema_c["blocks"][0]["textBlock"]["text"] = paragraph_text
            conanical_obj["blocks"].append(deepcopy(conanical_schema_c["blocks"][0]))
    return conanical_obj


def create_metadata(metadata: str, output_file: str) -> str:
    """
    To create the metadata file by adding pdfs file location, content type, converted json location,other user defined schema.

    Parameters
    ----------
    metadata : str
        The metadata string with the older configuration which will get update with new configuration.

    output_file : str
        The GCS path of the Canonical json path.

    Returns
    -------
    str
        Returns the updated metadata string having the latest file info attached with the older metadata string.
    """

    metadata_json_copy = deepcopy(metadata_schema)
    file_path = output_file.replace("gs://", "")
    pdf_file_path = (
        GCS_INPUT_PATH + "/pdfs/" + file_path.split("/")[-1].split(".")[0] + ".pdf"
    )
    pdf_file_path = pdf_file_path.replace("gs://", "")
    metadata_json_copy["id"] = str(uuid.uuid4())
    metadata_json_copy["content"]["uri"] += file_path
    metadata_json_copy["structData"]["Source_url"] += pdf_file_path
    metadata_json_copy["structData"]["Title"] = file_path.split("/")[-1].split(".")[0]
    metadata += json.dumps(metadata_json_copy) + "\n"
    return metadata


output_files_for_metadata = []
file_name_list = [
    i
    for i in list(file_names(f"{GCS_INPUT_PATH}/processor_output")[1].values())
    if i.endswith(".json")
]

print("Converting files ...")
for file_name in file_name_list:
    try:
        document = documentai_json_proto_downloader(input_bucket_name, file_name)
        conanical_object = convert_doc_object_to_conanical_object(document, file_name)

    except Exception as e:
        print(f"[x] {input_bucket_name}/{file_name} || Error : {str(e)}")
        continue
    output_file_name = f"{input_prefix_path}/output/{file_name.split('/')[-1]}"
    output_files_for_metadata.append(f"gs://{input_bucket_name}/{output_file_name}")
    store_document_as_json(
        json.dumps(conanical_object), input_bucket_name, output_file_name
    )
    print(f"[✓] {input_bucket_name}/{output_file_name}")

print("\n\nCreating metadata file ...")
metadata_str = ""
for gcs_output_file in output_files_for_metadata:
    metadata_str = create_metadata(metadata_str, gcs_output_file)

store_document_as_json(
    metadata_str, input_bucket_name, input_prefix_path + "/metadata/" + "metadata.jsonl"
)
print(
    "Metadata created & stored in",
    f"gs://{input_bucket_name}/{input_prefix_path}/metadata/",
)

### 4. Output

Document AI json after conversion to Canonical json and store the files to output folder inside the GCS_INPUT_PATH folder. Each page text will get store inside each text block with their respective page number.<br>
<img src="./Images/conanical_json_output_1.png" width=1200 height=400 alt="conanical json output image">

Finally script will create metadata having GCS location of converted  Canonical json with the authentication link of pdfs.
<img src="./Images/conanical_json_output_2.png" width=800 height=400 alt="conanical json output image"><br><hr>

Import the metadata file in vertex AI search and conversation datastore, verify the files got imported.<br>
<img src="./Images/conanical_json_output_3.png" width=800 height=400 alt="conanical json output image"><br>
<img src="./Images/conanical_json_output_4.png" width=800 height=400 alt="conanical json output image">