# Date Entity Normalization


* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description

This tool updates the values of normalized dates in entities within the Document AI JSON output. It aids in identifying the actual date format, such as MM/DD/YYYY or DD/MM/YYYY, through a heuristic approach. Upon successful identification, the tool updates all date values in the JSON to maintain a consistent format.

## Prerequisites

1. Vertex AI Notebook or Google Colab
2. GCS bucket for processing of  the input json and output json


## Step by Step procedure 

### 1. Install the required libraries

In [None]:
%pip install google-cloud-storage
%pip install google-cloud-documentai

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

### 2. Import the required libraries/Packages

In [None]:
import json
import os
import re
from datetime import datetime
from tqdm import tqdm
from pathlib import Path
from typing import Dict, List, Union, Optional, Tuple
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from utilities import (
    file_names,
    documentai_json_proto_downloader,
    store_document_as_json,
)
from pprint import pprint

### 3. Input Details

<ul>
    <li><b>input_path :</b> GCS Storage name. It should contain DocAI processed output json files. This bucket is used for processing input files and saving output files in the folders.</li>
    <li><b>output_path:</b> GCS URI of the folder, where the Output Json files will store.</li>
</ul>

In [None]:
input_path = (
    "gs://{bucket_name}/{folder_path}"  # Path to your Document AI input JSON files.
)
output_path = "gs://{bucket_name}/{folder_path}"  # Path where Vertex AI output merged JSON files will be saved.

### 4.Execute the code

In [None]:
input_storage_bucket_name = input_path.split("/")[2]
input_bucket_path_prefix = "/".join(input_path.split("/")[3:])
output_storage_bucket_name = output_path.split("/")[2]
output_bucket_path_prefix = "/".join(output_path.split("/")[3:])


def identify_and_convert_date_format(
    mention_text: str, known_format: Optional[str] = None
) -> Tuple[Optional[datetime], str]:
    """
    This function attempts to identify and convert a date string to a datetime object.

    Args:
      mention_text: The text string potentially containing a date.
      known_format: (Optional) A specific date format string to try first (e.g., "%Y-%m-%d").

    Returns:
      A tuple containing two elements:
          - The converted datetime object (or None if not successful).
          - The identified date format string (or "N/A" if not found).
    """

    formats = ["%d/%m/%Y", "%m/%d/%Y"]
    if known_format:
        formats.insert(0, known_format)

    for fmt in formats:
        try:
            date_obj = datetime.strptime(mention_text, fmt)
            return date_obj, fmt
        except ValueError:
            continue
    return None, "N/A"


def process_json_files(
    list_of_files: List[str],
    input_storage_bucket_name: str,
    output_storage_bucket_name: str,
    output_bucket_path_prefix: str,
) -> None:
    """
    Processes a list of JSON files, converting dates within entities to ISO 8601 format and storing the updated JSON data in a specified output bucket.

    Args:
        list_of_files: List of file paths for the JSON files to process (type: List[str]).
        input_storage_bucket_name: Name of the input storage bucket (type: str).
        output_storage_bucket_name: Name of the output storage bucket (type: str).
        output_bucket_path_prefix: Prefix for the output file paths (type: str).

    Returns:
        None
    """
    all_json_data = []

    for k in tqdm(range(0, len(list_of_files))):
        print("***************")
        file_name = list_of_files[k].split("/")[-1]
        print(f"File Name {file_name}")
        json_proto_data = documentai_json_proto_downloader(
            input_storage_bucket_name, list_of_files[k]
        )
        for ind, ent in enumerate(json_proto_data.entities):
            if "date" in ent.type:
                print("---------------")
                mention_text = ent.mention_text if hasattr(ent, "mention_text") else ""
                normalized_value = (
                    ent.normalized_value if hasattr(ent, "normalized_value") else ""
                )
                type_ = ent.type if hasattr(ent, "type") else ""
                print(f"Type: {type_}")
                print(f"Mention Text: {mention_text}")
                print(f"Old Normalized Value: {normalized_value}")

                date_obj, identified_format = identify_and_convert_date_format(
                    mention_text
                )

                json_data = json.loads(documentai.Document.to_json(json_proto_data))

                ent_json = json_data["entities"][ind]

                if date_obj:
                    new_date_text_iso = date_obj.strftime("%Y-%m-%d")
                    ent_json["normalizedValue"]["text"] = new_date_text_iso
                    ent_json["normalizedValue"]["dateValue"] = {
                        "day": date_obj.day,
                        "month": date_obj.month,
                        "year": date_obj.year,
                    }
                    if identified_format != "N/A":
                        ent_json["identified_format"] = identified_format
                else:
                    if identified_format == "N/A" and any(
                        e
                        for e in json_data["entities"]
                        if e["type"] == "date" and "identified_format" in e
                    ):
                        known_format = next(
                            (
                                e["identified_format"]
                                for e in json_data["entities"]
                                if e["type"] == "date" and "identified_format" in e
                            ),
                            None,
                        )
                        date_obj, identified_format = identify_and_convert_date_format(
                            mention_text, known_format=known_format
                        )
                        if date_obj:
                            new_date_text_iso = date_obj.strftime("%Y-%m-%d")
                            ent_json["normalizedValue"]["text"] = new_date_text_iso
                            ent_json["normalizedValue"]["dateValue"] = {
                                "day": date_obj.day,
                                "month": date_obj.month,
                                "year": date_obj.year,
                            }

        output_file_name = f"{output_bucket_path_prefix}{file_name}"
        store_document_as_json(
            json.dumps(json_data), output_storage_bucket_name, output_file_name
        )

    print("--------------------")
    print("All files processed.")


json_files = file_names(input_path)[1].values()
list_of_files = [i for i in list(json_files) if i.endswith(".json")]
process_json_files(
    list_of_files,
    input_storage_bucket_name,
    output_storage_bucket_name,
    output_bucket_path_prefix,
)

### 5.Output

The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path. <br><hr>
<b>Comparison Between Input and Output File</b><br><br>
<i><h4>Post processing results<h4><i><br>
Upon running the post processing script against input data. The resultant output json data is obtained. The following image will show the difference date formate in the date filed <br>
    

<img src= "./images/output_image1.png" width=800 height=400>