# Normalize Date Value 19xx to 20xx

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This is a post processing tool to normalize year in date related entities from 19xx to 20xx. Document AI processors will give a normalized_value attribute for date entities in Document Object and sometimes this normalized value for year will be inferred as 19xx instead of 20xx.

# Prerequisites
* Vertex AI Notebook
* GCS Folder Path

# Step-by-Step Procedure

## 1. Import Modules/Packages

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from google.cloud import storage
from google.cloud import documentai_v1beta3 as documentai
from utilities import (
    file_names,
    store_document_as_json,
)

## 2. Input Details

* **INPUT_GCS_PATH** : It is input GCS folder path which contains DocumentAI processor JSON results
* **OUTPUT_GCS_PATH** : It is a GCS folder path to store post-processing results

In [None]:
# Parser results as JSON, Data entities should contain Normalized Value data in it
GCS_INPUT_DIR = "gs://BUCKET_NAME/incubator/"
GCS_OUTPUT_DIR = "gs://BUCKET_NAME/incubator/output/"

## 3. Run Below Code-Cells

In [None]:
def normalize_date_entity(entity: documentai.Document.Entity):
    """
    Normalize a date entity by adding 100 years to the year value.

    This function takes a date entity extracted using Google Cloud Document AI
    and normalizes it by adding 100 years to the year value.

    Args:
        entity (documentai.Document.Entity): The date entity to be normalized.

    Returns:
        None

    Example:
        # Example usage:
        entity = ...  # Assume entity is extracted from a document
        normalize_date_entity(entity)
        # The date entity will be normalized with the year increased by 100.
    """
    print("\t\t", entity.type_, entity.normalized_value.text, end=" -> ")
    accumulate = 100
    date = entity.normalized_value.date_value
    curr_year, curr_month, curr_day = date.year, date.month, date.day
    updated_year = curr_year + accumulate
    entity.normalized_value.date_value.year = updated_year
    text = f"{updated_year}-{curr_month:0>2}-{curr_day:0>2}"
    entity.normalized_value.text = text
    print(entity.normalized_value.text)


json_splits = GCS_INPUT_DIR.strip("/").split("/")
input_bucket = json_splits[2]
INPUT_FILES_DIR = "/".join(json_splits[3:])
GCS_OUTPUT_DIR = GCS_OUTPUT_DIR.strip("/")
output_splits = GCS_OUTPUT_DIR.split("/")
output_bucket = output_splits[2]
OUTPUT_FILES_DIR = "/".join(output_splits[3:])


_, files_dict = file_names(GCS_INPUT_DIR)
ip_storage_client = storage.Client()
ip_storage_bucket = ip_storage_client.bucket(input_bucket)
print("Process started for converting normalized dat value from 19xx to 20xx...")
for fn, fp in files_dict.items():
    print(f"\tFile: {fn}")
    json_str = ip_storage_bucket.blob(fp).download_as_string()
    doc = documentai.Document.from_json(json_str)
    for ent in doc.entities:
        if 100 < ent.normalized_value.date_value.year < 2000:
            normalize_date_entity(ent)

    json_str = documentai.Document.to_json(doc)
    file_name = f"{OUTPUT_FILES_DIR}/{fn}"
    print(f"\t  Output gcs uri - {file_name}", output_bucket)
    store_document_as_json(json_str, output_bucket, file_name)

print("Process Completed!!!")

# 4. Output Details

Refer below images for preprocessed and postprocessed results

<table>
    <tr>
        <td>
            <b>Pre-processed data</b>
        </td>
        <td>
            <b>Post-processed data</b>
        </td>
    </tr>
    <tr>
        <td>
            <img src='./images/post_processing_image.png' width=400 height=600></img>
        </td>
        <td>
            <img src='./images/pre_processing_image.png' width=400 height=600></img>
        </td>
    </tr>
</table>
    