# Enrich Address for Invoice and Expense Documents
 

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
The tool facilitates a more detailed and accurate address parsing process. Detected addresses are broken down into their constituent parts, such as city, country, and ZIP code. The address data is enriched with additional relevant information, enhancing its overall usability.

# Prerequisites
* Python : Jupyter notebook (Vertex AI).

NOTE : 
 * The version of Python currently running in the Jupyter notebook should be greater than 3.8
  * The normalizedValue attribute will be accessible exclusively in JSON file and is not visible in the processor.


# Step-by-Step Procedure

## Import the libraries

In [None]:
!pip install --upgrade google-cloud-aiplatform

In [None]:
# Run this cell to download utilities module
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import json
from pathlib import Path
from google.cloud import storage
import vertexai
from vertexai.language_models import TextGenerationModel
from utilities import file_names, store_document_as_json, blob_downloader

## 2. Input Details

* **PROJECT_ID** : It contains the project ID of the working project.
* **LOCATION** : It contains the location.
* **GCS_INPUT_PATH** : It contains the input jsons bucket path. 
* **GCS_OUTPUT_PATH** : It contains the output bucket path where the updated jsons after adding the attribute will be stored.
* **ENTITY_NAME** : It contains the names of the entities which the user wants to split. 


In [None]:
PROJECT_ID = "rand-automl-project"  # Your Google Cloud project ID.
LOCATION = "us-central1"
# '/' should be provided at the end of the path.
GCS_INPUT_PATH = "gs://bucket_name/path/to/jsons/"
# '/' should be provided at the end of the path.
GCS_OUTPUT_PATH = "gs://bucket_name/path/to/jsons/"
# Name of the entities in a list format.
ENTITY_NAME = ["receiver_address", "remit_to_address"]

## 3. Run Below Code-Cells

In [None]:
def split_address_to_json(address: str, project_id: str, location: str) -> dict:
    """
    Split an address into JSON format with specific keys using a text generation model.

    This function splits an address into JSON format with keys for streetAddress,
    city, state, zipcode, and country
    using a text generation model.

    Args:
        address (str): The input address string to be split.
        project_id (str): The project ID for the Vertex AI project.
        location (str): The location of the Vertex AI project.

    Returns:
        dict or None: A dictionary containing the JSON-formatted address if successful, else None.
    """
    vertexai.init(project=project_id, location=location)
    parameters = {
        "max_output_tokens": 1024,
        "temperature": 0.2,
        "top_p": 0.8,
        "top_k": 40,
    }
    model = TextGenerationModel.from_pretrained("text-bison@001")
    response = model.predict(
        f"""Please split the address into Json format with keys
        streetAddress, city, state, zipcode, country

        input: {address}
        output:
        """,
        **parameters,
    )

    # Extracting JSON response from the model
    json_response = response.text

    try:
        json_output = json.loads(json_response)
        print("JSON OUTPUT", json_output)
        return json_output
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
        print("Response from Model:", response.text)
        return None


def process_json_files(
    list_of_files: list,
    input_storage_bucket_name: str,
    output_storage_bucket_name: str,
    output_bucket_path_prefix: str,
    project_id: str,
    location: str,
) -> None:
    """
    Process JSON files containing address entities, split the addresses,
    and store the updated JSON files in Google Cloud Storage.

    This function iterates over a list of JSON files containing address entities,
    splits the addresses into JSON format with keys for streetAddress,
    city, state, zipcode, and country,
    and stores the updated JSON files in a specified Google Cloud Storage bucket.

    Args:
        list_of_files (list): A list of JSON file paths to be processed.
        input_storage_bucket_name (str): The name of the input Google Cloud Storage bucket.
        output_storage_bucket_name (str): The name of the output Google Cloud Storage bucket.
        output_bucket_path_prefix (str): The prefix path within the output bucket
                                         where the processed files will be stored.
        project_id (str): The project ID for the Vertex AI project.
        location (str): The location of the Vertex AI project.

    Returns:
        None
    """

    for k, _ in enumerate(list_of_files):
        print("***************")
        file_name = list_of_files[k].split("/")[
            -1
        ]  # Extracting the file name from the path
        print(f"File Name {file_name}")
        json_data = blob_downloader(input_storage_bucket_name, list_of_files[k])
        for ent in json_data["entities"]:
            for name in ENTITY_NAME:
                if name in ent["type"]:
                    print("---------------")
                    mention_text = ent.get("mentionText", "")
                    # normalized_value = ent.get('normalizedValue', "")
                    type_ = ent.get("type", "")
                    print(f"Type: {type_}")
                    print(f"Mention Text: {mention_text}")

                    # Try splitting the address
                    output_json = split_address_to_json(
                        mention_text.replace("\n", " ").strip(), project_id, location
                    )
                    # If address was successfully split, update the entity
                    if output_json is not None:
                        ent["normalizedValue"] = output_json
                        ent["identified_format"] = "Address split"
                    else:
                        print("Address couldn't be split.")

                    print(f"New Normalized Value: {ent['normalizedValue']}")

        # save to Google Cloud Storage
        output_file_name = f"{output_bucket_path_prefix}{file_name}"
        store_document_as_json(
            json.dumps(json_data), output_storage_bucket_name, output_file_name
        )

    print("--------------------")
    print("All files processed.")

## Run the main functions after executing the above functions: 

In [None]:
def main(project_id: str, location: str, input_path: str, output_path: str) -> None:
    """
    Main function to process JSON files containing address entities and
    store the updated JSON files in Google Cloud Storage.

    This function serves as the main entry point for processing JSON files
    containing address entities, splitting the addresses,
    and storing the updated JSON files in a specified Google Cloud Storage bucket.

    Args:
        project_id (str): The project ID for the Vertex AI project.
        location (str): The location of the Vertex AI project.
        input_path (str): The path to the input directory containing JSON files.
        output_path (str): The path to the output directory
                           where the processed files will be stored.

    Returns:
    """
    input_storage_bucket_name = input_path.split("/")[2]
    # input_bucket_path_prefix = "/".join(input_path.split("/")[3:])
    output_storage_bucket_name = output_path.split("/")[2]
    output_bucket_path_prefix = "/".join(output_path.split("/")[3:])

    json_files = file_names(input_path)[1].values()
    list_of_files = [i for i in list(json_files) if i.endswith(".json")]
    process_json_files(
        list_of_files,
        input_storage_bucket_name,
        output_storage_bucket_name,
        output_bucket_path_prefix,
        project_id,
        location,
    )


main(PROJECT_ID, LOCATION, GCS_INPUT_PATH, GCS_OUTPUT_PATH)

## Output
The new attribute 'normalizedValue' will be added to each address entity in the newly generated json file.

<table>
    <tr>
        <td>
            <b>Pre-processed data</b>
        </td>
    </tr>
    <tr>
        <td>
            <img src='./images/input_image.png' width=600 height=600></img>
        </td>
    </tr>
</table>

In [None]:
!python3 --version