# Currency Normalization

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. 


## Objective

This document guides you to use the currency normalization tool which uses parsed jsons and excel file (which has parsed currency prediction and desired currency name) to normalize the currency entity to desired value.

## Prerequisites

* Vertex AI Notebook Or Colab (If using Colab, use authentication)
* Storage Bucket for storing input and output json files
* Permission For Google Storage and Vertex AI Notebook.
* Excel file which contains mapping information


## Step by Step procedure

### 1. Importing Required Modules

In [2]:
%pip install tqdm
%pip install pandas
%pip install google-cloud-storage
%pip install google-cloud-documentai
%pip install fuzzywuzzy

In [None]:
# Run this cell to download utilities module
# !wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from google.cloud import storage
from fuzzywuzzy import fuzz
from tqdm import tqdm
from google.cloud import documentai_v1beta3 as documentai
from typing import Dict, List

import json
import pandas as pd
import utilities

### 2. Input and Output Paths

* **gcs_input_path** : GCS Input Path. It should contain DocAI processed output json files. 
* **gcs_output_path** : GCS Output Path. The post-processed json files stored in this path. 
* **project_id** : It should contains the project id of your current project.
* **column_key** : It should contain the name of the column which should be considered as Key to convert(already existing currency format)
* **column_value** : It should contain name of column which should be considered as Value to convert(desired format to Convert)
* **updated_entity** : This should contain name of the entity to be converted.
* **excel_path** : Screenshot from sample file: 

<img src="./Images/currency_issue.png" width=800 height=400></img>

In [None]:
# Input and Output Bucket path
gcs_input_path = "gs://XXXXXXXXXXXX/"  # Parsed json files path , end '/' is mandatory
gcs_output_path = "gs://XXXXXXXXXX/"  # output path
project_id = "XXXXXXXXX"  # project ID
excel_path = "Currency Issue.xlsx"  # Excel Path
column_key = "Currency in Invoice"  # name of column which should be considered as Key to convert(already existing currency format)
column_value = "Expected Currency Code"  # name of column which should be considered as Value to convert(desired format To Convert )
update_entity = ["currency"]  # Entity Names

### 3. Run the Code

In [None]:
def get_mapping_dict(
    excel_path: str, column_key: str, column_value: str
) -> Dict[str, str]:
    """This Function gets the details from the excel and  returns a mapping dictionary

    Args:
      excel_path (str) : It contains the name of the excel sheet.
      column_key (str) : It contains name of the already existing currency format.
      column_value (str) : It contains the name of the desired format To Convert.

    Returns:
      A dictionary having currencies as key-value pair.
    """

    df = pd.read_excel(excel_path)
    mapping_dict = {}
    for index, row in df.iterrows():
        key = row[column_key]
        value = row[column_value]
        mapping_dict[key] = value

    return mapping_dict


def mapping_entities_dict(json_dict, update_entity, mapping_dict):
    """
    This Function used Mapping dictionary and update entity list where the normalized value and mention text of the entities have to be changed

    Args:
      json_dict (object) : It contains the document object having all the document data.
      update_entity (list) : It contains the list of names to be converted from one form to other.
      mapping_dict (Dict) : It contains currencies as keys and values.

    Returns:
        object : updated json after updating the mentiontext and normalized value from mapping dict
    """

    def calculate_similarity_ratio(
        mapping_dict: Dict[str, str], mention_text: str, match_ratio: float
    ) -> str:
        """
         It keeps track of the key with the highest fuzzy ratio that exceeds the specified match ratio. The function returns the key with the highest fuzzy ratio.

        Args:
            mapping_dict (dict): A dictionary where keys represent potential matches for the mention text.
            mention_text (str): The text for which similarity ratios are calculated against the keys in the mapping dictionary.
            match_ratio (float): The threshold ratio that a fuzzy match must exceed to be considered a valid match.

        Returns:
            str: The key from the mapping dictionary that has the highest fuzzy ratio with the mention text and exceeds the specified match ratio.
        """

        match_key1 = ""
        match_fuzzy = 0
        for i in mapping_dict.keys():
            if fuzz.ratio(str(i), str(mention_text)) > match_fuzzy:
                match_key1 = i
                match_fuzzy = fuzz.ratio(str(i), str(mention_text))
        return match_key1

    for entity in json_dict.entities:
        if entity.type_ in update_entity:
            match_key = calculate_similarity_ratio(
                mapping_dict, entity.mention_text, 0.95
            )
            if match_key != "":
                entity.mention_text = mapping_dict[match_key]
                if not entity.normalized_value:
                    entity.normalized_value.text = mapping_dict[match_key]
                else:
                    entity.normalized_value = {"text": mapping_dict[match_key]}
            else:
                continue

    return json_dict

In [None]:
file_names_list, file_dict = utilities.file_names(gcs_input_path)
mapping_dict = get_mapping_dict(excel_path, column_key, column_value)
for filename, filepath in tqdm(file_dict.items(), desc="Progress"):
    input_bucket_name = gcs_input_path.split("/")[2]
    if ".json" in filepath:
        json_dict = utilities.documentai_json_proto_downloader(
            input_bucket_name, filepath
        )
        json_dict_updated = mapping_entities_dict(
            json_dict, update_entity, mapping_dict
        )

        output_bucket_name = gcs_output_path.split("/")[2]
        output_path_within_bucket = "/".join(gcs_output_path.split("/")[3:]) + filename
        utilities.store_document_as_json(
            documentai.Document.to_json(json_dict_updated),
            output_bucket_name,
            output_path_within_bucket,
        )

### 4.Output

The post processed json field can be found in the storage path provided by you during the script execution that is output_bucket.

<img src="./Images/currency_output.png" width=800 height=400></img>