# Detecting language of the text within the entities

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.


## Objective

This tool facilitates the detection of language codes within text entities by aligning token text anchors with entity text anchors. Subsequently, a new attribute named "detectedLanguages" is integrated into the generated JSON file. This allows users to conveniently access the language code associated with each entity directly within the JSON output.


## Prerequisites
* Vertex AI Notebook

## Step by Step procedure 

### 1.Importing Required Modules

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
from google.cloud import documentai_v1beta3 as documentai
from google.cloud import storage
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
from pathlib import Path
import json
from utilities import (
    file_names,
    documentai_json_proto_downloader,
    store_document_as_json,
)

### 2.Setup the inputs
* `gcs_input_path` : It contains the input jsons bucket path. 
* `gcs_output_path` : It contains the output bucket path where the updated jsons after adding the attribute will be stored.


In [None]:
gcs_input_path = "gs://{input_bucket_name}/{subfolder_name}/"  # '/' should be provided at the end of the path.
gcs_output_path = "gs://{output_bucket_name}/{subfolder_name}/"  # '/' should be provided at the end of the path.

### 3.Run the required functions

In [None]:
def get_detected_languages_for_entities_with_multiple_segments(
    document_json: dict,
) -> dict:
    """
    Extracts detected languages for each entity in a document based on overlapping text segments with tokens.

    Args:
        document_json (dict): The JSON representation of the document, containing entities, pages, and tokens.

    Returns:
        dict: The updated document JSON with detected languages added to each entity.
    """
    for entity in document_json["entities"]:
        detected_languages = set()  # Using a set to avoid duplicate languages

        try:
            # Iterate over each text segment of the entity
            for segment in entity["textAnchor"]["textSegments"]:
                entity_start_index = segment.get("startIndex", "0")
                entity_end_index = segment.get("endIndex", "0")
                # print(entity_start_index,entity_end_index)

                # Iterate through each page and each token in the document
                for page in document_json["pages"]:
                    for token in page["tokens"]:
                        # print(token.layout.text_anchor.text_segments)
                        token_start_index = token["layout"]["textAnchor"][
                            "textSegments"
                        ][0]["startIndex"]
                        token_end_index = token["layout"]["textAnchor"]["textSegments"][
                            0
                        ]["endIndex"]

                        # Check if the entity's text segment overlaps with the token's text anchor
                        if (
                            entity_start_index >= token_start_index
                            and entity_end_index <= token_end_index
                        ) or (
                            token_start_index >= entity_start_index
                            and token_end_index <= entity_end_index
                        ):
                            # If there's an overlap, add the detected languages to the set
                            # print("Hi")
                            for language in token["detectedLanguages"]:
                                detected_languages.add(language["languageCode"])

            # Print the entity type, text content, and detected languages
            entity["detectedLanguages"] = list(detected_languages)
        except Exception as e:
            pass

    document_json["entities"] = document_json["entities"]

    return document_json

### 4.Run the code

In [None]:
def main(gcs_input_path, gcs_output_path):
    file_names_list, file_dict = file_names(gcs_input_path)
    for filename, filepath in file_dict.items():
        print(">>>>>>>>>>>>>>> Processing File : ", filename)
        input_bucket_name = gcs_input_path.split("/")[2]
        document_proto = documentai_json_proto_downloader(input_bucket_name, filepath)
        document_json = json.loads(documentai.Document.to_json(document_proto))

        final_json = get_detected_languages_for_entities_with_multiple_segments(
            document_json
        )
        output_bucket_name = gcs_output_path.split("/")[2]
        output_path_within_bucket = "/".join(gcs_output_path.split("/")[3:])
        store_document_as_json(
            json.dumps(final_json),
            output_bucket_name,
            f"{output_path_within_bucket}{filename}",
        )


main(gcs_input_path, gcs_output_path)

### 5.Output

The new attribute 'detectedLanguages' will be added to each entity in the newly generated json file.


<img src="./Images/image.png" width=800 height=400 ></img>