# OCR Confidence Score Calculation Tool

* Author: docai-incubator@google.com


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Purpose and Description

The objective of this tool is to get the confidence score of individual tokens and then compare it with the text segments of the entities. By doing this, the tool aims to identify the minimum OCR confidence among these comparisons. 
The tool associates the lowest confidence value with the respective entity. This resulting confidence value is stored as the "ocr_confidence_score." Every entity, including both parent and child entities, it possess its own OCR confidence score.


## Prerequisites

1. Vertex AI Notebook
2. Input Json Files
3. GCS bucket for processing of  the input json and writing the output.




## Step by Step procedure 

### 1. Input details


In [None]:
%pip install google-cloud-storage
%pip install tqdm

In [None]:
#!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import json
import tqdm
from google.cloud import storage

# Replace with your Google Cloud Storage bucket and folder paths
source_bucket_name = "<input_bucket_name>"
source_folder_path = "<folder_path_without_bucket_name>"  # Folder containing jsons
destination_bucket_name = "<ouput_bucket_name>"
destination_folder_path = "<folder_path_without_bucket_name>"

### 2. Output

<img src="./Images/ocr_confidence_output_1.png" width=800 height=400 alt="Entity confidence output image">

<img src="./Images/ocr_confidence_output_2.png" width=800 height=400 alt="Child Entity confidence output image">

### 3. Run the functions

In [None]:
def get_token_range(jsonData: dict):
    """To get the token range from the json content.

    Args:
        jsonData: The document converted into json format.

    Returns:
        The token range of the tokens from the OCR data (start index, end index) hash map.
    """

    tokenRange = {}
    for i in range(0, len(jsonData["pages"])):
        for j in range(0, len(jsonData["pages"][i]["tokens"])):
            pageNumber = i
            tokenNumber = j
            try:
                startIndex = int(
                    jsonData["pages"][i]["tokens"][j]["layout"]["textAnchor"][
                        "textSegments"
                    ][0]["startIndex"]
                )
            except:
                startIndex = 0
            endIndex = int(
                jsonData["pages"][i]["tokens"][j]["layout"]["textAnchor"][
                    "textSegments"
                ][0]["endIndex"]
            )
            confidence = jsonData["pages"][i]["tokens"][j]["layout"]["confidence"]
            full_text = jsonData["text"]
            text_in_range = full_text[startIndex : endIndex + 1]
            tokenRange[range(startIndex, endIndex)] = {
                "token_text": text_in_range,
                "confidence": confidence,
            }
    return tokenRange


def find_min_confidence(token_ranges: dict, start_end_indices: list):
    """To get the minimum confidence score of tokens.

    Args:
        token_ranges: The start_index and the end_index as the dictonary.
        start_end_indices :

    Returns:
        The token range of the tokens from the OCR data (start index, end index) hash map.
    """

    min_confidence = float("inf")

    for start, end in start_end_indices:
        for rng, info in token_ranges.items():
            if rng.start < int(end) and rng.stop > int(start):
                intersection_start = max(rng.start, int(start))
                intersection_end = min(rng.stop, int(end))
                min_confidence = min(min_confidence, info["confidence"])
    return min_confidence

### 4. Run the code

<div style="background-color:#f5f569"><b>NOTE:</b> Please note that you can change the default confidence score generated by the processor by changing the "ocr_confidence_score" to "confidence".</div>

In [None]:
# Initialize Google Cloud Storage client
client = storage.Client()

# List all JSON files in the source folder
source_bucket = client.get_bucket(source_bucket_name)
blobs = source_bucket.list_blobs(prefix=source_folder_path)
json_files = [blob.name for blob in blobs if blob.name.endswith(".json")]

# Process each JSON file
for json_file_path in json_files:
    blob = source_bucket.blob(json_file_path)
    json_data = json.loads(blob.download_as_text())

    text_data = json_data["text"]
    entities = json_data["entities"]
    token_ranges = get_token_range(json_data)

    # Iterate through entities
    for entity in entities:
        try:
            text_segments = entity["textAnchor"]["textSegments"]
            segment_indices = [
                (text_segment["startIndex"], text_segment["endIndex"])
                for text_segment in text_segments
            ]
            min_confidence = find_min_confidence(token_ranges, segment_indices)
            entity["ocr_confidence_score"] = min_confidence
        except KeyError:
            pass

        try:
            lt_props = entity["properties"]
            for lt in lt_props:
                lt_text_segments = lt["textAnchor"]["textSegments"]
                segment_indices = [
                    (text_segment["startIndex"], text_segment["endIndex"])
                    for text_segment in lt_text_segments
                ]
                min_confidence = find_min_confidence(token_ranges, segment_indices)
                lt["ocr_confidence_score"] = min_confidence
        except KeyError:
            pass
    # Save modified JSON with added confidence scores
    output_filename = json_file_path.replace(
        source_folder_path, destination_folder_path
    )
    output_blob = client.bucket(destination_bucket_name).blob(output_filename)
    output_blob.upload_from_string(
        json.dumps(json_data, ensure_ascii=False).encode("utf-8"),
        content_type="application/json",
    )
    print(f"Done: {output_filename}")
print("Completed")