# Text ordering by bounding box coordinates

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

## Objective

This document provides a step-by-step guide on how to order text by bounding box coordinates using a Python notebook and a Document AI OCR output JSON file. 

## Prerequisites
* Access to vertex AI Notebook or Google Colab
* Python
*  OCR processor output json 

## Step by Step procedure

### 1.Importing Required Modules

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
import json
from operator import itemgetter
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

### 2.Setup the inputs

In [None]:
# Provide your Document AI OCR output Json
input_filename = "your-input.json"

# Provide your output file name to store

output_filename = "ocr_output.txt"

### 3.Run the required functions

In [None]:
with open(input_filename, "r", encoding="utf-8") as f:
    json_data = json.load(f)


def clean_text(text: str) -> str:
    """
    Cleans the input text by removing extra spaces and joins the words with a single space.

    Args:
        text (str): The raw input text.

    Returns:
        str: The cleaned text with excess spaces removed.
    """


def get_line_key(y: float) -> int:
    """
    Returns a rounded integer value for the y-coordinate to group tokens by their approximate line position.

    Args:
        y (float): The y-coordinate of the token's position.

    Returns:
        int: The rounded y-coordinate value multiplied by 100 to represent the line key.
    """
    return round(y * 100)


def arrange_tokens(tokens: list, text: str) -> list[tuple[str, float, float]]:
    """
    Arranges tokens from the document by extracting their text and normalized coordinates.

    Args:
        tokens (list): A list of tokens from the document, where each token contains layout, textAnchor, and boundingPoly information.
        text (str): The full text of the document used to extract substrings based on token indices.

    Returns:
        list[tuple[str, float, float]]: A list of tuples where each tuple contains:
            - The cleaned substring of text corresponding to the token.
            - The x-coordinate of the token's position.
            - The y-coordinate of the token's position.
    """
    arranged_tokens = []
    for token in tokens:
        text_anchor = token.get("layout", {}).get("textAnchor", {})
        text_segments = text_anchor.get("textSegments", [{}])
        if text_segments:
            start_index = int(text_segments[0].get("startIndex", 0))
            end_index = int(text_segments[0].get("endIndex", 0))
            substring = clean_text(text[start_index:end_index])

            bounding_poly = token.get("layout", {}).get("boundingPoly", {})
            normalized_vertices = bounding_poly.get("normalizedVertices", [])

            if normalized_vertices:
                x = normalized_vertices[0].get("x", 0)
                y = normalized_vertices[0].get("y", 0)
                arranged_tokens.append((substring, x, y))

    return arranged_tokens


def group_into_lines(arranged_tokens: list[tuple[str, float, float]]) -> list[str]:
    """
    Groups tokens into lines based on their y-coordinate and sorts them by their x-coordinate to reconstruct the line of text.

    Args:
        arranged_tokens (list[tuple[str, float, float]]): A list of tokens where each token is represented as a tuple containing:
            - The token text (str).
            - The x-coordinate (float) of the token's position.
            - The y-coordinate (float) of the token's position.

    Returns:
        list[str]: A list of strings where each string is a line of text reconstructed from the tokens.
    """
    lines = {}
    for token, x, y in arranged_tokens:
        line_key = get_line_key(y)
        if line_key not in lines:
            lines[line_key] = []
        lines[line_key].append((token, x))

    sorted_lines = []
    for line_key in sorted(lines.keys()):
        sorted_line = sorted(lines[line_key], key=itemgetter(1))
        line_text = " ".join(token for token, _ in sorted_line)
        sorted_lines.append(line_text)

    return sorted_lines


with open(output_filename, "w", encoding="utf-8") as output_file:
    for page_num, page in enumerate(json_data["pages"], 1):
        tokens = page.get("tokens", [])
        text = json_data.get("text", "")

        arranged_tokens = arrange_tokens(tokens, text)
        lines = group_into_lines(arranged_tokens)
        output_file.write(f"Page: {page_num}\n\n")
        for line in lines:
            output_file.write(f"{line}\n")
        output_file.write("\n" + "=" * 50 + "\n\n")

print(f"Output has been saved to {output_filename}")

### 4.Output

The screenshot below shows before and after comparison where the text is getting aligned by the help of their coordinates.

<b>Comparison Between Input and Output File</b><br><br>
<i><h4>Post processing results<h4><i><br>
Upon running the post processing script against input data. The resultant output json data is obtained. The following image will show the difference<br>
    
<table>
        <tr>
           <td><h3><b>Input Json </b></h3></td>
            <td><h3><b>Output Json</b></h3></td>
        </tr>
    <tr>
    <td><img src="./Images/input.png"></td>
    <td><img src="./Images/output.png"></td>
    </tr>
    </table>