# Synonyms Based Splitter Document Labeling 

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied.

## Objective

This notebook automates document labeling using a synonyms-based approach for a Custom Document Splitter Parser. By comparing a user-defined list of keywords against OCR-extracted text, it identifies and labels document segments, enhancing document organization and categorization.There is also an optional flag to split the pdfs and save into the GCS folders(named as labels).

In this context, "synonyms" refer to a set of keywords that match text extracted from documents. By using these keywords, the tool searches the OCR text to create splitter entities, which are markers used to categorize and split the document based on identified keywords.

### Practical Application

The tool labels document sections by searching OCR text for user-provided synonyms, streamlining the process of splitting and categorizing documents based on their content.

### Examples

- **Example 1:** With `synonyms_list=['PART A','PART B','PART C','PART D']`, if both "PART A" and "PART B" are found, the page is labeled as "PART A".
- **Example 2:** For `synonyms_list=['INTRODUCTION', 'EXECUTIVE SUMMARY', 'CONCLUSION']`, if "EXECUTIVE SUMMARY" and "CONCLUSION" are found, it labels the page as "EXECUTIVE SUMMARY".

Priority is given to the synonym appearing first in the list when multiple matches occur on a page, ensuring consistent labeling.

## Prerequisites
* Python : Jupyter notebook (Vertex AI) 
* Service account permissions in projects.



## Step by Step procedure 

### 1.Importing Required Modules

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

In [None]:
#importing libraries
import re
from utilities import *
from google.cloud import documentai_v1beta3 as documentai
import json
from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
from google.cloud import storage
from google.cloud.exceptions import Conflict, NotFound
from PIL import Image
import io

### 2.Setup the required inputs
* `project_id` : Your Google project id or name
* `synonyms_list` : list of synonyms which has to be used to search in ocr for splitting documents
* `gcs_input_uri` : OCR PARSED JSONS RESULTS PATH
* `gcs_output_uri` : Path to save the updated jsons
* `save_split_pdfs_flag` : flag whether to save the splitted pdfs in gcs bucket
* `pdfs_ouput_path` : path to save the split files
* `synonym_entity_name` : type of entity to view in cde 
* `label_unidentified_entity_name` : default label name in case first few pages no synonym found


In [None]:
project_id='xxxx-xxxx-xxxx' 
synonyms_list=['PART A','PART B','PART C','PART D'] 
gcs_input_uri="gs://xxxx/xxxx/xxxx/"
gcs_output_uri='gs://xxxx/xxxx/xxx/' 
save_split_pdfs_flag='TRUE'
pdfs_ouput_path='gs://xxxx/xxxx/xxx/' 
synonym_entity_name='synonym_entity' 
label_unidentified_entity_name="label_unidentified"

### Function to parse the Raw pdfs upto 200 pages in a single json
Use the below function to parse the documents which has more than 10 pages to get the output in a single json and provide the path in the `gcs_input_uri` as given in above cell

In [None]:
# BATCH PROCESSING FUNCTION WITH SHARDING TILL 200 pages
def batch_process_documents(
    project_id : str,
    location : str,
    processor_id : str,
    gcs_input_uri : str,
    gcs_output_uri : str,
    timeout: int = 600,
) -> Any:
    """It will perform Batch Process on raw input documents

    Args:
        project_id (str): GCP project ID
        location (str): Processor location us or eu
        processor_id (str): GCP DocumentAI ProcessorID
        gcs_input_uri (str): GCS path which contains all input files
        gcs_output_uri (str): GCS path to store processed JSON results
        timeout (int, optional): Maximum waiting time for operation to complete.

    Returns:
        operation.Operation: LRO operation ID for current batch-job
    """

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = {}
    if location == "eu":
        opts = {"api_endpoint": "eu-documentai.googleapis.com"}
    elif location == "us":
        opts = {"api_endpoint": "us-documentai.googleapis.com"}
        #opts = {"api_endpoint": "us-autopush-documentai.sandbox.googleapis.com"}
    client = documentai.DocumentProcessorServiceClient(client_options=opts)


    input_config= documentai.BatchDocumentsInputConfig(gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri))
    
    sharding_config = documentai.DocumentOutputConfig.GcsOutputConfig.ShardingConfig(pages_per_shard=200)
    gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
        gcs_uri=gcs_output_uri, sharding_config=sharding_config
    )

    output_config = documentai.DocumentOutputConfig(
        gcs_output_config=gcs_output_config
    )

    # Location can be 'us' or 'eu'
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    request = documentai.types.document_processor_service.BatchProcessRequest(
        name=name,
        input_documents=input_config,
        document_output_config=output_config,
    )

    operation = client.batch_process_documents(request)

    # Wait for the operation to finish
    operation.result(timeout=timeout)
    return operation

### 3.Importing Required functions and calling

In [None]:
def get_text_anchors_page_wise(json_ocr : object) -> Dict:
    """
    Get text anchors for each page in the OCR result.

    Args:
        json_ocr (object): The OCR result in Document AI Document format.

    Returns:
        Dict : A dictionary where keys are page numbers (0-indexed),
            and values are dictionaries with 'start_index' and 'end_index' for each page's text anchor.
    """
    
    #Getting text anchors
    p=0
    text_anchors_page_wise={}
    for page in json_ocr.pages:
        for an in page.layout.text_anchor.text_segments:
            start_index=an.start_index
            end_index=an.end_index
        text_anchors_page_wise[p]={'start_index':start_index,'end_index':end_index}
        p+=1
    return text_anchors_page_wise

#getting text anchors of matches with synonyms
def find_substring_indexes(text : str, substring : str) -> List[Union[int, int]]:
    """
    Find the starting and ending indexes of occurrences of a substring in the given text.

    Args:
        text (str): The input text where substring needs to be found.
        substring (str): The substring to be searched in the text.

    Returns:
        List[Union[int, int]]: A list of tuples containing the starting and ending indexes of substring occurrences.
    """
    
    if ' ' or '\n' not in substring:
        pattern = re.compile(re.escape(substring), re.IGNORECASE)
        matches = [(match.start(), match.end()) for match in pattern.finditer(text)]
    else:
        pattern = re.compile(r'{}.*{}'.format(re.escape(substring.split(' ')[0]),re.escape(substring.split(' ')[-1])), re.IGNORECASE)
        matches = [(match.start(), match.end()) for match in pattern.finditer(json_dict['text'])]

    return matches

def get_synonyms_matches_pages(synonyms_list : List[str], text_anchors_page_wise : Dict[int, Dict[str, int]], json_ocr : object) -> Tuple:
    """
    Find matches of synonyms in the OCR text and associate them with corresponding pages.

    Args:
        synonyms_list (List[str]): List of synonyms to be searched in the OCR text.
        text_anchors_page_wise (Dict[int, Dict[str, int]]): Text anchors with start and end indexes for each page.
        json_ocr (object): JSON representation of the OCR output.

    Returns:
        Tuple : A tuple containing:
            - A dictionary with synonyms as keys and lists of pages where they are found, sorted in ascending order.
            - A dictionary with synonym information, including text anchors and corresponding pages.
    """
    
    matches_synonyms={}
    synonym_info={}
    for synonym in synonyms_list:
        pattern = re.compile('[^a-zA-Z0-9\s]')
        matches_list=find_substring_indexes(re.sub(pattern, ' ', json_ocr.text),re.sub(pattern, ' ', synonym))#find_substring_indexes(json_ocr.text, synonym)
        # print(matches_list)
        for match in matches_list:
            for p1,anc in text_anchors_page_wise.items():
                if match[0]>=anc['start_index'] and match[1]<=anc['end_index']:
                    if synonym in matches_synonyms.keys():
                        matches_synonyms[synonym].append(p1)
                        synonym_info[synonym].append({'text_anchors':{'start_index':match[0],'end_index':match[1]},'page':p1})
                    else:
                        matches_synonyms[synonym]=[p1]
                        synonym_info[synonym]=[{'text_anchors':{'start_index':match[0],'end_index':match[1]},'page':p1}]
    matches_synonyms_updated = {key: sorted(list(set(value))) for key, value in matches_synonyms.items()}
    synonym_wise_data={}
    temp_pages=[]
    temp_permanant=[]
    temp_page=-1
    asssigned_pages=list(set(value for values_list in matches_synonyms_updated.values() for value in values_list))
    unassigned_pages=[]
    synonym_assigned=''
    for page_num in range(len(json_ocr.pages)):
        asssigned_flag='NO'
        for synonym_1,pages_available in matches_synonyms_updated.items():
            if synonym_assigned=='':
                synonym_assigned=synonym_1
            if page_num in pages_available:
                if temp_page<page_num:
                    if len(temp_pages)==0:
                        temp_pages=[page_num]
                    else:
                        temp_pages.append(page_num)
                    temp_page=page_num
                if synonym_1 in synonym_wise_data.keys() and len(temp_pages)>0:
                    synonym_wise_data[synonym_1].append(temp_pages)
                    temp_permanant.append(temp_pages)
                elif len(temp_pages)>0:
                    synonym_wise_data[synonym_1]=[temp_pages]
                temp_pages=[]
                asssigned_flag='YES'
        if asssigned_flag=='NO':
            unassigned_pages.append(page_num)

    for unass_page in unassigned_pages:
        closest_list=''
        closest_synonym=''
        min_diff=100
        for syn_2,pagass_list in synonym_wise_data.items():
            for pag_ass in pagass_list:
                for p_n1 in pag_ass:
                    if p_n1<unass_page:
                        if min_diff>unass_page-p_n1:
                            min_diff=unass_page-p_n1
                            closest_list=pag_ass
                            closest_synonym=syn_2
                        else:
                            continue
        if closest_synonym!='':
            #print(closest_list)
            for syn_2,pagass_list in synonym_wise_data.items():
                for pag_ass in pagass_list:
                    if syn_2==closest_synonym and pag_ass==closest_list:
                        pag_ass.append(unass_page)
        else:
            if label_unidentified_entity_name in synonym_wise_data.keys():
                synonym_wise_data[label_unidentified_entity_name].append([unass_page])
            else:
                synonym_wise_data[label_unidentified_entity_name]=[[unass_page]]


    data = {part: [list(set(sublist)) for sublist in lists] for part, lists in synonym_wise_data.items()}
    for part, part_data in data.items():
        for sublist in part_data:
            sublist.sort()
    return data,synonym_info

def remove_repeated_pages(data : Dict[str, List[List[int]]]) -> Dict[str, List[List[int]]]:
    """
    Remove repeated page numbers from the provided data.

    Args:
        data (Dict[str, List[List[int]]]): Dictionary containing part-wise data with lists of page numbers.

    Returns:
        Dict[str, List[List[int]]]: Modified dictionary with repeated page numbers removed from the lists.
    """
    
    all_numbers = []
    unique_numbers = set()
    repeated_numbers = set()

    for part_data in data.values():
        for item in part_data:
            if isinstance(item, list):
                flat_list = item
            else:
                flat_list = [item]
            all_numbers.extend(flat_list)

    for num in all_numbers:
        if num in unique_numbers:
            repeated_numbers.add(num)
        else:
            unique_numbers.add(num)
    
    for part, part_data in data.items():
        for sublist in part_data:
            if isinstance(sublist, list) and len(sublist) > 1 and sublist[0] in repeated_numbers:
                sublist.pop(0)
    return data

def store_blob(bytes_stream: bytes, file: str ,BUCKET_NAME: str) -> None:
    """To store PDF files in GCS

    Args:
        bytes_stream (bytes): Binary Format of pdf data
        file (str): filename to store in specified GCS bucket
    """

    storage_client = storage.Client()
    result_bucket = storage_client.get_bucket(BUCKET_NAME)
    document_blob = storage.Blob(name=str(file), bucket=result_bucket)
    document_blob.upload_from_string(bytes_stream, content_type="application/pdf")

def save_split_pdfs(json_ocr : object, pdfs_ouput_path : str, file_name : str, synonym_tag : Optional[str]) -> None:
    """
    Save split PDFs based on OCR data.

    Args:
        json_ocr (object): JSON OCR data.
        pdfs_ouput_path (str): Output path for storing the PDFs.
        file_name (str): Base file name for the saved PDFs.
        synonym_tag (Optional[str]): 'YES' if synonym tagging, 'NO' otherwise.

    Returns:
        None
    """
    
    if synonym_tag=='YES':
        for entity in json_ocr.entities:
            if entity.type!=synonym_entity_name:
                pages_new=[]
                for p_num in entity.page_anchor.page_refs:
                    # print(p_num)
                    pages_new.append(p_num.page)
                pages_image=[]
                for page_num1 in range(len(json_ocr.pages)):
                    if page_num1 in pages_new:
                        pages_image.append(json_ocr.pages[page_num1].image.content)
                folder_name=entity.type
                synthesized_images = [decode_image(page) for page in pages_image]
                pdf_bytes = create_pdf_from_images(synthesized_images)
                from datetime import datetime
                current_time = datetime.now()
                time_stamp_1=int(current_time.timestamp())
                file_save_path=('/').join(pdfs_ouput_path.split('/')[3:])+str(folder_name)+'/'+file_name+'_'+str(time_stamp_1)+'.pdf'
                BUCKET_NAME=pdfs_ouput_path.split('/')[2]
                store_blob(bytes_stream= pdf_bytes, file=file_save_path ,BUCKET_NAME=BUCKET_NAME)

    elif synonym_tag=='NO':
        synthesized_images = [decode_image(page.image.content) for page in json_ocr.pages]
        pdf_bytes = create_pdf_from_images(synthesized_images)
        from datetime import datetime
        current_time = datetime.now()
        time_stamp_1=int(current_time.timestamp())
        file_save_path=('/').join(pdfs_ouput_path.split('/')[3:])+label_unidentified_entity_name+'/'+file_name+'_'+str(time_stamp_1)+'.pdf'
        BUCKET_NAME=pdfs_ouput_path.split('/')[2]
        store_blob(bytes_stream= pdf_bytes, file=file_save_path ,BUCKET_NAME=BUCKET_NAME)

def decode_image(image_bytes: bytes) -> Image.Image:
    """
    Decode image bytes into a Pillow Image object.

    Args:
        image_bytes (bytes): The image bytes to be decoded.

    Returns:
        Image.Image: The Pillow Image object.
    """
    
    with io.BytesIO(image_bytes) as image_file:
        image = Image.open(image_file)
        image.load()
    return image

def create_pdf_from_images(images: Sequence[Image.Image]) -> bytes:
    """Creates a PDF from a sequence of images.

    The PDF will contain 1 page per image, in the same order.

    Args:
      images: A sequence of images.

    Returns:
      The PDF bytes.
    """
    
    if not images:
        raise ValueError("At least one image is required to create a PDF")

    # PIL PDF saver does not support RGBA images
    images = [
        image.convert("RGB") if image.mode == "RGBA" else image for image in images
    ]

    with io.BytesIO() as pdf_file:
        images[0].save(
            pdf_file, save_all=True, append_images=images[1:], format="PDF"
        )
        return pdf_file.getvalue()
    
def create_splitter_entities(json_ocr : object, synonyms_list : List[str], synonym_entity_name : str) -> Tuple[object, str]:
    """
    Creates splitter entities based on the identified synonyms in the OCR output.

    Args:
        json_ocr (object): The OCR output in the form of a Document AI document.
        synonyms_list (List[str]): List of synonyms to be identified in the OCR output.
        synonym_entity_name (str): Name to be assigned to the entity representing identified synonyms.

    Returns:
        Tuple[object, str]: A tuple containing the updated OCR document and a tag indicating
        whether synonyms were found ('YES') or not ('NO').
    """
    
    text_anchors_page_wise=get_text_anchors_page_wise(json_ocr)
    synonyms_pages,synonym_info=get_synonyms_matches_pages(synonyms_list,text_anchors_page_wise,json_ocr)
    if len(synonyms_pages)>0:
        data=remove_repeated_pages(synonyms_pages)
        max_page=len(text_anchors_page_wise)
        max_page_list = max([max(sublist, default=0) for sublist in sum(data.values(), [])])
        entities_splitter=[]
        for synonym,pages_nested_list in data.items():
            for pages_list in pages_nested_list:
                temp_splitter_entity={'type_':'','text_anchor':{'text_segments':[]},'page_anchor':{'page_refs':[]}}
                # if max_page_list in pages_list:
                #     pages_list.extend(range(max_page_list, max_page))
                #     # print(pages_list)
                if len(pages_list)>=1:
                    sorted_pages=sorted(pages_list)
                    start_index_ent=text_anchors_page_wise[sorted_pages[0]]['start_index']
                    end_index_ent=text_anchors_page_wise[sorted_pages[-1]]['end_index']
                for page_2 in pages_list:
                    temp_splitter_entity['page_anchor']['page_refs'].append({'page':page_2})
                temp_splitter_entity['text_anchor']['text_segments'].append({'start_index':start_index_ent,'end_index':end_index_ent})
                temp_splitter_entity['type_']=''.join(['_' if c.isspace() or not c.isalnum() else c for c in synonym])
                entities_splitter.append(temp_splitter_entity)
        
        json_ocr.entities=entities_splitter
        json_ocr=add_ent_json(json_ocr,synonym_info,synonym_entity_name)
        synonym_tag='YES'
    else:
        # print('NO OUTPUT')
        synonym_tag='NO'
    
    return json_ocr,synonym_tag

def get_new_entity(syn_data_1 : Dict[str, Any], json_ocr : object, synonym_entity_name : str) -> Dict[str, Any]:
    """
    Creates a new entity based on the provided synonym data and the OCR output.

    Args:
        syn_data_1 (Dict[str, Any]): Information about the synonym data, including text anchors and page number.
        json_ocr (object): The OCR output in the form of a Document AI document.
        synonym_entity_name (str): Name to be assigned to the new entity.

    Returns:
        Dict[str, Any]: A dictionary representing the new entity with mention text, page anchors, text anchors, and type.
    """
    
    text_anchors_temp= syn_data_1['text_anchors']
    page_num=syn_data_1['page']
    # synonym_entity_name='synonym_entity'
    new_ent={'mention_text':'','page_anchor':{'page_refs':[{'bounding_poly':{'normalized_vertices':[]},'page': page_num}]},'text_anchor':{'text_segments': []},'type_':''}
    entity_text_anc=[]
    page_anc={'x':[],'y':[]}
    for page in json_ocr.pages:
        # print(page.page_number)
        if page_num==page.page_number-1:
            # print(page.page_number)
            for token in page.tokens:
                # print(token)
                token_seg=token.layout.text_anchor.text_segments
                for seg in token_seg:
                    token_start=seg.start_index
                    token_end=seg.end_index
                if token_start>=text_anchors_temp['start_index']-3 and token_end<=text_anchors_temp['end_index']+2:
                    if json_ocr.text[token_start:token_end].replace(" ", "") in json_ocr.text[text_anchors_temp['start_index']:text_anchors_temp['end_index']].replace(" ", ""):
                        vertices = token.layout.bounding_poly.normalized_vertices
                        minx_token, miny_token = min(point.x for point in vertices), min(point.y for point in vertices)
                        maxx_token, maxy_token = max(point.x for point in vertices), max(point.y for point in vertices)
                        entity_text_anc.append({'start_index':token_start,'end_index':token_end})
                        page_anc['x'].extend([minx_token,maxx_token])
                        page_anc['y'].extend([miny_token,maxy_token])
    new_ent['mention_text']=json_ocr.text[text_anchors_temp['start_index']:text_anchors_temp['end_index']]
    page_anchors_ent=[{'x':min(page_anc['x']),'y':min(page_anc['y'])},{'x':min(page_anc['x']),'y':max(page_anc['y'])},
                     {'x':max(page_anc['x']),'y':min(page_anc['y'])},{'x':max(page_anc['x']),'y':max(page_anc['y'])}]
    new_ent['page_anchor']['page_refs'][0]['bounding_poly']['normalized_vertices']=page_anchors_ent
    new_ent['text_anchor']['text_segments']=entity_text_anc
    new_ent['type_']=synonym_entity_name
    
    return new_ent

def create_cde_entities(synonym_info : Dict[str, List[Dict[str, Any]]], json_ocr : object, synonym_entity_name : str) ->List[Dict[str, Any]]:
    """
    Creates CDE entities based on the synonym information and OCR output.

    Args:
        synonym_info (Dict[str, List[Dict[str, Any]]]): Information about synonyms and their occurrences in the OCR output.
        json_ocr (object): The OCR output in the form of a Document AI document.
        synonym_entity_name (str): Name to be assigned to the synonym entity.

    Returns:
        List[Dict[str, Any]]: A list of dictionaries representing CDE entities with mention text, page anchors, text anchors, and type.
    """
    
    cde_entities=[] 
    for syn,tag in synonym_info.items():
        for item in tag:
            try:
                cde_ent=get_new_entity(item,json_ocr,synonym_entity_name)
                cde_entities.append(cde_ent)
            except:
                continue
    return cde_entities


def add_ent_json(json_ocr : object, synonym_info : Dict[str, List[Dict[str, Any]]], synonym_entity_name : str) -> object:
    """
    Adds CDE entities to the Document AI document based on synonym information.

    Args:
        json_ocr (object): The OCR output in the form of a Document AI document.
        synonym_info (Dict[str, List[Dict[str, Any]]]): Information about synonyms and their occurrences in the OCR output.
        synonym_entity_name (str): Name to be assigned to the synonym entity.

    Returns:
        object : The updated Document AI document with added CDE entities.
    """
    
    cde_entities=create_cde_entities(synonym_info,json_ocr,synonym_entity_name)
    for ent_cde in cde_entities:
        json_ocr.entities.append(ent_cde)
    
    return json_ocr

def main():
    files_name_list,files_path_dict=file_names(gcs_input_uri)
    for i in range(len(files_name_list)):
        #print(file_name_list[i])
        file_path='gs://'+gcs_input_uri.split('/')[2]+'/'+files_path_dict[files_name_list[i]]
        print(file_path)
        json_ocr=documentai_json_proto_downloader(file_path.split('/')[2],('/').join(file_path.split('/')[3:]))
        json_ocr,synonym_tag=create_splitter_entities(json_ocr,synonyms_list,synonym_entity_name)
        if save_split_pdfs_flag=='TRUE':
            save_split_pdfs(json_ocr,pdfs_ouput_path,files_name_list[i],synonym_tag)
        store_document_as_json(documentai.Document.to_json(json_ocr),gcs_output_uri.split('/')[2],('/').join(gcs_output_uri.split('/')[3:])+files_name_list[i])
        
main()

### Output

Entities will be added to jsons and saved in the output gcs path

* CDE format entities with entity type as ‘synonym_entity’ as shown below


<img src="./Images/cde_entity.png" width=800 height=400 alt="cde_entity"></img>

* Splitter format entities added with entity type same as labels or synonyms given

<img src="./Images/splitter_entity.png" width=800 height=400 alt="splitter_entity"></img>


* If save_split_pdfs_flag is TRUE , then the split pdfs will be saved in gcs path provided with folder names same as labels

<img src="./Images/folders.png" width=800 height=400 alt="folders_image"></img>

If the documents doesnt have any synonyms given , then it will be saved in label_unidentified folder.

The names of files will be `filename’+timestamp.pdf`

<img src="./Images/pdf_split.png" width=800 height=400 alt="pdf_split"></img>