# Categorizing Bank Statement Transactions by Account Number

* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied.

# Objective
This document guides to categorize the transactions for each account number from the bank statement parsed json.


# Prerequisite
* Python : Jupyter notebook (Vertex)  
* GCS storage bucket
* Bank Statement Parser

# Step by Step Procedure

## 1. Import ModulesPpackages

In [1]:
%pip install google-cloud-documentai --quiet
%pip install google-cloud-documentai-toolbox --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

--2024-01-12 12:45:47--  https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29735 (29K) [text/plain]
Saving to: ‘utilities.py’


2024-01-12 12:45:47 (13.3 MB/s) - ‘utilities.py’ saved [29735/29735]



In [4]:
import re
from collections import Counter, defaultdict
from difflib import SequenceMatcher
from typing import Dict, List, Union

from google.cloud import documentai_v1beta3 as documentai
from google.cloud.documentai_toolbox import gcs_utilities

import utilities

## 2. Input Details

`gcs_input_path`: Input GCS path which contains bank statement parser JSON files         
`gcs_output_path`: GCS path to store post processed(JSON) results

In [5]:
# Bank statement parser jsons path
gcs_input_path = "gs://bucket/path_to/pre/input"
# post process json path
gcs_output_path = "gs://bucket/path_to/post/output/"

## 3. Run Below Code-cells

In [7]:
def del_ent_attrs(ent: documentai.Document.Entity) -> None:
    """To delete empty attributes of Entity object

    Args:
        ent (documentai.Document.Entity): DocumentAI doc-proto object
    """

    if not ent.normalized_value:
        del ent.normalized_value
    if not ent.confidence:
        del ent.confidence
    if not ent.page_anchor:
        del ent.page_anchor
    if not ent.id:
        del ent.id
    if not ent.mention_text:
        del ent.mention_text
    if not ent.text_anchor:
        del ent.text_anchor


def boundary_markers(doc: documentai.Document) -> documentai.Document:
    """It will rename all entities & child_entities type_

    Args:
        doc (documentai.Document): DocumentAI Doc-Proto object

    Returns:
        documentai.Document:  It returns DocumentAI Doc-Proto object with new entity-type
    """

    # find ent_ids of Json
    ent_ids = defaultdict(list)
    all_entities = []
    for idx, entity in enumerate(doc.entities):
        if entity.id:
            ent_ids[idx].append(int(entity.id))
            all_entities.append(entity)
        for prop in entity.properties:
            ent_ids[idx].append(int(prop.id))
            all_entities.append(prop)
    all_entities = sorted(all_entities, key=lambda x: x.id)
    # Single Level Entities file : json_dict
    json_dict = defaultdict(list)
    for entity in all_entities:
        json_dict["confidence"].append(entity.confidence)
        json_dict["id"].append(entity.id)
        json_dict["mentionText"].append(entity.mention_text)
        json_dict["normalizedValue"].append(entity.normalized_value)
        json_dict["pageAnchor"].append(entity.page_anchor)
        json_dict["textAnchor"].append(entity.text_anchor)
        json_dict["type"].append(entity.type_)

    acc_dict = {}
    idx = 0
    for ent in doc.entities:
        if ent.type_ != "account_number":
            continue
        pg_no = ent.page_anchor.page_refs[0].page
        y_min = min(
            vertex.y
            for vertex in ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices
        )
        acn = re.sub("\D", "", ent.mention_text.strip(".#:' "))
        acc_dict[idx] = {"page": pg_no, "account_number": acn, "min_y": y_min}
        idx += 1

    sorted_data = sorted(acc_dict.values(), key=lambda x: (int(x["page"]), x["min_y"]))
    # acns -> acns
    acns = {}
    idx = 0
    for data in sorted_data:
        acn = data["account_number"]
        if acn not in acns and len(acn) > 6:
            acns[acn] = f"account_{idx}_number"
            idx += 1
    acn_dict = {}
    acn_page_dict = {}
    for key, value in acns.items():
        si_ei_pn = []
        pg_nos = set()
        zip_data = zip(
            json_dict["mentionText"], json_dict["pageAnchor"], json_dict["textAnchor"]
        )
        for mt, pa, ta in zip_data:
            if re.sub("\D", "", mt.strip(".#:' ")) == key:
                page = pa.page_refs[0].page
                ts = ta.text_segments[0]
                si_ei_pn.append((ts.start_index, ts.end_index, page))
                pg_nos.add(page)
        acn_page_dict[value] = pg_nos
        acn_dict[value] = si_ei_pn

    page_no = set(range(len(doc.pages)))
    pages_temp = set()
    for pn_set in acn_page_dict.values():
        page_no = page_no & pn_set
        if page_no:
            pages_temp = page_no
    page_no = list(pages_temp)
    for value in acn_dict.values():
        value.sort(key=lambda x: x[2])

    acns_to_delete = []
    for key, value in acn_dict.items():
        if key != "account_0_number":
            min_si = min_ei = min_page = float("inf")
            data_to_rm = []
            if len(value) <= 1:
                acns_to_delete.append(key)
                continue
            for si_ei_pn in value:
                check_length = len(data_to_rm) < len(value) - 1
                check_if = (si_ei_pn[2] in page_no) and (si_ei_pn[2] < 3)
                if check_if and check_length:
                    data_to_rm.append(si_ei_pn)
                    continue
                min_si = min(min_si, si_ei_pn[0])
                min_ei = min(min_ei, si_ei_pn[1])
                min_page = min(min_page, si_ei_pn[2])
                data_to_rm.append(si_ei_pn)
            for k in data_to_rm:
                value.remove(k)
            acn_dict[key] = [(min_si, min_ei, min_page)]
            continue
        min_si = min_ei = 0
        min_page = float("inf")
        data_to_rm = []
        for si_ei_pn in value:
            if si_ei_pn[2] != page_no[0]:
                continue
            min_si = max(min_si, si_ei_pn[0])
            min_ei = max(min_ei, si_ei_pn[1])
            min_page = min(min_page, si_ei_pn[2])
            data_to_rm.append(si_ei_pn)

        for k in data_to_rm:
            value.remove(k)
        acn_dict["account_0_number"] = [(min_si, min_ei, min_page)]

    for i in acns_to_delete:
        del acn_dict[i]

    txt_len = len(doc.text)
    if len(acns) > 1:
        border_idx = []
        for si_ei_pn in acn_dict.values():
            border_idx.append((si_ei_pn[0][0], si_ei_pn[0][1]))

        region_splitter = []
        for bi in border_idx:
            region_splitter.append(bi[0])

        region_splitter_dict = {}
        for idx, rs in enumerate(region_splitter):
            region_splitter_dict[rs] = f"account_{idx}"
        region_splitter_dict[txt_len] = "last_index"
    else:
        region_splitter_dict = dict([(txt_len, "account_0")])
        region_splitter_dict[txt_len + 1] = "last_index"

    for i, _ in enumerate(json_dict["id"]):
        sub_str = re.sub("\D", "", json_dict["mentionText"][i].strip(".#:' "))
        ent_type = json_dict["type"][i]
        if ent_type == "account_number" and len(sub_str) > 5:
            json_dict["type"][i] = acns[sub_str]

    TYPE_MAPPING = {
        "starting_balance": "_beginning_balance",
        "ending_balance": "_ending_balance",
        "table_item/transaction_deposit_date": "_transaction/deposit_date",
        "table_item/transaction_deposit_description": "_transaction/deposit_desc",
        "table_item/transaction_deposit": "_transaction/deposit_amount",
        "table_item/transaction_withdrawal_date": "_transaction/withdraw_date",
        "table_item/transaction_withdrawal_description": "_transaction/withdraw_desc",
        "table_item/transaction_withdrawal": "_transaction/withdraw_amount",
    }
    for i, _id in enumerate(json_dict["id"]):
        try:
            si = json_dict["textAnchor"][i].text_segments[0].start_index
        except IndexError:
            # To skip entity type checking if there is no TextAnchor object in Doc Proto
            continue
        ent_type = json_dict["type"][i]
        keys = list(region_splitter_dict.keys())
        for j in range(1, len(region_splitter_dict)):
            if ent_type in TYPE_MAPPING and si < keys[j]:
                json_dict["type"][i] = (
                    region_splitter_dict[keys[j - 1]] + TYPE_MAPPING[ent_type]
                )
                break

    new_entities = []
    for i, _ in enumerate(all_entities):
        entity = documentai.Document.Entity(
            confidence=json_dict["confidence"][i],
            id=json_dict["id"][i],
            mention_text=json_dict["mentionText"][i],
            normalized_value=json_dict["normalizedValue"][i],
            page_anchor=json_dict["pageAnchor"][i],
            text_anchor=json_dict["textAnchor"][i],
            type_=json_dict["type"][i],
        )
        new_entities.append(entity)
    new_entities_to_id_dict = {}
    for ent in new_entities:
        new_entities_to_id_dict[int(ent.id)] = ent
    all_entities_new = [""] * len(ent_ids)
    for i, _ids in ent_ids.items():
        if len(_ids) == 1:
            all_entities_new[i] = new_entities_to_id_dict[_ids[0]]
            continue
        sub_entities = []
        for _id in _ids:
            sub_entities.append(new_entities_to_id_dict[_id])
        all_entities_new[i] = doc.entities[i]
        all_entities_new[i].properties = sub_entities
    for ent in all_entities_new:
        del_ent_attrs(ent)
        for child_ent in ent.properties:
            del_ent_attrs(child_ent)
    for i in all_entities_new:
        if i.type_ == "table_item":
            i.type_ = i.properties[0].type_.split("/")[0]

    doc.entities = all_entities_new
    return doc


def match_ent_type(doc: documentai.Document, ent_type: str) -> Dict[str, str]:
    """It will look for provided `ent_type` with all entities in doc-proto object & clean its matched mention_text

    Args:
        doc (documentai.Document): DocumentAI doc-proto object
        ent_type (str): A string-data to look in all entities

    Returns:
        Dict[str, str]: All matched entity-types with provided `ent_type` as key and Its most-frequent mention_text as value
    """

    types = set()
    for entity in doc.entities:
        if ent_type in entity.type_:
            types.add(entity.type_)
    types_dict = {}
    for unique_type in types:
        cleaned_mts = []
        for entity in doc.entities:
            if unique_type == entity.type_:
                cleaned_mts.append(entity.mention_text.strip("$#"))
        data = Counter(cleaned_mts).most_common(1)[0][0]
        types_dict[unique_type] = data
    return types_dict


def fix_account_balance(doc: documentai.Document) -> documentai.Document:
    """It will fix account balance for doc-proto entities whose entity-types matches with `beginning_balance` or `ending_balance`

    Args:
        doc (documentai.Document): DocumentAI doc-proto object

    Returns:
        documentai.Document: It returns updated DocumentAI Doc-Proto object
    """

    beg_end_dict = dict()
    beg_end_dict.update(match_ent_type(doc, "beginning_balance"))
    beg_end_dict.update(match_ent_type(doc, "ending_balance"))
    for entity in doc.entities:
        mt = entity.mention_text.strip("$#")
        et = entity.type_
        keys = list(beg_end_dict.keys())
        values = list(beg_end_dict.values())
        if et in beg_end_dict:
            if mt != beg_end_dict[et] and mt in values:
                entity.type_ = keys[values.index(mt)]
            elif mt != beg_end_dict[et]:
                doc.entities.remove(entity)
    return doc


def find_account_number(
    data: List[Dict[str, Union[int, float]]], page_no: int, y_coord: float
) -> Union[None, str]:
    """It will look for nearest account_number in provided page number based on y_coord

    Args:
        data (List[Dict[str, Union[int, float]]]): It contains account-numbers and its corresponding page_no & y-coordinate
        page_no (int): Page number to look for account-number
        y_coord (float): minimum y-coordinate of token which matches with r"\sstatement"

    Returns:
        Union[None,str]: It returns either None or closest account number from given `page_no`
    """
    closest_acc = None
    min_dst = float("inf")
    for acn, page_info_list in data.items():
        for page_info in page_info_list:
            page = page_info.get("page")
            y = page_info.get("y")
            dst = abs(y_coord - y)
            if page == page_no and dst < min_dst:
                min_dst = dst
                closest_acc = acn
    return closest_acc


def detials_account(
    doc: documentai.Document, account_type: str
) -> List[
    Dict[
        str,
        Dict[str, Union[str, int, documentai.Document.TextAnchor.TextSegment, float]],
    ]
]:
    """It will look for entities whose type_ matches with `account_type`

    Args:
        doc (documentai.Document): DocumentAI doc-proto object
        account_type (str): String data to match with individual entity.type_

    Returns:
        List[Dict[str,Dict[str, Union[str,int,documentai.Document.TextAnchor.TextSegment, float]]]]:
            it returnsList which has dictionary of  mention_text and its id, page_number, text_segment, x_max & y_max
    """
    acc_dict_lst = []
    for ent in doc.entities:
        if ent.properties:
            continue
        match_ratio = SequenceMatcher(None, ent.type_, account_type).ratio()
        if match_ratio >= 0.9:
            id1 = ent.id
            page1 = ent.page_anchor.page_refs[0].page
            text_segment = ent.text_anchor.text_segments[0]
            x_coords = []
            y_coords = []
            nvs = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices
            for nv in nvs:
                x_coords.append(nv.x)
                y_coords.append(nv.y)
            x_max = max(x_coords, default="")
            y_max = max(y_coords, default="")
            acc_dict_lst.append(
                {
                    ent.mention_text: {
                        "id": id1,
                        "page": page1,
                        "textSegments": text_segment,
                        "x_max": x_max,
                        "y_max": y_max,
                    }
                }
            )
    return acc_dict_lst


def accounttype_change(doc: documentai.Document) -> documentai.Document:
    """It will rename entity type_ for all target entities in doc-proto object

    Args:
        doc (documentai.Document): DocumentAI doc-proto object

    Returns:
        documentai.Document: It returns updated doc-proto object
    """

    acc_name_dict = detials_account(doc, "account_type")
    acn_dict = detials_account(doc, "account_i_number")
    temp_del = []
    for item in acc_name_dict:
        for key in item:
            if re.search("\sstatement", key, re.IGNORECASE):
                temp_del.append(key)
    for idx, item in enumerate(acc_name_dict):
        for key in item:
            for m in temp_del:
                if key == m:
                    del acc_name_dict[idx]
    acc_comp = []
    for name_item in acc_name_dict:
        for acn_item in acn_dict:
            for key, value in name_item.items():
                for acn, value_2 in acn_item.items():
                    y_diff = abs(value["y_max"] - value_2["y_max"])
                    acc_comp.append({key: {acn: y_diff}})

    ymin_dict = {}
    for entry in acc_comp:
        for acc_type, account_info in entry.items():
            # acn -> account_number
            for acn, miny in account_info.items():
                if acn in ymin_dict:
                    curr_min = ymin_dict[acn]["min_value"]
                    if miny < curr_min:
                        ymin_dict[acn] = {"account_type": acc_type, "min_value": miny}
                else:
                    ymin_dict[acn] = {"account_type": acc_type, "min_value": miny}

    # Extract one account name based on min y
    result_dict = {acn: data["account_type"] for acn, data in ymin_dict.items()}
    acn_ymin = {}
    map_acc_type = {}
    for ent in doc.entities:
        match_ratio = SequenceMatcher(None, ent.type_, "account_i_number").ratio()
        if match_ratio > 0.8:
            acc_num1 = re.sub("\D", "", ent.mention_text.strip(".#:' "))
            if len(acc_num1) > 5:
                nvs = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices
                min_y1 = min(nv.y for nv in nvs)
                page = ent.page_anchor.page_refs[0].page
                if acc_num1 in acn_ymin.keys():
                    acn_ymin[acc_num1].append({"y": min_y1, "page": page})
                else:
                    acn_ymin[acc_num1] = [{"y": min_y1, "page": page}]
        cond1 = ent.mention_text in result_dict.keys()
        cond2 = ent.mention_text not in map_acc_type.keys()
        if cond1 and cond2:
            map_acc_type[ent.mention_text] = ent.type_

    for ent in doc.entities:
        cond1 = ent.type_ == "account_type"
        cond2 = re.search("\sstatement", ent.mention_text, re.IGNORECASE)
        if cond1 and cond2:
            doc.entities.remove(ent)
        elif cond1:
            nvs = ent.page_anchor.page_refs[0].bounding_poly.normalized_vertices
            ymin_2 = min(nv.y for nv in nvs)
            page = ent.page_anchor.page_refs[0].page
            x1 = find_account_number(acn_ymin, page, ymin_2)
            try:
                data = map_acc_type[x1].split("_")[1]
            except KeyError:
                continue
            else:
                ent.type_ = f"account_{data}_name"
    return doc


input_bucket, _ = gcs_utilities.split_gcs_uri(gcs_input_path)
output_bucket, output_files_dir = gcs_utilities.split_gcs_uri(gcs_output_path)
_, file_dict = utilities.file_names(gcs_input_path)
print(f"Categorizing Bank Statement Transactions by Account Number Process Started...")
for fn, fp in file_dict.items():
    print(f"\tFile: {fn}")
    doc = utilities.documentai_json_proto_downloader(input_bucket, fp)
    try:
        doc = boundary_markers(doc)
    except Exception as e:
        doc = doc
        print("Unable to update the account details because of {}".format(e.args))
    try:
        doc = fix_account_balance(doc)
    except Exception as e:
        print(
            "Unable to update the starting and ending balance because of {}".format(
                e.args
            )
        )
    try:
        doc = accounttype_change(doc)
    except Exception as e:
        print("Unable to update the account type because of {}".format(e).args)
    str_data = documentai.Document.to_json(
        doc,
        use_integers_for_enums=False,
        including_default_value_fields=False,
        preserving_proto_field_name=False,
    )
    output_file_path = f"{output_files_dir.rstrip('/')}/{fn}"
    target_path = output_file_path if output_files_dir else fn
    utilities.store_document_as_json(str_data, output_bucket, target_path)
    print(f"\t\tPost processed data uploaded to gs://{output_bucket}/{target_path}")
print(f"Process Completed")

Categorizing Bank Statement Transactions by Account Number Process Started...
	File: 1941000828-0.json
		Post processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/1941000828-0.json
	File: 2016398000-0.json
		Post processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2016398000-0.json
	File: 2016654464-0.json
		Post processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2016654464-0.json
	File: 2017496199-0.json
		Post processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2017496199-0.json
	File: 2024616717-0.json
		Post processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_account_number/post/output/2024616717-0.json
	File: SampleBank-0.json
		Post processed data uploaded to gs://siddamv/categorizing_bank_statement_transactions_by_a

## 4. Output Details

The bank statement parser entities for transactions will be mapped relating to the account 
Mapping as below  
<table>
    <tr>
        <td><b>Bank Statement parser output entity type Before post processing</b></td>
        <td><b>After post processing</b></td>
    </tr>
    <tr>
        <td>account_number</td>
        <td>account_0_number<br>account_1_number  ..etc</td>
    </tr>
    <tr>
        <td>account_type</td>
        <td>account_0_name<br>account_1_name   ..etc</td>
    </tr>
    <tr>
        <td>starting_balance</td>
        <td>account_0_beggining_balance<br>account_1_beggining_balance  ..etc</td>
    </tr>
    <tr>
        <td>ending_balance</td>
        <td>account_0_ending_balance<br>account_1_ending_balance  ..etc</td>
    </tr>
    <tr>
        <td>table_item/transaction_deposit_date</td>
        <td>account_0_transaction/deposit_date<br>account_1_transaction/deposit_date  ..etc</td>
    </tr>
    <tr>
        <td>table_item/transaction_deposit_description</td>
        <td>account_0_transaction/deposit_description<br>account_1_transaction/deposit_description ..etc</td>
    </tr>
    <tr>
        <td>table_item/transaction_deposit</td>
        <td>account_0_transaction/deposit<br>account_1_transaction/deposit  ..etc</td>
    </tr>
    <tr>
        <td>table_item/transaction_withdrawal_date</td>
        <td>account_0_transaction/withdrawal_date<br>account_1_transaction/withdrawal_date  ..etc</td>
    </tr>
    <tr>
        <td>table_item/transaction_withdrawal_description</td>
        <td>account_0_transaction/withdrawal_description<br>account_1_transaction/withdrawal_description ..etc</td>
    </tr>
    <tr>
        <td>table_item/transaction_withdrawal</td>
        <td>account_0_transaction/withdrawal<br>account_1_transaction/withdrawal ..etc</td>
    </tr>
    <tr>
        <td>table_item</td>
        <td>account_0_trasaction<br>account_1_transaction  ..etc</td>
    </tr>
    </table>
