diff --git a/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Docai_json_to_canonical_json_conversion.ipynb b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Docai_json_to_canonical_json_conversion.ipynb new file mode 100644 index 00000000..4c127916 --- /dev/null +++ b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Docai_json_to_canonical_json_conversion.ipynb @@ -0,0 +1,405 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "62ac8784-c132-4866-acf0-0bfd0e569002", + "metadata": {}, + "source": [ + "# DocAI JSON to Canonical Doc JSON conversion" + ] + }, + { + "cell_type": "markdown", + "id": "10b01c70-c6cf-4111-89fd-e830e3016db4", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com" + ] + }, + { + "cell_type": "markdown", + "id": "133f6ba4-6180-4ac7-9084-68e29a4c40df", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "189ef8bd-7434-4fd1-ba36-f9c31ce15d12", + "metadata": {}, + "source": [ + "## Purpose and Description\n", + "
Note: This feature is in Preview with allowlist. To turn on this feature , contact your Google account team.
\n", + "\n", + "\n", + "A parsed, unstructured document(Canonical Doc JSONs) is represented by JSON that describes the unstructured document using a sequence of text, table, and list blocks. You import canonical JSON files with your parsed unstructured document data in the same way that you import other types of unstructured documents, such as PDFs. When this feature is turned on, whenever a JSON file is uploaded and identified by either an application/json MIME type or a .JSON extension, it is treated as a parsed document." + ] + }, + { + "cell_type": "markdown", + "id": "a048aa80-bd14-488c-8f84-75c93a3cb00a", + "metadata": {}, + "source": [ + "**Canonical Json :** Canonical Doc JSONs are a JSON representation of parsed unstructured documents. They use a sequence of text, table, and list blocks to describe the document's structure.\n", + "\n", + "Refer below cell, which gives details about **Canonical Json Schema** \n", + "Change the metadata_schema according to your need. Ex : if you want to add file info, add a key&value pair to structData key of metadata_schema object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20af14d1-8a02-4f13-8c63-9a13f16da33e", + "metadata": {}, + "outputs": [], + "source": [ + "# Pre defined json structure\n", + "\n", + "conanical_schema = {\n", + " \"title\": \"Some Title\",\n", + " \"blocks\": [\n", + " {\n", + " \"textBlock\": {\"text\": \"Some PARAGRAPH 1\", \"type\": \"PARAGRAPH\"},\n", + " \"pageNumber\": 1,\n", + " }\n", + " ],\n", + "}\n", + "\n", + "\n", + "metadata_schema = {\n", + " \"id\": \"your_random_id\",\n", + " \"structData\": {\n", + " \"Title\": \"File Title\",\n", + " \"Description\": \"Your file description\",\n", + " \"Source_url\": \"https://storage.mtls.cloud.google.com/\",\n", + " },\n", + " \"content\": {\"mimeType\": \"application/json\", \"uri\": \"gs://\"},\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "c358842d-da07-4d4e-8521-d55eb5197f5b", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "1. Vertex AI Notebook\n", + "2. Parser Json files in GCS Folders.\n", + "3. PDFs files in GCS Folders." + ] + }, + { + "cell_type": "markdown", + "id": "f3d2adea-a44f-4c06-863c-a77454080ae8", + "metadata": {}, + "source": [ + "## Step by Step procedure " + ] + }, + { + "cell_type": "markdown", + "id": "8765cc08-24aa-4540-b0f7-32d3c32d457c", + "metadata": {}, + "source": [ + "### 1. Import Modules/Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "532dea4c-5f0b-40ff-ac3f-3e011c709437", + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "raw", + "id": "dbcf5eb5-1dc4-40be-bba1-f9ad160e40fa", + "metadata": {}, + "source": [ + "import json\n", + "import uuid\n", + "from copy import deepcopy\n", + "from google.cloud import documentai\n", + "from typing import Any, Dict, List, Optional, Sequence, Tuple, Union\n", + "from utilities import documentai_json_proto_downloader,store_document_as_json, file_names" + ] + }, + { + "cell_type": "markdown", + "id": "0ae09265-eaed-4fe5-abf6-c57aa66d8564", + "metadata": {}, + "source": [ + "### 2. Input Details" + ] + }, + { + "cell_type": "markdown", + "id": "34eb5b04-e17f-4402-95e5-98c40b905916", + "metadata": {}, + "source": [ + "* GCS_INPUT_PATH : GCS path for input files. It should contain DocAI processed output json files and also the pdfs which got parsed by the processor with the same name as json files." + ] + }, + { + "cell_type": "markdown", + "id": "0b80a90a-9c93-4710-9f37-fb0faf63cd05", + "metadata": {}, + "source": [ + "This is how the input bucket structure should look like : \n", + "\"input" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e782c446-c601-4a07-b511-ae2c2b8821eb", + "metadata": {}, + "outputs": [], + "source": [ + "# Please follow the folders path to sucessfully create metadata\n", + "\n", + "# Given folders in GCS_INPUT_PATH with the following structure :\n", + "#\n", + "# gs://path/to/input/folder\n", + "# ├──/processor_output/ Folder having parsed json from the processor.\n", + "# └──/pdfs/ Folder having all the pdfs which got parsed with same name as jsons.\n", + "\n", + "GCS_INPUT_PATH = \"gs://{bucket_name}/{folder_path}\"\n", + "\n", + "input_bucket_name = GCS_INPUT_PATH.split(\"/\")[2]\n", + "input_prefix_path = \"/\".join(GCS_INPUT_PATH.split(\"/\")[3:])" + ] + }, + { + "cell_type": "markdown", + "id": "3508a7de-06d7-48cc-95bf-be335562d344", + "metadata": {}, + "source": [ + "#### Change the metadata_schema according to your need. Ex : if you want to add file info, add a key&value pair to structData key of metadata_schema object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "062d74d4-7a32-4c16-b4a2-e314452f917e", + "metadata": {}, + "outputs": [], + "source": [ + "# Pre defined json structure\n", + "\n", + "conanical_schema = {\n", + " \"title\": \"Some Title\",\n", + " \"blocks\": [\n", + " {\n", + " \"textBlock\": {\"text\": \"Some PARAGRAPH 1\", \"type\": \"PARAGRAPH\"},\n", + " \"pageNumber\": 1,\n", + " }\n", + " ],\n", + "}\n", + "\n", + "\n", + "metadata_schema = {\n", + " \"id\": \"your_random_id\",\n", + " \"structData\": {\n", + " \"Title\": \"File Title\",\n", + " \"Description\": \"Your file description\",\n", + " \"Source_url\": \"https://storage.mtls.cloud.google.com/\",\n", + " },\n", + " \"content\": {\"mimeType\": \"application/json\", \"uri\": \"gs://\"},\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "d6aaa615-f4dd-4520-a0bc-ad3a12cac009", + "metadata": {}, + "source": [ + "### 3. Run the scipt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b1618f90-feda-4802-9122-7463c8e75d34", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_doc_object_to_conanical_object(\n", + " doc_object: documentai.Document, file_path: str\n", + ") -> Dict:\n", + " \"\"\"\n", + " To convert the document AI object structure to conanical object which is compactible with vertex AI search.\n", + "\n", + " Parameters\n", + " ----------\n", + " doc_object : documentai.Document\n", + " The documnet AI object from the input file provided by the user.\n", + "\n", + " file_path : str\n", + " The GCS file path of the json file.\n", + "\n", + " Returns\n", + " -------\n", + " Dict\n", + " Returns the converted conanical object.\n", + " \"\"\"\n", + "\n", + " conanical_obj = {}\n", + " conanical_obj[\"blocks\"] = list()\n", + " conanical_schema_c = deepcopy(conanical_schema)\n", + " file_name = file_path.split(\"/\")[-1].split(\".\")[0]\n", + " OCR_text = doc_object.text\n", + "\n", + " conanical_schema_c[\"title\"] = file_name\n", + " conanical_obj[\"title\"] = conanical_schema_c[\"title\"]\n", + "\n", + " # looping through all pages\n", + " for page in doc_object.pages:\n", + " page_number = page.page_number\n", + " conanical_schema_c[\"blocks\"][0][\"pageNumber\"] = page_number\n", + "\n", + " # looping through all paragraph and getting OCR text by index\n", + " paragraph = page.paragraphs\n", + " if paragraph:\n", + " first_paragraph = True\n", + " for paragraph in page.paragraphs:\n", + " text_segments = paragraph.layout.text_anchor.text_segments[0]\n", + " if first_paragraph:\n", + " page_starting_index = text_segments.start_index\n", + " first_paragraph = False\n", + " last_paragraph_index = text_segments.end_index\n", + " paragraph_text = OCR_text[page_starting_index:last_paragraph_index]\n", + " conanical_schema_c[\"blocks\"][0][\"textBlock\"][\"text\"] = paragraph_text\n", + " conanical_obj[\"blocks\"].append(deepcopy(conanical_schema_c[\"blocks\"][0]))\n", + " return conanical_obj\n", + "\n", + "\n", + "def create_metadata(metadata: str, output_file: str) -> str:\n", + " \"\"\"\n", + " To create the metadata file by adding pdfs file location, content type, converted json location,other user defined schema.\n", + "\n", + " Parameters\n", + " ----------\n", + " metadata : str\n", + " The metadata string with the older configuration which will get update with new configuration.\n", + "\n", + " output_file : str\n", + " The GCS path of the Canonical json path.\n", + "\n", + " Returns\n", + " -------\n", + " str\n", + " Returns the updated metadata string having the latest file info attached with the older metadata string.\n", + " \"\"\"\n", + "\n", + " metadata_json_copy = deepcopy(metadata_schema)\n", + " file_path = output_file.replace(\"gs://\", \"\")\n", + " pdf_file_path = (\n", + " GCS_INPUT_PATH + \"/pdfs/\" + file_path.split(\"/\")[-1].split(\".\")[0] + \".pdf\"\n", + " )\n", + " pdf_file_path = pdf_file_path.replace(\"gs://\", \"\")\n", + " metadata_json_copy[\"id\"] = str(uuid.uuid4())\n", + " metadata_json_copy[\"content\"][\"uri\"] += file_path\n", + " metadata_json_copy[\"structData\"][\"Source_url\"] += pdf_file_path\n", + " metadata_json_copy[\"structData\"][\"Title\"] = file_path.split(\"/\")[-1].split(\".\")[0]\n", + " metadata += json.dumps(metadata_json_copy) + \"\\n\"\n", + " return metadata\n", + "\n", + "\n", + "output_files_for_metadata = []\n", + "file_name_list = [\n", + " i\n", + " for i in list(file_names(f\"{GCS_INPUT_PATH}/processor_output\")[1].values())\n", + " if i.endswith(\".json\")\n", + "]\n", + "\n", + "print(\"Converting files ...\")\n", + "for file_name in file_name_list:\n", + " try:\n", + " document = documentai_json_proto_downloader(input_bucket_name, file_name)\n", + " conanical_object = convert_doc_object_to_conanical_object(document, file_name)\n", + "\n", + " except Exception as e:\n", + " print(f\"[x] {input_bucket_name}/{file_name} || Error : {str(e)}\")\n", + " continue\n", + " output_file_name = f\"{input_prefix_path}/output/{file_name.split('/')[-1]}\"\n", + " output_files_for_metadata.append(f\"gs://{input_bucket_name}/{output_file_name}\")\n", + " store_document_as_json(\n", + " json.dumps(conanical_object), input_bucket_name, output_file_name\n", + " )\n", + " print(f\"[✓] {input_bucket_name}/{output_file_name}\")\n", + "\n", + "print(\"\\n\\nCreating metadata file ...\")\n", + "metadata_str = \"\"\n", + "for gcs_output_file in output_files_for_metadata:\n", + " metadata_str = create_metadata(metadata_str, gcs_output_file)\n", + "\n", + "store_document_as_json(\n", + " metadata_str, input_bucket_name, input_prefix_path + \"/metadata/\" + \"metadata.jsonl\"\n", + ")\n", + "print(\n", + " \"Metadata created & stored in\",\n", + " f\"gs://{input_bucket_name}/{input_prefix_path}/metadata/\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9a390272-f2a8-435e-b626-6bcc9c91dd43", + "metadata": {}, + "source": [ + "### 4. Output" + ] + }, + { + "cell_type": "markdown", + "id": "aa841590-c1ac-4dd5-9301-d97b149236a4", + "metadata": {}, + "source": [ + "Document AI json after conversion to Canonical json and store the files to output folder inside the GCS_INPUT_PATH folder. Each page text will get store inside each text block with their respective page number.
\n", + "\"conanical\n", + "\n", + "Finally script will create metadata having GCS location of converted Canonical json with the authentication link of pdfs.\n", + "\"conanical

\n", + "\n", + "Import the metadata file in vertex AI search and conversation datastore, verify the files got imported.
\n", + "\"conanical
\n", + "\"conanical" + ] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_1.png b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_1.png new file mode 100644 index 00000000..77996c14 Binary files /dev/null and b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_1.png differ diff --git a/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_2.png b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_2.png new file mode 100644 index 00000000..4003713e Binary files /dev/null and b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_2.png differ diff --git a/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_3.png b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_3.png new file mode 100644 index 00000000..8790bdca Binary files /dev/null and b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_3.png differ diff --git a/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_4.png b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_4.png new file mode 100644 index 00000000..d22ae7ca Binary files /dev/null and b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_4.png differ diff --git a/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/input_sample_image.png b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/input_sample_image.png new file mode 100644 index 00000000..e299f213 Binary files /dev/null and b/incubator-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/input_sample_image.png differ diff --git a/incubator-tools/Export_import_document_schema_from_processor/Export_import_document_schema_from_processor.ipynb b/incubator-tools/Export_import_document_schema_from_processor/Export_import_document_schema_from_processor.ipynb new file mode 100644 index 00000000..227e50c2 --- /dev/null +++ b/incubator-tools/Export_import_document_schema_from_processor/Export_import_document_schema_from_processor.ipynb @@ -0,0 +1,287 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6589fc93-39d1-4d10-be1f-e7eb33fe4087", + "metadata": {}, + "source": [ + "# Export and Import Document schema from a processor (using spreadsheet).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "361f188e-fe11-4a49-b7c8-080e0e69ce7a", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. \n" + ] + }, + { + "cell_type": "markdown", + "id": "1036937a-0221-48eb-862e-3fa0b8e646a8", + "metadata": {}, + "source": [ + "## Objective\n", + "\n", + "This document Guides how to export a schema from a processor to a spreadsheet(.xlsx extension) and import a schema from a spreadsheet to a processor . This approach considers 3 level nesting as well.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "115a4e82-5e83-468a-b0e5-097ca14f15d5", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "* Vertex AI Notebook Or Colab (If using Colab, use authentication)\n", + "* Processor details to import the processor\n", + "* Permission For Google Storage and Vertex AI Notebook.\n" + ] + }, + { + "cell_type": "markdown", + "id": "142123d3-37b1-4aa8-841c-40c3bd52d70c", + "metadata": {}, + "source": [ + "## 1. Exporting Document schema to a spreadsheet" + ] + }, + { + "cell_type": "markdown", + "id": "73ae8955-8516-42ce-a08e-2ace4152d7d9", + "metadata": {}, + "source": [ + "\n", + "#### Input\n", + "* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n", + "* `location`=\"us\" # location of the processor \n", + "* `processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor from which the schema has to be exported to spreadsheet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "840bb64b-66e8-4ec2-b25b-815db36775e1", + "metadata": {}, + "outputs": [], + "source": [ + "processor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\"\n", + "# get document schema\n", + "from google.cloud import documentai_v1beta3\n", + "\n", + "\n", + "def get_dataset_schema(processor_name):\n", + " # Create a client\n", + " client = documentai_v1beta3.DocumentServiceClient()\n", + "\n", + " # dataset_name = client.dataset_schema_path(project, location, processor)\n", + " # Initialize request argument(s)\n", + " request = documentai_v1beta3.GetDatasetSchemaRequest(\n", + " name=processor_name + \"/dataset/datasetSchema\",\n", + " )\n", + "\n", + " # Make the request\n", + " response = client.get_dataset_schema(request=request)\n", + "\n", + " return response\n", + "\n", + "\n", + "response_document_schema = get_dataset_schema(processor_name)\n", + "dataset_schema = []\n", + "for schema_metadata in response_document_schema.document_schema.entity_types:\n", + " if len(schema_metadata.properties) > 0:\n", + " for schema_property in schema_metadata.properties:\n", + " temp_schema_metadata = {\n", + " \"name\": schema_property.name,\n", + " \"value_type\": schema_property.value_type,\n", + " \"occurrence_type\": schema_property.occurrence_type.name,\n", + " }\n", + " if len(schema_metadata.display_name) == 0:\n", + " dataset_schema.append(temp_schema_metadata)\n", + " else:\n", + " temp_schema_metadata[\"display_name\"] = schema_metadata.display_name\n", + " dataset_schema.append(temp_schema_metadata)\n", + "\n", + "import pandas as pd\n", + "\n", + "df = pd.DataFrame(dataset_schema)\n", + "df.to_excel(\"Document_Schema_exported.xlsx\", index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "6e7dc8b0-c547-4cc5-845b-cdf73d2ce909", + "metadata": {}, + "source": [ + "### Output \n", + "* The output will be the schema saved in \"Document_Schema_exported.xlsx\" file as shown below\n", + "\n", + "\n", + "#### * `Columns`\n", + "#### Name:\n", + "Entity type which can be parent entity or child entities\n", + "\n", + "#### Value_type:\n", + "\n", + "* Value type is the data type of the entities, if the entity is a parent item the value type will be same as entity type.if it is final child type then value type is data type\n", + "\n", + "#### Occurance_type :\n", + "\n", + "* Occurance type is the occurance type of respective entity\n", + "\n", + "#### display_name:\n", + "\n", + "* Display name is the name of the parent entity for child entities. if entity itself is the parent entity then display_name will be empty" + ] + }, + { + "cell_type": "markdown", + "id": "c156ac15-faa2-407d-87f0-86f19a10af33", + "metadata": {}, + "source": [ + "## 2. Importing Document schema from a spreadsheet" + ] + }, + { + "cell_type": "markdown", + "id": "650eb081-9ca5-443e-8369-f281fc39f6fc", + "metadata": {}, + "source": [ + "#### Input\n", + "* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n", + "* `new_location`=\"us\" # location of the processor \n", + "* `new_processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor to which the schema has to be imported\n", + "* `schema_xlsx_path`=\"Document_Schema_exported.xlsx\"" + ] + }, + { + "cell_type": "markdown", + "id": "d4e7499b-9895-4d61-a7dd-a12454d80c59", + "metadata": { + "tags": [] + }, + "source": [ + "* Add any entities in the xlsx file to be added in the new processor\n", + "\n", + "## Note\n", + "\n", + "* Make sure the entities in the spreadsheet are not already in the schema of the processor to avoid issues\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6fa50e70-48ab-419f-95ee-9b1ffc889d73", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import math\n", + "import pandas as pd\n", + "from google.cloud import documentai_v1beta3\n", + "\n", + "# Import the Excel file back into a data frame\n", + "imported_df = pd.read_excel(schema_xlsx_path)\n", + "\n", + "# Convert the data frame back to a list of dictionaries\n", + "imported_data = imported_df.to_dict(orient=\"records\")\n", + "\n", + "parent_entities = []\n", + "nested_entities = {}\n", + "for data in imported_data:\n", + " temp_data = {key: value for key, value in data.items() if key != \"display_name\"}\n", + " if isinstance(data[\"display_name\"], float) and math.isnan(data[\"display_name\"]):\n", + " parent_entities.append(temp_data)\n", + " else:\n", + " if data[\"display_name\"] in nested_entities.keys():\n", + " nested_entities[data[\"display_name\"]].append(temp_data)\n", + " else:\n", + " nested_entities[data[\"display_name\"]] = [temp_data]\n", + "\n", + "schema_line = []\n", + "\n", + "for line, properties in nested_entities.items():\n", + " client = documentai_v1beta3.types.DocumentSchema.EntityType()\n", + " client.name = line\n", + " client.base_types = [\"object\"]\n", + " client.properties = properties\n", + " client.display_name = line\n", + " schema_line.append(client)\n", + "\n", + "new_processor_name = (\n", + " f\"projects/{project_id}/locations/{new_location}/processors/{new_processor_id}\"\n", + ")\n", + "\n", + "response_newprocessor = get_dataset_schema(new_processor_name)\n", + "# updating into the processor\n", + "for i in response_newprocessor.document_schema.entity_types:\n", + " for e3 in parent_entities:\n", + " i.properties.append(e3)\n", + "\n", + "for e4 in schema_line:\n", + " response_newprocessor.document_schema.entity_types.append(e4)\n", + "\n", + "\n", + "def update_dataset_schema(schema):\n", + " from google.cloud import documentai_v1beta3\n", + "\n", + " # Create a client\n", + " client = documentai_v1beta3.DocumentServiceClient()\n", + "\n", + " # Initialize request argument(s)\n", + " request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n", + " dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n", + " )\n", + "\n", + " # Make the request\n", + " response = client.update_dataset_schema(request=request)\n", + "\n", + " # Handle the response\n", + " return response\n", + "\n", + "\n", + "response_update = update_dataset_schema(response_newprocessor)" + ] + }, + { + "cell_type": "markdown", + "id": "043b9c7c-83f5-49c1-84ef-1cddb6c1dbf6", + "metadata": {}, + "source": [ + "### Output \n", + "* The schema of new processor will be updated as per spreadsheet given" + ] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/Export_import_document_schema_from_processor/Images/Exported_schema.png b/incubator-tools/Export_import_document_schema_from_processor/Images/Exported_schema.png new file mode 100644 index 00000000..cf55098f Binary files /dev/null and b/incubator-tools/Export_import_document_schema_from_processor/Images/Exported_schema.png differ diff --git a/incubator-tools/README.md b/incubator-tools/README.md index 1c931e7f..4da5c657 100644 --- a/incubator-tools/README.md +++ b/incubator-tools/README.md @@ -32,4 +32,41 @@ Folder contains various tools which is made for the benefit of Doc AI users. * [Import and Evaluator Processors](./importing_processor_and_evaluating_with_alternate_test_sets/) * [Labeled Dataset Validation](./labeled_dataset_validation/) * [Split Overlapping Entities](./overlapping_split/) -* [Rename Entity Type](./rename_entity_type/) \ No newline at end of file +* [Rename Entity Type](./rename_entity_type/) + +* [Combine Two Processors Output](./Combine_two_processors_output/) +* [DocAI Json to Canonical Json Conversion](./DocAI_Json_to_Canonical_Json_Conversion/) +* [Export and Import Document Schema from Processor](./Export_import_document_schema_from_processor/) +* [Asynchronous API Reference Architecture](./Reference_architecture_asynchronous/) +* [Advance Table Line Enhancement](./advance_table_line_enhancement/) +* [Backmapping Entities from Parser Output Language to Original Language of the Document](./backmapping_entities_from_parser_output_to_original_language/) +* [Bank Statement Post Processing Tool](./bank_statement_post_processing_tool/) +* [Bank Statements Line Item Improver and Missing Items Finder](./bank_statements_line_items_improver_and_missing_items_finder/) +* [Categorizing Bank Statement Transactions by Account Number](./categorizing_bank_statement_transactions_by_account_number/) +* [Comparison between Custom Document Classifier Ground Truth and Parsed Json Prediction Results](./cdc_comparison/) +* [CMEK Key Creation and Destroying Procedure](./cmek_docai_processor/) +* [Date Entity Normalization](./date_entity_normalization/) +* [PDF Clustering Analysis Tool](./docai_pdf_clustering_analysis_tool/) +* [Document AI Processor Types](./docai_processor_types/) +* [Schema from Form Parser Output](./document-schema-from-form-parser-output/) +* [Migrating Schema Between the Processors](./documentai_migrating_schema_between_processors/) +* [Enrich the Address for Invioce Parser](./enrich_address_for_invoice/) +* [Entity Sorting using Csharp](./entity_sorting_csharp/) +* [Entity Sorting using Python](./entity_sorting_python/) +* [Formparser Table to Entity Converter Tool](./formparser_table_to_entity_converter_tool/) +* [HITL Line Item Prefix Issues](./hitl_line_item_prefix_issue/) +* [Identity Document Proofing Evaluation](./identity_document_proofing_evaluation/) +* [Normalize Date Entities from 19xx to 20xx](./normalize_date_value_19xx_to_20xx/) +* [OCR Based Document Section Splitter](./ocr_based_document_section_splitter/) +* [Reprocess Old OCR Json to New OCR Engine](./old_ocr_to_new_ocr_conversion/) +* [Seperation of Paragraphs in a Document](./paragraph_separation/) +* [Replace PII Data with Synthetic Data](./pii_synthetic_redaction_tool/) +* [Post Processing Negative Values](./post_processing_negative_values/) +* [Annotating the Entities in the Document based on the OCR Tokens](./reverse_annotation_tool/) +* [Schema Converter Tool](./schema_converter_tool/) +* [Signature Detection Technique](./signature-detection-technique/) +* [Special Character Removal](./special_character_removal/) +* [Tagging Line Items in a Specific Format](./specific_format_line_items_tagging/) +* [Labeling Documents through Custom Document Splitter Parser using Synonyms List](./synonyms_based_splitter_document_labeling/) +* [Tagging Entity Synonyms](./synonyms_entity_tag/) +* [Vertex Object Detection Visualization](./vertex_object_detection_visualization/) diff --git a/incubator-tools/combine_two_processors_output/combine_two_processor_output.ipynb b/incubator-tools/combine_two_processors_output/combine_two_processor_output.ipynb new file mode 100644 index 00000000..55b8a02b --- /dev/null +++ b/incubator-tools/combine_two_processors_output/combine_two_processor_output.ipynb @@ -0,0 +1,652 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1af5ebbb-4a9c-4c9a-8ae8-d9574c754397", + "metadata": {}, + "source": [ + "# Combine Two processor Output Tool" + ] + }, + { + "cell_type": "markdown", + "id": "3dde429f-5406-4a57-a6f9-d6adf8ae3c5b", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com" + ] + }, + { + "cell_type": "markdown", + "id": "21aa9549-e3a4-47b1-9648-c95e2a02eea6", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "0f6dc7e8-a4a9-4279-8271-000cabe72bf7", + "metadata": {}, + "source": [ + "## Objective\n", + "The objective of the tooling is to efficiently integrate the output of one AI processor (proto) with another. This integration results in a comprehensive final output that reflects the combined capabilities of both parsers. Technically, this process involves sending the document proto object from one parser to the next." + ] + }, + { + "cell_type": "markdown", + "id": "cd252261-d116-4f31-9b46-c19b14bba143", + "metadata": {}, + "source": [ + "## Prerequisites \n", + "* Python : Jupyter notebook (Vertex AI) or Google Colab \n", + "* Permission to the Google project and Document AI \n", + "* Input PDF Files" + ] + }, + { + "cell_type": "markdown", + "id": "4d8382ad-e669-4f07-87ae-91a09895eacb", + "metadata": {}, + "source": [ + "## Sync Code" + ] + }, + { + "cell_type": "markdown", + "id": "20289168-5b69-4a39-9bb7-1bd8f0bb7047", + "metadata": {}, + "source": [ + "### Importing Required Modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1c7ea6e-3c37-499d-84b7-989924e233ad", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from google.api_core.client_options import ClientOptions\n", + "import google.auth.transport.requests\n", + "from google import auth\n", + "from google.cloud import documentai\n", + "from google.cloud import storage\n", + "import requests\n", + "import json\n", + "import mimetypes\n", + "from typing import List, Tuple" + ] + }, + { + "cell_type": "markdown", + "id": "856cd580-126e-40b2-85c7-c5144908361f", + "metadata": {}, + "source": [ + "### Setup the required inputs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5b258c2-d255-4755-b528-ba0f1697795d", + "metadata": {}, + "outputs": [], + "source": [ + "# First and Second Processor Configuration Details\n", + "# Replace with your Google Cloud Project ID\n", + "PROJECT_ID = \"\" # e.g., \"project-123\"\n", + "# Specify the location for the first processor\n", + "LOCATION = \"\" # e.g., \"us\"\n", + "# Replace with the ID of your first processor\n", + "PROCESSOR_ID = \"\" # e.g., \"1234abcd\"\n", + "\n", + "# The MIME type for the files to be processed\n", + "MIME_TYPE = \"application/pdf\" # Keep as-is if processing PDF files\n", + "\n", + "# Configuration for the second processor\n", + "# Replace with your Google Cloud Project ID for the second processor\n", + "PROJECT_ID_2 = \"\" # e.g., \"project-456\"\n", + "# Specify the location for the second processor\n", + "LOCATION_2 = \"\" # e.g., \"us\"\n", + "# Replace with the ID of your second processor\n", + "PROCESSOR_ID_2 = \"\" # e.g., \"5678efgh\"\n", + "\n", + "# Google Cloud Storage Bucket Paths\n", + "# Specify the path to the input PDF files\n", + "input_path = \"\" # e.g., \"bucket/input_pdf/\" gs:// is not required at the beginning.\n", + "# Specify the path for output from the first parser\n", + "output_path1 = \"\" # e.g., \"bucket/first_parser_output\" gs:// is not required at the beginning.\n", + "# Specify the path for output from the second parser\n", + "output_path2 = \"\" # e.g., \"bucket/second_parser_output\" gs:// is not required at the beginning." + ] + }, + { + "cell_type": "markdown", + "id": "adc6af03-68d9-474d-8124-381110db8902", + "metadata": {}, + "source": [ + "* `PROJECT_ID :` Google Cloud Project ID for the first processor.\n", + "* `LOCATION :` Google Cloud project location for the first processor.\n", + "* `PROCESSOR_ID :` Processor ID from the first processor.\n", + "* `MIME_TYPE :` The MIME type for the files to be processed.\n", + "* `PROJECT_ID_2 :` Google Cloud Project ID for the second processor.\n", + "* `LOCATION_2 :` Google Cloud project location for the second processor.\n", + "* `PROCESSOR_ID_2 :` Processor ID from the second processor.\n", + "* `input_path :` The path to the input PDF files.\n", + "* `output_path1 :` The path for output from the first parser.\n", + "* `output_path2 :` The path for output from the second parser." + ] + }, + { + "cell_type": "markdown", + "id": "dcc8f035-59b9-46a6-b2db-73e44ab7e372", + "metadata": {}, + "source": [ + "### Run the Code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5186722b-524d-4ef0-8efc-952309aa6878", + "metadata": {}, + "outputs": [], + "source": [ + "# get credentials of current user / service account\n", + "def get_access_token() -> str:\n", + " \"\"\"\n", + " Retrieves the access token for authentication.\n", + "\n", + " Returns:\n", + " str: The access token.\n", + " \"\"\"\n", + "\n", + " credentials, _ = auth.default()\n", + " credentials.refresh(google.auth.transport.requests.Request())\n", + " return credentials.token\n", + "\n", + "\n", + "def list_files(bucket_name: str, prefix: str) -> List[str]:\n", + " \"\"\"\n", + " Lists all files in a Google Cloud Storage (GCS) bucket with the given prefix.\n", + "\n", + " Args:\n", + " bucket_name (str): The name of the GCS bucket.\n", + " prefix (str): The prefix to filter files in the bucket.\n", + "\n", + " Returns:\n", + " List[str]: A list of file names in the bucket with the specified prefix.\n", + " \"\"\"\n", + "\n", + " storage_client = storage.Client()\n", + " blobs = storage_client.list_blobs(bucket_name, prefix=prefix)\n", + " return [blob.name for blob in blobs]\n", + "\n", + "\n", + "def download_blob(\n", + " bucket_name: str, source_blob_name: str, destination_file_name: str\n", + ") -> str:\n", + " \"\"\"\n", + " Downloads a blob from a GCS bucket.\n", + "\n", + " Args:\n", + " bucket_name (str): The name of the GCS bucket.\n", + " source_blob_name (str): The name of the source blob.\n", + " destination_file_name (str): The name of the destination file.\n", + "\n", + " Returns:\n", + " str: The path to the downloaded file.\n", + " \"\"\"\n", + "\n", + " storage_client = storage.Client()\n", + " bucket = storage_client.bucket(bucket_name)\n", + " blob = bucket.blob(source_blob_name)\n", + " blob.download_to_filename(destination_file_name)\n", + " return destination_file_name\n", + "\n", + "\n", + "def upload_blob(\n", + " bucket_name: str, source_file_name: str, destination_blob_name: str\n", + ") -> None:\n", + " \"\"\"\n", + " Uploads a file to a GCS bucket.\n", + "\n", + " Args:\n", + " bucket_name (str): The name of the GCS bucket.\n", + " source_file_name (str): The name of the source file.\n", + " destination_blob_name (str): The name of the destination blob.\n", + " \"\"\"\n", + "\n", + " storage_client = storage.Client()\n", + " bucket = storage_client.bucket(bucket_name)\n", + " blob = bucket.blob(destination_blob_name)\n", + " blob.upload_from_filename(source_file_name)\n", + "\n", + "\n", + "def get_bucket_and_prefix(full_path: str) -> Tuple[str, str]:\n", + " \"\"\"\n", + " Extracts the bucket name and prefix from a full path.\n", + "\n", + " Args:\n", + " full_path (str): The full path containing the bucket name and prefix.\n", + "\n", + " Returns:\n", + " Tuple[str, str]: A tuple containing the bucket name and prefix.\n", + " \"\"\"\n", + "\n", + " parts = full_path.split(\"/\")\n", + " bucket_name = parts[0]\n", + " prefix = \"/\".join(parts[1:])\n", + " return bucket_name, prefix\n", + "\n", + "\n", + "def second_processer_calling(\n", + " document: object, PROJECT_ID_2: str, LOCATION_2: str, PROCESSOR_ID_2: str\n", + ") -> dict:\n", + " \"\"\"\n", + " Calls the second Document AI processor to process the document.\n", + "\n", + " Args:\n", + " document (object): The document to be processed.\n", + " PROJECT_ID_2 (str): The Google Cloud project ID.\n", + " LOCATION_2 (str): The location of the Document AI processor.\n", + " PROCESSOR_ID_2 (str): The ID of the Document AI processor.\n", + "\n", + " Returns:\n", + " dict: The JSON response from the second processor.\n", + " \"\"\"\n", + "\n", + " print(\"Processing through Second Parser\")\n", + " url = f\"https://us-documentai.googleapis.com/v1/projects/{PROJECT_ID_2}/locations/{LOCATION_2}/processors/{PROCESSOR_ID_2}:process\"\n", + " headers = {\"Authorization\": f\"Bearer {get_access_token()}\"}\n", + " json_data = documentai.Document.to_json(document)\n", + " json_data_dict = json.loads(json_data) # Parse the JSON string to a dictionary\n", + "\n", + " create_process_request = {\"inlineDocument\": json_data_dict}\n", + " create_processor_response = requests.post(\n", + " url, headers=headers, json=create_process_request\n", + " )\n", + " create_processor_response.raise_for_status()\n", + " json_object = create_processor_response.json()\n", + " return json_object[\"document\"]\n", + "\n", + "\n", + "def online_process(\n", + " project_id: str,\n", + " location: str,\n", + " processor_id: str,\n", + " file_content: bytes,\n", + " mime_type: str,\n", + ") -> object:\n", + " \"\"\"\n", + " Process a document with the given MIME type and content using the Document AI processor.\n", + "\n", + " Args:\n", + " project_id (str): The Google Cloud project ID.\n", + " location (str): The location of the Document AI processor.\n", + " processor_id (str): The ID of the Document AI processor.\n", + " file_content (bytes): The content of the document.\n", + " mime_type (str): The MIME type of the document content.\n", + "\n", + " Returns:\n", + " object: The processed Document AI document.\n", + " \"\"\"\n", + "\n", + " print(\"Processing through First Parser\")\n", + " opts = {\"api_endpoint\": f\"{location}-documentai.googleapis.com\"}\n", + " documentai_client = documentai.DocumentProcessorServiceClient(client_options=opts)\n", + " resource_name = documentai_client.processor_path(project_id, location, processor_id)\n", + "\n", + " # Load Binary Data into Document AI RawDocument Object\n", + " raw_document = documentai.RawDocument(content=file_content, mime_type=mime_type)\n", + "\n", + " # Configure the process request\n", + " request = documentai.ProcessRequest(name=resource_name, raw_document=raw_document)\n", + "\n", + " # Use the Document AI client to process the document\n", + " result = documentai_client.process_document(request=request)\n", + "\n", + " return result.document\n", + "\n", + "\n", + "def process_files_in_bucket(input_path: str, output_path1: str, output_path2: str):\n", + " \"\"\"\n", + " Processes files in a bucket and saves output to other buckets.\n", + "\n", + " Args:\n", + " input_path (str): The path to the input bucket and prefix.\n", + " output_path1 (str): The path to the output bucket and prefix for the first parser output.\n", + " output_path2 (str): The path to the output bucket and prefix for the second parser output.\n", + " \"\"\"\n", + "\n", + " storage_client = storage.Client()\n", + " input_bucket_name, input_prefix = get_bucket_and_prefix(input_path)\n", + " output_bucket1_name, output_prefix1 = get_bucket_and_prefix(output_path1)\n", + " output_bucket2_name, output_prefix2 = get_bucket_and_prefix(output_path2)\n", + " input_bucket = storage_client.bucket(input_bucket_name)\n", + "\n", + " for blob in input_bucket.list_blobs(prefix=input_prefix):\n", + " if (\n", + " not blob.name.endswith(\"/\") and blob.name != input_prefix\n", + " ): # Skip directories and the prefix itself\n", + " file_name = blob.name\n", + " content = blob.download_as_bytes()\n", + "\n", + " # Removing the original extension and appending .json\n", + " base_file_name = os.path.splitext(os.path.basename(file_name))[0] + \".json\"\n", + "\n", + " print(\"Processing file:\", file_name) # Debug print\n", + " mime_type = mimetypes.guess_type(file_name)[0] or \"application/octet-stream\"\n", + " print(\"Detected MIME type:\", mime_type) # Debug print\n", + "\n", + " # Process with first parser\n", + " first_parser_output = online_process(\n", + " PROJECT_ID, LOCATION, PROCESSOR_ID, content, mime_type\n", + " )\n", + " first_parser_output_json = json.loads(\n", + " documentai.Document.to_json(first_parser_output)\n", + " )\n", + "\n", + " # Save first parser output to output_bucket1\n", + " output_blob1 = storage.Blob(\n", + " output_prefix1 + \"/\" + base_file_name,\n", + " storage_client.bucket(output_bucket1_name),\n", + " )\n", + " output_blob1.upload_from_string(\n", + " json.dumps(first_parser_output_json, indent=2),\n", + " content_type=\"application/json\",\n", + " )\n", + "\n", + " # Process with second parser\n", + " second_parser_output = second_processer_calling(\n", + " first_parser_output, PROJECT_ID_2, LOCATION_2, PROCESSOR_ID_2\n", + " )\n", + "\n", + " # Save second parser output to output_bucket2\n", + " output_blob2 = storage.Blob(\n", + " output_prefix2 + \"/\" + base_file_name,\n", + " storage_client.bucket(output_bucket2_name),\n", + " )\n", + " output_blob2.upload_from_string(\n", + " json.dumps(second_parser_output, indent=2),\n", + " content_type=\"application/json\",\n", + " )\n", + "\n", + "\n", + "# Call the function\n", + "process_files_in_bucket(input_path, output_path1, output_path2)\n", + "print(\"Done\")" + ] + }, + { + "cell_type": "markdown", + "id": "d20aa173-4dbc-415d-bf1c-887f5f57b4ab", + "metadata": {}, + "source": [ + "### Async Code" + ] + }, + { + "cell_type": "markdown", + "id": "4cd56f6f-d520-46a1-80da-9adb38d74f7d", + "metadata": {}, + "source": [ + "### Importing Required Modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8fed567-6c79-4fc9-b9a6-8c1d416e5f49", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import google.auth.transport.requests\n", + "from google import auth\n", + "from google.cloud import documentai\n", + "from google.cloud import storage\n", + "import requests\n", + "import json\n", + "import mimetypes\n", + "import asyncio\n", + "import aiohttp" + ] + }, + { + "cell_type": "markdown", + "id": "9c3c0fe2-3063-42ba-9415-571e8a503950", + "metadata": {}, + "source": [ + "### Run the Code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b554003-baad-4848-a36a-1075d27e0600", + "metadata": {}, + "outputs": [], + "source": [ + "# get credentials of current user / service account\n", + "def get_access_token() -> str:\n", + " \"\"\"\n", + " Retrieves the access token for authentication.\n", + "\n", + " Returns:\n", + " str: The access token.\n", + " \"\"\"\n", + "\n", + " credentials, _ = auth.default()\n", + " credentials.refresh(google.auth.transport.requests.Request())\n", + " return credentials.token\n", + "\n", + "\n", + "def batch_process_documents(\n", + " project_id: str,\n", + " location: str,\n", + " processor_id: str,\n", + " gcs_input_uri: str,\n", + " gcs_output_uri: str,\n", + " timeout: int = 6000,\n", + ") -> object:\n", + " \"\"\"\n", + " Batch process documents using the Document AI processor.\n", + "\n", + " Args:\n", + " project_id (str): The Google Cloud project ID.\n", + " location (str): The location of the Document AI processor.\n", + " processor_id (str): The ID of the Document AI processor.\n", + " gcs_input_uri (str): The URI of the input documents in Google Cloud Storage.\n", + " gcs_output_uri (str): The URI of the output documents in Google Cloud Storage.\n", + " timeout (int): Timeout for the operation in seconds. Defaults to 6000.\n", + "\n", + " Returns:\n", + " documentai.BatchDocumentsResponse: The response object containing the batch processing operation result.\n", + " \"\"\"\n", + "\n", + " from google.cloud import documentai_v1beta3 as documentai\n", + "\n", + " # You must set the api_endpoint if you use a location other than 'us', e.g.:\n", + " opts = {}\n", + " if location == \"eu\":\n", + " opts = {\"api_endpoint\": \"eu-documentai.googleapis.com\"}\n", + " elif location == \"us\":\n", + " opts = {\"api_endpoint\": \"us-documentai.googleapis.com\"}\n", + " # opts = {\"api_endpoint\": \"us-autopush-documentai.sandbox.googleapis.com\"}\n", + " client = documentai.DocumentProcessorServiceClient(client_options=opts)\n", + "\n", + " destination_uri = f\"{gcs_output_uri}/\"\n", + "\n", + " input_config = documentai.BatchDocumentsInputConfig(\n", + " gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)\n", + " )\n", + "\n", + " # Where to write results\n", + " output_config = documentai.DocumentOutputConfig(\n", + " gcs_output_config={\"gcs_uri\": destination_uri}\n", + " )\n", + "\n", + " # Location can be 'us' or 'eu'\n", + " name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\"\n", + " request = documentai.types.document_processor_service.BatchProcessRequest(\n", + " name=name,\n", + " input_documents=input_config,\n", + " document_output_config=output_config,\n", + " )\n", + "\n", + " operation = client.batch_process_documents(request)\n", + "\n", + " # Wait for the operation to finish\n", + " operation.result(timeout=timeout)\n", + " return operation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3dd889c2-fcb9-4072-a30e-0e599841e0d9", + "metadata": {}, + "outputs": [], + "source": [ + "res = batch_process_documents(\n", + " project_id=PROJECT_ID,\n", + " location=LOCATION,\n", + " processor_id=PROCESSOR_ID,\n", + " gcs_input_uri=f\"gs://{input_path}\",\n", + " gcs_output_uri=f\"gs://{output_path1}\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f160c3e6-fd5e-474e-ab37-4074b580e96a", + "metadata": {}, + "outputs": [], + "source": [ + "async def second_processer_calling(\n", + " document: object,\n", + " PROJECT_ID_2: str,\n", + " LOCATION_2: str,\n", + " PROCESSOR_ID_2: str,\n", + " session: aiohttp.ClientSession,\n", + ") -> object:\n", + " \"\"\"\n", + " Asynchronously calls the second Document AI processor.\n", + "\n", + " Args:\n", + " document (object): The document to process.\n", + " PROJECT_ID_2 (str): The Google Cloud project ID for the second processor.\n", + " LOCATION_2 (str): The location of the second processor.\n", + " PROCESSOR_ID_2 (str): The ID of the second processor.\n", + " session (aiohttp.ClientSession): The aiohttp session for making HTTP requests.\n", + "\n", + " Returns:\n", + " object: The processed document.\n", + " \"\"\"\n", + "\n", + " print(\"Processing through Second Parser\")\n", + " url = f\"https://us-documentai.googleapis.com/v1/projects/{PROJECT_ID_2}/locations/{LOCATION_2}/processors/{PROCESSOR_ID_2}:process\"\n", + " headers = {\"Authorization\": f\"Bearer {get_access_token()}\"}\n", + " # json_data = documentai.Document.to_json(document)\n", + " # json_data_dict = json.loads(json_data)\n", + "\n", + " create_process_request = {\"inlineDocument\": document}\n", + " async with session.post(\n", + " url, headers=headers, json=create_process_request\n", + " ) as response:\n", + " response.raise_for_status()\n", + " json_object = await response.json()\n", + " return json_object[\"document\"]\n", + "\n", + "\n", + "async def save_to_gcs(bucket: str, blob_name: str, data: str) -> None:\n", + " \"\"\"\n", + " Asynchronously saves data to Google Cloud Storage.\n", + "\n", + " Args:\n", + " bucket (str): The Google Cloud Storage bucket.\n", + " blob_name (str): The name of the blob.\n", + " data (str): The data to save.\n", + " \"\"\"\n", + "\n", + " blob = bucket.blob(blob_name)\n", + " blob.upload_from_string(data)\n", + "\n", + "\n", + "async def main():\n", + " # Splitting the bucket name and prefix from output_path1\n", + " bucket_name, prefix = output_path1.split(\"/\", 1)\n", + "\n", + " # Fetch JSON from output_path1 and its subfolders\n", + " storage_client = storage.Client()\n", + " bucket = storage_client.get_bucket(bucket_name)\n", + " blobs = bucket.list_blobs(prefix=prefix)\n", + "\n", + " # Initialize aiohttp session\n", + " async with aiohttp.ClientSession() as session:\n", + " tasks = []\n", + " original_filenames = []\n", + "\n", + " for blob in blobs:\n", + " if blob.name.endswith(\".json\"):\n", + " json_string = blob.download_as_string()\n", + " document = json.loads(json_string)\n", + " original_filenames.append(os.path.basename(blob.name))\n", + "\n", + " # Schedule asynchronous processing\n", + " task = asyncio.ensure_future(\n", + " second_processer_calling(\n", + " document, PROJECT_ID_2, LOCATION_2, PROCESSOR_ID_2, session\n", + " )\n", + " )\n", + " tasks.append(task)\n", + "\n", + " # Wait for all tasks to complete\n", + " results = await asyncio.gather(*tasks)\n", + "\n", + " # Save results to output_path2 with the same original file names\n", + " _, output_prefix = output_path2.split(\"/\", 1)\n", + " for filename, result in zip(original_filenames, results):\n", + " result_json = json.dumps(result)\n", + " blob_name = f\"{output_prefix}/{filename}\"\n", + " await save_to_gcs(bucket, blob_name, result_json)\n", + "\n", + "\n", + "await main()" + ] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/date_entity_normalization/date_entity_normalization.ipynb b/incubator-tools/date_entity_normalization/date_entity_normalization.ipynb new file mode 100644 index 00000000..4e29c612 --- /dev/null +++ b/incubator-tools/date_entity_normalization/date_entity_normalization.ipynb @@ -0,0 +1,367 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "05597417-c7d3-4132-99d9-76591e72b58b", + "metadata": {}, + "source": [ + "# Date Entity Normalization\n" + ] + }, + { + "cell_type": "markdown", + "id": "ad09b57c-8166-4662-b33d-dd883bcd4b2e", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com\n" + ] + }, + { + "cell_type": "markdown", + "id": "45325f56-279d-480d-af14-f138181d3efa", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "89cc7ea6-79b7-4905-a77f-6f5dec1de471", + "metadata": {}, + "source": [ + "## Purpose and Description\n", + "\n", + "This tool updates the values of normalized dates in entities within the Document AI JSON output. It aids in identifying the actual date format, such as MM/DD/YYYY or DD/MM/YYYY, through a heuristic approach. Upon successful identification, the tool updates all date values in the JSON to maintain a consistent format." + ] + }, + { + "cell_type": "markdown", + "id": "70169fe9-1709-4050-8f1c-3050cef19566", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "1. Vertex AI Notebook or Google Colab\n", + "2. GCS bucket for processing of the input json and output json\n" + ] + }, + { + "cell_type": "markdown", + "id": "174b2cf3-04ad-45e4-b0b0-b279990ea211", + "metadata": {}, + "source": [ + "## Step by Step procedure " + ] + }, + { + "cell_type": "markdown", + "id": "8983f13d-91d8-48cc-8f79-87c1cf309cc3", + "metadata": {}, + "source": [ + "### 1. Install the required libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f1fa99b-07ba-427c-8c03-520efbb81161", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%pip install google-cloud-storage\n", + "%pip install google-cloud-documentai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b2bd0830-3222-4300-b63b-fd21c7b9aa01", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "markdown", + "id": "8371a9df-8297-4424-9be0-bfa7cfb9485f", + "metadata": {}, + "source": [ + "### 2. Import the required libraries/Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ccb6b464-fa6e-42ba-8e1a-feade406fdbf", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "import re\n", + "from datetime import datetime\n", + "from tqdm import tqdm\n", + "from pathlib import Path\n", + "from typing import Dict, List, Union, Optional, Tuple\n", + "from google.cloud import documentai_v1beta3 as documentai\n", + "from google.cloud import storage\n", + "from utilities import (\n", + " file_names,\n", + " documentai_json_proto_downloader,\n", + " store_document_as_json,\n", + ")\n", + "from pprint import pprint" + ] + }, + { + "cell_type": "markdown", + "id": "c70e438f-c525-48ef-8de3-fdeda1294a7f", + "metadata": {}, + "source": [ + "### 3. Input Details" + ] + }, + { + "cell_type": "markdown", + "id": "f6a948aa-b3fb-4b5e-bfd3-058f5e9a38cb", + "metadata": {}, + "source": [ + "
    \n", + "
  • input_path : GCS Storage name. It should contain DocAI processed output json files. This bucket is used for processing input files and saving output files in the folders.
  • \n", + "
  • output_path: GCS URI of the folder, where the Output Json files will store.
  • \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74083216-d1d6-44a6-9b29-967593fb53c2", + "metadata": {}, + "outputs": [], + "source": [ + "input_path = (\n", + " \"gs://{bucket_name}/{folder_path}\" # Path to your Document AI input JSON files.\n", + ")\n", + "output_path = \"gs://{bucket_name}/{folder_path}\" # Path where Vertex AI output merged JSON files will be saved." + ] + }, + { + "cell_type": "markdown", + "id": "28716e4e-210e-4f42-9a0f-fc7d97310567", + "metadata": {}, + "source": [ + "### 4.Execute the code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fe73f8a-19ff-4025-a082-685ddeb54fc1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "input_storage_bucket_name = input_path.split(\"/\")[2]\n", + "input_bucket_path_prefix = \"/\".join(input_path.split(\"/\")[3:])\n", + "output_storage_bucket_name = output_path.split(\"/\")[2]\n", + "output_bucket_path_prefix = \"/\".join(output_path.split(\"/\")[3:])\n", + "\n", + "\n", + "def identify_and_convert_date_format(\n", + " mention_text: str, known_format: Optional[str] = None\n", + ") -> Tuple[Optional[datetime], str]:\n", + " \"\"\"\n", + " This function attempts to identify and convert a date string to a datetime object.\n", + "\n", + " Args:\n", + " mention_text: The text string potentially containing a date.\n", + " known_format: (Optional) A specific date format string to try first (e.g., \"%Y-%m-%d\").\n", + "\n", + " Returns:\n", + " A tuple containing two elements:\n", + " - The converted datetime object (or None if not successful).\n", + " - The identified date format string (or \"N/A\" if not found).\n", + " \"\"\"\n", + "\n", + " formats = [\"%d/%m/%Y\", \"%m/%d/%Y\"]\n", + " if known_format:\n", + " formats.insert(0, known_format)\n", + "\n", + " for fmt in formats:\n", + " try:\n", + " date_obj = datetime.strptime(mention_text, fmt)\n", + " return date_obj, fmt\n", + " except ValueError:\n", + " continue\n", + " return None, \"N/A\"\n", + "\n", + "\n", + "def process_json_files(\n", + " list_of_files: List[str],\n", + " input_storage_bucket_name: str,\n", + " output_storage_bucket_name: str,\n", + " output_bucket_path_prefix: str,\n", + ") -> None:\n", + " \"\"\"\n", + " Processes a list of JSON files, converting dates within entities to ISO 8601 format and storing the updated JSON data in a specified output bucket.\n", + "\n", + " Args:\n", + " list_of_files: List of file paths for the JSON files to process (type: List[str]).\n", + " input_storage_bucket_name: Name of the input storage bucket (type: str).\n", + " output_storage_bucket_name: Name of the output storage bucket (type: str).\n", + " output_bucket_path_prefix: Prefix for the output file paths (type: str).\n", + "\n", + " Returns:\n", + " None\n", + " \"\"\"\n", + " all_json_data = []\n", + "\n", + " for k in tqdm(range(0, len(list_of_files))):\n", + " print(\"***************\")\n", + " file_name = list_of_files[k].split(\"/\")[-1]\n", + " print(f\"File Name {file_name}\")\n", + " json_proto_data = documentai_json_proto_downloader(\n", + " input_storage_bucket_name, list_of_files[k]\n", + " )\n", + " for ind, ent in enumerate(json_proto_data.entities):\n", + " if \"date\" in ent.type:\n", + " print(\"---------------\")\n", + " mention_text = ent.mention_text if hasattr(ent, \"mention_text\") else \"\"\n", + " normalized_value = (\n", + " ent.normalized_value if hasattr(ent, \"normalized_value\") else \"\"\n", + " )\n", + " type_ = ent.type if hasattr(ent, \"type\") else \"\"\n", + " print(f\"Type: {type_}\")\n", + " print(f\"Mention Text: {mention_text}\")\n", + " print(f\"Old Normalized Value: {normalized_value}\")\n", + "\n", + " date_obj, identified_format = identify_and_convert_date_format(\n", + " mention_text\n", + " )\n", + "\n", + " json_data = json.loads(documentai.Document.to_json(json_proto_data))\n", + "\n", + " ent_json = json_data[\"entities\"][ind]\n", + "\n", + " if date_obj:\n", + " new_date_text_iso = date_obj.strftime(\"%Y-%m-%d\")\n", + " ent_json[\"normalizedValue\"][\"text\"] = new_date_text_iso\n", + " ent_json[\"normalizedValue\"][\"dateValue\"] = {\n", + " \"day\": date_obj.day,\n", + " \"month\": date_obj.month,\n", + " \"year\": date_obj.year,\n", + " }\n", + " if identified_format != \"N/A\":\n", + " ent_json[\"identified_format\"] = identified_format\n", + " else:\n", + " if identified_format == \"N/A\" and any(\n", + " e\n", + " for e in json_data[\"entities\"]\n", + " if e[\"type\"] == \"date\" and \"identified_format\" in e\n", + " ):\n", + " known_format = next(\n", + " (\n", + " e[\"identified_format\"]\n", + " for e in json_data[\"entities\"]\n", + " if e[\"type\"] == \"date\" and \"identified_format\" in e\n", + " ),\n", + " None,\n", + " )\n", + " date_obj, identified_format = identify_and_convert_date_format(\n", + " mention_text, known_format=known_format\n", + " )\n", + " if date_obj:\n", + " new_date_text_iso = date_obj.strftime(\"%Y-%m-%d\")\n", + " ent_json[\"normalizedValue\"][\"text\"] = new_date_text_iso\n", + " ent_json[\"normalizedValue\"][\"dateValue\"] = {\n", + " \"day\": date_obj.day,\n", + " \"month\": date_obj.month,\n", + " \"year\": date_obj.year,\n", + " }\n", + "\n", + " output_file_name = f\"{output_bucket_path_prefix}{file_name}\"\n", + " store_document_as_json(\n", + " json.dumps(json_data), output_storage_bucket_name, output_file_name\n", + " )\n", + "\n", + " print(\"--------------------\")\n", + " print(\"All files processed.\")\n", + "\n", + "\n", + "json_files = file_names(input_path)[1].values()\n", + "list_of_files = [i for i in list(json_files) if i.endswith(\".json\")]\n", + "process_json_files(\n", + " list_of_files,\n", + " input_storage_bucket_name,\n", + " output_storage_bucket_name,\n", + " output_bucket_path_prefix,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5b8baad3-4bd5-4626-81b8-b3b5587a5056", + "metadata": {}, + "source": [ + "### 5.Output" + ] + }, + { + "cell_type": "markdown", + "id": "a9cda194-9d54-4d42-9eb0-de225422e17f", + "metadata": {}, + "source": [ + "The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path.

\n", + "Comparison Between Input and Output File

\n", + "

Post processing results


\n", + "Upon running the post processing script against input data. The resultant output json data is obtained. The following image will show the difference date formate in the date filed
\n", + " \n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ace1b20b-6417-420e-b0dd-3841ecac2773", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/date_entity_normalization/images/output_image1.png b/incubator-tools/date_entity_normalization/images/output_image1.png new file mode 100644 index 00000000..6c3bbb3b Binary files /dev/null and b/incubator-tools/date_entity_normalization/images/output_image1.png differ diff --git a/incubator-tools/enrich_address_for_invoice/enrich_address_for_invoice.ipynb b/incubator-tools/enrich_address_for_invoice/enrich_address_for_invoice.ipynb new file mode 100644 index 00000000..d2496fc2 --- /dev/null +++ b/incubator-tools/enrich_address_for_invoice/enrich_address_for_invoice.ipynb @@ -0,0 +1,404 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b9b561af-dfd4-43b5-9baa-36ba937ff124", + "metadata": {}, + "source": [ + "# Enrich Address for Invoice and Expense Documents\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "cb81020b-6cf8-486e-98d7-def30ce0490a", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com" + ] + }, + { + "cell_type": "markdown", + "id": "455354e7-a77e-44ca-b44e-0425d8b59a3d", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "d9962c16-5cc6-4642-93d5-11d11b6b8de1", + "metadata": {}, + "source": [ + "# Objective\n", + "The tool facilitates a more detailed and accurate address parsing process. Detected addresses are broken down into their constituent parts, such as city, country, and ZIP code. The address data is enriched with additional relevant information, enhancing its overall usability." + ] + }, + { + "cell_type": "markdown", + "id": "cb319168-3b6d-40e5-8681-fb47ad709556", + "metadata": {}, + "source": [ + "# Prerequisites\n", + "* Python : Jupyter notebook (Vertex AI).\n", + "\n", + "NOTE : \n", + " * The version of Python currently running in the Jupyter notebook should be greater than 3.8\n", + " * The normalizedValue attribute will be accessible exclusively in JSON file and is not visible in the processor.\n" + ] + }, + { + "cell_type": "markdown", + "id": "52acc46e-5e4c-4daa-ac9e-7031283f10a0", + "metadata": {}, + "source": [ + "# Step-by-Step Procedure" + ] + }, + { + "cell_type": "markdown", + "id": "389c1532-21cb-4950-8446-7cabc0394332", + "metadata": {}, + "source": [ + "## Import the libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36d8a5b9-f90e-4575-87ff-95810f84ccd8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install --upgrade google-cloud-aiplatform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "412f3332-6473-457c-922a-d6eb26b395c0", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Run this cell to download utilities module\n", + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ad36024-52db-4628-9a87-a1fcbdfdc66c", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pathlib import Path\n", + "from google.cloud import storage\n", + "import vertexai\n", + "from vertexai.language_models import TextGenerationModel\n", + "from utilities import file_names, store_document_as_json, blob_downloader" + ] + }, + { + "cell_type": "markdown", + "id": "0f6ea2a0-d3ea-4b2b-a2c1-0af6ec85f4d9", + "metadata": {}, + "source": [ + "## 2. Input Details" + ] + }, + { + "cell_type": "markdown", + "id": "052c4c0b-3872-40dd-8a4e-17d5e05f01b7", + "metadata": {}, + "source": [ + "* **PROJECT_ID** : It contains the project ID of the working project.\n", + "* **LOCATION** : It contains the location.\n", + "* **GCS_INPUT_PATH** : It contains the input jsons bucket path. \n", + "* **GCS_OUTPUT_PATH** : It contains the output bucket path where the updated jsons after adding the attribute will be stored.\n", + "* **ENTITY_NAME** : It contains the names of the entities which the user wants to split. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c0b83232-34e7-4ff1-abab-ab68235ad29f", + "metadata": {}, + "outputs": [], + "source": [ + "PROJECT_ID = \"rand-automl-project\" # Your Google Cloud project ID.\n", + "LOCATION = \"us-central1\"\n", + "# '/' should be provided at the end of the path.\n", + "GCS_INPUT_PATH = \"gs://bucket_name/path/to/jsons/\"\n", + "# '/' should be provided at the end of the path.\n", + "GCS_OUTPUT_PATH = \"gs://bucket_name/path/to/jsons/\"\n", + "# Name of the entities in a list format.\n", + "ENTITY_NAME = [\"receiver_address\", \"remit_to_address\"]" + ] + }, + { + "cell_type": "markdown", + "id": "b6b658d0-d777-4bc6-8a94-2d7452b03db6", + "metadata": {}, + "source": [ + "## 3. Run Below Code-Cells" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b98d18d7-9a44-4d0b-b743-25bc26010ecb", + "metadata": {}, + "outputs": [], + "source": [ + "def split_address_to_json(address: str, project_id: str, location: str) -> dict:\n", + " \"\"\"\n", + " Split an address into JSON format with specific keys using a text generation model.\n", + "\n", + " This function splits an address into JSON format with keys for streetAddress,\n", + " city, state, zipcode, and country\n", + " using a text generation model.\n", + "\n", + " Args:\n", + " address (str): The input address string to be split.\n", + " project_id (str): The project ID for the Vertex AI project.\n", + " location (str): The location of the Vertex AI project.\n", + "\n", + " Returns:\n", + " dict or None: A dictionary containing the JSON-formatted address if successful, else None.\n", + " \"\"\"\n", + " vertexai.init(project=project_id, location=location)\n", + " parameters = {\n", + " \"max_output_tokens\": 1024,\n", + " \"temperature\": 0.2,\n", + " \"top_p\": 0.8,\n", + " \"top_k\": 40,\n", + " }\n", + " model = TextGenerationModel.from_pretrained(\"text-bison@001\")\n", + " response = model.predict(\n", + " f\"\"\"Please split the address into Json format with keys\n", + " streetAddress, city, state, zipcode, country\n", + "\n", + " input: {address}\n", + " output:\n", + " \"\"\",\n", + " **parameters,\n", + " )\n", + "\n", + " # Extracting JSON response from the model\n", + " json_response = response.text\n", + "\n", + " try:\n", + " json_output = json.loads(json_response)\n", + " print(\"JSON OUTPUT\", json_output)\n", + " return json_output\n", + " except json.JSONDecodeError as e:\n", + " print(f\"Error decoding JSON response: {e}\")\n", + " print(\"Response from Model:\", response.text)\n", + " return None\n", + "\n", + "\n", + "def process_json_files(\n", + " list_of_files: list,\n", + " input_storage_bucket_name: str,\n", + " output_storage_bucket_name: str,\n", + " output_bucket_path_prefix: str,\n", + " project_id: str,\n", + " location: str,\n", + ") -> None:\n", + " \"\"\"\n", + " Process JSON files containing address entities, split the addresses,\n", + " and store the updated JSON files in Google Cloud Storage.\n", + "\n", + " This function iterates over a list of JSON files containing address entities,\n", + " splits the addresses into JSON format with keys for streetAddress,\n", + " city, state, zipcode, and country,\n", + " and stores the updated JSON files in a specified Google Cloud Storage bucket.\n", + "\n", + " Args:\n", + " list_of_files (list): A list of JSON file paths to be processed.\n", + " input_storage_bucket_name (str): The name of the input Google Cloud Storage bucket.\n", + " output_storage_bucket_name (str): The name of the output Google Cloud Storage bucket.\n", + " output_bucket_path_prefix (str): The prefix path within the output bucket\n", + " where the processed files will be stored.\n", + " project_id (str): The project ID for the Vertex AI project.\n", + " location (str): The location of the Vertex AI project.\n", + "\n", + " Returns:\n", + " None\n", + " \"\"\"\n", + "\n", + " for k, _ in enumerate(list_of_files):\n", + " print(\"***************\")\n", + " file_name = list_of_files[k].split(\"/\")[\n", + " -1\n", + " ] # Extracting the file name from the path\n", + " print(f\"File Name {file_name}\")\n", + " json_data = blob_downloader(input_storage_bucket_name, list_of_files[k])\n", + " for ent in json_data[\"entities\"]:\n", + " for name in ENTITY_NAME:\n", + " if name in ent[\"type\"]:\n", + " print(\"---------------\")\n", + " mention_text = ent.get(\"mentionText\", \"\")\n", + " # normalized_value = ent.get('normalizedValue', \"\")\n", + " type_ = ent.get(\"type\", \"\")\n", + " print(f\"Type: {type_}\")\n", + " print(f\"Mention Text: {mention_text}\")\n", + "\n", + " # Try splitting the address\n", + " output_json = split_address_to_json(\n", + " mention_text.replace(\"\\n\", \" \").strip(), project_id, location\n", + " )\n", + " # If address was successfully split, update the entity\n", + " if output_json is not None:\n", + " ent[\"normalizedValue\"] = output_json\n", + " ent[\"identified_format\"] = \"Address split\"\n", + " else:\n", + " print(\"Address couldn't be split.\")\n", + "\n", + " print(f\"New Normalized Value: {ent['normalizedValue']}\")\n", + "\n", + " # save to Google Cloud Storage\n", + " output_file_name = f\"{output_bucket_path_prefix}{file_name}\"\n", + " store_document_as_json(\n", + " json.dumps(json_data), output_storage_bucket_name, output_file_name\n", + " )\n", + "\n", + " print(\"--------------------\")\n", + " print(\"All files processed.\")" + ] + }, + { + "cell_type": "markdown", + "id": "f1f6e420-8b24-4773-b069-771869be758f", + "metadata": {}, + "source": [ + "## Run the main functions after executing the above functions: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb6aa350-f4af-440d-8185-dbcc030b73b1", + "metadata": {}, + "outputs": [], + "source": [ + "def main(project_id: str, location: str, input_path: str, output_path: str) -> None:\n", + " \"\"\"\n", + " Main function to process JSON files containing address entities and\n", + " store the updated JSON files in Google Cloud Storage.\n", + "\n", + " This function serves as the main entry point for processing JSON files\n", + " containing address entities, splitting the addresses,\n", + " and storing the updated JSON files in a specified Google Cloud Storage bucket.\n", + "\n", + " Args:\n", + " project_id (str): The project ID for the Vertex AI project.\n", + " location (str): The location of the Vertex AI project.\n", + " input_path (str): The path to the input directory containing JSON files.\n", + " output_path (str): The path to the output directory\n", + " where the processed files will be stored.\n", + "\n", + " Returns:\n", + " \"\"\"\n", + " input_storage_bucket_name = input_path.split(\"/\")[2]\n", + " # input_bucket_path_prefix = \"/\".join(input_path.split(\"/\")[3:])\n", + " output_storage_bucket_name = output_path.split(\"/\")[2]\n", + " output_bucket_path_prefix = \"/\".join(output_path.split(\"/\")[3:])\n", + "\n", + " json_files = file_names(input_path)[1].values()\n", + " list_of_files = [i for i in list(json_files) if i.endswith(\".json\")]\n", + " process_json_files(\n", + " list_of_files,\n", + " input_storage_bucket_name,\n", + " output_storage_bucket_name,\n", + " output_bucket_path_prefix,\n", + " project_id,\n", + " location,\n", + " )\n", + "\n", + "\n", + "main(PROJECT_ID, LOCATION, GCS_INPUT_PATH, GCS_OUTPUT_PATH)" + ] + }, + { + "cell_type": "markdown", + "id": "94e01349-3898-4160-adcd-715a9b83b5c7", + "metadata": {}, + "source": [ + "## Output\n", + "The new attribute 'normalizedValue' will be added to each address entity in the newly generated json file." + ] + }, + { + "cell_type": "markdown", + "id": "f9a46bcb-4b12-4c6a-8e5f-b56469d8a4a5", + "metadata": {}, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " Pre-processed data\n", + "
\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1847ab37-c050-4fa4-bd7c-ac3c826e9622", + "metadata": {}, + "outputs": [], + "source": [ + "!python3 --version" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54f5c823-b08d-4a9b-a5b2-0235b3067f7a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/enrich_address_for_invoice/images/input_image.png b/incubator-tools/enrich_address_for_invoice/images/input_image.png new file mode 100644 index 00000000..c92d5c58 Binary files /dev/null and b/incubator-tools/enrich_address_for_invoice/images/input_image.png differ diff --git a/incubator-tools/normalize_date_value_19xx_to_20xx/images/post_processing_image.png b/incubator-tools/normalize_date_value_19xx_to_20xx/images/post_processing_image.png new file mode 100644 index 00000000..634aef1c Binary files /dev/null and b/incubator-tools/normalize_date_value_19xx_to_20xx/images/post_processing_image.png differ diff --git a/incubator-tools/normalize_date_value_19xx_to_20xx/images/pre_processing_image.png b/incubator-tools/normalize_date_value_19xx_to_20xx/images/pre_processing_image.png new file mode 100644 index 00000000..57c0b7a8 Binary files /dev/null and b/incubator-tools/normalize_date_value_19xx_to_20xx/images/pre_processing_image.png differ diff --git a/incubator-tools/normalize_date_value_19xx_to_20xx/normalize_date_value_19xx_to_20xx.ipynb b/incubator-tools/normalize_date_value_19xx_to_20xx/normalize_date_value_19xx_to_20xx.ipynb new file mode 100644 index 00000000..fd046070 --- /dev/null +++ b/incubator-tools/normalize_date_value_19xx_to_20xx/normalize_date_value_19xx_to_20xx.ipynb @@ -0,0 +1,275 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0396799c-a7b9-4572-813e-f1940e332b80", + "metadata": {}, + "source": [ + "# Normalize Date Value 19xx to 20xx" + ] + }, + { + "cell_type": "markdown", + "id": "c05e4b88-d62c-41bb-9094-d448099bb2de", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com" + ] + }, + { + "cell_type": "markdown", + "id": "85d0d35f-1ca8-4b53-b508-2900e9e6fe58", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "2bf8bebe-720c-4fb9-8449-ecdcf2b1c3de", + "metadata": {}, + "source": [ + "# Objective\n", + "This is a post processing tool to normalize year in date related entities from 19xx to 20xx. Document AI processors will give a normalized_value attribute for date entities in Document Object and sometimes this normalized value for year will be inferred as 19xx instead of 20xx." + ] + }, + { + "cell_type": "markdown", + "id": "8782c76f-ca1f-4341-83e3-8d9fba967461", + "metadata": {}, + "source": [ + "# Prerequisites\n", + "* Vertex AI Notebook\n", + "* GCS Folder Path" + ] + }, + { + "cell_type": "markdown", + "id": "e554aae7-245f-4342-83fc-a7d7ead70049", + "metadata": {}, + "source": [ + "# Step-by-Step Procedure" + ] + }, + { + "cell_type": "markdown", + "id": "14c857c3-230b-4c8d-80ee-46cc44f8493f", + "metadata": {}, + "source": [ + "## 1. Import Modules/Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eca65467-4ed8-4521-945d-57ed5411c487", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Run this cell to download utilities module\n", + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a64c5868-9809-4518-8d36-1b56eae9897c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from google.cloud import storage\n", + "from google.cloud import documentai_v1beta3 as documentai\n", + "from utilities import (\n", + " file_names,\n", + " store_document_as_json,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "f8c967e2-1b39-4335-906a-5135a613b503", + "metadata": {}, + "source": [ + "## 2. Input Details" + ] + }, + { + "cell_type": "markdown", + "id": "096ad870-e76b-4190-8934-9b95f88c9865", + "metadata": {}, + "source": [ + "* **INPUT_GCS_PATH** : It is input GCS folder path which contains DocumentAI processor JSON results\n", + "* **OUTPUT_GCS_PATH** : It is a GCS folder path to store post-processing results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73e16b97-fce1-4593-8683-d23f3cd0092c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Parser results as JSON, Data entities should contain Normalized Value data in it\n", + "GCS_INPUT_DIR = \"gs://BUCKET_NAME/incubator/\"\n", + "GCS_OUTPUT_DIR = \"gs://BUCKET_NAME/incubator/output/\"" + ] + }, + { + "cell_type": "markdown", + "id": "10b448dc-2572-417e-b685-afe8db2bcfe0", + "metadata": {}, + "source": [ + "## 3. Run Below Code-Cells" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92155d09-86dc-4a08-bc19-791dd91a32c9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "def normalize_date_entity(entity: documentai.Document.Entity):\n", + " \"\"\"\n", + " Normalize a date entity by adding 100 years to the year value.\n", + "\n", + " This function takes a date entity extracted using Google Cloud Document AI\n", + " and normalizes it by adding 100 years to the year value.\n", + "\n", + " Args:\n", + " entity (documentai.Document.Entity): The date entity to be normalized.\n", + "\n", + " Returns:\n", + " None\n", + "\n", + " Example:\n", + " # Example usage:\n", + " entity = ... # Assume entity is extracted from a document\n", + " normalize_date_entity(entity)\n", + " # The date entity will be normalized with the year increased by 100.\n", + " \"\"\"\n", + " print(\"\\t\\t\", entity.type_, entity.normalized_value.text, end=\" -> \")\n", + " accumulate = 100\n", + " date = entity.normalized_value.date_value\n", + " curr_year, curr_month, curr_day = date.year, date.month, date.day\n", + " updated_year = curr_year + accumulate\n", + " entity.normalized_value.date_value.year = updated_year\n", + " text = f\"{updated_year}-{curr_month:0>2}-{curr_day:0>2}\"\n", + " entity.normalized_value.text = text\n", + " print(entity.normalized_value.text)\n", + "\n", + "\n", + "json_splits = GCS_INPUT_DIR.strip(\"/\").split(\"/\")\n", + "input_bucket = json_splits[2]\n", + "INPUT_FILES_DIR = \"/\".join(json_splits[3:])\n", + "GCS_OUTPUT_DIR = GCS_OUTPUT_DIR.strip(\"/\")\n", + "output_splits = GCS_OUTPUT_DIR.split(\"/\")\n", + "output_bucket = output_splits[2]\n", + "OUTPUT_FILES_DIR = \"/\".join(output_splits[3:])\n", + "\n", + "\n", + "_, files_dict = file_names(GCS_INPUT_DIR)\n", + "ip_storage_client = storage.Client()\n", + "ip_storage_bucket = ip_storage_client.bucket(input_bucket)\n", + "print(\"Process started for converting normalized dat value from 19xx to 20xx...\")\n", + "for fn, fp in files_dict.items():\n", + " print(f\"\\tFile: {fn}\")\n", + " json_str = ip_storage_bucket.blob(fp).download_as_string()\n", + " doc = documentai.Document.from_json(json_str)\n", + " for ent in doc.entities:\n", + " if 100 < ent.normalized_value.date_value.year < 2000:\n", + " normalize_date_entity(ent)\n", + "\n", + " json_str = documentai.Document.to_json(doc)\n", + " file_name = f\"{OUTPUT_FILES_DIR}/{fn}\"\n", + " print(f\"\\t Output gcs uri - {file_name}\", output_bucket)\n", + " store_document_as_json(json_str, output_bucket, file_name)\n", + "\n", + "print(\"Process Completed!!!\")" + ] + }, + { + "cell_type": "markdown", + "id": "4e8df1da-2ef2-424c-902e-e4cebc0515c6", + "metadata": {}, + "source": [ + "# 4. Output Details\n", + "\n", + "Refer below images for preprocessed and postprocessed results" + ] + }, + { + "cell_type": "markdown", + "id": "83949d45-cac6-48c2-a067-0b67c0d36910", + "metadata": { + "tags": [] + }, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " Pre-processed data\n", + " \n", + " Post-processed data\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97bf332a-2406-473d-8b0e-304181331dd0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/reverse_annotation_tool/images/csv_sample.png b/incubator-tools/reverse_annotation_tool/images/csv_sample.png new file mode 100644 index 00000000..46ddaa35 Binary files /dev/null and b/incubator-tools/reverse_annotation_tool/images/csv_sample.png differ diff --git a/incubator-tools/reverse_annotation_tool/images/output_sample.png b/incubator-tools/reverse_annotation_tool/images/output_sample.png new file mode 100644 index 00000000..8439600f Binary files /dev/null and b/incubator-tools/reverse_annotation_tool/images/output_sample.png differ diff --git a/incubator-tools/reverse_annotation_tool/reverse_annotation_tool.ipynb b/incubator-tools/reverse_annotation_tool/reverse_annotation_tool.ipynb new file mode 100644 index 00000000..6cd1c755 --- /dev/null +++ b/incubator-tools/reverse_annotation_tool/reverse_annotation_tool.ipynb @@ -0,0 +1,955 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "29399724", + "metadata": {}, + "source": [ + "# Reverse Annotation Tool" + ] + }, + { + "cell_type": "markdown", + "id": "1e4a6def", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com" + ] + }, + { + "cell_type": "markdown", + "id": "6029ebef", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "18bbfedd", + "metadata": {}, + "source": [ + "# Objective\n", + "This tool helps in annotating or labeling the entities in the document based on the ocr text tokens. The notebook script expects the input file containing the name of entities in tabular format. And the first row is the header representing the entities that need to be labeled in every document. The script calls the processor and parses each of these input documents. The parsed document is then annotated if input entities are present in the document based on the OCR text tokens. The result is an output json file with updated entities and exported into a storage bucket path. This result json files can be imported into a processor to further check the annotations are existing as per the input file which was provided to the script prior the execution." + ] + }, + { + "cell_type": "markdown", + "id": "4a77d9e9", + "metadata": {}, + "source": [ + "# Prerequisites\n", + "* Vertex AI Notebook\n", + "* Input csv file containing list of files to be labeled.\n", + "* Document AI Processor\n", + "* GCS bucket for processing of the input documents and writing the output." + ] + }, + { + "cell_type": "markdown", + "id": "cad788f5", + "metadata": {}, + "source": [ + "# Step-by-Step Procedure" + ] + }, + { + "cell_type": "markdown", + "id": "3e511f7a", + "metadata": {}, + "source": [ + "## 1. Import Modules/Packages" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "80ecc19e-d0fb-4435-9284-44318db1937c", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install google-cloud-documentai\n", + "!pip install google-cloud-storage\n", + "!pip install numpy\n", + "!pip install pandas\n", + "!pip install fuzzywuzzy" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "10e90cfe", + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell to download utilities module\n", + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "2d4b7289", + "metadata": {}, + "outputs": [], + "source": [ + "import csv\n", + "import re\n", + "from typing import Dict, List, Tuple, Union\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "from fuzzywuzzy import fuzz\n", + "from google.cloud import documentai_v1beta3 as documentai\n", + "from google.cloud import storage\n", + "\n", + "from utilities import process_document_sample, store_document_as_json" + ] + }, + { + "cell_type": "markdown", + "id": "4aa4caea", + "metadata": {}, + "source": [ + "## 2. Input Details" + ] + }, + { + "cell_type": "markdown", + "id": "717bb8cc", + "metadata": {}, + "source": [ + "* **PROJECT_ID** : GCP project Id\n", + "* **LOCATION** : Location of DocumentAI processor, either `us` or `eu`\n", + "* **PROCESSOR_ID** : DocumentAI processor Id\n", + "* **PROCESSOR_VERSION** : DocumentAI processor verrsion Id(eg- pretrained-invoice-v2.0-2023-12-06)\n", + "* **INPUT_BUCKET** : It is input GCS folder path which contains pdf files\n", + "* **OUTPUT_BUCKET** : It is a GCS folder path to store post-processing results\n", + "* **READ_SCHEMA_FILENAME** : It is a csv file contains entities(type & mention_text) data, which are needed to be annotated. In csv, Column-1(FileNames) contains file names , Column-2(entity_type) contains data to be annotated, Column-3 and its following fields should follow same field-schema as Column-2. In otherwords it is a schema file containing a tabular data with header row as name of the entities that needs to be identified and annotated in the document and the following rows are for each file whose values needs to be extracted. \n", + "\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "bfdd838b", + "metadata": {}, + "outputs": [], + "source": [ + "PROJECT_ID = \"xx-xx-xx\"\n", + "PROCESSOR_ID = \"xx-xx-xx\"\n", + "PROCESSOR_VERSION = \"pretrained-invoice-v2.0-2023-12-06\"\n", + "INPUT_BUCKET = \"gs://BUCKET_NAME/reverse_annotation_tool/input/\"\n", + "OUTPUT_BUCKET = \"gs://BUCKET_NAME/reverse_annotation_tool/output/\"\n", + "LOCATION = \"us\"\n", + "# Column headers based on your original CSV structure\n", + "READ_SCHEMA_FILENAME = \"schema_and_data.csv\"" + ] + }, + { + "cell_type": "markdown", + "id": "3347e5d1", + "metadata": {}, + "source": [ + "## 3. Run Below Code-Cells" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94606760", + "metadata": {}, + "outputs": [], + "source": [ + "def read_input_schema(read_schema_filename: str) -> pd.DataFrame:\n", + " \"\"\"\n", + " Reads an input schema from a CSV file.\n", + "\n", + " Args:\n", + " - read_schema_file_name (str): Path to the CSV file containing the schema.\n", + "\n", + " Returns:\n", + " - pd.DataFrame: DataFrame containing the schema data.\n", + " \"\"\"\n", + "\n", + " df = pd.read_csv(read_schema_filename, dtype=str)\n", + " df = df.drop(df[df[\"FileNames\"] == \"Type\"].index)\n", + " df.replace(\"\", np.nan, inplace=True)\n", + " return df\n", + "\n", + "\n", + "def get_token_range(json_data: documentai.Document) -> Dict[range, Dict[str, int]]:\n", + " \"\"\"\n", + " Gets the token ranges from the provided JSON data.\n", + "\n", + " Args:\n", + " - json_data (documentai.Document): JSON data containing page and token information.\n", + "\n", + " Returns:\n", + " - dict: Dictionary containing token ranges with page number and token number information.\n", + " \"\"\"\n", + "\n", + " token_range = {}\n", + " for pn, page in enumerate(json_data.pages):\n", + " for tn, token in enumerate(page.tokens):\n", + " ts = token.layout.text_anchor.text_segments[0]\n", + " start_index = ts.start_index\n", + " end_index = ts.end_index\n", + " token_range[range(start_index, end_index)] = {\n", + " \"page_number\": pn,\n", + " \"token_number\": tn,\n", + " }\n", + " return token_range\n", + "\n", + "\n", + "def fix_page_anchor_entity(\n", + " entity: documentai.Document.Entity,\n", + " json_data: documentai.Document,\n", + " token_range: Dict[range, Dict[str, int]],\n", + ") -> documentai.Document.Entity:\n", + " \"\"\"\n", + " Fixes the page anchor entity based on the provided JSON data and token range.\n", + "\n", + " Args:\n", + " - entity (documentai.Document.Entity): Entity object to be fixed.\n", + " - json_data (documentai.Document): JSON data containing page and token information.\n", + " - token_range (Dict[range, Dict[str, int]]):\n", + " Dictionary containing token ranges with page number and token number information.\n", + "\n", + " Returns:\n", + " - documentai.Document.Entity: Fixed entity object.\n", + " \"\"\"\n", + "\n", + " start = entity.text_anchor.text_segments[0].start_index\n", + " end = entity.text_anchor.text_segments[0].end_index - 1\n", + "\n", + " for j in token_range:\n", + " if start in j:\n", + " lower_token = token_range[j]\n", + " for j in token_range:\n", + " if end in j:\n", + " upper_token = token_range[j]\n", + "\n", + " lower_token_data = (\n", + " json_data.pages[lower_token[\"page_number\"]]\n", + " .tokens[lower_token[\"token_number\"]]\n", + " .layout.bounding_poly.normalized_vertices\n", + " )\n", + " upper_token_data = (\n", + " json_data.pages[int(upper_token[\"page_number\"])]\n", + " .tokens[int(upper_token[\"token_number\"])]\n", + " .layout.bounding_poly.normalized_vertices\n", + " )\n", + "\n", + " def get_coords(\n", + " normalized_vertex: documentai.NormalizedVertex,\n", + " ) -> Tuple[float, float]:\n", + " return normalized_vertex.x, normalized_vertex.y\n", + "\n", + " xa, ya = get_coords(lower_token_data[0])\n", + " xa_, ya_ = get_coords(upper_token_data[0])\n", + "\n", + " xb, yb = get_coords(lower_token_data[1])\n", + " xb_, yb_ = get_coords(upper_token_data[1])\n", + "\n", + " xc, yc = get_coords(lower_token_data[2])\n", + " xc_, yc_ = get_coords(upper_token_data[2])\n", + "\n", + " xd, yd = get_coords(lower_token_data[3])\n", + " xd_, yd_ = get_coords(upper_token_data[3])\n", + "\n", + " cord1 = {\"x\": min(xa, xa_), \"y\": min(ya, ya_)}\n", + " cord2 = {\"x\": max(xb, xb_), \"y\": min(yb, yb_)}\n", + " cord3 = {\"x\": max(xc, xc_), \"y\": max(yc, yc_)}\n", + " cord4 = {\"x\": min(xd, xd_), \"y\": max(yd, yd_)}\n", + " nvs = []\n", + " for coords in [cord1, cord2, cord3, cord4]:\n", + " x, y = coords[\"x\"], coords[\"y\"]\n", + " nv = documentai.NormalizedVertex(x=x, y=y)\n", + " nvs.append(nv)\n", + " entity.page_anchor.page_refs[0].bounding_poly.normalized_vertices = nvs\n", + " entity.page_anchor.page_refs[0].page = lower_token[\"page_number\"]\n", + " return entity\n", + "\n", + "\n", + "def create_entity(\n", + " mention_text: str, type_: str, match: re.Match\n", + ") -> documentai.Document.Entity:\n", + " \"\"\"\n", + " Creates a Document Entity based on the provided mention text, type, and match object.\n", + "\n", + " Args:\n", + " - mention_text (str): The text to be mentioned in the entity.\n", + " - type_ (str): The type of the entity.\n", + " - match (re.Match): Match object representing the start and end indices of the mention text.\n", + "\n", + " Returns:\n", + " - documentai.Document.Entity: The created Document Entity.\n", + " \"\"\"\n", + "\n", + " entity = documentai.Document.Entity()\n", + " entity.mention_text = mention_text\n", + " entity.type = type_\n", + " bp = documentai.BoundingPoly(normalized_vertices=[])\n", + " ts = documentai.Document.TextAnchor.TextSegment(\n", + " start_index=str(match.start()), end_index=str(match.end())\n", + " )\n", + " entity.text_anchor.text_segments = [ts]\n", + " entity.page_anchor.page_refs = [\n", + " documentai.Document.PageAnchor.PageRef(bounding_poly=bp)\n", + " ]\n", + "\n", + " return entity\n", + "\n", + "\n", + "# Line Items processing\n", + "def extract_anchors(prop: documentai.Document.Entity) -> Tuple[str, str]:\n", + " \"\"\"It will look for text anchors and page anchors in Entity object\n", + "\n", + " Args:\n", + " prop (documentai.Document.Entity): DocumentAI Entity object\n", + "\n", + " Returns:\n", + " Tuple[str, str]: It contains text_anchors and page_anchors in string-format\n", + " \"\"\"\n", + " text_anchor = f\"{prop.text_anchor.text_segments}\" if prop.text_anchor else \"MISSING\"\n", + " page_anchor = f\"{prop.page_anchor.page_refs[0]}\" if prop.page_anchor else \"MISSING\"\n", + " return text_anchor, page_anchor\n", + "\n", + "\n", + "def improved_similarity_score(str1: str, str2: str) -> float:\n", + " \"\"\"it return similarity/fuzzy ratio between two strings\n", + "\n", + " Args:\n", + " str1 (str): It is a text\n", + " str2 (str): It is also a text\n", + "\n", + " Returns:\n", + " float: similarity ration between string_1 and string_2\n", + " \"\"\"\n", + "\n", + " str1_parts = set(str1.split())\n", + " str2_parts = set(str2.split())\n", + " common_parts = str1_parts.intersection(str2_parts)\n", + " total_parts = str1_parts.union(str2_parts)\n", + " if not total_parts:\n", + " return 0.0\n", + " return len(common_parts) / len(total_parts)\n", + "\n", + "\n", + "def pair_items_with_improved_similarity(\n", + " gt_dict: Dict[str, Dict[str, str]], pred_dict: Dict[str, Dict[str, str]]\n", + ") -> Dict[str, str]:\n", + " \"\"\"It pairs grounf_truth and prediction data based on similarity scrore between them\n", + "\n", + " Args:\n", + " gt_dict (Dict[str, Dict[str, str]]): A dictionary containing ground truth data\n", + " pred_dict (Dict[str, Dict[str, str]]): A dictionary containing predicton data\n", + "\n", + " Returns:\n", + " Dict[str, str]: It contains type & best matched mention text\n", + " \"\"\"\n", + " pairings = {}\n", + " for gt_key, gt_values in gt_dict.items():\n", + " gt_concat = \" \".join(gt_values.values()).lower()\n", + " best_match_key = None\n", + " best_score = -1\n", + " for pred_key, pred_values in pred_dict.items():\n", + " pred_values_only = {k: v[\"value\"] for k, v in pred_values.items()}\n", + " pred_concat = \" \".join(pred_values_only.values()).lower()\n", + " score = improved_similarity_score(gt_concat, pred_concat)\n", + " if score > best_score:\n", + " best_score = score\n", + " best_match_key = pred_key\n", + " pairings[gt_key] = best_match_key if best_score > 0 else None\n", + " return pairings\n", + "\n", + "\n", + "def process_documents(\n", + " csv_file_path: str, doc_obj: documentai.Document, file_name: str\n", + ") -> Tuple[Dict[str, Dict[str, str]], Dict[str, Dict[str, str]], Dict[str, str]]:\n", + " \"\"\"\n", + " It an a helper function to get line_items , grouped entities and paired entity type and\n", + " its best match against ground truth\n", + "\n", + " Args:\n", + " csv_file_path (str):\n", + " CSV file path, It contains text's which need to annotated in doc-proto object\n", + " doc_obj (documentai.Document): DocumentAI Doc proto object\n", + " file_name (str): _description_\n", + "\n", + " Returns:\n", + " Tuple[Dict[str, Dict[str, str]],Dict[str, Dict[str, str]],Dict[str, str]]:\n", + " it returns line_items , grouped entities and paired entity type and\n", + " its best match against ground truth\n", + " \"\"\"\n", + " line_items_dict = {}\n", + " entity_groups_dict = {}\n", + "\n", + " # Read and process the CSV file\n", + " with open(csv_file_path, mode=\"r\", newline=\"\") as csv_file:\n", + " csv_reader = csv.reader(csv_file)\n", + " headers = next(csv_reader)\n", + " for index, row in enumerate(csv_reader):\n", + " row_dict = dict(zip(headers, row))\n", + " if row_dict.get(\"FileNames\") == file_name:\n", + " line_item_details = {}\n", + " has_line_item_values = False\n", + " for header, value in row_dict.items():\n", + " if \"line_item/\" in header and value:\n", + " has_line_item_values = True\n", + " line_item_details[header] = value\n", + " if line_item_details and has_line_item_values:\n", + " line_items_dict[f\"gt_line_item_{index}\"] = line_item_details\n", + "\n", + " n = 1\n", + " for entity in doc_obj.entities:\n", + " if entity.properties:\n", + " entity_details = {}\n", + " for prop in entity.properties:\n", + " key = prop.type_\n", + " value = prop.mention_text\n", + " text_anchor, page_anchor = extract_anchors(prop)\n", + " entity_details[key] = {\n", + " \"value\": value,\n", + " \"text_anchor\": text_anchor,\n", + " \"page_anchor\": page_anchor,\n", + " }\n", + " entity_groups_dict[f\"pred_line_item_{n}\"] = entity_details\n", + " n += 1\n", + "\n", + " # Pair items using the improved similarity score\n", + " improved_pairings = pair_items_with_improved_similarity(\n", + " line_items_dict, entity_groups_dict\n", + " )\n", + "\n", + " return line_items_dict, entity_groups_dict, improved_pairings\n", + "\n", + "\n", + "def extract_bounding_box_and_page(layout_info: str) -> Dict[str, Union[int, float]]:\n", + " \"\"\"It is used to get xy-coords and its page number from page_anchor\n", + "\n", + " Args:\n", + " layout_info (str): DocumentAI token object page_anchor data in string format\n", + "\n", + " Returns:\n", + " Dict[str, Union[int, float]]: It contains page_number and xy-coords of token\n", + " \"\"\"\n", + " x_values = []\n", + " y_values = []\n", + " page = 0 # Default page number\n", + "\n", + " for line in layout_info.split(\"\\n\"):\n", + " if \"x:\" in line:\n", + " _, x_value = line.split(\":\")\n", + " x_values.append(float(x_value.strip()))\n", + " elif \"y:\" in line:\n", + " _, y_value = line.split(\":\")\n", + " y_values.append(float(y_value.strip()))\n", + " elif line.startswith(\"page:\"):\n", + " _, page = line.split(\":\")\n", + " page = int(page.strip())\n", + "\n", + " return {\n", + " \"page\": page,\n", + " \"min_x\": min(x_values),\n", + " \"max_x\": max(x_values),\n", + " \"min_y\": min(y_values),\n", + " \"max_y\": max(y_values),\n", + " }\n", + "\n", + "\n", + "def get_page_anc_line(line_dict_1: Dict[str, str]) -> Dict[str, Union[int, float]]:\n", + " \"\"\"It is used to get xy-coords and its page number from page_anchor\n", + "\n", + " Args:\n", + " line_dict_1 (Dict[str, str]): Dictionary which holds page_anchor data\n", + "\n", + " Returns:\n", + " Dict[str, Union[int, float]]:\n", + " It returns page_number and xy-coords of based on page_anchor object\n", + " \"\"\"\n", + "\n", + " val_s = []\n", + " for en1, val1 in line_dict_1.items():\n", + " page_anc_dict = extract_bounding_box_and_page(val1[\"page_anchor\"])\n", + " val_s.append(page_anc_dict)\n", + " page_line = {\n", + " \"page\": val_s[0][\"page\"],\n", + " \"min_x\": min(entry[\"min_x\"] for entry in val_s),\n", + " \"max_x\": max(entry[\"max_x\"] for entry in val_s),\n", + " \"min_y\": min(entry[\"min_y\"] for entry in val_s),\n", + " \"max_y\": max(entry[\"max_y\"] for entry in val_s),\n", + " }\n", + "\n", + " return page_line\n", + "\n", + "\n", + "def get_cleaned_text(text: str) -> str:\n", + " \"\"\"it removes spaces & newline characters from provided text\n", + "\n", + " Args:\n", + " text (str): A text which need to be cleaned\n", + "\n", + " Returns:\n", + " str: text without containing spaces & newline chars\n", + " \"\"\"\n", + " return text.lower().replace(\" \", \"\").replace(\"\\n\", \"\")\n", + "\n", + "\n", + "def get_match(gt_mt_split: List[str], mt_temp: str) -> Union[List[str], None]:\n", + " \"\"\"It returns best match of mention text from ground truth text\n", + "\n", + " Args:\n", + " gt_mt_split (List[str]): It contains list of strings\n", + " mt_temp (str): It is text, which need to be checked against gt_mt_split for best match\n", + "\n", + " Returns:\n", + " Union[List[str], None]: It returns best match of mention text from ground truth text\n", + " \"\"\"\n", + " flag_found = False\n", + " for mt in gt_mt_split:\n", + " if fuzz.ratio(get_cleaned_text(mt_temp), get_cleaned_text(mt)) > 75:\n", + " gt_mt_split.remove(mt)\n", + " flag_found = True\n", + " return gt_mt_split\n", + " if flag_found == False:\n", + " return None\n", + "\n", + "\n", + "def get_new_entity(\n", + " doc_obj: documentai.Document,\n", + " page_anc_dict: Dict[str, Union[int, float]],\n", + " gt_mt: str,\n", + " type_en: str,\n", + ") -> Union[documentai.Document.Entity, None]:\n", + " \"\"\"It creates new entity based on provided page_anchor, mention text and entity type\n", + "\n", + " Args:\n", + " doc_obj (documentai.Document): DocumentAI Doc proto object\n", + " page_anc_dict (Dict[str, Union[int, float]]):\n", + " It contains page_number and xy-coords of based on page_anchor object\n", + " gt_mt (str): text which uses as mention_text for entity object\n", + " type_en (str): text which uses as type_ for an entity object\n", + "\n", + " Returns:\n", + " Union[documentai.Document.Entity, None]:\n", + " It creates new entity based on provided page_anchor, mention text and entity type\n", + " \"\"\"\n", + " text_anc = []\n", + " page_anc = {\"x\": [], \"y\": []}\n", + " mt_text = \"\"\n", + " gt_mt_split = gt_mt.split()\n", + " for page_num, _ in enumerate(doc_obj.pages):\n", + " if page_num == int(page_anc_dict[\"page\"]):\n", + " for token in doc_obj.pages[page_num].tokens:\n", + " vertices = token.layout.bounding_poly.normalized_vertices\n", + " minx_token, miny_token = min(point.x for point in vertices), min(\n", + " point.y for point in vertices\n", + " )\n", + " maxx_token, maxy_token = max(point.x for point in vertices), max(\n", + " point.y for point in vertices\n", + " )\n", + " token_seg = token.layout.text_anchor.text_segments\n", + " for seg in token_seg:\n", + " token_start, token_end = seg.start_index, seg.end_index\n", + " if (\n", + " abs(miny_token - page_anc_dict[\"min_y\"]) <= 0.02\n", + " and abs(maxy_token - page_anc_dict[\"max_y\"]) <= 0.02\n", + " ):\n", + " mt_temp = doc_obj.text[token_start:token_end]\n", + "\n", + " if (\n", + " get_cleaned_text(mt_temp) in gt_mt.lower().replace(\" \", \"\")\n", + " or fuzz.ratio(\n", + " get_cleaned_text(mt_temp), gt_mt.lower().replace(\" \", \"\")\n", + " )\n", + " > 70\n", + " ):\n", + " if len(mt_temp) <= 2:\n", + " if (\n", + " fuzz.ratio(\n", + " mt_temp.lower().replace(\" \", \"\").replace(\"\\n\", \"\"),\n", + " gt_mt.lower().replace(\" \", \"\"),\n", + " )\n", + " > 80\n", + " ):\n", + " ts = documentai.Document.TextAnchor.TextSegment(\n", + " start_index=token_start, end_index=token_end\n", + " )\n", + " text_anc.append(ts)\n", + " page_anc[\"x\"].extend([minx_token, maxx_token])\n", + " page_anc[\"y\"].extend([miny_token, maxy_token])\n", + " mt_text += mt_temp\n", + " else:\n", + " ts = documentai.Document.TextAnchor.TextSegment(\n", + " start_index=token_start, end_index=token_end\n", + " )\n", + " text_anc.append(ts)\n", + " page_anc[\"x\"].extend([minx_token, maxx_token])\n", + " page_anc[\"y\"].extend([miny_token, maxy_token])\n", + " mt_text += mt_temp\n", + " else:\n", + " match_mt = get_match(gt_mt_split, mt_temp)\n", + " if match_mt:\n", + " ts = documentai.Document.TextAnchor.TextSegment(\n", + " start_index=token_start, end_index=token_end\n", + " )\n", + " text_anc.append(ts)\n", + " page_anc[\"x\"].extend([minx_token, maxx_token])\n", + " page_anc[\"y\"].extend([miny_token, maxy_token])\n", + " mt_text += mt_temp\n", + "\n", + " try:\n", + " x, y = page_anc.values()\n", + " page_anc_new = [\n", + " {\"x\": min(x), \"y\": min(y)},\n", + " {\"x\": max(x), \"y\": min(y)},\n", + " {\"x\": max(x), \"y\": max(y)},\n", + " {\"x\": min(x), \"y\": max(y)},\n", + " ]\n", + " nvs = []\n", + " for xy in page_anc_new:\n", + " nv = documentai.NormalizedVertex(**xy)\n", + " nvs.append(nv)\n", + " new_entity = documentai.Document.Entity()\n", + " new_entity.mention_text = mt_text\n", + " new_entity.type_ = type_en\n", + " ta = documentai.Document.TextAnchor(content=mt_text, text_segments=text_anc)\n", + " new_entity.text_anchor = ta\n", + " bp = documentai.BoundingPoly(normalized_vertices=nvs)\n", + " page_ref = documentai.Document.PageAnchor.PageRef(\n", + " page=str(page_anc_dict[\"page\"]), bounding_poly=bp\n", + " )\n", + " new_entity.page_anchor.page_refs = [page_ref]\n", + " return new_entity\n", + " except ValueError:\n", + " return None\n", + "\n", + "\n", + "def parse_page_anchor(page_anchor_str: str) -> documentai.Document.PageAnchor:\n", + " \"\"\"It creates page_anchor proto-object based on provided page_anchor text\n", + "\n", + " Args:\n", + " page_anchor_str (str): page anchor data in string format\n", + "\n", + " Returns:\n", + " documentai.Document.PageAnchor:\n", + " newly created page_anchor proto-object based on provided page_anchor text\n", + " \"\"\"\n", + " # Extract normalized vertices using the provided reference approach\n", + " vertices = []\n", + " page = \"0\" # Default to page 0 if not specified\n", + " lines = page_anchor_str.split(\"\\n\")\n", + " for idx, line in enumerate(lines):\n", + " if \"x:\" in line:\n", + " x = float(line.split(\":\")[1].strip())\n", + " # Ensure there's a corresponding 'y' line following 'x' line\n", + " if idx + 1 < len(lines) and \"y:\" in lines[idx + 1]:\n", + " y = float(lines[idx + 1].split(\":\")[1].strip())\n", + " nv = documentai.NormalizedVertex(x=x, y=y)\n", + " vertices.append(nv)\n", + " elif line.startswith(\"page:\"):\n", + " page = line.split(\":\")[1].strip()\n", + "\n", + " bp = documentai.BoundingPoly(normalized_vertices=vertices)\n", + " page_ref = documentai.Document.PageAnchor.PageRef(page=page, bounding_poly=bp)\n", + " page_anchor = documentai.Document.PageAnchor(page_refs=[page_ref])\n", + " return page_anchor\n", + "\n", + "\n", + "def parse_text_anchor(\n", + " text_anchor_str: str, content_str: str\n", + ") -> documentai.Document.TextAnchor:\n", + " \"\"\"\n", + " It builds DocAI text Anchor object based on provided text anchor and content in string format\n", + "\n", + " Args:\n", + " text_anchor_str (str): DocAI text anchor object in string format\n", + " content_str (str): text to add to text anchot object\n", + "\n", + " Returns:\n", + " documentai.Document.TextAnchor:\n", + " Text Anchor object created based on provided text anchor and content in string format\n", + " \"\"\"\n", + "\n", + " # Simplified parsing for 'text_segments' from 'text_anchor'\n", + " segments_matches = re.findall(\n", + " r\"start_index: (\\d+)\\nend_index: (\\d+)\", text_anchor_str\n", + " )\n", + " text_segments = []\n", + " for si, ei in segments_matches:\n", + " ts = documentai.Document.TextAnchor.TextSegment(start_index=si, end_index=ei)\n", + " text_segments.append(ts)\n", + "\n", + " ta = documentai.Document.TextAnchor(\n", + " text_segments=text_segments, content=content_str\n", + " )\n", + " return ta\n", + "\n", + "\n", + "def construct_predicted_value_details(\n", + " predicted_value: Dict[str, str], type_: str\n", + ") -> documentai.Document.Entity:\n", + " \"\"\"It will create new entinty object in doc-proto format\n", + "\n", + " Args:\n", + " predicted_value (Dict[str, str]):\n", + " it is dictionary wich contains text_anchor, page_anchor and value as string-object\n", + " type_ (str): text which needs to be assigned to entity.type_\n", + "\n", + " Returns:\n", + " documentai.Document.Entity: It returnds new entity object\n", + " \"\"\"\n", + " page_anchor = parse_page_anchor(predicted_value[\"page_anchor\"])\n", + " text_anchor = parse_text_anchor(\n", + " predicted_value[\"text_anchor\"], predicted_value.get(\"value\", \"\")\n", + " )\n", + " ent = documentai.Document.Entity()\n", + " ent.mention_text = predicted_value.get(\"value\", \"\")\n", + " ent.page_anchor = page_anchor\n", + " ent.text_anchor = text_anchor\n", + " ent.type_ = type_\n", + " return ent\n", + "\n", + "\n", + "df_schema = read_input_schema(READ_SCHEMA_FILENAME)\n", + "# Group by 'FileNames'\n", + "grouped = df_schema.groupby(\"FileNames\", as_index=False)\n", + "processed_rows = []\n", + "\n", + "for name, group in grouped:\n", + " # Get the total number of columns\n", + " max_columns = len(group.columns)\n", + " combined_row = []\n", + "\n", + " # Iterate over rows in the group\n", + " for index, row in group.iterrows():\n", + " row_list = row.tolist()\n", + " row_filled = row_list + [np.nan] * (\n", + " max_columns - len(row_list)\n", + " ) # Extend with NaNs if less than max_columns\n", + " combined_row.extend(row_filled)\n", + "\n", + " processed_rows.append(combined_row)\n", + "\n", + "\n", + "headers = [\n", + " header.strip()\n", + " for header in pd.read_csv(READ_SCHEMA_FILENAME, nrows=0).columns.tolist()\n", + "]\n", + "prefix = \"line_item/\"\n", + "\n", + "# Extract the part after 'line_item/' for each item that starts with the prefix\n", + "unique_entities = [item.split(\"/\")[-1] for item in headers if item.startswith(prefix)]\n", + "\n", + "processed_files = set() # Set to keep track of processed FileNames\n", + "\n", + "\n", + "client = storage.Client()\n", + "bucket = client.get_bucket(INPUT_BUCKET.split(\"/\")[2])\n", + "\n", + "for row in processed_rows:\n", + " file_name = row[0] # The first item is 'FileNames'\n", + "\n", + " if file_name not in processed_files:\n", + " print(\"Processing:\", file_name)\n", + " file_name_path = INPUT_BUCKET + file_name\n", + " file_name_path = \"/\".join(file_name_path.split(\"/\")[3:])\n", + " blob = bucket.blob(file_name_path)\n", + " content = blob.download_as_bytes()\n", + " res = utilities.process_document_sample(\n", + " project_id=PROJECT_ID,\n", + " location=LOCATION,\n", + " processor_id=PROCESSOR_ID,\n", + " pdf_bytes=content,\n", + " processor_version=PROCESSOR_VERSION,\n", + " )\n", + " res_dict = res.document\n", + " token_range = get_token_range(res_dict)\n", + "\n", + " # Add the file_name to the set of processed files\n", + " processed_files.add(file_name)\n", + "\n", + " parser_entities = res_dict.entities\n", + "\n", + " # Process the line items\n", + " line_items_dict, entity_groups_dict, improved_pairings = process_documents(\n", + " READ_SCHEMA_FILENAME, res_dict, file_name\n", + " )\n", + "\n", + " entities = [] # Initialize the list to hold all entities\n", + "\n", + " for match_key, match_value in improved_pairings.items():\n", + " if match_value:\n", + " entity = documentai.Document.Entity()\n", + " mention_texts = [] # To hold all mention_text values for concatenation\n", + "\n", + " for gt_k, gt_v in line_items_dict[match_key].items():\n", + " is_correct = True\n", + " if gt_k in entity_groups_dict[match_value].keys():\n", + " predicted_value = entity_groups_dict[match_value][gt_k]\n", + " similarity = fuzz.ratio(gt_v, predicted_value[\"value\"])\n", + " if similarity < 90:\n", + " is_correct = False\n", + " else:\n", + " predicted_value_details = construct_predicted_value_details(\n", + " predicted_value, gt_k\n", + " )\n", + " entity.type_ = gt_k.split(\"/\")[\n", + " 0\n", + " ] # Assuming 'line_item' is the desired type\n", + " entity.properties.append(predicted_value_details)\n", + " mention_texts.append(predicted_value_details.mention_text)\n", + " else:\n", + " is_correct = False\n", + "\n", + " if not is_correct:\n", + " page_line = get_page_anc_line(entity_groups_dict[match_value])\n", + " new_ent = get_new_entity(res_dict, page_line, gt_v, gt_k)\n", + " if new_ent is not None:\n", + " entity.type_ = gt_k.split(\"/\")[\n", + " 0\n", + " ] # Assuming 'line_item' is the desired type\n", + " entity.properties.append(new_ent)\n", + " mention_texts.append(new_ent.mention_text)\n", + " else:\n", + " predicted_value_details = construct_predicted_value_details(\n", + " predicted_value, gt_k\n", + " )\n", + " entity.properties.append(predicted_value_details)\n", + " mention_texts.append(predicted_value_details.mention_text)\n", + "\n", + " # Concatenate all mention_texts for the parent entity\n", + " entity.mention_text = \" \".join(mention_texts).strip()\n", + "\n", + " # Add the entity to the list if it has been populated with properties\n", + " if entity.properties:\n", + " entities.append(entity)\n", + "\n", + " # Initialize containers for processed and unprocessed entities\n", + " list_of_entities = []\n", + " list_of_entities_not_mapped = []\n", + " processed_entities = set() # Set to track processed entities\n", + "\n", + " # Iterate over rows and headers\n", + " for i in range(0, len(row), len(headers)):\n", + " row_slice = row[i : i + len(headers)]\n", + " for j in range(1, len(headers)):\n", + " type_ = headers[j]\n", + " mention_text = row_slice[j]\n", + " if \"/\" in type_:\n", + " continue # Skip if type_ contains '/'\n", + " # Check if entity matches parser_entities before using re.finditer\n", + " matched = False\n", + " for proc_ent in parser_entities:\n", + " if proc_ent.type_ == type_ and proc_ent.mention_text == mention_text:\n", + " matched = True\n", + " # Directly append the parser entity\n", + " list_of_entities.append(proc_ent)\n", + " break # Exit loop after match is found\n", + "\n", + " # If no match is found, proceed with finditer to create a new entity\n", + " if not matched and mention_text:\n", + " occurrences = re.finditer(\n", + " re.escape(str(mention_text)) + r\"[ |\\,|\\n]\", res_dict.text\n", + " )\n", + " for m in occurrences:\n", + " start, end = m.start(), m.end()\n", + " entity_id = (mention_text, start, end)\n", + " if entity_id not in processed_entities:\n", + " entity = create_entity(mention_text, type_, m)\n", + " try:\n", + " entity_modified = fix_page_anchor_entity(\n", + " entity, res_dict, token_range\n", + " )\n", + " processed_entities.add(entity_id)\n", + " list_of_entities.append(entity_modified)\n", + " except Exception as e:\n", + " print(\n", + " \"Not able to find \" + mention_text + \" in the OCR:\", e\n", + " )\n", + " continue\n", + "\n", + " # Update and write final output as in your existing code\n", + " res_dict.entities = list_of_entities\n", + " for entity in entities:\n", + " res_dict.entities.append(entity)\n", + "\n", + " # Write the final output to GCS\n", + " output_bucket_name = OUTPUT_BUCKET.split(\"/\")[2]\n", + " output_path_within_bucket = (\n", + " \"/\".join(OUTPUT_BUCKET.split(\"/\")[3:]) + file_name + \".json\"\n", + " )\n", + " utilities.store_document_as_json(\n", + " documentai.Document.to_json(res_dict),\n", + " output_bucket_name,\n", + " output_path_within_bucket,\n", + " )\n", + "print(\"Process Completed!!!\")" + ] + }, + { + "cell_type": "markdown", + "id": "a9eda18a", + "metadata": {}, + "source": [ + "# 4. Output Details\n", + "\n", + "As we can observe, data mentioned in csv is annotated in DocAI proto results\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "7347f175", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43567429-2662-4c1b-aaa6-0f0278a1bae0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/reverse_annotation_tool/schema_and_data.csv b/incubator-tools/reverse_annotation_tool/schema_and_data.csv new file mode 100644 index 00000000..3b719223 --- /dev/null +++ b/incubator-tools/reverse_annotation_tool/schema_and_data.csv @@ -0,0 +1,13 @@ +FileNames,line_item/quantity,line_item/description,line_item/unit_price,line_item/amount,invoice_date,invoice_id,ship_to_address,supplier_address,supplier_name,supplier_email,supplier_website,supplier_phone,total_amount,net_amount,receiver_email +Type,,,,,,,,,,,,,,, +file_name_1.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_1.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_1.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_1.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_1.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_1.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_1.pdf,,,,,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data > +file_name_2.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_2.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_2.pdf,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,,,,,,,,,, +file_name_2.pdf,,,,,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,< fill_data >,,< fill_data >,< fill_data >,< fill_data >,< fill_data > diff --git a/incubator-tools/schema_comparision/contract_schema.json b/incubator-tools/schema_comparision/contract_schema.json deleted file mode 100644 index 25ad018e..00000000 --- a/incubator-tools/schema_comparision/contract_schema.json +++ /dev/null @@ -1,77 +0,0 @@ -{ - "Contract": { - "displayName": "Contract Doc AI v1.2", - "entityTypes": [ - { - "type": "document_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "parties", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "agreement_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "effective_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "expiration_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "initial_term", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "governing_law", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "renewal_term", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "notice_to_terminate_renewal", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "arbitration_venue", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "litigation_venue", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "indemnity_clause", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "confidentiality_clause", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "non_compete_clause", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - } - ] - } -} \ No newline at end of file diff --git a/incubator-tools/schema_comparision/invoice_schema.json b/incubator-tools/schema_comparision/invoice_schema.json deleted file mode 100644 index d5ee4638..00000000 --- a/incubator-tools/schema_comparision/invoice_schema.json +++ /dev/null @@ -1,247 +0,0 @@ -{ - "Invoice": { - "displayName": "invoice_uptrain", - "entityTypes": [ - { - "type": "amount_due", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "amount_paid_since_last_invoice", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "carrier", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "currency", - "baseType": "currency", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "currency_exchange_rate", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "customer_tax_id", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_tax_id", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "delivery_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "due_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "freight_amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "invoice_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "invoice_id", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "net_amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "payment_terms", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "purchase_order", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_address", - "baseType": "address", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_email", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_phone", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_website", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "remit_to_address", - "baseType": "address", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "remit_to_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "ship_from_address", - "baseType": "address", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "ship_from_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "ship_to_address", - "baseType": "address", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "ship_to_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_address", - "baseType": "address", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_email", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_iban", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_payment_ref", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_phone", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_registration", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_tax_id", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "supplier_website", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "total_amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "total_tax_amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "line_item/amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/description", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/product_code", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/purchase_order", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/quantity", - "baseType": "number", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/unit", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/unit_price", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "vat/amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "vat/category_code", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "vat/tax_amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "vat/tax_rate", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - } - ] - } -} \ No newline at end of file diff --git a/incubator-tools/schema_comparision/purchase_order_schema.json b/incubator-tools/schema_comparision/purchase_order_schema.json deleted file mode 100644 index 74addd8a..00000000 --- a/incubator-tools/schema_comparision/purchase_order_schema.json +++ /dev/null @@ -1,92 +0,0 @@ -{ - "PurchaseOrder": { - "displayName": "PO-schema", - "entityTypes": [ - { - "type": "currency", - "baseType": "currency", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "delivery_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "payment_terms", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "purchase_order_date", - "baseType": "datetime", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "purchase_order_id", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "receiver_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "ship_to_address", - "baseType": "address", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "ship_to_name", - "baseType": "string", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "total_amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_ONCE" - }, - { - "type": "line_item/amount", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/description", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/product_code", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/purchase_order_id", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/quantity", - "baseType": "number", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/receiver_reference", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/unit_of_measure", - "baseType": "string", - "occurrenceType": "OPTIONAL_MULTIPLE" - }, - { - "type": "line_item/unit_price", - "baseType": "money", - "occurrenceType": "OPTIONAL_MULTIPLE" - } - ] - } -} \ No newline at end of file diff --git a/incubator-tools/signature-detection-technique/images/image_input.png b/incubator-tools/signature-detection-technique/images/image_input.png new file mode 100644 index 00000000..8b41bdb3 Binary files /dev/null and b/incubator-tools/signature-detection-technique/images/image_input.png differ diff --git a/incubator-tools/signature-detection-technique/images/image_output.png b/incubator-tools/signature-detection-technique/images/image_output.png new file mode 100644 index 00000000..5b8db7ec Binary files /dev/null and b/incubator-tools/signature-detection-technique/images/image_output.png differ diff --git a/incubator-tools/signature-detection-technique/signature_detection_technique.ipynb b/incubator-tools/signature-detection-technique/signature_detection_technique.ipynb new file mode 100644 index 00000000..d423ba30 --- /dev/null +++ b/incubator-tools/signature-detection-technique/signature_detection_technique.ipynb @@ -0,0 +1,298 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5fd2bc5c-50a3-46c0-9077-b6599c2de7b6", + "metadata": {}, + "source": [ + "# Signature Detection by Reading Pixels" + ] + }, + { + "cell_type": "markdown", + "id": "d4b79b77-f7f4-48c5-b857-2b3b7c7a1444", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com\n" + ] + }, + { + "cell_type": "markdown", + "id": "d11b31a4-e270-47d0-995d-7316958e30af", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "929318e2-d58c-4acc-988c-34430fb9f8dc", + "metadata": {}, + "source": [ + "## Purpose and Description\n", + "This documentation outlines the procedure for detecting the signature in the document by taking normalized bounding box coordinates of signature location.\n", + "While using this code, the user needs to set two values while calling the function a) BlankLine Pixel count b) Signature Pixel Count (only for the black pixels).\n" + ] + }, + { + "cell_type": "markdown", + "id": "88bd161e-4993-4a77-a0c3-64f8c7c304fb", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "1. Access to vertex AI Notebook or Google Colab\n", + "2. Python\n", + "3. Python Libraries like cv2, PIL, base64, io, numpy etc." + ] + }, + { + "cell_type": "markdown", + "id": "68ab89ba-c325-4baf-b694-3967f78fde15", + "metadata": {}, + "source": [ + "## Step by Step procedure " + ] + }, + { + "cell_type": "markdown", + "id": "8465596d-9656-4cff-b4aa-66749c8b7e3b", + "metadata": {}, + "source": [ + "### 1. Import the required Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fe262b4-7250-4422-859c-f4d960a59514", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install google.cloud" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4888923-5cb5-409e-a25f-3c8cdf7c1982", + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f85b531-2993-4ce0-b712-069d9cc8eefa", + "metadata": {}, + "outputs": [], + "source": [ + "import io\n", + "from io import BytesIO\n", + "from typing import Dict\n", + "import os\n", + "import base64\n", + "from PIL import Image\n", + "import numpy as np\n", + "import json\n", + "import cv2\n", + "import PIL\n", + "from utilities import documentai_json_proto_downloader, file_names\n", + "from google.cloud import documentai_v1beta3 as documentai" + ] + }, + { + "cell_type": "markdown", + "id": "eda73c3b-e421-482f-9471-2455d624e2e5", + "metadata": {}, + "source": [ + "### 2. Input details" + ] + }, + { + "cell_type": "markdown", + "id": "4854f4e6-5326-481d-b62a-dd8ff7ca2217", + "metadata": {}, + "source": [ + "
    \n", + "
  • input_path : GCS Storage name. It should contain DocAI processed output json files. This bucket is used for processing input files in the folders.
  • \n", + "
  • normalized_vertices: 4 coordinates of the signature entity where it is expected to be present.\n", + "
  • page_number: Page Number where the signature is expected to be present\n", + "
  • blank_line_pixel_count: Count of the total black pixel (considering the image is a binary image(black & white)) of the blank signature field.\n", + "
  • signature_threshold_pixel_count: Threshold Count of the total black pixel (considering the image is a binary image(black & white)) signature field having signature in it.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eaf0950f-1ab9-4ae9-bae7-d836c4960bbf", + "metadata": {}, + "outputs": [], + "source": [ + "def signature_detection(\n", + " document_proto: documentai.Document,\n", + " normalized_vertices: Dict,\n", + " page_number: int,\n", + " blank_line_pixel_count: int = 600,\n", + " signature_threshold_pixel_count: int = 1000,\n", + ") -> bool:\n", + " \"\"\"\n", + " Detects signatures within a document.\n", + "\n", + " Args:\n", + " document_proto (documentai.Document): Document AI proto object.\n", + " normalized_vertices (Dict): Normalized vertices containing bounding box information.\n", + " page_number (int): Page number to process.\n", + " blank_line_pixel_count (int): Threshold for considering a line as blank. Default is 600.\n", + " signature_threshold_pixel_count (int): Threshold for considering a signature. Default is 1000.\n", + "\n", + " Returns:\n", + " bool: True if signature is detected, False otherwise.\n", + " \"\"\"\n", + "\n", + " bounding_box = normalized_vertices\n", + "\n", + " # Getting the height & width of the page\n", + " img_height = document_proto.pages[page_number].image.height\n", + " img_width = document_proto.pages[page_number].image.width\n", + "\n", + " x = [i[\"x\"] for i in bounding_box]\n", + " y = [i[\"y\"] for i in bounding_box]\n", + "\n", + " left = min(x) * img_width - 1\n", + " top = min(y) * img_height - 2\n", + " right = max(x) * img_width + 5\n", + " bottom = max(y) * img_height + 17\n", + "\n", + " # Setting up the bounding box coordinates to crop the image to only the signature part.\n", + " bounding_box_coordinates = (left, top, right, bottom)\n", + "\n", + " # Fetching the Image data which is in base64 encoded format\n", + " content = document_proto.pages[page_number].image.content\n", + "\n", + " image = Image.open(io.BytesIO(content))\n", + "\n", + " # Cropping the image where signature is present\n", + " cropped_image = image.crop(bounding_box_coordinates)\n", + "\n", + " # Saving cropped image\n", + " cropped_image.save(\"cropped.jpeg\")\n", + "\n", + " cropped_img = cv2.imread(\"cropped.jpeg\", 0) # Read image as grayscale\n", + "\n", + " # Apply binary thresholding\n", + " _, cropped_bw_image = cv2.threshold(cropped_img, 127, 255, cv2.THRESH_BINARY)\n", + "\n", + " # Count the total black & white pixels in the cropped image part\n", + " pixel_value, occurrence = np.unique(cropped_bw_image, return_counts=True)\n", + " pixel_counts = dict(zip(pixel_value, occurrence))\n", + "\n", + " cropped_black_pixel = pixel_counts.get(0, 0) # Count of black pixels\n", + "\n", + " # Logic to determine if the cropped part contains a signature\n", + " os.remove(\"cropped.jpeg\")\n", + " if (\n", + " cropped_black_pixel > blank_line_pixel_count\n", + " and cropped_black_pixel > signature_threshold_pixel_count\n", + " ):\n", + " print(\"Signature Detected\")\n", + " return True\n", + " else:\n", + " print(\"No Signature Detected\")\n", + " return False" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7cd9c12-28f3-4bd8-a3f7-b0beab95b81c", + "metadata": {}, + "outputs": [], + "source": [ + "# INPUT : Storage bucket name\n", + "INPUT_PATH = \"gs://{bucket_name}/{folder_path}/{file_name}.json\"\n", + "normalized_vertices = [\n", + " {\"x\": 0.33105803, \"y\": 0.7846154},\n", + " {\"x\": 0.41695109, \"y\": 0.7846154},\n", + " {\"x\": 0.41695109, \"y\": 0.80000001},\n", + " {\"x\": 0.33105803, \"y\": 0.80000001},\n", + "]\n", + "input_bucket_name = INPUT_PATH.split(\"/\")[2]\n", + "path_parts = INPUT_PATH.split(\"/\")[3:]\n", + "file_name = \"/\".join(path_parts)\n", + "file_name = INPUT_PATH[len(input_bucket_name) + 6 :]\n", + "document_proto = documentai_json_proto_downloader(input_bucket_name, file_name)\n", + "# Function Calling\n", + "signature_detection(document_proto, normalized_vertices, 0)" + ] + }, + { + "cell_type": "markdown", + "id": "c0cdb673-9858-4022-b1f1-06452c1ba922", + "metadata": {}, + "source": [ + "### 3.Output" + ] + }, + { + "cell_type": "markdown", + "id": "ce460741-3b0d-4965-a1d4-b64d903d7579", + "metadata": {}, + "source": [ + "Upon the above code execution, it will prompt whether the given image with bounding box coordinates is having the signature in it or not. \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "

Input Json Image

Output

\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "847c1e16-74f2-4324-8fd8-d5c1a0423a4a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/incubator-tools/special_character_removal/images/image_input_json.png b/incubator-tools/special_character_removal/images/image_input_json.png new file mode 100644 index 00000000..d7dfb1c2 Binary files /dev/null and b/incubator-tools/special_character_removal/images/image_input_json.png differ diff --git a/incubator-tools/special_character_removal/images/image_output_json.png b/incubator-tools/special_character_removal/images/image_output_json.png new file mode 100644 index 00000000..b2224389 Binary files /dev/null and b/incubator-tools/special_character_removal/images/image_output_json.png differ diff --git a/incubator-tools/special_character_removal/special_character_removal.ipynb b/incubator-tools/special_character_removal/special_character_removal.ipynb new file mode 100644 index 00000000..51dc1044 --- /dev/null +++ b/incubator-tools/special_character_removal/special_character_removal.ipynb @@ -0,0 +1,394 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "49bf864f-c8d8-47c9-af54-b253f41bca88", + "metadata": {}, + "source": [ + "# DocAI Special Character Removal" + ] + }, + { + "cell_type": "markdown", + "id": "2388af64-6727-4876-b3ef-296cd321c302", + "metadata": {}, + "source": [ + "* Author: docai-incubator@google.com\n" + ] + }, + { + "cell_type": "markdown", + "id": "a84b2797-4849-48ec-b222-09c150dcc4e0", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied." + ] + }, + { + "cell_type": "markdown", + "id": "79756a4c-45b1-4507-9c8e-7b7c9e504170", + "metadata": {}, + "source": [ + "## Purpose and Description\n", + "\n", + "This documentation outlines the procedure for handling special characters within the CDE JSON samples. It involves replacing the original mention text value with its corresponding post-processed value using the provided code.\n", + "\n", + "This process removes special characters like hyphens (-) and forward slashes (/) from the amount field. This is done because the presence of these characters can interfere with the ability of parsing elements to correctly identify the amount values." + ] + }, + { + "cell_type": "markdown", + "id": "ca0c3b23-a58e-4f9f-ad92-d3e05ba1a619", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "1. Access to vertex AI Notebook or Google Colab\n", + "2. Python\n", + "3. Access to the google storage bucket." + ] + }, + { + "cell_type": "markdown", + "id": "f3ac4060-3e3c-43fc-a1ee-9feb9544373c", + "metadata": {}, + "source": [ + "## Step by Step procedure " + ] + }, + { + "cell_type": "markdown", + "id": "1c96f566-cd19-444a-9b68-95d4d2d125ed", + "metadata": {}, + "source": [ + "### 1. Install the required libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0e13449-1ca7-4f1c-8a73-1cdbe4b74db6", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%pip install Pillow\n", + "%pip install google-cloud-storage\n", + "%pip install google-cloud-documentai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea0b148e-2fc1-41ea-94e1-7e0681aa4c6c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py" + ] + }, + { + "cell_type": "markdown", + "id": "f79501f4-7df9-4ff3-9849-c0922718d939", + "metadata": {}, + "source": [ + "### 2. Import the required libraries/Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43f907df-4fe2-4ae8-9d1d-33cf77f764fa", + "metadata": {}, + "outputs": [], + "source": [ + "from io import BytesIO\n", + "from google.cloud import storage\n", + "from PIL import Image\n", + "from google.cloud import documentai_v1beta3 as documentai\n", + "from google.api_core.client_options import ClientOptions\n", + "from pathlib import Path\n", + "import base64\n", + "import io\n", + "import json\n", + "from utilities import (\n", + " file_names,\n", + " documentai_json_proto_downloader,\n", + " store_document_as_json,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0057355b-8edb-4394-aed9-6cae90975184", + "metadata": {}, + "source": [ + "### 3. Input Details" + ] + }, + { + "cell_type": "markdown", + "id": "7a7883f7-2912-4c8a-ab0a-31ad66b4e9cf", + "metadata": {}, + "source": [ + "
    \n", + "
  • input_path : It is input GCS folder path which contains DocumentAI processor JSON results
  • \n", + "
  • output_path : It is a GCS folder path to store post-processing results
  • \n", + "
  • project_id : It is the project id of the current project.
  • \n", + "
  • location : It is the location of the project in the processor.
  • \n", + "
  • processor_id : It is the cde processor id.
  • \n", + "
  • entity_name : The name of an entity to consider for cleaning and converting it with post processed value.
  • \n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9b0c0d7-31b7-4657-97f7-e6f21d5906f3", + "metadata": {}, + "outputs": [], + "source": [ + "input_path = \"gs://bucket_name/path/to/jsons/\"\n", + "output_path = \"gs://bucket_name/path/to/output/\"\n", + "project_id = \"project-id\"\n", + "location = \"location\"\n", + "processor_id = \"processor-id\"\n", + "entity_name = \"Amount_in_number\" # It is the entity name which need to be converted." + ] + }, + { + "cell_type": "markdown", + "id": "3e4d367c-4ef3-46ae-ba0a-43bd2d9c7f6b", + "metadata": {}, + "source": [ + "### 4.Execute the code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e39691d-db63-4f1f-9764-67d866933da1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "input_storage_bucket_name = input_path.split(\"/\")[2]\n", + "input_bucket_path_prefix = \"/\".join(input_path.split(\"/\")[3:])\n", + "output_storage_bucket_name = output_path.split(\"/\")[2]\n", + "output_bucket_path_prefix = \"/\".join(output_path.split(\"/\")[3:])\n", + "\n", + "\n", + "def remove_special_characters(\n", + " json_proto_data: documentai.Document, entity_name: str\n", + ") -> documentai.Document:\n", + " \"\"\"\n", + " Removes special characters from a specified entity type (\"entity_name\") in a Documentai document.\n", + "\n", + " This function processes the entity bounding box, extracts the image data, performs OCR with symbol confidence,\n", + " and removes special characters like '-' and '/' based on confidence thresholds while considering adjacent digits.\n", + "\n", + " Args:\n", + " json_proto_data: The Documentai document object containing text and entities (type: documentai.Document).\n", + " entity_name: The name of the entity type to process (type: str).\n", + "\n", + " Returns:\n", + " The modified Documentai document object with updated entity mentions after removing special characters (type: documentai.Document).\n", + " \"\"\"\n", + " for page_index, page in enumerate(json_proto_data.pages):\n", + " if \"image\" in page and \"content\" in page.image:\n", + " # Decode the image content\n", + " image_data = page.image.content\n", + " # image_data = base64.b64decode(image_data_base64)\n", + " image = Image.open(io.BytesIO(image_data))\n", + "\n", + " for entity in json_proto_data.entities:\n", + " if entity.type == entity_name:\n", + " bounding_box = entity.page_anchor.page_refs[\n", + " 0\n", + " ].bounding_poly.normalized_vertices\n", + "\n", + " # Convert normalized coordinates to pixel coordinates\n", + " img_width, img_height = image.size\n", + " left = bounding_box[0].x * img_width\n", + " top = bounding_box[0].y * img_height\n", + " right = bounding_box[2].x * img_width\n", + " bottom = bounding_box[2].y * img_height\n", + "\n", + " # Crop the image\n", + " cropped_image = image.crop((left, top, right, bottom))\n", + "\n", + " # Convert the PIL image to bytes directly\n", + " cropped_image_bytes = BytesIO()\n", + " cropped_image.save(cropped_image_bytes, format=\"PNG\")\n", + " image_content = cropped_image_bytes.getvalue()\n", + "\n", + " docai_client = documentai.DocumentProcessorServiceClient(\n", + " client_options=ClientOptions(\n", + " api_endpoint=f\"{location}-documentai.googleapis.com\"\n", + " )\n", + " )\n", + " RESOURCE_NAME = docai_client.processor_path(\n", + " project_id, location, processor_id\n", + " )\n", + "\n", + " raw_document = documentai.RawDocument(\n", + " content=image_content, mime_type=\"image/png\"\n", + " )\n", + " process_options = {\"ocr_config\": {\"enable_symbol\": True}}\n", + " request = documentai.ProcessRequest(\n", + " name=RESOURCE_NAME,\n", + " raw_document=raw_document,\n", + " process_options=process_options,\n", + " )\n", + "\n", + " result = docai_client.process_document(request=request)\n", + " new_json_proto_data = result.document\n", + "\n", + " # Extracting text and confidence values\n", + " # new_json_data = json.loads(documentai.Document.to_json(document_object))\n", + " complete_text = new_json_proto_data.text\n", + " symbols_confidence = []\n", + "\n", + " for page in new_json_proto_data.pages:\n", + " for symbol in page.symbols:\n", + " segments = symbol.layout.text_anchor.text_segments[0]\n", + " start_index = int(segments.start_index)\n", + " end_index = int(segments.end_index)\n", + " symbol_text = complete_text[start_index:end_index]\n", + " confidence = symbol.layout.confidence\n", + " symbols_confidence.append((symbol_text, confidence))\n", + "\n", + " # Initially filter out '-' and '/' without affecting adjacent numeric values\n", + " symbols_confidence_filtered = [\n", + " (sym, conf)\n", + " for sym, conf in symbols_confidence\n", + " if sym not in (\"-\", \"/\")\n", + " ]\n", + "\n", + " # Check and remove the first symbol if its confidence is below 0.85\n", + " if (\n", + " symbols_confidence_filtered\n", + " and symbols_confidence_filtered[0][1] < 0.85\n", + " ):\n", + " symbols_confidence_filtered.pop(0)\n", + "\n", + " # Check and remove the last two symbols if their confidences are below 0.85\n", + " if (\n", + " len(symbols_confidence_filtered) > 2\n", + " and symbols_confidence_filtered[-1][1] < 0.85\n", + " ):\n", + " symbols_confidence_filtered.pop(-1)\n", + " if (\n", + " len(symbols_confidence_filtered) > 2\n", + " and symbols_confidence_filtered[-2][1] < 0.85\n", + " ):\n", + " symbols_confidence_filtered.pop(-2)\n", + "\n", + " if len(symbols_confidence_filtered) > 2:\n", + " for j in range(1, len(symbols_confidence_filtered) - 2):\n", + " if symbols_confidence_filtered[j][1] < 0.5:\n", + " symbols_confidence_filtered.pop(j)\n", + "\n", + " # Join the remaining characters from the processed list\n", + " post_processed_symbols = \"\".join(\n", + " [sym for sym, conf in symbols_confidence_filtered]\n", + " )\n", + " entity.mention_text = post_processed_symbols\n", + " return json_proto_data\n", + "\n", + "\n", + "list_of_files = [\n", + " i for i in list(file_names(input_path)[1].values()) if i.endswith(\".json\")\n", + "]\n", + "\n", + "for i in range(0, len(list_of_files)):\n", + " file_name = list_of_files[i]\n", + " # json_data=json.loads(source_bucket.blob(list_of_files[i]).download_as_string().decode('utf-8'))\n", + " json_proto_data = documentai_json_proto_downloader(\n", + " input_storage_bucket_name, file_name\n", + " )\n", + " print(\"Processing>>>>>>>\", file_name)\n", + " document_proto = remove_special_characters(json_proto_data, entity_name)\n", + " output_path_within_bucket = output_bucket_path_prefix + file_name.split(\"/\")[1]\n", + " store_document_as_json(\n", + " documentai.Document.to_json(document_proto),\n", + " output_storage_bucket_name,\n", + " output_path_within_bucket,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "77c3018b-729f-40a2-ba13-3dd994987862", + "metadata": {}, + "source": [ + "### 5.Output" + ] + }, + { + "cell_type": "markdown", + "id": "f98e124a-0fbe-45f4-8bb0-18be734d916e", + "metadata": {}, + "source": [ + "\n", + "The post processed json field can be found in the storage path provided by the user during the script execution that is output_bucket_path.

\n", + "Comparison Between Input and Output File

\n", + "

Post processing results


\n", + "Upon code execution, the JSONs with the newly replaced values will be stored in the designated output Google Cloud Storage (GCS) bucket. This table summarizes the key differences between the input and output JSON files for the 'Amount_in_number' entity
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "

Input Json

Output Json

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c88a3113-dfdb-4ec8-b119-2725da4c8b93", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "environment": { + "kernel": "python3", + "name": "common-cpu.m112", + "type": "gcloud", + "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/noxfile.py b/noxfile.py index 70dc2360..767ccccf 100644 --- a/noxfile.py +++ b/noxfile.py @@ -30,7 +30,7 @@ ISORT_VERSION = "isort==5.11.0" LINT_PATHS = ["."] -DEFAULT_PYTHON_VERSION = "3.8" +DEFAULT_PYTHON_VERSION = "3.7" UNIT_TEST_PYTHON_VERSIONS = ["3.7", "3.8", "3.9", "3.10", "3.11"] UNIT_TEST_STANDARD_DEPENDENCIES = [