Merge branch 'main' into renovate/major-eslint-monorepo

GoogleCloudPlatform · May 17, 2024 · 30cee11 · 30cee11
2 parents 418e1dd + a93cd89
commit 30cee11
Show file tree

Hide file tree

Showing 31 changed files with 4,089 additions and 418 deletions.
diff --git a/...ols/DocAI_Json_to_Canonical_Json_Conversion/Docai_json_to_canonical_json_conversion.ipynb b/...ols/DocAI_Json_to_Canonical_Json_Conversion/Docai_json_to_canonical_json_conversion.ipynb
diff --git a/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_1.png b/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_1.png
diff --git a/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_2.png b/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_2.png
diff --git a/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_3.png b/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_3.png
diff --git a/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_4.png b/...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_4.png
diff --git a/...tor-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/input_sample_image.png b/...tor-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/input_sample_image.png
diff --git a/..._import_document_schema_from_processor/Export_import_document_schema_from_processor.ipynb b/..._import_document_schema_from_processor/Export_import_document_schema_from_processor.ipynb
@@ -0,0 +1,287 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6589fc93-39d1-4d10-be1f-e7eb33fe4087",
+   "metadata": {},
+   "source": [
+    "# Export and Import Document schema from a processor (using spreadsheet).\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "361f188e-fe11-4a49-b7c8-080e0e69ce7a",
+   "metadata": {},
+   "source": [
+    "## Disclaimer\n",
+    "\n",
+    "This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1036937a-0221-48eb-862e-3fa0b8e646a8",
+   "metadata": {},
+   "source": [
+    "## Objective\n",
+    "\n",
+    "This document Guides how to export a schema from a processor to a spreadsheet(.xlsx extension) and import a schema from a spreadsheet to a processor . This approach considers 3 level nesting as well.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "115a4e82-5e83-468a-b0e5-097ca14f15d5",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "* Vertex AI Notebook Or Colab (If using Colab, use authentication)\n",
+    "* Processor details to import the processor\n",
+    "* Permission For Google Storage and Vertex AI Notebook.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "142123d3-37b1-4aa8-841c-40c3bd52d70c",
+   "metadata": {},
+   "source": [
+    "## 1. Exporting Document schema to a spreadsheet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73ae8955-8516-42ce-a08e-2ace4152d7d9",
+   "metadata": {},
+   "source": [
+    "\n",
+    "#### Input\n",
+    "* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n",
+    "* `location`=\"us\" # location of the processor \n",
+    "* `processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor from which the schema has to be exported to spreadsheet"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "840bb64b-66e8-4ec2-b25b-815db36775e1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "processor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\"\n",
+    "# get document schema\n",
+    "from google.cloud import documentai_v1beta3\n",
+    "\n",
+    "\n",
+    "def get_dataset_schema(processor_name):\n",
+    "    # Create a client\n",
+    "    client = documentai_v1beta3.DocumentServiceClient()\n",
+    "\n",
+    "    # dataset_name = client.dataset_schema_path(project, location, processor)\n",
+    "    # Initialize request argument(s)\n",
+    "    request = documentai_v1beta3.GetDatasetSchemaRequest(\n",
+    "        name=processor_name + \"/dataset/datasetSchema\",\n",
+    "    )\n",
+    "\n",
+    "    # Make the request\n",
+    "    response = client.get_dataset_schema(request=request)\n",
+    "\n",
+    "    return response\n",
+    "\n",
+    "\n",
+    "response_document_schema = get_dataset_schema(processor_name)\n",
+    "dataset_schema = []\n",
+    "for schema_metadata in response_document_schema.document_schema.entity_types:\n",
+    "    if len(schema_metadata.properties) > 0:\n",
+    "        for schema_property in schema_metadata.properties:\n",
+    "            temp_schema_metadata = {\n",
+    "                \"name\": schema_property.name,\n",
+    "                \"value_type\": schema_property.value_type,\n",
+    "                \"occurrence_type\": schema_property.occurrence_type.name,\n",
+    "            }\n",
+    "            if len(schema_metadata.display_name) == 0:\n",
+    "                dataset_schema.append(temp_schema_metadata)\n",
+    "            else:\n",
+    "                temp_schema_metadata[\"display_name\"] = schema_metadata.display_name\n",
+    "                dataset_schema.append(temp_schema_metadata)\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.DataFrame(dataset_schema)\n",
+    "df.to_excel(\"Document_Schema_exported.xlsx\", index=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e7dc8b0-c547-4cc5-845b-cdf73d2ce909",
+   "metadata": {},
+   "source": [
+    "### Output \n",
+    "* The output will be the schema saved in \"Document_Schema_exported.xlsx\" file as shown below\n",
+    "<img src=\"./Images/Exported_schema.png\" width=800 height=400></img>\n",
+    "\n",
+    "#### * `Columns`\n",
+    "#### Name:\n",
+    "Entity type which can be parent entity or child entities\n",
+    "\n",
+    "#### Value_type:\n",
+    "\n",
+    "* Value type is the data type of the entities, if the entity is a parent item the value type will be same as entity type.if it is final child type then value type is data type\n",
+    "\n",
+    "#### Occurance_type :\n",
+    "\n",
+    "* Occurance type is the occurance type of respective entity\n",
+    "\n",
+    "#### display_name:\n",
+    "\n",
+    "* Display name is the name of the parent entity for child entities. if entity itself is the parent entity then display_name will be empty"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c156ac15-faa2-407d-87f0-86f19a10af33",
+   "metadata": {},
+   "source": [
+    "## 2. Importing Document schema from a spreadsheet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "650eb081-9ca5-443e-8369-f281fc39f6fc",
+   "metadata": {},
+   "source": [
+    "#### Input\n",
+    "* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n",
+    "* `new_location`=\"us\" # location of the processor \n",
+    "* `new_processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor to which the schema has to be imported\n",
+    "* `schema_xlsx_path`=\"Document_Schema_exported.xlsx\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4e7499b-9895-4d61-a7dd-a12454d80c59",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "* Add any entities in the xlsx file to be added in the new processor\n",
+    "\n",
+    "## Note\n",
+    "\n",
+    "* Make sure the entities in the spreadsheet are not already in the schema of the processor to avoid issues\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6fa50e70-48ab-419f-95ee-9b1ffc889d73",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import math\n",
+    "import pandas as pd\n",
+    "from google.cloud import documentai_v1beta3\n",
+    "\n",
+    "# Import the Excel file back into a data frame\n",
+    "imported_df = pd.read_excel(schema_xlsx_path)\n",
+    "\n",
+    "# Convert the data frame back to a list of dictionaries\n",
+    "imported_data = imported_df.to_dict(orient=\"records\")\n",
+    "\n",
+    "parent_entities = []\n",
+    "nested_entities = {}\n",
+    "for data in imported_data:\n",
+    "    temp_data = {key: value for key, value in data.items() if key != \"display_name\"}\n",
+    "    if isinstance(data[\"display_name\"], float) and math.isnan(data[\"display_name\"]):\n",
+    "        parent_entities.append(temp_data)\n",
+    "    else:\n",
+    "        if data[\"display_name\"] in nested_entities.keys():\n",
+    "            nested_entities[data[\"display_name\"]].append(temp_data)\n",
+    "        else:\n",
+    "            nested_entities[data[\"display_name\"]] = [temp_data]\n",
+    "\n",
+    "schema_line = []\n",
+    "\n",
+    "for line, properties in nested_entities.items():\n",
+    "    client = documentai_v1beta3.types.DocumentSchema.EntityType()\n",
+    "    client.name = line\n",
+    "    client.base_types = [\"object\"]\n",
+    "    client.properties = properties\n",
+    "    client.display_name = line\n",
+    "    schema_line.append(client)\n",
+    "\n",
+    "new_processor_name = (\n",
+    "    f\"projects/{project_id}/locations/{new_location}/processors/{new_processor_id}\"\n",
+    ")\n",
+    "\n",
+    "response_newprocessor = get_dataset_schema(new_processor_name)\n",
+    "# updating into the processor\n",
+    "for i in response_newprocessor.document_schema.entity_types:\n",
+    "    for e3 in parent_entities:\n",
+    "        i.properties.append(e3)\n",
+    "\n",
+    "for e4 in schema_line:\n",
+    "    response_newprocessor.document_schema.entity_types.append(e4)\n",
+    "\n",
+    "\n",
+    "def update_dataset_schema(schema):\n",
+    "    from google.cloud import documentai_v1beta3\n",
+    "\n",
+    "    # Create a client\n",
+    "    client = documentai_v1beta3.DocumentServiceClient()\n",
+    "\n",
+    "    # Initialize request argument(s)\n",
+    "    request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n",
+    "        dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n",
+    "    )\n",
+    "\n",
+    "    # Make the request\n",
+    "    response = client.update_dataset_schema(request=request)\n",
+    "\n",
+    "    # Handle the response\n",
+    "    return response\n",
+    "\n",
+    "\n",
+    "response_update = update_dataset_schema(response_newprocessor)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "043b9c7c-83f5-49c1-84ef-1cddb6c1dbf6",
+   "metadata": {},
+   "source": [
+    "### Output \n",
+    "* The schema of new processor will be updated as per spreadsheet given"
+   ]
+  }
+ ],
+ "metadata": {
+  "environment": {
+   "kernel": "python3",
+   "name": "common-cpu.m112",
+   "type": "gcloud",
+   "uri": "gcr.io/deeplearning-platform-release/base-cpu:m112"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/...r-tools/Export_import_document_schema_from_processor/Images/Exported_schema.png b/...r-tools/Export_import_document_schema_from_processor/Images/Exported_schema.png
diff --git a/incubator-tools/README.md b/incubator-tools/README.md
@@ -32,4 +32,41 @@ Folder contains various tools which is made for the benefit of Doc AI users.
 * [Import and Evaluator Processors](./importing_processor_and_evaluating_with_alternate_test_sets/)
 * [Labeled Dataset Validation](./labeled_dataset_validation/)
 * [Split Overlapping Entities](./overlapping_split/)
-* [Rename Entity Type](./rename_entity_type/)
+* [Rename Entity Type](./rename_entity_type/)
+
+* [Combine Two Processors Output](./Combine_two_processors_output/)
+* [DocAI Json to Canonical Json Conversion](./DocAI_Json_to_Canonical_Json_Conversion/)
+* [Export and Import Document Schema from Processor](./Export_import_document_schema_from_processor/)
+* [Asynchronous API Reference Architecture](./Reference_architecture_asynchronous/)
+* [Advance Table Line Enhancement](./advance_table_line_enhancement/)
+* [Backmapping Entities from Parser Output Language to Original Language of the Document](./backmapping_entities_from_parser_output_to_original_language/)
+* [Bank Statement Post Processing Tool](./bank_statement_post_processing_tool/)
+* [Bank Statements Line Item Improver and Missing Items Finder](./bank_statements_line_items_improver_and_missing_items_finder/)
+* [Categorizing Bank Statement Transactions by Account Number](./categorizing_bank_statement_transactions_by_account_number/)
+* [Comparison between Custom Document Classifier Ground Truth and Parsed Json Prediction Results](./cdc_comparison/)
+* [CMEK Key Creation and Destroying Procedure](./cmek_docai_processor/)
+* [Date Entity Normalization](./date_entity_normalization/)
+* [PDF Clustering Analysis Tool](./docai_pdf_clustering_analysis_tool/)
+* [Document AI Processor Types](./docai_processor_types/)
+* [Schema from Form Parser Output](./document-schema-from-form-parser-output/)
+* [Migrating Schema Between the Processors](./documentai_migrating_schema_between_processors/)
+* [Enrich the Address for Invioce Parser](./enrich_address_for_invoice/)
+* [Entity Sorting using Csharp](./entity_sorting_csharp/)
+* [Entity Sorting using Python](./entity_sorting_python/)
+* [Formparser Table to Entity Converter Tool](./formparser_table_to_entity_converter_tool/)
+* [HITL Line Item Prefix Issues](./hitl_line_item_prefix_issue/)
+* [Identity Document Proofing Evaluation](./identity_document_proofing_evaluation/)
+* [Normalize Date Entities from 19xx to 20xx](./normalize_date_value_19xx_to_20xx/)
+* [OCR Based Document Section Splitter](./ocr_based_document_section_splitter/)
+* [Reprocess Old OCR Json to New OCR Engine](./old_ocr_to_new_ocr_conversion/)
+* [Seperation of Paragraphs in a Document](./paragraph_separation/)
+* [Replace PII Data with Synthetic Data](./pii_synthetic_redaction_tool/)
+* [Post Processing Negative Values](./post_processing_negative_values/)
+* [Annotating the Entities in the Document based on the OCR Tokens](./reverse_annotation_tool/)
+* [Schema Converter Tool](./schema_converter_tool/)
+* [Signature Detection Technique](./signature-detection-technique/)
+* [Special Character Removal](./special_character_removal/)
+* [Tagging Line Items in a Specific Format](./specific_format_line_items_tagging/)
+* [Labeling Documents through Custom Document Splitter Parser using Synonyms List](./synonyms_based_splitter_document_labeling/)
+* [Tagging Entity Synonyms](./synonyms_entity_tag/)
+* [Vertex Object Detection Visualization](./vertex_object_detection_visualization/)