Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Incubator Tools - 7th Iteration #804

Merged
merged 8 commits into from
May 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
holtskinner marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6589fc93-39d1-4d10-be1f-e7eb33fe4087",
"metadata": {},
"source": [
"# Export and Import Document schema from a processor (using spreadsheet).\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "361f188e-fe11-4a49-b7c8-080e0e69ce7a",
"metadata": {},
"source": [
"## Disclaimer\n",
"\n",
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. \n"
]
},
{
"cell_type": "markdown",
"id": "1036937a-0221-48eb-862e-3fa0b8e646a8",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This document Guides how to export a schema from a processor to a spreadsheet(.xlsx extension) and import a schema from a spreadsheet to a processor . This approach considers 3 level nesting as well.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "115a4e82-5e83-468a-b0e5-097ca14f15d5",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"* Vertex AI Notebook Or Colab (If using Colab, use authentication)\n",
"* Processor details to import the processor\n",
"* Permission For Google Storage and Vertex AI Notebook.\n"
]
},
{
"cell_type": "markdown",
"id": "142123d3-37b1-4aa8-841c-40c3bd52d70c",
"metadata": {},
"source": [
"## 1. Exporting Document schema to a spreadsheet"
]
},
{
"cell_type": "markdown",
"id": "73ae8955-8516-42ce-a08e-2ace4152d7d9",
"metadata": {},
"source": [
"\n",
"#### Input\n",
"* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n",
"* `location`=\"us\" # location of the processor \n",
"* `processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor from which the schema has to be exported to spreadsheet"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "840bb64b-66e8-4ec2-b25b-815db36775e1",
"metadata": {},
"outputs": [],
"source": [
"processor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\"\n",
"# get document schema\n",
"from google.cloud import documentai_v1beta3\n",
"\n",
"\n",
"def get_dataset_schema(processor_name):\n",
" # Create a client\n",
" client = documentai_v1beta3.DocumentServiceClient()\n",
"\n",
" # dataset_name = client.dataset_schema_path(project, location, processor)\n",
" # Initialize request argument(s)\n",
" request = documentai_v1beta3.GetDatasetSchemaRequest(\n",
" name=processor_name + \"/dataset/datasetSchema\",\n",
" )\n",
"\n",
" # Make the request\n",
" response = client.get_dataset_schema(request=request)\n",
"\n",
" return response\n",
"\n",
"\n",
"response_document_schema = get_dataset_schema(processor_name)\n",
"dataset_schema = []\n",
"for schema_metadata in response_document_schema.document_schema.entity_types:\n",
" if len(schema_metadata.properties) > 0:\n",
" for schema_property in schema_metadata.properties:\n",
" temp_schema_metadata = {\n",
" \"name\": schema_property.name,\n",
" \"value_type\": schema_property.value_type,\n",
" \"occurrence_type\": schema_property.occurrence_type.name,\n",
" }\n",
" if len(schema_metadata.display_name) == 0:\n",
" dataset_schema.append(temp_schema_metadata)\n",
" else:\n",
" temp_schema_metadata[\"display_name\"] = schema_metadata.display_name\n",
" dataset_schema.append(temp_schema_metadata)\n",
"\n",
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(dataset_schema)\n",
"df.to_excel(\"Document_Schema_exported.xlsx\", index=False)"
]
},
{
"cell_type": "markdown",
"id": "6e7dc8b0-c547-4cc5-845b-cdf73d2ce909",
"metadata": {},
"source": [
"### Output \n",
"* The output will be the schema saved in \"Document_Schema_exported.xlsx\" file as shown below\n",
"<img src=\"./Images/Exported_schema.png\" width=800 height=400></img>\n",
"\n",
"#### * `Columns`\n",
"#### Name:\n",
"Entity type which can be parent entity or child entities\n",
"\n",
"#### Value_type:\n",
"\n",
"* Value type is the data type of the entities, if the entity is a parent item the value type will be same as entity type.if it is final child type then value type is data type\n",
"\n",
"#### Occurance_type :\n",
"\n",
"* Occurance type is the occurance type of respective entity\n",
"\n",
"#### display_name:\n",
"\n",
"* Display name is the name of the parent entity for child entities. if entity itself is the parent entity then display_name will be empty"
]
},
{
"cell_type": "markdown",
"id": "c156ac15-faa2-407d-87f0-86f19a10af33",
"metadata": {},
"source": [
"## 2. Importing Document schema from a spreadsheet"
]
},
{
"cell_type": "markdown",
"id": "650eb081-9ca5-443e-8369-f281fc39f6fc",
"metadata": {},
"source": [
"#### Input\n",
"* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n",
"* `new_location`=\"us\" # location of the processor \n",
"* `new_processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor to which the schema has to be imported\n",
"* `schema_xlsx_path`=\"Document_Schema_exported.xlsx\""
]
},
{
"cell_type": "markdown",
"id": "d4e7499b-9895-4d61-a7dd-a12454d80c59",
"metadata": {
"tags": []
},
"source": [
"* Add any entities in the xlsx file to be added in the new processor\n",
"\n",
"## Note\n",
"\n",
"* Make sure the entities in the spreadsheet are not already in the schema of the processor to avoid issues\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fa50e70-48ab-419f-95ee-9b1ffc889d73",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import math\n",
"import pandas as pd\n",
"from google.cloud import documentai_v1beta3\n",
"\n",
"# Import the Excel file back into a data frame\n",
"imported_df = pd.read_excel(schema_xlsx_path)\n",
"\n",
"# Convert the data frame back to a list of dictionaries\n",
"imported_data = imported_df.to_dict(orient=\"records\")\n",
"\n",
"parent_entities = []\n",
"nested_entities = {}\n",
"for data in imported_data:\n",
" temp_data = {key: value for key, value in data.items() if key != \"display_name\"}\n",
" if isinstance(data[\"display_name\"], float) and math.isnan(data[\"display_name\"]):\n",
" parent_entities.append(temp_data)\n",
" else:\n",
" if data[\"display_name\"] in nested_entities.keys():\n",
" nested_entities[data[\"display_name\"]].append(temp_data)\n",
" else:\n",
" nested_entities[data[\"display_name\"]] = [temp_data]\n",
"\n",
"schema_line = []\n",
"\n",
"for line, properties in nested_entities.items():\n",
" client = documentai_v1beta3.types.DocumentSchema.EntityType()\n",
" client.name = line\n",
" client.base_types = [\"object\"]\n",
" client.properties = properties\n",
" client.display_name = line\n",
" schema_line.append(client)\n",
"\n",
"new_processor_name = (\n",
" f\"projects/{project_id}/locations/{new_location}/processors/{new_processor_id}\"\n",
")\n",
"\n",
"response_newprocessor = get_dataset_schema(new_processor_name)\n",
"# updating into the processor\n",
"for i in response_newprocessor.document_schema.entity_types:\n",
" for e3 in parent_entities:\n",
" i.properties.append(e3)\n",
"\n",
"for e4 in schema_line:\n",
" response_newprocessor.document_schema.entity_types.append(e4)\n",
"\n",
"\n",
"def update_dataset_schema(schema):\n",
" from google.cloud import documentai_v1beta3\n",
"\n",
" # Create a client\n",
" client = documentai_v1beta3.DocumentServiceClient()\n",
"\n",
" # Initialize request argument(s)\n",
" request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n",
" dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n",
" )\n",
"\n",
" # Make the request\n",
" response = client.update_dataset_schema(request=request)\n",
"\n",
" # Handle the response\n",
" return response\n",
"\n",
"\n",
"response_update = update_dataset_schema(response_newprocessor)"
]
},
{
"cell_type": "markdown",
"id": "043b9c7c-83f5-49c1-84ef-1cddb6c1dbf6",
"metadata": {},
"source": [
"### Output \n",
"* The schema of new processor will be updated as per spreadsheet given"
]
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m112",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m112"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 38 additions & 1 deletion incubator-tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,41 @@ Folder contains various tools which is made for the benefit of Doc AI users.
* [Import and Evaluator Processors](./importing_processor_and_evaluating_with_alternate_test_sets/)
* [Labeled Dataset Validation](./labeled_dataset_validation/)
* [Split Overlapping Entities](./overlapping_split/)
* [Rename Entity Type](./rename_entity_type/)
* [Rename Entity Type](./rename_entity_type/)

* [Combine Two Processors Output](./Combine_two_processors_output/)
* [DocAI Json to Canonical Json Conversion](./DocAI_Json_to_Canonical_Json_Conversion/)
* [Export and Import Document Schema from Processor](./Export_import_document_schema_from_processor/)
* [Asynchronous API Reference Architecture](./Reference_architecture_asynchronous/)
* [Advance Table Line Enhancement](./advance_table_line_enhancement/)
* [Backmapping Entities from Parser Output Language to Original Language of the Document](./backmapping_entities_from_parser_output_to_original_language/)
* [Bank Statement Post Processing Tool](./bank_statement_post_processing_tool/)
* [Bank Statements Line Item Improver and Missing Items Finder](./bank_statements_line_items_improver_and_missing_items_finder/)
* [Categorizing Bank Statement Transactions by Account Number](./categorizing_bank_statement_transactions_by_account_number/)
* [Comparison between Custom Document Classifier Ground Truth and Parsed Json Prediction Results](./cdc_comparison/)
* [CMEK Key Creation and Destroying Procedure](./cmek_docai_processor/)
* [Date Entity Normalization](./date_entity_normalization/)
* [PDF Clustering Analysis Tool](./docai_pdf_clustering_analysis_tool/)
* [Document AI Processor Types](./docai_processor_types/)
* [Schema from Form Parser Output](./document-schema-from-form-parser-output/)
* [Migrating Schema Between the Processors](./documentai_migrating_schema_between_processors/)
* [Enrich the Address for Invioce Parser](./enrich_address_for_invoice/)
* [Entity Sorting using Csharp](./entity_sorting_csharp/)
* [Entity Sorting using Python](./entity_sorting_python/)
* [Formparser Table to Entity Converter Tool](./formparser_table_to_entity_converter_tool/)
* [HITL Line Item Prefix Issues](./hitl_line_item_prefix_issue/)
* [Identity Document Proofing Evaluation](./identity_document_proofing_evaluation/)
* [Normalize Date Entities from 19xx to 20xx](./normalize_date_value_19xx_to_20xx/)
* [OCR Based Document Section Splitter](./ocr_based_document_section_splitter/)
* [Reprocess Old OCR Json to New OCR Engine](./old_ocr_to_new_ocr_conversion/)
* [Seperation of Paragraphs in a Document](./paragraph_separation/)
* [Replace PII Data with Synthetic Data](./pii_synthetic_redaction_tool/)
* [Post Processing Negative Values](./post_processing_negative_values/)
* [Annotating the Entities in the Document based on the OCR Tokens](./reverse_annotation_tool/)
* [Schema Converter Tool](./schema_converter_tool/)
* [Signature Detection Technique](./signature-detection-technique/)
* [Special Character Removal](./special_character_removal/)
* [Tagging Line Items in a Specific Format](./specific_format_line_items_tagging/)
* [Labeling Documents through Custom Document Splitter Parser using Synonyms List](./synonyms_based_splitter_document_labeling/)
* [Tagging Entity Synonyms](./synonyms_entity_tag/)
* [Vertex Object Detection Visualization](./vertex_object_detection_visualization/)
Loading