-
Notifications
You must be signed in to change notification settings - Fork 91
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into renovate/major-eslint-monorepo
- Loading branch information
Showing
31 changed files
with
4,089 additions
and
418 deletions.
There are no files selected for viewing
405 changes: 405 additions & 0 deletions
405
...ols/DocAI_Json_to_Canonical_Json_Conversion/Docai_json_to_canonical_json_conversion.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+353 KB
...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+103 KB
...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+115 KB
...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+154 KB
...ools/DocAI_Json_to_Canonical_Json_Conversion/Images/conanical_json_output_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+76.1 KB
...tor-tools/DocAI_Json_to_Canonical_Json_Conversion/Images/input_sample_image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
287 changes: 287 additions & 0 deletions
287
..._import_document_schema_from_processor/Export_import_document_schema_from_processor.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,287 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6589fc93-39d1-4d10-be1f-e7eb33fe4087", | ||
"metadata": {}, | ||
"source": [ | ||
"# Export and Import Document schema from a processor (using spreadsheet).\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "361f188e-fe11-4a49-b7c8-080e0e69ce7a", | ||
"metadata": {}, | ||
"source": [ | ||
"## Disclaimer\n", | ||
"\n", | ||
"This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "1036937a-0221-48eb-862e-3fa0b8e646a8", | ||
"metadata": {}, | ||
"source": [ | ||
"## Objective\n", | ||
"\n", | ||
"This document Guides how to export a schema from a processor to a spreadsheet(.xlsx extension) and import a schema from a spreadsheet to a processor . This approach considers 3 level nesting as well.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "115a4e82-5e83-468a-b0e5-097ca14f15d5", | ||
"metadata": {}, | ||
"source": [ | ||
"## Prerequisites\n", | ||
"\n", | ||
"* Vertex AI Notebook Or Colab (If using Colab, use authentication)\n", | ||
"* Processor details to import the processor\n", | ||
"* Permission For Google Storage and Vertex AI Notebook.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "142123d3-37b1-4aa8-841c-40c3bd52d70c", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1. Exporting Document schema to a spreadsheet" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "73ae8955-8516-42ce-a08e-2ace4152d7d9", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"#### Input\n", | ||
"* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n", | ||
"* `location`=\"us\" # location of the processor \n", | ||
"* `processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor from which the schema has to be exported to spreadsheet" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "840bb64b-66e8-4ec2-b25b-815db36775e1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"processor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}\"\n", | ||
"# get document schema\n", | ||
"from google.cloud import documentai_v1beta3\n", | ||
"\n", | ||
"\n", | ||
"def get_dataset_schema(processor_name):\n", | ||
" # Create a client\n", | ||
" client = documentai_v1beta3.DocumentServiceClient()\n", | ||
"\n", | ||
" # dataset_name = client.dataset_schema_path(project, location, processor)\n", | ||
" # Initialize request argument(s)\n", | ||
" request = documentai_v1beta3.GetDatasetSchemaRequest(\n", | ||
" name=processor_name + \"/dataset/datasetSchema\",\n", | ||
" )\n", | ||
"\n", | ||
" # Make the request\n", | ||
" response = client.get_dataset_schema(request=request)\n", | ||
"\n", | ||
" return response\n", | ||
"\n", | ||
"\n", | ||
"response_document_schema = get_dataset_schema(processor_name)\n", | ||
"dataset_schema = []\n", | ||
"for schema_metadata in response_document_schema.document_schema.entity_types:\n", | ||
" if len(schema_metadata.properties) > 0:\n", | ||
" for schema_property in schema_metadata.properties:\n", | ||
" temp_schema_metadata = {\n", | ||
" \"name\": schema_property.name,\n", | ||
" \"value_type\": schema_property.value_type,\n", | ||
" \"occurrence_type\": schema_property.occurrence_type.name,\n", | ||
" }\n", | ||
" if len(schema_metadata.display_name) == 0:\n", | ||
" dataset_schema.append(temp_schema_metadata)\n", | ||
" else:\n", | ||
" temp_schema_metadata[\"display_name\"] = schema_metadata.display_name\n", | ||
" dataset_schema.append(temp_schema_metadata)\n", | ||
"\n", | ||
"import pandas as pd\n", | ||
"\n", | ||
"df = pd.DataFrame(dataset_schema)\n", | ||
"df.to_excel(\"Document_Schema_exported.xlsx\", index=False)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6e7dc8b0-c547-4cc5-845b-cdf73d2ce909", | ||
"metadata": {}, | ||
"source": [ | ||
"### Output \n", | ||
"* The output will be the schema saved in \"Document_Schema_exported.xlsx\" file as shown below\n", | ||
"<img src=\"./Images/Exported_schema.png\" width=800 height=400></img>\n", | ||
"\n", | ||
"#### * `Columns`\n", | ||
"#### Name:\n", | ||
"Entity type which can be parent entity or child entities\n", | ||
"\n", | ||
"#### Value_type:\n", | ||
"\n", | ||
"* Value type is the data type of the entities, if the entity is a parent item the value type will be same as entity type.if it is final child type then value type is data type\n", | ||
"\n", | ||
"#### Occurance_type :\n", | ||
"\n", | ||
"* Occurance type is the occurance type of respective entity\n", | ||
"\n", | ||
"#### display_name:\n", | ||
"\n", | ||
"* Display name is the name of the parent entity for child entities. if entity itself is the parent entity then display_name will be empty" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c156ac15-faa2-407d-87f0-86f19a10af33", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2. Importing Document schema from a spreadsheet" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "650eb081-9ca5-443e-8369-f281fc39f6fc", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Input\n", | ||
"* `project_id`=\"xxxxxxxxxx\" # Project ID of the project\n", | ||
"* `new_location`=\"us\" # location of the processor \n", | ||
"* `new_processor_id`=\"xxxxxxxxxxxxxxx\" #Processor id of processor to which the schema has to be imported\n", | ||
"* `schema_xlsx_path`=\"Document_Schema_exported.xlsx\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d4e7499b-9895-4d61-a7dd-a12454d80c59", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"* Add any entities in the xlsx file to be added in the new processor\n", | ||
"\n", | ||
"## Note\n", | ||
"\n", | ||
"* Make sure the entities in the spreadsheet are not already in the schema of the processor to avoid issues\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6fa50e70-48ab-419f-95ee-9b1ffc889d73", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"import math\n", | ||
"import pandas as pd\n", | ||
"from google.cloud import documentai_v1beta3\n", | ||
"\n", | ||
"# Import the Excel file back into a data frame\n", | ||
"imported_df = pd.read_excel(schema_xlsx_path)\n", | ||
"\n", | ||
"# Convert the data frame back to a list of dictionaries\n", | ||
"imported_data = imported_df.to_dict(orient=\"records\")\n", | ||
"\n", | ||
"parent_entities = []\n", | ||
"nested_entities = {}\n", | ||
"for data in imported_data:\n", | ||
" temp_data = {key: value for key, value in data.items() if key != \"display_name\"}\n", | ||
" if isinstance(data[\"display_name\"], float) and math.isnan(data[\"display_name\"]):\n", | ||
" parent_entities.append(temp_data)\n", | ||
" else:\n", | ||
" if data[\"display_name\"] in nested_entities.keys():\n", | ||
" nested_entities[data[\"display_name\"]].append(temp_data)\n", | ||
" else:\n", | ||
" nested_entities[data[\"display_name\"]] = [temp_data]\n", | ||
"\n", | ||
"schema_line = []\n", | ||
"\n", | ||
"for line, properties in nested_entities.items():\n", | ||
" client = documentai_v1beta3.types.DocumentSchema.EntityType()\n", | ||
" client.name = line\n", | ||
" client.base_types = [\"object\"]\n", | ||
" client.properties = properties\n", | ||
" client.display_name = line\n", | ||
" schema_line.append(client)\n", | ||
"\n", | ||
"new_processor_name = (\n", | ||
" f\"projects/{project_id}/locations/{new_location}/processors/{new_processor_id}\"\n", | ||
")\n", | ||
"\n", | ||
"response_newprocessor = get_dataset_schema(new_processor_name)\n", | ||
"# updating into the processor\n", | ||
"for i in response_newprocessor.document_schema.entity_types:\n", | ||
" for e3 in parent_entities:\n", | ||
" i.properties.append(e3)\n", | ||
"\n", | ||
"for e4 in schema_line:\n", | ||
" response_newprocessor.document_schema.entity_types.append(e4)\n", | ||
"\n", | ||
"\n", | ||
"def update_dataset_schema(schema):\n", | ||
" from google.cloud import documentai_v1beta3\n", | ||
"\n", | ||
" # Create a client\n", | ||
" client = documentai_v1beta3.DocumentServiceClient()\n", | ||
"\n", | ||
" # Initialize request argument(s)\n", | ||
" request = documentai_v1beta3.UpdateDatasetSchemaRequest(\n", | ||
" dataset_schema={\"name\": schema.name, \"document_schema\": schema.document_schema}\n", | ||
" )\n", | ||
"\n", | ||
" # Make the request\n", | ||
" response = client.update_dataset_schema(request=request)\n", | ||
"\n", | ||
" # Handle the response\n", | ||
" return response\n", | ||
"\n", | ||
"\n", | ||
"response_update = update_dataset_schema(response_newprocessor)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "043b9c7c-83f5-49c1-84ef-1cddb6c1dbf6", | ||
"metadata": {}, | ||
"source": [ | ||
"### Output \n", | ||
"* The schema of new processor will be updated as per spreadsheet given" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"environment": { | ||
"kernel": "python3", | ||
"name": "common-cpu.m112", | ||
"type": "gcloud", | ||
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m112" | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Binary file added
BIN
+114 KB
...r-tools/Export_import_document_schema_from_processor/Images/Exported_schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.