This repository demonstrates how to create, train, and deploy a custom document extractor using Google Cloud's Document AI Workbench. The project involves setting up the environment, defining processor fields, labeling documents, training models, and deploying the trained models. Below is a detailed technical overview of each step involved:
In the Google Cloud console, navigate to the Document AI section and create a new processor for custom document extraction. Name the processor (e.g., my-custom-document-extractor), select the nearest region, and opt for Google-managed storage and encryption.
Specify the fields to be extracted by the processor. Create fields such as control_number, employees_social_security_number, employer_identification_number, employers_name_address_and_zip_code, federal_income_tax_withheld, social_security_tax_withheld, social_security_wages, and wages_tips_other_compensation.
Define attributes like Data Type (e.g., Number, Money, Address) and Occurrence (e.g., Optional multiple, Required multiple).
Import sample documents from Cloud Storage to the labeling console.
Use tools like bounding box and select text to annotate documents. Confirm suggested labels by the foundation model, and manually label fields that are not identified automatically.
Create a processor version using the pretrained foundation model for initial extraction. Name the version (e.g., w2-foundation-model).
Import additional documents, split data for training and testing, and use the foundation model to auto-label documents. Verify and manually correct auto-labeled documents.
Ensure at least 10 training and 10 test instances for each field.
Name the processor version (e.g., w2-custom-model), and select the Model based training method. Start training, which might take several hours.
Deploy the trained processor version. Set it as the Default version or use the version ID for document processing.
Test the processor version using new documents not involved in training. Assess performance using metrics like F1 score, precision, and recall.
Handle different processor versions similarly to other processors in Document AI.
Use Document AI API for online or batch processing. Follow code samples for sending processing requests and handling responses.
The custom document extractor significantly reduces manual data entry and increases accuracy by automating the extraction of structured data from unstructured documents.
Users can define specific fields and attributes tailored to their document types, ensuring high relevance and precision.
Leveraging foundation models and generative AI for auto-labeling streamlines the annotation process, saving time and improving labeling efficiency.
Training custom models with sufficient data and rigorous evaluation ensures reliable performance in diverse document processing scenarios.
The solution supports scalable deployment and management of multiple processor versions, catering to evolving business needs and document types.
Seamless integration with the Document AI API allows for flexible and scalable document processing in various applications and workflows.
This technical description and summary provide a comprehensive guide to building a custom document extractor, automating structured data extraction from various document types using Google Cloud's Document AI Workbench.