# Vertex AI Search Using Document AI Reference Architecture for Semantic Search and Q&A

* Author: docai-incubator@google.com

## Objective

This document describes a comprehensive and robust Semantic Search and Question-Answering (Q&A) **reference architecture** leveraging Vertex AI Search and **Document AI technologies**.The goal is to empower users with the ability to :

* **Extract** specific entities from documents, converting unstructured data into structured data for Retrieval-Augmented Search
* **Uncover relevant information** from a corpus of documents through semantic understanding of search queries and document content.
* **Obtain answers to natural language questions** by leveraging Document AI for structured information extraction and Vertex AI Search for efficient retrieval and comprehension of related passages.

This document serves as a guide for organizations seeking to build a powerful search and Q&A experience using Google Cloud's advanced AI capabilities.


## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the **DocAI Incubator Team**. No guarantees of performance are implied

## Prerequisites

* Familiarity with the [Vertex AI Workbench Notebook](https://cloud.google.com/vertex-ai-notebooks?hl=en) and [Google Services](https://cloud.google.com/gcp?utm_source=google&utm_medium=cpc&utm_campaign=na-CA-all-en-dr-bkws-all-all-trial-e-dr-1707554&utm_content=text-ad-none-any-DEV_c-CRE_665735450630-ADGP_Hybrid+%7C+BKWS+-+EXA+%7C+Txt-Core-GCP-KWID_43700077224548580-kwd-87853815&utm_term=KW_gcp-ST_gcp&gad_source=1&gclid=CjwKCAjw17qvBhBrEiwA1rU9w7LxYif8b8qUUZKlKb-DSZgHW6SYFsofiUDtKBbVeWmqiwVxHiYbsRoCs-kQAvD_BwE&gclsrc=aw.ds&hl=en)

* Basic knowledge of [Google Cloud Storage](https://cloud.google.com/storage/?utm_source=google&utm_medium=cpc&utm_campaign=na-none-all-en-dr-sitelink-all-all-trial-e-gcp-1707554&utm_content=text-ad-none-any-DEV_c-CRE_665735485433-ADGP_Hybrid+%7C+BKWS+-+MIX+%7C+Txt-Storage-Storage+General-KWID_43700077212830063-kwd-11642151515-userloc_9000837&utm_term=KW_google+cloud+storage-ST_google+cloud+storage-NET_g-&gad_source=1&gclid=CjwKCAjw17qvBhBrEiwA1rU9w3jRYzDB-YTr-f8FDYfDcEA4nwUm34OA69-nAmoJs1tEMTZtW9QUUhoCjPIQAvD_BwE&gclsrc=aw.ds&hl=en)

* Intermediate knowledge of [Document AI](https://cloud.google.com/document-ai/docs/overview)

* Intermediate knowledge of [Python scripting](https://www.python.org/)

* Intermediate knowledge of [Cloud Functions](https://cloud.google.com/functions?hl=en) and [Cloud Scheduler](https://cloud.google.com/scheduler/docs/schedule-run-cron-job)

* Necessary [IAM](https://cloud.google.com/security/products/iam) privileges for [GCS](https://cloud.google.com/storage/docs/access-control/iam-roles), [Big Query](https://cloud.google.com/bigquery/docs/introduction) and [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/access-control)

* Intermediate knowledge of [Vertex AI Search and Conversation](https://cloud.google.com/generative-ai-app-builder/docs/try-enterprise-search)

* Pipeline supports only the following formats <b>pdf, png, jpg, jpeg, tif, tiff</b>


## Technology Overview
<img src="./Images/Technology_Overview.png" width=1000 height=600 alt="Technology_Overview"></img>

The following Google Cloud services are utilized for this pipeline.

* Google Cloud Storage
* Google Cloud Functions
* Google Cloud Document AI
* Google Cloud Scheduler

The above diagram highlights the overall architecture of Vertex AI Semantic Search and Q&A. The each phases to this solution is explained:
* <b>Documentation Ingestion & Services Initialization:</b> In this step, you are expected to run an initializer script that orchestrates required GCP services - Storage bucket, DocAI, Cloud Functions, Scheduler and Bigquery. Also in this step, your files are  input to the Cloud storage bucket for processing further. This is achieved by shell script which ensures the environment is set up automatically.
    * Cloud Shell : You need to run the provided shell script. This eliminates workarounds like manual setup of each service involved for this activity.
    * Cloud Storage: This is storage bucket where your files are landed.
<br>
<br>
* <b>Scheduled batch processing and Converting into Parsed, Canonical, Metadata Json files :</b> The Cloud Functions are triggered to perform the processing of the landed files from the Storage bucket of the previous step. Here, the files are grouped in batches of 50 and get processed concurrently through Document AI Processor. After these parsed jsons generations we will create Canonical Jsons, Metadata Jsonl and thereafter, updating the bigquery table with the file paths. The scheduler aids in automation of repeated workflow of this entire process i.e., from file ingestion to bigquery updation.
<br>
<br>
* <b>Importing Metadata Jsonl into Vertex AI Datastore :</b> In this step, previously generated metadata will be moved to Vertex AI Datastore.
<br>
<br>
* <b>Faceted Search Tool :</b> This tool enhances user experience by providing a seamless search and navigation feature. It directly provides access to relevant documents through navigation links in the search results, thereby reducing the need for manual effort and intervention. The tool effectively addresses user queries related to documents, streamlining the process of finding and accessing the desired information.

## Reference Architecture Workflow diagram
<img src='./Images/Reference_Architecture.png' width=1500 height=1000 alt="Reference_Architecture"></img>

The diagram illustrates the architectural components of the system, with each step explained in detail: 

1. First, the input documents should be uploaded into the Google Cloud Storage Bucket. Ensure that a one-time initializer script should be run that orchestrates required GCP services - Storage bucket, DocAI, Cloud Functions, Scheduler, Bigquery.
2. The Cloud Function 1 is triggered to perform the batch processing of the input files from the Storage bucket of the previous step. In each trigger only a group of 50 files are processed and the remaining will be taken care of in the next run. From this batch processing the files are passed through [DocAI extraction parser](https://cloud.google.com/document-ai/docs/processors-list#processor_cde) which provides us with proto json files as the output. The cloud scheduler aids in automation of repeated workflow of this entire process.
      * <b>Cloud Functions gen2 :</b> This service helps in running the python script where the batch processing logic is implemented.
      * <b>Cloud Storage :</b> A temporary directory is created to track every processing activity.
      * <b>Cloud scheduler :</b> Ensures the cloud functions are triggered automatically every hour.
      * <b>Bigquery:</b> Monitors the processing of input files, each containing the following columns:
          * <b>Input file name and Input file path :</b> It contains the name of the input file and path of the input file. 
          * <b>Proto json file path :</b> It contains the path of the proto json file.
          * <b>Canonical json path :</b> It contains the path of the canonical json file.
          * <b>Jsonl path :</b> It contains the path of the metadata jsonl file.
          * <b>Status of the document :</b> It contains the status of the file whether it is a success or failure.
      * Within **"batch_status"** are three subfolders. One is “Success”, the other is “Failure” and last is “Existing”. Assuming the Batch process is successful, the file is moved to the canonical bucket's "batch_status" folder. 
          * If a file is successfully parsed then it will be moved from the Input folder to Success folder present in the canonical bucket’s batch_status/Success.
          * If a file is unsuccessful then it might have two cases. One of them can be an “Unsupported File Extension/Format” file passed or Batch Process might be unsuccessful. In these both cases files will be moved from the input bucket to the Failure folder present in the canonical bucket’s batch_status/Failure.
          * If a file is already existing in the Bigquery then that particular file will be moved from the input folder to the Existing folder present in the canonical bucket’s batch_status/Existing.
<img src="./Images/batch_status.png" width=400 height=1000 alt="batch_status"></img>
3. The DocAI JSON files and then post-processed to generate (Vertex Search) canonical JSON files, which adhere to the specific, predefined Vertex Search JSON format. Those canonical files will be stored in the canonical bucket’s **“canonical_object”** folder.
4. Then generate the metadata jsonl file which contains the extracted entities information along with source and canonical file path details which adhere to a [supported NDJson format](https://cloud.google.com/generative-ai-app-builder/docs/prepare-data#storage-unstructured). These metadata json files will be stored under the canonical bucket’s **“jsonl_metadata”** folder.
5. Upon the creation of a new metadata file, Cloud function 2 is activated, specifically designed to upload the metadata file to the Vertex AI Datastore.  After creation of parsed json, canonical json, metadata jsonl files BigQuery is updated with these file paths and status of the file. Within **“vais_status”** are two subfolders. One is **“success”**, the other is **“failed”**.
      * If the data present in the metadata file is successfully imported into the Vertex AI Datastore then that particular metadata jsonl file will be moved to the canonical bucket’s “vais_status/success” folder.
      * If the data present in the metadata file is unsuccessful in importing then that particular metadata jsonl file will be moved to the canonical bucket’s “vais_status/failed ” folder.
<img src="./Images/vais_status.png" width=400 height=1000 alt="vais_status"></img>
6. Vertex AI Datastore is integrated within the supplied web application. This integration will enable faceted and semantic search capabilities for user queries. Users will receive pertinent and interconnected information tailored to their search queries.
7. Whenever a new file is added into the input folder the above steps from 2 to 5 will process triggered with the help of cloud scheduler.

## GCP Service Configurations
<table>
    <tr>
        <td>
        <b>GCP Service</b>
        </td>
        <td>
            <b>Configuration</b>
        </td>
    </tr>
    <tr>
        <td>Google Cloud Storage</td>
        <td>Storage Class : Standard</td>
    </tr>
    <tr>
        <td>Google Cloud Functions (gen2)</td>
        <td>Memory allocated : 4 GiB
            <br>
            CPU : 2000m
            <br>
            Timeout : 3,600 seconds
            <br>
            Minimum instances : 1
            <br>
            Maximum instances : 100
            <br>
            Concurrency : 100
        </td>
    </tr>
    <tr>
        <td>Google Cloud Functions (gen2) with Trigger</td>
        <td>Memory allocated : 1 GiB
            <br>
            CPU : 1000m
            <br>
            Timeout : 540 seconds
            <br>
            Minimum instances : 1
            <br>
            Maximum instances : 100
            <br>
            Concurrency : 100
        </td>
    </tr>
    <tr>
        <td>Google Cloud Scheduler</td>
        <td>Frequency : Every 1 hour
            <br>
            Target URL : Cloud Function
        </td>
    </tr>
</table>

## Deployment overview
The installer script that runs in the Cloud Shell environment.
The following steps are automated so that you can eliminate the steps involved in setting up their environment for this application to run. 
* Initialization of the below Google Cloud services happens with the help of shell script for the following: [Storage](https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-cli),    [Cloud Function](https://cloud.google.com/functions/docs/create-deploy-gcloud),   [Bigquery](https://cloud.google.com/bigquery/docs),  [Cloud Scheduler](https://cloud.google.com/scheduler/docs/creating#gcloud_1),  [DocAI Processor](https://cloud.google.com/document-ai/docs/create-processor#documentai_create_processor-drest) and [(*)](https://g3doc.corp.google.com/cloud/ai/documentai/core/c/g3doc/infra_dev_guide/how_to_use_call_dai.md?cl=head#http-api)
* Create, configure and deploy Cloud Functions
* Create Cloud storage buckets
* Create and update Bigquery(python script builds this).
* Run the process automatically.

The automated deployment feature is discussed in the the brief note about the installer:

Installer is a .tar file containing 

* Shell Script files
* Python Scripts (main.py)
* Requirements.txt files

Make sure the following parameters are assigned with proper values in the input.sh before executing the deployer.sh shell script.

* <b>bq_project_name :</b> It includes the BigQuery project name.  
* <b>bq_dataset_name :</b> It includes the BigQuery dataset name. 
* <b>bq_table_name :</b> It includes the BigQuery table name. 
* <b>input_bucket_path :</b> It contains the path of the input files. 
* <b>docai_parsed_json_bucket_path :</b> It includes the path where the proto json files to be stored. 
* <b>canonical_json_bucket_path :</b> It includes the path where the canonical json files to be stored. 
* <b>jsonl_bucket_path :</b> It includes the path where the metadata jsonl files to be stored. 
* <b>project_id :</b> It includes the project id of the project. 
* <b>docai_location :</b> It includes the location of the processor.  
* <b>docai_processor_id :</b> It includes the processor id of the processor. 
* <b>docai_processor_version_id :</b> It contains the version id of the processor which is deployed. 

This creates, configure and deploy the necessary services:
* Cloud Storage Buckets
* Invoice Processor
* Cloud Functions gen2
* Bigquery
* Cloud Scheduler

**NOTE** : Please note that this is a one time installation activity. Subsequently, the automated deployment steps are presented with comprehensive details.


## Service Account details
<img src="./Images/service_account_note.png" width=1000 height=500 alt='service_account_note'></img>
* Create Service Accounts
* Owner
* Project IAM Admin
* Project Viewer IAM
* Enable/Disable Service Accounts

You  need Project IAM Admin roles tagged to their account. This role enables users to tag mentioned roles to their service account that gets created.
The created and deployed service account needs to have the following roles for functioning of each service in the cloud:
* Cloud function developer
* Run invoker
* Document AI Editor
* Datastore User
* Storage Admin
* Service Account User
* Service Account Admin
* Bigquery Data Editor


## Automated deployment steps
Please be noted that this is a one time setup and if services are deployed successfully in this attempt, no need to run trigger installer script again. Upon successful deployment, you can access the input storage in the GCP console and place any new documents in the previously deployed input bucket. The Cloud Scheduler runs every 1 hour [(configurable)](https://cloud.google.com/scheduler/docs/creating#edit) and the newly added files in the input GCS bucket are considered and processed further. 

Follow the below steps to run one time installer script to setup the pipeline :
* To guarantee successful completion of these stages, make sure you have created a DocAI processor well ahead of time and have trained it.
* Login into the GCP console with your  google workspace account and open the cloud shell as shown. Click on the cloud shell(1) and a cloud shell opens up at the bottom of the page(2).
<img src="./Images/Cloud Shell Image.png" width=1800 height=800 alt="Cloud Shell Image"></img>
* Download the DocAI_vertexAI_Pipeline folder and VertexAI_init.sh to your local computer. 
There are two files:
<img src="./Images/download_and_zip.png" width=800 height=50 alt="download_and_zip"></img>
* Now, convert the downloaded DocAI_vertexAI_Pipeline into a zip file named DocAI_vertexAI_Pipeline.zip. You can zip the folder using any tool or library of your choice, based on your convenience.

* Next, with the shell terminal opened, find the option to upload files. This uploads files to your present working directory in the cloud shell.
<img src="./Images/Uploading files into shell.png" width=1800 height=800 alt="Uploading files into shell"></img>
<table>
    <tr>
        <td><b>Choose files</b></td>
        <td><b>Choose and Upload files</b></td>
    </tr>
    <tr>
        <td><img src="./Images/Choose files.png" width=1800 height=400 alt="Choose files"></img>
    </td>
        <td><img src="./Images/upload_downloaded_files.png" width=1800 height=400 alt="upload_downloaded_files"></img>
    </td>
    </tr>
</table>

* Use ‘cd’ command to change directory into the folder and ‘ls’ command to view the two uploaded files.
    * DocAI_vertexAI_Pipeline.zip file : This file is a compressed .zip file which contains necessary scripts and files to start and set up the project.
    * VertexAI_init.sh file : This shell script extracts the DocAI_vertexAI_Pipeline.zip file and provides a folder called ‘DocAI_vertexAI_Pipeline/’ in the same directory.
<img src="./Images/Listing_files.png" width=1800 height=400 alt="Listing files"></img>
* Use the change mode command to provide permission to execute the VertexAI_init.sh shell script.
<img src="./Images/Granting_Permissions.png" width=1800 height=200 alt="Granting Permissions"></img>
* Run the VertexAI_init shell script and obtain the folder as shown in the image below.
<img src="./Images/Run_VertexAI_Init.png" width=1800 height=200 alt="Run_VertexAI_Init"></img>
* Change directory into the DocAI_vertexAI_Pipeline/ folder and the below 5 files are observed.
<img src="./Images/change_directory.png" width=1800 height=200 alt="change_directory"></img>
* Input.sh : 
This is an input config file. Provide the project details in this file and save. Here is an example of project details input :

    BQ_PROJECT_NAME="BigQuery project Name"

    BQ_DATASET_NAME="Bigquery Dataset Name" #existing one or mention a new name if required to create a new dataset

    BQ_TABLE_NAME="Name of the table" #existing one or mention a new name if required to create a new dataset

    PDF_INPUT_BUCKET_PATH="gs://bucket_name" #include gs://

    DOCAI_PARSED_JSON_BUCKET_PATH="gs://parsed_json_bucket" #include gs://

    CANONICAL_JSON_BUKCET_PATH="gs://canonical_jsons_bucket" #include gs://

    JSONL_BUKCET_PATH="gs://vertexai_metadata_jsonl_bucket" #include gs://

    PROJECT_ID="project id of the processor "

    PROJECT_NUMBER="project number"

    DOCAI_LOCATION="location of the processor" # eg: us or eu

    DOCAI_PROCESSOR_ID="processor id"

    DOCAI_PROCESSOR_VERSION_ID="v2.0-2023-12-06" # Here is an example

    PDF_TO_JSONL_CF="pdf_to_jsonl_cf" #Cloud Function Name 

    PDF_TO_JSONL_CF_CS="pdf_to_jsonl_cf_cs" #Cloud Scheduler Name

    JSONL_TO_VAIS_CF="jsonl_to_vais_cf" #Cloud Function Name with trigger on JSONL_BUKCET_PATH

    VAIS_DATASTORE_ID="pdf-json_1699996276496"

    SERVICE_ACCOUNT_NAME="dwh-uim-service-account@project-name.iam.gserviceaccount.com" # Here is an example of a service account.

* deployer.sh :
    * This is a shell script that automatically configures and deploys necessary cloud services. More about this in the following points.
* pdf_to_jsonl_cf : 
    * This folder will contain a Python file and "requirements.txt". These  will assist in converting PDFs into structured JSON files. These files include canonical and metadata information in JSONL format, which will subsequently be updated in BigQuery.
* Jsonl_to_vais_cf : 
    * This folder will contain a Python file and "requirements.txt". These will assist in importing the metadata jsonl files into VertexAI Datastore using the cloud function.
    
* <b>Use the change mode command to provide permission before executing the deployer.sh file.</b>
<img src="./Images/premission_deployer.png" width=1800 height=400 alt="premission_deployer"></img>

* Run the shell script and the output will be as follows : 
    * **“ ./deployer.sh “**  is the command to be runned in the shell script in the current DocAI_vertexAI_Pipeline directory.

<br>
<img src="./Images/deployer_1.png" width=1800 height=500 alt='deployer_output_1'></img>
<img src="./Images/deployer_2.png" width=1800 height=500 alt='deployer_output_2'></img>

* After completion of execution you can able to view an url in the console. Upon opening the URL you can search for the queries.

The automated deployment steps takes anywhere between 3 to 5 mins to setup and configure Google cloud services. Once completed, these services can be viewed in the console separately. The following are deployed services:
* Service Account with the name defined when the automated deployment script was first run.
* Document AI Invoice Processor.
* Cloud function gen2 with python script and its dependencies configured. 
* Cloud scheduler is set up to trigger the cloud function on an hourly basis.
* Bigquery is updated on every run if any new files are added to the input bucket.
* Storage bucket created for following:
    * Input
    * Parsed JSON, Canonical JSON, Metadata JSONL Buckets.


## User Interface

* Use the [Simple Demo Web Application code](https://drive.google.com/file/d/1I7pia3r-bfIb_2sWVfc9dB5kDpn_7gBw/view?usp=drive_link) (written using html, css, javascript) to get started. Don’t forget to replace the PROJECT_NUMBER and DATASTORE_ID in the web app code mentioned above.

* In the deployer.sh file, the project_number and datastore_id values are being updated automatically.
<img src='./Images/deployer_updating_automatic.png' width=1800 height=400 alt="deployer_updating_automatic"></img>
* Vertex AI Demo Search Tool Image
<img src='./Images/demo_search_tool.png' width=1800 height=400 alt="demo_search_tool"></img>

* Configure different [serving controls](https://cloud.google.com/generative-ai-app-builder/docs/configure-serving-controls) such as Boosting, Property Filter, Synonyms, etc.. to change the default behavior of a search request.

* Using this custom web app built on REST API,  It provides us with the ability to restrict the GenAI/Semantic search to a specific set of documents using [metadata filtering](https://cloud.google.com/generative-ai-app-builder/docs/filter-search-metadata).

* You can customize or build your own UI on top of Search API Request with the required [fields](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.servingConfigs/search#request-body) which provides you the flexibility to add multiple features in the UI for the end-users to control the results (such as GenAI Search, Faceted Search, User login and few other) and perform variety of search use cases which may not be possible with the widget.

## References

* [Vertex AI Search and Conversation](https://cloud.google.com/generative-ai-app-builder/docs/introduction)
* [Faceted_Regulated_GenAI_Search.html](https://drive.google.com/file/d/1I7pia3r-bfIb_2sWVfc9dB5kDpn_7gBw/view)
* [Document AI Overview](https://cloud.google.com/document-ai/docs/overview)
* https://screencast.googleplex.com/cast/NTg2NzI2MjI2NjMxMDY1NnxjYjQzOGVjZi0yNQ