# Importing processor and evaluating with alternate test sets


* Author: docai-incubator@google.com

## Disclaimer

This tool is not supported by the Google engineering team or product team. It is provided and supported on a best-effort basis by the DocAI Incubator Team. No guarantees of performance are implied. 


## Objective

This Document guides to import processors and evaluate the imported processor version using alternate data sets.

* Alternate test sets are the test sets which are used to test the processor version other than which are used while training the processor or in UI
* We get precision, recall and f1 score for set of files from notebook .


## Prerequisites

* Vertex AI Notebook Or Colab (If using Colab, use authentication)
* Processor details to import the processor
* Permission For Google Storage and Vertex AI Notebook.
* GCS path where the labeled documents are placed

### IAM Roles for service account linked to vertext ai notebook

* Document AI editor
* Storage Admin
* Vertex AI Service agent

Also give `Document AI Editor` role to `Document AI Service Agent`. 

### Download utilities module

In [None]:
!pip install google-cloud-storage
!pip install google-cloud-documentai
!pip install tqdm

In [None]:
!wget https://raw.githubusercontent.com/GoogleCloudPlatform/document-ai-samples/main/incubator-tools/best-practices/utilities/utilities.py

## Step by Step procedure

## 1. Create a DocAI processor

### Input for creating processor


* `project_number` : Your project number
* `location` : Your project location which the processor has to be created
* `new_processor_display_name` : Name of the processor to be displayed
* `processor_type` : Type of the processor to be created 

In [None]:
# input details
# input for creating new processor
project_number = "xxxxxxxxxx"  # project number
location = "us"  # location which the processor has to be created
new_processor_display_name = "test_processor_api5"  # name of processor to be displayed
processor_type = "CUSTOM_EXTRACTION_PROCESSOR"  # type of processor to be created

### Function to create new processor

In [None]:
from google.cloud import documentai_v1beta3


# Function to create a new processor
def sample_create_processor(
    project_number, location, new_processor_display_name, processor_type
):
    """
    Create a Document AI processor.

    Args:
        project_number (str): The Google Cloud project number.
        location (str): The location where the processor will be created.
        new_processor_display_name (str): The display name for the new processor.
        processor_type (str): The type of the processor.

    Returns:
        documentai.Processor: The created Document AI processor.
    """

    # Create a client
    client = documentai_v1beta3.DocumentProcessorServiceClient()

    # Initialize request argument(s)
    request = documentai_v1beta3.CreateProcessorRequest(
        parent=f"projects/{project_number}/locations/{location}",
        processor=documentai_v1beta3.Processor(
            display_name=f"{new_processor_display_name}", type_=f"{processor_type}"
        ),
    )

    # Make the request
    response = client.create_processor(request=request)

    # Handle the response
    print(response)

    return response


# calling  function
response_new = sample_create_processor(
    project_number, location, new_processor_display_name, processor_type
)

## 2. Importing processor
* To import a trained processor into the new processor below is the code snippet

### Input for importing processor

* `project_number` : Your project number
* `source_processor_id` : Processor id of processor which has to be imported
* `source_processor_version_id` : Processor version ID which has to be imported
* `source_processor_location` : Location of processor which has to be imported
* `new_processor_id` : You can get by response_new.name.split('/')[-1] from above step 1 or give the processor ID if known
* `new_processor_location` : Location of the processor to which has to be imported


In [None]:
# input for importing processor from another
project_number = "xxxxxxxxxx"  # project number
source_processor_id = (
    "xxxxxxxxxxxxxxxxx"  # processor id of processor which has to be imported
)
source_processor_version_id = (
    "xxxxxxxxx"  # processor version ID which has to be imported
)
source_processor_location = "us"  # location of processor which has to be imported
new_processor_id = "xxxxxxxxxxxx"  # can get by response_new.name.split('/')[-1] from above step 1 or give the processor ID if known
new_processor_location = "us"  # location of the processor to which has to be imported
import google.cloud.documentai_v1beta3 as documentai

### Function to import processor

In [None]:
import google.cloud.documentai_v1beta3 as documentai


# Function to import processor
def import_processor(
    project_number: str,
    new_processor_location: str,
    new_processor_id: str,
    source_processor_location: str,
    source_processor_id: str,
    source_processor_version_id: str,
) -> documentai.ImportProcessorVersionResponse:
    """
    Import a Document AI processor version from a source processor.

    Args:
        project_number (str): The Google Cloud project number.
        new_processor_location (str): The location where the new processor is located.
        new_processor_id (str): The ID of the new processor.
        source_processor_location (str): The location where the source processor is located.
        source_processor_id (str): The ID of the source processor.
        source_processor_version_id (str): The ID of the source processor version.

    Returns:
        documentai_v1beta3.ImportProcessorVersionResponse: The response from importing the processor version.
    """

    from google.cloud import documentai_v1beta3

    new_processor_name = f"projects/{project_number}/locations/{new_processor_location}/processors/{new_processor_id}"
    # provide the source version(to copy) processor details in the below format
    client = documentai_v1beta3.DocumentProcessorServiceClient()

    source_version = f"projects/{project_number}/locations/{source_processor_location}/processors/{source_processor_id}/processorVersions/{source_processor_version_id}"

    # provide the new processor name in the parent variable in format 'projects/{project_number}/locations/{location}/processors/{new_processor_id}'

    import google.cloud.documentai_v1beta3 as documentai

    op_import_version_req = documentai_v1beta3.types.document_processor_service.ImportProcessorVersionRequest(
        processor_version_source=source_version, parent=new_processor_name
    )

    # copying the processor

    op_import_version = client.import_processor_version(request=op_import_version_req)

    return op_import_version


# calling function
op_import_version = import_processor(
    project_number,
    new_processor_location,
    new_processor_id,
    source_processor_location,
    source_processor_id,
    source_processor_version_id,
)

In [None]:
op_import_version.metadata.common_metadata.state

## 3. Adding a dataset to the processor
* Below code snippet creates a bucket for dataset if the given bucket does not exist



* sample `op_import_version.metadata`


<img src="./Images/sample_import_version_metadata.png" width=800 height=400></img>

### Input for adding dataset to a processor

In [None]:
# new_data_bucket should already exist
project_number = "xxxxxxxxxxx"
new_dataset_bucket = "gs://xxxxxxxx"
new_processor_location = "us"
new_processor_id = "xxxxxxxxxxxxxxxxxx"  # can get by response_new.name.split('/')[-1] from above step 1 or give the processor ID if known
new_processor_version_id = (
    "xxxxxxxxxxxxxxx"  # processor version ID for which dataset has to be added
)
# you can also  get the processor name directly by 'new_version_processor_details=op_import_version.metadata.common_metadata.resource'

## Deploy Processor

In [None]:
from google.cloud import documentai_v1beta3


def sample_deploy_processor_version(
    project_number, new_processor_location, new_processor_id, new_processor_version_id
):
    # Create a client
    client = documentai_v1beta3.DocumentProcessorServiceClient()

    # Initialize request argument(s)
    request = documentai_v1beta3.DeployProcessorVersionRequest(
        name=f"projects/{project_number}/locations/{new_processor_location}/processors/{new_processor_id}/processorVersions/{new_processor_version_id}",
    )

    try:
        # Make the request
        operation = client.deploy_processor_version(request=request)
        print("Waiting for operation to complete...")
        response = operation.result()
        # Handle the response
        print(response.metadata)

    except Exception as e:
        print(e.message)


sample_deploy_processor_version(
    project_number, new_processor_location, new_processor_id, new_processor_version_id
)

### Function to create and add dataset to a processor

In [None]:
from google.cloud import storage
from tqdm.auto import tqdm
from google.cloud import documentai_v1beta3

new_processor_name = f"projects/{project_number}/locations/{new_processor_location}/processors/{new_processor_id}"


# function to add a dataset into processor
def add_processor_dataset(
    processor_name: str, dataset_gcs_uri: str, project_id: str, location: str
):
    """
    Add a dataset to a Document AI processor.

    Args:
        processor_name (str): The name of the Document AI processor.
        dataset_gcs_uri (str): The URI of the Google Cloud Storage bucket for the dataset.
        project_id (str): The Google Cloud project ID.
        location (str): The location of the processor.
    """
    # Create a client
    client = documentai_v1beta3.DocumentServiceClient()

    # Initialize request argument(s)
    dataset = documentai_v1beta3.Dataset(
        {
            "name": f"{processor_name}/dataset",
            "gcs_managed_config": {"gcs_prefix": {"gcs_uri_prefix": dataset_gcs_uri}},
            "spanner_indexing_config": {},
        }
    )

    request = documentai_v1beta3.UpdateDatasetRequest(dataset=dataset)

    try:
        # Make the request
        operation = client.update_dataset(request=request)

        response = operation.result()

    except Exception as e:
        print(e.message)


# calling function
add_processor_dataset(
    new_processor_name, new_dataset_bucket, project_number, source_processor_location
)

## 4. Evaluating processor version with additional test sets


## NOTE
**Before running the evaluation make the trained version as default version for evaluating the dataset**

### Input to Evaluate processor version with Additional test sets

In [None]:
project_id = "xxxxxxxx"
location = "us"  # Format is 'us' or 'eu'
processor_id = "xxxxxxxxxxxxxx"
processor_version_id = "xxxxxxxxxxxxx"
gcs_input_uri = "gs://xxxxxxx/xxxxxxx/"  # Format: gs://bucket/directory/ ==> where the labeled documents are present

### Function to evaluate

In [None]:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore


def evaluate_processor_version_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version_id: str,
    gcs_input_uri: str,
) -> None:
    """
    Evaluate a Document AI processor version using documents from a Google Cloud Storage bucket.

    Args:
        project_id (str): The Google Cloud project ID.
        location (str): The location of the processor.
        processor_id (str): The ID of the Document AI processor.
        processor_version_id (str): The ID of the processor version.
        gcs_input_uri (str): The Google Cloud Storage URI for the input documents.

    Returns:
        str: The ID of the evaluation.
    """
    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor version
    # e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version_id
    )

    evaluation_documents = documentai.BatchDocumentsInputConfig(
        gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
    )

    request = documentai.EvaluateProcessorVersionRequest(
        processor_version=name,
        evaluation_documents=evaluation_documents,
    )

    # Make EvaluateProcessorVersion request
    # Continually polls the operation until it is complete.
    # This could take some time for larger files
    operation = client.evaluate_processor_version(request=request)
    # Print operation details
    # Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
    print(f"Waiting for operation {operation.operation.name} to complete...")
    # Wait for operation to complete
    response = documentai.EvaluateProcessorVersionResponse(operation.result())

    # Once the operation is complete,
    # Print evaluation ID from operation response
    print(f"Evaluation Complete: {response.evaluation}")
    return response.evaluation


# calling function
response_evaluation = evaluate_processor_version_sample(
    project_id, location, processor_id, processor_version_id, gcs_input_uri
)

# sample output
#'projects/xxxxxxx/locations/xx/processors/xxxxxxxxxxxxxxxx/processorVersions/xxxxxxxxxxx/evaluations/xxxxxxxxxxxx'

### To get evaluation of processor version



In [None]:
# evaluation_value has to be the output of above function in format
#'projects/xxxxxxx/locations/xx/processors/xxxxxxxxxxxxxxxx/processorVersions/xxxxxxxxxxx/evaluations/xxxxxxxxxxxx'

from google.cloud import documentai_v1beta3


def sample_get_evaluation(evaluation_value):
    # Create a client
    client = documentai_v1beta3.DocumentProcessorServiceClient()

    # Initialize request argument(s)
    request = documentai_v1beta3.GetEvaluationRequest(
        name=evaluation_value,
    )

    # Make the request
    response = client.get_evaluation(request=request)

    # Handle the response
    print(response)
    return response


eval_result = sample_get_evaluation(response_evaluation)

### Sample eval_result


<img src="./Images/eval_result.png" width=800 height=400></img>