In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Evaluate Gemini Structured Output

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main//gemini/evaluation/evaluate_gemini_structured_output.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fevaluate_gemini_structured_output.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/evaluate_gemini_structured_output.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main//gemini/evaluation/evaluate_gemini_structured_output.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_gemini_structured_output.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_gemini_structured_output.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_gemini_structured_output.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_gemini_structured_output.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_gemini_structured_output.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author |
| --- |
| [Steve Phillips](https://github.com/stevie-p) |

## Overview

This notebook uses the [**Gen AI Evaluation Service**](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/evaluation) to evaluate and compare the performance of Gemini models for an extraction task.

The task is to accurately extract information from a scanned, handwritten order form for "Acme Corporation".

Within this notebook, we:

* Use Gemini models with [structured output](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output) to ensure well-structured JSON output
* Extract the data using Gemini models
* Define two custom metrics: `valid_schema` using the [`jsonschema`](https://pypi.org/project/jsonschema/) library, and `accuracy` using the [`deepdiff`](https://github.com/seperman/deepdiff) library
* Use the **Gen AI Evaluation service** to run the evaluation experiments

The [models](https://cloud.google.com/vertex-ai/generative-ai/docs/models) under test are:

* Gemini 2.0 Flash
* Gemini 2.5 Flash
* Gemini 2.5 Pro

## Get started

### Install Google Gen AI SDK and other required packages


In [1]:
%pip install --upgrade --quiet google-genai jsonschema IPython==7.34.0 google-cloud-aiplatform 'pybind11>=2.12' deepdiff

Restart the runtime to use the newly installed packages.

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [1]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
# Create your own project and insert the project ID here ---->

# Use the environment variable if the user doesn't provide Project ID.
import os

# fmt: off
PROJECT_ID = "cloud-llm-preview2"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "us-central1" # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
# fmt: on

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

if not LOCATION:
    LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

EXPERIMENT_NAME = "eval-gemini-structured"

from google import genai
import vertexai

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

In [3]:
import io
import json
from datetime import datetime
import pandas as pd
from IPython.display import Image, Markdown, display
from google.cloud import storage  # type: ignore
from google.genai.types import (
    GenerateContentConfig,
    Part,
)
from jsonschema import validate
from vertexai.evaluation import CustomMetric, EvalTask, notebook_utils
from deepdiff import DeepDiff

## View the images

Let's have a look at the images we want to extract data from.

In [4]:
images = [
    {
        "image_url": "https://storage.googleapis.com/eval-extraction-examples/Acme%20Order%20Form.jpg",
        "image_uri": "gs://eval-extraction-examples/Acme Order Form.jpg",
        "image_type": "image/jpeg",
        "image_name": "Acme Order Form.jpg",
        "reference": {  # The Ground Truth
            "order_number": "98-X42-77A",
            "order_date": "2025-09-01",
            "customer_name": "WILE E. COYOTE (ESQ., PH.D, S.G.)",
            "customer_address": "HIGH MESA, CORNER OF X-MARK AND DETONATION CANYON, ANVIL FALLS, AZ",
            "line_items": [
                {
                    "item_description": "Jet Propelled Unicycle",
                    "quantity": 1,
                    "unit_price": 99.99,
                    "delivery_option": "Next Day",
                },
                {
                    "item_description": "Instant Hole Kit",
                    "quantity": 3,
                    "unit_price": 45.00,
                    "delivery_option": "Standard",
                },
                {
                    "item_description": "TNT High Explosives x24",
                    "quantity": 1,
                    "unit_price": 120.00,
                    "delivery_option": "Fast",
                },
                {
                    "item_description": "Super Magnet (XL)",
                    "quantity": 1,
                    "unit_price": 150.00,
                    "delivery_option": "Fast",
                },
                {
                    "item_description": "Rocket-Powered Roller skates",
                    "quantity": 2,
                    "unit_price": 79.99,
                    "delivery_option": "Next Day",
                },
            ],
        },
    },
    {
        "image_url": "https://storage.googleapis.com/eval-extraction-examples/EF0004.jpg",
        "image_uri": "gs://eval-extraction-examples/EF0004.jpg",
        "image_type": "image/jpeg",
        "image_name": "EF0004.jpg",
        "reference": {  # The Ground Truth
            "order_number": "EF0004",
            "order_date": "2025-10-26",
            "customer_name": "Elmer J. Fudd",
            "customer_address": "Happy Hunter's Hollow, Looney Tune Forest, CA",
            "line_items": [
                {
                    "item_description": "Silent Sneak Shoes",
                    "quantity": 1,
                    "unit_price": 35.99,
                    "delivery_option": "Standard",
                },
                {
                    "item_description": "Invisible Rabbit Trap",
                    "quantity": 2,
                    "unit_price": 75.00,
                    "delivery_option": "Standard",
                },
                {
                    "item_description": "Giant Butterfly Net",
                    "quantity": 1,
                    "unit_price": 49.50,
                    "delivery_option": "Fast",
                },
                {
                    "item_description": "Instant Camouflage Kit",
                    "quantity": 3,
                    "unit_price": 65.00,
                    "delivery_option": "Next Day",
                },
                {
                    "item_description": "Repellent Spray",
                    "quantity": 4,
                    "unit_price": 29.99,
                    "delivery_option": "Next Day",
                },
            ],
        },
    },
]

In [5]:
# Display the images using their public URLs

for image in images:

    print(image["image_name"])
    display(Image(url=image["image_url"], height=800))

These are mock order forms for *Acme Corporation*, for customers to order various products, and select a delivery option for each; either "Standard", "Fast" or "Next Day".

We will use this form to evaluate the performance of Gemini.

## Extract the data using Gemini


### Select the models

In [6]:
# Define which models to compare

models = [
    # Gemini 2.0 family
    "gemini-2.0-flash",
    # Gemini 2.5 family
    "gemini-2.5-flash",
    "gemini-2.5-pro",
]

### Define the prompt and schema

In [7]:
# Define the prompt and the structured output schema

prompt = """
    Analyze the attached scanned form and extract the information in the table in accordance with the schema.

    Provide the output in a clean JSON format.

    If any date field is formatted ambiguously, assume the dates are in dd/mm/yyyy format.

    If a field is blank, illegible, or cannot be found, return null for its value.

    If there are blank rows, do not include them in the output.

    If there is no image attached, return null for all fields.

"""

# Use structured output to ensure well formatted and consistent JSON output

schema = {
    "type": "object",
    "properties": {
        "order_number": {"type": "string"},
        "order_date": {
            "type": "string",
            "format": "date",  # Note: Enforces a full date output in the RFC 3339 format ("YYYY-MM-DD")
        },
        "customer_name": {"type": "string"},
        "customer_address": {"type": "string"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "item_description": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"},
                    "delivery_option": {  # Note: We do not tell Gemini how to interpret the checkboxes as "Standard", "Fast" or "Next Day"
                        "type": "string"
                    },
                },
            },
        },
    },
}

generate_content_config = GenerateContentConfig(
    response_mime_type="application/json",
    response_schema=schema,
)

### Run the prompt

In [8]:
# Run the prompt for each model in `models` and each image in `images`, storing the output in `gemini_response`

gemini_response = {}
run_id = notebook_utils.generate_uuid(8)


for image_info in images:
    image = Part.from_uri(
        file_uri=image_info["image_uri"], mime_type=image_info["image_type"]
    )
    image_name = image_info["image_name"]

    gemini_response[image_name] = {}

    for model in models:
        run_name = f"{run_id}-{model}-{image_name}"

        response = client.models.generate_content(
            model=model, contents=[prompt, image], config=generate_content_config
        )

        response_json = json.dumps(response.parsed, indent=4)

        print("----------------------------------")
        print(f"{run_name}: \n{response_json}")
        gemini_response[image_name][model] = response_json

## Perform the Evaluation

### Prepare the evaluation dataset

Now we have the outputs from the Gemini models we can run the evaulation.

In [9]:
# Create the Evaluation Dataset

eval_dataset_rows = []
for image_info in images:
    image_name = image_info["image_name"]
    image_uri = image_info["image_uri"]
    image_type = image_info["image_type"]
    reference_str = json.dumps(
        image_info["reference"], indent=4
    )  # Convert the reference (ground truth) to pretty-printed JSON

    if image_name in gemini_response:
        models_data = gemini_response[image_name]
        for model_name, response_text in models_data.items():
            eval_dataset_rows.append(
                {
                    "model": model_name,
                    "prompt": prompt,  # The same prompt is used for all Gemini calls
                    "image": image_name,
                    "reference": reference_str,
                    "response": response_text,
                    "differences": DeepDiff(json.loads(reference_str), json.loads(response_text)).pretty() # Uses the `deepdiff` library for identifying the differences between the response and the references
                }
            )

eval_dataset = pd.DataFrame(eval_dataset_rows)

This evaluation data set now contains the reference (ground truth) and response for each combination of model and image.

### Define custom metrics for JSON schema validation and accuracy

In [10]:
# Define a custom evaluation metric to assess whether the response complies with the schema

def is_valid_schema(instance: dict[str, str]) -> dict[str, bool]:
    """Return 1 if the response complies with the schema, 0 if not"""
    response = instance["response"]

    try:
        validate(instance=json.loads(response), schema=schema)
    except Exception:
        return {"valid_schema": False}

    return {"valid_schema": True}


valid_schema = CustomMetric(name="valid_schema", metric_function=is_valid_schema)

In [11]:
# Define a `CustomMetric` for accuracy using the `deepdiff` library

def calculate_accuracy(instance: dict[str, str]) -> dict[str, float]:

    ref_json_string = instance["reference"]
    resp_json_string = instance["response"]

    try:
        reference_data = json.loads(ref_json_string)
        response_data = json.loads(resp_json_string)
    except json.JSONDecodeError:
        # If JSON is invalid or parsing fails, return 0 accuracy
        return {"accuracy": 0.0}

    # Use the deepdiff library to calculate the difference between the response and the reference (0 = exact match, 1 = )
    deep_distance = DeepDiff(reference_data, response_data, ignore_order=True, get_deep_distance=True).get("deep_distance")

    accuracy = 1.0 - deep_distance

    return {"accuracy": accuracy}

accuracy = CustomMetric(name="accuracy", metric_function=calculate_accuracy)

### Define `EvalTask` & Experiment

In [12]:
# Define the evaluation task

extraction_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "exact_match",  # Exact match will only be 1 if the response is perfectly accurate, with no allowance for inconsistent JSON formatting. Hence, the custom `accuracy` metric is the better metric.
        valid_schema,
        accuracy,
    ],
    experiment=EXPERIMENT_NAME,
)

### Run the Evaluation

In [13]:
# Define the experiment & experiment run
run_id = notebook_utils.generate_uuid(8)

experiment_run_name = f"eval-{run_id}"

eval_result = extraction_eval_task.evaluate(experiment_run_name=experiment_run_name)

### Display the results

In [14]:
notebook_utils.display_eval_result(eval_result=eval_result)

## Analyse the results

### Use Gemini to analyse the results

In [15]:
# Let's get Gemini to analyse the results.

# Prepare the prompt for Gemini 2.5 Flash to summarize and analyze the results
summary_prompt = """
Analyze the following experiment results comparing Gemini models for extracted data from a scanned form.
The results include a summary table with overall metrics and row-based metrics, as well as the specific differences between the extracted data and the reference (ground truth).

Summarize the performance of each model based on the metrics provided (valid_schema, accuracy) from the summary table.
Analyze the detailed differences to understand the *types* of errors and mismatches occurring for each model.
Identify which models performed best and worst for each metric and based on the detailed error analysis.
Draw conclusions about the strengths and weaknesses of Gemini models for this specific tabular data extraction task, considering both the overall accuracy and the nature of the errors.
Consider the different versions of Gemini and how their performance varies.
Provide a clear and concise summary of the overall results, followed by key conclusions supported by observations from the detailed comparison.

Experiment Results Summary Table:
"""

# Convert the evaluation results summary and row-based metrics to a string format
# Assuming eval_result has a structure that can be converted to a readable string
try:
    # This will likely involve converting the DataFrames within eval_result to string
    eval_result_string = str(eval_result)
except Exception as e:
    eval_result_string = f"Could not convert evaluation results to string: {e}"
    print(eval_result_string)


# Concatenate the prompt and the summary table results
full_prompt = (
    summary_prompt
    + eval_result_string
)

# Use Gemini 2.5 Flash to analyze the results
try:
    # Generate the response
    response = client.models.generate_content(
        model="gemini-2.5-flash", contents=full_prompt
    )

    # Display the summary and analysis from Gemini
    display(Markdown(response.text))

except Exception as e:
    print(f"An error occurred while calling Gemini: {e}")
    print(
        "Please ensure you have access to Gemini 2.5 Flash and your project/location settings are correct."
    )

## Conclusions


This notebook has shown how to use the Gen AI Evaluation Service to evaluate Gemini's Structured Output, for a document processing task.

It uses a "bring your own response" approach and uses custom `valid_schema` and `accuracy` metrics as well as the `exact_match` metric.

It also does a deep "field-wise" comparison of the responses to understand inaccuracies, and uses Gemini to summarise and analyse the results.