In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Evaluate groundedness with custom parsing

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/tree/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fevaluate_groundedness_with_custom_parsing.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/tree/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_groundedness_with_custom_parsing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

 | | |
 |-|-|
 |Author(s): | [Greg Breard](https://github.com/gregbreard) |

## Overview

This Colab shows how to assess the grounding capabilities of generative models, evaluating their ability to generate responses that are factually consistent with and derived from a given textual context. Assessing grounding with custom parsing, as shown here, offers a more advanced evaluation compared to the standard [metric prompt templates](https://cloud.google.com/vertex-ai/generative-ai/docs/models/metrics-templates), allowing for detailed analysis of factual consistency and source relevance.

It goes beyond basic information retrieval by examining the extent to which model outputs are truly rooted in the provided source and can synthesize information accurately from that context. The approach to grounding demonstrated here is based on the [FACTS Grounding Benchmark](https://www.kaggle.com/facts-leaderboard/examples).

This is accomplished using the Vertex Gen AI Evaluation SDK which supports custom output parsing. The prompt instructs the autorater model to return a structured (JSON) output which is then parsed seemlessly by providing the parsing method in the metric definition. The parsed output is appended to the evaluation result data frame.

## Objective

1. Generate structured output from an autorater.
2. Use custom parsing for advanced evalutation.

## Steps

1. Set up the environment.
2. Define helper functions, prompt templates, and metric.
3. Prepare the dataset for evaluation.
4. Run the evaluation (including model inference).

## Costs
This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

# Get started

## Install Vertex AI SDK for Python and other required packages


In [None]:
%pip install --upgrade --quiet google-cloud-aiplatform

## Restart runtime (Colab only)

To use the newly installed packages, you must restart the runtime on Google Colab.

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


## Authenticate your notebook environment (Colab only)


Authenticate your environment on Google Colab.

In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

## Set Google Cloud project information and initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
PROJECT_ID = "your-project-id"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}


import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

## Import libraries

In [None]:
import numpy as np
import pandas as pd
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import (
    AutoraterConfig,
    CustomOutputConfig,
    EvalTask,
    PointwiseMetric,
)

# Set up evaluation

## Helper functions

The following functions provide support for extracting JSON objects from the results returned by the autorater and computing the model response score. Additionally, there are pretty printing methods to improve the readability of the evaluation results.

In [None]:
import json
import re
from typing import Any

_TABLE_STYLE = [
    {
        "selector": "th",
        "props": [
            ("background-color", "#f2f2f2"),
            ("border", "1px solid gray"),
            ("color", "black"),
            ("font-size", "11pt"),
            ("text-align", "center"),
            ("word-break", "break-all"),
        ],
    },
    {
        "selector": "td",
        "props": [
            ("border", "1px solid gray"),
            ("color", "black"),
            ("min-width", "100px"),
            ("text-align", "center"),
        ],
    },
    {"selector": "tr:nth-child(even)", "props": [("background-color", "#f9f9f9")]},
    {"selector": "tr:nth-child(odd)", "props": [("background-color", "white")]},
    {"selector": "tr:hover", "props": [("background-color", "#94e6ff")]},
    {"selector": "td:hover", "props": [("background-color", "#ffffb3")]},
]


def parse_response_to_json(responses: list[str]) -> dict[str, Any]:
    response = re.sub(
        r"(.*```json|```.*)",
        "",
        responses[0].strip(),
    )
    sentences = []
    verdicts = []
    score = 0
    try:
        sentences = json.loads(response)
        for sentence in sentences:
            verdicts.append(sentence["label"])
        successful = 0
        for verdict in verdicts:
            if verdict == "supported":
                successful += 1
            elif verdict in ["unsupported", "contradictory"]:
                successful = 0
                break
        score = successful / len(verdicts)
    except Exception as e:
        print(f"Failed to parse JSON response: {str(e)}")
    return {
        "sentence": sentences,
        "sentence_verdict": verdicts,
        "model_resp_score": score,
    }


def pretty_print_df(df: "pd.DataFrame", hide_columns: list[str]) -> "pd.Styler":
    styled_df = df.copy()
    for col in df.columns:
        if (
            isinstance(df[col][0], list)
            and df[col][0]
            and isinstance(df[col][0][0], dict)
        ):
            styled_df[col] = styled_df[col].apply(lambda x: _list_to_html_table(x))
    return (
        styled_df.style.hide(axis="index")
        .hide(subset=hide_columns, axis=1)
        .format({"groundedness/model_resp_score": "{:,.1f}"})
        .set_table_styles(_TABLE_STYLE)
    )


def _list_to_html_table(data: list[dict[str, Any]]) -> str:
    if not data:
        return "<i>No data to display.</i>"
    html_table = "<table style='border-collapse: collapse'><thead><tr>"
    # Extract headers from the first element
    for key in data[0].keys():
        html_table += f"<th>{key}</th>"
    html_table += "</tr></thead><tbody>"
    # Add rows
    for item in data:
        html_table += "<tr>"
        for value in item.values():
            html_table += f"<td>{value}</td>"
        html_table += "</tr>"
    html_table += "</tbody></table>"
    return html_table

## Prompt Templates

In [None]:
GROUNDING_AUTORATER_PROMPT = """
You are a helpful and harmless AI assistant. You will be provided with a textual
context and a model-generated response.
Your task is to analyze the response sentence by sentence and classify each
sentence according to its relationship with the provided context.

**Instructions:**

1. **Decompose the response into individual sentences.**
2. **For each sentence, assign one of the following labels:**
    * **`supported`**: The sentence is entailed by the given context.  Provide a
      supporting excerpt from the context. The supporting excerpt must *fully*
      entail the sentence. If you need to cite multiple supporting excerpts,
      simply concatenate them.
    * **`unsupported`**: The sentence is not entailed by the given context. No
      excerpt is needed for this label.
    * **`contradictory`**: The sentence is falsified by the given context.
      Provide a contradicting excerpt from the context.
    * **`no_rad`**: The sentence does not require factual attribution (e.g.,
      opinions, greetings, questions, disclaimers).  No excerpt is needed for
      this label.
3. **For each label, provide a short rationale explaining your decision.**
   The rationale should be separate from the excerpt.
4. **Be very strict with your `supported` and `contradictory` decisions.**
   Unless you can find straightforward, indisputable evidence excerpts *in the
   context* that a sentence is `supported` or `contradictory`, consider it
   `unsupported`. You should not employ world knowledge unless it is truly
   trivial.

**Input Format:**

The input will consist of two parts, clearly separated:

* **Context:**  The textual context used to generate the response.
* **Response:** The model-generated response to be analyzed.

**Output Format:**

For each sentence in the response, output a JSON object with the following
fields:

* `"sentence"`: The sentence being analyzed.
* `"label"`: One of `supported`, `unsupported`, `contradictory`, or `no_rad`.
* `"rationale"`: A brief explanation for the assigned label.
* `"excerpt"`:  A relevant excerpt from the context. Only required for
  `supported` and `contradictory` labels.

Output each JSON object on a new line.

**Example:**

**Input:**

```
Context:
Apples are red fruits. Bananas are yellow fruits.

Response:
Apples are red. Bananas are green. Bananas are cheaper than apples. Enjoy your fruit!
```

**Output:**

[
{"sentence": "Apples are red.", "label": "supported", "rationale": "The context explicitly states that apples are red.", "excerpt": "Apples are red fruits."}
{"sentence": "Bananas are green.", "label": "contradictory", "rationale": "The context states that bananas are yellow, not green.", "excerpt": "Bananas are yellow fruits."}
{"sentence": "Bananas are cheaper than apples.", "label": "unsupported", "rationale": "The context does not mention the price of bananas or apples.", "excerpt": null}
{"sentence": "Enjoy your fruit!", "label": "no_rad", "rationale": "This is a general expression and does not require factual attribution.", "excerpt": null}
]

**Now, please analyze the following context and response:**

**User Query:**
{query}

**Context:**
{context}

**Response:**
{response}
"""

In [None]:
RESPONSE_PROMPT_TEMPLATE = """
Using only the information included in the context block, answer the user query
in 5 sentences or less.

**User Query:**
{query}

**Context:**
{context}
"""

## Define the metric

In [None]:
grounded_metric = PointwiseMetric(
    metric="groundedness",
    metric_prompt_template=GROUNDING_AUTORATER_PROMPT,
    custom_output_config=CustomOutputConfig(
        return_raw_output=True,
        parsing_fn=parse_response_to_json,
    ),
)

# Prepare the dataset

In [None]:
# source: https://www.kaggle.com/datasets/deepmind/facts-grounding-examples/data?select=examples.csv
queries = [
    "What advantages do offline vs online retailers have?",
    "What are the requirements of OPM?",
    "How does virtual memory improve the efficiency of real physical memory (RAM) usage in computer systems?",
    "List the reasons that resulted in decreased emission of greenhouse gases from ethanol production.",
]
contexts = [
    """Due to savings in inventory costs, online retailing has a big advantage in selling
less popular items (the long tail) and in removing geographic barriers to purchase. Brynjolfsson
et al. (2003) estimated that the significantly increased assortment of books available online
increased consumer welfare by $700 million to a billion dollars in 2000. In comparing the online
sales of a clothing retailer with the catalog sales of the same item, Brynjolfsson et al. (2009)
showed that online sales of niche items were less sensitive to competition from offline stores
than from catalog sales because the online sales were skewed toward niche items. Brynjolfsson et
al. (2011) showed that online sales of niche items increased with recommendations and search tools,
indicating that these tools lowered search costs, making it easier for consumers to locate them.
In sum, because of the ability to handle more extensive inventories and provide search tools that
facilitate locating niche items, online retailing has a comparative advantage in selling less popular
items, translating into substantial benefits for consumers.
As noted above, online sellers have an advantage in facilitating a search for information on digital
attributes (including price). In contrast, offline sellers have an advantage in providing information
on non- digital attributes and providing faster delivery. This leads to the possibility that consumers
will search among both online and offline retailers.""",
    """As part of the assessment, S. 4043 would require OPM to explain whether each agency met its telework
goals and, if not, the actions being taken to identify and eliminate barriers to meeting them. The annual
report would also discuss additional steps that are planned by agencies to ensure telework oversight and
quality control and increase the utilization rates of office building space owned or leased by the agencies.
S. 4043 also requires the Office of Management and Budget (OMB), in consultation with GSA and the
Federal Real Property Council, to develop benchmarks and guidance for executive agencies to use when
calculating building utilization rates. S. 4043 would then require each executive agency head to establish
(1) a system to track office building space utilization rates consistent with that OMB guidance and (2)
indicators that measure the effects of telework policy on the management of real and personal property,
among other things.
S. 4043 would also require OPM to establish data standards to aid telework reporting requirements and
for automated telework tracking within payroll systems used by agencies. S. 4043 would require OPM, in
turn, to create an online tool that makes the standardized and reported data publicly available and would
allow OPM to use the online tool to fulfill its annual reporting requirements. For a more detailed
discussion of the bill's provisions on telework data standards, including office building utilization data,
see CRS Insight IN12352, Establishing Data Standards and Measuring Building Use: Select Provisions
of the Telework Transparency Act of 2024 (S. 4043).""",
    """Virtual memory is a computer system technique which gives an application program
the impression that it has contiguous working memory (an address space), while in fact it may be
physically fragmented and may even overflow on to disk storage. Systems that use this technique
make programming of large applications easier and use real physical memory (e.g.RAM) more
efficiently than those without virtual memory.
http://en.wikipedia.org/wiki/Virtual_memory
Page Fault: A page is a fixed-length block of memory that is used as a unit of transfer
between physical memory and external storage like a disk, and a page fault is an interrupt (or
exception) to the software raised by the hardware, when a program accesses a page that is
mapped in address space, but not loaded in physical memory.
http://en.wikipedia.org/wiki/Page_fault
Thrash is the term used to describe a degenerate situation on a computer where increasing
resources are used to do a decreasing amount of work. In this situation the system is
said to be thrashing. Usually it refers to two or more processes accessing a shared resource
repeatedly such that serious system performance degradation occurs because the system is
spending a disproportionate amount of time just accessing the shared resource. Resource
access time may generally be considered as wasted, since it does not contribute to the
advancement of any process. In modern computers, thrashing may occur in the paging system
(if there is not 'sufficient' physical memory or the disk access time is overly long), or in the
communications system (especially in conflicts over internal bus access), etc.
http://en.wikipedia.org/wiki/Thrash_(computer_science)""",
    """A new USDA report, titled 'A Life-Cycle Analysis of the Greenhouse Gas Emissions of Corn-Based
Ethanol,' finds that greenhouse gas emissions associated with producing corn-based ethanol in
the United States are about 43 percent lower than gasoline when measured on an energy equivalent
basis. Unlike other studies of greenhouse gas benefits, which relied on forecasts of future ethanol production
systems and expected impacts on the farm sector, this study reviewed how the industry and farm
sectors have performed over the past decade to assess the current greenhouse gas profile of corn-based ethanol.
The report shows that the reductions in greenhouse gas emissions were driven by a variety of improvements in
ethanol production, spanning from the corn field to the ethanol refinery. Farmers are producing corn
more efficiently and using conservation practices that reduce greenhouse gas emissions, including reduced tillage,
cover crops, and improved nitrogen management. Both corn yields and the efficiency of ethanol
production technologies are also improving.
Previous estimates of ethanol's greenhouse gas balance report lower efficiencies, largely due to anticipated
conversion of grasslands and forests to commodity production as a result of increased demand for corn
used in ethanol production. However, recent studies of international agricultural land use trends show
that since 2004, the primary land use change response of the world's farmers to rising commodity prices
has been to use available land resources more efficiently rather than to expand the amount of land used
for farming.""",
]

eval_dataset = pd.DataFrame(
    {
        "query": queries,
        "context": contexts,
    }
)

# Run evaluation

In [None]:
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[grounded_metric],
    autorater_config=AutoraterConfig(sampling_count=1),
)

# Model to use for generating responses to evaluate.
eval_model = GenerativeModel(model_name="gemini-2.0-flash-001")

eval_result = eval_task.evaluate(
    model=eval_model,
    prompt_template=RESPONSE_PROMPT_TEMPLATE,
)

# Calculate overall score for metric.
np.mean(eval_result.metrics_table["groundedness/model_resp_score"])

In [None]:
pretty_print_df(
    eval_result.metrics_table, hide_columns=["prompt", "groundedness/sentence_verdict"]
)