In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Agentspace answer eval using BLEU, ROUGE, BERT, Similarity Score

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fsearch%2Fagentspace%2Fagentspace_eval.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/agentspace/agentspace_eval.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/agentspace/agentspace_eval.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Authors |
| --- |
| [Nikhil Kulkarni](https://github.com/nikhilkul) |
| [Koushik Ghosh](https://github.com/Koushik25feb) |
| [Koyel Guha](https://github.com/koyelguha) |

## Overview: Agentspace Answer Evaluation

This notebook provides a comprehensive framework for evaluating the quality of answers generated by an Agentspace application. It compares the Agentspace output against a "golden dataset" of expected answers using several widely recognized natural language processing (NLP) metrics.

**Target Audience:** Developers and researchers working with Agentspace applications who need to quantitatively assess the performance of their answer generation models.

**Key Features:**

*   **Metric-Based Evaluation:** Utilizes established NLP metrics such as BLEU, ROUGE, BERTScore, and Semantic Similarity to provide a robust evaluation.
*   **Google Sheets Integration:** Seamlessly loads the golden dataset from a specified Google Sheet and can output the evaluation results back to a sheet.
*   **BigQuery Integration:** Provides an option to save the evaluation results to a BigQuery table for further analysis and historical tracking.
*   **Agentspace API Interaction:** Includes helper functions to interact with Agentspace Search and Assist APIs to fetch answers for evaluation.
*   **Qualitative Rating:** Maps the numerical scores from the metrics to a qualitative rating (e.g., Excellent, Good) for easier interpretation.
*   **Configurable:** Allows users to configure project details, engine ID, application type, and input/output sheet URLs.

**Evaluation Framework:**

The notebook employs a multi-faceted evaluation approach using the following NLP metrics:

*   **BLEU (Bilingual Evaluation Understudy):** Measures the n-gram overlap between the generated answer and the expected answer. It is a precision-focused metric, indicating how much of the generated text is present in the reference.
*   **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** A set of metrics that measure the overlap of n-grams, word sequences, and word pairs between the generated answer and the expected answer. It is a recall-focused metric, indicating how much of the reference text is covered by the generated text. The notebook specifically uses ROUGE-L, which measures the longest common subsequence.
*   **BERTScore:** Leverages pre-trained BERT embeddings to compute a similarity score between the generated answer and the expected answer. It considers semantic similarity beyond simple word overlap, making it more robust to paraphrasing.
*   **Semantic Similarity:** Calculates the semantic similarity between the generated answer and the expected answer using a Sentence Transformer model (`all-MiniLM-L6-v2`). This provides a measure of how similar the meaning of the two texts is, regardless of the exact wording.

For each question in the golden dataset, the notebook calculates these four scores. A qualitative rating (Excellent, Good, Moderate, Low, Poor) is then assigned based on the average of the BLEU, ROUGE, and BERT scores, and also individually for the Semantic Similarity score. This provides both numerical and easily interpretable qualitative feedback on the performance of the Agentspace application.

**Input and Output Options:**

*   **Input (Golden Dataset):** The golden dataset, containing the test queries and their corresponding expected answers, is loaded from a Google Sheet. You need to provide the Google Drive URL of your sheet and the name of the worksheet containing the data. The notebook expects at least two columns: one for the query/question and one for the expected answer. Column names MUST be `search_query` and `expected answers`

Sample:

| `search_query` | `expected_answers` |
|----------------|--------------------|
|                |                    |
|                |                    |

*   **Output (Evaluation Results):** The evaluation results, including the calculated scores (BLEU, ROUGE, BERTScore, Semantic Similarity) and their corresponding qualitative ratings for each question, can be saved in the following formats:
    *   **CSV File:** The results are saved to a local CSV file within the Colab environment.
    *   **Google Sheet:** The results can be written to a specified worksheet within your Google Sheet. The notebook handles the case where the worksheet already exists.
    *   **BigQuery Table:** The results can be appended to a BigQuery table, including a timestamp for each run, allowing for historical tracking and further analysis using BigQuery's capabilities. You need to provide the dataset ID and table name.

**How to Use:**

1.  **Setup:** Provide your Google Cloud project details, Agentspace engine ID, and the URLs for your golden dataset Google Sheet, as well as the desired BigQuery dataset and table names if using that option.
2.  **Authentication:** Authenticate your Google Cloud account and enable the necessary APIs (Discovery Engine, Sheets, Drive, BigQuery).
3.  **Data Loading:** The notebook will fetch your golden dataset from the specified Google Sheet.
4.  **Answer Retrieval:** The notebook will query your Agentspace application with the questions from the golden dataset to get the generated answers.
5.  **Evaluation:** The notebook will compute the specified NLP metrics by comparing the generated answers to the expected answers in your golden dataset.
6.  **Results:** The results, including the scores and qualitative ratings, will be saved to the configured output options (CSV, Google Sheet, and/or BigQuery).

**Prerequisites:**

*   Access to a Google Cloud project.
*   An Agentspace application (Search or Assist).
*   A Google Sheet containing your golden dataset with at least two columns: one for the query/question and one for the expected answer.
*   Necessary APIs enabled in your Google Cloud project (Discovery Engine, Sheets, Drive, BigQuery).

This notebook can be easily adapted to evaluate other answer generation systems by modifying the helper functions to interact with the relevant APIs.

## Step 1: Initialization

In [None]:
# @title Step 1.1 Install necessary libraries

%pip install --upgrade --quiet pandas openpyxl nltk rouge-score bert-score transformers colabtools google-cloud-discoveryengine

In [None]:
import datetime

# @title Step 1.2 Import necessary libraries
import json
import logging
import time

from google.auth import default
import google.auth.transport.requests
from google.colab import auth
import requests
import vertexai

creds, _ = google.auth.default()
auth_req = google.auth.transport.requests.Request()

In [None]:
# @title Add logger

logger = logging.getLogger("agentspace_eval")
logger.setLevel(logging.DEBUG)
log_file_path = "./agentspace_eval_notebook_logs.log"

# Ensuring that handlers are not added multiple times if the cell is run multiple times
# This prevents duplicate log entries in the file and console
if logger.hasHandlers():
    logger.handlers.clear()  # Clear existing handlers


file_handler = logging.FileHandler(log_file_path, mode="a")
file_handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(name)s - %(message)s")
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

# Optionally, adding a StreamHandler to also print logs to the Colab console output
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)


logger.info(f"Logging initialized. Logs will be saved to: {log_file_path}")

## Step 2: Setup

In [None]:
# @title Step 2.1 Set the project related configuration.

# Use project number and not Project ID
project_num = "00000000"  # @param{ type : 'string' }

# Engine ID is the ID of the Agentspace APP
engine_id = "agentspace_app_engine_id"  # @param{ type : 'string' }

# Assist or Search - Note: "Assist" means "Search + Assist"
# Whereas Search means "Search + Answer"
# option to put are [search, assist]
app_type = "search"  # @param{ type : 'string' }

# Location global, us and eu
location = "global"  # @param ["us", "eu", "global"]

# Use Project ID
auth_project_id = "[your-project-id]"  # @param{ type : 'string' }

# Input Queries
# Note: Every user needs to have their own copy of this doc. Please make a copy of the golden data google sheet below and add that link.
eval_data_google_drive_url = "[spreadsheet-url]"  # @param{ type : 'string' }

# Input Queries
worksheet_name = "input_queries"  # @param{ type : 'string' }

# (Optional) Output file name used for debugging. This file will be saved in the colab env.
output_file_name = "test_output"  # @param{ type : 'string' }

# Eval data worksheet
sheet_name_suffix = datetime.datetime.fromtimestamp(time.time()).strftime(
    "%Y-%m-%d %H:%M:%S"
)
eval_data_worksheet_name = "sample_outputs"  # @param{ type : 'string' }

# Eval data with metrics worksheet
eval_data_results_worksheet_name = "eval_data_results"  # @param{ type : 'string' }

# Top 'K' for search Results
K = 10  # @param{ type : 'string' }

In [None]:
# @title Step 2.2 Enable Google Sheets Integration

# Enable Google Sheets Integration by visiting (only if you are using the golden dataset from the spreadsheet

print(
    f"https://console.developers.google.com/apis/api/sheets.googleapis.com/overview?project={auth_project_id}"
)
print(
    f"https://console.developers.google.com/apis/api/drive.googleapis.com/overview?project={auth_project_id}"
)

## Step 3: Authenticate your Google Cloud Account and enable APIs


In [None]:
# @title Step 3.1 Authenticate your Google Cloud Account and enable APIs

# Authenticate gcloud.
auth.authenticate_user(project_id=auth_project_id)

region = "us-central1"  # region = global is not supported yet

# Configure gcloud.
!gcloud config set project {auth_project_id}
!gcloud config get-value project

# Initialize Vertex AI
vertexai.init(project=auth_project_id, location=region)

In [None]:
# @title Step 3.2 Authenticate refresh

creds.refresh(auth_req)

## Step 4: Helper Functions


In [None]:
# @title Step 4.0: Build the Discovery engine API based on the location

if location == "us":
    base_discovery_engine_domain = "us-discoveryengine.googleapis.com"
elif location == "eu":
    base_discovery_engine_domain = "eu-discoveryengine.googleapis.com"
else:  # Default to global if the location is not explicitly 'us' or 'eu'
    base_discovery_engine_domain = "discoveryengine.googleapis.com"

In [None]:
# @title Step 4.1: Search

from functools import lru_cache

serving_config_id = "default_search"  # @param{ type : 'string' }
discovery_engine_url = f"https://{base_discovery_engine_domain}/v1alpha/projects/{auth_project_id}/locations/{location}/collections/default_collection/engines/{engine_id}/servingConfigs/{serving_config_id}"

# Create serving config with Control
create_serving_config_response = requests.post(
    discovery_engine_url,
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer " + creds.token,
        "X-Goog-User-Project": auth_project_id,
    },
    json={
        "displayName": f"{serving_config_id}",
        "solutionType": "SOLUTION_TYPE_SEARCH",
    },
)


@lru_cache(maxsize=None)
def get_search_results(query, num=K):
    search_response = requests.post(
        f"{discovery_engine_url}:search",
        headers={
            "Content-Type": "application/json",
            "Authorization": "Bearer " + creds.token,
            "X-Goog-User-Project": auth_project_id,
        },
        json={
            "query": query,
            "pageSize": num,
        },
    )
    results = search_response.json()
    logger.info(f"query : {query}")
    logger.info(f"Response code:  {search_response.status_code}")
    return results

In [None]:
# @title Step 4.2: Assistant


ENDPOINT = f"https://{base_discovery_engine_domain}"
ASSISTANT_NAME = f"projects/{auth_project_id}/locations/global/collections/default_collection/engines/{engine_id}/assistants/default_assistant"


@lru_cache(maxsize=None)
def get_assist_results(query: str):
    response = requests.post(
        f"{ENDPOINT}/v1alpha/{ASSISTANT_NAME}:assist",
        headers={
            "Content-Type": "application/json; charset=utf-8",
            "Authorization": f"Bearer {creds.token}",
            "X-Goog-User-Project": f"{auth_project_id}",
        },
        data=json.dumps({"query": {"text": query}}),
    )
    if response.status_code != 200:
        logger.error(f"Assistant failed for query: {query} | {response.content}")
        answer = "FAILED"
        assist_token = "None"
    else:
        assist_response = response.json()
        assist_token = assist_response.get("assistToken")
        answer_data = assist_response.get("answer")
        state = answer_data.get("state")
        if state == "SKIPPED":
            answer = answer_data["assistSkippedReasons"][0]
        else:
            answer = (
                answer_data.get("replies")[0]
                .get("groundedContent")
                .get("content")
                .get("text")
            )
    return answer, assist_token

In [None]:
# @title Step 4.3: Answer


session = requests.post(
    f"https://{base_discovery_engine_domain}/v1alpha/projects/{auth_project_id}/locations/global/collections/default_collection/engines/{engine_id}/sessions",  # auto session model
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer " + creds.token,
    },
    json={
        "userPseudoId": "12345",  # customer id
    },
)


@lru_cache(maxsize=None)
def get_answer_results(query):
    response = requests.post(
        f"https://{base_discovery_engine_domain}/v1alpha/projects/{auth_project_id}/locations/global/collections/default_collection/engines/{engine_id}/servingConfigs/default_search:answer",
        headers={
            "Content-Type": "application/json",
            "Authorization": "Bearer " + creds.token,
        },
        json={
            "query": {"text": query},
            "searchSpec": {
                "searchParams": {
                    "maxReturnResults": K,
                },
            },
            "session": session.json()["name"],
        },
    )

    if response.status_code != 200:
        logger.error(f"Answer API failed for query: {query} | {response.content}")
        answer_data = "FAILED"
        answer_token = "None"
    else:
        answer_response = response.json()
        answer_token = answer_response.get("answerQueryToken")
        answer_data = answer_response.get("answer").get("answerText")
    return answer_data, answer_token

## Step 5: Get Golden Dataset

In [None]:
# @title Step 5.1: Get Golden Dataset from CSV file locally(notebook)
import pandas as pd

df = pd.read_csv("/content/golden_dataset.csv")

In [None]:
from google.auth import default
from google.colab import auth, drive
import gspread

# @title Step 5.2: Get Golden Dataset from Google Drive/Sheet
import pandas as pd

drive.mount("/content/drive")
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# Replace with your actual spreadsheet details
spreadsheet = gc.open_by_url(f"{eval_data_google_drive_url}")
worksheet = spreadsheet.worksheet(f"{worksheet_name}")
# spreadsheet.add_worksheet(f"{eval_data_worksheet_name}", rows="100", cols="20")

df = pd.DataFrame(worksheet.get_all_values()[1:], columns=worksheet.get_all_values()[0])

In [None]:
# @title Step 5.3 Populate Agentspace answer based on the Golden dataset

df[["answer_result", "answer_token"]] = df["search_query"].apply(
    lambda q: pd.Series(get_answer_results(q))
)

In [None]:
# @title Step 5.4: Visualise the Golden dataset to verify
df

## Step 6: Evaluation Functionality


In [None]:
# @title Step 6.1: Evaluation imports

from bert_score import score as bert_score
import nltk
from nltk.translate.bleu_score import sentence_bleu
import pandas as pd
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    "all-MiniLM-L6-v2"  # similarity score model (open source - Hugging Face based)
)

In [None]:
# @title Step 6.2: Download the punkt for BERT eval process

nltk.download("punkt")
nltk.download("punkt_tab")

In [None]:
# @title Step 6.3: Evaluation utility functions


def get_semantic_score(expected, actual):
    emb1 = model.encode(expected, convert_to_tensor=True)
    emb2 = model.encode(actual, convert_to_tensor=True)
    similarity = util.cos_sim(emb1, emb2).item()
    return similarity  # Score between 0 and 1


def compute_scores(expected, actual):
    """Compute BLEU, ROUGE, and BERTScore"""
    # BLEU
    reference = [nltk.word_tokenize(expected.lower())]
    candidate = nltk.word_tokenize(actual.lower())
    bleu = sentence_bleu(reference, candidate)

    # ROUGE
    scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
    rouge = scorer.score(expected, actual)["rougeL"].fmeasure

    # BERT Score
    P, R, F1 = bert_score([actual], [expected], lang="en", verbose=False)
    bert = F1[0].item()

    return bleu, rouge, bert


def score_to_rating(score):
    """Map metric values to a qualitative rating"""

    if score > 0.8:
        return 5, "Excellent match – high semantic and lexical similarity."
    elif score > 0.6:
        return 4, "Good match – minor differences, but meaning mostly intact."
    elif score > 0.4:
        return 3, "Moderate match – some loss in meaning."
    elif score > 0.2:
        return 2, "Low match – significant differences from expected."
    else:
        return 1, "Poor match – largely incorrect or off-topic."

In [None]:
# @title Step 6.4: Evaluate the Golden dataset with answers with respect to BLEU, ROUGE, BERT and Similarity score
results = []

for _, row in df.iterrows():
    question = row["search_query"]
    expected = row["expected_answers"]
    actual = row["answer_result"]

    bleu, rouge, bert = compute_scores(expected, actual)

    avg = (bleu + rouge + bert) / 3
    score, reasoning = score_to_rating(avg)
    similarity = get_semantic_score(expected, actual)
    score_to_rating

    results.append(
        {
            "question": question,
            "expected_answer": expected,
            "actual_answer": actual,
            "BLEU_score": bleu,
            "BLEU_rating": score_to_rating(bleu),
            "ROUGE_score": rouge,
            "ROUGE_rating": score_to_rating(rouge),
            "BERTScore": bert,
            "BERT_rating": score_to_rating(bert),
            "similarity": similarity,
            "similarity_rating": score_to_rating(similarity),
        }
    )

## Step 7: Saving the Eval result

In [None]:
# @title Step 7.1: Converting the result in the dataframe and adding timestamp
import datetime

output_df = pd.DataFrame(results)
# Add a timestamp column to the DataFrame
output_df["timestamp"] = datetime.datetime.now()

In [None]:
# @title Step 7.2: Save the output in the CSV

output_csv_file_name = "eval_output.csv"  # @param{ type : 'string' }
output_df.to_csv(output_csv_file_name, index=False)

In [None]:
# @title Step 7.3: Save the output in Google Sheet

spreadsheet = gc.open_by_url(f"{eval_data_google_drive_url}")

try:
    worksheet_eval_data = spreadsheet.add_worksheet(
        f"{eval_data_worksheet_name}", rows="100", cols="20"
    )
except gspread.exceptions.APIError as e:
    logger.info(
        f"Worksheet '{eval_data_worksheet_name}' already exists or another API error occurred: {e}"
    )
    # If the worksheet already exists, get it
    worksheet_eval_data = spreadsheet.worksheet(eval_data_worksheet_name)
else:
    pass  # No need to get the worksheet again if it was just created

# Convert tuple columns to strings
for col in ["BLEU_rating", "ROUGE_rating", "BERT_rating", "similarity_rating"]:
    if col in output_df.columns:
        output_df[col] = output_df[col].apply(
            lambda x: f"{x[0]}: {x[1]}" if isinstance(x, tuple) else x
        )

# Convert timestamp column to string
if "timestamp" in output_df.columns:
    output_df["timestamp"] = output_df["timestamp"].astype(str)

worksheet_eval_data.clear()
worksheet_eval_data.update(
    [output_df.columns.values.tolist()] + output_df.values.tolist()
)

In [None]:
# @title Step 7.4: Save the output in BigQuery table

# provide the BigQuery details here
dataset_id = "agentspace_eval_dataset"  # @param{type: 'string'}
table_name = "agentspace_eval_result"  # @param{type: 'string'}
project_id = auth_project_id  # Use the authenticated project ID

import datetime

from google.cloud import bigquery
from google.cloud.exceptions import NotFound

client = bigquery.Client(project=project_id)

# Add a timestamp column to the DataFrame (if not already added in a previous step)
if "timestamp" not in output_df.columns:
    output_df["timestamp"] = datetime.datetime.now()

# Convert tuple columns to strings before saving to BigQuery
for col in ["BLEU_rating", "ROUGE_rating", "BERT_rating", "similarity_rating"]:
    if col in output_df.columns:
        output_df[col] = output_df[col].apply(
            lambda x: f"{x[0]}: {x[1]}" if isinstance(x, tuple) else x
        )

# Convert timestamp column to string before saving to BigQuery
if "timestamp" in output_df.columns:
    output_df["timestamp"] = output_df["timestamp"].astype(str)

# Construct a full table ID
table_id = f"{project_id}.{dataset_id}.{table_name}"

# Check if the dataset exists, create if not
try:
    client.get_dataset(dataset_id)
    logger.info(f"Dataset {dataset_id} already exists.")
except NotFound:
    logger.info(f"Dataset {dataset_id} not found. Creating dataset.")
    dataset = bigquery.Dataset(f"{project_id}.{dataset_id}")
    dataset.location = location  # Use the location defined earlier
    client.create_dataset(dataset, timeout=30)
    logger.info(f"Dataset {dataset_id} created.")


# Check if the table exists, create if not
try:
    client.get_table(table_id)
    logger.info(f"Table {table_name} already exists.")
except NotFound:
    logger.info(f"Table {table_name} not found. Creating table.")
    # Create table with inferred schema (schema will be inferred during the load job)
    table = bigquery.Table(table_id)
    client.create_table(table, exists_ok=True)
    logger.info(f"Table {table_name} created.")


# Load data from DataFrame to BigQuery
job_config = bigquery.LoadJobConfig(
    write_disposition="WRITE_APPEND",  # Append data if table exists
)

job = client.load_table_from_dataframe(output_df, table_id, job_config=job_config)

job.result()  # Wait for the job to complete.

logger.info(f"Loaded {job.output_rows} rows into {table_id}.")

## Learnings and Conclusion

### What We Learned

This notebook provided a comprehensive framework for evaluating the quality of answers generated by an Agentspace application. Our process involved several key stages:

1.  **Environment Setup and Configuration:**
    *   We successfully installed and imported all necessary libraries for data manipulation (pandas), NLP evaluation (NLTK, rouge-score, bert-score, sentence-transformers), and Google Cloud integration (google-cloud-discoveryengine, gspread, google-auth).
    *   We configured essential parameters, including Google Cloud project details, Agentspace engine ID, application type (search/assist), and locations for input (Google Sheets) and output (CSV, Google Sheets, BigQuery).

2.  **Data Ingestion and Answer Retrieval:**
    *   The notebook demonstrated how to load a "golden dataset" containing queries and their expected answers from a Google Sheet.
    *   We utilized helper functions to interact with the Agentspace Discovery Engine APIs (specifically `get_answer_results` in the main flow) to retrieve answers generated by the Agentspace application for each query in our golden dataset.

3.  **Metric-Based Evaluation:**
    *   We implemented a robust evaluation methodology using four distinct NLP metrics:
        *   **BLEU Score:** To measure n-gram precision against reference answers.
        *   **ROUGE-L Score:** To measure recall based on the longest common subsequence.
        *   **BERTScore:** To assess semantic similarity using contextual embeddings, going beyond lexical overlap.
        *   **Semantic Similarity:** Calculated using the `all-MiniLM-L6-v2` Sentence Transformer model to capture the similarity in meaning between generated and expected answers.
    *   For each query, we computed these four scores by comparing the Agentspace-generated answer with the corresponding expected answer.
    *   A qualitative rating system (Excellent, Good, Moderate, Low, Poor) was applied to the numerical scores, providing an easily interpretable assessment of answer quality for each metric.

4.  **Results Storage and Reporting:**
    *   The evaluation results, including the raw scores, qualitative ratings, original queries, expected answers, and actual answers, were compiled into a structured Pandas DataFrame.
    *   A timestamp was added to each evaluation run for tracking purposes.
    *   The notebook demonstrated multiple ways to persist these results:
        *   Saving to a local CSV file.
        *   Writing to a new or existing worksheet in a Google Sheet.
        *   Appending to a BigQuery table, enabling historical analysis and more complex querying. The notebook also handled the creation of the BigQuery dataset and table if they didn't already exist.

### Conclusion

This notebook successfully establishes a repeatable and quantitative process for evaluating the performance of an Agentspace application's answer generation capabilities. By leveraging a golden dataset and a suite of established NLP metrics, we can:

*   **Objectively measure answer quality:** Moving beyond subjective assessments to data-driven insights.
*   **Identify areas for improvement:** Pinpoint queries or types of queries where the Agentspace application may be underperforming.
*   **Track performance over time:** By storing results (especially in BigQuery), we can monitor how changes to the Agentspace configuration, underlying data, or models impact answer quality.
*   **Benchmark different configurations:** The framework can be used to compare the performance of different Agentspace engine settings or versions.

The integration with Google Sheets and BigQuery makes the golden dataset management and results analysis highly accessible and scalable. The qualitative ratings alongside numerical scores offer a balanced view, catering to both technical and non-technical stakeholders.

**Potential Next Steps:**

*   **Automate the evaluation pipeline:** Integrate this notebook into a CI/CD pipeline for regular, automated performance checks.
*   **Expand the golden dataset:** Continuously add more diverse and challenging queries to the golden dataset for more comprehensive testing.
*   **Error Analysis:** Perform a deeper dive into queries with low scores to understand the root causes of poor performance (e.g., issues with grounding, summarization, or relevance).
*   **Experiment with different models/settings:** Use this evaluation framework to systematically test the impact of different Agentspace configurations or underlying LLMs.
*   **Visualize results:** Create dashboards (e.g., in Looker Studio using BigQuery data) to visualize evaluation trends and key metrics.

This evaluation framework is a valuable asset for anyone looking to rigorously assess and improve their Agentspace applications.