 # Evaluate Generative Model Tool Use | Gen AI Evaluation SDK

## Overview

* Define an API function and a Tool for Gemini model, and evaluate the Gemini model tool use quality with *Vertex AI Python SDK for Gen AI Evaluation Service*.

## Getting Started

### Set Google Cloud project information and initialize Vertex AI SDK

In [None]:
PROJECT_ID = !gcloud config list --format 'value(core.project)'
PROJECT_ID = PROJECT_ID[0]  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

In [None]:
# General
import json
import logging
import random
import string
import warnings

import pandas as pd
from IPython.display import Markdown, display

# Main
from vertexai.evaluation import EvalTask
from vertexai.generative_models import GenerativeModel

### Library settings

In [None]:
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)
warnings.filterwarnings("ignore")

### Helper functions

In [None]:
def generate_uuid(length: int = 8) -> str:
    """Generate a uuid of a specified length (default=8)."""
    return "".join(
        random.choices(string.ascii_lowercase + string.digits, k=length)
    )


def display_eval_report(eval_result, metrics=None):
    """Display the evaluation results."""

    title, summary_metrics, report_df = eval_result
    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        report_df = report_df.filter(
            [
                metric
                for metric in report_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the title with Markdown for emphasis
    display(Markdown(f"## {title}"))

    # Display the metrics DataFrame
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    # Display the detailed report DataFrame
    display(Markdown("### Report Metrics"))
    display(report_df)

## Evaluate Tool use and Function Calling quality for Gemini

#### Tool evaluation metrics

* `tool_call_valid`
* `tool_name_match`
* `tool_parameter_key_match`
* `tool_parameter_kv_match`

In [None]:
tool_metrics = [
    "tool_call_valid",
    "tool_name_match",
    "tool_parameter_key_match",
    "tool_parameter_kv_match",
]

### 1. Evaluate a Bring-Your-Own-Prediction dataset

Generative model's tool use quality can be evaluated if the eval dataset contains saved model tool call responses, and expected references.

In [None]:
response = [
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2"}}]}',
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2"}}]}',
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14"}}]}',
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Cinemark", "location": "Mountain View CA", "showtime": "5:30", "date": "2024-03-30", "num_tix": "2"}}]}',
]

reference = [
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2"}}]}',
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Godzilla", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "9:30", "date": "2024-03-30", "num_tix": "2"}}]}',
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2"}}]}',
    '{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2"}}]}',
]

eval_dataset = pd.DataFrame(
    {
        "response": response,
        "reference": reference,
    }
)

#### Define EvalTask

**Exercise:** Define the EvalTask for this job. The EvalTask needs the eval_dataset, tool_metrics and experiment name. Use this [documentation for reference.](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.preview.evaluation.EvalTask)

In [None]:
experiment_name = "eval-saved-llm-tool-use"  # @param {type:"string"}

tool_use_eval_task = None  # TODO: Define EvalTask

In [None]:
run_id = generate_uuid()

experiment_run_name = f"eval-{run_id}"

eval_result = tool_use_eval_task.evaluate(
    experiment_run_name=experiment_run_name
)
display_eval_report(
    (
        "Tool Use Quality Evaluation Metrics",
        eval_result.summary_metrics,
        eval_result.metrics_table,
    )
)

In [None]:
tool_use_eval_task.display_runs()

## 2. Tool Use and Function Calling with Gemini

[Function Calling Documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling)

### Define a function and tool

**Exercise:** Define a Function call for booking movie tickets. The parameters are "movie", "location", "showtime", "date" and "num_tix".


In [None]:
from vertexai.generative_models import FunctionDeclaration, Tool

book_tickets_func = (
    FunctionDeclaration()
)  # TODO- Define the FunctionDeclaration for booking movie tickets


book_tickets_tool = Tool(
    function_declarations=[book_tickets_func],
)

### Generate a function call

Prompt the Gemini model and include the tool that you defined.

In [None]:
prompt = """I'd like to book 2 tickets for the movie "Mission Impossible Dead Reckoning Part 1"
at the Regal Edwards 14 theater in Mountain View, CA. The showtime is 7:30 PM on March 30th, 2024.
"""

gemini_model = GenerativeModel("gemini-2.0-pro")

gemini_response = gemini_model.generate_content(
    prompt,
    tools=[book_tickets_tool],
)

gemini_response.candidates[0].content

###  Unpack the Gemini response into a Python dictionary

In [None]:
def unpack_response(response):
    output = {}
    function_call = {}
    for key, value in response.candidates[0].content.parts[0].to_dict().items():
        function_call[key] = value
    output["content"] = ""
    output["tool_calls"] = [function_call["function_call"]]
    output["tool_calls"][0]["arguments"] = output["tool_calls"][0].pop("args")
    return json.dumps(output)


response = unpack_response(gemini_response)
response

### Evaluate the Gemini's Function Call Response

In [None]:
reference_str = json.dumps(
    {
        "content": "",
        "tool_calls": [
            {
                "name": "book_tickets",
                "arguments": {
                    "movie": "Mission Impossible Dead Reckoning Part 1",
                    "theater": "Regal Edwards 14",
                    "location": "Mountain View CA",
                    "showtime": "7:30",
                    "date": "2024-03-30",
                    "num_tix": "2",
                },
            }
        ],
    }
)

eval_dataset = pd.DataFrame(
    {"response": [response], "reference": [reference_str]}
)

In [None]:
# Expected Tool Call Response
json.loads(eval_dataset.reference[0])

In [None]:
# Actual Gemini Tool Call Response
json.loads(eval_dataset.response[0])

**Exercise:** Similar to above define the EvalTask for this dataset.

In [None]:
experiment_name = "eval-gemini-model-function-call"  # @param {type:"string"}

gemini_functiona_call_eval_task = EvalTask()  # TODO - Define the EvalTask

In [None]:
run_id = generate_uuid()

eval_result = gemini_functiona_call_eval_task.evaluate(
    experiment_run_name=f"eval-{run_id}"
)

display_eval_report(
    (
        "Gemini Tool Use Quality Evaluation Metrics",
        eval_result.summary_metrics,
        eval_result.metrics_table,
    )
)