In [4]:
!pip install -Uq "google-genai==1.7.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-adk 1.18.0 requires google-genai<2.0.0,>=1.45.0, but you have google-genai 1.7.0 which is incompatible.
google-cloud-aiplatform 1.125.0 requires google-genai<2.0.0,>=1.37.0, but you have google-genai 1.7.0 which is incompatible.[0m[31m
[0m

In [5]:
from google import genai
from google.genai import types
from IPython.display import Markdown, display

genai.__version__

'1.7.0'

**Set up your API key**

To run the following cell, your API key must be stored it in a Kaggle secret named *GOOGLE_API_KEY.*

To make the key available through Kaggle secrets, choose Secrets from the Add-ons menu and follow the instructions to add your key or enable it for this notebook.

In [6]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("Evaluation and structured output")


If you received an error response along the lines of No user secrets exist for kernel id ..., then you need to add your API key via Add-ons, Secrets and enable it.

In [7]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e,genai.errors.APIError) and e.code in{429,503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
    genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)


In [8]:
client = genai.Client(api_key=secret_value_0)

**Evaluation**

When using LLMs in real-world cases, it's important to understand how well they are performing. The open-ended generation capabilities of LLMs can make many cases difficult to measure. In this notebook you will walk through some simple techniques for evaluating LLM outputs and understanding their performance.

*For this example, you'll evaluate a summarisation task using the [Gemini 1.5 Pro technical report](http://). Start by downloading the PDF to the notebook environment, and uploading that copy for use with the Gemini API.*

In [9]:
!wget -nv -O gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

document_file = client.files.upload(file='gemini.pdf')

2025-11-25 16:57:37 URL:https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf [7228817/7228817] -> "gemini.pdf" [1]


**Summarise a document**


The summarisation request used here is fairly basic. It targets the training content specifically but **provides no guidance otherwise.**

In [10]:
request = 'Tell me about the training process used here.'

def summarise_doc(request: str) -> str:
    """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
    config = types.GenerateContentConfig(temperature=0.0)
    response = client.models.generate_content(
        model='gemini-2.0-flash',
        config=config,
        contents=[request, document_file]
    )
    return response.text


summary = summarise_doc(request)
Markdown(summary)

Based on the document you provided, here's a breakdown of the training process used for Gemini 1.5 Pro:

**1. Data:**

*   **Multimodal and Multilingual Data:** The model is trained on a diverse dataset that includes text, images, audio, and video content. The text data is sourced from various domains, including web documents and code.
*   **Pre-training Dataset:** The pre-training dataset includes data sourced across many different domains, including web documents and code, and incorporates image, audio, and video content.
*   **Instruction-Tuning Phase:** Gemini 1.5 Pro is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses, with further tuning based on human preference data.

**2. Architecture:**

*   **Mixture-of-Experts (MoE) Transformer:** Gemini 1.5 Pro is based on a sparse MoE Transformer architecture. This allows the model to have a large number of parameters while only activating a subset for any given input.

**3. Infrastructure:**

*   **TPUv4 Accelerators:** The model is trained on multiple 4096-chip pods of Google's TPUv4 accelerators, distributed across multiple datacenters.

**4. Training Process:**

*   **Pre-training:** The model is initially pre-trained on the large multimodal dataset.
*   **Instruction Tuning:** After pre-training, the model is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses.
*   **Human Preference Tuning:** Further tuning is performed based on human preference data.

**5. Key Improvements:**

*   **Architecture:** Improvements across the model stack, including architecture, data, optimization, and systems.
*   **Long-Context Understanding:** Significant architecture changes enable understanding of inputs up to 10 million tokens without performance degradation.

**In summary:** Gemini 1.5 Pro is trained using a large, diverse multimodal dataset on Google's TPUv4 infrastructure. It uses a MoE Transformer architecture and undergoes pre-training, instruction tuning, and human preference tuning. The training process incorporates improvements across the model stack to enable long-context understanding and overall performance.

**Define an evaluator**

For a task like this, you may wish to evaluate a number of aspects, like how well the model followed the prompt ("instruction following"), whether it included relevant data in the prompt ("groundedness"), how easy the text is to read ("fluency"), or other factors like "verbosity" or "quality".

You can instruct an LLM to perform these tasks in a similar manner to how you would instruct a human rater: with a clear definition and [assessment rubric](http://).

In this step, you define an evaluation agent using a pre-written "summarisation" prompt and use it to gauge the quality of the generated summary.

Note: For more pre-written evaluation prompts covering groundedness, safety, coherence and more, check out this [comprehensive list of model-based evaluation prompts](http://) from the Google Cloud docs.


In [11]:
import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs

### Prompt
{prompt}

## AI-generated Response
{response}
"""

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


def eval_summary(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

## Evaluation
STEP 1:
The response includes information from the document provided. It also includes a summarization in the end. The response is well-organized.

STEP 2:
The response follows instructions, is grounded, concise, and fluent.
I choose a rating of 4.

## Rating
4


**Rating**



In this example, the model generated a textual justification that was set up in a chat context. This full text response is useful both for human interpretation and for giving the model a place to "collect notes" while it assesses the text and produces a final score. This "note taking" or "thinking" strategy typically works well with auto-regressive models, where the generated text is passed back into the model at each generation step. This means the working "notes" are used when generating final result output.

In the next turn, the model converts the text output into a structured response. If you want to aggregate scores or use them programatically then you want to avoid parsing the unstructured text output. Here the *SummaryRating* schema is passed, so **the model converts the chat history into an instance of the SummaryRating enum**.

In [12]:
struct_eval

<SummaryRating.GOOD: '4'>

Make the summary prompt better or worse¶
Gemini models tend to be quite good at tasks like direct summarisation without much prompting, so you should expect to see a result like **GOOD** or **VERY_GOOD** on the previous task, even with a rudimentary prompt. Run it a few times to get a feel for the average response.

To explore how to influence the summarisation output, consider what you might change in the summary request prompt to change the result. Take a look at the evaluation **SUMMARY_PROMPT** for some ideas.

In [13]:
new_prompt = "Explain like I'm 5 the training process"
# Try:
#  ELI5 the training process
#  Summarise the needle/haystack evaluation technique in 1 line
#  Describe the model architecture to someone with a civil engineering degree
#  What is the best LLM?

if not new_prompt:
    raise ValueError("Try setting a new summarisation prompt.")

def run_and_eval_summary(prompt):
    """Generate and evaluate the summary using the new prompt."""
    summary = summarise_doc(new_prompt)
    display(Markdown(summary+ '\n-----'))
    
    text,struct = eval_summary([new_prompt, document_file], summary)
    display(Markdown(text + '\n------'))
    print(struct)


run_and_eval_summary(new_prompt)

Okay, I can explain the training process of a large language model like Gemini 1.5 Pro in a way that a 5-year-old can understand.

Imagine you have a puppy, and you want to teach it to understand and respond to your commands. That's kind of like training a big computer brain!

1.  **Lots of Examples:** First, you show the puppy lots and lots of things. You show it pictures of cats, dogs, cars, and houses. You also read it stories and tell it about all sorts of things. The computer brain also gets to see and read lots of things – millions and millions of pictures, books, and websites!

2.  **Learning Patterns:** The puppy starts to notice patterns. It learns that things with pointy ears and a tail are often dogs, and that when you say "sit," it should put its bottom on the ground. The computer brain also learns patterns. It learns that certain words often go together, and that certain pictures are related to certain words.

3.  **Making Predictions:** Now, you ask the puppy a question, like "Where's the ball?" The puppy tries to guess where the ball is based on what it has learned. The computer brain also tries to guess the answer to questions.

4.  **Getting Feedback:** If the puppy guesses right, you give it a treat and say "Good job!" If it guesses wrong, you gently correct it. The computer brain also gets feedback. If it guesses right, it gets a little reward. If it guesses wrong, it adjusts itself to try to guess better next time.

5.  **Repeating and Improving:** You keep showing the puppy things, asking questions, and giving feedback over and over again. The puppy gets better and better at understanding and responding to you. The computer brain also keeps learning and improving. It gets better at understanding and answering questions, and even at doing new things that it wasn't specifically taught!

So, training a big computer brain is like teaching a puppy, but with lots and lots of examples, and instead of treats, the computer brain gets little rewards that help it learn and improve. And just like a well-trained puppy, a well-trained computer brain can be very helpful and do amazing things!
-----

## Evaluation
STEP 1:
The response is good, it answers the question "Explain like I'm 5 the training process". It uses the analogy of a puppy.

STEP 2:
I will rate this response as a 4. The summary follows instructions, is grounded, concise, and fluent.

## Rating:
4

------

SummaryRating.GOOD


In [14]:
new_prompt = "Explain like full technical depth"
# Try:
#  ELI5 the training process
#  Summarise the needle/haystack evaluation technique in 1 line
#  Describe the model architecture to someone with a civil engineering degree
#  What is the best LLM?

if not new_prompt:
    raise ValueError("Try setting a new summarisation prompt.")

def run_and_eval_summary(prompt):
    """Generate and evaluate the summary using the new prompt."""
    summary = summarise_doc(new_prompt)
    display(Markdown(summary+ '\n-----'))
    
    text,struct = eval_summary([new_prompt, document_file], summary)
    display(Markdown(text + '\n------'))
    print(struct)


run_and_eval_summary(new_prompt)

Okay, I'll provide a full technical explanation of the Gemini 1.5 Pro paper, diving into the details of its architecture, training, evaluation, and responsible deployment.

**1. Introduction**

*   **Gemini Family:** Gemini 1.5 Pro is presented as the latest model in the Gemini family, emphasizing its compute efficiency and multimodal capabilities.
*   **Key Capabilities:** The model is designed for:
    *   Recalling and reasoning over fine-grained information from millions of tokens of context.
    *   Handling long documents, hours of video, and audio.
    *   Achieving near-perfect recall on long-context retrieval tasks.
    *   Improving state-of-the-art performance in long-document QA, long-video QA, and long-context ASR.
    *   Matching or surpassing Gemini 1.0 Ultra's performance on a broad set of benchmarks.
*   **Context Length:** The model demonstrates continued improvement in next-token prediction and near-perfect retrieval ( >99%) up to at least 10M tokens.
*   **In-Context Learning:** The model exhibits surprising new capabilities, such as learning to translate English to Kalamang (a low-resource language) at a similar level to a person who learned from the same content.

**2. Model Architecture**

*   **Mixture-of-Experts (MoE):** Gemini 1.5 Pro utilizes a sparse MoE architecture, building upon Gemini 1.0's research advances and multimodal capabilities.
*   **Conditional Computation:** MoE models use a learned routing function to direct inputs to a subset of the model's parameters for processing. This allows the model to grow its total parameter count while keeping the number of activated parameters constant for any given input.
*   **Transformer-Based:** The model is based on the Transformer architecture, which has become the standard for language models due to its ability to handle long-range dependencies.

**3. Training Infrastructure and Dataset**

*   **TPUv4 Accelerators:** Gemini 1.5 Pro is trained on multiple 4096-chip pods of Google's TPUv4 accelerators, distributed across multiple datacenters.
*   **Multimodal and Multilingual Data:** The pre-training dataset includes data sourced from various domains, including web documents, code, images, audio, and video content.
*   **Instruction Tuning:** The model is fine-tuned on a collection of multimodal data containing paired instructions and appropriate responses, with further tuning based on human preference data.

**4. Long-Context Evaluation**

*   **Evaluation Categories:** The evaluation of Gemini 1.5 Pro focuses on three main categories:
    *   Qualitative long-context multimodal evaluations: Manually probing and stress-testing the model's long-context abilities.
    *   Quantitative long-context multimodal evaluations: Measuring the model's long-context abilities on synthetic and real-world tasks with well-defined metrics.
    *   Quantitative core evaluations: Identifying progress and regression in core capabilities.
*   **Diagnostic Long-Context Evaluations:**
    *   **Perplexity over Long Sequences:** Evaluating the ability of the models to make use of very long contexts to improve next-token prediction by recording the negative log-likelihood (NLL) of tokens at different positions in the input sequences.
    *   **Text Haystack:** Testing long-context recall using the needle-in-a-haystack evaluation, which tests a model's ability to retrieve a text (i.e., "needle") inserted at various positions into a sequence (i.e., "haystack").
    *   **Video Haystack:** Adapting the text needle-in-a-haystack evaluation and turning it into a cross-modal evaluation, wherein a needle is hidden in one modality while the retrieval query is given in text.
    *   **Audio Haystack:** Testing long context capabilities on audio understanding by hiding a very short clip of audio lasting a few seconds where a speaker says "the secret keyword is needle" within an audio signal (the haystack) up to almost five days long (i.e., 107 hours).
    *   **Improved Diagnostics:**
        *   Multiple needles-in-haystack: Increasing the number of unique "needles" in each haystack and requiring the model to retrieve them all.
        *   Multi-round Co-reference Resolution (MRCR): Presenting the model with a long conversation between a user and a model, in which the user requests writing (e.g. poems, riddles, essays) on different topics proceeded by the model responses.
*   **Realistic Long-Context Evaluations:**
    *   **In-Context Language Learning:** Evaluating Gemini 1.5 Pro on the Machine Translation from One Book (MTOB) benchmark, which measures the ability to learn to perform sentence-level translation between English and Kalamang (ISO 639-3 language code: kgv) from instructional materials.
    *   **Long-Document QA:** Creating questions using the book "Les Misérables" (by Victor Hugo) and testing the model's ability to answer them correctly when the entire 1,462 page book (i.e., 710K tokens) is provided as input.
    *   **Long-Context Audio:** Evaluating Gemini 1.5 Pro's long context understanding capabilities on audio inputs by testing it on 15 minute segments of an internal YouTube video-based benchmark.
    *   **Long-Context Video QA:** Introducing a new benchmark, 1H-VideoQA, composed of 125 five-way multiple-choice questions over public videos 40-105 minutes long.

**5. Core Capability Evaluations**

*   **Core Text Evals:**
    *   Reasoning, Math and Science
    *   Coding
    *   Multilinguality
    *   Instruction Following
*   **Core Vision Multimodal Evaluations:** Assessing performance on multimodal image tasks by reporting results on 8 image understanding benchmarks and 5 video understanding benchmarks.
*   **Core Audio Multimodal Evaluations:** Evaluating Gemini 1.5 Pro on several short-context Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) benchmarks.

**6. Responsible Deployment**

*   **Structured Approach:** Following a structured approach to responsible deployment, as outlined in Figure 13.
*   **Impact Assessment:** Developing model impact assessments to identify, assess, and document key downstream societal benefits and harms associated with the development of advanced models.
*   **Model Mitigations:** Modeling mitigation of safety risks mostly through supervised fine-tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model.
*   **Model Safety Evaluations:** Undertaking a range of safety evaluations on the Gemini 1.5 Pro model, including content safety, representational harms, and memorization for text-to-text and image-to-text evaluations.
*   **Divergence:** Evaluating Gemini 1.5 Pro to understand its susceptibility to divergence and in particular, emitting memorized training data via this attack.
*   **Deployment:** Releasing external model cards on an ongoing basis within updates of technical reports and in documentation for enterprise customers.

**Key Technical Details and Implications**

*   **MoE Architecture:** The use of a Mixture-of-Experts architecture is crucial for scaling the model's capacity without a proportional increase in computational cost. This allows Gemini 1.5 Pro to have a much larger parameter count than a dense model with similar computational requirements.
*   **Long Context Length:** The ability to process millions of tokens of context is a significant advancement. This enables the model to perform tasks that were previously impossible, such as analyzing entire books or long videos in a single pass.
*   **Multimodal Capabilities:** The model's native multimodal capabilities allow it to seamlessly integrate information from different modalities, such as text, images, audio, and video. This is essential for tasks that require understanding the relationships between different types of data.
*   **Instruction Tuning:** The use of instruction tuning is critical for aligning the model's behavior with human preferences and ensuring that it can follow complex instructions.
*   **Responsible AI:** The paper emphasizes the importance of responsible AI development and deployment, including impact assessments, safety evaluations, and mitigation strategies.

**In summary, Gemini 1.5 Pro represents a significant step forward in the development of large language models. Its MoE architecture, long context length, multimodal capabilities, and responsible AI approach make it a powerful tool for a wide range of applications.**

-----

Rating: 5

Explanation:
The response is excellent because it summarizes all the relevant details in the original text. The organization of the response is very easy to read. In addition, the response follows the instructions, is grounded, is concise, and fluent.
------

SummaryRating.VERY_GOOD


In [15]:
new_prompt = "give the specific information"
# Try:
#  ELI5 the training process
#  Summarise the needle/haystack evaluation technique in 1 line
#  Describe the model architecture to someone with a civil engineering degree
#  What is the best LLM?

if not new_prompt:
    raise ValueError("Try setting a new summarisation prompt.")

def run_and_eval_summary(prompt):
    """Generate and evaluate the summary using the new prompt."""
    summary = summarise_doc(new_prompt)
    display(Markdown(summary+ '\n-----'))
    
    text,struct = eval_summary([new_prompt, document_file], summary)
    display(Markdown(text + '\n------'))
    print(struct)


run_and_eval_summary(new_prompt)

Here are the key details from the document about Gemini 1.5:

*   **Model Overview:** Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model. It can recall and reason over fine-grained information from millions of tokens of context, including long documents, video, and audio.

*   **Performance:** It achieves near-perfect recall on long-context retrieval tasks, improves state-of-the-art in long-document QA, long-video QA, and long-context ASR. It matches or surpasses Gemini 1.0 Ultra's performance on a broad set of benchmarks.

*   **Context Length:** Gemini 1.5 Pro can handle contexts up to at least 10 million tokens. This is a significant increase compared to models like Claude 2.1 (200k) and GPT-4 Turbo (128k).

*   **New Capabilities:** The model demonstrates surprising new capabilities, such as in-context learning. For example, it can learn to translate English to Kalamang (a language with few speakers) at a level similar to a person who learned from the same content.

*   **Architecture:** Gemini 1.5 Pro uses a sparse mixture-of-experts (MoE) Transformer-based architecture.

*   **Training:** It is trained on multiple 4096-chip pods of Google's TPUv4 accelerators, distributed across multiple datacenters, and on a variety of multimodal and multilingual data.

*   **Evaluation:** The evaluation methodology includes qualitative and quantitative long-context multimodal evaluations, as well as quantitative core evaluations.

*   **Responsible Deployment:** The report emphasizes responsible deployment, including impact assessment, model policies, evaluations, and mitigations of harm.

*   **Model Card:** A model card is provided with details on the model's architecture, inputs, outputs, applications, limitations, and ethical considerations.
-----

STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
The prompt asks to give the specific information from the file that was uploaded. The response provides key details from the document about Gemini 1.5.

STEP 2: Score based on the rubric.
5. The response follows instructions, is grounded, concise, and fluent.

------

SummaryRating.VERY_GOOD


**Evaluating in practice**


Evaluation has many practical uses,for example:

* You can quickly iterate on a prompt with a small set of test documents,
* You can compare different models to find what works best for your needs, such as finding the trade-off between price and performance, or finding the best performance for a specific task.
* When pushing changes to a model or prompt in a production system, you can verify that the system does not regress in quality.

**In this section you will try two different evaluation approaches.**


**Pointwise evaluation**
The technique used above, *where you evaluate a single input/output pair against some criteria is known as pointwise evaluation*.
**This is useful for evaluating singular outputs in an absolute sense, such as "was it good or bad?"**

In this exercise, you will try different guidance prompts with a set of question

In [16]:
import functools

# Try these instructions, or edit and add your own.
terse_guidance = "Answer the following question in a single sentence, or as close to that as possible."
moderate_guidance = "Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question."
cited_guidance = "Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible."
guidance_options = {
    'Terse': terse_guidance,
    'Moderate': moderate_guidance,
    'Cited': cited_guidance,
}

questions = [
    # Un-comment one or more questions to try here, or add your own.
    # Evaluating more questions will take more time, but produces results
    # with higher confidence. In a production system, you may have hundreds
    # of questions to evaluate a complex system.

    # "What metric(s) are used to evaluate long context performance?",
    "How does the model perform on code tasks?",
    "How many layers does it have?",
    # "Why is it called Gemini?",
]

if not questions:
  raise NotImplementedError('Add some questions to evaluate!')


@functools.cache
def answer_question(question: str, guidance: str = '') -> str:
  """Generate an answer to the question using the uploaded document and guidance."""
  config = types.GenerateContentConfig(
      temperature=0.0,
      system_instruction=guidance,
  )
  response = client.models.generate_content(
      model='gemini-2.0-flash',
      config=config,
      contents=[question, document_file],
  )

  return response.text


answer = answer_question(questions[0], terse_guidance)
Markdown(answer)

Gemini 1.5 Pro demonstrates strong performance on code tasks, surpassing Gemini 1.0 Ultra on Natural2Code and showing improvements in coding capabilities.


In [17]:
import enum

QA_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user prompt and an AI-generated responses.
You should first read the user prompt carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided in the Evaluation section below.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

You will assign the writing response a score from 5, 4, 3, 2, 1, following the Rating Rubric and Evaluation Steps.
Give step-by-step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.

## Criteria Definition
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness,completeness, and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}
"""

class AnswerRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'


@functools.cache
def eval_answer(prompt, ai_response, n=1):
  """Evaluate the generated answer against the prompt/question used."""
  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PROMPT.format(prompt=[prompt, document_file], response=ai_response)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


text_eval, struct_eval = eval_answer(prompt=questions[0], ai_response=answer)
display(Markdown(text_eval))
print(struct_eval)

STEP 1:
The answer provides a summary of how the model performs on code tasks. The response is fluent and the information is found in the document.
STEP 2:
I would score this a 5 because it is grounded, fluent, complete, and follows instructions.



AnswerRating.VERY_GOOD


Now run the evaluation task in a loop. Note that the guidance instruction is hidden from the evaluation agent. If you passed the guidance prompt, the model would score based on whether it followed that guidance, but for this task the goal is to find the best overall result based on the user's question, not the developers instruction.

In [18]:
import collections
import itertools

# Number of times to repeat each task in order to reduce error and calculate an average.
# Increasing it will take longer but give better results, try 2 or 3 to start.
NUM_ITERATIONS = 1

scores = collections.defaultdict(int)
responses = collections.defaultdict(list)

for question in questions:
  display(Markdown(f'## {question}'))
  for guidance, guide_prompt in guidance_options.items():

    for n in range(NUM_ITERATIONS):
      # Generate a response.
      answer = answer_question(question, guide_prompt)

      # Evaluate the response (note that the guidance prompt is not passed).
      written_eval, struct_eval = eval_answer(question, answer, n)
      print(f'{guidance}: {struct_eval}')

      # Save the numeric score.
      scores[guidance] += int(struct_eval.value)

      # Save the responses, in case you wish to inspect them.
      responses[(guidance, question)].append((answer, written_eval))

## How does the model perform on code tasks?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_GOOD
Cited: AnswerRating.VERY_GOOD


## How many layers does it have?

Terse: AnswerRating.VERY_GOOD
Moderate: AnswerRating.VERY_BAD
Cited: AnswerRating.BAD


Now aggregate the scores to see how each prompt performed.

In [19]:
for guidance , score in scores.items():
    average_score = score / (NUM_ITERATIONS * len(questions))
    nearest = AnswerRating(str(round(average_score)))
    print(f'{guidance}: {average_score:.2f} - {nearest.name}')
    

Terse: 5.00 - VERY_GOOD
Moderate: 3.00 - OK
Cited: 3.50 - GOOD


**Pairwise evaluation**

The pointwise evaluation prompt used in the previous step has 5 levels of grading in the output. This may be too coarse for your system, or perhaps you wish to improve on a prompt that is already "very good".

Another approach to evaluation is to compare two outputs against each other. This is pairwise evaluation, and is a key step in ranking and sorting algorithms, which allows you to use it to rank your prompts either instead of, or in addition to the pointwise approach.

This step implements pairwise evaluation using the [pairwise QA quality prompt](http://) from the Google Cloud docs.

In [20]:
QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.

You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.

# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

# AI-generated Response

### Response A
{baseline_model_response}

### Response B
{response}
"""


class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'


@functools.cache
def eval_pairwise(prompt, response_a, response_b, n=1):
  """Determine the better of two answers to the same prompt."""

  chat = client.chats.create(model='gemini-2.0-flash')

  # Generate the full text response.
  response = chat.send_message(
      message=QA_PAIRWISE_PROMPT.format(
          prompt=[prompt, document_file],
          baseline_model_response=response_a,
          response=response_b)
  )
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=AnswerComparison,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  return verbose_eval, structured_eval


question = questions[0]
answer_a = answer_question(question, terse_guidance)
answer_b = answer_question(question, cited_guidance)

text_eval, struct_eval = eval_pairwise(
    prompt=question,
    response_a=answer_a,
    response_b=answer_b,
)

display(Markdown(text_eval))
print(struct_eval)

STEP 1: Analyze Response A based on the question answering quality criteria:
Response A is a one sentence answer that states that the model performs strongly on code tasks, referencing that it surpasses Gemini 1.0 Ultra on Natural2Code. This is somewhat helpful but not super detailed.

STEP 2: Analyze Response B based on the question answering quality criteria:
Response B provides much more detail about how the model performs on coding tasks. It cites the document, it's easy to read, and it's well-organized.

STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
Response B is the better answer because it gives more detail and is better formatted than response A. Response A is not bad, but does not give many specifics.

STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
B

STEP 5: Output your assessment reasoning in the explanation field.
Response B gives more detail about how the model performs on code tasks, while response A is only a single sentence. Both answers are well-written and easy to read. However, response B is much more helpful than response A.

AnswerComparison.B


With a pair-wise evaluator in place, the only thing required to rank prompts against each other is a comparator.

This example implements the minimal comparators required for **total ordering (== and <)** and performs the comparison using *n_iterations* evaluations over the set of *questions*

In [21]:
@functools.total_ordering
class QAGuidancePrompt:
  """A question-answering guidance prompt or system instruction."""

  def __init__(self, prompt, questions, n_comparisons=NUM_ITERATIONS):
    """Create the prompt. Provide questions to evaluate against, and number of evals to perform."""
    self.prompt = prompt
    self.questions = questions
    self.n = n_comparisons

  def __str__(self):
    return self.prompt

  def _compare_all(self, other):
    """Compare two prompts on all questions over n trials."""
    results = [self._compare_n(other, q) for q in questions]
    mean = sum(results) / len(results)
    return round(mean)

  def _compare_n(self, other, question):
    """Compare two prompts on a question over n trials."""
    results = [self._compare(other, question, n) for n in range(self.n)]
    mean = sum(results) / len(results)
    return mean

  def _compare(self, other, question, n=1):
    """Compare two prompts on a single question."""
    answer_a = answer_question(question, self.prompt)
    answer_b = answer_question(question, other.prompt)

    _, result = eval_pairwise(
        prompt=question,
        response_a=answer_a,
        response_b=answer_b,
        n=n,  # Cache buster
    )
    # print(f'q[{question}], a[{self.prompt[:20]}...], b[{other.prompt[:20]}...]: {result}')

    # Convert the enum to the standard Python numeric comparison values.
    if result is AnswerComparison.A:
      return 1
    elif result is AnswerComparison.B:
      return -1
    else:
      return 0

  def __eq__(self, other):
    """Equality check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) == 0

  def __lt__(self, other):
    """Ordering check that performs pairwise evaluation."""
    if not isinstance(other, QAGuidancePrompt):
      return NotImplemented

    return self._compare_all(other) < 0


Now Python's sorting functions will "just work" on any QAGuidancePrompt instances. The answer_question and eval_pairwise functions are [memoized](http://) to avoid unnecessarily regenerating the same answers or evaluations, so you should see this complete quickly unless you have changed the questions, prompts or number of iterations from the earlier steps.

In [23]:
terse_prompt = QAGuidancePrompt(terse_guidance, questions)
moderate_prompt = QAGuidancePrompt(moderate_guidance, questions)
cited_prompt = QAGuidancePrompt(cited_guidance, questions)

# Sort in reverse order, so that best is first
sorted_results = sorted([terse_prompt, moderate_prompt,cited_prompt], reverse = True)

for i ,p in enumerate(sorted_results):
    if i:
        print('====')

    
    print(f'#{i+1}:{p}')






#1:Answer the following question in a single sentence, or as close to that as possible.
====
#2:Provide a brief answer to the following question, use a citation if necessary, but only enough to answer the question.
====
#3:Provide a thorough, detailed answer to the following question, citing the document and supplying additional background information as much as possible.


**Challenges**


**LLM limitations**
LLMs are known to have problems on certain tasks, and these challenges still persist when using LLMs as evaluators. For example, LLMs can struggle to count the number of characters in a word (this is a numerical problem, not a language problem), so an LLM evaluator will not be able to accurately evaluate this type of task. There are solutions available in some cases, such as connecting tools to handle problems unsuitable to a language model, but it's important that you understand possible limitations and include human evaluators to calibrate your evaluation system and determine a baseline.

One reason that LLM evaluators work well is that all of the information they need is available in the input context, so the model only needs to attend to that information to produce the result. When customising evaluation prompts, or building your own systems, keep this in mind and ensure that you are not relying on "internal knowledge" from the model, or behaviour that might be better provided from a tool.

Improving confidence
One way to improve the confidence of your evaluations is to include a diverse set of evaluators. That is, use the same prompts and outputs, but execute them on different models, like Gemini Flash and Pro, or even across different providers, like Gemini, Claude, ChatGPT and local models like Gemma or Qwen. This follows the same idea used earlier, where repeating trials to gather multiple "opinions" helps to reduce error, except by using different models the "opinions" will be more diverse.