In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Intro to Logprobs


<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Flogprobs%2Fintro_logprobs.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/logprobs/intro_logprobs.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/logprobs/intro_logprobs.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Authors |
| --- |
| [Eric Dong](https://github.com/gericdong) |


## Overview

Model parameters, `response_logprobs` and `logprobs`, are being enabled in Gemini API on Vertex AI to return the log probabilities of the model output tokens, providing developers with a deeper view into the model's decision-making process for each generated token.

The `response_logprobs` parameter, when enabled, instructs the model to return the log probabilities of the output tokens. The `logprobs` parameter allows users to specify the number of top alternative tokens to be included in the response, along with their associated log probabilities.

### Objectives

In this tutorial, you will learn how to set and use the `response_logprobs` and `logprobs` model parameters in Gemini API on Vertex AI. You will complete the following tasks:

- Set the `response_logprobs` and `logprobs` parameters
- Process and interpret log probabilities output
- Use log probabilities in classification tasks
- Use log probabilities in auto-complete
- Use log probabilities in RAG evaluation


## Getting Started

### Install Google Gen AI SDK for Python


In [None]:
%pip install --upgrade --quiet google-genai

### Authenticate your notebook environment

If you are running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set up Google Cloud Project and location

This tutorial uses Google Cloud credentials to authenticate your environment. Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).


In [None]:
import os

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "global")

### Import libraries


In [None]:
import math

from google import genai
from google.genai.types import GenerateContentConfig
import pandas as pd

### Create a client

In [None]:
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Use a Gemini model

Learn more about all [Gemini models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).

In [None]:
MODEL_ID = "gemini-2.0-flash"  # @param ["gemini-2.0-flash", "gemini-2.5-pro", "gemini-2.5-flash"] {"allow-input":true, isTemplate: true}

## Enable log probabilities

You can enable log probabilities by setting parameters `response_logprobs` and `logprobs` to `generation_config`.


- `response_logprobs`: Optional: boolean; if true, returns the log probabilities of the tokens that were chosen by the model at each step. By default, this parameter is set to false.
- `logprobs`: Optional: int [1-20]; returns the log probabilities of the top candidate tokens at each generation step.

See [Gemini API reference docs](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#generationconfig) for more details.


In [None]:
prompt = """
I am not sure if I really like this restaurant a lot.
"""

response_schema = {"type": "STRING", "enum": ["Positive", "Negative", "Neutral"]}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=prompt,
    config=GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=response_schema,
        response_logprobs=True,  # default to False
        logprobs=3,  # [1-20]
    ),
)

print(response.text)

### Process log probabilities output

A log probability is the natural logarithm of the probability score the model assigned to a token.

- A probability is always a value between 0 and 1 (e.g., 0.95 for 95% probability).
- The natural logarithm of any number between 0 and 1 is always a negative number.
- The natural logarithm of 1 (representing 100% certainty) is 0.

You can use the helper function `print_logprobs` to print the log probabilities. The output from the function provides a detailed look into the model's predictions for each token in the generated response.

In [None]:
def print_logprobs(response):
    """
    Print log probabilities for each token in the response
    """
    if response.candidates and response.candidates[0].logprobs_result:
        logprobs_result = response.candidates[0].logprobs_result
        for i, chosen_candidate in enumerate(logprobs_result.chosen_candidates):
            print(
                f"Token: '{chosen_candidate.token}' ({chosen_candidate.log_probability:.4f})"
            )
            if i < len(logprobs_result.top_candidates):
                top_alternatives = logprobs_result.top_candidates[i].candidates
                alternatives = [
                    alt
                    for alt in top_alternatives
                    if alt.token != chosen_candidate.token
                ]
                if alternatives:
                    print("Alternative Tokens:")
                    for alt_token_info in alternatives:
                        print(
                            f"  - '{alt_token_info.token}': ({alt_token_info.log_probability:.4f})"
                        )
            print("-" * 20)

In [None]:
print_logprobs(response)

### Interpret log probabilities

Let's break down this example: (The output you receive may differ from this example)

```
Token: 'Neutral' (-0.0214)
Alternative Tokens:
  - 'Positive': (-4.8219)
  - 'Negative': (-5.6293)
```

- Token: `Neutral` (`-0.0214`): This line indicates that the model's chosen token for this position in the output sequence is 'Neutral'.

- The value (`-0.0214`) is the log probability of this token. A log probability closer to 0 indicates a higher probability and, therefore, greater confidence from the model in its choice. In this case, the model is very confident in selecting `Neutral` as the output.

- Alternative Tokens:: This section lists other tokens that the model considered for this same position in the sequence. These are the "runners-up" that had the next highest probabilities.
   - `Positive`: (`-4.8219`): The token 'Positive' was a possible alternative, but its log probability is significantly lower (more negative) than that of 'Neutral', indicating the model considered it a much less likely choice.
   - `Negative`: (`-5.6293`): Similarly, the token `Negative` was another alternative with an even lower log probability, making it a less probable candidate in the model's view.

This output shows that while `Positive` and `Negative` were considered, the model's confidence in `Neutral` being the correct token was higher than for any of the alternatives. This level of insight is invaluable for understanding the model's certainty at each step of the generation process and for debugging or refining model behavior.

## Use log probabilities


### Classification

Using `logprobs` in classification tasks can transform the model's output from a simple, "black box" answer into a transparent, quantifiable decision. It allows you to understand not just what category the model chose, but how confident it was in that choice and what other options it considered. See some use case examples below that are built upon the above classification example.

#### **Use Case 1**: Detecting ambiguity for human review

**Scenario:**
You are using the model to automate a classification task (e.g., categorizing support tickets). If the model is not confident in its top choice and the next best choice is very close, the prediction is "ambiguous." Instead of accepting the potentially incorrect top answer, you can flag it for human review.

**Why `logprobs` is useful:**
It allows you to see the confidence score of the top choice and compare it to the alternatives. A small difference between the top two log probabilities is a clear signal of ambiguity.

In [None]:
def check_for_ambiguity(response, ambiguity_margin=1.0):
    """
    Check if the classification is ambiguous.
    Ambiguity is defined as the log probability of the top choice being
    too close to the log probability of the second choice.
    """
    if not (response.candidates and response.candidates[0].logprobs_result):
        return

    logprobs_result = response.candidates[0].logprobs_result
    if (
        len(logprobs_result.chosen_candidates) < 1
        or len(logprobs_result.top_candidates) < 1
        or len(logprobs_result.top_candidates[0].candidates) < 2
    ):
        print("Not enough data to check for ambiguity.")
        return

    chosen_candidate = logprobs_result.chosen_candidates[0]
    top_alternatives = logprobs_result.top_candidates[0].candidates
    second_choice_token_info = top_alternatives[1]
    logprob_top_choice = chosen_candidate.log_probability
    logprob_second_choice = second_choice_token_info.log_probability
    difference = abs(logprob_top_choice - logprob_second_choice)

    print(f"Top choice: '{chosen_candidate.token}' ({logprob_top_choice:.4f})")
    print(
        f"Second choice: '{second_choice_token_info.token}' ({logprob_second_choice:.4f})"
    )
    print(f"Logprob difference: {difference:.4f}")

    if difference < ambiguity_margin:
        print(
            f"\nResult is AMBIGUOUS. The difference is less than the margin of {ambiguity_margin}. Flag for human review."
        )
    else:
        print("\nResult is confident.")

In [None]:
check_for_ambiguity(response)

#### **Use Case 2**: Confidence-Based thresholding

**Scenario:**
You only want to accept and process the model's classification if its confidence in the result is above a certain level (e.g., 90%). This is crucial in applications where incorrect classifications can have significant negative consequences.

**Why `logprobs` is useful:**
It provides a direct measure of the model's confidence in its chosen classification. You can convert the top log probability back to a raw probability score to apply a simple threshold.

In [None]:
def accept_if_confident(response, threshold=0.90):
    """
    Accepts the classification only if the confidence level is above a threshold.
    """
    if not (response.candidates and response.candidates[0].logprobs_result):
        return None

    chosen_candidate = response.candidates[0].logprobs_result.chosen_candidates[0]
    log_prob = chosen_candidate.log_probability

    # Convert log probability back to a raw probability score
    probability = math.exp(log_prob)

    print(f"Chosen classification: '{chosen_candidate.token}'")
    print(f"Model confidence: {probability:.2%}")

    if probability >= threshold:
        print(f"Confidence is above {threshold:.0%}. Accepting result.")
        return chosen_candidate.token
    else:
        print(f"Confidence is below {threshold:.0%}. Rejecting result.")
        return None

In [None]:
accepted_class = accept_if_confident(response, threshold=0.90)
print(f"Final classification: {accepted_class}")

### Auto-complete

This use case showcases how `logprobs` can be used to enable an auto-complete feature that adapts its suggestions in real-time. As the user types, the context changes, and therefore the predictions for the very next word should become more relevant.

**Why `logprobs` is useful:**
By repeatedly querying the model with the growing text and examining the `logprobs`, we can directly observe the model's shifting "thought process". We can see the list of likely next words narrow down from very general predictions to highly specific ones, providing a powerful and intuitive user experience.

In [None]:
def get_autocomplete_suggestions(prompt: str) -> dict:
    """
    Gets autocomplete suggestions for a given text prompt.
    """
    system_instruction = (
        "You are acting as auto-complete. Complete the sentence with only one word.",
    )

    response = client.models.generate_content(
        model=MODEL_ID,
        contents=prompt,
        config=GenerateContentConfig(
            system_instruction=system_instruction,
            max_output_tokens=1,  # We only want the very next word
            temperature=0.7,  # A non-zero value gives more diverse suggestions
            response_logprobs=True,
            logprobs=3,
        ),
    )

    return response


def parse_suggestions(response: dict) -> str:
    """
    Parses the logprobs from a model response and formats them.
    """
    if not (
        response.candidates
        and response.candidates[0].logprobs_result
        and response.candidates[0].logprobs_result.top_candidates
    ):
        return "N/A"

    suggestions = response.candidates[0].logprobs_result.top_candidates[0].candidates
    suggestion_strings = []
    for token_info in suggestions:
        probability = math.exp(token_info.log_probability) * 100
        suggestion_strings.append(f"'{token_info.token}' ({probability:.1f}%)")

    return " | ".join(suggestion_strings)

Provide a list of strings simulating a user typing a sentence word by word, and loop through each step of the user typing the sentence to demonstrate evolution of auto-complete suggestions.


In [None]:
prompts_in_sequence = [
    "The",
    "The best",
    "The best thing",
    "The best thing about",
    "The best thing about living",
    "The best thing about living in",
    "The best thing about living in Toronto",
    "The best thing about living in Toronto is",
    "The best thing about living in Toronto is the",
]

results = []

for prompt in prompts_in_sequence:
    response = get_autocomplete_suggestions(prompt)
    top_suggestions = parse_suggestions(response)

    results.append({"Typed Text": f"`{prompt}`", "Top 3 Next Words": top_suggestions})

df = pd.DataFrame(results)
print(df.to_string(index=False))

**Analysis of the auto-complete suggestions:**

The output example below shows how the next-word predictions change as the context is provided. The exact tokens and probabilities will vary, but the pattern will be similar to this:


```
                                     Typed Text                                               Top 3 Next Words
                                          `The`                'cat' (39.2%) | 'quick' (32.8%) | 'end' (23.0%)
                                     `The best`              'thing' (84.4%) | 'things' (11.0%) | 'way' (2.8%)
                               `The best thing`                 'ever' (100.0%) | 'is' (0.0%) | 'since' (0.0%)
                         `The best thing about`      'life' (84.5%) | 'everything' (4.8%) | 'chocolate' (4.8%)
                  `The best thing about living`                   'life' (81.3%) | 'now' (14.7%) | 'ab' (1.0%)
               `The best thing about living in`          'Europe' (30.8%) | 'Spain' (13.8%) | 'nature' (12.3%)
       `The best thing about living in Toronto`         'is' (91.0%) | 'diversity' (6.9%) | 'Diversity' (1.3%)
    `The best thing about living in Toronto is` 'diversity' (97.1%) | 'everything' (1.6%) | 'Diversity' (1.2%)
`The best thing about living in Toronto is the`       'diversity' (99.8%) | 'Diversity' (0.1%) | 'food' (0.0%)

```

- `The`: The possibilities are vast. The model predicts common starting words for a sentence.
- `The best thing`: The context now implies a comparison or a statement of opinion. The model predicts prepositions like to and about.
- `The best thing about living`: The word in becomes overwhelmingly probable, as it's the most common grammatical construction.
- `The best thing about living in`: The model now expects a location.
- `The best thing about living in Toronto is the`: With the full context, the model now predicts specific, well-known attributes of Toronto, like diversity, food, and people.

This step-by-step demonstration shows how `logprobs` give developers a window into the model's contextual understanding, allowing them to build truly intelligent and adaptive features.

### RAG Evaluation

`logprobs` can be used to assess the quality of a RAG system's output, specifically by measuring how "grounded" the generated answer is in the provided context.

**Scenario**: You have a RAG system that answers questions based on a knowledge base. You need to evaluate how well the generated answers are supported by the retrieved context. A high-quality RAG system should generate answers that are factually consistent with the provided text.

**Why `logprobs` is useful:**

When an LLM is given relevant context, it should be more "confident" in generating an answer that is based on that context. This confidence is reflected in the log probabilities of the tokens it generates.

You can calculate an average log probability for the tokens in the generated answer. This serves as a "grounding" or "confidence" score.


In [None]:
# Fictional knowledge base:
# This is the source document our RAG system will use.
KNOWLEDGE_BASE = {
    "doc1": "Project Adam, launched in 2025, is a next-generation AI-powered data analytics platform. Its primary goal is to provide real-time insights from unstructured data sources like customer feedback and social media trends.",
    "doc2": "The core technology behind Project Adam is the 'Quantum-Entangled Data Core' (QEDC), which allows for processing speeds up to 100x faster than traditional cloud architectures. Security is handled by a decentralized cryptographic layer.",
    "doc3": "Project Adam is currently in a private beta phase and is available to select enterprise partners in the financial and retail sectors. A public release is tentatively scheduled for Q4 2025. The headquarters is located in Montreal, Canada.",
}

# User's question:
USER_QUESTION = "What is the core technology of Project Adam?"

# Simulate different retrieval qualities:
# 1. Good Retrieval: The retrieved chunk is highly relevant.
good_context = KNOWLEDGE_BASE["doc2"]

# 2. Poor Retrieval: The retrieved chunk is from the same document but irrelevant.
poor_context = KNOWLEDGE_BASE["doc3"]

# 3. No Retrieval: No context is provided.
no_context = ""

In [None]:
def get_answer_and_score(question: str, context: str) -> tuple[str, float]:
    """
    Generates an answer based on a question and context, and calculates a grounding score.
    """
    if context:
        prompt = f"""
        Context:
        ---
        {context}
        ---
        Based *only* on the context provided, answer the following question.
        Question: {question}
        Answer:
        """
    else:
        # No context provided (control case)
        prompt = f"Question: {question}\nAnswer:"

    response = client.models.generate_content(
        model=MODEL_ID,
        contents=prompt,
        config=GenerateContentConfig(
            temperature=0,  # Set to 0 for deterministic, factual answers
            response_logprobs=True,
        ),
    )
    generated_text = response.text
    total_logprob = 0.0
    token_count = 0

    if (
        response.candidates
        and response.candidates[0].logprobs_result
        and response.candidates[0].logprobs_result.chosen_candidates
    ):

        logprobs_result = response.candidates[0].logprobs_result
        for chosen_candidate in logprobs_result.chosen_candidates:
            total_logprob += chosen_candidate.log_probability
            token_count += 1

        average_logprob = total_logprob / token_count if token_count > 0 else 0.0
        return generated_text, average_logprob

    return generated_text, 0.0  # Return 0 if logprobs are unavailable

In [None]:
# Run the RAG evaluation across different retrieval scenarios.
scenarios = {
    "1. Good Retrieval": good_context,
    "2. Poor Retrieval": poor_context,
    "3. No Retrieval (Control)": no_context,
}

results = []
print("Evaluating RAG grounding scores with logprobs")

for scenario_name, context in scenarios.items():
    answer, score = get_answer_and_score(USER_QUESTION, context)
    results.append(
        {
            "Scenario": scenario_name,
            "Score (Avg Logprob)": f"{score:.4f}",
            "Generated Answer": answer.strip(),
        }
    )

df = pd.DataFrame(results)
print(df.to_string(index=False))

**Example output & analysis:**

The output will show the correlation between context quality and the grounding score.

```
Scenario grounding score (Avg Logprob)                                                                                              Generated Answer
        1. Good Retrieval                          -0.0076                                                                            Quantum-Entangled Data Core (QEDC)
        2. Poor Retrieval                          -0.1144                                    The context provided does not mention the core technology of Project Adam.
3. No Retrieval (Control)                          -0.2336 The core technology of Project Adam is **deep learning**, specifically large-scale distributed deep learning.
```

- Good Retrieval (Score: `-0.0076)`: This scenario has the highest score (the number closest to zero). The model is very confident because every token in its answer is directly supported by the provided text. This is the ideal outcome.

- Poor Retrieval (Score: `-0.1144`): This has a low score. The model correctly recognizes that the provided context (about release dates) doesn't contain the answer.

- No Retrieval (Score: `-0.2336`): This has a low score. Here, the model is drawing on its general "memorized" knowledge from its training data.

By using `logprobs` as a quantitative metric, you can automate the process of evaluating and improving your RAG system, ensuring your answers are not just correct, but verifiably grounded in your source data.

## What's next

You have learned how to use the `response_logprobs` and `logprobs` parameters to gain deeper insight into the model's decision-making process. This notebook demonstrated how to apply these insights to practical use cases, including analyzing classification results , building dynamic autocomplete features , and evaluating RAG systems.

Learn more:

- [Gemini API reference docs](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#generationconfig)
