In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Know Your Customer Use Case - Gemini Grounding with Google Search 

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fkyc%2Fkyc-with-grounding.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/kyc/kyc-with-grounding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author |
| --- |
| [Lukas Geiger](https://github.com/ljogeiger) |

## Overview

This notebook demonstrates how to use the Gemini API (specifically the `gemini-2.5-flash` model) to find and summarize negative news articles related to a specified entity. The entity can be a person, company, or ship. The notebook leverages Google Search as a tool for the Gemini API to ground its responses in real-world information.

You will learn how to:
* Configure the Gemini API client.
* Define a detailed system instruction to guide the model's behavior.
* Craft a prompt that incorporates an input entity.
* Use Google Search as a grounding tool for the model.
* Process the model's response to extract the generated text and grounding metadata (sources).
* Evaluate the responses using GCP's Evaluation Framework and create custom metrics.

## Use Case Definition
"Know Your Customer" (KYC) is a crucial due diligence process used by businesses, particularly in regulated industries, to verify the identity of their clients and assess potential risks associated with doing business with them. The primary goal of KYC is to prevent financial crimes like money laundering, terrorist financing, and fraud.

For example, imagine you are conducting interviews for a board seat. As part of this process you might choose to do a background check to evaluate whether or not the candidate has been involved in any illegal activities. Then based on the report received you can make a more informed decision whether to proceed with the candidate or not. This can be applied to many other use cases and across many entity types (companies, people, vessels, governments, etc.)

## Get started

### Install Google Gen AI SDK and other required packages
The following command installs the Google Generative AI SDK, which is necessary to interact with the Gemini API.

In [None]:
%pip install --upgrade --quiet google-genai

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment. This allows the notebook to access Google Cloud services.

In [17]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Authenticate your notebook environment

In [None]:
!gcloud auth application-default login

### Set Google Cloud project information and Initialize API Client

To get started using Vertex AI (which hosts the Gemini models used here), you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

This cell also initializes the `genai.Client` which will be used to interact with the Gemini API.

In [None]:
import os

from google import genai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    # Attempt to get project ID from environment variable if not set by user
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))
    if not PROJECT_ID or PROJECT_ID == "None":
        raise ValueError("Please set your Google Cloud Project ID.")
print(f"Using Project ID: {PROJECT_ID}")

LOCATION = os.environ.get(
    "GOOGLE_CLOUD_REGION", "us-central1"
)  # Default to us-central1 if not set
print(f"Using Location: {LOCATION}")

# Initialize the Google Gen AI client
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Import libraries
Import necessary libraries, including `google.genai.types` for defining specific configurations for the API call, and `IPython.display` for better rendering of markdown in the notebook.

In [4]:
import json  # Though not used in the final script, often useful for handling API responses

from IPython.display import Markdown, display
from google.genai import types

## Generating Negative News Reports with Gemini

### Load model
Specify the model ID to be used. We are using `gemini-2.5-flash`, a fast and versatile model with reasoning capabilities.

In [5]:
MODEL_ID = "gemini-2.5-flash"  # @param {type:"string"}

### Define Prompt Template and System Instructions

**System Instructions:** These provide high-level guidance to the model on its role, the desired output format, and steps to follow. We instruct the model to act as a professional report generator for negative news, to search thoroughly, cite dates, and handle cases where no negative news is found. A list of specific activities is provided to focus the search.

**Prompt Template:** This is the specific query sent to the model for each entity. It includes a placeholder `{entity}` which will be filled in with the actual entity name during execution.

In [6]:
prompt_template = """
Your task is to provide a comprehensive and professional report of negative news articles for a given input entity. The input entity can be a person, company, or ship.

Input Entity:
{input_entity}

Activities:
Money Laundering
Forgery
Bribery and Corruption
Human Trafficking

Follow these steps:

1.  If the input entity is a company, map it to its legal business name.
2.  Thoroughly search Google News for negative news articles related to the input entity and the specified activities across all time.
3.  Summarize and interpret the Google Search results for the given input entity and each activity. If there are no results for a given activity, skip it.
4.  For each activity with search results, create a headline that summarizes the event.
5.  Group the search results under the corresponding headline, including the date of each news article.
6.  For person names, strictly follow the entity names.
7.  If there are no negative news articles associated with the input entity, respond with: "There are no results found for {input_entity}."


Output Format:

Headline: [Summary of the event]
Date: [Date of the news article]
Summary: [Brief summary of the news article]

Example:

Headline: John Doe Accused of Money Laundering\n
Date: 2023-01-15\n
Summary: John Doe is accused of laundering money through offshore accounts, according to a report by the International Consortium of Investigative Journalists.

Ensure that the report is comprehensive, accurate, and professionally presented.
"""

system_instructions_text = """
You are a professional news analyst tasked with providing comprehensive reports on negative news articles related to a given entity. Your reports must be thorough, accurate, and professionally presented.
"""

### Helper Function to Get Sources
The `get_sources` function processes the `grounding_metadata` from the model's response. This metadata contains information about the web pages the model used to generate its answer (when using Google Search as a tool). The function extracts titles and URLs for these sources and formats them for display. This is crucial for verifying the information provided by the model.

In [7]:
def get_sources(response):
    """Return a formatted string of sources corresponding with response citations

    Args:
        response: The response from the Gemini API containing grounding metadata

    Returns:
        A formatted string containing the sources with their titles and URLs
    """
    source_text = "\n\n**Sources:**\n"
    if not response.candidates or not response.candidates[0].grounding_metadata:
        return source_text + "No grounding metadata found.\n"

    metadata = response.candidates[0].grounding_metadata
    sources = {}
    source_titles = {}
    max_chunk_index = -1

    if not metadata.grounding_supports:
        return source_text + "No grounding supports found in metadata.\n"

    for support in metadata.grounding_supports:
        for chunk_index in support.grounding_chunk_indices:
            display_chunk_index = chunk_index + 1  # offset 0 list index
            if display_chunk_index > max_chunk_index:
                max_chunk_index = display_chunk_index
            if display_chunk_index not in source_titles and chunk_index < len(
                metadata.grounding_chunks
            ):
                chunk = metadata.grounding_chunks[chunk_index]
                source_titles[display_chunk_index] = chunk.web.title
                sources[display_chunk_index] = chunk.web.uri  # Corrected to use uri
            elif chunk_index >= len(metadata.grounding_chunks):
                print(
                    f"Warning: chunk_index {chunk_index} out of bounds for grounding_chunks (len: {len(metadata.grounding_chunks)})."
                )
    sorted_source_titles = dict(sorted(source_titles.items()))

    if sources:
        for i in sorted_source_titles:
            source_text += f"[[{i}] {sorted_source_titles[i]}]({sources[i]})\n"
    else:
        source_text += "No sources extracted from grounding metadata.\n"

    # Debugging information (optional, can be commented out)
    # print(f"Max Chunk Index: {max_chunk_index}")
    # print(f"Length of GroundingChunks: {len(metadata.grounding_chunks)}")
    return source_text

### Define Generation Function
The `generate_negative_news_report` function encapsulates the logic for calling the Gemini API. It takes an entity string and the system instructions text as input.
Key configurations:
* **`model`**: Uses the `MODEL_ID` defined earlier.
* **`contents`**: The user prompt, formatted with the specific entity.
* **`tools`**: Configured to use `types.GoogleSearch()`, enabling the model to perform Google searches to find relevant information.
* **`generate_content_config`**: 
    * `temperature=1`, `top_p=0.95`: These parameters control the randomness and creativity of the output. Higher temperature and top_p values lead to more diverse responses.
    * `max_output_tokens=8192`: Sets the maximum length of the generated response.
    * `response_modalities=["TEXT"]`: Specifies that we expect a text response.
    * **`safety_settings`**: **Important Note:** All harm categories (`HATE_SPEECH`, `DANGEROUS_CONTENT`, `SEXUALLY_EXPLICIT`, `HARASSMENT`) are set to `"OFF"`. This is done to ensure the model can retrieve and report on potentially sensitive topics related to negative news. However, in a production environment or for other use cases, you should carefully consider and configure appropriate safety settings based on your application's requirements and responsible AI practices.
    * `system_instruction`: The detailed instructions for the model's task.

The function calls `client.models.generate_content` and returns the API response.

In [8]:
def generate_kyc_report_from_entity(
    entity_name: str, system_instructions: str, prompt_template: str
):
    """Generate a KYC report for a given entity using the Gemini API.

    Args:
        entity_name: The name of the entity to generate a report for
        system_instructions: The system instructions to guide the model's behavior
        prompt_template: The prompt template to use for the report
    Returns:
        response: The response from the Gemini API containing the generated report
    """
    current_prompt = prompt_template.format(input_entity=entity_name)
    contents = [current_prompt]

    tools = [
        types.Tool(google_search=types.GoogleSearch()),
    ]

    generate_content_config = types.GenerateContentConfig(
        temperature=1,
        top_p=0.95,
        max_output_tokens=8192,
        response_modalities=["TEXT"],  # Expect text modality output
        safety_settings=[
            types.SafetySetting(
                category="HARM_CATEGORY_HATE_SPEECH", threshold="BLOCK_NONE"
            ),  # Using new enums if applicable, else use "OFF"
            types.SafetySetting(
                category="HARM_CATEGORY_DANGEROUS_CONTENT", threshold="BLOCK_NONE"
            ),
            types.SafetySetting(
                category="HARM_CATEGORY_SEXUALLY_EXPLICIT", threshold="BLOCK_NONE"
            ),
            types.SafetySetting(
                category="HARM_CATEGORY_HARASSMENT", threshold="BLOCK_NONE"
            ),
        ],
        tools=tools,
        system_instruction=types.Content(
            parts=[types.Part(text=system_instructions)]
        ),  # System instructions should be Content object
        thinking_config=types.ThinkingConfig(
            include_thoughts=True,
        ),
    )

    print(f"\n--- Generating report for: {entity_name} ---")
    response = client.models.generate_content(
        model=MODEL_ID,  # Fully qualified model name
        contents=contents,
        config=generate_content_config,  # Parameter name is generation_config
    )

    return response

### Define Entities and Run Analysis
Define a list of entities for which to generate reports. The code then iterates through this list, calls the `generate_negative_news_report` function for each entity, and displays the model's text response along with the extracted sources. Using `display(Markdown(...))` helps in rendering the output in a more readable format.

In [None]:
entities_to_check = [
    "Ricardo Martinelli",
    "Robert Burke",
]  # Example entities, you can change or extend this list


def generate_kyc_report_from_entity_list(
    entity_list: list[str], system_instructions_text: str, prompt_template: str
):
    """Generate a KYC report for a given entity using the Gemini API.

    Args:
        entity_list: The list of entities to generate a report for
        system_instructions: The system instructions to guide the model's behavior
        prompt_template: The prompt template to use for the report
    Returns:
        response: The response from the Gemini API containing the generated report
    """
    for entity in entity_list:
        response = generate_kyc_report_from_entity(
            entity, system_instructions_text, prompt_template
        )

        # Display the model's text response
        try:
            if response.candidates and len(response.candidates) > 0:
                parts = response.candidates[0].content.parts
                # First display any thought parts
                thought_parts = [
                    part for part in parts if hasattr(part, "thought") and part.thought
                ]
                if thought_parts:
                    display(Markdown("**Model Thoughts:**"))
                    for part in thought_parts:
                        if hasattr(part, "text") and part.text:
                            display(Markdown(part.text))

                # Then display the final response
                response_parts = [
                    part
                    for part in parts
                    if hasattr(part, "thought")
                    and not part.thought
                    and hasattr(part, "text")
                    and part.text
                ]
                if response_parts:
                    display(Markdown("**Model Response:**"))
                    for part in response_parts:
                        display(Markdown(part.text))
                else:
                    display(
                        Markdown(
                            "**Model Response:**\nNo text content found in response parts."
                        )
                    )
        except Exception as e:
            display(Markdown(f"**Error displaying response:** {str(e)}"))
            print("Full response object:", response)

        # Get and display grounding information (sources)
        sources_text = get_sources(response)
        display(Markdown(sources_text))

        print("------------------------------------------------------")


generate_kyc_report_from_entity_list(
    entities_to_check, system_instructions_text, prompt_template
)

## Evaluation
Add section for evaluating of response using Evaluation Framework

Import necessary libraries

In [10]:
from IPython.display import Markdown, display
import pandas as pd
import plotly.graph_objects as go
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate

Define helper functions

In [11]:
def display_eval_result(eval_result, metrics=None):
    """Display the evaluation results.

    Args:
        eval_result: The evaluation result object containing metrics
        metrics: Optional list of metric names to filter the display
    Returns:
        metrics_df: DataFrame containing summary metrics
        metrics_table: DataFrame containing detailed metrics
    """
    summary_metrics, metrics_table = (
        eval_result.summary_metrics,
        eval_result.metrics_table,
    )

    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        metrics_table = metrics_table.filter(
            [
                metric
                for metric in metrics_table.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the summary metrics
    display(Markdown("### Summary Metrics"))
    display(metrics_df)
    # Display the metrics table
    display(Markdown("### Row-based Metrics"))
    display(metrics_table)


def plot_bar_plot(eval_results, metrics=None):
    """Create a bar plot of evaluation results.

    Args:
        eval_results: List of tuples containing (title, summary_metrics, metrics_table)
        metrics: Optional list of metric names to filter the plot
    Returns:
        fig: The bar plot figure
    """
    fig = go.Figure()
    data = []

    for eval_result in eval_results:
        title, summary_metrics, _ = eval_result
        if metrics:
            summary_metrics = {
                k: summary_metrics[k]
                for k, v in summary_metrics.items()
                if any(selected_metric in k for selected_metric in metrics)
            }

        data.append(
            go.Bar(
                x=list(summary_metrics.keys()),
                y=list(summary_metrics.values()),
                name=title,
            )
        )

    fig = go.Figure(data=data)
    fig.update_layout(barmode="group")
    fig.show()

### Test Evaluation 1: Response Completeness and Accuracy

In [None]:
completeness_accuracy_template = PointwiseMetricPromptTemplate(
    criteria={
        "category_coverage": "The response should cover all relevant negative news categories (Money Laundering, Forgery, Bribery, Human Trafficking) if they exist for the entity.",
        "source_citation": "The response should properly cite sources for each claim.",
    },
    rating_rubric={
        "5": "Response covers all relevant categories and properly cites sources.",
        "3": "Response covers most categories but may miss some citations.",
        "1": "Response is incomplete or contains inaccuracies.",
    },
)

completeness_accuracy_metric = PointwiseMetric(
    metric="completeness_accuracy",
    metric_prompt_template=completeness_accuracy_template,
)

In [None]:
def generate_ground_truth_data(
    entities: list[str], system_instructions: str, prompt_template: str
) -> dict[str, list[str]]:
    """Generate ground truth data for evaluation using generate_kyc_report_from_entity.

    Args:
        entities: List of entity names to generate reports for
        system_instructions: System instructions for the model
        prompt_template: The prompt template to use for the report

    Returns:
        Dictionary containing prompts, references, and responses for each entity
    """
    ground_truth_data = {"prompt": [], "reference": [], "response": []}

    # Process each entity's data
    for entity in entities:
        # Generate the prompt
        prompt = prompt_template.format(input_entity=entity)
        ground_truth_data["prompt"].append(prompt)

        # Generate the report
        response = generate_kyc_report_from_entity(
            entity, system_instructions, prompt_template
        )

        # Extract the response text
        if response.candidates and len(response.candidates) > 0:
            response_parts = [
                part.text
                for part in response.candidates[0].content.parts
                if hasattr(part, "text") and part.text
            ]
            response_text = "\n".join(response_parts)
        else:
            response_text = "No response generated."

        ground_truth_data["response"].append(response_text)

        # Note: we'll use second run response as reference
        # In a real scenario, you might want to use human-verified references
        ref_response = generate_kyc_report_from_entity(
            entity, system_instructions, prompt_template
        )

        # Extract the response text
        if ref_response.candidates and len(ref_response.candidates) > 0:
            ref_response_parts = [
                part.text
                for part in ref_response.candidates[0].content.parts
                if hasattr(part, "text") and part.text
            ]
            ref_response_text = "\n".join(ref_response_parts)
        else:
            ref_response_text = "No response generated."

        ground_truth_data["reference"].append(ref_response_text)

    return ground_truth_data


# Example usage:
entities_to_check = [
    "Ricardo Martinelli",
    "Robert Burke",
]

ground_truth_data = generate_ground_truth_data(
    entities_to_check, system_instructions_text, prompt_template
)

In [None]:
# Create evaluation dataset
eval_dataset = pd.DataFrame(ground_truth_data)

# Run evaluation
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[completeness_accuracy_metric],
    experiment="kyc-completeness-accuracy",
)

eval_result = eval_task.evaluate()

# Display results
display_eval_result(eval_result)

#### Test Evaluation 2: Response Structure and Professionalism

In [None]:
structure_professionalism_template = PointwiseMetricPromptTemplate(
    criteria={
        "formatting": "The response should follow the specified format with clear headlines, dates, and summaries.",
        "professional_tone": "The response should maintain a professional and objective tone throughout.",
        "clarity": "The information should be presented clearly and be easy to understand.",
    },
    rating_rubric={
        "5": "Response is well-structured, professional, and clear.",
        "3": "Response has good structure but could be more professional or clearer.",
        "1": "Response lacks proper structure or professionalism.",
    },
)

structure_professionalism_metric = PointwiseMetric(
    metric="structure_professionalism",
    metric_prompt_template=structure_professionalism_template,
)

# Create evaluation dataset
eval_dataset = pd.DataFrame(ground_truth_data)

# Run evaluation
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[structure_professionalism_metric],
    experiment="kyc-structure-professionalism",
)

eval_result = eval_task.evaluate()

# Display results
display_eval_result(eval_result)

# Create visualization of results
plot_bar_plot(
    [
        (
            "Structure and Professionalism",
            eval_result.summary_metrics,
            eval_result.metrics_table,
        )
    ]
)

## Cleaning up
This notebook primarily makes API calls to Google Gemini models and does not create persistent resources in your Google Cloud project (like VMs, storage buckets, etc.) beyond the API usage itself. Therefore, specific cleanup steps for created resources are generally not required after running this notebook.

If you want to disable the Vertex AI API used, you can do so from the Google Cloud Console, but this would affect any other services or notebooks relying on it.