# Prompt Management and Context Caching with Gemini


## Learning Objectives

1.  Learn how to use Vertex AI SDK to manage the lifecycle of prompt templates.
2.  Learn how to define, save, load and manage the prompts directly within Python code.
3.  Understand the concept of context caching and its benefits when working with large language models.
4.  Learn how to use the Vertex AI SDK to create and utilize cached content with Gemini models.
5.  Compare the performance of using cached content versus generating content from scratch, highlighting the speed and cost advantages.

## Overview
This notebook explores two key aspects of working with generative AI on Google Cloud. The first part focuses on Vertex AI's prompt management capabilities, explaining how to programmatically create, version, and organize prompt templates using the Vertex AI SDK. The second part introduces the Gemini API's context caching feature, designed to optimize requests with large, consistent initial contexts. 

## Basic Setup

In [1]:
import datetime

import vertexai
from google import genai
from google.genai import types
from vertexai.preview import prompts
from vertexai.preview.prompts import Prompt

In [2]:
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
MODEL = "gemini-2.0-flash-001"

client = genai.Client(vertexai=True, location="us-central1")

## Prompt Management

Vertex AI offers prompt management through its user interface, Vertex AI Studio, and programmatically via the Vertex AI SDK. This section focuses on the latter method, demonstrating how to leverage the `vertexai.preview.prompts` module to define, save, and manage prompts specifically for Gemini text generation.

### The Prompt class

To effectively manage prompts, we will use the [Prompt class](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.preview.prompts#prompt). This class represent a prompt object, encapsulates the prompt data, variables, generation configuration, and other relevant information.

Consider managing a social media page that features two-sentence stories with two main characters. The [Prompt class](https://cloud.google.com/vertex-ai/generative-ai/docs/reference/python/latest/vertexai.preview.prompts#prompt) can define a reusable prompt template, enabling the generation of multiple stories with varied character pairings from a single structure. 

Let's construct the prompt object.


In [3]:
# Initialize vertexai
vertexai.init(project=PROJECT, location="us-central1")

# Create local Prompt
prompt = Prompt(
    prompt_name="story-writer",
    prompt_data="Generate a story with 2 main characters: {A} and {B}.",
    model_name=MODEL,
    system_instruction="You are a story writer. Write a short story in 2 sentences. Don't replace the words in the variables with their synnonyms.",
)

prompt

Prompt(prompt_data='Generate a story with 2 main characters: {A} and {B}.', system_instruction=You are a story writer. Write a short story in 2 sentences. Don't replace the words in the variables with their synnonyms.), model_name=projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001), prompt_name=story-writer)

After the creation of a Prompt object, the prompt data and properties representing various configurations can be used to generate content. Let's generate content for different variable sets.

In [4]:
content_1 = prompt.assemble_contents(A="cat", B="dog")

response_1 = prompt.generate_content(contents=content_1)

print(response_1)

Assembled prompt replacing: 1 instances of variable A, 1 instances of variable B
candidates {
  content {
    role: "model"
    parts {
      text: "The cat, with a mischievous glint in its emerald eyes, plotted to swap the dog\'s bone for a squeaky toy, but the dog, sensing the feline\'s intentions, cleverly replaced the toy with a smelly sock.\n"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.4668608109156291
}
usage_metadata {
  prompt_token_count: 43
  candidates_token_count: 48
  total_token_count: 91
  prompt_tokens_details {
    modality: TEXT
    token_count: 43
  }
  candidates_tokens_details {
    modality: TEXT
    token_count: 48
  }
}
model_version: "gemini-2.0-flash-001"
create_time {
  seconds: 1745464148
  nanos: 987746000
}
response_id: "VKsJaOKkPLeTmecP6Nno-QM"



In [5]:
content_2 = prompt.assemble_contents(
    **{"A": "a king", "B": "a high school student"}
)

response_2 = prompt.generate_content(contents=content_2)

print(response_2)

Assembled prompt replacing: 1 instances of variable A, 1 instances of variable B
candidates {
  content {
    role: "model"
    parts {
      text: "King Alaric, bored with ruling his kingdom, wished for an escape, and a portal suddenly opened in his throne room, sucking him into detention with a bewildered high school student named Maya. Together, they had to navigate algebra tests and royal decrees to find a way back to their respective realities before either one of them failed senior year or lost a kingdom.\n"
    }
  }
  finish_reason: STOP
  avg_logprobs: -0.38257723384433323
}
usage_metadata {
  prompt_token_count: 47
  candidates_token_count: 72
  total_token_count: 119
  prompt_tokens_details {
    modality: TEXT
    token_count: 47
  }
  candidates_tokens_details {
    modality: TEXT
    token_count: 72
  }
}
model_version: "gemini-2.0-flash-001"
create_time {
  seconds: 1745464150
  nanos: 347889000
}
response_id: "VqsJaPGdFcuhmecP8vbe-Qc"



### Save, load and update a prompt

We can use the `vertexai.preview.prompts.create_version()` method to save a prompt online, making it accessible in the Google Cloud console. This method takes a Prompt object and creates a new version in the online store, returning an updated Prompt object linked to this resource. Remember that changes to a Prompt object are only saved online when `create_version()` is explicitly called.

In [6]:
prompt_v1 = prompts.create_version(prompt=prompt)

Created prompt resource with id 8304504122208419840 with version number 1


You can go to Google Cloud Console to check your newly created prompt in [Prompt Management](https://console.cloud.google.com/vertex-ai/studio/saved-prompts). You can also retrieve a saved prompt from the online resource using the `vertexai.preview.prompts.get()` method. Simply provide the prompt's unique ID to this function, and it will return the associated Prompt object, as demonstrated in the following code snippet. 

In [7]:
loaded_prompt = prompts.get(prompt_id=prompt_v1.prompt_id)

loaded_prompt

Prompt(prompt_data='Generate a story with 2 main characters: {A} and {B}.', system_instruction=You are a story writer. Write a short story in 2 sentences. Don't replace the words in the variables with their synnonyms.), model_name=projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001), prompt_id=8304504122208419840)

After retrieving a prompt using `get()`, you can modify its attributes and save these modifications as a new version. For instance, setting the new content to prompt_data updates the prompt locally—these changes are saved online only when create_version() is invoked. Because the prompt is associated with a prompt resource, `create_version()` generates a new version under the same prompt_id and returns a new `Prompt` object linked to the online resource.

In [8]:
loaded_prompt.prompt_data = (
    "Write a story with {A} as the protagonist and {B} as the antagonist."
)

prompt_v2 = prompts.create_version(prompt=loaded_prompt)

prompt_v2

Updated prompt resource with id 8304504122208419840 as version number 2


Prompt(prompt_data='Write a story with {A} as the protagonist and {B} as the antagonist.', system_instruction=You are a story writer. Write a short story in 2 sentences. Don't replace the words in the variables with their synnonyms.), model_name=projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001), prompt_id=8304504122208419840, version_id=2, version_name=story-writer_2025_04_23_200923)

### List prompts and prompt versions

To see the display names and prompt IDs of all prompts saved in the current Google Cloud project, use the `list_prompts()` method.

In [9]:
prompts_metadata = prompts.list()

prompts_metadata

[PromptMetadata(display_name='story-writer', prompt_id='8304504122208419840'),
 PromptMetadata(display_name='story-writer', prompt_id='2044500640163430400'),
 PromptMetadata(display_name='story-writer', prompt_id='4660938092037799936'),
 PromptMetadata(display_name='story-writer', prompt_id='3767817990934888448'),
 PromptMetadata(display_name='story-writer', prompt_id='1732472434340134912'),
 PromptMetadata(display_name='story-writer', prompt_id='3786113864421081088'),
 PromptMetadata(display_name='story-writer', prompt_id='893677003742380032'),
 PromptMetadata(display_name='story-writer', prompt_id='1425101759772098560'),
 PromptMetadata(display_name='story-writer', prompt_id='347615548923707392'),
 PromptMetadata(display_name='movie-critic', prompt_id='803745748683325440'),
 PromptMetadata(display_name='雑談', prompt_id='5764847766324903936'),
 PromptMetadata(display_name='untitled_1743733676692', prompt_id='6377337315647291392'),
 PromptMetadata(display_name='Car Customer Service', pr

After checking the prompt list, you can specify the index to retriave a specific prompt.

In [10]:
retrieved_prompt = prompts.get(prompt_id=prompts_metadata[0].prompt_id)

retrieved_prompt

Prompt(prompt_data='Write a story with {A} as the protagonist and {B} as the antagonist.', system_instruction=You are a story writer. Write a short story in 2 sentences. Don't replace the words in the variables with their synnonyms.), model_name=projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001), prompt_id=8304504122208419840)

To see the display names and version IDs of all prompt versions saved within the prompt, use the `list_versions()` method.

In [11]:
prompt_versions_metadata = prompts.list_versions(prompt_id=prompt_v1.prompt_id)

prompt_versions_metadata

[PromptVersionMetadata(display_name='story-writer_2025_04_23_200914', prompt_id='8304504122208419840', version_id='1'),
 PromptVersionMetadata(display_name='story-writer_2025_04_23_200923', prompt_id='8304504122208419840', version_id='2')]

### Restore a prompt version

Prompt resources keep a history of saved versions. To revert to a previous version, use the `restore_version()` method, which makes that older version the latest one. This method returns metadata you can use with `get()` to retrieve the newly restored version.

For instance, the following code restores the prompt content to version id 1, the original version.

In [12]:
prompt_version_metadata = prompts.restore_version(
    prompt_id=prompt_v1.prompt_id, version_id="1"
)

# Fetch the newly restored latest version of the prompt
restored_prompt = prompts.get(prompt_id=prompt_version_metadata.prompt_id)

restored_prompt

Restored prompt version 1 under prompt id 8304504122208419840 as version number 3


Prompt(prompt_data='Generate a story with 2 main characters: {A} and {B}.', system_instruction=You are a story writer. Write a short story in 2 sentences. Don't replace the words in the variables with their synnonyms.), model_name=projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001), prompt_id=8304504122208419840)

### Delete a prompt

To delete the online resource associated with a prompt ID, use the `delete()` method.


In [13]:
prompts.delete(prompt_id=prompt_v1.prompt_id)

Deleted prompt resource with id 8304504122208419840.


## Context caching

The second section of this notebook demonstrates how to use context caching with Gemini models in Vertex AI. 

Context caching allows you to store the processed content, such as research papers, long videos or audios along with system instructions, so you don't have to re-process it every time. <br>
When you query the model, it can leverage the stored context, leading to faster response times and reduced resource consumption. This is particularly useful when working with large documents or when using the same context across multiple queries.

### Define the contents

Here we define the contents variable as a list of `Part` objects, each containing a reference to a research paper in PDF format stored in Google Cloud Storage.<br>
These are the papers that will be used for context caching.

In [14]:
system_instruction = """
You are an expert researcher. You always stick to the facts in the sources provided, and never make up new facts.
Now look at these research papers, and answer the following questions.
"""

contents = [
    types.Part.from_uri(
        file_uri="gs://asl-public/data/generative-ai/pdf/2312.11805v3.pdf",
        mime_type="application/pdf",
    ),
    types.Part.from_uri(
        file_uri="gs://asl-public/data/generative-ai/pdf/2403.05530.pdf",
        mime_type="application/pdf",
    ),
]

### Create context caching

Let's create the cached content. It uses `client.caches.create` to set up a cache with specified parameters. The parameters are:

*   `model`: Specifies the Gemini model to use ("gemini-2.0-flash-001" in this case).
*   `config`: Basic configuration, which includes:
    *   `system_instruction`: Sets the instructions for how the model should behave.
    *   `contents`: The actual documents or other data you want to store in the cache.
    *   `ttl`: The time-to-live of the cache (60 minutes in this case), after which the cache will expire.
    *   `display_name`: A name for easy identification.

The output of this cell is the unique identifier `cached_content.name` that is used to retrieve cached content later.

In [15]:
# cached_content = caching.CachedContent.create(
cached_content = client.caches.create(
    model=MODEL,
    config=types.CreateCachedContentConfig(
        system_instruction=system_instruction,
        contents=contents,
        ttl="3600s",
        display_name="example-cache",
    ),
)

print(cached_content.name)

projects/684496754124/locations/us-central1/cachedContents/5362817386942562304


Let's take a look at the created context cache!

In [16]:
# caching.CachedContent.list()
for cache in client.caches.list():
    print(cache)

name='projects/684496754124/locations/us-central1/cachedContents/5362817386942562304' display_name='example-cache' model='projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001' create_time=datetime.datetime(2025, 4, 24, 3, 9, 52, 640307, tzinfo=TzInfo(UTC)) update_time=datetime.datetime(2025, 4, 24, 3, 9, 52, 640307, tzinfo=TzInfo(UTC)) expire_time=datetime.datetime(2025, 4, 24, 4, 9, 52, 634948, tzinfo=TzInfo(UTC)) usage_metadata=CachedContentUsageMetadata(audio_duration_seconds=None, image_count=167, text_count=153, total_token_count=43127, video_duration_seconds=None)
name='projects/684496754124/locations/us-central1/cachedContents/505685188823482368' display_name='example-cache' model='projects/nghiale-demo-358818/locations/us-central1/publishers/google/models/gemini-2.0-flash-001' create_time=datetime.datetime(2025, 4, 24, 3, 3, 56, 907339, tzinfo=TzInfo(UTC)) update_time=datetime.datetime(2025, 4, 24, 3, 3, 56, 907339, tzinfo=TzInfo(UTC)

### Generate without cached context

For comparison, let's first generate the answer **without** cached content and note the processing time.

In [17]:
%%time
response = client.models.generate_content(
    model=MODEL, contents=contents + ["What are the papers about?"]
)

print(response.text)

The papers "Gemini: A Family of Highly Capable Multimodal Models" and "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" are both about Google's Gemini family of language models. Here's a breakdown of what each paper covers:

**1. "Gemini: A Family of Highly Capable Multimodal Models"**

*   **Introduction of the Gemini Family:** This paper introduces the Gemini model family, emphasizing its multimodal capabilities (understanding and generating across image, audio, video, and text).
*   **Model Sizes:** It describes three sizes of Gemini: Ultra (for complex tasks), Pro (for enhanced performance and scalability), and Nano (for on-device applications).
*   **Performance:** It presents benchmarks showing state-of-the-art results on a wide range of language understanding, reasoning, coding, and multimodal tasks. One key highlight is Gemini Ultra being the first model to achieve human-expert performance on the MMLU benchmark.
*   **Capabilities Showcase:**

### Generate with cached context

Now let's use the cached content to generate answers. The `cached_content` parameters refers to the created cached content. 

In [18]:
%%time
# response = cached_model.generate_content("What are the papers about?")
response = client.models.generate_content(
    model=MODEL,
    contents="What are the papers about?",
    config=types.GenerateContentConfig(cached_content=cached_content.name),
)
print(response.text)

The papers are about Gemini, which is a new family of multimodal models developed by Google. Gemini exhibits capabilities across image, audio, video, and text understanding. It advances the state-of-the-art in large-scale language modeling, image understanding, audio processing, and video understanding. It also has capabilities in areas such as coding and reasoning.

One paper specifically discusses Gemini 1.5 Pro, the latest model of the Gemini family, and how it unlocks multimodal understanding across millions of tokens of context. This model is capable of recalling and reasoning over fine-grained information from millions of tokens of context and achieves near-perfect recall on long-context retrieval tasks across modalities.
CPU times: user 17.7 ms, sys: 1.19 ms, total: 18.9 ms
Wall time: 9.88 s


The output clearly demonstrates a substantial decrease in processing time. 

In conclusion, context caching markedly accelerates processing, with this performance gain amplifying as the volume of contextual information increases. By storing and reusing processed context, we achieve significant gains in efficiency, especially with larger contexts.

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.