# Context Caching with Gemini


## Learning Objectives


1.  Understand the concept of context caching and its benefits when working with large language models.
2.  Learn how to use the Vertex AI SDK to create and utilize cached content with Gemini models.
3.  Compare the performance of using cached content versus generating content from scratch, highlighting the speed and cost advantages.

## Overview
This notebook demonstrates how to use context caching with Gemini models in Vertex AI. 

Context caching allows you to store the processed content, such as research papers, long videos or audios along with system instructions, so you don't have to re-process it every time. <br>
When you query the model, it can leverage the stored context, leading to faster response times and reduced resource consumption. This is particularly useful when working with large documents or when using the same context across multiple queries.

## Import

In [1]:
import datetime

import vertexai
from vertexai.generative_models import Part
from vertexai.preview import caching
from vertexai.preview.generative_models import GenerativeModel

Here we define the contents variable is a list of `Part` objects, each containing a reference to a research paper in PDF format stored in Google Cloud Storage.<br>
These are the papers that will be used for context caching.

In [2]:
system_instruction = """
You are an expert researcher. You always stick to the facts in the sources provided, and never make up new facts.
Now look at these research papers, and answer the following questions.
"""

contents = [
    Part.from_uri(
        "gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
        mime_type="application/pdf",
    ),
    Part.from_uri(
        "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
        mime_type="application/pdf",
    ),
]

## Create context caching

Let's create the cached content. It uses `caching.CachedContent.create` to set up a cache with specified parameters. The parameters are:

*   `model_name`: Specifies the Gemini model to use ("gemini-1.5-pro-002" in this case).
*   `system_instruction`: Sets the instructions for how the model should behave.
*   `contents`: The actual documents or other data you want to store in the cache.
*   `ttl`: The time-to-live of the cache (60 minutes in this case), after which the cache will expire.
*   `display_name`: A name for easy identification.

The output of this cell is the unique identifier `cached_content.name` that is used to retrieve cached content later.

In [3]:
cached_content = caching.CachedContent.create(
    model_name="gemini-1.5-pro-002",
    system_instruction=system_instruction,
    contents=contents,
    ttl=datetime.timedelta(minutes=60),
    display_name="example-cache",
)

print(cached_content.name)

5149940940688850944


Let's take a look at the created context cache!

In [4]:
caching.CachedContent.list()

[<vertexai.caching._caching.CachedContent object at 0x7fa9e072cdc0>: {
   "name": "projects/237937020997/locations/us-central1/cachedContents/5149940940688850944",
   "model": "projects/takumiohym-sandbox/locations/us-central1/publishers/google/models/gemini-1.5-pro-002",
   "createTime": "2025-01-28T05:37:08.263884Z",
   "updateTime": "2025-01-28T05:37:08.263884Z",
   "expireTime": "2025-01-28T06:37:08.212948Z",
   "displayName": "example-cache",
   "usageMetadata": {
     "totalTokenCount": 43127,
     "textCount": 153,
     "imageCount": 167
   }
 },
 <vertexai.caching._caching.CachedContent object at 0x7fa9e072c8b0>: {
   "name": "projects/237937020997/locations/us-central1/cachedContents/3570303371388649472",
   "model": "projects/takumiohym-sandbox/locations/us-central1/publishers/google/models/gemini-1.5-pro-002",
   "createTime": "2025-01-28T05:29:13.873488Z",
   "updateTime": "2025-01-28T05:29:13.873488Z",
   "expireTime": "2025-01-28T06:29:13.845257Z",
   "displayName": "exam

## Generate with Cached Context

Now we can use the cached content to generate answers. The `GenerativeModel.from_cached_content` method loads the previously created cached content. 

In [5]:
# You can also retrieve the cached contents with context identifier
# cached_content = caching.CachedContent(cached_content_name=caching.CachedContent.list()[0].name)

cached_model = GenerativeModel.from_cached_content(
    cached_content=cached_content
)

In [6]:
%%time
response = cached_model.generate_content("What are the papers about?")
print(response.text)

The first paper, "Gemini: A Family of Highly Capable Multimodal Models," introduces Gemini 1.0, Google's family of multimodal models. These models are capable of understanding and generating text, images, audio, and video, and excel in reasoning, coding, and multilingual tasks. The paper details the model architecture, training infrastructure, and evaluation on various benchmarks, demonstrating state-of-the-art performance in many areas, including surpassing human expert performance on the MMLU benchmark.

The second paper, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context," introduces Gemini 1.5 Pro, an enhanced version capable of processing vastly longer contexts (up to 10 million tokens). This allows the model to handle very long documents, multiple hours of video, and days of audio. The paper focuses on evaluating Gemini 1.5 Pro's long-context capabilities through both synthetic and real-world tasks, including retrieval from extensive materials, l

## Generate without cached context

Let's compare the processing time by generating an answer **without** using the cached content. 

In [7]:
model = GenerativeModel("gemini-1.5-pro-002")

contents.append("What are the papers about?")

In [8]:
%%time
response = model.generate_content(contents)
print(response.text)

The first paper, "Gemini: A Family of Highly Capable Multimodal Models," introduces Gemini, a family of multimodal models developed by Google.  Gemini is designed to understand and generate text, images, audio, and video, excelling in tasks requiring reasoning, coding, and understanding across different modalities (like interpreting charts and diagrams within text).  The paper discusses the model's architecture, training, and performance on various benchmarks, emphasizing its state-of-the-art results and potential for real-world applications. A key aspect is Gemini's ability to handle a 32k context length, enabling it to process long documents and complex multimodal inputs.

The second paper, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context," focuses on Gemini 1.5 Pro, an enhanced version of the Gemini model.  The primary advancement is a vastly expanded context window, allowing the model to process up to 10 million tokens. This enables the model to 

Copyright 2024 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.