# Benchmarking Anthropic's Prompt Caching

## Overview
This notebook quickly explores Anthropic's prompt caching feature, comparing time and costs for a large prompt. Below, we:

* Load a copy of the US Constitution (~70k tokens) and shove it into a system prompt
* Ask 5 questions about it without caching the prompt, using Anthropic models, and benchmark response times
* Ask the same questions about it *with* system prompt caching, and benchmark response times
* Compare it with Langchain prompt caching

You can read more about:
* Anthropic's prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
* OpenAI's prompt caching: https://platform.openai.com/docs/guides/prompt-caching  
* Langchain's caching: https://python.langchain.com/docs/how_to/caching_embeddings/#using-with-a-vector-store

## TL;DR (Too Long; Didn't Run)
Using an unscientific toy example, Anthropic's prompt caching feature reduced my runtime by 40-50% and cut costs by up to 65% after just 5 calls. But it only makes sense to use this feature with long prompts and repeated calls in a short time frame.

## Setup

First, set up some prerequisites: Libraries, API keys, etc.

In [16]:
!pip install -qU langchain langchain-anthropic bs4 anthropic

In [None]:
import getpass
anthropic_api_key = getpass.getpass("Anthropic API key: ")


Choose your preferred model below; just make sure it's a model for which Anthropic has prompt cachign available.

Timeout can be adjusted based on your account settings; this helps avoid hitting rate limits. For a Tier 2 account using Sonnet, 60 seconds worked for me. (10 seconds worked with Haiku.)

In [12]:
model ="claude-3-5-sonnet-latest"
wait_time_seconds = 60

Grab the data -- the US constitution

In [13]:
import requests
import os

from bs4 import BeautifulSoup

# Text of the U.S. Constitution - about 66k tokens!
url = "https://www.govinfo.gov/content/pkg/CDOC-110hdoc50/html/CDOC-110hdoc50.htm" 
txt_file_name = "constitution.txt"

if os.path.isfile(txt_file_name):
    with open(txt_file_name, "r", encoding="utf-8") as file:
        us_constitution_text = file.read()
else:
    response = requests.get(url)

    soup = BeautifulSoup(response.content, 'html.parser')
    us_constitution_text = soup.get_text(separator="\n", strip=True)
        
    # Save the text content to a .txt file
    output_file = txt_file_name
    with open(output_file, "w", encoding="utf-8") as file:
        file.write(text_content)

I generated these test questions using AI. Also set up a simple system prompt.

In [3]:
test_questions = [
    "What six goals are outlined in the Preamble of the U.S. Constitution?",
    "How does Article I define the composition and powers of the House and Senate?",
    "What process does Article V establish for proposing and ratifying constitutional amendments?",
    "What powers are explicitly granted to Congress under Article I, Section 8?",
    "How does the Constitution address the resolution of disputes between states in Article III?"
]

system_prompt = f"""Answer the question based on the context. 

Context: {us_constitution_text}
"""

## Baseline: Ask the test questions with no prompt caching

In [None]:
import anthropic
import time

client = anthropic.Anthropic(api_key=anthropic_api_key)
for i,question in enumerate(test_questions):
  start_time = time.time()
  response = client.messages.create(
      model=model,
      max_tokens=1024,
      system=[
        {
          "type": "text",
          "text": system_prompt,
        }
      ],
      messages=[{"role": "user", "content": f"Question to analyze: {question}"}],
  )
  print(response.content)
  print(f"Cache usage: {response.usage}")
  print(f"Time: {time.time()-start_time}s")

  if i<(len(test_questions)-1):
    time.sleep(wait_time_seconds) 

[TextBlock(text='According to the Preamble of the U.S. Constitution, the six goals are:\n\n1. "Form a more perfect Union"\n2. "Establish Justice"\n3. "Insure domestic Tranquility"\n4. "Provide for the common defence"\n5. "Promote the general Welfare"\n6. "Secure the Blessings of Liberty to ourselves and our Posterity"\n\nThese goals are laid out in the opening paragraph of the Constitution, which reads:\n\n"We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."\n\nThese six goals represent the fundamental purposes and aspirations that the Constitution was designed to achieve for the United States.', type='text')]
Cache usage: Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=70549, output_t

## Comparison: Ask the same questions, but cache the system prompt

In [15]:
for i,question in enumerate(test_questions):
  start_time = time.time()
  response = client.messages.create(
      model=model,
      max_tokens=1024,
      system=[
        {
          "type": "text",
          "text": system_prompt,
          "cache_control": {"type": "ephemeral"}
        }
      ],
      messages=[{"role": "user", "content": f"Question to analyze: {question}"}],
  )
  print(response.content)
  print(f"Cache usage: {response.usage}")
  print(f"Time: {time.time()-start_time}s")

  if i<(len(test_questions)-1):
    time.sleep(wait_time_seconds) 

[TextBlock(text='According to the Preamble of the U.S. Constitution, the six goals outlined are:\n\n1. To form a more perfect Union\n2. To establish Justice\n3. To insure domestic Tranquility\n4. To provide for the common defence\n5. To promote the general Welfare\n6. To secure the Blessings of Liberty to ourselves and our Posterity\n\nThese goals are stated in the opening lines of the Constitution:\n\n"We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."\n\nThese six goals represent the fundamental purposes for which the Constitution was created and serve as guiding principles for the American system of government.', type='text')]
Cache usage: Usage(cache_creation_input_tokens=70520, cache_read_input_tokens=0, inp

## Results

| Model | Prompt Caching | Median Response Time | Cache Creation Tokens | Cache Read Tokens | Input Tokens | Output Tokens | Cost |
|-------|---------------|---------------------|---------------------------|----------------------|-------------------|-------------------|------|
| Sonnet-3.5 | N | **11.9 s** | 0 | 0 | 352729 | 1591 | **$1.08** |
| Sonnet-3.5 | Y | **10.6 s (cache creation), 7.0 s (subsequent)** | 70520 | 282080 | 129 | 1731 | **$0.38** |
| Haiku-3.5 | N | **16.1 s** | 0 | 0 | 352729 | 1591 | **$0.36** |
| Haiku-3.5 | Y | **13.4 s (cache creation), 8.0 s (subsequent)** | 70520 | 282080 | 129 | 1467 | **$0.12** |

So here we have it! Benefits of prompt caching for this example:

* After the initial cache creation, it's about **40-50% faster**. 
* Reading the cache is less expensive than normal input tokens, so keeping the model the same, the total **cost dropped by about 65%**!

Interestingly, the Haiku calls were _not_ faster than the Sonnet calls here as one might expect. This may have more to do with API load than the models themselves.

It would be interesting to compare these results to RAG and see how the time, cost and accuracy compares; is it faster / cheaper to shove everything in the prompt and cache it, or fetch the context from a vector DB? (Of course, the answer to this likely depends on the model, data and use case!)

## When to use Anthropic's prompt caching

Prompt caching really only makes sense when:
* You have long context (>1024 tokens), and
* Repeated calls in a <5 minute time window with similar context 

Anthropic's cache refreshes after 5 minutes of disuse -- so it's not worthwhile to cache if you are calling infrequently! In fact, since cache creation costs more than regular input tokens, this could actually _increase_ your costs.

Also -- note that I used a direct call to Anthropic's API here, instead of using Langchain. I started out with Langchain, but prompt caching didn't work well for me -- it may be that Langchain 0.3 does not fully support this feature yet since Anthropic only recently (as of 1/15/24) took it out of beta.  I did not try LlamaIndex.

## How it compares to OpenAI's

OpenAI automatically applies prompt caching to prompts longer than 1024 tokens, so in some ways it's simpler -- it's happening under the hood and doesn't need to be set up separately.

On the other hand, OpenAI only applies caching if there is an exact match at the beginning of the prompt. Anthropic's version is much more flexible, allowing the user to tag which parts of the prompt to cache or not.

## What about Langchain's caching feature?

Langchain also has caching options, which can offer blazingly fast performance, but it only applies if the prompt is _identical_ each time. The speedup is much more dramatic than what Anthropic offers, but if only part of your prompt is the same, this won't help.

Below is some example code -- in this case I was able to get a response in <0.005 s for an identical prompt using an in-memory cache. 

If you expect to have prompts that are semantically similar but not quite identical, **semantic caching** is a great option to explore as well! For an example build of semantic caching with Redis, see the bottom of [this notebook](https://github.com/angelachapman/llmops/blob/main/Week%208/Day%201/Prototyping_LangChain_Application_with_Production_Minded_Changes_Assignment.ipynb)

And of course these caching options are not all mutually exclusive -- mixing and matching is probably the best way to optimize compute time and costs.

In [5]:
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

# Set up a ChatAnthropic LLM without prompt caching, to baseline.
llm = ChatAnthropic(
    model="claude-3-5-haiku-latest",  
    anthropic_api_key=anthropic_api_key  
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "Question: {question}"),
    ]
)

chain = prompt | llm

In [7]:
from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

In [None]:
import time

for question in test_questions:
    start_time = time.time()
    print(chain.invoke({
            "question": question,
            })   
    )

    print(f"Time: {time.time()-start_time}\n")
    time.sleep(wait_time_seconds) # Tier 2 account, so sleep for a bit before calling again


In [10]:
import time

start_time = time.time()
print(chain.invoke({
    "question": test_questions[0],
    })   
)

print(f"Time: {time.time()-start_time}\n")

content='Based on the Preamble text in the document, the six goals outlined are:\n\n1. To form a more perfect Union\n2. To establish Justice\n3. To insure domestic Tranquility\n4. To provide for the common defence\n5. To promote the general Welfare\n6. To secure the Blessings of Liberty to ourselves and our Posterity\n\nThe full Preamble reads: "We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."' additional_kwargs={} response_metadata={'id': 'msg_01A4DYzJHYCQEPfAnx8AArYZ', 'model': 'claude-3-5-haiku-20241022', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 70547, 'output_tokens': 165}} id='run-3434c0e5-348c-44fc-a8bf-930f