# 通过 Claude API 进行提示词缓存

提示词缓存允许您存储和重用提示词中的上下文。这使得在提示词中包含额外信息变得更加实用——例如详细指令和示例响应——这些都有助于改善 Claude 生成的每个响应。

此外，通过在提示词中充分利用提示词缓存，您可以减少超过 2 倍的延迟并节省高达 90% 的成本。当构建涉及详细书籍内容重复任务的解决方案时，这可以产生显著的节省。

在本教程中，我们将演示如何在单轮和多轮对话中使用提示词缓存。

## 设置

首先，让我们设置必要的导入和初始化环境：

In [3]:
%pip install anthropic bs4 --quiet

Note: you may need to restart the kernel to use updated packages.


In [4]:
import anthropic
import time
import requests
from bs4 import BeautifulSoup

client = anthropic.Anthropic()
MODEL_NAME = "claude-sonnet-4-5"

现在让我们获取一些文本内容用于示例。我们将使用简·奥斯汀的《傲慢与偏见》的文本，它大约有 187,000 个标记。

In [5]:
def fetch_article_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()

    # Get text
    text = soup.get_text()

    # Break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # Break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # Drop blank lines
    text = "\n".join(chunk for chunk in chunks if chunk)

    return text


# Fetch the content of the article
book_url = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"
book_content = fetch_article_content(book_url)

print(f"Fetched {len(book_content)} characters from the book.")
print("First 500 characters:")
print(book_content[:500])

Fetched 737525 characters from the book.
First 500 characters:
The Project Gutenberg eBook of Pride and Prejudice
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.
Title:


## 示例 1：单轮

让我们演示大型文档的提示词缓存，比较缓存和非缓存 API 调用的性能和成本。

### 第 1 部分：非缓存 API 调用

首先，让我们进行非缓存 API 调用。这会将提示词加载到缓存中，以便我们后续的缓存 API 调用可以从提示词缓存中受益。

我们将要求一个简短的输出字符串以保持输出响应时间低，因为提示词缓存的好处仅适用于输入处理时间。

In [6]:
def make_non_cached_api_call():
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<book>" + book_content + "</book>",
                    "cache_control": {"type": "ephemeral"},
                },
                {"type": "text", "text": "What is the title of this book? Only output the title."},
            ],
        }
    ]

    start_time = time.time()
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=messages,
    )
    end_time = time.time()

    return response, end_time - start_time


non_cached_response, non_cached_time = make_non_cached_api_call()

print(f"Non-cached API call time: {non_cached_time:.2f} seconds")
print(f"Non-cached API call input tokens: {non_cached_response.usage.input_tokens}")
print(f"Non-cached API call output tokens: {non_cached_response.usage.output_tokens}")

print("\nSummary (non-cached):")
print(non_cached_response.content)

Non-cached API call time: 20.37 seconds
Non-cached API call input tokens: 17
Non-cached API call output tokens: 8

Summary (non-cached):
[TextBlock(text='Pride and Prejudice', type='text')]


### 第 2 部分：缓存 API 调用

现在，让我们进行缓存 API 调用。我将在内容对象中添加 "cache_control": {"type": "ephemeral"} 属性，并在请求中添加 "prompt-caching-2024-07-31" beta 标头。这将为此次 API 调用启用提示词缓存。

为了保持输出延迟恒定，我们将向 Claude 提出与之前相同的问题。请注意，此问题不是缓存内容的一部分。

In [7]:
def make_cached_api_call():
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<book>" + book_content + "</book>",
                    "cache_control": {"type": "ephemeral"},
                },
                {"type": "text", "text": "What is the title of this book? Only output the title."},
            ],
        }
    ]

    start_time = time.time()
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=messages,
    )
    end_time = time.time()

    return response, end_time - start_time


cached_response, cached_time = make_cached_api_call()

print(f"Cached API call time: {cached_time:.2f} seconds")
print(f"Cached API call input tokens: {cached_response.usage.input_tokens}")
print(f"Cached API call output tokens: {cached_response.usage.output_tokens}")

print("\nSummary (cached):")
print(cached_response.content)

Cached API call time: 2.92 seconds
Cached API call input tokens: 17
Cached API call output tokens: 8

Summary (cached):
[TextBlock(text='Pride and Prejudice', type='text')]


如您所见，缓存 API 调用总共只用了 3.64 秒，而非缓存 API 调用用了 21.44 秒。由于缓存，整体延迟有了显著改善。

## 示例 2：多轮对话与增量缓存

现在，让我们看一个多轮对话，其中随着对话的进行，我们添加缓存断点。

In [8]:
class ConversationHistory:
    def __init__(self):
        # Initialize an empty list to store conversation turns
        self.turns = []

    def add_turn_assistant(self, content):
        # Add an assistant's turn to the conversation history
        self.turns.append({"role": "assistant", "content": [{"type": "text", "text": content}]})

    def add_turn_user(self, content):
        # Add a user's turn to the conversation history
        self.turns.append({"role": "user", "content": [{"type": "text", "text": content}]})

    def get_turns(self):
        # Retrieve conversation turns with specific formatting
        result = []
        user_turns_processed = 0
        # Iterate through turns in reverse order
        for turn in reversed(self.turns):
            if turn["role"] == "user" and user_turns_processed < 1:
                # Add the last user turn with ephemeral cache control
                result.append(
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": turn["content"][0]["text"],
                                "cache_control": {"type": "ephemeral"},
                            }
                        ],
                    }
                )
                user_turns_processed += 1
            else:
                # Add other turns as they are
                result.append(turn)
        # Return the turns in the original order
        return list(reversed(result))


# Initialize the conversation history
conversation_history = ConversationHistory()

# System message containing the book content
# Note: 'book_content' should be defined elsewhere in the code
system_message = f"<file_contents> {book_content} </file_contents>"

# Predefined questions for our simulation
questions = [
    "What is the title of this novel?",
    "Who are Mr. and Mrs. Bennet?",
    "What is Netherfield Park?",
    "What is the main theme of this novel?",
]


def simulate_conversation():
    for i, question in enumerate(questions, 1):
        print(f"\nTurn {i}:")
        print(f"User: {question}")

        # Add user input to conversation history
        conversation_history.add_turn_user(question)

        # Record the start time for performance measurement
        start_time = time.time()

        # Make an API call to the assistant
        response = client.messages.create(
            model=MODEL_NAME,
            max_tokens=300,
            system=[
                {"type": "text", "text": system_message, "cache_control": {"type": "ephemeral"}},
            ],
            messages=conversation_history.get_turns(),
        )

        # Record the end time
        end_time = time.time()

        # Extract the assistant's reply
        assistant_reply = response.content[0].text
        print(f"Assistant: {assistant_reply}")

        # Print token usage information
        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens
        input_tokens_cache_read = getattr(response.usage, "cache_read_input_tokens", "---")
        input_tokens_cache_create = getattr(response.usage, "cache_creation_input_tokens", "---")
        print(f"User input tokens: {input_tokens}")
        print(f"Output tokens: {output_tokens}")
        print(f"Input tokens (cache read): {input_tokens_cache_read}")
        print(f"Input tokens (cache write): {input_tokens_cache_create}")

        # Calculate and print the elapsed time
        elapsed_time = end_time - start_time

        # Calculate the percentage of input prompt cached
        total_input_tokens = input_tokens + (
            int(input_tokens_cache_read) if input_tokens_cache_read != "---" else 0
        )
        percentage_cached = (
            int(input_tokens_cache_read) / total_input_tokens * 100
            if input_tokens_cache_read != "---" and total_input_tokens > 0
            else 0
        )

        print(f"{percentage_cached:.1f}% of input prompt cached ({total_input_tokens} tokens)")
        print(f"Time taken: {elapsed_time:.2f} seconds")

        # Add assistant's reply to conversation history
        conversation_history.add_turn_assistant(assistant_reply)


# Run the simulated conversation
simulate_conversation()


Turn 1:
User: What is the title of this novel?
Assistant: The title of this novel is "Pride and Prejudice" by Jane Austen.
User input tokens: 4
Output tokens: 22
Input tokens (cache read): 0
Input tokens (cache write): 187354
0.0% of input prompt cached (4 tokens)
Time taken: 20.37 seconds

Turn 2:
User: Who are Mr. and Mrs. Bennet?
Assistant: Mr. and Mrs. Bennet are the parents of five daughters (Jane, Elizabeth, Mary, Kitty, and Lydia) in Pride and Prejudice. 

Mr. Bennet is an intelligent but detached father who often retreats to his library to avoid his wife's dramatics. He has a satirical wit and tends to be amused by the follies of others, including his own family members. He shows particular fondness for his second daughter Elizabeth, who shares his sharp mind and wit.

Mrs. Bennet is a woman primarily focused on getting her five daughters married to wealthy men. She is described as having "poor nerves" and is often anxious, dramatic, and somewhat foolish. Her main goal in life

如您在此示例中看到的，在初始缓存设置后，响应时间从近 24 秒减少到仅 7-11 秒，同时在所有答案中保持相同级别的质量。剩余延迟的大部分是由于生成响应所需的时间，这不受提示词缓存的影响。

由于我们在后续轮次中不断调整缓存断点，几乎 100% 的输入标记都被缓存了，我们能够几乎瞬间读取下一条用户消息。