# CrewAI + Anthropic Prompt Caching: Custom LLM Cookbook

This notebook demonstrates how to:
- Implement a custom LLM in CrewAI that calls Anthropic's Messages API
- Use Anthropic's prompt caching (5-minute TTL by default; optional 1-hour TTL with beta header)
- Cache a long public-domain text (Frankenstein by Mary Shelley) in the system prompt, then run a task twice to observe cache usage

You can run this notebook locally or in any Jupyter environment.

## Prerequisites
- An Anthropic API key (set it via environment variable or prompt in this notebook)
- Python 3.9+
- Internet access to fetch the public-domain text from Project Gutenberg

Note on prompt caching TTLs:
- Default cache TTL is 5 minutes. No special header needed.
- Extended 1-hour TTL requires access to Anthropic's beta and a specific `anthropic-beta` header value provided by Anthropic. This is optional and shown later.

In [1]:
# Install required packages
%pip install -U "crewai[tools]" anthropic requests --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.13 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [19]:
import os
import time
import requests
from getpass import getpass
from typing import Any, Dict, List, Optional, Union
from copy import deepcopy

from crewai import BaseLLM, Agent, Task, Crew

## Configure your environment
Enter your Anthropic API key below (input is hidden). Alternatively, set `ANTHROPIC_API_KEY` in your environment before starting the notebook.
You can also tweak model/temperature and choose the cache TTL.

In [20]:
# Set your API key securely (input hidden). If already set in env, you won't be prompted.
if not os.getenv("ANTHROPIC_API_KEY"):
    os.environ["ANTHROPIC_API_KEY"] = getpass("Enter your ANTHROPIC_API_KEY (input hidden): ")

# Base configuration (feel free to adjust)
MODEL_NAME = "claude-sonnet-4-20250514"  # Update to your preferred Anthropic model
MAX_TOKENS = 1000
TEMPERATURE = 0.2
CACHE_TTL = "1h"  # choices: '5m' or '1h'

# If you have access to the extended 1-hour TTL beta, set the provided header value here.
# Example value shown below may change; use the exact value Anthropic gives you.
EXTENDED_TTL_BETA_HEADER = None  # e.g., "extended-cache-ttl-2025-04-11"

## Custom CrewAI LLM with Anthropic Prompt Caching
We subclass `crewai.BaseLLM` and implement a `call` method that:
- Builds Anthropic Messages API payload
- Injects a system block consisting of content blocks (the last block gets `cache_control`)
- Adds the `anthropic-beta` header when using the 1-hour TTL (if you have access)
- Returns the assistant text.

We'll also log cache read/write token counts for observability.

In [21]:
import json
import requests

class AnthropicPromptCachingLLM(BaseLLM):
    """
    A custom CrewAI LLM for Anthropic models that supports prompt caching.

    Args:
        model (str): Anthropic model name.
        api_key (Optional[str]): Anthropic API key. If None, reads from env.
        temperature (float): Sampling temperature.
        system_content (Optional[List[Dict[str, str]]]): List of content blocks for the system prompt.
        cache_ttl (str): '5m' (default) or '1h'.
        max_tokens (int): Max tokens for the response.
        api_version (str): Anthropic API version header value.
        extended_ttl_beta_header (Optional[str]): Anthropic beta header value required for 1-hour TTL.
    """

    def __init__(
        self,
        model: str,
        api_key: Optional[str] = None,
        temperature: float = 0.2,
        system_content: Optional[List[Dict[str, str]]] = None,
        cache_ttl: str = "1h",
        max_tokens: int = 1024,
        api_version: str = "2023-06-01",
        extended_ttl_beta_header: Optional[str] = None,
    ):
        super().__init__(model=model, temperature=temperature)
        self.api_key = api_key or os.getenv("ANTHROPIC_API_KEY")
        if not self.api_key:
            raise ValueError("Anthropic API key is required. Set ANTHROPIC_API_KEY or pass api_key.")

        if cache_ttl not in ("5m", "1h"):
            raise ValueError("cache_ttl must be either '5m' or '1h'.")

        self.system_content = system_content
        self.cache_ttl = cache_ttl
        self.max_tokens = max_tokens
        self.api_version = api_version
        self.extended_ttl_beta_header = extended_ttl_beta_header
        self.endpoint = "https://api.anthropic.com/v1/messages"

    def call(self, messages: Union[str, List[Dict[str, str]]], **kwargs: Any) -> str:
        # Normalize messages
        if isinstance(messages, str):
            messages = [{"role": "user", "content": messages}]

        # Filter out any system messages; we'll use the dedicated system param
        messages = [m for m in messages if m.get("role") != "system"]

        # Headers
        headers = {
            "x-api-key": self.api_key,
            "Content-Type": "application/json",
            "anthropic-version": self.api_version,
        }
        if self.cache_ttl == "1h" and self.extended_ttl_beta_header:
            headers["anthropic-beta"] = self.extended_ttl_beta_header

        # Payload
        payload: Dict[str, Any] = {
            "model": self.model,
            "messages": messages,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
        }

        # System content with prompt caching
        if self.system_content:
            system_payload = deepcopy(self.system_content)

            cache_control: Dict[str, Any] = {"type": "ephemeral"}
            if self.cache_ttl == "1h":
                cache_control["ttl"] = "1h"

            if system_payload and isinstance(system_payload[-1], dict):
                system_payload[-1]["cache_control"] = cache_control

            payload["system"] = system_payload

        # Call Anthropic API
        try:
            resp = requests.post(self.endpoint, headers=headers, json=payload, timeout=120)
            resp.raise_for_status()
            result = resp.json()

            # Log cache usage if present
            usage = result.get("usage", {})
            read_tokens = usage.get("cache_read_input_tokens", 0)
            write_tokens = usage.get("cache_creation_input_tokens", 0)
            if read_tokens:
                print(f"Cache Hit: Read {read_tokens} tokens from cache.")
            if write_tokens:
                print(f"Cache Write: Wrote {write_tokens} tokens to cache.")

            return result["content"][0]["text"]

        except requests.Timeout:
            raise TimeoutError("Anthropic API request timed out.")
        except requests.RequestException as e:
            if getattr(e, "response", None) is not None:
                try:
                    print(f"Error Response: {e.response.text}")
                except Exception:
                    pass
            raise RuntimeError(f"Anthropic API request failed: {e}")
        except (KeyError, IndexError) as e:
            raise ValueError(f"Failed to parse Anthropic API response: {e}")

## Fetch a long public-domain text (Frankenstein)
We'll use Mary Shelley's "Frankenstein; or, The Modern Prometheus" from Project Gutenberg (public domain). This is long enough to demonstrate prompt caching but fits within typical 200k-token context limits when truncated if needed.

In [22]:
# Project Gutenberg ebook ID for Frankenstein is 84
EBOOK_ID = 84


def fetch_gutenberg_text(ebook_id: int) -> str:
    # Try a few common URL patterns from Gutenberg
    urls = [
        f"https://www.gutenberg.org/cache/epub/{ebook_id}/pg{ebook_id}.txt",
        f"https://www.gutenberg.org/ebooks/{ebook_id}.txt.utf-8",
    ]
    for url in urls:
        try:
            r = requests.get(url, timeout=60)
            if r.ok and len(r.text) > 10_000:
                return r.text
        except Exception:
            pass
    raise RuntimeError("Failed to fetch the text from Project Gutenberg.")

text = fetch_gutenberg_text(EBOOK_ID)
print(f"Fetched {len(text):,} characters.")

# Optional: truncate to stay safely within context limits for 200k-token models.
# Adjust this if you know your model/context window capabilities.
MAX_CHARS_FOR_DEMO = 300_000
if len(text) > MAX_CHARS_FOR_DEMO:
    text = text[:MAX_CHARS_FOR_DEMO]
    print(f"Truncated to {len(text):,} characters for the demo.")

Fetched 446,543 characters.
Truncated to 300,000 characters for the demo.


In [12]:
system_blocks: List[Dict[str, str]] = [
    {
        "type": "text",
        "text": (
            "You are a literary analyst. Be concise, factual, and cite short, relevant "
            "passages when helpful. If unsure, say so."
        ),
    },
    {
        "type": "text",
        "text": text,
    },
]

llm = AnthropicPromptCachingLLM(
    model=MODEL_NAME,
    temperature=TEMPERATURE,
    system_content=system_blocks,
    cache_ttl=CACHE_TTL,
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    max_tokens=MAX_TOKENS,
    extended_ttl_beta_header=EXTENDED_TTL_BETA_HEADER,
)

## Create a CrewAI Agent, Task, and Crew
We'll ask the agent to summarize the main themes of Frankenstein in up to 5 bullet points. We'll run the task twice to observe prompt cache behavior (first run writes to cache; second run should read from cache).

In [24]:
analyst = Agent(
    role="Literary Analyst",
    goal="Answer questions about Mary Shelley's Frankenstein accurately and concisely.",
    backstory="Expert in 19th-century literature with the full text available via prompt caching.",
    llm=llm,
    verbose=True,
)

task = Task(
    description="Summarize the main themes of Frankenstein in at most 5 bullet points.",
    expected_output="A concise bullet list of up to 5 items, each a single sentence.",
    agent=analyst,
)

crew = Crew(
    agents=[analyst],
    tasks=[task],
    verbose=True,
)

In [25]:
# First run (should write to cache)
start = time.time()
result1 = crew.kickoff()
end = time.time()
print(f"First run time: {end - start:.2f} s\n")
print("--- Result (Run 1) ---")

from IPython.display import display, Markdown

display(Markdown(result1.raw))

Cache Hit: Read 72757 tokens from cache.


First run time: 12.62 s

--- Result (Run 1) ---


**

• The dangers of unchecked scientific ambition and the pursuit of knowledge without considering moral consequences, as shown through Victor's obsessive quest to create life.

• The destructive effects of isolation and the fundamental human need for companionship, demonstrated by both Victor's self-imposed solitude and the creature's desperate loneliness.

• The question of what truly makes someone human versus monstrous, exploring how society's rejection and treatment can shape one's identity and actions.

• The moral responsibility of creators toward their creations and the catastrophic consequences of abandoning that duty, as Victor flees from his creature.

• The destructive cycle of revenge and how hatred breeds more violence, escalating from the creature's initial abandonment to the mutual destruction of creator and creation.

In [26]:
# Second run (should read from cache)
start = time.time()
result2 = crew.kickoff()
end = time.time()
print(f"Second run time: {end - start:.2f} s\n")
print("--- Result (Run 2) ---")
display(Markdown(result2.raw))

Cache Hit: Read 72757 tokens from cache.


Second run time: 12.24 s

--- Result (Run 2) ---


**

• The dangers of unchecked scientific ambition and the pursuit of knowledge without considering moral consequences, as demonstrated by Victor's reckless creation of life.

• The responsibility of creators toward their creations and the tragic consequences of abandonment, shown through Victor's rejection of his creature.

• The nature of humanity and monstrosity, exploring what truly makes someone human versus monstrous through both Victor's and the creature's actions.

• Isolation and the fundamental human need for companionship, as both Victor and his creature suffer from profound loneliness that drives their destructive behaviors.

• The destructive cycle of revenge and how hatred perpetuates suffering, illustrated through the escalating violence between Victor and his creature that destroys innocent lives.

## Optional: Enable 1-hour cache TTL
If you have access to Anthropic's extended TTL beta, set the `EXTENDED_TTL_BETA_HEADER` to the exact value Anthropic provides and choose `CACHE_TTL = '1h'` when instantiating the LLM. Example below (commented out):

```python
EXTENDED_TTL_BETA_HEADER = "extended-cache-ttl-2025-04-11"  # Example; replace with your value
llm_1h = AnthropicPromptCachingLLM(
    model=MODEL_NAME,
    temperature=TEMPERATURE,
    system_content=system_blocks,
    cache_ttl="1h",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    max_tokens=MAX_TOKENS,
    extended_ttl_beta_header=EXTENDED_TTL_BETA_HEADER,
)
```

Then create a new `Agent`, `Task`, and `Crew` with `llm_1h` and run twice as before.

## Troubleshooting
- If you see a context/window error, reduce `MAX_CHARS_FOR_DEMO` or use a shorter text.
- If you don't see cache logs, ensure you're running the second cell (Run 2) without changing the system content and within the TTL window.
- For 1-hour TTL, you must include the exact `anthropic-beta` header value provided by Anthropic and set `cache_ttl='1h'`.
- Ensure your Anthropic key has access to the chosen model.