# Demonstrate a Pro-feature: Prompt Caching using litellm

In [12]:
# Story file 
with open("hamlet.txt", "r", encoding="utf-8") as f:
    hamlet = f.read()

loc = hamlet.find("Speak, man")

print(hamlet[loc:loc + 100])

Speak, man.
  Laer. Where is my father?
  King. Dead.
  Queen. But not by him!
  King. Let him deman


## Connect with local Ollama LLM
### Initial setup

In [13]:
import os
from dotenv import load_dotenv
import litellm
from IPython.display import Markdown

load_dotenv()


OLLAMA_NEW_BASE_URL = os.getenv("OLLAMA_NEW_BASE_URL")
LLAMA3_3B_MODEL = f"ollama/{os.getenv("LLAMA3_3B")}"

if not OLLAMA_NEW_BASE_URL and LLAMA3_3B_MODEL:
    print("One or more mandatory config missing.")
else:
    print(f"Config loaded successfully for Ollama. \nBase URL: {OLLAMA_NEW_BASE_URL} \nModel: {LLAMA3_3B_MODEL}")

Config loaded successfully for Ollama. 
Base URL: http://localhost:11434 
Model: ollama/llama3.2:3b


# Build with lite llm

In [14]:
question = [{"role": "user", "content": "In Hamlet, when Laertes asks 'Where is my father?' what is the reply?"}]

In [15]:
openrouter_prefix = "openrouter/"

minimax_model = f"{openrouter_prefix}{os.getenv("MINIMAX_FREE")}"
zai_model = f"{openrouter_prefix}{os.getenv("Z_AI_FREE")}"

print(f"Models: \n1. {minimax_model} \n2. {zai_model}")


Models: 
1. openrouter/minimax/minimax-m2:free 
2. openrouter/z-ai/glm-4.5-air:free


In [10]:
response = litellm.completion(model=minimax_model, messages=question)

display(Markdown(response.choices[0].message.content))


In *Hamlet*, when Laertes asks "Where is my father?" in **Act 4, Scene 5**, **Claudius** (the King) replies:

> **"Dead."**

He then elaborates, saying, "But there's no great cry. **He lies at ease.**" (Some editions use "He lies in peace" instead of "ease").

### Context:
- Laertes has just returned from France after hearing his father, **Polonius**, was killed.
- Ophelia, Laertes' sister, enters the scene, having been driven mad by grief. She sings and distributes flowers.
- After Ophelia leaves, Laertes asks Claudius where Polonius is buried.
- Claudius, who secretly ordered Polonius's death and hid the body, lies about the funeral and Polonius's fate to placate Laertes.
- This exchange is part of Claudius' ongoing deception to prevent Laertes from seeking immediate revenge against Hamlet.

### Why this matters:
Claudius' lie ("He lies at ease") contrasts with the audience's knowledge that Polonius' body was hidden *unceremoniously* in a crypt, and it deepens the theme of concealment in the play. Laertes eventually learns the truth from the gravedigger in **Act 5, Scene 1**, fueling his role as Hamlet's foil in the fencing match.

# Find the tokens and cost for previous call

In [33]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Usages Details: {response.usage}")

print(response._response_headers)

# Check how to retrieve below details from response
# print(response.litellm_call_id)
# print(response.response_cost)

Input tokens: 59
Output tokens: 602
Total tokens: 661
Reasoning tokens: 318
Usages Details: Usage(completion_tokens=602, prompt_tokens=59, total_tokens=661, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=318, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
{'date': 'Fri, 07 Nov 2025 00:57:17 GMT', 'content-type': 'application/json', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'content-encoding': 'gzip', 'access-control-allow-origin': '*', 'vary': 'Accept-Encoding', 'permissions-policy': 'payment=(self "https://checkout.stripe.com" "https://connect-js.stripe.com" "https://js.stripe.com" "https://*.js.stripe.com" "https://hooks.stripe.com")', 'referrer-policy': 'no-referrer, strict-origin-when-cross-origin', 'x-content-type-options': 'nosniff', 'server': 'cloudflare', 'cf-ray': '99a8e01a5b8e7fc2-MAA'}


## Observation

* It seems because we are using free models from openrouter caching is not supported there.

In [37]:
response = litellm.completion(model=minimax_model, messages=question)

display(Markdown(f"## Response: \n{response.choices[0].message.content[:100]}"))

print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")

## Response: 
In *Hamlet* (Act 4, Scene 5), when Laertes bursts in demanding, "**Where is my father?**" (after Pol

Input tokens: 59
Output tokens: 839
Total tokens: 898
Reasoning tokens: 409


## Prompt Caching with OpenAI

For OpenAI:

https://platform.openai.com/docs/guides/prompt-caching

> Cache hits are only possible for exact prefix matches within a prompt. To realize caching benefits, place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end. This also applies to images and tools, which must be identical between requests.


Cached input is 4X cheaper

https://openai.com/api/pricing/

## Prompt Caching with Anthropic

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

You have to tell Claude what you are caching

You pay 25% MORE to "prime" the cache

Then you pay 10X less to reuse from the cache with inputs.

https://www.anthropic.com/pricing#api

## Gemini supports both 'implicit' and 'explicit' prompt caching

https://ai.google.dev/gemini-api/docs/caching?lang=python