### We are using a very useful tool here named litellm which is a very lightweight and helpful tool

In [3]:
import os
import requests
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display
from litellm import completion



tell_a_joke = [
    {"role": "user", "content": "Tell a joke for a student who is studying but is being disturbed by his orange cat"},
]

response = completion(model="openai/gpt-5-nano", messages=tell_a_joke)
reply = response.choices[0].message.content
display(Markdown(reply))

Trying to study with an orange cat around is tough. Every time I say “focus,” he says “paws” and sits on my notes, turning my study session into a cat-astrophe I never planned.

Here above you can see how we get the response from openai chat.completions api and below using the litellm we can see the more in detail nitty and gritty stuff of this output

In [4]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 23
Output tokens: 2686
Total tokens: 2709
Total cost: 0.1076 cents


### Now lets do soemthing very fun and useful, we will simulate a bit of rag like feature with litellm and also we will see how prompt caching works and how it helps reduce us the api costs!!! 

In [5]:
with open("hamlet.txt", "r", encoding="utf-8") as f:
    hamlet = f.read()

loc = hamlet.find("Speak, man")
print(hamlet[loc:loc+100])

Speak, man.
  Laer. Where is my father?
  King. Dead.
  Queen. But not by him!
  King. Let him deman


In [25]:
question = [{"role": "user", "content": "In Hamlet, when Laertes asks 'Where is my father?' what is the reply?"}]

response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

In Shakespeare's *Hamlet*, when Laertes, distraught and enraged after hearing of his father's death, cries out, **"Where is my father?"**, the reply he receives is:

**"He is dead."**

This stark and brutal reply is delivered by **Claudius**.

In [26]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 19
Output tokens: 61
Total tokens: 80
Total cost: 0.0026 cents


#### See above how the answer was wrong , now lets use a little bit of rag kinda work around, but as we know so much context will take up a lot of money, then we will also see prompt caching as well

In [27]:
question[0]["content"] += "\n\nFor context, here is the entire text of Hamlet:\n\n"+hamlet

response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

When Laertes asks, "Where is my father?", the reply is given by **Claudius, the King of Denmark**.

The specific reply is: **"Dead."**

In [28]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 53208
Output tokens: 36
Cached tokens: None
Total cost: 0.5335 cents


#### The above cost is so much, about 1 cent for such a small answer!, now lets see the power of prompt caching

In [29]:
response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

When Laertes asks "Where is my father?", the reply is:

**"Dead."**

This reply comes from Claudius, the King.

In [30]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 53208
Output tokens: 31
Cached tokens: 52216
Total cost: 0.0634 cents


#### see how the cost lowered to about 10 times less, and as you can see the cached tokens is so much! thats the power of prompt caching!