# cachy

> Cache your LLM requests and make your notebooks fast again

## Introduction 

`cachy` caches your litellm api requests. It does this by saving the results of each llm call to a local `cachy.txt` file. Before calling an LLM api (e.g. Anthropic) it will check if the request exists in `cachy.txt`. If it does it will return the cached result.

Under the hood LiteLLM uses `httpx.Client` and `httpx.AsyncClient` to call an LLM API. `cachy` patches the `send` method of both clients and injects a simple caching mechanism:

- create a cache key from the request
- if the key exists in `cachy.txt` return the cached response
- if not call the LLM API and save the response to `cachy.txt`

In [None]:
#| default_exp cachy

In [None]:
#| export
import hashlib, httpx, json
from fastcore.utils import *

We want to restrict our caching to LLM API calls only. A simple way to do this is to check the request url. By default we cache OpenAI, Anthropic, Gemini and DeepSeek API calls.

In [None]:
#| export
doms = ['api.openai.com', 'api.anthropic.com', 'generativelanguage.googleapis.com', 'api.deepseek.com'] 

In [None]:
#| export
def _should_cache(url, doms): return any(dom in str(url) for dom in doms)

`cachy.txt` contains 1 llm request/response per line. 

Each line has the following syntax `key|response` where `key` is a hash of the llm request. 

```txt
416f3b8c|{"id":"chatcmpl-C74Ljo3pEBjR5tvlghY3RtHLFBcMQ","object":"chat.completion","created":1755801051,"model":"gpt-4o-2024-08-06","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","refusal":null,"annotations":[]},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":9,"completion_tokens":9,"total_tokens":18,"prompt_tokens_details":{"cached_tokens":0,"audio_tokens":0},"completion_tokens_details":{"reasoning_tokens":0,"audio_tokens":0,"accepted_prediction_tokens":0,"rejected_prediction_tokens":0}},"service_tier":"default","system_fingerprint":"fp_80956533cb"}
```

In [None]:
#| export
def _cache(key, cfp):
    with open(cfp, 'r') as f:
        line = first(f, lambda x: x.startswith(key))
        return last(line.strip().split('|',1)) if line else None

An LLM response can include the `|` char so that's why we use `.split('|', 1)` to ensure that we don't accidentally split the response.

In [None]:
#| export
def _write_cache(key, content, cfp):
    with open(cfp, 'a') as f: f.write(f"{key}|{str(content)}\n")

We patch `Client.send`.

In [None]:
#| export
def _apply_sync_patch(cfp, doms):    
    @patch
    def send(self:httpx._client.Client, r, **kwargs):
        is_stream = kwargs.get('stream')
        if not _should_cache(r.url, doms): return self._orig_send(r, **kwargs)
        key = hashlib.sha256((str(r.url)+r.content.decode()+str(is_stream)).encode()).hexdigest()[:8]
        if res := _cache(key, cfp): 
            return httpx.Response(status_code=200, content=json.loads(res) if is_stream else res, request=r)
        res = self._orig_send(r, **kwargs)
        if is_stream: content = b''.join(list(res.iter_bytes())).decode()
        else: content = json.dumps(json.loads(res.read().decode()), separators=(',',':'))
        _write_cache(key, json.dumps(content) if is_stream else content , cfp)
        return httpx.Response(status_code=res.status_code, content=content, request=r)

We patch `AsyncClient.send`.

In [None]:
#| export
def _apply_async_patch(cfp, doms):    
    @patch
    async def send(self:httpx._client.AsyncClient, r, **kwargs):
        is_stream = kwargs.get('stream')
        if not _should_cache(r.url, doms): return await self._orig_send(r, **kwargs)
        key = hashlib.sha256((str(r.url)+r.content.decode()+str(is_stream)).encode()).hexdigest()[:8]
        if res := _cache(key, cfp): 
            return httpx.Response(status_code=200, content=json.loads(res) if is_stream else res, request=r)
        res = await self._orig_send(r, **kwargs)
        if is_stream: content = b''.join([c async for c in res.aiter_bytes()]).decode()
        else: content = json.dumps(json.loads(res.read().decode()), separators=(',',':'))
        _write_cache(key, json.dumps(content) if is_stream else content , cfp)
        return httpx.Response(status_code=res.status_code, content=content, request=r)

Now let's define a method that makes it easy for the user to enable caching.

In [None]:
#| export
def enable_cachy(cache_dir=None, doms=doms):
    cfp = Path(cache_dir or Config.find('settings.ini').config_path or '.') / 'cachy.txt'
    cfp.touch(exist_ok=True)   
    _apply_sync_patch(cfp, doms)
    _apply_async_patch(cfp, doms)

## Tests 

Let's test `enable_req_cache` on the scenarios below for each provider:

- sync
- sync (with streaming)
- async
- async (with streaming)

In [None]:
enable_cachy()

### Sync Tests

In [None]:
from litellm import completion

In [None]:
class mods: ant="claude-sonnet-4-20250514"; oai="gpt-4o"; gem="gemini/gemini-2.0-flash"

In [None]:
def mk_msgs(m): return [{"role": "user", "content": f"write 1 words about {m}"}]

Let's define a helper method to display a streamed response.

In [None]:
def _stream(r): 
    for ch in r: print(ch.choices[0].delta.content or "")

#### Anthropic

Let's test `claude-sonnet-x`.

In [None]:
r = completion(model=mods.ant, messages=mk_msgs("ant sync..."))
r

ModelResponse(id='chatcmpl-5c241db5-4bfc-4751-9b97-47a8c966227e', created=1756123821, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Coordination.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=6, prompt_tokens=16, total_tokens=22, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

In [None]:
r = completion(model=mods.ant, messages=mk_msgs("ant sync..."))
r

ModelResponse(id='chatcmpl-08800b92-edf0-4c11-8f9e-9b275077536a', created=1756123823, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Coordination.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=6, prompt_tokens=16, total_tokens=22, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

Now, with streaming enabled.

In [None]:
r = completion(model=mods.ant, messages=mk_msgs("ant sync stream..."), stream=True)
_stream(r)

**
Asynchronous**

(Ant sync
 stream likely refers to asynchronous streaming - handling
 data flows that don't require synchronized,
 real-time processing between sender and receiver.)



In [None]:
r = completion(model=mods.ant, messages=mk_msgs("ant sync stream..."), stream=True)
_stream(r)

**
Asynchronous**

(Ant sync
 stream likely refers to asynchronous streaming - handling
 data flows that don't require synchronized,
 real-time processing between sender and receiver.)



#### OpenAI

Let's test `gpt-4o`.

In [None]:
r = completion(model=mods.oai, messages=mk_msgs("oai sync..."))
r

ModelResponse(id='chatcmpl-C8QJx7gpuXLgyh7bndGwsPtN1Ny4a', created=1756123837, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_df0f7b956c', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=16, total_tokens=17, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')

In [None]:
r = completion(model=mods.oai, messages=mk_msgs("oai sync..."))
r

ModelResponse(id='chatcmpl-C8QJx7gpuXLgyh7bndGwsPtN1Ny4a', created=1756123837, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_df0f7b956c', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=1, prompt_tokens=16, total_tokens=17, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')

Now, with streaming enabled.

In [None]:
r = completion(model=mods.oai, messages=mk_msgs("oai sync stream..."), stream=True)
_stream(r)

Eff
icient
.




In [None]:
r = completion(model=mods.oai, messages=mk_msgs("oai sync stream..."), stream=True)
_stream(r)

Eff
icient
.




#### Gemini

Let's test `2.0-flash`.

In [None]:
r = completion(model=mods.gem, messages=mk_msgs("gem sync..."))
r

ModelResponse(id='xVKsaI-SF6qxxN8P58LQyAI', created=1756123845, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=8, total_tokens=11, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=8, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])

In [None]:
r = completion(model=mods.gem, messages=mk_msgs("gem sync..."))
r

ModelResponse(id='xVKsaI-SF6qxxN8P58LQyAI', created=1756123846, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Synchronization.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=8, total_tokens=11, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=8, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])

Now, with streaming enabled.

In [None]:
r = completion(model=mods.gem, messages=mk_msgs("gem sync stream..."), stream=True)
_stream(r)

Eff
ortless.




In [None]:
r = completion(model=mods.gem, messages=mk_msgs("gem sync stream..."), stream=True)
_stream(r)

Eff
ortless.




### Async Tests

In [None]:
from litellm import acompletion

In [None]:
async def _astream(r):
    async for chunk in r: print(chunk.choices[0].delta.content or "")

#### Anthropic

Let's test `claude-sonnet-x`.

In [None]:
r = await acompletion(model=mods.ant, messages=mk_msgs("ant async..."))
r

ModelResponse(id='chatcmpl-c1e613e2-8f3d-4060-8224-7f59eff9b56e', created=1756123859, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=7, prompt_tokens=16, total_tokens=23, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

In [None]:
r = await acompletion(model=mods.ant, messages=mk_msgs("ant async..."))
r

ModelResponse(id='chatcmpl-489b76c3-27c7-46a0-a88d-8acf9f3fbed4', created=1756123860, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=7, prompt_tokens=16, total_tokens=23, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

Now, with streaming enabled.

In [None]:
r = await acompletion(model=mods.ant, messages=mk_msgs("ant async stream..."), stream=True)
await(_astream(r))

As
ynchronous streams enable non-blocking iteration
 over data sequences that arrive over time, allowing
 efficient processing of real-time events
, API responses, or file chunks without
 freezing the main thread.



In [None]:
r = await acompletion(model=mods.ant, messages=mk_msgs("ant async stream..."), stream=True)
await(_astream(r))

As
ynchronous streams enable non-blocking iteration
 over data sequences that arrive over time, allowing
 efficient processing of real-time events
, API responses, or file chunks without
 freezing the main thread.



#### OpenAI

Let's test `gpt-4o`.

In [None]:
r = await acompletion(model=mods.oai, messages=mk_msgs("oai async..."))
r

ModelResponse(id='chatcmpl-C8QKVEoMm2fEvWkbDlE43jRVKosVV', created=1756123871, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_80956533cb', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficiency.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=2, prompt_tokens=16, total_tokens=18, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')

In [None]:
r = await acompletion(model=mods.oai, messages=mk_msgs("oai async..."))
r

ModelResponse(id='chatcmpl-C8QKVEoMm2fEvWkbDlE43jRVKosVV', created=1756123871, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_80956533cb', choices=[Choices(finish_reason='stop', index=0, message=Message(content='Efficiency.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=2, prompt_tokens=16, total_tokens=18, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')

Now, with streaming enabled.

In [None]:
r = await acompletion(model=mods.oai, messages=mk_msgs("oai async stream..."), stream=True)
await(_astream(r))

"
Eff
icient
"




In [None]:
r = await acompletion(model=mods.oai, messages=mk_msgs("oai async stream..."), stream=True)
await(_astream(r))

"
Eff
icient
"




#### Gemini

Let's test `2.0-flash`.

In [None]:
r = await acompletion(model=mods.gem, messages=mk_msgs("gem async..."))
r

ModelResponse(id='5VKsaM37OrqzxN8PyPrC0Ak', created=1756123877, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=8, total_tokens=11, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=8, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])

In [None]:
r = await acompletion(model=mods.gem, messages=mk_msgs("gem async..."))
r

ModelResponse(id='5VKsaM37OrqzxN8PyPrC0Ak', created=1756123879, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Concurrency.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=3, prompt_tokens=8, total_tokens=11, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=8, image_tokens=None)), vertex_ai_grounding_metadata=[], vertex_ai_url_context_metadata=[], vertex_ai_safety_results=[], vertex_ai_citation_metadata=[])

Now, with streaming enabled.

In [None]:
r = await acompletion(model=mods.gem, messages=mk_msgs("gem async stream..."), stream=True)
await(_astream(r))

Concurrent
.




In [None]:
r = await acompletion(model=mods.gem, messages=mk_msgs("gem async stream..."), stream=True)
await(_astream(r))

Concurrent
.




### Tool Calls

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type":"string", "description":"The city e.g. Reims"},
                    "unit": {"type":"string", "enum":["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            }
        }
    }
]

In [None]:
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
r

ModelResponse(id='chatcmpl-39afdd40-b7f1-4dfc-9093-17282b055d5f', created=1756124252, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_01Bx7sqvFrX71pcEYVQtA7PJ', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

In [None]:
r = completion(model=mods.ant, messages=mk_msgs("Is it raining in Reims?"), tools=tools)
r

ModelResponse(id='chatcmpl-44db0e0c-9a39-444b-865b-fde6b2e1afb8', created=1756124246, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{"location": "Reims"}', name='get_current_weather'), id='toolu_01Bx7sqvFrX71pcEYVQtA7PJ', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=57, prompt_tokens=427, total_tokens=484, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0))

### Citations

In [None]:
search_tool = {"type":"web_search_20250305", "name":"web_search", "max_uses":3}
r = completion(
    "claude-sonnet-4-20250514", tools=[search_tool],
    messages=mk_msgs("Search the web and tell me very briefly about otters"), 
)
r

ModelResponse(id='chatcmpl-7edbf05a-6e23-4cca-a722-e758ecb9d6c4', created=1756124350, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content="Here's 1 word about searching the web: Informative.\n\nAnd briefly about otters: Otters are carnivorous mammals in the subfamily Lutrinae. The 14 extant otter species are all semiaquatic, both freshwater and marine. Otters are distinguished by their long, slim bodies, powerful webbed feet for swimming, and their dense fur, which keeps them warm and buoyant in water. They are playful animals, engaging in activities like sliding into water on natural slides and playing with stones. Sea otters have the densest fur of any animal on earth with an estimated 1 million hairs per square inch.", role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': [[{'type': 'web_search_result_location', 'cited_text': 'Otters are ca

In [None]:
search_tool = {"type":"web_search_20250305", "name":"web_search", "max_uses":3}
r = completion(
    "claude-sonnet-4-20250514", tools=[search_tool],
    messages=mk_msgs("Search the web and tell me very briefly about otters"), 
)
r

ModelResponse(id='chatcmpl-33b68f32-ebc7-4fa5-9251-5c93e909d8d1', created=1756124362, model='claude-sonnet-4-20250514', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content="Here's 1 word about searching the web: Informative.\n\nAnd briefly about otters: Otters are carnivorous mammals in the subfamily Lutrinae. The 14 extant otter species are all semiaquatic, both freshwater and marine. Otters are distinguished by their long, slim bodies, powerful webbed feet for swimming, and their dense fur, which keeps them warm and buoyant in water. They are playful animals, engaging in activities like sliding into water on natural slides and playing with stones. Sea otters have the densest fur of any animal on earth with an estimated 1 million hairs per square inch.", role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': [[{'type': 'web_search_result_location', 'cited_text': 'Otters are ca

## Export

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()