# Canonical Neural Cache
[Canonical Cache](https://canonical.chat/)

Email us with any and all questions at <founders@canonical.chat>

This is a simple notebook to demonstrate how to use the Canonical Cache with LiteLLM in your LLM apps. An API Key is required, email us at <founders@canonical.chat> to get one. See our GitHub repository [here](https://github.com/Canonical-AI-Inc/canonical) for more examples.

## A few important notes:

 - The first level cache key is a hash of the model's name and system prompt. Changing either of these will automatically create a new cache bucket for you. When you start seeing cache misses after editing your system prompt, this is why.

 - We enforce a delay period of 5 seconds between when a cache entry is created and when it can be used.

 - Function calling is not supported and is explicitly ignored by the cache.

 - Message format must be system -> assistant -> user. See the examples below

 - When the temperature is > 0, expect to see cache misses initially. We use the temperature to emulate the same kind of behavior you would expect if you weren't using the cache. That is, for temperature values > 0 we build a list of possible completions and then select one at random when the list is full. The greater the temperature, the larger the list and the longer it takes to fill. A concrete example of this when the temperature is 0.1, the list size is 5.

In [None]:
import os
import litellm
from litellm.caching import Cache, canonical_cache_key

## set ENV variables
openaikey = os.environ["OPENAI_API_KEY"]
apikey = os.environ["CANONICAL_CACHE_API_KEY"]

# The canonical cache only supports completions
litellm.cache = Cache(type="canonical",
                      apikey=apikey,
                      supported_call_types=["completion", "acompletion"])

# The Canonical cache uses a custom cache key function, set it here
litellm.cache.get_cache_key = canonical_cache_key
# Off to the races!

The code below will run a syncronous completion. Run it multiple times to see a cache hit. Changing the system prompt will create a new cache key and result in a cache miss. Also, if you change the model name a new cache key will be created, resulting in a cache miss. So, the first level of hashing is a combination of the model name and the system prompt. The next level of hashing is the chat context.

Subtle changes to the user's message should still result in a cache hit. For example, if you change `Hello, how are you?` to `Hello, how are you today?` the cache should still hit. 

In [None]:
import litellm
from litellm import completion
litellm.set_verbose=True

messages = [
    { "role": "system", "content": "Be a Gen Z assistant today and everyday."},
    { "role": "assistant", "content": "How can I help you today?" },
    { "role": "user",  "content": "Hello, how are you?"}
]
# openai call
response = completion(model="gpt-3.5-turbo", messages=messages, stream=False)
print(response)

Notice the change to the user's message below.

In [None]:
import litellm
from litellm import completion
litellm.set_verbose=True

# this subtle change to the user's prompt should result in a cache hit
messages = [
    { "role": "system", "content": "Be a Gen Z assistant today and everyday."},
    { "role": "assistant", "content": "How can I help you today?" },
    { "role": "user",  "content": "Hello, how are you today?"} # notice the change here
]
# openai call
response = completion(model="gpt-3.5-turbo", messages=messages, stream=False)
print(response)

The example below demonstrates an asyncronous completion.

In [None]:
import litellm
from litellm import acompletion
litellm.set_verbose=True

messages = [
    { "role": "system", "content": "Be a Baby Boomer assistant today."},
    { "role": "assistant", "content": "How can I help you today?" },
    { "role": "user",  "content": "Hello, how are you?"}
]

# execute an async completion
response = await acompletion(model="gpt-3.5-turbo", messages=messages, stream=False)
print(response)
