# Semantic LLM caching

_**NOTE:** this uses Cassandra's experimental "Vector Similarity Search" capability.
At the moment, this is obtained by building and running an early alpha from a specific branch of the codebase._

The Cassandra-backed "semantic cache" for prompt responses is imported like this:

In [1]:
from langchain.cache import CassandraSemanticCache

As usual, a database connection is needed to access Cassandra. The following assumes
that a _vector-search-capable Cassandra cluster_ is running locally. Adjust as needed.

In [2]:
from cqlsession import getLocalSession, getLocalKeyspace
localSession = getLocalSession()
localKeyspace = getLocalKeyspace()

An embedding function and an LLM are needed:

In [3]:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

myEmbedding = OpenAIEmbeddings()
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)

**Note**: for the time being you have to explicitly _turn on this experimental flag_ on the `cassio` side:

In [4]:
import cassio
cassio.globals.enableExperimentalVectorSearch()

## Create the cache

At this point you can instantiate the semantic cache:

In [5]:
cassSemanticCache = CassandraSemanticCache(
    session=localSession,
    keyspace=localKeyspace,
    embedding=myEmbedding
)

_Note: the following cell simply clears the cache to better demonstrate what's coming next. It's not terribly important._

In [6]:
from langchain.llms.base import get_prompts
this_llm_string = get_prompts(llm.dict(), [])[1]

cassSemanticCache.clear(llm_string=this_llm_string)

Configure the cache at a LangChain global level:

In [7]:
import langchain
langchain.llm_cache = cassSemanticCache

Now try submitting a few prompts to the LLM and pay attention to the response times.

If the LLM is actually run, they should be the order of a few seconds; but in case of a cache hit, it will be way less than a second.

Notice that you get a cache hit even after rephrasing the question.

In [8]:
%%time
# A new question should take long
llm("How many eyes do spiders have?")

CPU times: user 43.8 ms, sys: 5.54 ms, total: 49.4 ms
Wall time: 2.05 s


'\n\nSpiders have eight eyes.'

In [9]:
%%time
# Second time, very same question, this should be quick
llm("How many eyes do spiders have?")

CPU times: user 10.3 ms, sys: 933 µs, total: 11.3 ms
Wall time: 59.4 ms


'\n\nSpiders have eight eyes.'

In [10]:
%%time
# Just a rephrasing: but it's the same question, so ...
llm("How many eyes does a spider generally have?")

CPU times: user 14.1 ms, sys: 996 µs, total: 15.1 ms
Wall time: 313 ms


'\n\nSpiders have eight eyes.'

In [11]:
%%time
# A totally new question
llm("Is absence of proof the same as proof of absence?")

CPU times: user 24.4 ms, sys: 211 µs, total: 24.6 ms
Wall time: 1.71 s


'\n\nNo, absence of proof is not the same as proof of absence.'

In [12]:
%%time
# Trying to catch the cache off-guard :)
llm("How many eyes are on the head of a typical spider?")

CPU times: user 6.99 ms, sys: 0 ns, total: 6.99 ms
Wall time: 484 ms


'\n\nSpiders have eight eyes.'

In [13]:
%%time
# Switching to the other question again
llm("Is it true that the absence of a proof equates the proof of an absence?")

CPU times: user 12.8 ms, sys: 4.29 ms, total: 17 ms
Wall time: 461 ms


'\n\nNo, absence of proof is not the same as proof of absence.'