# Semantic Caching

RedisVL provides the ``LLMCache`` interface to turn Redis, with it's vector search capability, into a semantic cache to store query results, thereby reducing the number of requests and tokens sent to the Large Language Models (LLM) service. This decreases expenses and enhances performance by reducing the time taken to generate responses.

This notebook will go over how to use ``LLMCache`` for your applications

First, we will import OpenAI to user their API for responding to prompts.

In [1]:
import os

# set redis address
username = "default"
host = "<enter your redis host here>"
port = "<enter your redis port here>"
password = "<enter your redis password here>"


REDIS_URL = f"redis://{username}:{password}@{host}:{port}"
os.environ["REDIS_URL"] = REDIS_URL

In [2]:
import os
import openai
import getpass
os.environ["TOKENIZERS_PARALLELISM"] = "False"


api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")

openai.api_key = api_key

def ask_openai(question):
    response = openai.Completion.create(
      engine="text-davinci-003",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()

In [3]:
# test it
print(ask_openai("What is the capital of France?"))

Paris.


## Initializing and using ``LLMCache``

``LLMCache`` will automatically create an index within Redis upon initialization for the semantic cache. The same ``SearchIndex`` class used in the previous tutorials is used here to perform index creation and manipulation.

In [4]:
from redisvl.llmcache.semantic import SemanticCache
cache = SemanticCache(
    redis_url=REDIS_URL,
    threshold=0.9, # semantic similarity threshold
    )

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# look at the index specification created for the semantic cache lookup
!rvl index info -i cache

[32m01:31:37[0m [34m[RedisVL][0m [1;30mINFO[0m   Using Redis address from environment variable, REDIS_URL


Index Information:
╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes     │ Index Options   │   Indexing │
├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
│ cache        │ HASH           │ ['llmcache'] │ []              │          0 │
╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────────┬───────────────┬────────╮
│ Name          │ Attribute     │ Type   │
├───────────────┼───────────────┼────────┤
│ prompt_vector │ prompt_vector │ VECTOR │
╰───────────────┴───────────────┴────────╯


In [6]:
# check the cache
cache.check("What is the capital of France?")

[]

In [7]:
# store the question and answer
cache.store("What is the capital of France?", "Paris")

In [8]:
# check the cache again
cache.check("What is the capital of France?")

['Paris']

In [9]:
# check for a semantically similar result
cache.check("What really is the capital of France?")

[]

In [10]:
# decrease the semantic similarity threshold
cache.set_threshold(0.7)
cache.check("What really is the capital of France?")

['Paris']

In [11]:
# adversarial example (not semantically similar enough)
cache.check("What is the capital of Spain?")

[]

In [12]:
cache.clear()

## Performance

Next, we will measure the speedup obtained by using ``LLMCache``. We will use the ``time`` module to measure the time taken to generate responses with and without ``LLMCache``.

In [13]:
def answer_question(question: str):
    results = cache.check(question)
    if results:
        return results[0]
    else:
        answer = ask_openai(question)
        cache.store(question, answer)
        return answer

In [14]:
import time
start = time.time()
answer = answer_question("What is the capital of France?")
end = time.time()
print(f"Time taken without cache {time.time() - start}")

Time taken without cache 0.5926871299743652


In [15]:
cached_start = time.time()
cached_answer = answer_question("What is the capital of France?")
cached_end = time.time()
print(f"Time Taken with cache: {cached_end - cached_start}")
print(f"Percentage of time saved: {round(((end - start) - (cached_end - cached_start)) / (end - start) * 100, 2)}%")

Time Taken with cache: 0.12577414512634277
Percentage of time saved: 78.78%


In [16]:
# check the stats of the index
!rvl stats -i cache

[32m01:31:40[0m [34m[RedisVL][0m [1;30mINFO[0m   Using Redis address from environment variable, REDIS_URL

Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key                    │ Value       │
├─────────────────────────────┼─────────────┤
│ num_docs                    │ 1           │
│ num_terms                   │ 0           │
│ max_doc_id                  │ 2           │
│ num_records                 │ 2           │
│ percent_indexed             │ 1           │
│ hash_indexing_failures      │ 0           │
│ number_of_uses              │ 11          │
│ bytes_per_record_avg        │ 0           │
│ doc_table_size_mb           │ 0.000134468 │
│ inverted_sz_mb              │ 0           │
│ key_table_size_mb           │ 2.76566e-05 │
│ offset_bits_per_record_avg  │ nan         │
│ offset_vectors_sz_mb        │ 0           │
│ offsets_per_term_avg        │ 0           │
│ records_per_doc_avg         │ 2           │
│ sortable_values_size_mb     │ 0           │
│

In [17]:
# remove the index and all cached items
cache.index.delete()