## Implementing semantic cache to improve a RAG system with FAISS.


> https://huggingface.co/learn/cookbook/en/semantic_cache_chroma_vector_database

---

In this notebook, we will explore a typical RAG solution where we will utilize an open-source model and the vector database Chroma DB. However, we will integrate a semantic cache system that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache.

A semantic caching system aims to identify similar or identical user requests. When a matching request is found, the system retrieves the corresponding information from the cache, reducing the need to fetch it from the original source.

As the comparison takes into account the semantic meaning of the requests, they don’t have to be identical for the system to recognize them as the same question. They can be formulated differently or contain inaccuracies, be they typographical or in the sentence structure, and we can identify that the user is actually requesting the same information.

For instance, queries like `What is the capital of France?, Tell me the name of the capital of France?, and What The capital of France is?` all convey the same intent and should be identified as the same question.


While the model’s response may differ based on the request for a concise answer in the second example, the information retrieved from the vector database should be the same. This is why I’m placing the cache system between the user and the vector database, not between the user and the Large Language Model.


![](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/semantic_cache.jpg)

Most tutorials that guide you through creating a RAG system are designed for single-user use, meant to operate in a testing environment. In other words, within a notebook, interacting with a local vector database and making API calls or using a locally stored model.

This architecture quickly becomes insufficient when attempting to transition one of these models to production, where they might encounter from tens to thousands of recurrent requests.

One way to enhance performance is through one or multiple semantic caches. This cache retains the results of previous requests, and before resolving a new request, it checks if a similar one has been received before. If so, instead of re-executing the process, it retrieves the information from the cache.

In a RAG system, there are two points that are time consuming:

- Retrieve the information used to construct the enriched prompt:

- Call the Large Language Model to obtain the response.


In both points, a semantic cache system can be implemented, and we could even have two caches, one for each point.

> Placing it at the model’s response point may lead to a loss of influence over the obtained response. Our cache system could consider “Explain the French Revolution in 10 words” and “Explain the French Revolution in a hundred words” as the same query. If our cache system stores model responses, users might think that their instructions are not being followed accurately.

But both requests will require the same information to enrich the prompt. This is the main reason why I chose to place the semantic cache system between the user’s request and the retrieval of information from the vector database.

However, this is a design decision. Depending on the type of responses and system requests, it can be placed at one point or another. It’s evident that caching model responses would yield the most time savings, but as I’ve already explained, it comes at the cost of losing user influence over the response.



### Setup and load data

In [24]:
import numpy as np
import pandas as pd
import os
from getpass import getpass
from datasets import load_dataset
from chromadb.api.models.Collection import Collection

In [2]:
if 'hf_key' not in locals():
  hf_key = getpass("Your Hugging Face API Key: ")
!huggingface-cli login --token $hf_key

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/shaunaksen/.cache/huggingface/token
Login successful


In [3]:
data = load_dataset(
    path="keivalya/MedQuad-MedicalQnADataset",
    split="train"
)

ChromaDB requires that the data has a unique identifier. We can make it with this statement, which will create a new column called Id.

In [4]:
data

Dataset({
    features: ['qtype', 'Question', 'Answer'],
    num_rows: 16407
})

In [5]:
data = data.to_pandas()
data["id"] = data.index

In [6]:
data.head()

Unnamed: 0,qtype,Question,Answer,id
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...,0
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...,1
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...,2
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos...",3
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen...",4


In [7]:
data.shape

(16407, 4)

In [8]:
MAX_ROWS = 1000
DOCUMENT = "Answer"
TOPIC = "qtype"

In [9]:
# Because it is just a sample we select a small portion of News.
subset_data = data.head(MAX_ROWS)

### Import and configure the Vector Database

In [10]:
import chromadb

Now we only need to indicate the path where the vector database will be stored.

In [11]:
if not os.path.exists(os.path.join(os.getcwd(), "data")):
    os.mkdir(os.path.join(os.getcwd(), "data"))

In [12]:
chroma_path = os.path.join(os.getcwd(), "data")

In [13]:
chroma_client = chromadb.PersistentClient(path=chroma_path)

In the next lines, we are creating the collection by calling the create_collection function in the chroma_client created above.

In [14]:
chroma_client.count_collections()

1

In [15]:
chroma_client.list_collections()

[Collection(name=news_collection)]

If the collection exist we need to delete it

In [16]:
collection_name = "news_collection"
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(
    name=collection_name
)

In [17]:
chroma_client.list_collections()

[Collection(name=news_collection)]

We are now ready to add the data to the collection using the add function. This function requires three key pieces of information:

- In the document we store the content of the Answer column in the Dataset.

- In metadatas, we can inform a list of topics. I used the value in the column qtype.

- In id we need to inform an unique identificator for each row. I’m creating the ID using the range of MAX_ROWS.

In [18]:
subset_data["Answer"]

0      LCMV infections can occur after exposure to fr...
1      LCMV is most commonly recognized as causing ne...
2      Individuals of all ages who come into contact ...
3      During the first phase of the disease, the mos...
4      Aseptic meningitis, encephalitis, or meningoen...
                             ...                        
995    Treatment options depend on the type of AVM, i...
996    The greatest potential danger posed by AVMs is...
997    The mission of the National Institute of Neuro...
998    Ataxia-telangiectasia is a rare, childhood neu...
999    There is no cure for A-T and, currently, no wa...
Name: Answer, Length: 1000, dtype: object

In [19]:
collection.add(
    ids=[f"id{id_}" for id_ in range(len(subset_data))],
    documents=subset_data['Answer'].tolist(),
    metadatas=[{TOPIC: topic_} for topic_ in subset_data[TOPIC].tolist()]
)

Once we have the information in the Database we can query it, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn’t look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents.

Metadata isn’t directly involved in the initial search process, it can be used to filter or refine the results after retrieval, enabling further customization and precision.

Let’s define a function to query the ChromaDB Database.


In [23]:
type(collection)

chromadb.api.models.Collection.Collection

In [25]:
def query_database(collection: Collection, query_text: str, n_results: int=10):
    results = collection.query(query_texts=query_text, n_results=n_results)

    return results

In [26]:
data.head()

Unnamed: 0,qtype,Question,Answer,id
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...,0
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...,1
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...,2
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos...",3
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen...",4


### Creating the semantic cache system

To implement the cache system, we will use Faiss, a library that allows storing embeddings in memory. It’s quite similar to what Chroma does, but without its persistence.

For this purpose, we will create a class called semantic_cache that will work with its own encoder and provide the necessary functions for the user to perform queries.

In this class, we first query the cache implemented with Faiss, that contains the previous petitions, and if the returned results are above a specified threshold, it will return the content of the cache. Otherwise, it will fetch the result from the Chroma database.

The cache is stored in a .json file.



In [28]:
import faiss
from sentence_transformers import SentenceTransformer
import time
import json

The init_cache() function below initializes the semantic cache.

It employs the FlatLS index, which might not be the fastest but is ideal for small datasets. Depending on the characteristics of the data intended for the cache and the expected dataset size, another index such as HNSW or IVF could be utilized.

I chose this index because it aligns well with the example. It can be used with vectors of high dimensions, consumes minimal memory, and performs well with small datasets.

I outline the key features of the various indices available with Faiss.

- FlatL2 or FlatIP. Well-suited for small datasets, it may not be the fastest, but its memory consumption is not excessive.
- LSH. It works effectively with small datasets and is recommended for use with vectors of up to 128 dimensions.
- HNSW. Very fast but demands a substantial amount of RAM.
- IVF. Works well with large datasets without consuming much memory or compromising performance.

More information about the different indices available with Faiss can be found at this link: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
