<a href="https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_Caching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding LangChain Caching

In this notebook, we will see:
1. How LangChain framework uses caching mechanism to improve LLM interaction efficiency.
2. The caching algorithms of 2 different underlying storages, In-Memory and SQLite.

Hope it will help you understand if and when you should use CACHE.

In [1]:
# !pip install langchain openai --quiet --upgrade

## Get your ChatOpenAI instance ready

In [6]:
import os,sys
import openai
from dotenv import load_dotenv, find_dotenv
# sys.path.append("../..")

# 读取本地/项目的环境变量。

# find_dotenv()寻找并定位.env文件的路径
# load_dotenv()读取该.env文件，并将其中的环境变量加载到当前的运行环境中  
# 如果你设置的是全局的环境变量，这行代码则没有任何作用。
print(find_dotenv())
_ = load_dotenv(find_dotenv())
print(os.environ["OPENAI_API_KEY"])

from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

import langchain 
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate




chat_llm = ChatOpenAI()

c:\Users\lenovo\Desktop\LangChainPlayGround\DeeperTutorials\.env
sk-lANo2jIeCWQt94UCCf5d16B7C32744279bF98b06C822D519


## 1. In Memory Cache

In [3]:
from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()

### Ask a question and measure how long it takes for LLM to respond.

In [5]:
%%time

chat_llm.predict("What is OpenAI?")

CPU times: total: 188 ms
Wall time: 8.93 s


"OpenAI is an artificial intelligence research laboratory consisting of the for-profit OpenAI LP and its non-profit parent company, OpenAI Inc. It was founded in December 2015 by Elon Musk, Sam Altman, Greg Brockman, Ilya Sutskever, John Schulman, and Wojciech Zaremba. OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. AGI refers to highly autonomous systems that outperform humans in most economically valuable work. OpenAI conducts research in AI and develops AI technologies with a focus on long-term safety, technical leadership, and cooperative orientation. They aim to create safe and beneficial AI systems while also promoting the broad distribution of AI benefits. OpenAI has produced various notable AI models, including GPT-3, which is a state-of-the-art language model capable of natural language processing and generation."

#### How the cache stores data

**source code**: [cache.py](https://github.com/hwchase17/langchain/blob/v0.0.219/langchain/cache.py#L102)
```python
class InMemoryCache(BaseCache):
    """Cache that stores things in memory."""

    def __init__(self) -> None:
        """Initialize with empty cache."""
        self._cache: Dict[Tuple[str, str], RETURN_VAL_TYPE] = {}
```

This is the implementation of InMemoryCache.

In [6]:
# First element of the tuple
list(langchain.llm_cache._cache.keys())[0][0]

'[{"lc": 1, "type": "constructor", "id": ["langchain", "schema", "messages", "HumanMessage"], "kwargs": {"content": "What is OpenAI?"}}]'

In [7]:
# Second element of the tuple
list(langchain.llm_cache._cache.keys())[0][1]

'{"lc": 1, "type": "constructor", "id": ["langchain", "chat_models", "openai", "ChatOpenAI"], "kwargs": {"openai_api_key": {"lc": 1, "type": "secret", "id": ["OPENAI_API_KEY"]}}}---[(\'stop\', None)]'

### Ask same question again and see the quicker response.

In [9]:
%%time

chat_llm.predict("What is OpenAI?")

CPU times: total: 0 ns
Wall time: 0 ns


"OpenAI is an artificial intelligence research laboratory consisting of the for-profit OpenAI LP and its non-profit parent company, OpenAI Inc. It was founded in December 2015 by Elon Musk, Sam Altman, Greg Brockman, Ilya Sutskever, John Schulman, and Wojciech Zaremba. OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. AGI refers to highly autonomous systems that outperform humans in most economically valuable work. OpenAI conducts research in AI and develops AI technologies with a focus on long-term safety, technical leadership, and cooperative orientation. They aim to create safe and beneficial AI systems while also promoting the broad distribution of AI benefits. OpenAI has produced various notable AI models, including GPT-3, which is a state-of-the-art language model capable of natural language processing and generation."

## 2. SQLite as Cache

In [10]:
# !rm -f .cache.db

'rm' 不是内部或外部命令，也不是可运行的程序
或批处理文件。


In [12]:
from langchain.cache import SQLiteCache
langchain.llm_cache = SQLiteCache(database_path=".cache.db")

### Ask the same question twice and measure the performance difference

In [14]:
%%time

chat_llm.predict("What is OpenAI?")

CPU times: total: 0 ns
Wall time: 4.96 s


'OpenAI is an artificial intelligence research laboratory and company that aims to ensure that artificial general intelligence (AGI) benefits all of humanity. AGI refers to highly autonomous systems that outperform humans at most economically valuable work. OpenAI conducts research, develops AI models, and promotes responsible deployment of AI technology. It also focuses on creating AI systems that are safe, transparent, and aligned with human values. OpenAI has developed various models, including the GPT-3 language model, and actively contributes to the open-source AI community.'

In [16]:
%%time

chat_llm.predict("What is OpenAI?")

CPU times: total: 0 ns
Wall time: 2 ms


'OpenAI is an artificial intelligence research laboratory and company that aims to ensure that artificial general intelligence (AGI) benefits all of humanity. AGI refers to highly autonomous systems that outperform humans at most economically valuable work. OpenAI conducts research, develops AI models, and promotes responsible deployment of AI technology. It also focuses on creating AI systems that are safe, transparent, and aligned with human values. OpenAI has developed various models, including the GPT-3 language model, and actively contributes to the open-source AI community.'

### Add some space in the sentence and ask again

In [18]:
%%time

chat_llm.predict("What is  OpenAI?")

CPU times: total: 78.1 ms
Wall time: 5.17 s


'OpenAI is an artificial intelligence research laboratory and company. It aims to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI conducts research, develops AI models and technologies, and promotes the principles of safe and responsible AI development. It has created various AI models, including GPT-3, which is a powerful language model capable of generating human-like text. OpenAI also provides an API for developers to access its models and integrate them into their applications.'

In [19]:
import sqlalchemy
from sqlalchemy import create_engine, text
engine = create_engine("sqlite:///.cache.db")

### **Why does the extra space cause the cache miss??**

#### How SQLite stores cache data

**source code**: [cache.py](https://github.com/hwchase17/langchain/blob/v0.0.219/langchain/cache.py#L128)
```python
class FullLLMCache(Base):  # type: ignore
    """SQLite table for full LLM Cache (all generations)."""

    __tablename__ = "full_llm_cache"
    prompt = Column(String, primary_key=True)
    llm = Column(String, primary_key=True)
    idx = Column(Integer, primary_key=True)
    response = Column(String)


class SQLAlchemyCache(BaseCache):
    """Cache that uses SQAlchemy as a backend."""

    def __init__(self, engine: Engine, cache_schema: Type[FullLLMCache] = FullLLMCache):
        """Initialize by creating all tables."""
        self.engine = engine
        self.cache_schema = cache_schema
        self.cache_schema.metadata.create_all(self.engine)
```

This is the schema of cache table `full_llm_cache`.

In [20]:
with engine.connect() as connection:

    rs = connection.exec_driver_sql('select * from full_llm_cache')
    print(rs.keys())
    for row in rs:
        print(row)

RMKeyView(['prompt', 'llm', 'idx', 'response'])
('[{"lc": 1, "type": "constructor", "id": ["langchain", "schema", "messages", "HumanMessage"], "kwargs": {"content": "What is OpenAI?"}}]', '{"lc": 1, "type": "constructor", "id": ["langchain", "chat_models", "openai", "ChatOpenAI"], "kwargs": {"openai_api_key": {"lc": 1, "type": "secret", "id": ["OPENAI_API_KEY"]}}}---[(\'stop\', None)]', 0, '{"lc": 1, "type": "constructor", "id": ["langchain", "schema", "output", "ChatGeneration"], "kwargs": {"message": {"lc": 1, "type": "constructor", "i ... (589 characters truncated) ... language model, and actively contributes to the open-source AI community.", "additional_kwargs": {}}}, "generation_info": {"finish_reason": "stop"}}}')
('[{"lc": 1, "type": "constructor", "id": ["langchain", "schema", "messages", "HumanMessage"], "kwargs": {"content": "What is  OpenAI?"}}]', '{"lc": 1, "type": "constructor", "id": ["langchain", "chat_models", "openai", "ChatOpenAI"], "kwargs": {"openai_api_key": {"lc"

## Semantic Cache

Semantic cache stores prompts and responses, and evaluate hits based on semantic similarity.

In [None]:
# !pip install langchain openai --quiet --upgrade

In [None]:
# import os
# os.environ['OPENAI_API_KEY'] = 'your openai api key'

### Follow [Redis official doc](https://redis.com/blog/running-redis-on-google-colab/) to install and start redis server on google colab.

In [None]:
# !curl -fsSL https://packages.redis.io/redis-stack/redis-stack-server-6.2.6-v7.focal.x86_64.tar.gz -o redis-stack-server.tar.gz
# !tar -xvf redis-stack-server.tar.gz
# !pip install redis

# !./redis-stack-server-6.2.6-v7/bin/redis-stack-server --daemonize yes

In [21]:
import langchain
from langchain.llms import OpenAI

# To make the caching really obvious, lets use a slower model.
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)

### Initialize the Redis semantic cache with default score threshold 0.2

In [22]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.cache import RedisSemanticCache


langchain.llm_cache = RedisSemanticCache(redis_url="redis://localhost:6379", 
                                         embedding=OpenAIEmbeddings(), 
                                         score_threshold=0.2)

In [23]:
%%time

llm("Please translate 'this is Monday' into Chinese")

Redis cannot be used as a vector database without RediSearch >=2.4Please head to https://redis.io/docs/stack/search/quick_start/to know more about installing the RediSearch module within Redis Stack.
Redis cannot be used as a vector database without RediSearch >=2.4Please head to https://redis.io/docs/stack/search/quick_start/to know more about installing the RediSearch module within Redis Stack.


ValueError: Redis failed to connect: Redis cannot be used as a vector database without RediSearch >=2.4Please head to https://redis.io/docs/stack/search/quick_start/to know more about installing the RediSearch module within Redis Stack.

Notice that, the query below is 1 word different from the previous one. Cache got similarily hit.

In [None]:
%%time

llm("Please translate 'this is Tuesday' into Chinese")

In [None]:
%%time

llm("Tell me a joke")

In [None]:
%%time

llm("Tell me 2 jokes")

### Initialize the Redis semantic cache with default score threshold 0.05

In [None]:
langchain.llm_cache = RedisSemanticCache(redis_url="redis://localhost:6379", embedding=OpenAIEmbeddings(), score_threshold=0.05)

In [None]:
%%time

llm("Give me a peach")

In [None]:
%%time

llm("Give me 2 peaches")

### Deep dive into Redis semantic cache

#### Find the keys in the cache

In [None]:
langchain.llm_cache._cache_dict

#### Manually execute similarity search to fetch the similar documents with scores

You should expect that the more similar the document is, the smaller the score will be.

In [None]:
langchain.llm_cache._cache_dict['cache:bf6f6d9ebdf492e28cb8bf4878a4b951'].similarity_search_with_score(query='Give me 2 peaches')

### Conclusion

The score threshold is the key factor in using Redis semantic cache for similarity cache.

## Semantic Cache with GPTCache

### What is GPTCache?

An open source project dedicated to building a semantic cache for storing LLM responses.

Two use cases:
1. Exact match
2. Similar match

GPTCache addressed the following questions:
1. How to generate embeddings for the queries? (via embedding function)
2. How to cache the data? (via cache store of data manager, such as SQLite, MySQL, and PostgreSQL. More NoSQL databases will be added in the future)
3. How to store and search vector embeddings? (via vector store of data manager, such as FAISS or vector databases such as Milvus. More vector databases and cloud services will be added in the future.)
4. How to determine eviction policy? (LRU or FIFO)
5. How to determine cache hit or miss? (via evaluation function)

Please refer to the following Cache class definition for better understanding of how above questions are addressed:

```python
class Cache:
   def init(self,
            cache_enable_func=cache_all,
            pre_embedding_func=last_content,
            embedding_func=string_embedding,
            data_manager: DataManager = get_data_manager(),
            similarity_evaluation=ExactMatchEvaluation(),
            post_process_messages_func=first,
            config=Config(),
            next_cache=None,
            **kwargs
            ):
       self.has_init = True
       self.cache_enable_func = cache_enable_func
       self.pre_embedding_func = pre_embedding_func
       self.embedding_func = embedding_func
       self.data_manager: DataManager = data_manager
       self.similarity_evaluation = similarity_evaluation
       self.post_process_messages_func = post_process_messages_func
       self.data_manager.init(**kwargs)
       self.config = config
       self.next_cache = next_cache
```

In [1]:
!pip install gptcache 

Collecting gptcache
  Obtaining dependency information for gptcache from https://files.pythonhosted.org/packages/5a/ec/1a83bfea7a4a8c1844bcc97f1c6046fe9e14b54c243156308e6374283bae/gptcache-0.1.39.1-py3-none-any.whl.metadata
  Downloading gptcache-0.1.39.1-py3-none-any.whl.metadata (23 kB)
Downloading gptcache-0.1.39.1-py3-none-any.whl (122 kB)
   ---------------------------------------- 0.0/122.3 kB ? eta -:--:--
   ---------- ----------------------------- 30.7/122.3 kB 1.3 MB/s eta 0:00:01
   ------------- ------------------------- 41.0/122.3 kB 653.6 kB/s eta 0:00:01
   ----------------------------- --------- 92.2/122.3 kB 744.7 kB/s eta 0:00:01
   ---------------------------------- --- 112.6/122.3 kB 726.2 kB/s eta 0:00:01
   -------------------------------------- 122.3/122.3 kB 651.7 kB/s eta 0:00:00
Installing collected packages: gptcache
Successfully installed gptcache-0.1.39.1




In [8]:
import langchain
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)

### Exact Match

In [9]:
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache
import hashlib

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    cache_obj.init(
        pre_embedding_func=get_prompt,
        data_manager=manager_factory(
            manager="map", 
            data_dir=f"map_cache_{hashed_llm}"),
    )


langchain.llm_cache = GPTCache(init_gptcache)

In [10]:
question = "What is cache eviction policy?"

In [11]:
%%time

llm(question)

CPU times: total: 31.2 ms
Wall time: 6.16 s


'\n\nA cache eviction policy is a set of rules that determine which items in a cache should be removed when the cache becomes full and new items need to be added.'

In [12]:
%%time

llm(question)

CPU times: total: 0 ns
Wall time: 998 µs


'\n\nA cache eviction policy is a set of rules that determine which items in a cache should be removed when the cache becomes full and new items need to be added.'

In [13]:
%%time

llm("What is cache eviction   policy?")

CPU times: total: 0 ns
Wall time: 2.54 s


'\n\nThere are several cache eviction policies, including least recently used (LRU), first in first out (FIFO), and random.'

### Similar Match

In [14]:
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache
import hashlib


def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")


langchain.llm_cache = GPTCache(init_gptcache)

In [15]:
%%time

llm(question)

Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)lve/main/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/46.9M [00:00<?, ?B/s]

CPU times: total: 7.22 s
Wall time: 1min 43s


'\n\nA cache eviction policy is a set of rules that determine when and how often cached data is removed from the cache.'

In [16]:
%%time

llm(question)

CPU times: total: 3.47 s
Wall time: 459 ms


'\n\nA cache eviction policy is a set of rules that determine when and how often cached data is removed from the cache.'

In [17]:
%%time

llm("What is cache eviction   policy?")

CPU times: total: 5.59 s
Wall time: 925 ms


'\n\nA cache eviction policy is a set of rules that determine when and how often cached data is removed from the cache.'

In [18]:
%%time

llm("Give me a peach")

CPU times: total: 6.22 s
Wall time: 1.98 s


",\n\nAnd I'll give you a kiss."

In [19]:
%%time

llm("Give me 2 peaches")

CPU times: total: 3.27 s
Wall time: 400 ms


",\n\nAnd I'll give you a kiss."