<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/langchain/cache_llm_calls.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cache** in LangChain

## Outline:

* In-Memory Cache
* SQLite Cache
* GPTCache
* Optional Caching
* Optional Caching with Chains

## Overview
* Caching LLM queries can reduce expenses and the number of API calls made to the service, leading to cost savings.
* Storing LLM responses in a cache can significantly reduce response retrieval time and improve the overall performance of the application.
* Caching is particularly relevant when dealing with high traffic levels, where API call expenses can be substantial.
* Traditional cache systems typically utilize an exact match between a new query and a cached query to determine if the requested content is available in the cache before fetching the data, but this approach is less effective for LLM caches due to the complexity and variability of LLM.
* Other LLM cost-optimization techniques include prompt engineering, caching with vector stores, chains for long documents, summarization for efficient chat history, and fine-tuning.
- Choosing the right caching mechanism depends upon the usecase you are working.

## Import & Setup Libraries

In [None]:

!pip install langchain openai gptcache tiktoken

In [None]:
import os
import openai
import warnings
from getpass import getpass
from langchain.llms import OpenAI
warnings.filterwarnings("ignore")

OPENAI_API_KEY = getpass()
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)

## In Memory Cache
- [track token usage](https://python.langchain.com/en/latest/modules/models/llms/examples/token_usage_tracking.html)

In [None]:
from langchain.cache import InMemoryCache
from langchain.callbacks import get_openai_callback
langchain.llm_cache = InMemoryCache()

In [None]:
%%time
# The first time, it is not yet in cache, so it should take longer
with get_openai_callback() as cb:
    result = llm("Tell me a joke")
    print(cb)
    print(result)

Tokens Used: 39
	Prompt Tokens: 4
	Completion Tokens: 35
Successful Requests: 1
Total Cost (USD): $0.0007800000000000001


Why did the chicken cross the road?

To get to the other side!
CPU times: user 16.8 ms, sys: 3.96 ms, total: 20.7 ms
Wall time: 1.44 s


In [None]:
%%time
# The second time it is, so it goes faster and no cost
with get_openai_callback() as cb:
    result = llm("Tell me a joke")
    print(cb)
    print(result)

Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0


Why did the chicken cross the road?

To get to the other side!
CPU times: user 365 µs, sys: 919 µs, total: 1.28 ms
Wall time: 1.29 ms


In [None]:
%%time
# The third time but different question
with get_openai_callback() as cb:
    result = llm("Capital of Nepal")
    print(cb)
    print(result)

Tokens Used: 15
	Prompt Tokens: 3
	Completion Tokens: 12
Successful Requests: 1
Total Cost (USD): $0.0003


Kathmandu is the capital of Nepal.
CPU times: user 16.8 ms, sys: 1.19 ms, total: 18 ms
Wall time: 1.17 s


## SQLite Cache

In [None]:
# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
langchain.llm_cache = SQLiteCache(database_path=".langchain.db")

In [None]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm("Tell me a joke")

CPU times: user 19.7 ms, sys: 171 µs, total: 19.8 ms
Wall time: 1.54 s


'\n\nTwo guys stole a calendar. They got six months each.'

In [None]:
%%time
# The second time it is, so it goes faster
llm("Tell me a joke")

CPU times: user 1.13 ms, sys: 13 µs, total: 1.15 ms
Wall time: 1.15 ms


'\n\nTwo guys stole a calendar. They got six months each.'

## GPTCache

We can use [GPTCache](https://github.com/zilliztech/GPTCache) for exact match caching OR to cache results based on semantic similarity

Let's first start with an example of exact match

In [None]:
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from langchain.cache import GPTCache
import hashlib

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()

def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    cache_obj.init(
        pre_embedding_func=get_prompt,
        data_manager=manager_factory(manager="map", data_dir=f"map_cache_{hashed_llm}"),
    )

langchain.llm_cache = GPTCache(init_gptcache)

In [None]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm("Tell me a joke")

CPU times: user 11.7 ms, sys: 165 µs, total: 11.8 ms
Wall time: 1.59 s


'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'

In [None]:
%%time
# The second time it is, so it goes faster
llm("Tell me a joke")

CPU times: user 322 µs, sys: 31 µs, total: 353 µs
Wall time: 357 µs


'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'

<font color="orange">Let's now show an example of similarity caching</font>

In [None]:
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache
import hashlib

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()

def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")

langchain.llm_cache = GPTCache(init_gptcache)

In [None]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm("Tell me a joke")

Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/46.9M [00:00<?, ?B/s]

CPU times: user 2.34 s, sys: 247 ms, total: 2.59 s
Wall time: 25.4 s


'\n\nTwo guys stole a calendar. They got six months each.'

In [None]:
%%time
# This is an exact match, so it finds it in the cache
llm("Tell me a joke")

CPU times: user 1.14 s, sys: 4.87 ms, total: 1.14 s
Wall time: 597 ms


'\n\nTwo guys stole a calendar. They got six months each.'

In [None]:
%%time
# This is not an exact match, but semantically within distance so it hits!
llm("Tell me joke")

CPU times: user 853 ms, sys: 14.8 ms, total: 868 ms
Wall time: 224 ms


'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'

## Optional Caching
- You can also turn off caching for specific LLMs if needed. 
- In the example below, even though global caching is enabled, we turn it off for a specific LLM.

In [None]:
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2, cache=False)

In [None]:
%%time
llm("Tell me a joke")

CPU times: user 1.15 s, sys: 7.63 ms, total: 1.16 s
Wall time: 2.91 s


"\n\nWhy couldn't the bicycle stand up by itself? Because it was...two tired!"

In [None]:
%%time
llm("Tell me a joke")

CPU times: user 1.15 s, sys: 5.01 ms, total: 1.15 s
Wall time: 2.97 s


'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'

## Optional Caching in Chains
- You can also turn off caching for particular nodes in chains. Note that because of certain interfaces, its often easier to construct the chain first, and then edit the LLM afterwards.

- As an example, we will load a summarizer map-reduce chain. We will cache results for the map-step, but then not freeze it for the combine step.

In [None]:
llm = OpenAI(model_name="text-davinci-002")
no_cache_llm = OpenAI(model_name="text-davinci-002", cache=False)

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain

#text_splitter = CharacterTextSplitter()
# refer to this article for more info -> https://www.pinecone.io/learn/chunking-strategies/
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 512,
    #chunk_overlap  = 20
)

In [None]:
#let's download state of union text from langchian github
!wget https://raw.githubusercontent.com/hwchase17/langchain/master/docs/modules/state_of_the_union.txt

--2023-06-01 17:49:41--  https://raw.githubusercontent.com/hwchase17/langchain/master/docs/modules/state_of_the_union.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39027 (38K) [text/plain]
Saving to: ‘state_of_the_union.txt’


2023-06-01 17:49:41 (104 MB/s) - ‘state_of_the_union.txt’ saved [39027/39027]



In [None]:
with open('/content/state_of_the_union.txt') as f:
    state_of_the_union = f.read()
texts = text_splitter.split_text(state_of_the_union)

In [None]:
len(texts)

119

In [None]:
from langchain.docstore.document import Document
docs = [Document(page_content=t) for t in texts[:3]]
from langchain.chains.summarize import load_summarize_chain

In [None]:
len(docs)

3

In [None]:
chain = load_summarize_chain(llm, chain_type="map_reduce", reduce_llm=no_cache_llm)

In [None]:
%%time
#chain.run(docs)
with get_openai_callback() as cb:
    response = chain.run(docs)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")
    print(response)

Total Tokens: 745
Prompt Tokens: 533
Completion Tokens: 212
Total Cost (USD): $0.0149


The President gave a speech to Congress, highlighting the fact that Americans from all political backgrounds can come together and work towards a common goal. He also reaffirmed the country's commitment to freedom and democracy. In light of Vladimir Putin's attempted takeover of Ukraine, the President noted the strength and determination of the Ukrainian people.
CPU times: user 130 ms, sys: 25.7 ms, total: 155 ms
Wall time: 5.57 s


<font color="orange">When we run it again, we see that it runs substantially faster but the final answer is different. This is due to caching at the map steps, but not at the reduce step.</font>

In [None]:
from pprint import pprint

In [None]:
%%time
#chain.run(docs)
with get_openai_callback() as cb:
    response = chain.run(docs)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")
    pprint(response)

Total Tokens: 229
Prompt Tokens: 169
Completion Tokens: 60
Total Cost (USD): $0.00458
('\n'
 '\n'
 'The President gave a speech to Congress, highlighting the fact that '
 'Americans from all political backgrounds can come together and work towards '
 "a common goal. He also reaffirmed the country's commitment to freedom and "
 'democracy. Vladimir Putin tried to take over Ukraine, but he underestimated '
 'the strength of the Ukrainian people.')
CPU times: user 16.8 ms, sys: 1.38 ms, total: 18.2 ms
Wall time: 2.06 s
