# Caching LLM Responses in LangChain

Caching is the practice of storing frequently accessed data or results in a temporary, faster storage layer.

In the context of LangChain, Caching optimizes interactions with LLMs by reducing API calls and speeding up applications, resulting in a more efficient user experience.

when we repeatedly request the same completion from LLM, Caching ensures that the result is stored locally. Subsequent requests for the same input can then be served directly from the cache, reducing the number of expensive API Calls to the LLM provider by avoiding redundant API calls.

Two types of caching: **1)In-Memory Caching** and **2)SQLite Caching**  

In [2]:
pip install -r ./requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


In [4]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True) 

True

## 1)In-Memory Caching

In [7]:
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
llm = OpenAI(model_name='gpt-3.5-turbo-instruct')

To Measure the response time of the model use **`%%time`**. This command in Jupyter Notebook is used to measure the execution time of the code within the current cell.

In [10]:
%%time
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())

prompt = 'Tell me a joke that a toddler can understand.'

#following request is not in the cache so it will take longer. 
llm.invoke(prompt)

CPU times: total: 0 ns
Wall time: 681 ms


'\n\nWhy was the math book sad? Because it had too many problems!'

In [12]:
%%time

#This time the request is already in the cache so it will be faster
llm.invoke(prompt)

CPU times: total: 0 ns
Wall time: 0 ns


'\n\nWhy was the math book sad? Because it had too many problems!'

## 2)SQLite Caching

In [19]:
%%time

from langchain.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

prompt2 = 'Tell me a joke about chocolate that a toddler can understand.'

# The following request is not in the cache so it will take longer. 
llm.invoke(prompt2)


CPU times: total: 15.6 ms
Wall time: 773 ms


'\n\nWhy did the chocolate chip cookie go to the doctor?\n\nBecause it was feeling crumbly!'

In [20]:
%%time

#This time the request is already in the cache so it will be faster
llm.invoke(prompt2)

CPU times: total: 0 ns
Wall time: 967 µs


'\n\nWhy did the chocolate chip cookie go to the doctor?\n\nBecause it was feeling crumbly!'