# How to cache chat model responses

:::info Prerequisites

This guide assumes familiarity with the following concepts:
- [Chat models](/docs/concepts/chat_models)
- [LLMs](/docs/concepts/text_llms)

:::

LangChain provides an optional caching layer for [chat models](/docs/concepts/chat_models). This is useful for two main reasons:

- It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times. This is especially useful during app development.
- It can speed up your application by reducing the number of API calls you make to the LLM provider.

This guide will walk you through how to enable this in your apps.

import ChatModelTabs from "@theme/ChatModelTabs";

<ChatModelTabs customVarName="llm" />


In [9]:
# | output: false
# | echo: false

import os
from getpass import getpass

from langchain_openai import ChatOpenAI

from langchain.globals import set_verbose

set_verbose(True)

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass()

llm = ChatOpenAI()

In [10]:
# <!-- ruff: noqa: F821 -->
from langchain_core.globals import set_llm_cache

## In Memory Cache

This is an ephemeral cache that stores model calls in memory. It will be wiped when your environment restarts, and is not shared across processes.

In [11]:
%%time
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 3.11 ms, sys: 2.44 ms, total: 5.55 ms
Wall time: 992 ms


AIMessage(content="Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-2ab2bd18-7a5c-4662-8c2d-dabba20c8930-0', usage_metadata={'input_tokens': 11, 'output_tokens': 17, 'total_tokens': 28})

In [12]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 239 μs, sys: 0 ns, total: 239 μs
Wall time: 215 μs


AIMessage(content="Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-2ab2bd18-7a5c-4662-8c2d-dabba20c8930-0', usage_metadata={'input_tokens': 11, 'output_tokens': 17, 'total_tokens': 28})

## SQLite Cache

This cache implementation uses a `SQLite` database to store responses, and will last across process restarts.

In [13]:
!rm .langchain.db

In [14]:
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [15]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 8.38 ms, sys: 0 ns, total: 8.38 ms
Wall time: 701 ms


AIMessage(content='Why did the scarecrow win an award?\nBecause he was outstanding in his field!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b5eb1f93-b220-4014-bebe-2c0dbb5fb65d-0', usage_metadata={'input_tokens': 11, 'output_tokens': 17, 'total_tokens': 28})

In [16]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 1.15 ms, sys: 883 μs, total: 2.03 ms
Wall time: 1.56 ms


AIMessage(content='Why did the scarecrow win an award?\nBecause he was outstanding in his field!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 11, 'total_tokens': 28, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-b5eb1f93-b220-4014-bebe-2c0dbb5fb65d-0', usage_metadata={'input_tokens': 11, 'output_tokens': 17, 'total_tokens': 28})

## Next steps

You've now learned how to cache model responses to save time and money.

Next, check out the other how-to guides chat models in this section, like [how to get a model to return structured output](/docs/how_to/structured_output) or [how to create your own custom chat model](/docs/how_to/custom_chat_model).