### What is Cache-Augmented Generation (CAG)?
CAG is a retrieval-free approach that bypasses the usual step of querying external knowledge sources at inference time. Instead, it preloads relevant documents into the LLM's extended context window, precomputes the model’s key‑value (KV) cache, and reuses this during inference—so the model can generate responses without additional retrieval steps 

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

from langchain.chat_models import init_chat_model

llm=init_chat_model("openai:gpt-4o-mini")

llm

ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x109ec0770>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x128821a30>, root_client=<openai.OpenAI object at 0x10629ecc0>, root_async_client=<openai.AsyncOpenAI object at 0x109f225d0>, model_name='gpt-4o-mini', model_kwargs={}, openai_api_key=SecretStr('**********'))

In [2]:
### Cache variable
Model_Cache={}

In [None]:
import time

def cache_model(query):
    start_time=time.time()
    if Model_Cache.get(query):  # We are just matching the query string
        print("**CAche Hit**")
        end_time=time.time()
        elapsed_time=end_time-start_time
        print(f"EXECUTION TIME: {elapsed_time:.2f} seconds")
        return Model_Cache.get(query)
    else:
        print("***CACHE MISS – EXECUTING MODEL***")
        start_time = time.time()
        response = llm.invoke(query)
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"EXECUTION TIME: {elapsed:.2f} seconds")
        Model_Cache[query] = response
        return response


In [4]:
response=cache_model("hi")
response

***CACHE MISS – EXECUTING MODEL***
EXECUTION TIME: 1.05 seconds


AIMessage(content='Hello! How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 8, 'total_tokens': 17, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-C83P3t1w3BANA8T1wKMawHjcSScKx', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--adb58d05-b25a-469b-94d6-92ea8706d869-0', usage_metadata={'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [5]:
Model_Cache

{'hi': AIMessage(content='Hello! How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 8, 'total_tokens': 17, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-C83P3t1w3BANA8T1wKMawHjcSScKx', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--adb58d05-b25a-469b-94d6-92ea8706d869-0', usage_metadata={'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})}

In [6]:
response=cache_model("hi")
response

**CAche Hit**
EXECUTION TIME: 0.00 seconds


AIMessage(content='Hello! How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 8, 'total_tokens': 17, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-C83P3t1w3BANA8T1wKMawHjcSScKx', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--adb58d05-b25a-469b-94d6-92ea8706d869-0', usage_metadata={'input_tokens': 8, 'output_tokens': 9, 'total_tokens': 17, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [7]:
query="can you give me 500 words on langgraph?"
response =cache_model(query)
print(response)

***CACHE MISS – EXECUTING MODEL***
EXECUTION TIME: 9.34 seconds
content='LangGraph is an innovative approach to harnessing the power of language models in order to facilitate better understanding, generation, and interaction with natural language. As language models have evolved, they have demonstrated remarkable capabilities in numerous applications, ranging from text generation to translation, sentiment analysis, and beyond. However, the integration of these models into user-friendly applications often presents challenges, which is where LangGraph comes into play.\n\nIn essence, LangGraph aims to create a conceptual framework and a user-friendly interface that allows developers and users to interact with and manipulate language models in a more intuitive manner. The key idea behind LangGraph is to treat language data as nodes and relationships within a graph, where each node represents specific concepts, phrases, or entities, and the edges between them signify their relationships or 

In [8]:
query="can you give me 500 words on langgraph?"
response =cache_model(query)
print(response)

**CAche Hit**
EXECUTION TIME: 0.00 seconds
content='LangGraph is an innovative approach to harnessing the power of language models in order to facilitate better understanding, generation, and interaction with natural language. As language models have evolved, they have demonstrated remarkable capabilities in numerous applications, ranging from text generation to translation, sentiment analysis, and beyond. However, the integration of these models into user-friendly applications often presents challenges, which is where LangGraph comes into play.\n\nIn essence, LangGraph aims to create a conceptual framework and a user-friendly interface that allows developers and users to interact with and manipulate language models in a more intuitive manner. The key idea behind LangGraph is to treat language data as nodes and relationships within a graph, where each node represents specific concepts, phrases, or entities, and the edges between them signify their relationships or contextual associatio

In [9]:
query="give me 500 words on langgraph?" # Slightly different query so it will be a cache miss
response =cache_model(query)
print(response)

***CACHE MISS – EXECUTING MODEL***
EXECUTION TIME: 9.10 seconds
content="LangGraph is an innovative framework designed to bridge the gap between natural language processing (NLP) and graph-based data representations. By leveraging the structure and semantic richness of graphs, LangGraph enhances the ability to analyze, interpret, and generate language in a manner that aligns more closely with human cognitive processes. This approach is particularly valuable in understanding and managing complex relationships within text data, thereby enabling a more nuanced interpretation of language.\n\nAt its core, LangGraph employs graphs to represent linguistic elements where nodes symbolize words or phrases, and edges represent the relationships between them. Such a representation can be particularly useful in various NLP tasks, such as word sense disambiguation, semantic similarity assessment, and knowledge extraction. By visualizing linguistic data as a network, LangGraph allows for the explorat