<a href="https://colab.research.google.com/github/DhrubaAdhikary/GEN_AI_DEMO/blob/master/Memory_in_RAGs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with Memory: Comprehensive Architecture Design

## üéØ Overview
This notebook demonstrates a **Retrieval-Augmented Generation (RAG) system with Memory Management** using LangChain and OpenAI. The system combines the power of conversational memory with vector-based retrieval to create context-aware AI applications.

## üèóÔ∏è Architecture Components

### 1. **Conversational Memory Layer**
- **In-Memory Chat History**: Stores conversation context for continuity
- **Session Management**: Maintains separate conversation threads per user
- **Message Persistence**: Retains both user inputs and AI responses

### 2. **Memory Management Strategies**

#### A. Full Memory (Unlimited Context)
```
User Message 1 ‚Üí AI Response 1
User Message 2 ‚Üí AI Response 2
    ...
User Message N ‚Üí AI Response N
[All messages retained]
```
**Pros**: Complete context awareness  
**Cons**: Token limits, increased costs

#### B. Windowed Memory (Last K Messages)
```
[Older messages discarded]
User Message (N-K+1) ‚Üí AI Response (N-K+1)
    ...
User Message N ‚Üí AI Response N
[Only last K messages retained]
```
**Pros**: Controlled token usage, scalable  
**Cons**: Loss of older context

#### C. Summary-Based Memory
```
[Message 1...N] ‚Üí Compressed Summary
Recent Messages (Last K)
[Combined for context]
```
**Pros**: Retains historical essence, efficient  
**Cons**: Information loss, summary generation cost

### 3. **Retrieval-Augmented Generation (RAG)**
- **Vector Embeddings**: Converts text into numerical representations
- **FAISS Vector Store**: Fast similarity search for relevant information
- **Semantic Retrieval**: Finds contextually relevant documents/facts
- **Context Injection**: Augments prompts with retrieved information

## üîÑ System Workflow

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  User Query     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ
         ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
         ‚îÇ                                  ‚îÇ
         ‚ñº                                  ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Memory Retrieval   ‚îÇ          ‚îÇ  Vector Search   ‚îÇ
‚îÇ  (Chat History)     ‚îÇ          ‚îÇ  (RAG System)    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ                               ‚îÇ
           ‚îÇ        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îÇ
           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∫‚îÇ  LLM (GPT)   ‚îÇ‚óÑ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           ‚îÇ
                           ‚ñº
                   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                   ‚îÇ  AI Response  ‚îÇ
                   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## üéì Learning Objectives

1. **Environment Setup**: Configure API keys and import dependencies
2. **Basic Memory**: Implement simple conversational memory
3. **Windowed Memory**: Manage context with sliding windows
4. **Memory Summarization**: Compress conversation history
5. **Vector Storage**: Create and query FAISS vector stores
6. **RAG Integration**: Combine retrieval with generation

## üõ†Ô∏è Technologies Used

- **LangChain**: Framework for LLM applications
- **OpenAI GPT-3.5**: Language model for generation
- **FAISS**: Facebook AI Similarity Search for vector retrieval
- **OpenAI Embeddings**: Text-to-vector conversion
- **Python**: Core implementation language

---

Let's explore each component step by step! üëá

![image.png](attachment:image.png)

Where the Memory Lives
As shown in the architecture diagram:

Short-Term Memory: Acts as the "immediate context," feeding the most recent chat history into the LLM.

Semantic Memory: The massive library of facts stored in your Vector DB (like Qdrant or Pinecone).

Episodic Memory: This is where past user preferences and historical summaries are stored, allowing for continuity over months or years.

All three layers funnel into the LLM Brain / Orchestrator to generate a grounded, context-aware response.

## 1. Environment Setup and API Configuration

This section initializes the OpenAI API key from Google Colab's secure storage (userdata). The API key is essential for:
- Authenticating with OpenAI services
- Accessing GPT models for text generation
- Using OpenAI's embedding models for vector representations

**Key Concepts:**
- **API Key Security**: Stored securely in Colab's userdata, never hardcoded
- **Environment Variables**: Standard practice for managing credentials
- **Assertion Check**: Validates the key is successfully loaded before proceeding

In [None]:
from google.colab import userdata
import os

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
assert os.environ["OPENAI_API_KEY"]


## 2. Core Library Imports

Importing essential LangChain components for building our RAG system with memory:

**Components Breakdown:**
- **`ChatOpenAI`**: Interface to OpenAI's chat models (GPT-3.5/4)
- **`OpenAIEmbeddings`**: Converts text to vector embeddings for similarity search
- **`ChatPromptTemplate`**: Structures prompts with system messages, history, and user input
- **`RunnableWithMessageHistory`**: Wrapper that adds conversation memory to any chain
- **`FAISS`**: High-performance vector database for similarity search (Facebook AI Similarity Search)

These components form the foundation of our memory-enabled RAG pipeline.

In [None]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableWithMessageHistory
# from langchain_core.chat_history import ChatMessageHistory
from langchain_community.vectorstores import FAISS


## 3. In-Memory Chat History Implementation

Importing the **`InMemoryChatMessageHistory`** class, which provides a simple storage mechanism for conversation messages.

**Why In-Memory Storage?**
- ‚úÖ Fast access - no disk I/O or database queries
- ‚úÖ Simple implementation for prototyping and demos
- ‚úÖ Automatic message ordering and retrieval
- ‚ö†Ô∏è Data lost when process terminates (not persistent)
- ‚ö†Ô∏è Limited to single-process applications

**Use Cases:**
- Development and testing
- Short-lived conversational sessions
- Proof-of-concept applications

For production systems, consider persistent storage like Redis, PostgreSQL, or MongoDB.

In [None]:
from langchain_core.chat_history import InMemoryChatMessageHistory


## 4. Initialize the Language Model

Creating an instance of **GPT-3.5-Turbo** with specific configuration:

**Parameters:**
- **`model="gpt-3.5-turbo"`**: Uses OpenAI's GPT-3.5 model
  - Fast response times
  - Cost-effective for most use cases
  - Good balance between performance and price
  
- **`temperature=0`**: Controls randomness in responses
  - `0` = Deterministic, consistent outputs
  - Ideal for factual queries and consistent behavior
  - Higher values (0.7-1.0) encourage creative, varied responses

This LLM instance will be the core reasoning engine for our conversational AI system.

In [None]:
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0
)


## 5. Session Management with Memory Store

Implementing a **session-based memory management system** for handling multiple concurrent conversations:

**Architecture:**
```python
store = {
    "session_1": InMemoryChatMessageHistory([msg1, msg2, ...]),
    "session_2": InMemoryChatMessageHistory([msg1, msg2, ...]),
    ...
}
```

**Key Functions:**
- **`get_session_history(session_id)`**: Retrieves or creates a conversation history
  - If session exists ‚Üí returns existing history
  - If new session ‚Üí creates new `InMemoryChatMessageHistory` instance
  
**Benefits:**
- üë• **Multi-User Support**: Each user gets isolated conversation context
- üîí **Session Isolation**: Prevents context bleeding between conversations
- üéØ **Lazy Initialization**: Sessions created only when needed
- üìä **Scalable**: Can handle multiple concurrent conversations

**Example Use Cases:**
- Chat applications with multiple users
- A/B testing different conversation flows
- Parallel conversation experiments

In [None]:
store = {}

def get_session_history(session_id: str):
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]


## 6. Building the Conversational Chain with Full Memory

Constructing a complete **memory-enabled conversational AI pipeline**:

**Components:**

1. **Prompt Template Structure:**
   ```
   System Message: "You are a helpful assistant."
   ‚Üì
   Conversation History: {history} (all previous messages)
   ‚Üì
   Current User Input: {input}
   ```

2. **Chain Construction (`prompt | llm`):**
   - Pipes the formatted prompt into the language model
   - LangChain Expression Language (LCEL) for composability

3. **RunnableWithMessageHistory Wrapper:**
   - **`chain`**: The base prompt + LLM pipeline
   - **`get_session_history`**: Function to retrieve/store messages
   - **`input_messages_key="input"`**: Maps user input to prompt variable
   - **`history_messages_key="history"`**: Maps stored messages to prompt variable

**Memory Behavior:**
- ‚úÖ **Full Context**: All messages from session start retained
- ‚úÖ **Automatic Persistence**: Conversations saved to `store` after each interaction
- ‚úÖ **Context Awareness**: Model remembers all previous exchanges

**Flow Diagram:**
```
User Input ‚Üí Retrieve History ‚Üí Format Prompt ‚Üí LLM ‚Üí Save Response ‚Üí Return
```

In [None]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("placeholder", "{history}"),
    ("human", "{input}")
])

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

chain = prompt | llm

chain_with_memory = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)


## 7. Testing Full Memory - Conversational Persistence

Demonstrating **full memory retention** across multiple conversation turns:

**Test Scenario:**
1. **First Message**: "My name is Dhruba"
2. **Second Message**: "I work on graph neural networks"

**Expected Behavior:**
- Both messages stored in `store["demo"]`
- LLM has access to all previous context
- Can answer questions like "What's my name?" or "What do I work on?"

**What We're Observing:**
- **`store["demo"].messages`**: Displays the complete conversation history
  - User messages (HumanMessage)
  - AI responses (AIMessage)
  - Preserves chronological order

**Memory Pattern:**
```
Turn 1: User ‚Üí "My name is Dhruba" ‚Üí AI response ‚Üí [saved]
Turn 2: User ‚Üí "I work on GNNs" ‚Üí AI response (knows your name) ‚Üí [saved]
...
[All messages accumulated]
```

This is useful when you need **complete context** but can become expensive with long conversations due to token limits.

In [None]:
chain_with_memory.invoke(
    {"input": "My name is Dhruba"},
    config={"configurable": {"session_id": "demo"}}
)

chain_with_memory.invoke(
    {"input": "I work on graph neural networks"},
    config={"configurable": {"session_id": "demo"}}
)

store["demo"].messages


[HumanMessage(content='My name is Dhruba', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Nice to meet you, Dhruba! How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 23, 'total_tokens': 39, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_1590f93f9d', 'id': 'chatcmpl-D4PUNy7E9xWv7NAGcxP3BFApjLo9q', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019c18dd-0096-7c71-8340-ec5731f51470-0', tool_calls=[], invalid_tool_calls=[], usage_metadata={'input_tokens': 23, 'output_tokens': 16, 'total_tokens': 39, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reason

## 8. Windowed Memory Implementation - Token Optimization

Introducing **sliding window memory** to manage long conversations efficiently:

**The Problem with Full Memory:**
- Token costs grow linearly with conversation length
- Risk hitting model context limits (4K, 16K, 128K tokens)
- Unnecessary older context may dilute recent, relevant information

**The Solution: `get_windowed_history(session_id, k=4)`**

**How It Works:**
```python
Original: [Msg1, Msg2, Msg3, Msg4, Msg5, Msg6, Msg7, Msg8]
                                    ‚Üì
Window (k=4): [Msg5, Msg6, Msg7, Msg8]  # Last 4 messages only
```

**Key Features:**
- **`hist.messages = hist.messages[-k:]`**: Python slice keeps last K messages
- **Default k=4**: Retains 2 conversation turns (2 user + 2 AI messages)
- **Dynamic Pruning**: Automatically discards older messages

**Trade-offs:**
- ‚úÖ **Fixed token cost**: Predictable, controlled context size
- ‚úÖ **Scalable**: Works for indefinitely long conversations
- ‚úÖ **Recent context focus**: Emphasizes latest exchanges
- ‚ö†Ô∏è **Context loss**: Older information forgotten

**When to Use:**
- Long-running chat sessions
- Cost-sensitive applications
- When recent context is most important

In [None]:
from langchain_core.chat_history import InMemoryChatMessageHistory

store = {}

def get_session_history(session_id: str):
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

def get_windowed_history(session_id: str, k=4):
    hist = get_session_history(session_id)
    hist.messages = hist.messages[-k:]
    return hist


## 9. Creating Windowed Memory Chain

Building a **new conversational chain** with the windowed memory function:

**Critical Difference:**
```python
# Full Memory
chain_with_memory = RunnableWithMessageHistory(
    chain,
    get_session_history,  # ‚Üê Returns ALL messages
    ...
)

# Windowed Memory
chain_with_window_memory = RunnableWithMessageHistory(
    chain,
    get_windowed_history,  # ‚Üê Returns LAST K messages only
    ...
)
```

**Configuration:**
- Same prompt template (system + history + input)
- Same LLM (GPT-3.5-turbo)
- **Different memory retrieval function**: `get_windowed_history` instead of `get_session_history`

**Impact:**
- Model only sees the last 4 messages (k=4)
- Recent conversational context preserved
- Older messages automatically pruned
- Consistent token usage across long conversations

This approach is particularly useful for:
- Customer support chatbots (focus on current issue)
- Task-oriented conversations
- Applications with strict latency requirements

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableWithMessageHistory

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("placeholder", "{history}"),
    ("human", "{input}")
])

chain = prompt | llm

chain_with_window_memory = RunnableWithMessageHistory(
    chain,
    get_windowed_history,      # üî¥ IMPORTANT
    input_messages_key="input",
    history_messages_key="history",
)


## 10. Testing Windowed Memory - Multiple Conversation Turns

Running **4 consecutive messages** to demonstrate windowed memory in action:

**Conversation Sequence:**
1. "I live in Bangalore" ‚Üí üèôÔ∏è Location information
2. "I work at AstraZeneca" ‚Üí üíº Employment details
3. "I build agentic AI systems" ‚Üí ü§ñ Work focus
4. "I focus on A2A architectures" ‚Üí üèóÔ∏è Technical specialization

**Expected Memory Behavior:**
- After message 1: [Msg1]
- After message 2: [Msg1, Msg2]
- After message 3: [Msg1, Msg2, Msg3]
- After message 4: [Msg1, Msg2, Msg3, Msg4] ‚úì (exactly k=4 messages)
- If we add message 5: [Msg2, Msg3, Msg4, Msg5] (Msg1 dropped!)

**What to Observe:**
- Window size caps at 4 messages
- Oldest messages automatically pruned beyond window
- Model maintains coherent recent context

**Testing Questions (try after running):**
- ‚úÖ "Where do I work?" ‚Üí Should answer (within window)
- ‚úÖ "What do I build?" ‚Üí Should answer (within window)
- ‚ùå "Where do I live?" ‚Üí May forget after 3+ more messages (outside window)

In [None]:
chain_with_window_memory.invoke(
    {"input": "I live in Bangalore"},
    config={"configurable": {"session_id": "win"}}
)

chain_with_window_memory.invoke(
    {"input": "I work at AstraZeneca"},
    config={"configurable": {"session_id": "win"}}
)

chain_with_window_memory.invoke(
    {"input": "I build agentic AI systems"},
    config={"configurable": {"session_id": "win"}}
)

chain_with_window_memory.invoke(
    {"input": "I focus on A2A architectures"},
    config={"configurable": {"session_id": "win"}}
)


AIMessage(content="That's great to hear! A2A architectures, which stands for Application-to-Application architectures, are crucial for enabling communication and interaction between different software applications. It's a key aspect of building complex systems that can work together seamlessly. If you have any questions or need assistance related to A2A architectures or any other topic, feel free to ask. I'm here to help!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 77, 'prompt_tokens': 180, 'total_tokens': 257, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-D4PX0f90QRKhKtfCnpPP0o8SfU6Z8', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019

## 11. Inspecting Windowed Memory State

Examining the **actual message history** stored in the windowed session:

**What This Shows:**
- Direct access to `store["win"].messages`
- Reveals which messages are retained in memory
- Confirms the window size limitation

**Expected Output:**
```python
[
    HumanMessage(content="I live in Bangalore"),
    AIMessage(content="..."),
    HumanMessage(content="I work at AstraZeneca"),
    AIMessage(content="..."),
    HumanMessage(content="I build agentic AI systems"),
    AIMessage(content="..."),
    HumanMessage(content="I focus on A2A architectures"),
    AIMessage(content="...")
]
# Total: 8 messages (4 user + 4 AI)
```

**Note:** With k=4 for `get_windowed_history`, we're tracking the last **4 messages total** (not 4 turns). If each turn = 2 messages (user + AI), we retain **2 full conversation turns**.

**Debugging Tip:**
Inspecting the message store is crucial for:
- Verifying memory strategy implementation
- Debugging context issues
- Understanding what information is available to the model

In [None]:
store["win"].messages


[HumanMessage(content='I work at AstraZeneca', additional_kwargs={}, response_metadata={}),
 AIMessage(content="That's wonderful! AstraZeneca is a global biopharmaceutical company known for its innovative medicines and contributions to healthcare. If you have any questions or need assistance related to your work at AstraZeneca or anything else, feel free to ask. I'm here to help!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 58, 'prompt_tokens': 72, 'total_tokens': 130, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-D4PWyGvqLjqNgt0KcwvasD04CuQWW', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019c18df-74d9-7e52-9795-21944988327a-0', tool_c

## 12. Memory Summarization - Compressing Conversation History

Implementing **summarization-based memory** to preserve historical context efficiently:

**The Strategy:**
Instead of keeping all messages or just recent ones, **summarize** old conversations into compressed form:
```
[Old Messages] ‚Üí Summary ‚Üí [Recent Messages + Summary] ‚Üí LLM
```

**Implementation Steps:**

1. **Summary Prompt Template:**
   - System instruction: "Summarize the conversation briefly."
   - Input: Full conversation text

2. **Summary Chain:**
   - Pipes the prompt into the LLM
   - Generates concise summary of key points

3. **Conversation Text Preparation:**
   - Extracts all messages from session
   - Joins them into single text string
   - Preserves conversational flow

**Benefits:**
- üì¶ **Compression**: Reduce 1000 tokens to 100 tokens
- üß† **Context Preservation**: Retain essential information
- üí∞ **Cost Efficiency**: Lower token usage than full history
- üéØ **Semantic Retention**: Keep meaning, discard verbosity

**Use Cases:**
- Long-term memory systems
- Multi-session conversations
- Knowledge base construction from chats

The resulting summary can be prepended to recent messages for hybrid memory approach!

In [None]:
summary_prompt = ChatPromptTemplate.from_messages([
    ("system", "Summarize the conversation briefly."),
    ("human", "{conversation}")
])

summary_chain = summary_prompt | llm

conversation_text = "\n".join(
    m.content for m in store["demo"].messages
)

summary = summary_chain.invoke(
    {"conversation": conversation_text}
)

summary


AIMessage(content="The user's name is Dhruba, and they work on graph neural networks (GNNs). GNNs are a fascinating area of research with applications in various fields. The conversation may involve discussing specific aspects of GNNs or addressing any questions or topics related to this area of research.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 60, 'prompt_tokens': 112, 'total_tokens': 172, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-D4PVNkgaDINqUEx3hjDPQj7e1Zb5p', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019c18dd-f41c-7cc0-bd0e-3452ed7a4600-0', tool_calls=[], invalid_tool_calls=[], usage_metadata={'input_tokens': 112, 'outpu

## 13. RAG Implementation - Vector Store and Semantic Search

Introducing **Retrieval-Augmented Generation (RAG)** with vector embeddings:

**RAG Architecture:**
```
Knowledge Base ‚Üí Embeddings ‚Üí Vector Store ‚Üí Similarity Search ‚Üí Context ‚Üí LLM
```

**Components:**

1. **OpenAI Embeddings:**
   - Converts text into 1536-dimensional vectors
   - Captures semantic meaning mathematically
   - Similar meanings = similar vectors

2. **FAISS Vector Store:**
   - Facebook AI Similarity Search
   - Efficient nearest-neighbor search
   - Indexes vectors for fast retrieval

3. **Knowledge Base:**
   ```python
   texts = [
       "User works on graph neural networks",
       "User builds agentic AI systems",
       "User is a senior data scientist"
   ]
   ```
   These facts are embedded and stored for retrieval

4. **Similarity Search:**
   - Query: "What kind of systems does the user build?"
   - Returns: "User builds agentic AI systems" (k=1, top match)
   - Based on cosine similarity between query and document vectors

**Why RAG?**
- üìö **External Knowledge**: Access information beyond training data
- üéØ **Relevant Context**: Retrieve only pertinent information
- ‚úÖ **Factual Accuracy**: Ground responses in specific documents
- üîÑ **Dynamic Updates**: Add new knowledge without retraining

**Combined with Memory:**
- Memory: Tracks conversation flow
- RAG: Provides factual knowledge
- Together: Context-aware + knowledge-grounded AI

In [None]:
embeddings = OpenAIEmbeddings()

texts = [
    "User works on graph neural networks",
    "User builds agentic AI systems",
    "User is a senior data scientist"
]

vectorstore = FAISS.from_texts(texts, embedding=embeddings)

vectorstore.similarity_search(
    "What kind of systems does the user build?", k=1
)[0].page_content


'User builds agentic AI systems'

## 14. Conclusion - Next Steps

**üéâ What We've Built:**

A comprehensive **RAG system with multiple memory strategies**:
- ‚úÖ Full conversation memory
- ‚úÖ Windowed (sliding) memory  
- ‚úÖ Summarization-based memory
- ‚úÖ Vector-based semantic retrieval

**üîÑ Combining Memory + RAG:**

For production systems, you can integrate these components:
```python
1. Retrieve relevant facts from vector store (RAG)
2. Load recent conversation history (Memory)
3. Optionally include conversation summary (Summarization)
4. Combine all context in prompt
5. Generate informed response
```

**üöÄ Advanced Patterns:**
- **Hybrid Memory**: Summary + recent window + vector retrieval
- **Semantic Memory**: Store conversations as embeddings for cross-session retrieval
- **Adaptive Windowing**: Dynamic window size based on conversation complexity
- **Multi-Index RAG**: Separate vector stores for different knowledge domains

**üìà Production Considerations:**
- Use persistent storage (PostgreSQL, Redis, Pinecone)
- Implement caching for embeddings
- Monitor token usage and costs
- Add conversation export/import functionality
- Implement memory cleanup policies

**Try Building:**
- Personal AI assistant with long-term memory
- Customer support bot with knowledge base
- Research assistant with document retrieval
- Code assistant with codebase context