
# 📘 Vector Databases, RAG, and LangChain Basics

This notebook covers:

1. **Vector Databases (Theory Only)**
2. **Retrieval-Augmented Generation (RAG)** — Theory + Practical
3. **LangChain Framework (Basics)** — Theory + Practical

---



## 🧠 2.7 Vector Databases — Theory Only

### 🔹 Need for Vector Storage & Retrieval
Traditional databases store **structured or textual data**. However, **AI models** (like embeddings or large language models) produce **vector representations** — numerical arrays that capture **semantic meaning** of data.

Hence, we need **vector databases** to:
- Store and manage **high-dimensional vector embeddings**.
- Retrieve similar items efficiently using **similarity search**.
- Scale for millions of records with **fast approximate nearest neighbor (ANN)** search.

**Example use cases:**
- Semantic document search  
- Recommendation systems  
- Image or audio similarity  
- RAG (Retrieval-Augmented Generation)

---

### 🔹 Popular Vector Databases

| Tool | Description | Key Features |
|------|--------------|---------------|
| **FAISS (Facebook AI Similarity Search)** | Open-source by Meta for efficient similarity search on dense vectors. | Very fast, supports GPU, suitable for local/offline use. |
| **Chroma** | Lightweight and open-source database integrated with LangChain. | Built for LLM workflows, supports persistence. |
| **Weaviate** | Cloud-native, schema-based, with RESTful APIs. | Supports hybrid (keyword + vector) search. |
| **Pinecone** | Fully managed cloud vector database. | Scalable, easy to integrate with OpenAI & LangChain. |

---

### 🔹 Search Methods

1. **Cosine Similarity:**  
   Measures angle between vectors (used for semantic similarity).

2. **Top-k Search:**  
   Retrieves the top `k` most similar vectors to a query.

3. **Filtering:**  
   Combines metadata filters (e.g., date, category) with similarity search for context-aware retrieval.



## 🧩 LangChain Framework — Basics

### 🔹 Concept
**LangChain** is a modular framework for building applications powered by **large language models (LLMs)**.  
It allows developers to combine components (like models, memory, tools, and data) to build intelligent workflows.

### 🔹 Key Elements

| Element | Description |
|----------|--------------|
| **Chains** | Sequential workflows that connect multiple steps (e.g., retrieval → LLM). |
| **Tools** | External functions or APIs the model can call. |
| **Memory** | Enables context persistence across conversations. |
| **Agents** | Decision-making components that dynamically choose which tools/chains to use. |

---

### 🔹 Why LangChain?
- Simplifies **integration** of LLMs with external data.  
- Supports **retrieval**, **memory**, **tools**, and **multi-step reasoning**.  
- Framework for **production-ready AI systems** (RAG, chatbots, assistants).

---

### 🔹 Simple Example Use Cases
- **Q&A Bots:** Retrieve info from custom documents.  
- **Web Agents:** Use APIs and browse the web.  
- **Data Assistants:** Query and summarize data automatically.


In [27]:
# 🚀 LangChain with Mistral-7B-Instruct 

!pip install langchain langchain-community transformers accelerate -q

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

print(f"Using device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

template = """You are a helpful assistant. Answer the following question clearly and concisely.

{question}"""

prompt = PromptTemplate(template=template, input_variables=["question"])

# --- Load model---
model_name = "mistralai/Mistral-7B-Instruct-v0.2"


print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16
)

# Set padding token to avoid warnings
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model.config.pad_token_id = tokenizer.pad_token_id

print("Model loaded successfully!")
# --- Create text generation pipeline ---
text_gen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.eos_token_id,
    return_full_text=False,  # Only return generated text, not the prompt
)

# --- Wrap pipeline in LangChain ---
llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

# --- Create LangChain ---
chain = LLMChain(prompt=prompt, llm=llm)

# --- Test the chain ---
question = "What is Cricket?"
print(f"\n{'='*60}")
print(f"Question: {question}")
print(f"{'='*60}")

response = chain.run(question)
print(f"\nAnswer: {response}")

# --- Additional example questions ---
print(f"\n{'='*60}")
print("Additional Examples:")
print(f"{'='*60}\n")

examples = [
    "Explain quantum computing in simple terms.",
    "What are the benefits of regular exercise?"
]

for q in examples:
    print(f"Q: {q}")
    print(f"A: {chain.run(q)}\n")

Using device: CUDA
Loading mistralai/Mistral-7B-Instruct-v0.2...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cuda:0


Model loaded successfully!

Question: What is Cricket?

Answer: 
Cricket is an outdoor bat-and-ball game originating in southeast England during the late 16th century. It's played between two teams, each consisting of eleven players. The objective is to score as many runs as possible by hitting the ball bowled by the opposing team with bats, while fielders try to prevent this by getting the ball back to the wicket (a pair of three wooden stumps with two smaller wooden pieces called bail on top) before the batsman can score another run. Runs are scored when both batsmen have successfully made it to their respective creases (small rectangular areas on either side of the pitch) without being dismissed. The team with the most runs at the end of the match wins.

Additional Examples:

Q: Explain quantum computing in simple terms.
A: 

Quantum computing is a type of computation that uses quantum bits, or qubits, instead of classical bits to store and process information. In classical computin

| **Parameter**                         | **Purpose**                                 | **Detailed Explanation**                                                                                                                                                    |
| ------------------------------------- | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `"text-generation"`                   | **Pipeline type**                           | Tells the `transformers` library what kind of task to perform — in this case, generating text continuations given an input prompt.                                          |
| `model=model`                         | **Pretrained model used for generation**    | The model object you’ve loaded (e.g., GPT-2, Llama-2, Falcon, Mistral). It defines the actual neural network used for text generation.                                      |
| `tokenizer=tokenizer`                 | **Text → tokens → text converter**          | Converts your input string into numerical tokens for the model, and decodes generated tokens back into readable text. Must match the model.                                 |
| `max_new_tokens=200`                  | **Length control**                          | Limits the number of *new* tokens the model can generate. This prevents runaway text. Example: If your prompt is 50 tokens long, final output = 50 + 200 tokens.            |
| `do_sample=True`                      | **Enable randomness**                       | Turns on probabilistic sampling. Without this (`False`), the model always picks the highest probability next token → deterministic but repetitive.                          |
| `temperature=0.7`                     | **Controls creativity / randomness**        | Scales the logits before sampling. <br>🔹 **Low (<0.7):** More focused, factual <br>🔹 **High (>1.0):** More creative, diverse responses.                                   |
| `top_p=0.9`                           | **Nucleus Sampling (a probability filter)** | Instead of considering all possible next tokens, it only samples from the smallest set of tokens whose total probability ≥ `0.9`. <br>Helps balance creativity & coherence. |
| `repetition_penalty=1.1`              | **Prevents looping**                        | Penalizes tokens that were already generated earlier. Values >1 discourage repetition (e.g., "the the the...").                                                             |
| `pad_token_id=tokenizer.eos_token_id` | **Padding fix for some models**             | Ensures that sequences shorter than batch length are padded using the **end-of-sequence** token (needed for some causal models like GPT-2).                                 |
| `return_full_text=False`              | **Clean output**                            | If `True`, output includes both **prompt + generated text**. If `False`, you only get the **newly generated part** — cleaner for chatbots and summarizers.                  |



# 🧠 LangChain Memory

 **Memory**  
   - ConversationBufferMemory  
   - ConversationBufferWindowMemory  
   - ConversationKGMemory  
   - VectorStoreRetrieverMemory  






## Memory - Theory

**Why memory is essential in conversational AI**  
- LLMs are stateless: without memory they cannot recall prior turns or user-specific facts. Memory enables continuity, personalization, and multi-turn reasoning.

**How memory persists and structures conversation context**  
- Memory components store parts of conversation (raw text, embeddings, or structured facts) and provide retrieval mechanisms so that the LLM can access relevant context when generating answers.

**Internal flow:**  
```
User input -> LLM -> Memory (read/write) -> LLM uses memory -> Output
```

**Types of Memory (overview & when to use)**

1. **ConversationBufferMemory** — stores the full conversation as text.  
   - Use when short chats need full context and token usage is acceptable.

2. **ConversationBufferWindowMemory** — keeps only the last *N* interactions.  
   - Use when chats grow long and only recent context matters.

3. **ConversationKGMemory** — builds a knowledge graph of entities and relations extracted from conversation.  
   - Use when you need structured reasoning and relationship queries (e.g., who works with whom).

4. **VectorStoreRetrieverMemory** — stores embeddings of past interactions in a vector DB (FAISS/Chroma).  
   - Use for semantic/long-term recall over many past interactions.

---


In [28]:
# ConversationBufferMemory demo (toy LLM using transformers pipeline)
from langchain.memory import ConversationBufferMemory
# ConversationBufferMemory stores the whole chat as a string
cb_memory = ConversationBufferMemory(memory_key="chat_history")

# Simulate conversation
def respond_with_memory(user_input):
    # Read memory
    history = cb_memory.load_memory_variables({}).get("chat_history", "")
    # Create prompt that includes history (very simple prompt concatenation)
    # Adjusted prompt format for direct pipeline use
    prompt = f"""Conversation history:
{history}

User: {user_input}
Assistant:"""
    generated = text_gen_pipeline(prompt, max_new_tokens=200)[0]['generated_text']
    res = generated.split("User:")[0].strip()

    # Update memory (append user + assistant)
    cb_memory.save_context({"input": user_input}, {"output": res})
    return res

print("User: Hi, my name is Aarnav and I love football.")
print("Assistant:", respond_with_memory("Hi, my name is Aarnav and I love football."))
print("User: Do you remember my hobby?")
print("Assistant:", respond_with_memory("Do you remember my hobby?"))

User: Hi, my name is Aarnav and I love football.
Assistant: Hello Aarnav, nice to meet you! I'm an assistant that helps answer questions and find information. How can I assist you with football today?
User: Do you remember my hobby?
Assistant: Yes, you mentioned that you love football earlier in our conversation.


In [33]:
# 🚀 ConversationBufferWindowMemory Forgetting Demo (Mistral-7B-Instruct) 
from langchain.memory import ConversationBufferWindowMemory 
# Keep only the last 1 interaction in memory 
window_memory = ConversationBufferWindowMemory(k=1, memory_key="chat_history_buffer") 
def respond_with_window_memory(user_input): 
    # Load only the last k turns from memory 
    history = window_memory.load_memory_variables({}).get("chat_history_buffer", "") 
    # Build the prompt using limited memory 
    prompt = f"""Recent conversation: {history} User: {user_input} Assistant""" 
    # Generate model response 
    generated = text_gen_pipeline(prompt, max_new_tokens=150)[0]['generated_text'] 
    response = generated.split("User:")[0].strip() 
    # Save this turn to memory 
    window_memory.save_context({"input": user_input}, {"output": response}) 
    return response 
# --- Conversation Test --- 
print("Turn 1 - User: My favorite color is blue.") 
print("Assistant:", respond_with_window_memory("My favorite color is blue.")) 
print("\nTurn 2 - User: What’s my favorite color?") 
print("Assistant:", respond_with_window_memory("What’s my favorite color?")) 
print("\nTurn 3 - User: I like playing cricket") 
print("Assistant:", respond_with_window_memory("I like playing cricket")) 
print("\nTurn 4 - User: What did i say about my favorite colour?") 
print("Assistant:", respond_with_window_memory("What did i say about my favorite colour?"))

Turn 1 - User: My favorite color is blue.
Assistant: : That's great! Blue is a beautiful color. It's often associated with feelings of calm and peace. Is there anything specific you would like to know or discuss about the color blue?

Turn 2 - User: What’s my favorite color?
Assistant: : Based on our previous conversation, your favorite color is blue. Is that correct?

Turn 3 - User: I like playing cricket
Assistant: : That's great to know! While we don't have information about your favorite color directly linked to your preference for cricket, I'm here to help answer any questions you might have about the sport or anything else. How about a question related to cricket? For example, who holds the record for the most runs in an innings in international cricket? The answer is 400 not out, which was scored by Brian Lara.

Turn 4 - User: What did i say about my favorite colour?
Assistant: : Based on our conversation, you mentioned your preference for playing cricket but there was no mentio


## 🤖 Retrieval-Augmented Generation (RAG)

### 🔹 Motivation
LLMs (like GPT) are **not connected to the internet** and have **limited knowledge** (up to their training date).  
RAG helps LLMs **access external knowledge** dynamically by retrieving context from a **vector database**.

### 🔹 RAG Pipeline Overview

1. **Query →** User inputs a question.  
2. **Embedding →** Convert query into a vector using a model (like `sentence-transformers`).  
3. **Vector DB →** Store document embeddings (FAISS/Chroma).  
4. **Retrieve →** Find top similar chunks to the query.  
5. **Inject Context →** Add retrieved chunks as context to LLM prompt.  
6. **Generate →** LLM produces a final response using the context.

### 🔹 Key Components
- **Chunking:** Breaking large documents into smaller pieces (e.g., 500 tokens).
- **Embedding:** Converting text chunks into vectors using models like `all-MiniLM-L6-v2`.
- **Retrieval:** Searching for relevant chunks using similarity search.

### 🔹 Tools
- **FAISS:** Local, efficient vector search.  
- **LangChain:** Simplifies RAG pipelines.  
- **OpenAI Embeddings / Sentence Transformers:** For converting text to vectors.

### 🔹 Applications
- Document Q&A systems  
- Customer support bots  
- Personalized knowledge assistants  


In [34]:
!pip install langchain sentence-transformers faiss-cpu langchain-community langchain-openai transformers -q

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from transformers import pipeline

# --- Load PDF ---
file_path = "/kaggle/input/serverless-computing-in-enterprise-cloud-architect/Serverless Computing in Enterprise Cloud Architect.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

# --- Chunking ---
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# --- Embeddings ---
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# --- Vector Store (FAISS) ---
vectorstore = FAISS.from_documents(chunks, embeddings)

# --- Retrieval ---
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# --- Query ---
query = "What is serverless cloud?"
retrieved_docs = retriever.get_relevant_documents(query)
print("\nTop Retrieved Chunks:")
for i, d in enumerate(retrieved_docs, 1):
    print(f"\nChunk {i}:\n{d.page_content}")
# --- Combine Retrieved Context ---
context = "\n\n".join([d.page_content for d in retrieved_docs])

# --- Prepare final prompt ---
final_prompt = f"""Answer the question based on the context below.

Context:
{context}

Question: {query}

Answer:"""

# --- Generate Answer ---
response = text_gen_pipeline(final_prompt)[0]["generated_text"]

print("=== FINAL ANSWER ===\n")
print(response)



Top Retrieved Chunks:

Chunk 1:
Serverless Computing in Enterprise Cloud 
Architecture: A Comprehensive Case Study 
Based on extensive research of current cloud computing literature , I've developed a 
comprehensive case study examining serverless computing 's transformative impact on 
enterprise architecture. The analysis is anchored by the systematic review "Rise of the Planet of 
Serverless Computing" by Wen et al. 2022 , which analyzed 164 research papers across 17 
different research directions.  1  
 
Research Foundation and Market Analysis 
The serverless computing market demonstrates remarkable growth trajectory, expanding from 
$3 billion in 2017 to a projected $ 22 billion by 2025, with enterprise adoption expected to reach 
50% of global organizations. This growth is driven by fundamental advantages in cost efficiency, 
automatic scaling, and operational simplification.  2  1  
 
 
Growth trajectory of serverless computing market size and enterprise adoption rates from 2017