[![Labellerr](https://storage.googleapis.com/labellerr-cdn/%200%20Labellerr%20template/notebook.webp)](https://www.labellerr.com)

# **Agent 101 – Mini RAG + Gemini LLM (In-Memory Vector Store)**

---

[![labellerr](https://img.shields.io/badge/Labellerr-BLOG-black.svg)](https://www.labellerr.com/blog/<BLOG_NAME>)
[![Youtube](https://img.shields.io/badge/Labellerr-YouTube-b31b1b.svg)](https://www.youtube.com/@Labellerr)
[![Github](https://img.shields.io/badge/Labellerr-GitHub-green.svg)](https://github.com/Labellerr/Hands-On-Learning-in-Computer-Vision)

## Understanding Retrieval-Augmented Generation (RAG) with LLMs

This notebook demonstrates how **Retrieval-Augmented Generation (RAG)** works with Large Language Models (LLMs) like Gemini.

### What is RAG?
RAG is a technique that combines:
1. **Retrieval**: Fetch relevant documents/chunks from a knowledge base based on a query
2. **Augmentation**: Provide those chunks as context to an LLM
3. **Generation**: Let the LLM generate answers grounded in the retrieved context

### Why use RAG?
- **Reduces Hallucinations**: LLMs only answer based on provided documents
- **Keeps Knowledge Fresh**: Update documents without retraining models
- **Improves Factuality**: Answers are grounded in your specific data
- **Explains Sources**: You can see which chunks contributed to the answer

### In this notebook:
- Build a minimal in-memory vector store
- Implement semantic search using TF-IDF (+ keyword matching)
- Use Gemini API to generate LLM-powered answers
- See the complete RAG pipeline in action

**Requirements:** A free Gemini API key from [Google AI Studio](https://aistudio.google.com/app/apikey)

## Section 0: Setup & Dependencies

This section imports necessary libraries and defines helper functions for text processing and vector operations.

### Key Functions:
- `normalize_text()`: Cleans text for consistent processing
- `split_into_sentences()`: Breaks documents into sentences for chunking

In [1]:

# Imports and helpers
import math, re, collections, os
from typing import List, Dict, Tuple

# Optional: use TF-IDF from scikit-learn if available
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    SKLEARN_OK = True
except Exception:
    SKLEARN_OK = False

def normalize_text(s: str) -> str:
    s = s.lower()
    s = re.sub(r"[^a-z0-9\s\.]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def split_into_sentences(text: str) -> List[str]:
    # Very small sentence splitter
    parts = re.split(r'(?<=[.!?])\s+', text.strip())
    return [p.strip() for p in parts if p.strip()]


## Section 1: Sample Knowledge Base (Corpus)

We create a small collection of documents about **RAG, vector stores, and search techniques**.

In a real application, this would be replaced with:
- Company documentation
- Research papers
- Product catalogs
- User manuals
- etc.

The RAG system will search this corpus to answer user questions.

In [2]:
corpus = [
'RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context from a knowledge base and then generate answers grounded in that context. It reduces hallucinations and improves factuality.',
'A vector store keeps representations (embeddings) of chunks so we can quickly find semantically similar pieces to a query.',
'Chunking breaks long documents into smaller segments (like paragraphs or sliding windows). Good chunking helps retrieval match the right context.',
'Cosine similarity measures how close two vectors point in the same direction. Higher cosine means more similar meanings.',
'You can start with TF-IDF for quick experiments and later switch to stronger embeddings like Sentence-Transformers. For scale, tools like FAISS, Milvus, or Chroma are common.'
]


for i, doc in enumerate(corpus, 1):
    print(f"Doc {i}:", doc[:90] + ("..." if len(doc) > 90 else ""))


Doc 1: RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context fro...
Doc 2: A vector store keeps representations (embeddings) of chunks so we can quickly find semanti...
Doc 3: Chunking breaks long documents into smaller segments (like paragraphs or sliding windows)....
Doc 4: Cosine similarity measures how close two vectors point in the same direction. Higher cosin...
Doc 5: You can start with TF-IDF for quick experiments and later switch to stronger embeddings li...


## Section 2: Document Chunking

### Why chunk documents?
Long documents need to be split into smaller chunks for better retrieval:
- **Semantic Precision**: Smaller chunks are more focused on single topics
- **Better Matching**: Queries are more likely to match relevant sections
- **Efficient Indexing**: Faster storage and retrieval in vector stores
- **Context Windows**: LLM context limits require appropriately sized chunks

### Chunking Strategy:
We use a **sentence-based sliding window** approach:
- Group N consecutive sentences into a chunk
- Can be extended with overlapping windows for better context preservation

In [3]:
    
def chunk_docs(docs: List[str], sentences_per_chunk: int = 2) -> List[str]:
    chunks = []
    for doc in docs:
        sents = split_into_sentences(doc)
        for i in range(0, len(sents), sentences_per_chunk):
            chunk = " ".join(sents[i:i+sentences_per_chunk])
            if chunk.strip():
                chunks.append(chunk.strip())
    return chunks

chunks = chunk_docs(corpus, sentences_per_chunk=2)
print(f"Total chunks: {len(chunks)}")
for i, c in enumerate(chunks[:5], 1):
    print(f"Chunk {i}: {c}")


Total chunks: 5
Chunk 1: RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context from a knowledge base and then generate answers grounded in that context. It reduces hallucinations and improves factuality.
Chunk 2: A vector store keeps representations (embeddings) of chunks so we can quickly find semantically similar pieces to a query.
Chunk 3: Chunking breaks long documents into smaller segments (like paragraphs or sliding windows). Good chunking helps retrieval match the right context.
Chunk 4: Cosine similarity measures how close two vectors point in the same direction. Higher cosine means more similar meanings.
Chunk 5: You can start with TF-IDF for quick experiments and later switch to stronger embeddings like Sentence-Transformers. For scale, tools like FAISS, Milvus, or Chroma are common.


## Section 3: In-Memory Vector Store & Smart Retrieval

### How the Vector Store Works:

1. **Vectorization**: Convert text chunks into numerical vectors
   - Uses **TF-IDF** (Term Frequency-Inverse Document Frequency) for weighted term importance
   - Fallback to **Bag-of-Words** if scikit-learn unavailable

2. **Vector Normalization**: Scale vectors to unit length for consistent similarity scores

3. **Similarity Search**: Find chunks most similar to a query
   - **Cosine Similarity**: Measures angle between vectors (0-1 range)
   - **Keyword Bonus**: Extra boost for exact term matches
   - **Combined Score**: Weighted blend of semantic + keyword relevance

### Retrieval Pipeline:
```
User Query → Vectorize → Cosine Similarity Scores → Keyword Bonus
→ Combined Ranking → Return Top-K Chunks
```

In [4]:

class TinyVectorStore:
    def __init__(self, chunks: List[str]):
        self.chunks = chunks
        self.method = "tfidf" if SKLEARN_OK else "bow_fallback"
        if SKLEARN_OK:
            self.vectorizer = TfidfVectorizer(lowercase=True)
            self.matrix = self.vectorizer.fit_transform(chunks)  # sparse matrix
        else:
            self._build_bow_index(chunks)

    def _build_bow_index(self, chunks: List[str]):
        tokenized = [normalize_text(c).split() for c in chunks]
        df = collections.Counter()
        for toks in tokenized:
            df.update(set(toks))
        N = len(chunks)
        self.vocab, self.idf = {}, {}
        for i, term in enumerate(sorted(df.keys())):
            self.vocab[term] = i
            self.idf[term] = math.log((1 + N) / (1 + df[term])) + 1.0
        self.matrix = []
        for toks in tokenized:
            tf = collections.Counter(toks)
            vec = [0.0] * len(self.vocab)
            for t, cnt in tf.items():
                if t in self.vocab:
                    vec[self.vocab[t]] = (cnt / max(1,len(toks))) * self.idf[t]
            norm = math.sqrt(sum(v*v for v in vec)) or 1.0
            vec = [v / norm for v in vec]
            self.matrix.append(vec)

    def encode_query(self, query: str):
        if SKLEARN_OK:
            return self.vectorizer.transform([query])
        else:
            toks = normalize_text(query).split()
            tf = collections.Counter(toks)
            vec = [0.0] * len(self.vocab)
            if len(toks) == 0:
                return vec
            for t, cnt in tf.items():
                if t in self.vocab:
                    vec[self.vocab[t]] = (cnt / len(toks)) * self.idf.get(t, 0.0)
            norm = math.sqrt(sum(v*v for v in vec)) or 1.0
            return [v / norm for v in vec]

    def cosine_scores(self, qvec):
        if SKLEARN_OK:
            sims = (self.matrix @ qvec.T).toarray().ravel()
            return sims.tolist()
        else:
            sims = []
            for vec in self.matrix:
                s = sum(a*b for a,b in zip(vec, qvec))
                sims.append(s)
            return sims

    def search(self, query: str, k: int = 3, alpha: float = 0.85) -> List[Tuple[int, float, str]]:
        qvec = self.encode_query(query)
        sims = self.cosine_scores(qvec)
        q_terms = set(normalize_text(query).split())
        keyword_bonus = []
        for chunk in self.chunks:
            terms = set(normalize_text(chunk).split())
            overlap = len(q_terms & terms) / max(1, len(q_terms))
            keyword_bonus.append(overlap)
        # Combine
        scores = [alpha*s + (1-alpha)*b for s,b in zip(sims, keyword_bonus)]
        tops = sorted(list(enumerate(scores)), key=lambda x: x[1], reverse=True)[:k]
        return [(idx, scores[idx], self.chunks[idx]) for idx, _ in tops]

store = TinyVectorStore(chunks)
print("Vectorizer method:", store.method)


Vectorizer method: tfidf


## Section 4: Baseline Extractive Answer (Without LLM)

### What is Extractive QA?
Before using an LLM, let's see a baseline approach:
- Retrieve relevant chunks using semantic search
- Extract sentences that contain query terms
- Combine sentences into an answer

### Limitations of Extractive QA:
- ❌ Cannot synthesize information across multiple chunks
- ❌ Cannot reformulate or explain concepts
- ❌ May include irrelevant sentences
- ✅ But: Fast, deterministic, and doesn't require an API

### Next: We'll upgrade to LLM-powered generation for better answers

In [5]:

def extractive_answer(query: str, retrieved: List[Tuple[int, float, str]], max_sents: int = 3) -> str:
    q_terms = set(normalize_text(query).split())
    chosen = []
    for _, _, chunk in retrieved:
        for s in split_into_sentences(chunk):
            s_terms = set(normalize_text(s).split())
            if len(q_terms & s_terms) > 0:
                chosen.append(s)
                if len(chosen) >= max_sents:
                    break
        if len(chosen) >= max_sents:
            break
    if not chosen:
        for _, _, chunk in retrieved:
            for s in split_into_sentences(chunk):
                chosen.append(s)
                if len(chosen) >= max_sents:
                    break
            if len(chosen) >= max_sents:
                break
    return " ".join(chosen)

def rag_ask_extractive(query: str, k: int = 3) -> Dict[str, object]:
    retrieved = store.search(query, k=k)
    answer = extractive_answer(query, retrieved, max_sents=3)
    return {"query": query, "retrieved": retrieved, "answer": answer}

demo = rag_ask_extractive("What is RAG and why are vector stores useful?", k=3)
demo


{'query': 'What is RAG and why are vector stores useful?',
 'retrieved': [(0,
   0.29330802646365534,
   'RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context from a knowledge base and then generate answers grounded in that context. It reduces hallucinations and improves factuality.'),
  (4,
   0.15488700510938208,
   'You can start with TF-IDF for quick experiments and later switch to stronger embeddings like Sentence-Transformers. For scale, tools like FAISS, Milvus, or Chroma are common.'),
  (1,
   0.11758022982737715,
   'A vector store keeps representations (embeddings) of chunks so we can quickly find semantically similar pieces to a query.')],
 'answer': 'RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context from a knowledge base and then generate answers grounded in that context. It reduces hallucinations and improves factuality. You can start with TF-IDF for quick experiments and later switch to stronger embed

## Section 5: Gemini API Setup

### What is Gemini?
**Google's Gemini** is a family of advanced multimodal LLMs that can:
- Generate coherent, contextual text
- Understand and reason about provided context
- Follow instructions reliably

### Available Gemini Models:
- **gemini-1.5-flash**: Fast, efficient, great for most use cases ⭐ (we'll use this)
- **gemini-1.5-pro**: More powerful, better reasoning (costs more)
- **gemini-2.0-flash**: Latest and fastest (when available)

### Setup Steps:
1. Get a free API key: https://aistudio.google.com/app/apikey
2. Set `GEMINI_API_KEY` environment variable
3. Install `google-generativeai` SDK

### Why use Gemini for RAG?
- ✅ Free tier available
- ✅ Fast inference
- ✅ Good context understanding
- ✅ Reliable instruction following
- ✅ Can be prompted to stay within retrieved context

In [6]:
# 5.1 Install Gemini SDK (if needed)
try:
    import google.generativeai as genai
    print("\n✅ Gemini SDK is already installed.")
except Exception:
    print("\nInstalling google-generativeai SDK...")
    import sys, subprocess
    subprocess.run([sys.executable, "-m", "pip", "install", "--quiet", "google-generativeai"], check=False)
    import google.generativeai as genai
    print("✅ Gemini SDK installed.")

# 5.2 Configure your API key
import os
from getpass import getpass

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

if not GEMINI_API_KEY:
    print("\n🔑 Enter your Gemini API key below (input is hidden for security):")
    GEMINI_API_KEY = getpass("Gemini API Key: ")
    if GEMINI_API_KEY:
        os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
        print("✅ API key set successfully in this session.")
    else:
        print("⚠️ No API key entered. Please provide one to enable LLM calls.")
else:
    print("\n✅ GEMINI_API_KEY already found in environment.")

# 5.3 Configure Gemini and list available models
if GEMINI_API_KEY:
    genai.configure(api_key=GEMINI_API_KEY)
    try:
        print("\n=== Available Models ===")
        for model in genai.list_models():
            if 'generateContent' in model.supported_generation_methods:
                print(f"✓ {model.name}")
    except Exception as e:
        print(f"Could not list models: {e}")

# 5.3 Choose your model (use gemini-pro for text-only, or gemini-pro-vision for images)
GEMINI_MODEL = "models/gemini-2.0-flash-exp"
print(f"\nUsing model: {GEMINI_MODEL}")


✅ Gemini SDK is already installed.

🔑 Enter your Gemini API key below (input is hidden for security):
✅ API key set successfully in this session.

=== Available Models ===
✓ models/gemini-2.5-pro-preview-03-25
✓ models/gemini-2.5-flash-preview-05-20
✓ models/gemini-2.5-flash
✓ models/gemini-2.5-flash-lite-preview-06-17
✓ models/gemini-2.5-pro-preview-05-06
✓ models/gemini-2.5-pro-preview-06-05
✓ models/gemini-2.5-pro
✓ models/gemini-2.0-flash-exp
✓ models/gemini-2.0-flash
✓ models/gemini-2.0-flash-001
✓ models/gemini-2.0-flash-exp-image-generation
✓ models/gemini-2.0-flash-lite-001
✓ models/gemini-2.0-flash-lite
✓ models/gemini-2.0-flash-preview-image-generation
✓ models/gemini-2.0-flash-lite-preview-02-05
✓ models/gemini-2.0-flash-lite-preview
✓ models/gemini-2.0-pro-exp
✓ models/gemini-2.0-pro-exp-02-05
✓ models/gemini-exp-1206
✓ models/gemini-2.0-flash-thinking-exp-01-21
✓ models/gemini-2.0-flash-thinking-exp
✓ models/gemini-2.0-flash-thinking-exp-1219
✓ models/gemini-2.5-flash-pre

## Section 6: LLM-Powered RAG Answer Generation

### The RAG Loop with Gemini:

```
┌─────────────────┐
│  User Query     │
└────────┬────────┘
         │
         ▼
┌──────────────────────────┐
│ 1. RETRIEVE              │  Vector store searches for
│    Semantic Search       │  relevant chunks
└────────┬─────────────────┘
         │
         ▼
┌──────────────────────────┐
│ 2. AUGMENT               │  Format retrieved chunks
│    Build Context         │  as prompt context
└────────┬─────────────────┘
         │
         ▼
┌──────────────────────────┐
│ 3. GENERATE              │  Send query + context
│    Gemini LLM            │  to Gemini API
│    (with instructions)   │
└────────┬─────────────────┘
         │
         ▼
┌─────────────────────────┐
│  LLM-Generated Answer   │
│  (grounded in context)  │
└─────────────────────────┘
```

### Key Benefits of LLM-Powered RAG:
- 🧠 **Synthesis**: Combine information from multiple chunks
- 📝 **Explanation**: Generate coherent, natural language answers
- 🎯 **Accuracy**: System prompt ensures answers stay within context
- 🔍 **Traceability**: Can show which chunks informed the answer

### System Prompt Design:
The system prompt is critical—it tells the LLM:
- **What role to play** (helpful assistant)
- **Scope constraints** (answer ONLY from provided context)
- **Failure mode** (what to say if context doesn't contain the answer)

This prevents the LLM from hallucinating information not in your knowledge base.

In [7]:
def build_context(chunks: List[str], max_chars: int = 2000) -> str:
    ctx = ""
    for i, ch in enumerate(chunks, 1):
        block = f"[Chunk {i}]\n{ch}\n\n"
        if len(ctx) + len(block) > max_chars:
            break
        ctx += block
    return ctx.strip()

def call_gemini_chat(system: str, user: str, model: str = None, max_tokens: int = 400, temperature: float = 0.2) -> str:
    model = model or GEMINI_MODEL
    key = os.getenv("GEMINI_API_KEY", "")
    if not key:
        return "GEMINI_API_KEY not set. Please set it and re-run."
    try:
        import google.generativeai as genai
        genai.configure(api_key=key)
        
        # Create the model
        client = genai.GenerativeModel(model_name=model)
        
        # Build the full prompt with system instructions
        full_prompt = f"{system}\n\n{user}"
        
        # Generate content
        resp = client.generate_content(
            full_prompt,
            generation_config={
                "max_output_tokens": max_tokens,
                "temperature": temperature,
            }
        )
        return resp.text.strip()
    except Exception as e:
        return f"[Gemini error] {e}"

def rag_ask_gemini(query: str, k: int = 3, max_ctx_chars: int = 2000, max_tokens: int = 400) -> Dict[str, object]:
    retrieved = store.search(query, k=k)
    top_texts = [t for _, _, t in retrieved]
    context = build_context(top_texts, max_chars=max_ctx_chars)
    system = (
        "You are a helpful assistant. Answer ONLY using the provided [Chunk] context. "
        "If the answer is not in the context, say you don't have enough information."
    )
    user_prompt = f"Question: {query}\n\nContext:\n{context}\n\nAnswer:"
    answer = call_gemini_chat(system, user_prompt, model=GEMINI_MODEL, max_tokens=max_tokens, temperature=0.2)
    return {"query": query, "retrieved": retrieved, "context": context, "answer": answer}

# Demo call (will only work if GEMINI_API_KEY is configured)
print("\n=== GEMINI LLM DEMO ===")
demo_llm = rag_ask_gemini("Explain RAG and how cosine similarity helps retrieval.", k=3)
print(f"Query: {demo_llm['query']}")
print(f"\nContext:\n{demo_llm['context']}")
print(f"\nAnswer: {demo_llm['answer']}")


=== GEMINI LLM DEMO ===
Query: Explain RAG and how cosine similarity helps retrieval.

Context:
[Chunk 1]
Cosine similarity measures how close two vectors point in the same direction. Higher cosine means more similar meanings.

[Chunk 2]
RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context from a knowledge base and then generate answers grounded in that context. It reduces hallucinations and improves factuality.

[Chunk 3]
Chunking breaks long documents into smaller segments (like paragraphs or sliding windows). Good chunking helps retrieval match the right context.

Answer: RAG, or Retrieval-Augmented Generation, is a technique where we fetch relevant context from a knowledge base and then generate answers grounded in that context. Cosine similarity measures how close two vectors point in the same direction, where higher cosine means more similar meanings.


## Summary & Next Steps

### What You've Learned:

#### 🔹 RAG Fundamentals
- **Retrieval**: Efficient document search using vector similarity
- **Augmentation**: Formatting retrieved context for LLMs
- **Generation**: LLM-powered answer synthesis grounded in facts

#### 🔹 Technical Components
- **Text normalization** for consistent processing
- **Document chunking** for optimal retrieval
- **TF-IDF vectorization** for semantic search
- **Cosine similarity** for relevance scoring
- **Keyword matching** for exact match boosting
- **System prompts** to guide LLM behavior

#### 🔹 Integration
- Using Gemini API for high-quality text generation
- Controlling LLM behavior through careful prompting
- Preventing hallucinations with context constraints

### Practical Use Cases:
✅ Customer support chatbots (FAQ-based answering)
✅ Document Q&A systems (legal, medical, technical docs)
✅ Internal knowledge bases (company wikis, runbooks)
✅ Research paper analysis and summarization
✅ Codebase documentation and Q&A
✅ Product recommendation engines

### Next Steps to Explore:

**1. Improve Retrieval Quality**
```python
# Use better embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)
```

**2. Scale to Production**
```python
# Use vector database
import faiss
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings)
```

**3. Add Reranking**
```python
# Rerank with cross-encoder
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')
scores = reranker.predict([[query, doc] for doc in top_k_docs])
```

**4. Implement Streaming**
```python
# Stream Gemini responses
for chunk in client.generate_content(..., stream=True):
    print(chunk.text, end='', flush=True)
```

**5. Add Multi-Hop Reasoning**
```python
# Agent that can make multiple retrieval passes
# (Build a CrewAI agent with retriever + reasoner tools)
```

### Resources:
- 📖 [Google Gemini API Docs](https://ai.google.dev/docs)
- 📖 [Retrieval Augmented Generation Survey](https://arxiv.org/abs/2312.10997)
- 🛠️ [LangChain RAG Guide](https://python.langchain.com/docs/use_cases/question_answering/)
- 🛠️ [LlamaIndex Documentation](https://docs.llamaindex.ai/)
- 🛠️ [Sentence-Transformers](https://www.sbert.net/)

---

**Congratulations!** You now understand how RAG works and can build your own systems. Happy building! 🚀