# Section 2: Building a Basic RAG System

In the previous demo, we saw LLMs fail due to:
- Knowledge cutoff
- Hallucinations
- No access to private data
- Context limits

**Now let's build a RAG system to solve these problems!**

## What We'll Build

```
Documents → Chunk → Embed → Store → Retrieve → Augment → Generate
```

---

## Step 1: LOAD - Get Your Documents

RAG starts with your knowledge base. Let's create a sample document about AI concepts.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1"

print("API Key loaded:", "Yes" if OPENROUTER_API_KEY else "No - check your .env file!")

API Key loaded: No - check your .env file!


In [2]:
# Create a knowledge base document
# This simulates INTERNAL company documentation that LLMs don't have access to

knowledge_base = """
# TechNova Solutions - Internal Documentation

## Company Overview
TechNova Solutions is a Bangalore-based enterprise software company founded in 2019.
The company specializes in cloud-native solutions and has 450 employees across 3 offices.
Current valuation: $120 million (Series C, 2024).

## Engineering Team Structure
- Platform Team: 45 engineers, led by Rajesh Kumar
- Backend Team: 60 engineers, led by Priya Sharma  
- Frontend Team: 35 engineers, led by Amit Patel
- DevOps Team: 25 engineers, led by Sneha Reddy
- Data Engineering: 30 engineers, led by Vikram Iyer

## Technology Stack

### Backend Services
- Primary Language: Go (Golang) for all microservices
- API Framework: gRPC for internal services, REST for external APIs
- Database: PostgreSQL 15 for transactional data, MongoDB for document storage
- Cache: Redis Cluster with 6 nodes
- Message Queue: Apache Kafka with 12 partitions per topic

### Frontend Architecture  
- Framework: Next.js 14 with React 18
- State Management: Zustand (migrated from Redux in Q2 2024)
- UI Components: Custom design system called "Nova UI"
- Testing: Playwright for E2E, Vitest for unit tests

### Infrastructure
- Cloud Provider: AWS (primary), GCP (disaster recovery)
- Kubernetes: EKS clusters in Mumbai (ap-south-1) and Singapore (ap-southeast-1)
- Container Registry: Amazon ECR
- CI/CD: GitHub Actions with ArgoCD for GitOps deployments
- Monitoring: Prometheus + Grafana stack, PagerDuty for alerts
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana)

### Security
- Authentication: OAuth 2.0 with Keycloak
- Secrets Management: HashiCorp Vault
- WAF: AWS WAF with custom rule sets
- Compliance: SOC 2 Type II certified, ISO 27001 in progress

## Deployment Process
1. Developer creates PR against main branch
2. Automated tests run (unit, integration, security scans)
3. Code review required from 2 team members
4. Merge to main triggers staging deployment via ArgoCD
5. QA team performs validation (2-4 hours)
6. Production deployment requires approval from Tech Lead
7. Canary deployment: 5% traffic for 30 minutes
8. Full rollout if metrics are healthy

## On-Call Rotation
- Primary on-call rotates weekly across teams
- Escalation path: On-call → Team Lead → Engineering Manager → CTO
- SLA: P1 incidents must be acknowledged within 15 minutes
- Post-incident reviews required for all P1/P2 incidents

## Recent Incidents
- Nov 2024: Database failover caused 23-minute outage. Root cause: misconfigured health checks.
- Oct 2024: Kafka consumer lag spike. Resolution: increased partition count.
- Sep 2024: Memory leak in payment service. Fixed in v2.3.4.

## Q1 2025 Roadmap
- Migrate remaining services from REST to gRPC
- Implement distributed tracing with Jaeger
- Launch new analytics dashboard (Project Apollo)
- Achieve ISO 27001 certification
- Reduce deployment time from 45 min to under 15 min

## Contact Information
- Engineering Support: eng-support@technova.internal
- Security Team: security@technova.internal  
- Platform Team Slack: #platform-team
- Incident Channel: #incidents

"""

# Save to file
with open("../data/knowledge_base.txt", "w") as f:
    f.write(knowledge_base)

print(f"Created knowledge base: {len(knowledge_base)} characters")
print("\nThis simulates INTERNAL company docs + recent news!")
print("\nSections included:")
print("  - Company Overview")
print("  - Engineering Team Structure")  
print("  - Technology Stack (Backend, Frontend, Infra)")
print("  - Deployment Process")
print("  - Recent Incidents")
print("  - Q1 2025 Roadmap")


FileNotFoundError: [Errno 2] No such file or directory: '../data/knowledge_base.txt'

In [None]:
from langchain_community.document_loaders import TextLoader

# Load the document
loader = TextLoader("../data/knowledge_base.txt")
documents = loader.load()

print(f"Loaded {len(documents)} document(s)")
print(f"Document length: {len(documents[0].page_content)} characters")

Loaded 1 document(s)
Document length: 3072 characters




---

## Step 2: CHUNK - Split Into Smaller Pieces

Why chunk?
- Embeddings work better on focused content
- Retrieval is more precise with smaller chunks
- Fits within LLM context limits

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Max characters per chunk (larger = more context per chunk)
    chunk_overlap=100,     # Overlap to preserve context across splits
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try to split on these first
)

# Split the document
chunks = text_splitter.split_documents(documents)

print(f"Split into {len(chunks)} chunks")
print("\n" + "="*50)
print("CHUNK EXAMPLES:")
print("="*50)

for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content)
    print()

Split into 10 chunks

CHUNK EXAMPLES:

--- Chunk 1 (291 chars) ---
# TechNova Solutions - Internal Documentation

## Company Overview
TechNova Solutions is a Bangalore-based enterprise software company founded in 2019.
The company specializes in cloud-native solutions and has 450 employees across 3 offices.
Current valuation: $120 million (Series C, 2024).


--- Chunk 2 (303 chars) ---
## Engineering Team Structure
- Platform Team: 45 engineers, led by Rajesh Kumar
- Backend Team: 60 engineers, led by Priya Sharma  
- Frontend Team: 35 engineers, led by Amit Patel
- DevOps Team: 25 engineers, led by Sneha Reddy
- Data Engineering: 30 engineers, led by Vikram Iyer

## Technology Stack


--- Chunk 3 (337 chars) ---
## Technology Stack

### Backend Services
- Primary Language: Go (Golang) for all microservices
- API Framework: gRPC for internal services, REST for external APIs
- Database: PostgreSQL 15 for transactional data, MongoDB for document storage
- Cache: Redis Cluster with 6 node

---

## Step 3: EMBED - Convert Text to Vectors

Embeddings capture semantic meaning:
- Similar concepts → similar vectors
- Enables semantic search (not just keyword matching)

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Using a free, local embedding model
# First run will download the model (~90MB)
print("Loading embedding model (first run downloads ~90MB)...")

embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",  # Fast and good quality
    model_kwargs={'device': 'cpu'}
)

print("Embedding model loaded!")

Loading embedding model (first run downloads ~90MB)...


  embeddings = HuggingFaceEmbeddings(


Embedding model loaded!


In [None]:
# Let's see what an embedding looks like
test_text = "What is the tech stack?"
test_embedding = embeddings.embed_query(test_text)

print(f"Text: '{test_text}'")
print(f"Embedding dimension: {len(test_embedding)}")
print(f"First 10 values: {test_embedding[:10]}")
print(f"\nThis vector captures the MEANING of '{test_text}'")

Text: 'What is the tech stack?'
Embedding dimension: 384
First 10 values: [-0.05783945694565773, -0.10159027576446533, -0.04430558905005455, -0.023304156959056854, -0.056692417711019516, -0.05945168808102608, 0.04886123538017273, 0.11376131325960159, 0.009314495138823986, -0.004337272606790066]

This vector captures the MEANING of 'What is the tech stack?'


### Understanding Embeddings: Cosine Similarity

Embeddings capture **semantic meaning**. Similar sentences have similar vectors (low distance / high similarity).

In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

# Define sentences - some similar, some different
sentences = {
    "s1": "What is the deployment process?",
    "s2": "How do we deploy code to production?",      # Similar to s1
    "s3": "What databases does the company use?",      # Different topic
    "s4": "Tell me about the CI/CD pipeline",          # Related to s1
    "s5": "What is the weather like today?",           # Completely unrelated
}

# Get embeddings for all sentences
sentence_embeddings = {key: embeddings.embed_query(text) for key, text in sentences.items()}

print("=" * 60)
print("COSINE SIMILARITY BETWEEN SENTENCES")
print("=" * 60)
print("\nSentences:")
for key, text in sentences.items():
    print(f"  {key}: \"{text}\"")

print("\n" + "-" * 60)
print("Similarity Scores (1.0 = identical, 0.0 = unrelated):")
print("-" * 60)

# Compare similar sentences
sim_1_2 = cosine_similarity(sentence_embeddings["s1"], sentence_embeddings["s2"])
print(f"\n✓ s1 vs s2 (both about deployment):     {sim_1_2:.4f}  ← HIGH (similar meaning!)")

sim_1_4 = cosine_similarity(sentence_embeddings["s1"], sentence_embeddings["s4"])
print(f"✓ s1 vs s4 (deployment vs CI/CD):       {sim_1_4:.4f}  ← MEDIUM-HIGH (related)")

# Compare different sentences
sim_1_3 = cosine_similarity(sentence_embeddings["s1"], sentence_embeddings["s3"])
print(f"\n✗ s1 vs s3 (deployment vs databases):  {sim_1_3:.4f}  ← LOWER (different topics)")

sim_1_5 = cosine_similarity(sentence_embeddings["s1"], sentence_embeddings["s5"])
print(f"✗ s1 vs s5 (deployment vs weather):    {sim_1_5:.4f}  ← LOWEST (unrelated!)")

print("\n" + "=" * 60)
print("KEY INSIGHT: Embeddings capture MEANING, not just keywords!")
print("'deployment process' ≈ 'deploy code to production'")
print("=" * 60)

COSINE SIMILARITY BETWEEN SENTENCES

Sentences:
  s1: "What is the deployment process?"
  s2: "How do we deploy code to production?"
  s3: "What databases does the company use?"
  s4: "Tell me about the CI/CD pipeline"
  s5: "What is the weather like today?"

------------------------------------------------------------
Similarity Scores (1.0 = identical, 0.0 = unrelated):
------------------------------------------------------------

✓ s1 vs s2 (both about deployment):     0.6666  ← HIGH (similar meaning!)
✓ s1 vs s4 (deployment vs CI/CD):       0.3548  ← MEDIUM-HIGH (related)

✗ s1 vs s3 (deployment vs databases):  0.1184  ← LOWER (different topics)
✗ s1 vs s5 (deployment vs weather):    0.0766  ← LOWEST (unrelated!)

KEY INSIGHT: Embeddings capture MEANING, not just keywords!
'deployment process' ≈ 'deploy code to production'


---

## Step 4: STORE - Save in Vector Database

Vector databases enable fast similarity search across millions of vectors.

In [None]:
from langchain_community.vectorstores import Chroma

# Create vector store from chunks
# This embeds all chunks and stores them
print("Creating vector store (embedding all chunks)...")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="../data/chroma_db"
)

print(f"\nVector store created!")
print(f"Contains {vectorstore._collection.count()} vectors")

Creating vector store (embedding all chunks)...

Vector store created!
Contains 25 vectors


---

## Step 5: RETRIEVE - Find Relevant Chunks

Given a query, find the most similar chunks using vector similarity.

In [None]:
# Create a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 matches for better coverage
)

# Test retrieval with company data
query = "What databases does TechNova use?"
retrieved_docs = retriever.invoke(query)

print(f"Query: '{query}'")
print(f"\nRetrieved {len(retrieved_docs)} relevant chunks:")
print("="*50)

for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Match {i+1} ---")
    print(doc.page_content)

Query: 'What databases does TechNova use?'

Retrieved 4 relevant chunks:

--- Match 1 ---
# TechNova Solutions - Internal Documentation

## Company Overview
TechNova Solutions is a Bangalore-based enterprise software company founded in 2019.
The company specializes in cloud-native solutions and has 450 employees across 3 offices.
Current valuation: $120 million (Series C, 2024).

--- Match 2 ---
# TechNova Solutions - Internal Documentation

## Company Overview
TechNova Solutions is a Bangalore-based enterprise software company founded in 2019.
The company specializes in cloud-native solutions and has 450 employees across 3 offices.
Current valuation: $120 million (Series C, 2024).

--- Match 3 ---
## Contact Information
- Engineering Support: eng-support@technova.internal
- Security Team: security@technova.internal  
- Platform Team Slack: #platform-team
- Incident Channel: #incidents

--- Match 4 ---
## Q1 2025 Roadmap
- Migrate remaining services from REST to gRPC
- Implement distri

---

## Step 6: AUGMENT & GENERATE - Build the RAG Chain

Now we combine retrieval with the LLM to generate grounded answers.

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Setup LLM
llm = ChatOpenAI(
    model="anthropic/claude-3.5-sonnet",
    openai_api_key=OPENROUTER_API_KEY,
    openai_api_base=OPENROUTER_BASE_URL,
    temperature=0.3
)

# RAG Prompt Template
template = """You are a helpful assistant. Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have information about that in my knowledge base."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper to format retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("RAG chain ready!")

RAG chain ready!


---

## Let's Test It!

### Test 1: Questions about our knowledge base

In [None]:
# Question 2: Internal company data (LLM has NO access to this!)
question = "In which programming language is TechNova's backend services written?"

print(f"Q: {question}")
print("="*50)
answer = rag_chain.invoke(question)
print(f"A: {answer}")

Q: In which programming language is TechNova's backend services written?
A: According to the context, TechNova's backend services are written in Go (Golang) for all microservices.


In [None]:
# Question 3: More internal data
question = "What is the deployment process at TechNova? Who needs to approve production deployments?"

print(f"Q: {question}")
print("="*50)
answer = rag_chain.invoke(question)
print(f"A: {answer}")

Q: What is the deployment process at TechNova? Who needs to approve production deployments?
A: Based on the context, the deployment process at TechNova consists of these steps:

1. Developer creates PR against main branch
2. Automated tests run (unit, integration, security scans)
3. Code review required from 2 team members
4. Merge to main triggers staging deployment via ArgoCD
5. QA team performs validation (2-4 hours)
6. Production deployment requires approval from Tech Lead
7. Canary deployment: 5% traffic for 30 minutes
8. Full rollout if metrics are healthy

Specifically regarding approvals, the Tech Lead needs to approve production deployments.


### Test 2: Question NOT in knowledge base

RAG should gracefully handle questions outside its knowledge.

In [None]:
# Question not in our knowledge base
question = "Who won the 2024 Nobel Prize in Physics?"

print(f"Q: {question}")
print("="*50)
answer = rag_chain.invoke(question)
print(f"A: {answer}")
print("\n(RAG correctly says it doesn't have this information!)")

Q: Who won the 2024 Nobel Prize in Physics?
A: I don't have information about that in my knowledge base. The provided context only contains information about engineering team structure, frontend architecture, and technology stack. It does not contain any information about Nobel Prize winners.

(RAG correctly says it doesn't have this information!)


---

## The Magic: Compare With vs Without RAG

Let's see the difference RAG makes.

In [None]:
# Direct LLM (no RAG) - for comparison
def ask_without_rag(question):
    response = llm.invoke(question)
    return response.content

# THE KEY DEMO: Private company data that LLM cannot know!

print("=" * 70)
print("DEMO: Private Company Data - With vs Without RAG")
print("=" * 70)

question = "Whats is the team size and lead name for the dev ops team at TechNova?"

print(f"\nQuestion: {question}")

print("\n" + "-" * 70)
print("WITHOUT RAG (LLM alone):")
print("-" * 70)
print(ask_without_rag(question))

print("\n" + "-" * 70)
print("WITH RAG (LLM + TechNova knowledge base):")
print("-" * 70)
print(rag_chain.invoke(question))

print("\n" + "=" * 70)
print("KEY TAKEAWAY")
print("=" * 70)
print("Without RAG: LLM has no access to private company data")
print("With RAG: Accurate answers from YOUR internal knowledge base!")
print("\nThis is why RAG is essential for enterprise applications.")

DEMO: Private Company Data - With vs Without RAG

Question: Whats is the team size and lead name for the dev ops team at TechNova?

----------------------------------------------------------------------
WITHOUT RAG (LLM alone):
----------------------------------------------------------------------
I cannot provide specific information about TechNova's DevOps team size or lead name, as I don't have access to their internal organizational details. To get accurate information about TechNova's team structure, you would need to contact TechNova directly or consult their official company resources.

----------------------------------------------------------------------
WITH RAG (LLM + TechNova knowledge base):
----------------------------------------------------------------------
According to the context, the DevOps Team has 25 engineers and is led by Sneha Reddy.

KEY TAKEAWAY
Without RAG: LLM has no access to private company data
With RAG: Accurate answers from YOUR internal knowledge base

---

## Summary: What We Built

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   LOAD      │ →   │   CHUNK     │ →   │   EMBED     │ →   │   STORE     │
│  Documents  │     │  Split text │     │  Vectors    │     │  ChromaDB   │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                                                                   ↓
┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   ANSWER    │ ←   │  GENERATE   │ ←   │  AUGMENT    │ ←   │  RETRIEVE   │
│  Grounded!  │     │    LLM      │     │  Prompt     │     │  Similar    │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
```

### Key Takeaways:

1. **Embeddings** capture semantic meaning - similar sentences have high cosine similarity
2. **Chunking** strategy matters - too small loses context, too large loses precision
3. **RAG grounds LLMs** in your data - no more hallucinations about your content
4. **Private data stays private** - only retrieved context is sent to the LLM

### Next: Agentic RAG
Basic RAG has limitations. What if:
- The first retrieval doesn't find good results?
- The question needs to be rewritten?
- Multiple lookups are needed?

**Agentic RAG adds intelligence to decide WHEN and HOW to retrieve!**