# Hands-on Coding Excersises

### 🎯 Goal: TO understand the fundamental concepts and see them in action.

### What are we going to do? 
We will walk through the code step-by-step for the first 45 minutes. In the last 15 minutes, everyone can start working on the shared piece of code and start ececuting it. 

### What are we not going to do?
We are not going to code in parallel to the excercise by the instructor. 
We should also refrain from deep diving individual complex codes or other packages. 

Note: We have people at various skill levels. So the focus is to fundamentally understand and see the concepts in action. We all have high level coding skills. So, it is expected we can later deep dive into code and other parameters etc. 



# Module 1: Introduction & Problem Statement

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand the fundamental limitations of standalone LLMs
- Recognize when RAG is the right solution
- Visualize the complete RAG workflow
- Experience hands-on examples of LLM limitations and RAG solutions

## 📚 Key Concepts

### What is RAG?
**Retrieval-Augmented Generation (RAG)** is a technique that enhances Large Language Models (LLMs) by providing them with relevant external information before generating responses.

Think of it like an open-book exam:
- **Without RAG**: LLM answers from memory only (closed-book exam)
- **With RAG**: LLM gets relevant documents first, then answers (open-book exam)

### Why Do We Need RAG?

#### 🚫 LLM Limitations
1. **Knowledge Cutoff Dates**: LLMs are trained on data up to a certain date
2. **Hallucinations**: LLMs can generate confident-sounding but incorrect information
3. **No Domain-Specific Knowledge**: Limited knowledge about your company/domain
4. **No Real-time Information**: Cannot access current events or dynamic data
5. **Context Length Limits**: Cannot process entire large documents

#### ✅ How RAG Helps
1. **Fresh Information**: Access to up-to-date external data
2. **Grounded Responses**: Answers based on provided evidence
3. **Domain Expertise**: Include your specific documents and data
4. **Source Attribution**: Know where information comes from
5. **Cost Effective**: No need to retrain models with new data

### 2025 Research Insights 🔬
- **Mathematical Proof**: OpenAI researchers proved LLM hallucinations are mathematically inevitable (Sept 2025)
- **Current Best**: Anthropic Claude 3.7 has the lowest hallucination rate at 17%
- **RAG Impact**: Properly implemented RAG can reduce hallucinations by 49-67%

## 🛠️ Setup (LlmUtils)
Let's Set up and get started!!

In [None]:
import llmutils
..
..

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.1,  # Low temperature for more consistent results
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

print("✅ Setup complete!")
print(f"📅 Today's date: {datetime.now().strftime('%Y-%m-%d')}")

## 🧪 Exercise 1: Demonstrating LLM Limitations

Let's see these limitations in action with real examples.

### Problem 1: Knowledge Cutoff Dates

In [None]:
# Test knowledge cutoff with recent events
recent_questions = [
    "What happened in the 2024 US Presidential Election?",
    "Who won the 2024 Nobel Prize in Physics?",
    "What are the latest features in iPhone 16?",
    "What is the current stock price of NVIDIA?"
]

print("🔍 Testing Knowledge Cutoff Issues:")
print("=" * 50)

for question in recent_questions:
    print(f"\n❓ Question: {question}")
    try:
        response = llm.invoke([HumanMessage(content=question)])
        print(f"🤖 LLM Response: {response.content[:200]}...")
    except Exception as e:
        print(f"❌ Error: {e}")
    print("-" * 30)

### Problem 2: Hallucinations (Confident but Wrong Answers)

In [None]:
# Test with questions that might trigger hallucinations
tricky_questions = [
    "What is the exact population of Atlantis?",
    "Who is the CEO of DataScience Corp (a fictional company)?",
    "What are the side effects of Imaginex (a made-up drug)?",
    "What year was the Treaty of Fabrication signed?"
]

print("🎭 Testing Hallucination Tendencies:")
print("=" * 50)

for question in tricky_questions:
    print(f"\n❓ Question: {question}")
    try:
        response = llm.invoke([HumanMessage(content=question)])
        print(f"🤖 LLM Response: {response.content}")
        print("⚠️  Note: This response might be hallucinated since the question involves fictional entities!")
    except Exception as e:
        print(f"❌ Error: {e}")
    print("-" * 30)

### Problem 3: No Domain-Specific Knowledge

In [None]:
# Test with company-specific questions
company_questions = [
    "What is our company's return policy?",
    "Who is the head of our marketing department?",
    "What are the specs of our Model-X product?",
    "What was discussed in last week's board meeting?"
]

print("🏢 Testing Domain-Specific Knowledge:")
print("=" * 50)

for question in company_questions:
    print(f"\n❓ Question: {question}")
    try:
        response = llm.invoke([HumanMessage(content=question)])
        print(f"🤖 LLM Response: {response.content}")
        print("📝 Note: LLM cannot access company-specific information!")
    except Exception as e:
        print(f"❌ Error: {e}")
    print("-" * 30)

## 🔧 Exercise 2: Simple RAG Preview

Now let's see how RAG can solve these problems with a basic example.

In [None]:
# Create some sample company knowledge
company_knowledge = """
COMPANY INFORMATION DATABASE
============================

COMPANY: TechCorp Solutions
FOUNDED: 2018
HEADQUARTERS: San Francisco, CA

LEADERSHIP:
- CEO: Sarah Johnson
- CTO: Michael Chen  
- Head of Marketing: Lisa Rodriguez
- Head of Sales: David Kim

PRODUCTS:
- Model-X: AI-powered analytics platform
  Specs: 99.9% uptime, supports 1M+ queries/sec, cloud-native
- Model-Y: Customer service automation tool
  Specs: 24/7 support, multi-language, integrates with 50+ platforms

POLICIES:
- Return Policy: 30-day money-back guarantee on all products
- Support: 24/7 technical support for enterprise customers
- Privacy: SOC2 Type II certified, GDPR compliant

RECENT NEWS:
- 2024-01-15: Launched Model-Y 2.0 with enhanced AI capabilities
- 2024-02-10: Secured $50M Series B funding
- 2024-03-05: Opened new office in London
"""

In [None]:
def simple_rag_query(question, knowledge_base):
    """
    A simple RAG implementation:
    1. Provide relevant context to the LLM
    2. Ask the LLM to answer based on that context
    """
    
    # Create a prompt that includes our knowledge base
    rag_prompt = f"""
    You are a helpful assistant that answers questions based on the provided company information.
    
    COMPANY INFORMATION:
    {knowledge_base}
    
    QUESTION: {question}
    
    Please answer the question based ONLY on the information provided above. 
    If the information is not available, say "I don't have that information in the company database."
    
    ANSWER:
    """
    
    # Get response from LLM
    response = llm.invoke([HumanMessage(content=rag_prompt)])
    return response.content



In [None]:
# Test our simple RAG with the same company questions
print("🚀 Testing Simple RAG vs Standard LLM:")
print("=" * 60)

test_questions = [
    "What is our company's return policy?",
    "Who is the head of our marketing department?",
    "What are the specs of our Model-X product?",
    "When did we open our London office?"
]

for question in test_questions:
    print(f"\n❓ Question: {question}")
    
    # Standard LLM response
    print("\n🤖 Standard LLM (no context):")
    standard_response = llm.invoke([HumanMessage(content=question)])
    print(f"   {standard_response.content}")
    
    # RAG response
    print("\n🔍 RAG LLM (with company context):")
    rag_response = simple_rag_query(question, company_knowledge)
    print(f"   {rag_response}")
    
    print("-" * 60)

## 📊 Exercise 3: RAG Workflow Visualization

Let's understand the complete RAG workflow step by step.

In [None]:
def visualize_rag_workflow(user_question):
    """
    Demonstrate the RAG workflow step by step
    """
    print("🔄 RAG WORKFLOW DEMONSTRATION")
    print("=" * 50)
    
    # Step 1: User asks a question
    print(f"📝 STEP 1: User Question")
    print(f"   '{user_question}'")
    print()
    
    # Step 2: Retrieve relevant documents (simplified)
    print(f"🔍 STEP 2: Document Retrieval")
    print(f"   Searching knowledge base for relevant information...")
    
    # Simple keyword matching for demonstration
    question_words = user_question.lower().split()
    relevant_lines = []
    
    for line in company_knowledge.split('\n'):
        if any(word in line.lower() for word in question_words if len(word) > 3):
            relevant_lines.append(line.strip())
    
    print(f"   Found {len(relevant_lines)} relevant lines:")
    for line in relevant_lines[:3]:  # Show first 3 relevant lines
        if line:
            print(f"   - {line}")
    print()
    
    # Step 3: Augment the prompt
    print(f"📝 STEP 3: Prompt Augmentation")
    print(f"   Combining retrieved context with user question...")
    print(f"   Context + Question → Enhanced Prompt")
    print()
    
    # Step 4: Generate response
    print(f"🤖 STEP 4: Generation")
    print(f"   LLM generates response based on provided context...")
    
    response = simple_rag_query(user_question, company_knowledge)
    print(f"   Response: '{response}'")
    print()
    
    # Step 5: Return grounded answer
    print(f"✅ STEP 5: Grounded Answer")
    print(f"   Answer is based on actual company data, not LLM's training!")
    
    return response

# Test the workflow
sample_question = "Who is our CTO?"
result = visualize_rag_workflow(sample_question)

## 🧠 Key Takeaways

From this module, you should now understand:

### ❌ LLM Limitations We Observed:
1. **Knowledge Cutoff**: Cannot answer questions about recent events
2. **Hallucinations**: May provide confident but incorrect answers
3. **No Domain Knowledge**: Cannot access company-specific information
4. **Generic Responses**: Provides general answers without specific context

### ✅ RAG Benefits We Demonstrated:
1. **Accurate Information**: Answers based on provided context
2. **Domain-Specific**: Can access and use company knowledge
3. **Grounded Responses**: Explicitly tells you when information isn't available
4. **Improved Accuracy**: Significantly better performance on domain-specific questions

### 🔄 RAG Workflow:
1. **User Question** → 2. **Retrieve Relevant Docs** → 3. **Augment Prompt** → 4. **Generate Response** → 5. **Grounded Answer**

## 🎯 Next Steps

In the next modules, we'll dive deeper into each component of the RAG system:
- **Module 2**: How to load and process different types of documents
- **Module 3**: Strategies for breaking documents into chunks
- **Module 4**: Understanding embedding models for semantic search
- And much more!

The simple RAG we built here is just the beginning. Real-world RAG systems are much more sophisticated and powerful!

## 🤔 Discussion Questions

1. In which scenarios would you prefer a standard LLM over RAG?
2. What types of company data would be most valuable to include in a RAG system?
3. How might RAG help with compliance and audit requirements?
4. What are potential challenges with implementing RAG in a large organization?

## 📝 Optional Exercise

Try creating your own knowledge base for a fictional company or organization and test the RAG system with domain-specific questions!

# Module 2: Document Types & Data Sources

## 🎯 Learning Objectives
By the end of this module, you will:
- Use LangChain document loaders for various file formats as well as process it  
- Use LangChain document loaders for various file formats

## 📚 Key Concepts

### Why Document Processing Matters
In RAG systems, the quality of your document processing directly impacts your final results. 

**Garbage In = Garbage Out!**

### Document Types & Challenges

| Document Type | Common Issues | Best Approach |
|---------------|---------------|---------------|
| **Plain Text** | Encoding, structure | Simple loaders |
| **PDF** | Tables, images, layout | AI-powered parsing |
| **HTML** | Noise, dynamic content | Smart cleaning |
| **CSV/JSON** | Structure preservation | Specialized loaders |
| **Images** | OCR accuracy | Multimodal models |
| **Code** | Syntax preservation | Language-aware parsing |

### 2025 Breakthroughs 🚀
- **LlamaParse**: AI-powered PDF parsing with vision-language models
- **Hybrid Multimodal**: Combining traditional + AI approaches
- **Markdown Intermediate**: Better structure preservation
- **Unstructured Library**: Production-ready document processing


In [None]:
from datetime import datetime
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader

PDF_PATH = Path("data/sample.pdf")
loader = PyPDFLoader(str(PDF_PATH))
docs = loader.load()  # one Document per page

# add simple metadata
for i, d in enumerate(docs):
    d.metadata.update({
        "doc_id": f"sample_pdf",
        "source": str(PDF_PATH),
        "page": d.metadata.get("page", i),
        "type": "pdf",
        "module": "doc_processing_min",
        "ingested_at": datetime.utcnow().isoformat() + "Z",
    })

len(docs), docs[0].metadata, docs[0].page_content[:300]


#print.... and show


In [None]:
### code for loading all, but will not run it..

# Module 3: Chunking Strategies & Implementation

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand why chunking is essential for RAG systems
- Implement different chunking strategies using LangChain
- Compare fixed-size, semantic, and recursive chunking approaches
- Optimize chunk sizes for different use cases
- Preserve context and metadata across chunks
- Apply 2025's latest semantic chunking techniques

## 📚 Key Concepts

### Why Chunking is Critical 🔪

**Think of chunking like creating sections and subsections in a book:**
- Writers don't write the entire book in one paragraph.
- We have the relevant chapter/section/sub-sections and paragraphs.
- RAG does the same with your documents

### The Chunking Challenge
| Too Small 📏 | Just Right ✅ | Too Large 📐 |
|--------------|---------------|---------------|
| Loses context | Preserves meaning | Hard to match |
| Poor retrieval | Good relevance | Noise |
| Fast processing | Balanced | Slow processing |

### 2025 Chunking Evolution 🚀
- **Semantic Chunking**: AI determines natural boundaries
- **Adaptive Chunking**: Dynamic sizing based on content type
- **Contextual Preservation**: Better metadata and overlap strategies
- **Embedding-Based Splitting**: Use embeddings to find semantic breaks

### Common Chunking Strategies
1. **Fixed-Size**: Split by character/token count
2. **Semantic**: Split by meaning and context
3. **Recursive**: Try multiple separators hierarchically


import os
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

# LangChain imports
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter
)
from langchain.schema import Document
from langchain.document_loaders import TextLoader

# For token counting
import tiktoken

# For semantic chunking (2025 approach)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

print("✅ Setup complete!")
print("🔪 Ready to chunk documents!")

In [None]:
### will move this to a separate file..
 
def analyze_chunks(chunks, method_name):
    """
    Analyze and visualize chunk characteristics
    """
    print(f"\n📊 {method_name} Analysis:")
    print(f"   Total chunks: {len(chunks)}")
    
    if chunks:
        # Calculate statistics
        chunk_sizes = [len(chunk.page_content) for chunk in chunks]
        
        print(f"   Avg chunk size: {np.mean(chunk_sizes):.0f} chars")
        print(f"   Min chunk size: {min(chunk_sizes)} chars")
        print(f"   Max chunk size: {max(chunk_sizes)} chars")
        print(f"   Std deviation: {np.std(chunk_sizes):.0f} chars")
        
        # Show first chunk preview
        print(f"\n📝 First chunk preview:")
        print(f"   {chunks[0].page_content[:150]}...")
        
        return chunk_sizes
    return []

# Test different fixed-size approaches
print("🔪 FIXED-SIZE CHUNKING EXPERIMENTS")
print("=" * 50)

# Use the technical document for comparison
test_doc = docs[0]  # API documentation



In [None]:
# 1. Character-based chunking
char_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n\n"  # Split on paragraphs when possible
)

char_chunks = char_splitter.split_documents([test_doc])
char_sizes = analyze_chunks(char_chunks, "Character-based (500 chars)")

In [None]:
# 2. Token-based chunking (more precise for LLMs)
token_splitter = TokenTextSplitter(
    chunk_size=100,  # 100 tokens
    chunk_overlap=20
)

token_chunks = token_splitter.split_documents([test_doc])
token_sizes = analyze_chunks(token_chunks, "Token-based (100 tokens)")

In [None]:
## todo: add metadata section

In [None]:
# Module 4: Understanding Embedding Models

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand the history of embeddings
- Create your first embeddings
- Visualize embeddings
- Convert the chunks into embeddings

## 📚 Key Concepts



**Modern text embeddings are similar but much more powerful:**
- **Contextual**: Same word has different embeddings in different contexts
- **Semantic**: Capture meaning, not just word co-occurrence

### Embedding Evolution Timeline 📈
| Era | Approach | Example | Context |
|-----|----------|---------|----------|
| 2013 | Static Word Vectors | Word2Vec | One vector per word |
| 2018 | Contextualized | BERT | Different vectors per context |
| 2019 | Sentence-Level | Sentence-BERT | Optimized for sentences |
| 2023 | Large-Scale | OpenAI text-embedding-3 | Billions of parameters |
| 2025 | Specialized | NV-Embed-v2, Stella | Domain & task optimized |

### 2025 MTEB Leaderboard Leaders 🏆
- **NV-Embed-v2**: 72.31 score (NVIDIA, Mistral-7B based)
- **Stella-1.5B**: Best open-source with commercial license
- **text-embedding-3-large**: OpenAI's flagship (64.6% MTEB)
- **EmbeddingGemma**: Google's best under 500M parameters
- **Voyage-3**: Strong commercial performance


In [None]:
# Generate embeddings for all chunks
import time

print("Generating embeddings...")
start_time = time.time()

# Extract text content from documents
texts = [doc.page_content for doc in docs]

# Generate embeddings in batch (more efficient)
embeddings = embedding_model.embed_documents(texts)

end_time = time.time()
print(f"Generated {len(embeddings)} embeddings in {end_time - start_time:.2f} seconds")
print(f"Each embedding has {len(embeddings[0])} dimensions")

# Preview first embedding
print(f"First embedding (first 10 values): {embeddings[0][:10]}")
print(f"Embedding magnitude: {sum(x**2 for x in embeddings[0])**0.5:.3f}")  # Should be ~1.0 if normalized


In [None]:
## viz..

# Module 4: Vector Stores & Databases

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand vector database architecture and why traditional databases aren't suitable for similarity search
- Compare different vector store options (local vs cloud, open-source vs commercial)
- Implement CRUD operations for vectors with metadata
- Design effective metadata schemas for filtering and organization
- Benchmark and optimize vector database performance

## 📚 Key Concepts

### Why Vector Databases?

Traditional databases excel at exact matches but struggle with **similarity search**:


#### 🚫 Traditional Database Limitations
1. **No built-in similarity search**: SQL doesn't understand "semantic closeness"
2. **Poor scalability**: Linear scan becomes prohibitive with millions of vectors

#### ✅ Vector Database Advantages
1. **Approximate Nearest Neighbor (ANN)**: Fast similarity search with controllable accuracy
3. **Metadata filtering**: Combine vector similarity with traditional filters
4. **Horizontal scaling**: Built for production workloads with millions/billions of vectors

### 2025 Vector Database Landscape 🏆

| Database | Query Latency | Cost | Best For |
|----------|---------------|------|---------|
| **Pinecone** | 23ms p95 | High | Enterprise, turnkey scale |
| **Qdrant** | ~30ms | Low | Complex filters, self-hosted |
| **Weaviate** | 34ms p95 | Medium | OSS flexibility, GraphQL |
| **Milvus** | Lowest | Variable | GPU acceleration |
| **Chroma** | 20ms p50 | Free | Fast prototyping |

### Database Categories

#### 🏠 Local/Embedded Options
- **Chroma**: SQLite-based, perfect for development
- **FAISS**: Meta's library, CPU/GPU optimized
- **Hnswlib**: Pure HNSW implementation, very fast

#### ☁️ Cloud/Managed Options
- **Pinecone**: Fully managed, highest performance
- **Weaviate**: Open-source with cloud hosting
- **Qdrant**: Rust-based, excellent cost/performance

#### 🗄️ Traditional DB Extensions
- **pgvector**: PostgreSQL extension
- **Redis**: In-memory vector search
- **Elasticsearch**: Dense vector search support