# 🔍 RAG: Retrieval Augmented Generation

## Giving LLMs Long-Term Memory

**Problem**: LLMs have cutoff knowledge dates
**Solution**: Retrieve relevant context dynamically

---


In [None]:
import numpy as np
print('✅ RAG concepts ready!')


## RAG Architecture

### The Pipeline

1. **Ingestion**
   - Chunk documents
   - Generate embeddings
   - Store in vector DB

2. **Retrieval**
   - User query → embedding
   - Similarity search
   - Top-k relevant chunks

3. **Generation**
   - Inject chunks into prompt
   - LLM generates answer
   - Cite sources

### Chunking Strategies

**Fixed-size**: 512 tokens, overlap 50
**Semantic**: Split on paragraphs/sentences
**Recursive**: Hierarchical chunking

### Embedding Models

| Model | Dimensions | Speed | Quality |
|-------|-----------|-------|----------|
| **all-MiniLM-L6-v2** | 384 | Fast | Good |
| **text-embedding-ada-002** | 1536 | Medium | Excellent |
| **BGE-large** | 1024 | Slow | Best |

### Vector Databases

**Chroma**: Simple, local, great for prototyping
**Qdrant**: Production-ready, scalable
**Pinecone**: Managed, serverless
**Weaviate**: Hybrid search (dense + sparse)


In [None]:
# RAG Pipeline (simplified)
class SimpleRAG:
    def __init__(self, documents):
        self.documents = documents
        self.embeddings = self._embed_documents(documents)
    
    def _embed_documents(self, docs):
        # Simplified: random embeddings
        # In practice: use SentenceTransformer
        return np.random.randn(len(docs), 384)
    
    def retrieve(self, query, k=3):
        # Embed query
        query_emb = np.random.randn(384)  # Simplified
        
        # Compute similarities (cosine)
        similarities = np.dot(self.embeddings, query_emb)
        similarities /= (np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_emb))
        
        # Top-k
        top_k_idx = np.argsort(similarities)[-k:][::-1]
        return [self.documents[i] for i in top_k_idx]
    
    def generate(self, query):
        # Retrieve context
        context = self.retrieve(query)
        
        # Build prompt
        prompt = f"Context: {' '.join(context)}\n\nQuestion: {query}\n\nAnswer:"
        
        # In practice: call LLM API
        return "[LLM would generate answer here]"

print('✅ RAG pipeline structure!')


## Advanced RAG Techniques

### HyDE (Hypothetical Document Embeddings)
1. Generate hypothetical answer
2. Embed hypothetical doc
3. Retrieve similar docs

### Multi-Query
1. Generate multiple query variations
2. Retrieve for each
3. Combine results

### Reranking
1. Retrieve top-20 with fast method
2. Rerank with slower, better model
3. Return top-5

### Production Considerations

✅ **Caching**: Cache embeddings & retrievals
✅ **Monitoring**: Track retrieval quality
✅ **Evaluation**: Use RAGAS metrics
✅ **Cost**: Embedding API calls add up
