### 4. Retrieval-Augmented Generation (RAG)
    4.1 Document Loading
            * File loaders
                PDF (PyPDFLoader/Docling/Nougat)
                Text
                Word
                HTML
                Markdown
            * Web loaders
                WebBaseLoader
                PlaywrightLoader (dynamic sites)
                SitemapLoader
            * Cloud loaders
                S3 loader
                GCS loader
                Azure Blob loader
    4.2 Text Splitting
                RecursiveCharacterTextSplitter — BEST for most cases
                Token-based (OpenAI’s tiktoken)
                Document-specific (Markdown, HTML, Python code)
    4.3 Embeddings
            * Embeddings convert text → numeric vectors.
            * Popular Embedding Models
            * OpenAIEmbeddings
            * CohereEmbeddings
            * HuggingFaceEmbeddings
            * Google VertexAI embeddings
            * Jina embeddings
    4.4 Key Concepts
            * Dimensionality                
            * Context window
            * Similarity metrics (cosine, dot)
            * Performance trade-offs

    4.5 Vector Stores (Databases for RAG)
            FAISS (local)
            ChromaDB (local)
            Pinecone (managed)
            Weaviate
            Milvus
            Elasticsearch
            
            Important topics
                Index creation
                Metadata filters
                Hybrid search (dense + keyword)
                Incremental updates
    4.6. Retrievers
            VectorStoreRetriever
            MultiQueryRetriever (uses LLM to rewrite queries)
            ContextualCompressionRetriever
            ParentDocumentRetriever
            EnsembleRetriever (combines BM25 + vector search)

    4.7 Retrieval-Augmented Generation (RAG)
            The most common RAG pipeline:
            Load → Split → Embed → Store → Retrieve → LLM
            Advanced RAG techniques
                RAG-Fusion (multiple query expansion)
                LLM-based ranking
                Graph RAG (knowledge graph embeddings)
                HyDE (Hypothetical Document Embeddings)
                Self-RAG / ReAct-RAG

✅ 4.1 Document Loading
     
     * File loaders

##### PDF — PyPDFLoader

In [3]:
#!pip install pypdf

In [None]:
a) PyPDFLoader

In [16]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("saikrishna_java_fullstack_developer.pdf")
docs = loader.load()

In [None]:
b) PDF — Docling
c) PDF — Nougat OCR
d) Text Files
e) Word Documents (.docx)
f) HTML
g) Markdown
h) Web Loader — static pages
i) Playwright Loader — dynamic JS website
j) Sitemap Loader
k) S3 Loader

#### ✅ 4.2 Text Splitting

a) RecursiveCharacterTextSplitter (BEST)

In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
chunks

[Document(metadata={'producer': 'www.smallpdf.com', 'creator': 'www.smallpdf.com', 'creationdate': 'D:20251209144849', 'moddate': 'D:20251209144849', 'source': 'saikrishna_java_fullstack_developer.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content='Saikrishna Vinjamuri\nsaikrishna.spark@gmail.com\n4694492774\nProfessional Summary:\nJava Full Stack Engineer with 10+ years of experience in designing and developing enterprise\napplications using Java, Spring Boot and AWS. Proficient in building scalable microservices and\nintegrating LLM-based AI solutions using LangChain, Amazon Bedrock, and Cohere. Strong expertise\nin RESTful APIs and intelligent document processing.\nSummary:'),
 Document(metadata={'producer': 'www.smallpdf.com', 'creator': 'www.smallpdf.com', 'creationdate': 'D:20251209144849', 'moddate': 'D:20251209144849', 'source': 'saikrishna_java_fullstack_developer.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content='Summary:\n● Proficient in so

### b) Token-based (OpenAI tiktoken)

In [11]:
from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=50)
chunks = splitter.split_documents(docs)

### c) Markdown-specific splitting

In [14]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(["#", "##", "###"])
chunks = splitter.split_text(open("notes.md").read())
chunks

FileNotFoundError: [Errno 2] No such file or directory: 'notes.md'