# 📄 LangChain: Document Loaders Guide


Document Loaders are utilities in LangChain to **load external data** (text, PDFs, URLs, etc.) into a format that can be used by LLM pipelines — primarily `Document` objects.

---


## 🧾 What is a Document?

A `Document` in LangChain is a lightweight wrapper around text with optional **metadata**.

```python
from langchain.schema import Document

doc = Document(page_content="Hello world!", metadata={"source": "manual"})

| Loader               | Use Case                             |
| -------------------- | ------------------------------------ |
| `TextLoader`         | Load plain text files                |
| `PyPDFLoader`        | Extract text from PDFs               |
| `WebBaseLoader`      | Scrape and load from a web URL       |
| `DirectoryLoader`    | Load multiple files from a folder    |
| `UnstructuredLoader` | Parse complex formats (tables, etc.) |


### 📄 PDF Loaders in LangChain: Comparison Table

---

| Loader Name                     | Module Location                                | Backend         | Best Use Case                                                                 | Pros                                             | Cons                                              |
|--------------------------------|--------------------------------------------------|------------------|--------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------|
| **PyPDFLoader**                | `langchain_community.document_loaders`         | PyPDF2           | General-purpose PDF loading with page-level text                              | Simple, widely used, good for clean PDFs         | Poor table/image handling                         |
| **PDFMinerLoader**             | `langchain_community.document_loaders`         | PDFMiner         | PDFs with precise layout/format-sensitive content                             | Preserves layout, detailed text control          | Slower, more complex to parse                     |
| **PDFPlumberLoader**           | `langchain_community.document_loaders`         | pdfplumber       | PDFs with tables and visual elements                                           | Great for tables and structured text             | Slightly heavier, may include layout noise        |
| **UnstructuredPDFLoader**     | `langchain_community.document_loaders`         | Unstructured.io  | Complex documents (invoices, HTML PDFs, emails)                               | Handles images, tables, structure well           | Requires `unstructured`, can be slow              |
| **PyMuPDFLoader**              | `langchain_community.document_loaders`         | PyMuPDF (fitz)   | Extract both text and metadata efficiently                                    | Fast, extracts images and metadata               | Needs external dependency                         |
| **PDFReaderLoader**            | `langchain_community.document_loaders`         | PDFReader        | OCR and scanned PDFs                                                          | Use when PDFs are image-based or scanned         | OCR-dependent, requires Tesseract or similar      |

---

## ✅ Loader Selection Guidelines

- 📃 **Simple text PDFs?** → Use `PyPDFLoader`
- 🧾 **PDFs with tables/layouts?** → Use `PDFPlumberLoader` or `PDFMinerLoader`
- 🧠 **Scanned or image PDFs?** → Use `PDFReaderLoader` with OCR
- 📊 **Need images/metadata?** → Use `PyMuPDFLoader`
- 🏗️ **Highly structured content (invoices, emails)?** → Use `UnstructuredPDFLoader`

---

```python


# Example 1: Load a Text File
from langchain.document_loaders import TextLoader

loader = TextLoader("example.txt")
docs = loader.load()

print(docs[0].page_content)


```python

# Example 2: Load a PDF

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("document.pdf")
docs = loader.load()

print(docs[0].metadata)

```python

# Example 3: Load a Web Page

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Penguin")
docs = loader.load()

print(docs[0].page_content[:500])

### 🧠 How are Documents Used?
Once loaded, Document objects are commonly passed to:

- 🧱 Text splitters (for chunking)

- 🧠 Embedding models (to create vectors)

- 🗃️ Vector stores (to enable search)

- 🧭 Chains and agents (for contextual LLM tasks)

## 🧠 What is RAG?

**RAG** stands for **Retrieval-Augmented Generation**.  
It is an LLM architecture pattern where external knowledge is **retrieved** from a data source (like a vector database) and **augmented** into the prompt before sending it to a language model.

### 🔍 Why RAG?

LLMs like GPT have limitations:
- **They can't know everything** — they’re trained on static data.
- **They may hallucinate** — make up incorrect facts.
- **They don’t update in real-time**.


RAG solves this by:
- Fetching **relevant documents** from a trusted source
- Feeding those documents into the model for **context-aware generation**

---

## 🛠️ Tools Used in RAG

| Tool/Component       | Role in RAG                                           |
|----------------------|--------------------------------------------------------|
| **Document Loaders** | Load text, PDFs, or web data into Document objects     |
| **Text Splitters**   | Break large documents into manageable chunks           |
| **Embeddings**       | Convert text into vectors for semantic similarity      |
| **Vector Store**     | Store and search documents by meaning (e.g., FAISS)    |
| **Retriever**        | Pull relevant docs based on query similarity           |
| **LLM**              | Generate answers based on query + retrieved context    |
| **Prompt Template**  | Formats context and query into a prompt for the LLM    |

---

## 🔄 RAG Workflow

1. 📥 User submits a **query**
2. 📚 Retriever fetches **relevant chunks** from knowledge base
3. 🧾 Prompt is composed with retrieved context
4. 🤖 LLM generates **answer grounded in documents**

---

> 💡 **RAG = Retrieval (search) + Generation (LLM response)** — better accuracy, less hallucination, dynamic knowledge!