# üìö Data Ingestion with Document Loaders

## What is Data Ingestion?

**Data Ingestion** is the process of loading data from various sources into a format that LangChain can work with. Think of it as the "import" step - getting your data ready for processing.

## What are Document Loaders?

Document Loaders are LangChain's way of:
- üìÑ Reading data from different sources (files, web, APIs)
- üîÑ Converting that data into a standardized `Document` format
- üìã Preserving **metadata** (like source, page number, etc.)

### The Document Object

Every loader returns a list of `Document` objects with two main attributes:
- `page_content`: The actual text content
- `metadata`: Information about the source (filename, page number, URL, etc.)

### Common Document Loaders

| Loader | Use Case |
|--------|----------|
| `TextLoader` | Plain text files (.txt) |
| `PyPDFLoader` | PDF documents |
| `WebBaseLoader` | Web pages |
| `CSVLoader` | CSV files |
| `ArxivLoader` | Academic papers from Arxiv |
| `WikipediaLoader` | Wikipedia articles |

üìñ **Full List**: https://python.langchain.com/v0.2/docs/integrations/document_loaders/

---

## 1Ô∏è‚É£ TextLoader - Loading Plain Text Files

The simplest document loader - reads `.txt` files directly.

**Use Cases:**
- Reading transcripts
- Loading plain text documents
- Processing text exports

**Key Points:**
- Returns a single Document for the entire file
- Preserves the source path in metadata

In [1]:
# TextLoader - For loading plain .txt files
from langchain_community.document_loaders import TextLoader

# Step 1: Create a loader instance by specifying the file path
loader = TextLoader('speech.txt')

# Step 2: The loader object is ready but hasn't loaded the data yet
print(f"Loader type: {type(loader)}")
print(f"File path: {loader.file_path}")

Loader type: <class 'langchain_community.document_loaders.text.TextLoader'>
File path: speech.txt


In [2]:
# Step 3: Call .load() to actually read the file and get Document objects
text_documents = loader.load()

# Let's explore what we got
print(f"Number of documents: {len(text_documents)}")
print(f"Document type: {type(text_documents[0])}")
print(f"\n--- Metadata ---")
print(text_documents[0].metadata)
print(f"\n--- Content Preview (first 500 chars) ---")
print(text_documents[0].page_content[:500])

Number of documents: 1
Document type: <class 'langchain_core.documents.base.Document'>

--- Metadata ---
{'source': 'speech.txt'}

--- Content Preview (first 500 chars) ---
The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.

Just because we fight withou


---

## 2Ô∏è‚É£ PyPDFLoader - Loading PDF Documents

PDF files are everywhere! PyPDFLoader extracts text from PDF files.

**Key Features:**
- üìÑ Creates **one Document per page** (great for large PDFs!)
- üìä Includes page number in metadata
- ‚ö° Uses `pypdf` library under the hood

**Installation:** `pip install pypdf`

**Use Cases:**
- Research papers
- Reports and documentation
- Any PDF content you need to process

In [3]:
# PyPDFLoader - For loading PDF documents
from langchain_community.document_loaders import PyPDFLoader

# Load a PDF file (the famous "Attention Is All You Need" paper)
loader = PyPDFLoader('attention.pdf')

# Load returns one Document per page!
docs = loader.load()

print(f"Total pages/documents: {len(docs)}")
print(f"Document type: {type(docs[0])}")

Total pages/documents: 15
Document type: <class 'langchain_core.documents.base.Document'>


In [4]:
# Check the type - it's a LangChain Document object
print(f"Type: {type(docs[0])}")
print(f"\nDocument attributes:")
print(f"  - page_content: The actual text")
print(f"  - metadata: Information about the source")

Type: <class 'langchain_core.documents.base.Document'>

Document attributes:
  - page_content: The actual text
  - metadata: Information about the source


In [5]:
# Explore the first page (index 0)
first_page = docs[0]

print("=== METADATA ===")
print(first_page.metadata)

print("\n=== PAGE CONTENT (first 800 characters) ===")
print(first_page.page_content[:800])

=== METADATA ===
{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-03T00:07:29+00:00', 'author': '', 'keywords': '', 'moddate': '2023-08-03T00:07:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}

=== PAGE CONTENT (first 800 characters) ===
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani‚àó
Google Brain
avaswani@google.com
Noam Shazeer‚àó
Google Brain
noam@google.com
Niki Parmar‚àó
Google Research
nikip@google.com
Jakob Uszkoreit‚àó
Google Research
usz@google.com
Llion Jones‚àó
Google Research
llion@google.com
Aidan N. Gomez‚àó ‚Ä†
University of Toronto
aidan@cs.toronto.edu
≈Åukasz Kaiser‚àó

---

## 3Ô∏è‚É£ WebBaseLoader - Loading Web Pages

Scrape content directly from websites! Uses BeautifulSoup under the hood.

**Key Features:**
- üåê Load content from any URL
- üéØ Filter specific HTML elements using BeautifulSoup
- üì¶ Can load multiple URLs at once

**Installation:** `pip install beautifulsoup4`

**Use Cases:**
- Blog posts and articles
- Documentation pages
- News articles
- Any web content

In [6]:
# WebBaseLoader - Basic usage (loads entire page)
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Load a blog post about AI Agents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",)
)

# Note: web_paths takes a tuple, even for single URLs (notice the comma)

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [7]:
# Load the web page content
web_docs = loader.load()

print(f"Number of documents: {len(web_docs)}")
print(f"\n=== METADATA ===")
print(web_docs[0].metadata)
print(f"\n=== CONTENT PREVIEW (first 1000 chars) ===")
print(web_docs[0].page_content[:1000])

Number of documents: 1

=== METADATA ===
{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log", 'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\nIn a LLM-powered autonomous agent system, LLM functions as the agent‚Äôs brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving t

### Filtering Web Content with BeautifulSoup

Often, you don't want the entire page (navigation, ads, footers, etc.). 
Use `bs_kwargs` to filter specific HTML elements!

**SoupStrainer** lets you specify which elements to extract:
- `class_`: Filter by CSS class names
- `id`: Filter by element IDs
- Other HTML attributes

In [8]:
# WebBaseLoader with BeautifulSoup filtering
# This extracts ONLY the main content (title, header, body)
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            # Only extract elements with these CSS classes
            class_=("post-title", "post-content", "post-header")
        )
    )
)

# This gives us cleaner content without navigation/footer clutter!

In [9]:
# Load the filtered content
filtered_docs = loader.load()

print(f"Number of documents: {len(filtered_docs)}")
print(f"\n=== CLEANER CONTENT PREVIEW (first 1500 chars) ===")
print(filtered_docs[0].page_content[:1500])

Number of documents: 1

=== CLEANER CONTENT PREVIEW (first 1500 chars) ===


      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent‚Äôs brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, th

---

## 4Ô∏è‚É£ ArxivLoader - Loading Academic Papers

Access academic papers directly from [arXiv](https://arxiv.org/)!

**Key Features:**
- üî¨ Load papers by arXiv ID (e.g., "1706.03762")
- üîç Search papers by topic/query
- üìä Rich metadata (authors, title, abstract, etc.)
- üìñ Extracts full PDF text

**Installation:** `pip install arxiv pymupdf`

**Use Cases:**
- Research assistance
- Academic Q&A systems
- Paper summarization

In [10]:
# ArxivLoader - Load academic papers from arXiv
from langchain_community.document_loaders import ArxivLoader

# Load the famous "Attention Is All You Need" paper
# "1706.03762" is the arXiv ID
docs = ArxivLoader(
    query="1706.03762",  # Can be arXiv ID or search query
    load_max_docs=2       # Limit number of papers to load
).load()

print(f"Number of documents loaded: {len(docs)}")

Number of documents loaded: 1


In [11]:
# Explore the loaded paper
print("=== PAPER METADATA ===")
for key, value in docs[0].metadata.items():
    # Truncate long values for display
    if isinstance(value, str) and len(value) > 100:
        value = value[:100] + "..."
    print(f"  {key}: {value}")

print(f"\n=== CONTENT PREVIEW (first 1000 chars) ===")
print(docs[0].page_content[:1000])

=== PAPER METADATA ===
  Published: 2023-08-02
  Title: Attention Is All You Need
  Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...
  Summary: The dominant sequence transduction models are based on complex recurrent or convolutional neural net...

=== CONTENT PREVIEW (first 1000 chars) ===
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani‚àó
Google Brain
avaswani@google.com
Noam Shazeer‚àó
Google Brain
noam@google.com
Niki Parmar‚àó
Google Research
nikip@google.com
Jakob Uszkoreit‚àó
Google Research
usz@google.com
Llion Jones‚àó
Google Research
llion@google.com
Aidan N. Gomez‚àó‚Ä†
University of Toronto
aidan@cs.toronto.edu
≈Åukasz Kaiser‚àó
Google Brain
lukaszkaiser@google.com
Illia Polosukhin‚àó‚Ä°
illia.polosukhin@gmail.com
Abstract
The dominant sequence tr

---

## 5Ô∏è‚É£ WikipediaLoader - Loading Wikipedia Articles

Access Wikipedia's vast knowledge base directly!

**Key Features:**
- üîç Search Wikipedia by topic
- üìö Returns article summaries and content
- üåç Supports multiple languages
- üìã Rich metadata

**Installation:** `pip install wikipedia`

**Use Cases:**
- General knowledge Q&A
- Research and fact-checking
- Building knowledge bases

In [12]:
# WikipediaLoader - Search and load Wikipedia articles
from langchain_community.document_loaders import WikipediaLoader

# Search for articles about "Generative AI"
docs = WikipediaLoader(
    query="Generative AI",  # Your search query
    load_max_docs=2          # Number of articles to load
).load()

print(f"Number of articles loaded: {len(docs)}")

Number of articles loaded: 2


In [13]:
# Explore Wikipedia documents
for i, doc in enumerate(docs):
    print(f"\n{'='*50}")
    print(f"üìÑ Article {i+1}")
    print(f"{'='*50}")
    print(f"Title: {doc.metadata.get('title', 'N/A')}")
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"\nContent Preview (first 500 chars):")
    print(doc.page_content[:500])
    print("...")


üìÑ Article 1
Title: Generative artificial intelligence
Source: https://en.wikipedia.org/wiki/Generative_artificial_intelligence

Content Preview (first 500 chars):
Generative artificial intelligence (Generative AI, or GenAI) is a subfield of artificial intelligence that uses generative models to generate text, images, videos, audio, software code or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data in response to input, which often comes in the form of natural language prompts.
The prevalence of generative AI tools has increased significantly since the AI boom in the 2020s
...

üìÑ Article 2
Title: Generative AI pornography
Source: https://en.wikipedia.org/wiki/Generative_AI_pornography

Content Preview (first 500 chars):
Generative AI pornography or simply AI pornography is a digitally created pornography produced through generative artificial intelligence (AI) technologies. Unlike traditional por

In [14]:
# üéØ Quick Reference: Install required packages
# Run this cell if you need to install the dependencies

# !pip install langchain-community
# !pip install pypdf          # For PyPDFLoader
# !pip install beautifulsoup4 # For WebBaseLoader
# !pip install arxiv pymupdf  # For ArxivLoader
# !pip install wikipedia      # For WikipediaLoader

print("üì¶ Common installations for Document Loaders:")
print("   pip install langchain-community")
print("   pip install pypdf beautifulsoup4 arxiv pymupdf wikipedia")

üì¶ Common installations for Document Loaders:
   pip install langchain-community
   pip install pypdf beautifulsoup4 arxiv pymupdf wikipedia


---

## üìù Summary: Key Takeaways

### Document Loaders Overview

| Loader | Source | Documents Per |
|--------|--------|---------------|
| `TextLoader` | .txt files | 1 per file |
| `PyPDFLoader` | PDF files | 1 per page |
| `WebBaseLoader` | Web URLs | 1 per URL |
| `CSVLoader` | CSV files | 1 per row |
| `ArxivLoader` | arXiv papers | 1 per paper |
| `WikipediaLoader` | Wikipedia | 1 per article |
| `DirectoryLoader` | Folders | Depends on file type |

### Common Pattern

```python
# All loaders follow this pattern:
from langchain_community.document_loaders import SomeLoader

# 1. Create loader instance
loader = SomeLoader(source="path/url/query")

# 2. Load documents
docs = loader.load()

# 3. Access content and metadata
for doc in docs:
    print(doc.page_content)  # The text
    print(doc.metadata)       # Source info
```

### Next Steps
- üî™ **Text Splitting**: Break large documents into smaller chunks
- üßÆ **Embeddings**: Convert text to vectors for similarity search
- üíæ **Vector Stores**: Store and retrieve documents efficiently

In [15]:
# DirectoryLoader - Load all files from a folder
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# Example: Load all .txt files from a directory
# loader = DirectoryLoader(
#     path="./documents/",          # Directory path
#     glob="**/*.txt",               # Pattern: all .txt files (recursive)
#     loader_cls=TextLoader,         # Which loader to use
#     show_progress=True             # Show loading progress
# )
# docs = loader.load()

print("DirectoryLoader patterns (glob):")
print("  '*.txt'      - All .txt files in the directory")
print("  '**/*.txt'   - All .txt files (recursive)")
print("  '**/*.pdf'   - All .pdf files (recursive)")
print("  '**/*.*'     - All files (recursive)")

DirectoryLoader patterns (glob):
  '*.txt'      - All .txt files in the directory
  '**/*.txt'   - All .txt files (recursive)
  '**/*.pdf'   - All .pdf files (recursive)
  '**/*.*'     - All files (recursive)


---

## 7Ô∏è‚É£ DirectoryLoader - Loading Multiple Files

Load all files from a directory at once!

**Key Features:**
- üìÅ Batch load entire directories
- üéØ Filter by file extension (glob patterns)
- üîÑ Choose loader type for each file type
- ‚ö° Supports parallel loading

**Use Cases:**
- Loading entire documentation folders
- Processing multiple reports
- Bulk data ingestion

In [16]:
# CSVLoader - For loading CSV/spreadsheet data
from langchain_community.document_loaders import CSVLoader

# Example: Load a CSV file (uncomment when you have a CSV file)
# loader = CSVLoader(
#     file_path="your_data.csv",
#     csv_args={
#         'delimiter': ',',
#         'quotechar': '"'
#     }
# )
# docs = loader.load()

# Each row becomes a Document!
# for doc in docs[:3]:
#     print(doc.page_content)
#     print(doc.metadata)
#     print("---")

print("CSVLoader creates one Document per row in your CSV file!")
print("Each row's content is formatted as 'column: value' pairs")

CSVLoader creates one Document per row in your CSV file!
Each row's content is formatted as 'column: value' pairs


---

## 6Ô∏è‚É£ CSVLoader - Loading CSV Files

Perfect for structured data in spreadsheets!

**Key Features:**
- üìä Each row becomes a separate Document
- üè∑Ô∏è Column values become metadata
- ‚öôÔ∏è Customizable field mappings

**Use Cases:**
- Customer data
- Product catalogs
- FAQ databases