# Document Loaders

Document loaders are designed to load document objects. Here is a list of different document loaders in LangChain.


### 1. PDF Loaders

In [8]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("myPDF.pdf")

document = loader.load()
pages = [page for page in document]

pages[3]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'author': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'myPDF.pdf', 'total_pages': 16, 'page': 3, 'page_label': '4'}, page_content='4 Z. Shen et al.\nEfficient Data Annotation\nC u s t o m i z e d  M o d e l  T r a i n i n g\nModel Cust omization\nDI A Model Hub\nDI A Pipeline Sharing\nCommunity Platform\nLa y out Detection Models\nDocument Images \nT h e  C o r e  L a y o u t P a r s e r  L i b r a r y\nOCR Module St or age & VisualizationLa y out Data Structur e\nFig. 1: The overall architecture of LayoutParser. For an input document image,\nthe core LayoutParser library provides a set of oﬀ-the-shelf tools for layout\ndetection, OCR, visualization, and storage, backed by a caref

Limitations of PyPDFLoader:

1. **No OCR:** Can't extract images, and text from images or handwritten.
2. **No Layout Analysis:** Can't distinguish between headers, paragraphs, or tables.

In [7]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "myPDF.pdf",
    mode="page",
    extract_images=True,
    extract_tables="markdown"
)
docs = loader.load()

pages = [page for page in document]

pages[3]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'author': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'myPDF.pdf', 'total_pages': 16, 'page': 3, 'page_label': '4'}, page_content='4 Z. Shen et al.\nEfficient Data Annotation\nC u s t o m i z e d  M o d e l  T r a i n i n g\nModel Cust omization\nDI A Model Hub\nDI A Pipeline Sharing\nCommunity Platform\nLa y out Detection Models\nDocument Images \nT h e  C o r e  L a y o u t P a r s e r  L i b r a r y\nOCR Module St or age & VisualizationLa y out Data Structur e\nFig. 1: The overall architecture of LayoutParser. For an input document image,\nthe core LayoutParser library provides a set of oﬀ-the-shelf tools for layout\ndetection, OCR, visualization, and storage, backed by a caref

---
### 2. Webpage Loaders


In [29]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://peymankh.dev")

docs = loader.load()

docs

[Document(metadata={'source': 'https://peymankh.dev', 'title': 'Peyman KH | AI Engineer & Developer', 'description': 'Peyman Khodabandehlouei, an AI Engineer & developer specializing in AI orchestration, multi-agent systems, and production-grade automation', 'language': 'en'}, page_content='\n\n\n\n\nPeyman KH | AI Engineer & Developer\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')]

In [9]:
from langchain_community.document_loaders import FireCrawlLoader

loader = FireCrawlLoader(
    api_key="fc-3f2e8f2c31ca4f23a9ec5e1c717ceef9",
    url="https://binance.com",
    mode="scrape",
    api_url="https://api.firecrawl.dev"  # Add this parameter explicitly
)

docs = loader.load()

HTTPError: Request Timeout: Failed to scrape URL as the request timed out. Scrape timed out after waiting in the concurrency limit queue

In [8]:
docs

[Document(metadata={'og:image': '/hero-headshot.jpg', 'og:title': 'Peyman KH | AI Engineer & Developer', 'twitter:description': 'AI Engineering Student & Full-Stack Developer specializing in AI orchestration, multi-agent systems, and production-grade automation.', 'twitter:card': 'summary_large_image', 'viewport': 'width=device-width, initial-scale=1.0', 'twitter:image': '/hero-headshot.jpg', 'og:type': 'website', 'ogImage': '/hero-headshot.jpg', 'title': 'Peyman KH | AI Engineer & Developer', 'og:description': 'AI Engineering Student & Full-Stack Developer specializing in AI orchestration, multi-agent systems, and production-grade automation. Graduating 2026.', 'language': 'en', 'twitter:title': 'Peyman KH | AI Engineer & Developer', 'ogDescription': 'AI Engineering Student & Full-Stack Developer specializing in AI orchestration, multi-agent systems, and production-grade automation. Graduating 2026.', 'author': 'AI Engineering Portfolio', 'ogTitle': 'Peyman KH | AI Engineer & Develope