 LangChain Document Loaders

This notebook demonstrates how to use the most popular and practical document loaders in LangChain.  
All loaders are now part of the `langchain_community` package.

We'll explore loaders for:
- PDFs  
- Web pages  
- YouTube videos  
- Text and CSV files  and many more

Each example extracts data and prepares it for text splitting and embeddings.


# PDF Loader

In [1]:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("attention.pdf")
docs = loader.load()

print(f"Loaded {len(docs)} pages")
print(docs[0].page_content[:500])


Loaded 15 pages
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 


# Web Page Loader

In [17]:
from langchain_community.document_loaders import UnstructuredURLLoader

urls = ["https://blog.langchain.com/context-engineering-for-agents/"]

loader = UnstructuredURLLoader(urls=urls)
docs = loader.load()

print(f"Loaded {len(docs)} web documents")
print(docs[0])


Loaded 1 web documents
page_content='Context Engineering

11 min read Jul 2, 2025

TL;DR

Agents need context to perform tasks. Context engineering is the art and science of filling the context window with just the right information at each step of an agent’s trajectory. In this post, we break down some common strategies — write, select, compress, and isolate — for context engineering by reviewing various popular agents and papers. We then explain how LangGraph is designed to support them!

Also, see our video on context engineering here.

Context Engineering

As Andrej Karpathy puts it, LLMs are like a new kind of operating system. The LLM is like the CPU and its context window is like the RAM, serving as the model’s working memory. Just like RAM, the LLM context window has limited capacity to handle various sources of context. And just as an operating system curates what fits into a CPU’s RAM, we can think about “context engineering” playing a similar role. Karpathy summarizes this w

# YouTube Video Loader

In [29]:
from langchain_yt_dlp.youtube_loader import YoutubeLoaderDL

# Basic transcript loading
loader = YoutubeLoaderDL.from_youtube_url(
    "https://www.youtube.com/watch?v=f7RfHxyoVyI", add_video_info=True
)

docs=loader.load()
docs[0].metadata

{'source': 'f7RfHxyoVyI',
 'title': 'How AWS Outage Took Down Over 1000 Websites and Apps | Vantage with Palki Sharma',
 'description': "A major outage at Amazon Web Services (AWS) brought down huge parts of the internet—crippling platforms like Snapchat, Reddit, WhatsApp, and even banking and tax portals. The issue was traced to a DNS failure at a key AWS data center in North Virginia. The disruption has renewed concerns over Big Tech's dominance, with U.S. Senator Elizabeth Warren calling for the breakup of large internet giants like Amazon.\n\n---\n\nAWS Outage | Amazon Web Services Down | Amazon Web Services Outage | Firstpost | World News | News Live | Vantage | Palki Sharma | News\n\n#aws #amazonwebservice #awsoutage #firstpost #vantageonfirstpost #palkisharma #worldnews\n\nVantage is a ground-breaking news, opinions, and current affairs show from Firstpost. Catering to a global audience, Vantage covers the biggest news stories from a 360-degree perspective, giving viewers a chan

# Text Loader

In [33]:
from langchain_community.document_loaders import TextLoader

loader=TextLoader("./speech.txt")
docs = loader.load()

print(docs)


[Document(metadata={'source': './speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

# CSV Loader

In [36]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path="HousingData.csv")
docs = loader.load()

print(f"Loaded {len(docs)} rows")
print(docs[0].page_content)


Loaded 506 rows
CRIM: 0.00632
ZN: 18
INDUS: 2.31
CHAS: 0
NOX: 0.538
RM: 6.575
AGE: 65.2
DIS: 4.09
RAD: 1
TAX: 296
PTRATIO: 15.3
B: 396.9
LSTAT: 4.98
MEDV: 24


# Wikipedia Loader

In [None]:
from langchain_community.document_loaders import WikipediaLoader
docs = WikipediaLoader(query="Agentic AI", load_max_docs=2).load()
docs[0].metadata  


{'title': 'Agentic AI',
 'summary': 'Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks with limited or no human intervention. The independent systems automatically respond to conditions to produce process results. The field is closely linked to agentic automation, also known as agent-based process management systems, when applied to process automation. Applications include software development, customer support, cybersecurity and business intelligence. \n\n',
 'source': 'https://en.wikipedia.org/wiki/Agentic_AI'}

In [None]:
docs[0].page_content[:500]  

'Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks with limited or no human intervention. The independent systems automatically respond to conditions to produce process results. The field is closely linked to agentic automation, also known as agent-based process management systems, when applied to process automation. Applications include software development, customer support, cybersecurity and business intelligence. \n\n\n='

# ArxivLoader

In [None]:
from langchain_community.document_loaders import ArxivLoader

loader = ArxivLoader(
    query="Hallucination is Inevitable",
    load_max_docs=2,

)
# docs = loader.load()
# docs[0]
docs = loader.get_summaries_as_docs()
docs[0]

Document(metadata={'Entry ID': 'http://arxiv.org/abs/2401.11817v2', 'Published': datetime.date(2025, 2, 13), 'Title': 'Hallucination is Inevitable: An Innate Limitation of Large Language Models', 'Authors': 'Ziwei Xu, Sanjay Jain, Mohan Kankanhalli'}, page_content='Hallucination has been widely recognized to be a significant drawback for\nlarge language models (LLMs). There have been many works that attempt to reduce\nthe extent of hallucination. These efforts have mostly been empirical so far,\nwhich cannot answer the fundamental question whether it can be completely\neliminated. In this paper, we formalize the problem and show that it is\nimpossible to eliminate hallucination in LLMs. Specifically, we define a formal\nworld where hallucination is defined as inconsistencies between a computable\nLLM and a computable ground truth function. By employing results from learning\ntheory, we show that LLMs cannot learn all the computable functions and will\ntherefore inevitably hallucinate i

### Summary

- Document Loaders act as the **data ingestion layer** in LangChain.
- All loaders now reside in `langchain_community`.
- Each loader standardizes data into a `Document` format with `page_content` and `metadata`.
- Next step: use **Text Splitters** to break these documents into smaller chunks for embeddings.
