# Document Loader
Document Loader is a class that loads Documents from various sources.

Listed below are some examples of Document Loaders.

- PyPDFLoader: Loads PDF files
- CSVLoader: Loads CSV files
- UnstructuredHTMLLoader: Loads HTML files
- JSONLoader: Loads JSON files
- TextLoader: Loads text files
- DirectoryLoader: Loads documents from a directory

In [None]:
!pip install langchain langchain_community pypdf arxiv pymupdf

## 1. PyPDF
PyPDF is a widely-used Python library for reading and extracting text from PDF files. LangChain integrates with PyPDF through PyPDFLoader, allowing you to easily convert PDF documents into structured Document objects that include both the content and metadata (like page numbers).

In [21]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('/content/MIASDB Excerpta Medica 1994.pdf')

# Load data into Document objects
docs = loader.load()

# Print the contents of the document
print(docs[0].page_content[:300])

See	discussions,	stats,	and	author	profiles	for	this	publication	at:	
http://www.researchgate.net/publication/243788073
The	mammographic	image	analysis	society
digital	mammogram	database.	Exerpta	Medica
ARTICLE
	·	JANUARY	1994
CITATIONS
170
3	AUTHORS
,	INCLUDING:
John	Suckling
University	of	Cambridg


## 2. WebBaseLoader
WebBaseLoader is a specialized document loader in LangChain designed for processing web-based content.

It leverages the BeautifulSoup4 library to parse web pages effectively, offering customizable parsing options through SoupStrainer and additional bs4 parameters.

This tutorial demonstrates how to use WebBaseLoader to:
- Load and parse web documents effectively
- Customize parsing behavior using BeautifulSoup options
- Handle different web content structures flexibly.

In [23]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Load news article content using WebBaseLoader
loader = WebBaseLoader(
   web_paths=("https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/",),
   # Configure BeautifulSoup to parse only specific div elements
   bs_kwargs=dict(
       parse_only=bs4.SoupStrainer(
           "div",
           attrs={"class": ["entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained"]},
       )
   ),
   # Set user agent in request header to mimic browser
   header_template={
       "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
   },
)

# Load and process the documents
docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs[0]



Number of documents: 1


Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nGoogle CEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we h

In [24]:
# Bypass SSL certificate verification
loader.requests_kwargs = {"verify": False}

# Load documents from the web
docs = loader.load()
docs[0]



Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nGoogle CEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we h

### 3. CSV Loader
LangChain's CSVLoader allows you to load structured CSV data into a list of Document objects, making it ready for use with language models in tasks like:

- Summarization
- Q&A over tabular data
- Semantic search or vector storage

In [25]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("/content/avocado.csv")

docs = loader.load()

for record in docs[:2]:
  print(record)

page_content=': 0
Date: 2015-12-27
AveragePrice: 1.33
Total Volume: 64236.62
4046: 1036.74
4225: 54454.85
4770: 48.16
Total Bags: 8696.87
Small Bags: 8603.62
Large Bags: 93.25
XLarge Bags: 0.0
type: conventional
year: 2015
region: Albany' metadata={'source': '/content/avocado.csv', 'row': 0}
page_content=': 1
Date: 2015-12-20
AveragePrice: 1.35
Total Volume: 54876.98
4046: 674.28
4225: 44638.81
4770: 58.33
Total Bags: 9505.56
Small Bags: 9408.07
Large Bags: 97.49
XLarge Bags: 0.0
type: conventional
year: 2015
region: Albany' metadata={'source': '/content/avocado.csv', 'row': 1}


In [26]:
print(docs[1].page_content)

: 1
Date: 2015-12-20
AveragePrice: 1.35
Total Volume: 54876.98
4046: 674.28
4225: 44638.81
4770: 58.33
Total Bags: 9505.56
Small Bags: 9408.07
Large Bags: 97.49
XLarge Bags: 0.0
type: conventional
year: 2015
region: Albany


## 4. ArvixLoader
arXiv is an open access archive for 2 million scholarly articles in the fields of physics,

mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems

science, and economics.
To access the Arxiv document loader, you need to install arxiv, PyMuPDF and langchain-community integration packages.

PyMuPDF converts PDF files downloaded from arxiv.org into text format.

In [29]:
from langchain_community.document_loaders import ArxivLoader

### Enter the research topic you want to search for in the Query parameter
loader = ArxivLoader(
    query="Chain of thought",
    load_max_docs=2,  # max number of documents
    load_all_available_meta=True,  # load all available metadata
)

In [33]:
# Print the first document's content and metadata
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

Contrastive Chain-of-Thought Prompting
Yew Ken Chia∗1,
Guizhen Chen∗1, 2
Luu Anh Tuan2
Soujanya Pori
{'Published': '2023-11-15', 'Title': 'Contrastive Chain-of-Thought Prompting', 'Authors': 'Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing', 'Summary': 'Despite the success of chain of thought in enhancing language model\nreasoning, the underlying process remains less well understood. Although\nlogically sound reasoning appears inherently crucial for chain of thought,\nprior studies surprisingly reveal minimal impact when using invalid\ndemonstrations instead. Furthermore, the conventional chain of thought does not\ninform language models on what mistakes to avoid, which potentially leads to\nmore errors. Hence, inspired by how humans can learn from both positive and\nnegative examples, we propose contrastive chain of thought to enhance language\nmodel reasoning. Compared to the conventional chain of thought, our approach\nprovides both valid and invalid reasoning 

## 5. TextLoader
This tutorial focuses on using LangChain’s TextLoader to efficiently load and process individual text files.

In [34]:
from langchain_community.document_loaders import TextLoader

# Create a text loader
loader = TextLoader("/content/Text summarization.txt", encoding="utf-8")

# Load the document
docs = loader.load()
print(f"Number of documents: {len(docs)}\n")
print("[Metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview - First 500 Characters] =========\n")
print(docs[0].page_content[:500])

Number of documents: 1

[Metadata]

{'source': '/content/Text summarization.txt'}


Project Title:
Text Summarization Assistant

Objective:
The goal of this project is to develop an intelligent assistant capable of generating concise and informative summaries from longer bodies of text. The assistant will support extractive and abstractive summarization techniques and be designed to help project managers quickly digest reports, meeting transcripts, documentation, and other text-heavy resources.

Key Features:

Input Processing: Accepts plain text or documents (PDF, DOCX, etc.) 


## 6. Directory Loader

In [37]:
from langchain_community.document_loaders import DirectoryLoader

path = "/content/sample_data"

text_loader_kwargs = {"autodetect_encoding": True}

loader = DirectoryLoader(
    path,
    loader_cls=TextLoader,
    silent_errors=True,
    loader_kwargs=text_loader_kwargs,
)
docs = loader.load()

In [38]:
doc_sources = [doc.metadata["source"] for doc in docs]
doc_sources

['/content/sample_data/anscombe.json',
 '/content/sample_data/README.md',
 '/content/sample_data/california_housing_train.csv',
 '/content/sample_data/california_housing_test.csv',
 '/content/sample_data/mnist_test.csv',
 '/content/sample_data/mnist_train_small.csv']

In [39]:
print("[Metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview - First 500 Characters] =========\n")
print(docs[0].page_content[:500])

[Metadata]

{'source': '/content/sample_data/anscombe.json'}


[
  {"Series":"I", "X":10.0, "Y":8.04},
  {"Series":"I", "X":8.0, "Y":6.95},
  {"Series":"I", "X":13.0, "Y":7.58},
  {"Series":"I", "X":9.0, "Y":8.81},
  {"Series":"I", "X":11.0, "Y":8.33},
  {"Series":"I", "X":14.0, "Y":9.96},
  {"Series":"I", "X":6.0, "Y":7.24},
  {"Series":"I", "X":4.0, "Y":4.26},
  {"Series":"I", "X":12.0, "Y":10.84},
  {"Series":"I", "X":7.0, "Y":4.81},
  {"Series":"I", "X":5.0, "Y":5.68},

  {"Series":"II", "X":10.0, "Y":9.14},
  {"Series":"II", "X":8.0, "Y":8.14},
  {"Ser


In [40]:
print("[Metadata]\n")
print(docs[1].metadata)
print("\n========= [Preview - First 500 Characters] =========\n")
print(docs[1].page_content[:500])

[Metadata]

{'source': '/content/sample_data/README.md'}


This directory includes a few sample datasets to get you started.

*   `california_housing_data*.csv` is California housing data from the 1990 US
    Census; more information is available at:
    https://docs.google.com/document/d/e/2PACX-1vRhYtsvc5eOR2FWNCwaBiKL6suIOrxJig8LcSBbmCbyYsayia_DvPOOBlXZ4CAlQ5nlDD8kTaIDRwrN/pub

*   `mnist_*.csv` is a small sample of the
    [MNIST database](https://en.wikipedia.org/wiki/MNIST_database), which is
    described at: http://yann.lecun.com/exdb/mnist/

* 


In [41]:
print("[Metadata]\n")
print(docs[3].metadata)
print("\n========= [Preview - First 500 Characters] =========\n")
print(docs[3].page_content[:500])

[Metadata]

{'source': '/content/sample_data/california_housing_test.csv'}


"longitude","latitude","housing_median_age","total_rooms","total_bedrooms","population","households","median_income","median_house_value"
-122.050000,37.370000,27.000000,3885.000000,661.000000,1537.000000,606.000000,6.608500,344700.000000
-118.300000,34.260000,43.000000,1510.000000,310.000000,809.000000,277.000000,3.599000,176500.000000
-117.810000,33.780000,27.000000,3589.000000,507.000000,1484.000000,495.000000,5.793400,270500.000000
-118.360000,33.820000,28.000000,67.000000,15.000000,49.00000
