Document Loaders
Use document loaders to load data from a source as Document.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

Text Loaders

from langchain_community.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()
CSV

from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
HTML

from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
Web Base Loader

# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")
data = loader.load()
JSON
Click Here for detailed docs.
Suppose we are interested in extracting the values under the content field within the messages key of the JSON data. This can easily be done through the JSONLoader as shown below.

#!pip install jq
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
 file_path='./example_data/facebook_chat.json',
 jq_schema='.messages[].content',
 text_content=False)
data = loader.load()


6. **PDF**  
[Click here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) for detailed docs.  
Make sure to install: `pip install pypdf`

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
File Directory
Click Here for detailed docs. Under the hood it uses UnstructuredLoader.
Make sure to install: pip install "unstructured[all-docs]"
This covers how to load all documents in a directory. We can use the glob parameter to control which files to load.
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()

# **Document Loaders**
Use document loaders to load data from a source as `Document`.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

1. **Text Loaders**
```python
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./index.md")
loader.load()
```

2. **CSV**
```python
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
```
3. **HTML**
```python
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
```
4. **Web Base Loader**
```python
# !pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")
data = loader.load()
```
5. **JSON**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) for detailed docs.  
Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data. This can easily be done through the JSONLoader as shown below.
```python
#!pip install jq
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
```

6. **PDF**  
[Click here](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) for detailed docs.  
Make sure to install: `pip install pypdf`

```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
```

7. **File Directory**  
[Click Here](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for detailed docs.
Under the hood it uses UnstructuredLoader.  
Make sure to install: `pip install "unstructured[all-docs]"`  
This covers how to load all documents in a directory. We can use the `glob` parameter to control which files to load.

```python
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", show_progress=True, loader_cls=TextLoader)
docs = loader.load()
```

In [3]:
#!pip install "unstructured[all-docs]"
#!pip install jq
#!pip install pypdf
#!pip install pymupdf

In [5]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.4-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.6 (from langchain_community)
  Downloading langchain-0.3.6-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.14 (from langchain_community)
  Downloading langchain_core-0.3.15-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.23.0-py3-none-any.whl.metadata (7.6 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-jso

In [17]:
from langchain_community.document_loaders import DirectoryLoader # to support document loading
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../',glob = "**/*.md",show_progress =True,loader_cls = TextLoader)
#docs = loader.load()

# Loading one.srt File

In [7]:
# Web Based Loader
#!pip install beautifulsoup4

In [8]:
from langchain_community.document_loaders import WebBaseLoader



In [9]:
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")


In [10]:
data = loader.load()

In [11]:
type(data)

list

In [12]:
len(data)

1

In [13]:
data[0]

Document(metadata={'source': 'https://docs.smith.langchain.com/user_guide', 'title': 'LangSmith User Guide | \uf8ffü¶úÔ∏è\uf8ffüõ†Ô∏è LangSmith', 'description': 'LangSmith is a platform for LLM application development, monitoring, and testing. In this guide, we‚Äôll highlight the breadth of workflows LangSmith supports and how they fit into each stage of the application development lifecycle. We hope this will inform users how to best utilize this powerful platform or give them something to consider if they‚Äôre just starting their journey.', 'language': 'en'}, page_content="\n\n\n\n\nLangSmith User Guide | \uf8ffü¶úÔ∏è\uf8ffüõ†Ô∏è LangSmith\n\n\n\n\n\n\nSkip to main contentGo to API DocsSearchRegionUSEUGo to AppQuick StartUser GuideTracingEvaluationProduction Monitoring & AutomationsPrompt HubProxyPricingSelf-HostingCookbookThis is outdated documentation for \uf8ffü¶úÔ∏è\uf8ffüõ†Ô∏è LangSmith, which is no longer actively maintained.For up-to-date documentation, see the latest version.

In [14]:
print(data)

[Document(metadata={'source': 'https://docs.smith.langchain.com/user_guide', 'title': 'LangSmith User Guide | \uf8ffü¶úÔ∏è\uf8ffüõ†Ô∏è LangSmith', 'description': 'LangSmith is a platform for LLM application development, monitoring, and testing. In this guide, we‚Äôll highlight the breadth of workflows LangSmith supports and how they fit into each stage of the application development lifecycle. We hope this will inform users how to best utilize this powerful platform or give them something to consider if they‚Äôre just starting their journey.', 'language': 'en'}, page_content="\n\n\n\n\nLangSmith User Guide | \uf8ffü¶úÔ∏è\uf8ffüõ†Ô∏è LangSmith\n\n\n\n\n\n\nSkip to main contentGo to API DocsSearchRegionUSEUGo to AppQuick StartUser GuideTracingEvaluationProduction Monitoring & AutomationsPrompt HubProxyPricingSelf-HostingCookbookThis is outdated documentation for \uf8ffü¶úÔ∏è\uf8ffüõ†Ô∏è LangSmith, which is no longer actively maintained.For up-to-date documentation, see the latest version

# Text Loader

In [19]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("/content/Friends_2x01.srt")
data1 =loader.load()

In [20]:
print("Type of Data Variable: ", type(data1))
print()
print("Number of Documents: ", len(data1))
print()
print("Type of each datapoints:", type(data1[0]))
print()
print("Metadata: ", data1[0].metadata)
print()
print("Page Content:", data1[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': '/content/Friends_2x01.srt'}

Page Content: 1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he 


# Loading all .srt Files

In [22]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("/content",glob = "*.srt",show_progress = True,loader_cls = TextLoader)
data2 = loader.load()

100%|██████████| 10/10 [00:00<00:00, 5001.55it/s]


In [24]:
print("Type of Data Variable:", type(data2))

print("Number of Documents:", len(data2))

Type of Data Variable: <class 'list'>
Number of Documents: 10


# Load .csv File

In [25]:
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path ="/content/movies_data.csv")
data3 = loader.load()

In [27]:
print("Type of loaded data:", type(data3))

print("Number of datapoints:", len(data3))

print("Type of each datapoints:", type(data3[0]))

Type of loaded data: <class 'list'>
Number of datapoints: 436
Type of each datapoints: <class 'langchain_core.documents.base.Document'>


In [28]:
data3[:5]

[Document(metadata={'source': '/content/movies_data.csv', 'row': 0}, page_content="movieId: 1\ntitle: Toy Story (1995)\ngenres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']"),
 Document(metadata={'source': '/content/movies_data.csv', 'row': 1}, page_content="movieId: 2\ntitle: Jumanji (1995)\ngenres: ['Adventure', 'Children', 'Fantasy']"),
 Document(metadata={'source': '/content/movies_data.csv', 'row': 2}, page_content="movieId: 3\ntitle: Grumpier Old Men (1995)\ngenres: ['Comedy', 'Romance']"),
 Document(metadata={'source': '/content/movies_data.csv', 'row': 3}, page_content="movieId: 6\ntitle: Heat (1995)\ngenres: ['Action', 'Crime', 'Thriller']"),
 Document(metadata={'source': '/content/movies_data.csv', 'row': 4}, page_content="movieId: 7\ntitle: Sabrina (1995)\ngenres: ['Comedy', 'Romance']")]

In [29]:
print(data3[0].page_content)

movieId: 1
title: Toy Story (1995)
genres: ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']


In [None]:
from