## **Document Loaders in LangChain**

### **1. Text File Example**

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('./example_data/example_text.md')

data = loader.load()

In [2]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:")
print(data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': './example_data/example_text.md'}

Page Content:
# Sample Markdown File  
This is a sample text file for practice.  
It contains headings, bullet points, and plain text.  

## Key Points  
- Markdown is lightweight and simple.  
- It is widely used 


### **2. CSV Loaders**

In [3]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("./example_data/mlb_teams_2012.csv")

data = loader.load()

In [4]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print("Metadata: ", data[1].metadata)
print()
print("Page Content:")
print(data[0].page_content[:200])
print(data[1].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  4

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': './example_data/mlb_teams_2012.csv', 'row': 0}
Metadata:  {'source': './example_data/mlb_teams_2012.csv', 'row': 1}

Page Content:
Team: Yankees
City: New York
Championships: 27
Team: Red Sox
City: Boston
Championships: 9


### **3. HTML Loaders**

In [5]:
# !pip install "unstructured[all-docs]"

In [6]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("./example_data/fake-content.html")
data = loader.load()

In [7]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:")
print(data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': './example_data/fake-content.html'}

Page Content:
Welcome to LangChain Practice

This is a sample HTML file to practice with document loaders.


### **4. Web Base Loader** 

In [8]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")

data = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [9]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata['source'])
print("Metadata: ", data[0].metadata['title'])
print("Metadata: ", data[0].metadata['description'][:55])
print()
print("Page Content:")
print(data[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  https://docs.smith.langchain.com/user_guide
Metadata:  LangSmith User Guide | ü¶úÔ∏èüõ†Ô∏è LangSmith
Metadata:  LangSmith is a platform for LLM application development

Page Content:





LangSmith User Guide | ü¶úÔ∏èüõ†Ô∏è LangSmith






Skip to main contentGo to API DocsSearchRegionUSEUGo to AppQuick StartUser GuideTracingEvaluationProduction Monitoring & AutomationsPrompt Hu


### **5. JSON Loaders**

In [10]:
# !pip install jq

In [11]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False
)

data = loader.load()

In [12]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print("Metadata: ", data[1].metadata)
print()
print("Page Content:")
print(data[0].page_content[:200])
print(data[1].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  2

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': 'C:\\Users\\91889\\OneDrive\\Desktop\\RAG\\example_data\\facebook_chat.json', 'seq_num': 1}
Metadata:  {'source': 'C:\\Users\\91889\\OneDrive\\Desktop\\RAG\\example_data\\facebook_chat.json', 'seq_num': 2}

Page Content:
Hi, how are you?
I'm fine, thank you!


### **6. PDF Loaders** 

In [13]:
# !pip install pypdf

In [14]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("example_data/example-paper.pdf")
pages = loader.load_and_split()

In [15]:
print("Type of Data Variable: ", type(pages))
print()
print("Number of Documents: ", len(pages))
print()
print("Type of each datapoints:", type(pages[0]))
print()
print("Metadata: ", pages[0].metadata)
print()
print("Page Content:")
print(pages[0].page_content[:200])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': 'example_data/example-paper.pdf', 'page': 0}

Page Content:
LangChain Practice 
 
This is a sample PDF file for testing document loaders.   
It includes plain text and can be split into pages for testing purposes.


### **Loading one .srt File**

In [16]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('./example_data/subtitles/Friends_2x01.srt')

data = loader.load()

In [24]:
print("Type of Data Variable: ", type(data))
print()
print("Number of Documents: ", len(data))
print()
print("Type of each datapoints:", type(data[0]))
print()
print("Metadata: ", data[0].metadata)
print()
print("Page Content:")
print(data[0].page_content[:152])

Type of Data Variable:  <class 'list'>

Number of Documents:  1

Type of each datapoints: <class 'langchain_core.documents.base.Document'>

Metadata:  {'source': './example_data/subtitles/Friends_2x01.srt'}

Page Content:
1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.


### **7. Loading Multiple .srt File**

In [25]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('example_data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|██████████| 10/10 [00:00<00:00, 61.41it/s]


In [26]:
print("Type of Data Variable: ", type(data))

print("Number of Documents:", len(data))

Type of Data Variable:  <class 'list'>
Number of Documents: 10
