## Textloader

In [10]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('./cricket.txt', encoding='utf-8')

doc = loader.load()

print(f'Type of the document loader (It loads inside the list): {type(doc)}')
print(f'How many documents loaded: {len(doc)}')

print(f'First 50 characters of the first document: \n{doc[0].page_content[:50]}')
print(f'Metadata of the first document: {doc[0].metadata}')

Type of the document loader (It loads inside the list): <class 'list'>
How many documents loaded: 1
First 50 characters of the first document: 
Beneath the sun or floodlight's gleam,

Cricket li
Metadata of the first document: {'source': './cricket.txt'}


## PDF loader: PyPDFLoader

In [12]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./dl-curriculum.pdf')

doc = loader.load()

print(f'Type of the document loader (It loads inside the list): {type(doc)}')
print(f'How many documents loaded: {len(doc)}')
print(f'First 50 characters of the first document: \n{doc[0].page_content[:50]}')
print(f'Metadata of the first document: {doc[0].metadata}')

Type of the document loader (It loads inside the list): <class 'list'>
How many documents loaded: 23
First 50 characters of the first document: 
CampusXDeepLearningCurriculum
A.ArtificialNeuralNe
Metadata of the first document: {'producer': 'Skia/PDF m131 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Deep Learning Curriculum', 'source': './dl-curriculum.pdf', 'total_pages': 23, 'page': 0, 'page_label': '1'}


## DirectoryLoader

In [13]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    path = 'books',
    glob = '*.pdf',
    loader_cls = PyPDFLoader
)

docs = loader.load()

print(f'Type of the document loader (It loads inside the list): {type(docs)}')
print(f'How many documents loaded: {len(docs)}')
print(f'First 50 characters of the first document: \n{docs[0].page_content[:50]}')
print(f'Metadata of the first document: {docs[0].metadata}')

Type of the document loader (It loads inside the list): <class 'list'>
How many documents loaded: 384
First 50 characters of the first document: 
A LiveCoMS Tutorial
Molecular Dynamics: From Basic
Metadata of the first document: {'producer': 'pdfTeX-1.40.24', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-09-08T10:56:01+00:00', 'author': '', 'keywords': '', 'moddate': '2025-09-08T10:56:01+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'books/MolecularDynamics.pdf', 'total_pages': 58, 'page': 0, 'page_label': '1'}


## Load vs LazyLoad

- load funtion: eager loading (load everything at once in memory).
- lazy function: load one document at once in the the memory. 

In [15]:
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    path = 'books',
    glob = '*.pdf',
    loader_cls = PyPDFLoader
)
docs = loader.lazy_load()

i = 0
for document in docs:
    print(f'First 50 characters of the document: \n{document.page_content[:50]}')
    print(f'Metadata of the document: {document.metadata}')
    if i == 5:
        break
    i += 1

First 50 characters of the document: 
A LiveCoMS Tutorial
Molecular Dynamics: From Basic
Metadata of the document: {'producer': 'pdfTeX-1.40.24', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-09-08T10:56:01+00:00', 'author': '', 'keywords': '', 'moddate': '2025-09-08T10:56:01+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'books/MolecularDynamics.pdf', 'total_pages': 58, 'page': 0, 'page_label': '1'}
First 50 characters of the document: 
A LiveCoMS Tutorial
then utilize positions, veloci
Metadata of the document: {'producer': 'pdfTeX-1.40.24', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-09-08T10:56:01+00:00', 'author': '', 'keywords': '', 'moddate': '2025-09-08T10:56:01+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4', 'subject': '', 'title': '', 'trapped': '/False', 'sou

## WebBaseLoader

- load content form webpages (urls). For static website works really well.

In [18]:
from langchain_community.document_loaders import WebBaseLoader

url = 'https://weather.com/weather/tenday/l/North%2BPlainfield%2BNJ?canonicalCityId=99f6f5324bd07939af5f2e75060b0deabfbaf5997e4b19fd2c735392b44f7ba2'

loader = WebBaseLoader(url) 

doc = loader.load()

print(f'Type of the document loader (It loads inside the list): {type(doc)}')
print(f'How many documents loaded: {len(doc)}')
print(f'First 500 characters of the first document: \n{doc[0].page_content[:500]}')
print(f'Metadata of the first document: {doc[0].metadata}')

Type of the document loader (It loads inside the list): <class 'list'>
How many documents loaded: 1
First 500 characters of the first document: 
10-Day Weather Forecast for North Plainfield, New Jersey 07060 - The Weather Channel | weather.comHamburgerThe Weather CompanyHomeHomeHomeTodayTodayTodayHourlyHourlyHourlyClose10 Day10 DayMonthlyMonthlyMonthlyRadarRadarRadarVideoVideoVideolightningUpgrade to PremiumChevron RightCloseThe Weather CompanyHomeHomeHomeTodayTodayTodayHourlyHourlyHourlyClose10 Day10 DayMonthlyMonthlyMonthlyRadarRadarRadarVideoVideoVideolightningUpgrade to PremiumChevron RightHamburgerThe Weather CompanySearchGlobeUS°FC
Metadata of the first document: {'source': 'https://weather.com/weather/tenday/l/North%2BPlainfield%2BNJ?canonicalCityId=99f6f5324bd07939af5f2e75060b0deabfbaf5997e4b19fd2c735392b44f7ba2', 'title': '10-Day Weather Forecast for North Plainfield, New Jersey 07060 - The Weather Channel | weather.com', 'description': 'Be prepared with the most accurate 10-d

## CSVLoader

In [20]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path='./Social_Network_Ads.csv')

doc = loader.load()

print(f'Type of the document loader (It loads inside the list): {type(doc)}')
print(f'How many documents loaded: {len(doc)}')
print(f'First 500 characters of the first document: \n{doc[0].page_content[:500]}')
print(f'Metadata of the first document: {doc[0].metadata}')
print(f'First row in the CSV file: \n{doc[0]}')

Type of the document loader (It loads inside the list): <class 'list'>
How many documents loaded: 400
First 500 characters of the first document: 
User ID: 15624510
Gender: Male
Age: 19
EstimatedSalary: 19000
Purchased: 0
Metadata of the first document: {'source': './Social_Network_Ads.csv', 'row': 0}
First row in the CSV file: 
page_content='User ID: 15624510
Gender: Male
Age: 19
EstimatedSalary: 19000
Purchased: 0' metadata={'source': './Social_Network_Ads.csv', 'row': 0}
