# **Document Loaders in Langchain**

Document laoders are components in langchain used to laod data from various sources into a standardized format (usually as Document objects), which can then be used for chunking, embedding , retrieval, and generation.



# **TextLoader**

Text Loader is a sample and commonly used document loader in langchain that reads plain text (.txt) files and coverts them into langchain document objects

 **Use Case**

 ideal for loading chat logs, scraped text, transcripts, code snippets or any plain text data into a langchain pipeline


 **limitation**

 Works only with .txt files

In [None]:
# Cell: Import the TextLoader for loading plain text files as LangChain document objects.
from langchain_community.document_loaders import TextLoader



In [None]:
# Create a TextLoader instance for the file 'datascience.txt' and load the documents.

loader = TextLoader('datascience.txt', encoding='utf-8')

docs = loader.load()

# Print the type of the loaded docs object (should be a list of Document objects).

print(type(docs)) # Print the metadata of the first document

<class 'list'>


In [None]:
# Explore the loaded documents.
print(len(docs)) # Print the number of documents loaded

# print(docs[0]) # Uncomment to print the first document object

# print(type(docs[0])) # Uncomment to print the type of the first document

print(docs[0].page_content) # Print the content of the first document

print(docs[0].metadata)

1
The Self-Taught Data Scientist

In quiet rooms where pixels glow,
A curious mind begins to grow.
No lecture hall, no rigid pace,
Just eager steps through data’s maze.

A dusty book, a midnight screen,
A question sparks in spaces between:
“What secrets hide in rows and charts?
What truth does data speak in parts?”

Python scripts and messy code,
Errors stacked a heavy load.
Yet in each bug, a lesson found—
Persistence is the battleground.

Statistics whispers gentle clues,
Probabilities, hidden truths.
Linear lines and curves that bend,
Regressions that predict, or end.

A scatterplot, a clustering sphere,
Machine’s learning, patterns clear.
A forest random, tangled trees,
Decision splits with cryptic ease.

Stack Overflow, a trusted friend,
To help confusion meet its end.
Blogs and MOOCs, and podcasts too,
Each puzzle piece reveals the view.

From wrangling data’s jagged mess,
To crafting insights, no less.
Transforming noise into a song,
To show the world where it belongs.

And slow

In [None]:
# Import additional modules for using HuggingFace LLM, output parsing, and prompt templates.
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

# Load environment variables (such as API keys) from a .env file.
load_dotenv()


True

In [None]:
#  Initialize the HuggingFaceEndpoint and wrap it in a ChatHuggingFace model for text generation.
llm = HuggingFaceEndpoint(
    repo_id='meta-llama/Meta-Llama-3-8B-Instruct',
    task='text-generation'
)

model = ChatHuggingFace(llm=llm)

In [None]:
# Create a prompt template for summarizing a poem.

prompt = PromptTemplate(
    template='write a summary for the following poem - \n {poem} ',
    input_variables=['poem']
)

In [None]:
#  Create a string output parser to extract plain text from the model's response.

parser = StrOutputParser()

In [None]:
# Reload the text file to get the document content for summarization.

loader = TextLoader('datascience.txt', encoding='utf-8')

docs = loader.load()

In [None]:
# Build a chain that applies the prompt, model, and parser in sequence.

chain = prompt | model | parser

In [None]:
# Invoke the chain with the content of the first document and print the summary.

print(chain.invoke({'poem':docs[0].page_content}))

The poem "The Self-Taught Data Scientist" is a celebration of an individual who has taught themselves the skills and knowledge to become a data scientist. The poem describes the self-directed learning process, from curiosity and eagerness to engage with data, to overcoming obstacles and finding lessons in each step. It highlights the tools, techniques, and resources used by the self-taught data scientist, such as Python scripts, statistical analysis, machine learning, and online resources like Stack Overflow and MOOCs.

Throughout the poem, the speaker emphasizes the importance of persistence, self-motivation, and determination in overcoming the challenges of data science. The poem also touches on the journey of transformation, from a novice to a skilled data scientist, and the satisfaction of leaving a mark on the world through insights and discoveries.

Ultimately, the poem is a tribute to the self-taught data scientist, acknowledging their dedication, creativity, and passion for lea

# **PyPDF Loader**

Pypdf loader is a document loader in langchain used to load content from pdf files and convert page into a document object

**limitations**

it uses the Pypdf library under the hood not great with scanned pdfs or complex layouts

In [None]:
# Install the pypdf package for PDF document loading.
! pip install pypdf

Collecting pypdf
  Downloading pypdf-5.7.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.7.0-py3-none-any.whl (305 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.7.0


In [None]:
# Import PyPDFLoader and load a PDF file as LangChain document objects.
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('dl-curriculum.pdf')

docs = loader.load()

# print(docs)  # Uncomment to print all loaded documents
# print(docs)

print(len(docs)) # Print the number of pages/documents loaded

print(docs[0].page_content) # Print the content of the first page

print(docs[1].metadata) # Print the metadata of the second page

23
CampusXDeepLearningCurriculum
A.ArtificialNeuralNetworkandhowtoimprovethem
1.BiologicalInspiration
● Understandingtheneuronstructure● Synapsesandsignaltransmission● Howbiologicalconceptstranslatetoartificialneurons
2.HistoryofNeuralNetworks
● Earlymodels(Perceptron)● BackpropagationandMLPs● The"AIWinter"andresurgenceofneuralnetworks● Emergenceofdeeplearning
3.PerceptronandMultilayerPerceptrons(MLP)
● Single-layerperceptronlimitations● XORproblemandtheneedforhiddenlayers● MLParchitecture
4. LayersandTheirFunctions
● InputLayer○ Acceptinginputdata● HiddenLayers○ Featureextraction● OutputLayer○ Producingfinalpredictions
5.ActivationFunctions
{'producer': 'Skia/PDF m131 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Deep Learning Curriculum', 'source': 'dl-curriculum.pdf', 'total_pages': 23, 'page': 1, 'page_label': '2'}


pdf with tables/columns =====> PDFPlumberloader

scanned/image PDFs=========> UnstrucherPDFLoader/AmazomTextracPDFLoader

Need layout and image data ========> PymuPDFLoader

https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/how_to/document_loader_pdf.ipynb

# **directory_loader**

Load vs Lazy load

DirectoryLoader in LangChain is a utility that allows you to efficiently load multiple documents from a directory, applying a specified loader (such as for PDFs or text files) to each file that matches a given pattern. 

The distinction between Load vs Lazy load is important: 

"Load" reads all documents into memory at once, which is fast for small datasets but can be memory-intensive for large ones, while "Lazy load" processes documents one at a time as needed, making it more memory-efficient and suitable for handling large collections of files without overwhelming system resources

In [46]:
# Import DirectoryLoader for loading multiple files from a directory,
# and PyPDFLoader for loading PDF files as LangChain document objects.

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

In [47]:
# Create a DirectoryLoader instance to load all PDF files from the 'books' directory using PyPDFLoader.

loader = DirectoryLoader(
    path='books',               # Directory containing PDF files
    glob='*.pdf',               # Pattern to match PDF files
    loader_cls=PyPDFLoader      # Loader class to use for each file
)

In [48]:
# Load all documents from the specified directory using the loader.

docs = loader.load()

In [49]:
# Iterate through all loaded documents and print their metadata.

for documents in docs:
    print(documents.page_content)

iants, such as train_on_batch() or fit_generator()), plus the get_layers()
method (which can return any of the model’s layers by name or by index), and the
save() method (and support for keras.models.load_model() and keras.mod
els.clone_model()). So if models provide more functionalities than layers, why not
just define every layer as a model? Well, technically you could, but it is probably
cleaner to distinguish the internal components of your model (layers or reusable
blocks of layers) from the model itself. The former should subclass the Layer class,
while the latter should subclass the Model class.
With that, you can quite naturally and concisely build almost any model that you find
in a paper, either using the sequential API, the functional API, the subclassing API, or
even a mix of these. “ Almost” any model? Y es, there are still a couple things that we
need to look at: first, how to define losses or metrics based on model internals, and
second how to build a custom training loo

In [None]:
# Print the total number of loaded documents/pages.

print(len(docs))

# Print the content of the 510th document (index 509).
print(docs[509].page_content)


# Print the metadata of the 510th document (index 509).
print(docs[509].metadata)

902
About the Author
Aurélien Géron is a Machine Learning consultant. A former Googler, he led the Y ou‐
Tube video classification team from 2013 to 2016. He was also a founder and CTO of
Wifirst from 2002 to 2012, a leading Wireless ISP in France; and a founder and CTO
of Polyconseil in 2001, the firm that now manages the electric car sharing service
Autolib’ .
Before this he worked as an engineer in a variety of domains: finance (JP Morgan and
Société Générale), defense (Canada’s DOD), and healthcare (blood transfusion). He
published a few technical books (on C++, WiFi, and internet architectures), and was
a Computer Science lecturer in a French engineering school.
A few fun facts: he taught his three children to count in binary with their fingers (up
to 1023), he studied microbiology and evolutionary genetics before going into soft‐
ware engineering, and his parachute didn’t open on the second jump.
Colophon
The animal on the cover of Hands-On Machine Learning with Scikit-Learn and 

# **WebBase Loader**

webbase loader is a document loader in langchain used to load and extract text content from web pages(urls)

it uses beautifulSoup under the hood to parse HTML and extract visible text.

When to use:

for blogs, news articles or public websites where the content is primarily text based and static.

LImitations:
Doesn't handle javascript heavy pages well (use Selenium URLLoader for that).
loads only static content (what's in the html, not what loads after the page reders)


In [None]:
# Install BeautifulSoup (bs4) for web page parsing support in WebBaseLoader
! pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Downloading beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
Downloading soupsieve-2.7-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4

   ---------------------------------------- 0/2 [soupsieve]
   ---------------------------------------- 0/2 [soupsieve]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [

In [None]:
# Import WebBaseLoader for loading web pages as documents,
# and other necessary modules for LLM, output parsing, and prompt templates.

from langchain_community.document_loaders import WebBaseLoader
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

# Load environment variables (such as API keys) from a .env file.
load_dotenv()

True

In [32]:
#  Initialize the HuggingFaceEndpoint and wrap it in a ChatHuggingFace model for text generation.
llm = HuggingFaceEndpoint(
    repo_id='meta-llama/Meta-Llama-3-8B-Instruct',
    task='text-generation'
)

model = ChatHuggingFace(llm=llm)

In [None]:
# Create a prompt template for answering a question based on provided text.

prompt = PromptTemplate(
    template='Answer the following question \n {question} from the following text - \n {text} ',
    input_variables=['question' , 'text']
)

In [34]:
#  Create a string output parser to extract plain text from the model's response.

parser = StrOutputParser()

In [None]:
# Specify the URL to load and create a WebBaseLoader instance for it.

url = "https://en.wikipedia.org/wiki/Data_science"

loader = WebBaseLoader(url)

In [None]:
# Load the web page content as LangChain document objects.

docs = loader.load()

In [None]:
# Build a chain that applies the prompt, model, and parser in sequence.

chain = prompt | model | parser

In [None]:
# Invoke the chain with a question and the loaded web page content, then print the answer.

print(chain.invoke({'question':'what is the future of data science ', 'text':docs[0].page_content}))

The future of data science is expected to be shaped by several trends and advancements. Some of the key developments that are likely to impact the field of data science include:

1. **Artificial Intelligence and Machine Learning**: The increasing use of AI and ML in data science will lead to more sophisticated and accurate analysis of large datasets.
2. **Cloud Computing**: The growth of cloud computing will enable easier access to computational power and storage, making it easier to process and analyze large datasets.
3. **Big Data and IoT**: The increasing amount of data being generated by IoT devices and other sources will require more efficient and effective methods for processing and analyzing large datasets.
4. **Data Governance and Ethics**: As data science becomes more prevalent, there will be a greater need for data governance and ethics to ensure that data is collected and used responsibly.
5. **Interdisciplinary Collaboration**: Data science will continue to be an interdisci

In [None]:
# (Optional) Print the number of loaded documents and the content of the first document.
# print(len(docs))
# print(docs[0].page_content)

# **CSV Loader**

In [None]:
# Import CSVLoader for loading CSV files as LangChain document objects.

from langchain_community.document_loaders import CSVLoader

# Create a CSVLoader instance for the file 'Social_Network_Ads.csv'.

loader = CSVLoader(file_path='Social_Network_Ads.csv')

# Load the CSV file as documents.

docs = loader.load()

In [None]:
# Print the total number of loaded documents (rows).

print(len(docs))

# Print the content of the sixth document (row) in the CSV file.

print(docs[5])

400
page_content='User ID: 15728773
Gender: Male
Age: 27
EstimatedSalary: 58000
Purchased: 0' metadata={'source': 'Social_Network_Ads.csv', 'row': 5}
