# **Document Loaders in Langchain**

Document laoders are components in langchain used to laod data from various sources into a standardized format (usually as Document objects), which can then be used for chunking, embedding , retrieval, and generation.



# **TextLoader**

Text Loader is a sample and commonly used document loader in langchain that reads plain text (.txt) files and coverts them into langchain document objects

 **Use Case**

 ideal for loading chat logs, scraped text, transcripts, code snippets or any plain text data into a langchain pipeline


 **limitation**

 Works only with .txt files

In [None]:
# Cell: Import the TextLoader for loading plain text files as LangChain document objects.
from langchain_community.document_loaders import TextLoader



In [None]:
# Create a TextLoader instance for the file 'datascience.txt' and load the documents.

loader = TextLoader('datascience.txt', encoding='utf-8')

docs = loader.load()

# Print the type of the loaded docs object (should be a list of Document objects).

print(type(docs)) # Print the metadata of the first document

<class 'list'>


In [None]:
# Explore the loaded documents.
print(len(docs)) # Print the number of documents loaded

# print(docs[0]) # Uncomment to print the first document object

# print(type(docs[0])) # Uncomment to print the type of the first document

print(docs[0].page_content) # Print the content of the first document

print(docs[0].metadata)

1
The Self-Taught Data Scientist

In quiet rooms where pixels glow,
A curious mind begins to grow.
No lecture hall, no rigid pace,
Just eager steps through data’s maze.

A dusty book, a midnight screen,
A question sparks in spaces between:
“What secrets hide in rows and charts?
What truth does data speak in parts?”

Python scripts and messy code,
Errors stacked a heavy load.
Yet in each bug, a lesson found—
Persistence is the battleground.

Statistics whispers gentle clues,
Probabilities, hidden truths.
Linear lines and curves that bend,
Regressions that predict, or end.

A scatterplot, a clustering sphere,
Machine’s learning, patterns clear.
A forest random, tangled trees,
Decision splits with cryptic ease.

Stack Overflow, a trusted friend,
To help confusion meet its end.
Blogs and MOOCs, and podcasts too,
Each puzzle piece reveals the view.

From wrangling data’s jagged mess,
To crafting insights, no less.
Transforming noise into a song,
To show the world where it belongs.

And slow

In [None]:
# Import additional modules for using HuggingFace LLM, output parsing, and prompt templates.
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv

# Load environment variables (such as API keys) from a .env file.
load_dotenv()


True

In [None]:
#  Initialize the HuggingFaceEndpoint and wrap it in a ChatHuggingFace model for text generation.
llm = HuggingFaceEndpoint(
    repo_id='meta-llama/Meta-Llama-3-8B-Instruct',
    task='text-generation'
)

model = ChatHuggingFace(llm=llm)

In [None]:
# Create a prompt template for summarizing a poem.

prompt = PromptTemplate(
    template='write a summary for the following poem - \n {poem} ',
    input_variables=['poem']
)

In [None]:
#  Create a string output parser to extract plain text from the model's response.

parser = StrOutputParser()

In [None]:
# Reload the text file to get the document content for summarization.

loader = TextLoader('datascience.txt', encoding='utf-8')

docs = loader.load()

In [None]:
# Build a chain that applies the prompt, model, and parser in sequence.

chain = prompt | model | parser

In [None]:
# Invoke the chain with the content of the first document and print the summary.

print(chain.invoke({'poem':docs[0].page_content}))

The poem "The Self-Taught Data Scientist" is a celebration of an individual who has taught themselves the skills and knowledge to become a data scientist. The poem describes the self-directed learning process, from curiosity and eagerness to engage with data, to overcoming obstacles and finding lessons in each step. It highlights the tools, techniques, and resources used by the self-taught data scientist, such as Python scripts, statistical analysis, machine learning, and online resources like Stack Overflow and MOOCs.

Throughout the poem, the speaker emphasizes the importance of persistence, self-motivation, and determination in overcoming the challenges of data science. The poem also touches on the journey of transformation, from a novice to a skilled data scientist, and the satisfaction of leaving a mark on the world through insights and discoveries.

Ultimately, the poem is a tribute to the self-taught data scientist, acknowledging their dedication, creativity, and passion for lea

# **PyPDF Loader**

Pypdf loader is a document loader in langchain used to load content from pdf files and convert page into a document object

**limitations**

it uses the Pypdf library under the hood not great with scanned pdfs or complex layouts

In [None]:
# Install the pypdf package for PDF document loading.
! pip install pypdf

Collecting pypdf
  Downloading pypdf-5.7.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.7.0-py3-none-any.whl (305 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.7.0


In [None]:
# Import PyPDFLoader and load a PDF file as LangChain document objects.
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('dl-curriculum.pdf')

docs = loader.load()

# print(docs)  # Uncomment to print all loaded documents
# print(docs)

print(len(docs)) # Print the number of pages/documents loaded

print(docs[0].page_content) # Print the content of the first page

print(docs[1].metadata) # Print the metadata of the second page

23
CampusXDeepLearningCurriculum
A.ArtificialNeuralNetworkandhowtoimprovethem
1.BiologicalInspiration
● Understandingtheneuronstructure● Synapsesandsignaltransmission● Howbiologicalconceptstranslatetoartificialneurons
2.HistoryofNeuralNetworks
● Earlymodels(Perceptron)● BackpropagationandMLPs● The"AIWinter"andresurgenceofneuralnetworks● Emergenceofdeeplearning
3.PerceptronandMultilayerPerceptrons(MLP)
● Single-layerperceptronlimitations● XORproblemandtheneedforhiddenlayers● MLParchitecture
4. LayersandTheirFunctions
● InputLayer○ Acceptinginputdata● HiddenLayers○ Featureextraction● OutputLayer○ Producingfinalpredictions
5.ActivationFunctions
{'producer': 'Skia/PDF m131 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Deep Learning Curriculum', 'source': 'dl-curriculum.pdf', 'total_pages': 23, 'page': 1, 'page_label': '2'}


pdf with tables/columns =====> PDFPlumberloader

scanned/image PDFs=========> UnstrucherPDFLoader/AmazomTextracPDFLoader

Need layout and image data ========> PymuPDFLoader

https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/how_to/document_loader_pdf.ipynb