In [None]:

from langchain_community.document_loaders import (
    TextLoader,
    CSVLoader,
    PyPDFLoader,
    DirectoryLoader,
    WebBaseLoader
)

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
loader = TextLoader("data/sample.txt")
documents = loader.load()

print("Number of documents:", len(documents))
print("\nPage Content Preview:\n", documents[0].page_content[:300])
print("\nMetadata:\n", documents[0].metadata)

Number of documents: 1

Page Content Preview:
 The quick brown fox jumps over the lazy dog.
This is the second line of the file.
LangChain's TextLoader is great for simple text files.
It handles various encodings like UTF-8.

Metadata:
 {'source': 'data/sample.txt'}


In [4]:
loader = CSVLoader("data/sample.csv")
documents = loader.load()

print("Number of rows converted to documents:", len(documents))
print("\nSample Document:\n", documents[0].page_content)

Number of rows converted to documents: 4

Sample Document:
 Name: Alice
Age: 30
City: New York
Occupation: Software Engineer


In [5]:
loader = PyPDFLoader("data/sample.pdf")
documents = loader.load()

print("Total pages:", len(documents))
print("\nSample Page Content:\n", documents[0].page_content[:500])

Total pages: 3

Sample Page Content:
 1. Cover Page 
 Logo (from website) 
 Title: Neuron – Corporate Photography & Video Solutions 
 Tagline: Visual storytelling for modern brands 
 Hero image from homepage 
 
2. About Neuron 
 Introduction from homepage: 
“Elevate your brand's presence with our bespoke corporate video production and 
professional headshots in Pune. At Neuron, we help you unleash the power of visual 
storytelling to connect with your audience and showcase your corporate identity like never 
before.” 
 Positio


In [8]:
loader = DirectoryLoader(
    "data/",
    glob="**/*.txt",
    loader_cls=TextLoader  # can customize per file type
)

documents = loader.load()
print("Total documents loaded from directory:", len(documents))

Total documents loaded from directory: 1


In [16]:
loader = WebBaseLoader("https://python.langchain.com/docs/")
documents = loader.load()

print("Web Content Preview:\n")
print(documents[0].page_content[:500])

Web Content Preview:

LangChain overview - Docs by LangChainSkip to main contentDocs by LangChain home pageOpen sourceSearch...⌘KAsk AIGitHubTry LangSmithTry LangSmithSearch...NavigationLangChain overviewDeep AgentsLangChainLangGraphIntegrationsLearnReferenceContributePythonOverviewGet startedInstallQuickstartChangelogPhilosophyCore componentsAgentsModelsMessagesToolsShort-term memoryStreamingStructured outputMiddlewareOverviewPrebuilt middlewareCustom middlewareAdvanced usageGuardrailsRuntimeContext engineeringModel


##PART 2 —--- Text Splitters
Task 6: Why Text Splitting is Required (Conceptual)

Why can’t we pass large documents directly?

    LLMs have token limits

    Large documents increase cost

    Context window overflow

    Lower accuracy in retrieval

What does chunking solve?
    Improves retrieval precision

    Fits inside token limits

    Reduces hallucination

    Enables semantic search

In [20]:
pip install -U langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


In [22]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

chunks = text_splitter.split_documents(documents)

print("Number of chunks:", len(chunks))
print("\nSample chunk:\n", chunks[0].page_content)

Created a chunk of size 2396, which is longer than the specified 500
Created a chunk of size 1072, which is longer than the specified 500


Number of chunks: 4

Sample chunk:
 LangChain overview - Docs by LangChainSkip to main contentDocs by LangChain home pageOpen sourceSearch...⌘KAsk AIGitHubTry LangSmithTry LangSmithSearch...NavigationLangChain overviewDeep AgentsLangChainLangGraphIntegrationsLearnReferenceContributePythonOverviewGet startedInstallQuickstartChangelogPhilosophyCore componentsAgentsModelsMessagesToolsShort-term memoryStreamingStructured outputMiddlewareOverviewPrebuilt middlewareCustom middlewareAdvanced usageGuardrailsRuntimeContext engineeringModel Context Protocol (MCP)Human-in-the-loopMulti-agentRetrievalLong-term memoryAgent developmentLangSmith StudioTestAgent Chat UIDeploy with LangSmithDeploymentObservabilityOn this page Create an agent Core benefitsLangChain overviewCopy pageLangChain is an open source framework with a pre-built agent architecture and integrations for any model or tool — so you can build agents that adapt as fast as the ecosystem evolvesCopy pageLangChain is the easy way to start

In [24]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

chunks = splitter.split_documents(documents)

print("Number of chunks:", len(chunks))
print("\nSample chunk:\n", chunks[0].page_content)

Number of chunks: 13

Sample chunk:
 LangChain overview - Docs by LangChainSkip to main contentDocs by LangChain home pageOpen sourceSearch...⌘KAsk AIGitHubTry LangSmithTry LangSmithSearch...NavigationLangChain overviewDeep AgentsLangChainLangGraphIntegrationsLearnReferenceContributePythonOverviewGet startedInstallQuickstartChangelogPhilosophyCore componentsAgentsModelsMessagesToolsShort-term memoryStreamingStructured outputMiddlewareOverviewPrebuilt middlewareCustom middlewareAdvanced usageGuardrailsRuntimeContext engineeringModel


Splitter	                    Advantage	        Disadvantage
CharacterTextSplitter	        Simple	            Breaks sentences
RecursiveCharacterTextSplitter	Respects structure	Slightly slower

In [25]:
loader = PyPDFLoader("data/sample.pdf")
pages = loader.load()

print("Each page preserved as chunk")
print("Total pages:", len(pages))

Each page preserved as chunk
Total pages: 3


Task 10: Semantic Meaning–Based Splitting

What is Semantic Chunking?

Instead of splitting by length, we split by meaning similarity.

Chunks are created where:

Topic shifts

Semantic distance increases

How embeddings help?

Convert text → vectors

Measure similarity between sentences

Break where similarity drops

PART 3 —--- Mini Integration

In [38]:
from langchain_community.document_loaders import DirectoryLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

def load_and_split_documents(path_or_url):
    
    if path_or_url.startswith("http"):
        loader = WebBaseLoader(path_or_url)
    else:
        loader = DirectoryLoader(path_or_url, glob="**/*.txt")
    
    documents = loader.load()
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    
    chunks = splitter.split_documents(documents)
    
    return chunks

In [None]:
pip install "unstructured[all-docs]"

In [41]:
chunks = load_and_split_documents("data/")
print("Total chunks from local data:", len(chunks))

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


Total chunks from local data: 1


In [36]:
chunks = load_and_split_documents("https://python.langchain.com/docs/")
print("Total chunks from web:", len(chunks))

Total chunks from web: 13


Which loader for which data?

Data Type	Loader
.txt	TextLoader
.csv	CSVLoader
.pdf	PyPDFLoader
Folder	DirectoryLoader
Website	WebBaseLoader

Best Splitter For:

Use Case	Best Splitter
Small text	CharacterTextSplitter
Large PDFs	RecursiveCharacterTextSplitter
Web data	RecursiveCharacterTextSplitter

Why chunk overlap is important?

Maintains context continuity

Prevents sentence breaking loss

Improves retrieval quality

Reduces answer fragmentation