# Advanced LangChain Document Loaders and Text Splitters

This notebook demonstrates 5 additional loader/splitter classes not covered in the basic examples.

## Overview
We'll explore:
1. **CSVLoader** - For structured data files
2. **JSONLoader** - For JSON data with JQ queries  
3. **DirectoryLoader** - For loading multiple files
4. **RecursiveCharacterTextSplitter** - Intelligent text chunking
5. **MarkdownHeaderTextSplitter** - Structure-aware splitting
6. **TokenTextSplitter** - Token-based splitting for LLMs
7. **Combined workflows** for real-world data processing

In [3]:

# Import required libraries
import os
import json
import shutil
from typing import List

# Core LangChain imports
from langchain_core.documents import Document

# Document Loaders
from langchain_community.document_loaders import (
    CSVLoader,
    JSONLoader,
    DirectoryLoader,
    TextLoader
)

# Text Splitters
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    TokenTextSplitter
)

## 1. CSV Loader - Loading Structured Data

The CSVLoader is perfect for ingesting structured tabular data. It can handle various CSV formats and provides flexible configuration options.

In [32]:
# Create a sample CSV file for demonstration
sample_csv_data = "name,age,department,salary\nJohn Doe,30,Engineering,75000\nJane Smith,28,Marketing,65000\nBob Johnson,35,Sales,70000\nAlice Brown,32,HR,60000\nCharlie Davis,29,Engineering,72000"

with open('employees.csv', 'w') as f:
    f.write(sample_csv_data)

print("Sample CSV file created")
print("Content preview:")
print(sample_csv_data[:200] + "...")

Sample CSV file created
Content preview:
name,age,department,salary
John Doe,30,Engineering,75000
Jane Smith,28,Marketing,65000
Bob Johnson,35,Sales,70000
Alice Brown,32,HR,60000
Charlie Davis,29,Engineering,72000...


In [19]:
# Load CSV with different configurations
csv_loader = CSVLoader(
    file_path='employees.csv',
    csv_args={
        'delimiter': ',',
        'quotechar': '"',
        'fieldnames': ['name', 'age', 'department', 'salary']
    }
)

csv_documents = csv_loader.load()

print(f"Loaded {len(csv_documents)} documents from CSV")
print(f"Sample document metadata: {csv_documents[0].metadata}")
print(f"Sample content: {csv_documents[0].page_content}")
print(f"Document type: {type(csv_documents[0])}")

Loaded 6 documents from CSV
Sample document metadata: {'source': 'employees.csv', 'row': 0}
Sample content: name: name
age: age
department: department
salary: salary
Document type: <class 'langchain_core.documents.base.Document'>


## 2. JSON Loader - Loading JSON Data with JQ Queries

The JSONLoader can extract specific parts of JSON documents using JQ queries, making it powerful for complex nested data structures.

In [20]:
# Create sample JSON data
sample_json_data = {
    "articles": [
        {
            "title": "Introduction to Machine Learning",
            "author": "Dr. Smith",
            "content": "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
            "tags": ["AI", "ML", "Technology"],
            "published_date": "2024-01-15"
        },
        {
            "title": "Deep Learning Fundamentals",
            "author": "Prof. Johnson",
            "content": "Deep learning uses neural networks with multiple layers to model and understand complex patterns in data.",
            "tags": ["Deep Learning", "Neural Networks", "AI"],
            "published_date": "2024-01-20"
        }
    ]
}

with open('articles.json', 'w') as f:
    json.dump(sample_json_data, f, indent=2)

print("Sample JSON file created")
print(f"Number of articles: {len(sample_json_data['articles'])}")

Sample JSON file created
Number of articles: 2


In [21]:
# Load JSON with JQ query to extract specific fields
json_loader = JSONLoader(
    file_path='articles.json',
    jq_schema='.articles[]',  # Extract each article
    content_key='content',    # Use 'content' field as page_content
    metadata_func=lambda record, metadata: {
        **metadata,
        "title": record.get("title"),
        "author": record.get("author"),
        "tags": record.get("tags"),
        "published_date": record.get("published_date")
    }
)

json_documents = json_loader.load()

print(f"Loaded {len(json_documents)} documents from JSON")
print("First document:")
print(f"Metadata: {json_documents[0].metadata}")
print(f"Content: {json_documents[0].page_content}")

Loaded 2 documents from JSON
First document:
Metadata: {'source': '/Users/ortimus/dev/KrishnaikAcademy/AgenticAI/articles.json', 'seq_num': 1, 'title': 'Introduction to Machine Learning', 'author': 'Dr. Smith', 'tags': ['AI', 'ML', 'Technology'], 'published_date': '2024-01-15'}
Content: Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.


## 3. Directory Loader - Loading Multiple Files

The DirectoryLoader enables bulk loading of multiple files from a directory, with support for file pattern matching.

In [22]:
# Create a sample directory structure
os.makedirs('sample_docs', exist_ok=True)

# Create sample text files
sample_texts = {
    'nlp_basics.txt': "Natural Language Processing (NLP) is a field of AI that focuses on human language.",
    'computer_vision.txt': "Computer Vision trains computers to interpret and understand visual information.",
    'robotics_ai.txt': "Robotics combines AI, engineering, and physics to create intelligent machines."
}

for filename, content in sample_texts.items():
    with open(f'sample_docs/{filename}', 'w') as f:
        f.write(content)

print(f"Created directory with {len(sample_texts)} files")

Created directory with 3 files


In [23]:
# Load all text files from directory
directory_loader = DirectoryLoader(
    'sample_docs/',
    glob="*.txt",
    loader_cls=TextLoader
)

directory_documents = directory_loader.load()

print(f"Loaded {len(directory_documents)} documents from directory")
for doc in directory_documents:
    filename = os.path.basename(doc.metadata['source'])
    print(f"  - {filename}: {len(doc.page_content)} characters")

Loaded 3 documents from directory
  - nlp_basics.txt: 82 characters
  - computer_vision.txt: 80 characters
  - robotics_ai.txt: 78 characters


## 4. Recursive Character Text Splitter - Intelligent Text Chunking

The RecursiveCharacterTextSplitter tries to split text at natural boundaries, respecting document structure.

In [24]:
# Sample long text for splitting
long_text = """# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is a rapidly evolving field that aims to create machines capable of intelligent behavior.

## History of AI

The field of AI was founded in 1956 at a conference at Dartmouth College. Early pioneers included Alan Turing, John McCarthy, and Marvin Minsky.

### Early Developments
- 1950: Alan Turing publishes Computing Machinery and Intelligence
- 1956: The term artificial intelligence is coined
- 1960s: First AI programs developed

## Applications of AI

AI has numerous applications across various industries:

1. Healthcare: Diagnostic imaging, drug discovery
2. Transportation: Autonomous vehicles, traffic optimization
3. Finance: Fraud detection, algorithmic trading
4. Education: Personalized learning, automated grading"""

print(f"Long text prepared ({len(long_text)} characters)")

Long text prepared (800 characters)


In [25]:
# Initialize the splitter with various parameters
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""],
    keep_separator=True
)

# Split the text
chunks = recursive_splitter.split_text(long_text)

print(f"Split text into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(chunk) for chunk in chunks) / len(chunks):.1f} characters")

for i, chunk in enumerate(chunks[:2]):
    print(f"\nChunk {i+1}:")
    print(chunk[:150] + "..." if len(chunk) > 150 else chunk)

Split text into 4 chunks
Average chunk size: 203.0 characters

Chunk 1:
# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is a rapidly evolving field that aims to create machines capable of intelligen...

Chunk 2:
## History of AI

The field of AI was founded in 1956 at a conference at Dartmouth College. Early pioneers included Alan Turing, John McCarthy, and Ma...


## 5. Markdown Header Text Splitter - Structure-Aware Splitting

The MarkdownHeaderTextSplitter splits text based on markdown headers, preserving document structure.

In [26]:
# Define headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

# Initialize the markdown splitter
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False
)

# Split the markdown text
markdown_chunks = markdown_splitter.split_text(long_text)

print(f"Split markdown into {len(markdown_chunks)} sections")

for i, chunk in enumerate(markdown_chunks[:3]):
    print(f"\nChunk {i+1}:")
    print(f"Metadata: {chunk.metadata}")
    print(f"Content preview: {chunk.page_content[:100]}...")

Split markdown into 4 sections

Chunk 1:
Metadata: {'Header 1': 'Introduction to Artificial Intelligence'}
Content preview: # Introduction to Artificial Intelligence  
Artificial Intelligence (AI) is a rapidly evolving field...

Chunk 2:
Metadata: {'Header 1': 'Introduction to Artificial Intelligence', 'Header 2': 'History of AI'}
Content preview: ## History of AI  
The field of AI was founded in 1956 at a conference at Dartmouth College. Early p...

Chunk 3:
Metadata: {'Header 1': 'Introduction to Artificial Intelligence', 'Header 2': 'History of AI', 'Header 3': 'Early Developments'}
Content preview: ### Early Developments
- 1950: Alan Turing publishes Computing Machinery and Intelligence
- 1956: Th...


## 6. Token Text Splitter - Token-Based Chunking for LLMs

The TokenTextSplitter splits text based on tokens rather than characters, crucial for LLM applications.

In [27]:
# Initialize token splitter
token_splitter = TokenTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    encoding_name="cl100k_base"
)

# Split text by tokens
token_chunks = token_splitter.split_text(long_text)

print(f"Split text into {len(token_chunks)} token-based chunks")

# Analyze token counts
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")

token_counts = [len(encoding.encode(chunk)) for chunk in token_chunks]
print(f"Token counts: {token_counts}")
print(f"Average tokens per chunk: {sum(token_counts) / len(token_counts):.1f}")

Split text into 2 token-based chunks
Token counts: [100, 79]
Average tokens per chunk: 89.5


## 7. Combining Loaders with Splitters - Real-World Workflow

In practice, you'll combine document loaders with text splitters for complete data processing pipelines.

In [28]:
def process_documents_workflow(documents: List[Document], splitter) -> List[Document]:
    """Process documents through a splitting workflow"""
    all_chunks = []
    
    for doc in documents:
        chunks = splitter.split_text(doc.page_content)
        
        for i, chunk in enumerate(chunks):
            chunk_doc = Document(
                page_content=chunk,
                metadata={
                    **doc.metadata,
                    'chunk_index': i,
                    'total_chunks': len(chunks)
                }
            )
            all_chunks.append(chunk_doc)
    
    return all_chunks

print("Workflow functions defined")

Workflow functions defined


In [29]:
# Process different document types with different splitters

# Process CSV documents with recursive splitter
processed_csv = process_documents_workflow(csv_documents, recursive_splitter)
print(f"CSV: {len(csv_documents)} docs → {len(processed_csv)} chunks")

# Process JSON documents with token splitter
processed_json = process_documents_workflow(json_documents, token_splitter)
print(f"JSON: {len(json_documents)} docs → {len(processed_json)} chunks")

# Process directory documents
processed_directory = process_documents_workflow(directory_documents, recursive_splitter)
print(f"Directory: {len(directory_documents)} docs → {len(processed_directory)} chunks")

total_original = len(csv_documents + json_documents + directory_documents)
total_processed = len(processed_csv + processed_json + processed_directory)
print(f"\nTotal: {total_original} docs → {total_processed} chunks")

CSV: 6 docs → 6 chunks
JSON: 2 docs → 2 chunks
Directory: 3 docs → 3 chunks

Total: 11 docs → 11 chunks


## 8. Cleanup and Summary

In [4]:
# Clean up temporary files
def cleanup_files():
    """Remove temporary files created during the demo"""
    files_to_remove = ['employees.csv', 'articles.json']
    dirs_to_remove = ['sample_docs']
    
    print("🧹 Cleaning up temporary files...")
    
    for file in files_to_remove:
        try:
            if os.path.exists(file):
                os.remove(file)
                print(f"Removed {file}")
        except Exception as e:
            print(f"Error removing {file}: {e}")
    
    for dir in dirs_to_remove:
        try:
            if os.path.exists(dir):
                shutil.rmtree(dir)
                print(f"Removed directory {dir}")
        except Exception as e:
            print(f"Error removing {dir}: {e}")
    
    print("\nCleanup completed!")

cleanup_files()

🧹 Cleaning up temporary files...
Removed employees.csv

Cleanup completed!


In [31]:
print("COMPREHENSIVE SUMMARY")
print("=" * 50)

print("\nDOCUMENT LOADERS COVERED:")
print("1. CSVLoader - Structured tabular data with flexible parsing")
print("2. JSONLoader - Complex JSON with JQ query support")
print("3. DirectoryLoader - Bulk loading from file systems")

print("\nTEXT SPLITTERS COVERED:")
print("4. RecursiveCharacterTextSplitter - Intelligent hierarchy-based splitting")
print("5. MarkdownHeaderTextSplitter - Structure-aware markdown processing")
print("6. TokenTextSplitter - Token-based splitting for LLM applications")

print("\nKEY TAKEAWAYS:")
print("• Choose loaders based on data format and complexity")
print("• Select splitters based on your application's needs")
print("• Consider token limits for LLM applications")
print("• Preserve important metadata throughout processing")
print("• Test different configurations for optimal results")



COMPREHENSIVE SUMMARY

DOCUMENT LOADERS COVERED:
1. CSVLoader - Structured tabular data with flexible parsing
2. JSONLoader - Complex JSON with JQ query support
3. DirectoryLoader - Bulk loading from file systems

TEXT SPLITTERS COVERED:
4. RecursiveCharacterTextSplitter - Intelligent hierarchy-based splitting
5. MarkdownHeaderTextSplitter - Structure-aware markdown processing
6. TokenTextSplitter - Token-based splitting for LLM applications

KEY TAKEAWAYS:
• Choose loaders based on data format and complexity
• Select splitters based on your application's needs
• Consider token limits for LLM applications
• Preserve important metadata throughout processing
• Test different configurations for optimal results
