# Module Overview
This module covers everything about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain v0.3.

Table of Contents

- Introduction to Data Ingestion
- Text Files (.txt)
- PDF Documents
- Microsoft Word Documents
- CSV and Excel Files
- JSON and Structured Data
- Web Scraping
- Databases (SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices

# Introduction To Data Ingestion

In [1]:
# import os 
import os
# Import type hints for clearer function signatures and better editor support
from typing import List, Dict, Any
# Import pandas for data manipulation and analysis
import pandas as pd

In [2]:
# Importing the Document class from LangChain Core
# ------------------------------------------------
# The `Document` class represents a single document (or text unit) in LangChain.
# It typically contains two parts:
#   1. page_content → the main text
#   2. metadata → extra information (like file name, source, etc.)
from langchain_core.documents import Document


# Importing Text Splitter classes
# -------------------------------
# LangChain provides several text splitter utilities to break large texts into smaller chunks.
# This helps overcome token limits and improves retrieval accuracy.
from langchain.text_splitter import (
    # Splits text while preserving semantic boundaries (best general-purpose splitter)
    RecursiveCharacterTextSplitter,  
    # Splits text based on character count or separators like '\n'
    CharacterTextSplitter,           
    # Splits text based on token count (useful when working with LLM token limits)
    TokenTextSplitter                
)


# Printing a confirmation message
# -------------------------------
# This line simply indicates that all imports and setup were successful.
print("Set up Completed!")


Set up Completed!


## Understanding Document Structure In Langchain

In [3]:
# Import the Document class
# --------------------------
# The Document class is the fundamental data structure in LangChain used
# to represent a piece of text (page_content) and its associated metadata.
from langchain_core.documents import Document


# Create a simple Document object
# ----------------------------------
# Here, we define a document with:
# - `page_content`: The actual text to be processed or embedded.
# - `metadata`: Additional context or attributes describing the text,
#   such as the source file, author, date, or any custom info.

doc = Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata={
        "source": "example.txt",        # File or data source name
        "page": 1,                      # Page number or section ID
        "author": "Krish Naik",         # Author or data origin
        "date_created": "2024-01-01",   # Date of creation or ingestion
        "custom_field": "any_value"     # Any custom metadata field
    }
)


# Print statements to confirm document creation
# ---------------------------------------------
# These lines display the internal structure of the Document object.
print("Document Structure")
print(f"Content  : {doc.page_content}")  # Displays the main text content
print(f"Metadata : {doc.metadata}")      # Displays all metadata as a dictionary


Document Structure
Content  : This is the main text content that will be embedded and searched.
Metadata : {'source': 'example.txt', 'page': 1, 'author': 'Krish Naik', 'date_created': '2024-01-01', 'custom_field': 'any_value'}


In [4]:
type(doc)

langchain_core.documents.base.Document

## Text Files (.txt)

In [5]:
## Create a simple txt file
import os
os.makedirs("data/text_files",exist_ok=True)

In [6]:
sample_texts={
    "data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("Sample text files created!")

Sample text files created!


### TextLoader- Read Single File 

In [7]:
# Importing the TextLoader class
# ------------------------------
# There are two sources for loaders in LangChain:
# - `langchain.document_loaders`: Older version (deprecated in some releases)
# - `langchain_community.document_loaders`: Newer maintained version
# You can safely use the second one.
from langchain_community.document_loaders import TextLoader


# Load a single text file
# --------------------------
# TextLoader reads plain text files (.txt) and converts their content
# into LangChain Document objects.
# Each file will be loaded as a list of one or more Document instances.
loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")


# Load the file content
# ---------------------
# The .load() method reads the file, extracts its text,
# and returns it as a list of Document objects.
documents = loader.load()


# Display summary information
# ----------------------------
# Let's print how many documents were loaded and show a short preview
# of the first document’s content and metadata.

# Number of loaded docs
print(f"Loaded {len(documents)} document")             
# First 100 chars of the content         
print(f"Content preview: {documents[0].page_content[:100]}...") 
# File details like path or source
print(f"Metadata: {documents[0].metadata}")                      


Loaded 1 document
Content preview: Python Programming Introduction

Python is a high-level, interpreted programming language known for ...
Metadata: {'source': 'data/text_files/python_intro.txt'}


### DirectoryLoader- Multiple Text Files

In [8]:
# Importing DirectoryLoader
# --------------------------
# DirectoryLoader helps to automatically scan a folder and load multiple files
# into LangChain Document objects. It can recursively search through subdirectories,
# apply file-matching patterns, and use any specific loader (like TextLoader or PyPDFLoader).
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader  # required for loader_cls


# ✅ Load all text files from a directory
# ---------------------------------------
# DirectoryLoader parameters:
# - "data/text_files": The root directory path.
# - glob="**/*.txt": Pattern to match all text files recursively.
# - loader_cls=TextLoader: Defines how each file is loaded (here, as text).
# - loader_kwargs={'encoding': 'utf-8'}: Extra parameters for the loader (e.g., encoding).
# - show_progress=True: Displays a progress bar while loading files.
dir_loader = DirectoryLoader(
    "data/text_files",
    # Matches all .txt files inside folders and subfolders
    glob="**/*.txt",                    
    # Use TextLoader for each file
    loader_cls=TextLoader,    
    # Pass UTF-8 encoding          
    loader_kwargs={'encoding': 'utf-8'},
    # Show progress bar during file loading
    show_progress=True                  
)


# Load all matched documents into a list
# --------------------------------------
# Each file is loaded as a Document object (text + metadata)
documents = dir_loader.load()


# Display summary information
# -------------------------------
# Print the total number of documents loaded and display details of each.
print(f"📁 Loaded {len(documents)} documents")

for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    # File path or source name
    print(f"  Source: {doc.metadata['source']}")     
    # Number of characters in content     
    print(f"  Length: {len(doc.page_content)} characters")

100%|██████████| 2/2 [00:00<00:00, 1816.11it/s]

📁 Loaded 2 documents

Document 1:
  Source: data\text_files\machine_learning.txt
  Length: 575 characters

Document 2:
  Source: data\text_files\python_intro.txt
  Length: 489 characters





| Component           | Description                                                       |
| ------------------- | ----------------------------------------------------------------- |
| **DirectoryLoader** | Scans a folder and loads all matching files as `Document` objects |
| **glob="**/*.txt"** | Recursively loads all `.txt` files inside the directory           |
| **loader_cls**      | Defines which loader to use for each file                         |
| **loader_kwargs**   | Extra parameters (e.g., encoding, options) passed to the loader   |
| **show_progress**   | Displays progress bar while loading                               |
| **documents**       | List of `Document` objects containing text and metadata           |


| Aspect          | Details                                                                                                                                                                                                                     |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ✅ Advantages    | - Loads multiple files at once (efficient bulk loading) <br> - Supports glob patterns (e.g., `*.txt`, `*.pdf`) <br> - Progress tracking while loading <br> - Recursive directory scanning (loads files from nested folders) |
| ❌ Disadvantages | - All files must be of the same type (one loader per run) <br> - Limited error handling per file (one failure may affect batch) <br> - Can be memory intensive for large directories                                        |


## Text Splitting Statergies

In [9]:
# Importing different text splitter classes
# -----------------------------------------
# LangChain provides multiple text splitters to handle large documents
# before embedding or retrieval. Each splitter has its own strategy:

from langchain.text_splitter import (
    CharacterTextSplitter,           # Splits text by character count or separators (simple approach)
    RecursiveCharacterTextSplitter,  # Recursively splits text by paragraphs, sentences, or sections (context-aware)
    TokenTextSplitter                # Splits text based on LLM tokens instead of raw characters
)

# Display the loaded documents for reference
print(documents)

[Document(metadata={'source': 'data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '), Document(metadata={'source': 'data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprog

### Character Based

In [11]:
### MEthod 1- Character Text Splitter
text=documents[0].page_content
text

'Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '

In [12]:
# Method 1: Character-based splitting
print("1️⃣ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator=" ",  # Split on newlines
    chunk_size=200,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

1️⃣ CHARACTER TEXT SPLITTER
Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [13]:
print(char_chunks[0])
print("------------------")
print(char_chunks[1])

Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
------------------
on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:


In [14]:
# Method 1: Character-based splitting
print("1️⃣ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines
    chunk_size=200,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

1️⃣ CHARACTER TEXT SPLITTER
Created 4 chunks
First chunk: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems...


In [15]:
print(char_chunks[0])
print("-------------")
print(char_chunks[1])
print("-------------")
print(char_chunks[2])

Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve
-------------
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.
Types of Machine Learning:
-------------
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties


### Recursive Character Splitting

In [18]:
# RECURSIVE CHARACTER TEXT SPLITTER (Recommended)
# ---------------------------------------------------
# This splitter is context-aware: it tries to split the text at logical boundaries
# using the list of separators provided. It is ideal for long documents because
# it preserves some context across chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Example text to split
text = documents[0].page_content  # Using the first loaded document

# Create a RecursiveCharacterTextSplitter instance
recursive_splitter = RecursiveCharacterTextSplitter(
    # Splits text using the first working separator (space in this case)
    separators=[" "],  
    # Maximum number of characters per chunk
    chunk_size=200,    
    # Number of characters to overlap between consecutive chunks
    chunk_overlap=20,  
    # Function to measure text length (here, number of characters)
    length_function=len 
)

# Split the text into chunks
recursive_chunks = recursive_splitter.split_text(text)

# Display summary information
# Number of chunks created
print(f"Created {len(recursive_chunks)} chunks")         
# Show first 100 characters of the first chunk
print(f"First chunk: {recursive_chunks[0][:100]}...")   

Created 4 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [19]:
print(recursive_chunks[0])
print("-----------------")
print(recursive_chunks[1])
print("------------------")
print(recursive_chunks[2])


Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
-----------------
on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:
------------------
Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation


In [20]:
# Example: Splitting text without natural break points
# -----------------------------------------------------
# Some texts don’t have clear paragraph or sentence boundaries (like logs or single-line content)
# RecursiveCharacterTextSplitter can still split them based on specified separators and chunk sizes.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Sample text (long sentences, no natural breaks)
simple_text = (
    "This is sentence one and it is quite long. "
    "This is sentence two and it is also quite long. "
    "This is sentence three which is even longer than the others. "
    "This is sentence four. This is sentence five. This is sentence six."
)

# Create a recursive character splitter
splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # Only split at spaces
    chunk_size=80,     # Maximum characters per chunk
    chunk_overlap=20,  # Overlap characters between consecutive chunks
    length_function=len
)

# Split the text
chunks = splitter.split_text(simple_text)

# Display results
print(f"\nSimple text example - {len(chunks)} chunks:\n")

for i in range(len(chunks) - 1):
    print(f"Chunk {i+1}: '{chunks[i]}'")
    print(f"Chunk {i+2}: '{chunks[i+1]}'\n")



Simple text example - 4 chunks:

Chunk 1: 'This is sentence one and it is quite long. This is sentence two and it is also'
Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'

Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'
Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'

Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'
Chunk 4: 'is sentence five. This is sentence six.'



### Token based Splitting

In [21]:
# TOKEN TEXT SPLITTER
# -----------------------
# This splitter splits text based on tokens rather than characters.
# Useful for LLM pipelines where token limits matter more than character count.

from langchain.text_splitter import TokenTextSplitter

# Create a TokenTextSplitter instance
token_splitter = TokenTextSplitter(
    chunk_size=50,    # Maximum number of tokens per chunk
    chunk_overlap=10  # Number of tokens to overlap between chunks
)

# Split the document text into token-based chunks
token_chunks = token_splitter.split_text(text)

# Display summary
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:100]}...")  # Display first 100 characters of the first chunk


Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


### Comparison

| Method                             | Splitting Basis                                     | Chunk Size Type | Overlap    | Pros                                                                   | Cons                                                  | Best Use Case                                                         |
| ---------------------------------- | --------------------------------------------------- | --------------- | ---------- | ---------------------------------------------------------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- |
| **CharacterTextSplitter**          | Characters or separators like spaces, newlines      | Characters      | Characters | Simple, fast, easy to implement                                        | May split sentences/words awkwardly, context loss     | Short/plain text or when token limits don’t matter                    |
| **RecursiveCharacterTextSplitter** | Logical boundaries (paragraphs → sentences → words) | Characters      | Characters | Preserves semantic boundaries, context-aware, handles long text better | Slightly slower than simple character splitter        | Long documents, RAG pipelines, embedding for LLMs                     |
| **TokenTextSplitter**              | Tokens (LLM tokenization)                           | Tokens          | Tokens     | Ensures chunks fit LLM token limits, overlap preserves context         | Needs tokenization, slower than character-based split | Embeddings, LLM processing, or any scenario where token limits matter |
