### 🎯 Module Overview
This module covers everything you need to know about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain v0.3 and explore each technique with practical examples.

Table of Contents

- Introduction to Data Ingestion
- Text Files (.txt)
- PDF Documents
- Microsoft Word Documents
- CSV and Excel Files
- JSON and Structured Data
- Web Scraping
- Databases (SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices

### Introduction To Data Ingestion

**Data Ingestion: The First Step in RAG**

Data ingestion is the process of loading, processing, and preparing your external data to be used in a Retrieval-Augmented Generation (RAG) system. The goal is to convert unstructured or semi-structured data from various sources (like text files, PDFs, websites, etc.) into a clean, standardized format that can be effectively used for retrieval.

The typical ingestion pipeline involves three key steps:
1.  **Loading**: Reading data from its source. LangChain provides a wide variety of `DocumentLoaders` for this purpose.
2.  **Splitting**: Breaking down large documents into smaller, manageable chunks. This is crucial for embedding and retrieval, as it helps the system find more precise pieces of information.
3.  **Storing**: Placing the processed chunks (often after converting them into numerical vectors, or *embeddings*) into a specialized database called a Vector Store, where they can be efficiently searched.

This notebook focuses on the first two steps: **Loading** and **Splitting**.

In [1]:
# os: A standard Python library for interacting with the operating system, used here for file path and directory management.
import os

# typing: Provides support for type hints, which makes the code more readable and easier to debug.
from typing import List, Dict, Any

# pandas: A powerful data manipulation and analysis library, useful for handling structured data like from CSV or Excel files.
import pandas as pd

In [2]:
# Document: The fundamental object in LangChain for representing a piece of text and its associated metadata.
from langchain_core.documents import Document

# TextSplitters: These are classes designed to break down long texts into smaller chunks.
from langchain.text_splitter import(
    RecursiveCharacterTextSplitter, # Recommended for general text.
    CharacterTextSplitter,          # Splits based on a single character.
    TokenTextSplitter               # Splits based on language model tokens.
)
print("Set up Completed!")

Set up Completed!


### Understanding Document Structure In Langchain

At the heart of LangChain's data handling is the `Document` object. Think of it as a standardized container for your data. No matter the source—be it a text file, a PDF page, or a database row—LangChain loaders will transform it into a `Document`.

A `Document` has two main components:

1.  `page_content` (string): This holds the actual text content of the data chunk.
2.  `metadata` (dictionary): This is a dictionary containing extra information about the content. Common metadata includes the source file, page number, author, etc. Metadata is incredibly powerful for filtering searches and providing context to the language model.

In [3]:
## create a simple document
doc=Document(
    # page_content: This is the core text that will be used for embedding and retrieval.
    page_content="This is the main text content that will be embedded and searched.",
    
    # metadata: A dictionary to hold supplementary information about the content.
    metadata={
        "source":"example.txt",        # The original file or source of the document.
        "page":1,                      # The page number, if applicable.
        "author":"Krish Naik",        # The author or creator of the content.
        "date_created":"2024-01-01",   # The creation date.
        "cutom_field":"any_value"       # You can add any custom fields you need.

    }
)
print("Document Structure")

print(f"Content :{doc.page_content}")
print(f"Metadata :{doc.metadata}")

# Why metadata matters:
print("\n📝 Metadata is crucial for:")
print("- Filtering search results")
print("- Tracking document sources")
print("- Providing context in responses")
print("- Debugging and auditing")

Document Structure
Content :This is the main text content that will be embedded and searched.
Metadata :{'source': 'example.txt', 'page': 1, 'author': 'Krish Naik', 'date_created': '2024-01-01', 'cutom_field': 'any_value'}

📝 Metadata is crucial for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


In [4]:
# Verify the type of our created object to confirm it's a LangChain Document.
type(doc)

langchain_core.documents.base.Document

### Text Files (.txt) - The Simplest Case {#2-text-files}

In [5]:
# First, we'll set up a directory to store our sample data.
import os

# Create a directory named 'data/text_files'. 
# The 'exist_ok=True' argument prevents an error if the directory already exists.
os.makedirs("data/text_files",exist_ok=True)

In [6]:
# Define the content for our two sample text files in a dictionary.
sample_texts={
    "data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

# Loop through the dictionary to create and write to each file.
for filepath,content in sample_texts.items():
    # Open the file in write mode ('w') with UTF-8 encoding to handle a wide range of characters.
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("✅ Sample text files created!")

✅ Sample text files created!


### TextLoader: Read a Single File

The `TextLoader` is one of the simplest document loaders. It's designed to read a single text file and load its entire content into one `Document` object.

- **Input**: The file path to a single `.txt` file.
- **Output**: A list containing a single `Document`. The `page_content` will be the full text of the file, and the `metadata` will automatically include the `source` (the file path).

In [7]:
# Note: It's good practice to import from langchain_community for loaders and other integrations.
from langchain_community.document_loaders import TextLoader

## Loading a single text file
# Instantiate the loader with the path to the file and specify the encoding.
loader=TextLoader("data/text_files/python_intro.txt", encoding="utf-8")

# The .load() method reads the file and returns a list of Document objects.
documents=loader.load()

print(f"📄 Loaded {len(documents)} document")
print(f"Content preview: {documents[0].page_content[:100]}...")
print(f"Metadata: {documents[0].metadata}")

📄 Loaded 1 document
Content preview: Python Programming Introduction

Python is a high-level, interpreted programming language known for ...
Metadata: {'source': 'data/text_files/python_intro.txt'}


### DirectoryLoader: Read Multiple Files

When you need to load all files from a directory, `DirectoryLoader` is the tool to use. It's a convenient wrapper that can apply a specific loader (like `TextLoader`) to every file in a directory that matches a certain pattern.

- **Input**: A directory path, a `glob` pattern to select files (e.g., `"*.txt"`), and a `loader_cls` specifying which loader to use for each file.
- **Output**: A list containing one `Document` per file found.

In [8]:
from langchain_community.document_loaders import DirectoryLoader

## load all the text files from the directory
dir_loader=DirectoryLoader(
    # The path to the directory containing the files.
    "data/text_files",
    
    # A glob pattern to select which files to load. "**/*.txt" means all files ending with .txt in this directory and any subdirectories.
    glob="**/*.txt",
    
    # The specific loader class to use for each file found. Here, we use TextLoader.
    loader_cls= TextLoader,
    
    # Keyword arguments to pass to the loader_cls during instantiation. We pass the encoding for TextLoader.
    loader_kwargs={'encoding': 'utf-8'},
    
    # If True, displays a progress bar to show loading progress.
    show_progress=True
)

# The .load() method iterates through the directory, applies the loader to each matching file, and collects the results.
documents=dir_loader.load()

print(f"\n📁 Loaded {len(documents)} documents")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"  Source: {doc.metadata['source']}")
    print(f"  Length: {len(doc.page_content)} characters")

# 📊 Analysis
print("\n📊 DirectoryLoader Characteristics:")
print("✅ Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n❌ Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")

100%|██████████| 2/2 [00:00<00:00, 132.10it/s]


📁 Loaded 2 documents

Document 1:
  Source: data\text_files\machine_learning.txt
  Length: 575 characters

Document 2:
  Source: data\text_files\python_intro.txt
  Length: 489 characters

📊 DirectoryLoader Characteristics:
✅ Advantages:
  - Loads multiple files at once
  - Supports glob patterns
  - Progress tracking
  - Recursive directory scanning

❌ Disadvantages:
  - All files must be same type
  - Limited error handling per file
  - Can be memory intensive for large directories





### Text Splitting Strategies

**Why Do We Need to Split Text?**

Large language models (LLMs) have a limited **context window**, which is the maximum number of tokens they can process at once. If a document is larger than this window, it cannot be processed in its entirety. Furthermore, for effective retrieval in RAG, we want to find and use only the most relevant parts of a document, not the whole thing.

Splitting a large document into smaller chunks addresses both issues:
1.  **Fits within the Context Window**: Ensures each piece of text sent to the LLM is of a manageable size.
2.  **Improves Retrieval Accuracy**: By embedding smaller, more focused chunks, the retrieval system can find pieces of text that are semantically very close to the user's query. Retrieving a whole book chapter about Python when the user asks a specific question about `for` loops is far less effective than retrieving a small paragraph that directly addresses it.

The goal of text splitting is to create chunks that are **semantically meaningful** and of an appropriate size.

In [9]:
### Different text splitting strategies
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)
# The 'documents' variable currently holds two Document objects, one for each file we loaded.
print(documents)

[Document(metadata={'source': 'data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '), Document(metadata={'source': 'data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprog

In [10]:
# We'll select the first document (about machine learning) to demonstrate splitting.
text=documents[0].page_content
text

'Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '

#### Method 1: `CharacterTextSplitter`

This is the simplest splitting method. It splits text based on a specified single character `separator` and then groups the resulting parts into chunks of a certain `chunk_size`.

- **Pro**: Very straightforward and predictable.
- **Con**: It doesn't respect the semantic structure of the text. It might split a sentence or even a word right in the middle if the chunk size limit is reached.

In [11]:
# Method 1: Character-based splitting, using a space character as the separator.
print("1️⃣ CHARACTER TEXT SPLITTER (by space)")
char_splitter = CharacterTextSplitter(
    separator=" ",          # The character to split the text on.
    chunk_size=200,         # The maximum size of each chunk (in characters).
    chunk_overlap=20,       # The number of characters to overlap between consecutive chunks. This helps maintain context.
    length_function=len     # The function used to measure the length of a chunk (default is len).
)

# The .split_text() method performs the splitting.
char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

1️⃣ CHARACTER TEXT SPLITTER (by space)
Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [12]:
# Notice the overlap between the end of the first chunk and the start of the second.
print(char_chunks[0])
print("------------------")
print(char_chunks[1])

Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
------------------
on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:


In [13]:
# Let's try again with a more logical separator for this text: a newline character.
print("1️⃣ CHARACTER TEXT SPLITTER (by newline)")
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines, which often separate paragraphs or logical blocks.
    chunk_size=200,  
    chunk_overlap=20,
    length_function=len
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

1️⃣ CHARACTER TEXT SPLITTER (by newline)
Created 4 chunks
First chunk: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems...


In [14]:
# The chunks now look more coherent as they are split by paragraphs.
print(char_chunks[0])
print("-------------")
print(char_chunks[1])
print("-------------")
print(char_chunks[2])

Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve
-------------
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.
Types of Machine Learning:
-------------
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties


#### Method 2: `RecursiveCharacterTextSplitter` (Recommended)

This is the most recommended and versatile text splitter. It works by trying to split the text using a list of separators in a specific order. It starts with the first separator (e.g., `"\n\n"` for paragraphs) and, if the resulting chunks are still too large, it moves to the next separator (e.g., `"\n"` for lines), and so on.

- **Why it's better**: It tries to keep semantically related pieces of text together as long as possible (e.g., paragraphs, then sentences). This hierarchical splitting approach generally results in more coherent and meaningful chunks.

In [15]:
# Method 2: Recursive character splitting (RECOMMENDED)
print("\n2️⃣ RECURSIVE CHARACTER TEXT SPLITTER")
recursive_splitter = RecursiveCharacterTextSplitter(
    # A list of characters to try splitting on, in order of preference.
    # It will try to split by paragraph (\n\n), then by line (\n), then by space (' '), and finally by character ('').
    # The default is ["\n\n", "\n", " ", ""]. We are only using space here for a simple example.
    separators=[" "],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
print(f"First chunk: {recursive_chunks[0][:100]}...")


2️⃣ RECURSIVE CHARACTER TEXT SPLITTER
Created 4 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [16]:
print(recursive_chunks[0])
print("-----------------")
print(recursive_chunks[1])
print("------------------")
print(recursive_chunks[2])

Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing
-----------------
on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning:
------------------
Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation


In [17]:
# Create text without natural break points like newlines to see how the splitter handles it.
simple_text = "This is sentence one and it is quite long. This is sentence two and it is also quite long. This is sentence three which is even longer than the others. This is sentence four. This is sentence five. This is sentence six."

splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # For this example, we only allow splitting on spaces.
    chunk_size=80,      # A smaller chunk size to demonstrate the splitting.
    chunk_overlap=20,
    length_function=len
)

chunks = splitter.split_text(simple_text)

print(f"\nSimple text example - {len(chunks)} chunks:\n")

# Iterate through the chunks to show the content and the overlap between them.
for i in range(len(chunks) - 1):
    print(f"Chunk {i+1}: '{chunks[i]}'")
    print(f"Chunk {i+2}: '{chunks[i+1]}'")
    print()


Simple text example - 4 chunks:

Chunk 1: 'This is sentence one and it is quite long. This is sentence two and it is also'
Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'

Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'
Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'

Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'
Chunk 4: 'is sentence five. This is sentence six.'



#### Method 3: `TokenTextSplitter`

Language models don't see characters; they see **tokens**. A token is a common sequence of characters in the text. For example, the word "Apple" might be one token, while a complex word like "RAG" might be broken into three tokens: "R", "A", "G".

`TokenTextSplitter` splits the text based on the number of tokens. This is the most accurate way to split text if your primary concern is staying within the token limit of a model.

- **Pro**: Aligns perfectly with how language models process text and their context limits.
- **Con**: Can be slightly slower than character-based methods because it needs to tokenize the text first.

In [18]:
# Method 3: Token-based splitting
print("\n3️⃣ TOKEN TEXT SPLITTER")
token_splitter = TokenTextSplitter(
    # Note: The chunk_size and chunk_overlap now refer to the number of tokens, not characters.
    chunk_size=50,
    chunk_overlap=10
    # This splitter uses the tiktoken library by default, which is the tokenizer used by OpenAI models.
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:100]}...")


3️⃣ TOKEN TEXT SPLITTER
Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


### 📊 Text Splitting Methods Comparison

---

#### CharacterTextSplitter
* ✅ **Simple and predictable**
* ✅ **Good for structured text**
* ❌ **May break mid-sentence**
* **Use when**: Text has clear, consistent delimiters (like CSV data).

---

#### RecursiveCharacterTextSplitter
* ✅ **Respects text structure** by trying multiple separators.
* ✅ **Best general-purpose splitter.**
* ❌ **Slightly more complex** to configure.
* **Use when**: This is your default choice for most unstructured text like articles or books.

---

#### TokenTextSplitter
* ✅ **Respects model token limits** perfectly.
* ✅ **Most accurate for embeddings** as it aligns with how the model "sees" the text.
* ❌ **Slower** than character-based methods.
* **Use when**: It's critical to control the exact number of tokens per chunk for a specific model.
---

### 🔑 Key Takeaways

* **The `Document` Object is Fundamental**: All data in LangChain is standardized into `Document` objects. Each `Document` contains the main text (`page_content`) and extra information (`metadata`) like the source, which is crucial for filtering and context.
* **Loaders for Every Source**: LangChain provides `DocumentLoaders` to handle various data sources. We saw how `TextLoader` handles a single file and `DirectoryLoader` efficiently loads all matching files in a folder.
* **Splitting is Non-Negotiable for RAG**: Large documents must be split into smaller chunks. This is critical to fit the text into a model's limited context window and to allow the retrieval system to find highly relevant, specific pieces of information.
* **Choose the Right Splitting Strategy**: While there are several methods, `RecursiveCharacterTextSplitter` is the recommended general-purpose choice because it intelligently tries to preserve the structure of the text (paragraphs, sentences). For ultimate precision related to a model's limits, `TokenTextSplitter` is the most accurate.
---