### üéØ Module Overview
This module covers everything you need to know about parsing and ingesting data for RAG systems, from basic text files to complex PDFs and databases. We'll use LangChain v0.3 and explore each technique with practical examples.

Table of Contents

- Introduction to Data Ingestion
- Text Files (.txt)
- PDF Documents
- Microsoft Word Documents
- CSV and Excel Files
- JSON and Structured Data
- Web Scraping
- Databases (SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices

#### Introduction To Data Ingestion

In [1]:
import os
from typing import List, Dict, Any
import pandas as pd

In [5]:
from langchain_core.documents import Document
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
)
print("Set up is completed!!!")

  from .autonotebook import tqdm as notebook_tqdm


Set up is completed!!!


#### Understanding Document Structure in Langchain

In [7]:
## create a simple document
doc=Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata={
        "source":"example.txt",
        "page":1,
        "author":"Souvick Mazumdar",
        "date_created":"2024-01-01",
        "cutom_field":"any_value"

    }
)
print("Document Structure")

print(f"Content :{doc.page_content}")
print(f"Metadata :{doc.metadata}")

# Why metadata matters:
print("\nüìù Metadata is crucial for:")
print("- Filtering search results")
print("- Tracking document sources")
print("- Providing context in responses")
print("- Debugging and auditing")

Document Structure
Content :This is the main text content that will be embedded and searched.
Metadata :{'source': 'example.txt', 'page': 1, 'author': 'Souvick Mazumdar', 'date_created': '2024-01-01', 'cutom_field': 'any_value'}

üìù Metadata is crucial for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


In [8]:
type(doc)

langchain_core.documents.base.Document

### Text Files(.txt)- The Simplest Case


In [2]:
## Create a simple txt file
import os
os.makedirs("data/text_files",exist_ok=True)

In [8]:
sample_texts={
    "data/text_files/python_intro.txt":"""

Python is a high-level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum in 1991, Python emphasizes code clarity through meaningful indentation and a clean syntax that makes it accessible to beginners while remaining powerful for experts. It's dynamically typed, supporting multiple programming paradigms including procedural, object-oriented, and functional programming styles.

Python's extensive standard library provides built-in modules for file I/O, system operations, mathematics, and more, enabling rapid development. The language excels in data science and machine learning through libraries like NumPy, Pandas, and TensorFlow. It's also widely used for web development (Django, Flask), automation, scientific computing, artificial intelligence, and backend services.

Key strengths include its versatility, large active community, comprehensive documentation, and rapid development cycle. Python's "batteries included" philosophy means most common tasks can be accomplished without external dependencies. The language runs on multiple platforms‚ÄîWindows, macOS, Linux‚Äîmaking it highly portable. With its emphasis on readability and developer productivity, Python has become one of the most popular programming languages worldwide, used by organizations from startups to Fortune 500 companies.


""",
    "data/text_files/python_machine_learnign.txt":"""


Machine Learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that analyze data, identify patterns, and make predictions or decisions based on that analysis.

Machine Learning has three main types: Supervised Learning (labeled data), Unsupervised Learning (unlabeled data), and Reinforcement Learning (learning through rewards). Applications include image recognition, natural language processing, recommendation systems, fraud detection, autonomous vehicles, and predictive analytics.

Key concepts include training data, features, models, and validation. Popular libraries like scikit-learn, TensorFlow, and PyTorch make implementing ML solutions accessible. Machine Learning powers modern AI systems, enabling computers to perform tasks that typically require human intelligence, from medical diagnosis to financial forecasting.

"""
}
for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("Sample text file created")

Sample text file created


### Text Loader Read Single File

In [7]:
# from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader
# Loading a single text file
loader=TextLoader("data/text_files/python_intro.txt",encoding="utf-8")

documents=loader.load()
# print(type(documents))
# print(documents)
print(f"Loaded {len(documents)} document")
print(f"Content preview: {documents[0].page_content[:100]}...")
print(f"Metadata: {documents[0].metadata}")


Loaded 1 document
Content preview: 

Python is a high-level, interpreted programming language known for its simplicity and readability....
Metadata: {'source': 'data/text_files/python_intro.txt'}


### DirectoryLoader- Multiple Text Files


In [11]:
from langchain_community.document_loaders import DirectoryLoader
##load all the text files from the directory
dir_loader=DirectoryLoader(
    "data/text_files",
    glob="**/*.txt",##Pattern to match files
    loader_cls=TextLoader, ## loader class to use
    loader_kwargs={'encoding':'utf-8'},
    show_progress=True
)
documents=dir_loader.load()
print(f"Loaded {len(documents)} document")
for i,doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"Content preview: {documents[0].page_content[:100]}...")
    print(f"Metadata: {documents[0].metadata}")

# üìä Analysis
print("\nüìä DirectoryLoader Characteristics:")
print("‚úÖ Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n‚ùå Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")

0it [00:00, ?it/s]

Loaded 0 document





In [None]:
from langchain_text_splitters import(
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)
print(documents)

In [20]:
### Method 1- Character Text Splitter
text=documents[0].page_content
text

'\n\nPython is a high-level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum in 1991, Python emphasizes code clarity through meaningful indentation and a clean syntax that makes it accessible to beginners while remaining powerful for experts. It\'s dynamically typed, supporting multiple programming paradigms including procedural, object-oriented, and functional programming styles.\n\nPython\'s extensive standard library provides built-in modules for file I/O, system operations, mathematics, and more, enabling rapid development. The language excels in data science and machine learning through libraries like NumPy, Pandas, and TensorFlow. It\'s also widely used for web development (Django, Flask), automation, scientific computing, artificial intelligence, and backend services.\n\nKey strengths include its versatility, large active community, comprehensive documentation, and rapid development cycle. Python\'s "batteries included" phil

In [27]:
### Method 1- Character-based splitting
print("CHARACTER TEXT SPLITTER")
char_splitter=CharacterTextSplitter(
    separator="\n", #Split on newline
    chunk_size=200, #Max chunk size in characters
    chunk_overlap=20, #Overlap between bhunks
    length_function=len #How to measure chunk size

)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First Chunk {char_chunks[0][:101]}...")

Created a chunk of size 432, which is longer than the specified 200
Created a chunk of size 396, which is longer than the specified 200


CHARACTER TEXT SPLITTER
Created 3 chunks
First Chunk Python is a high-level, interpreted programming language known for its simplicity and readability. Cr...


In [28]:
print(char_chunks[0])
print("------------------------------")
print(char_chunks[1])
print("------------------------------")
print(char_chunks[2])


Python is a high-level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum in 1991, Python emphasizes code clarity through meaningful indentation and a clean syntax that makes it accessible to beginners while remaining powerful for experts. It's dynamically typed, supporting multiple programming paradigms including procedural, object-oriented, and functional programming styles.
------------------------------
Python's extensive standard library provides built-in modules for file I/O, system operations, mathematics, and more, enabling rapid development. The language excels in data science and machine learning through libraries like NumPy, Pandas, and TensorFlow. It's also widely used for web development (Django, Flask), automation, scientific computing, artificial intelligence, and backend services.
------------------------------
Key strengths include its versatility, large active community, comprehensive documentation, and rapid develo

#### NOTE:
 Overlapping is very important because it will help in further during similarity search. It can happen that some context of chunk2 is present in chunk1 so to get that situation underconsideration overlapping is very important

In [None]:
### Method 2- Recursive Character-based splitting (RECOMMENDED)
print("RECURSIVE CHARACTER TEXT SPLITTER")
char_splitter=RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=200, #Max chunk size in characters
    chunk_overlap=20, #Overlap between chunks
    length_function=len #How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First Chunk: {char_chunks[0][:101]}...")
for i, chunk in enumerate(char_chunks):
    print(f"Chunk {i+1} size: {len(chunk)}")

RECURSIVE CHARACTER TEXT SPLITTER
Created 9 chunks
First Chunk: Python is a high-level, interpreted programming language known for its simplicity and readability. Cr...
Chunk 1 size: 197
Chunk 2 size: 194
Chunk 3 size: 63
Chunk 4 size: 193
Chunk 5 size: 186
Chunk 6 size: 46
Chunk 7 size: 192
Chunk 8 size: 199
Chunk 9 size: 164


In [30]:
print(char_chunks[0])
print("------------------------------")
print(char_chunks[1])
print("------------------------------")
print(char_chunks[2])


Python is a high-level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum in 1991, Python emphasizes code clarity through meaningful indentation
------------------------------
indentation and a clean syntax that makes it accessible to beginners while remaining powerful for experts. It's dynamically typed, supporting multiple programming paradigms including procedural,
------------------------------
procedural, object-oriented, and functional programming styles.


In [None]:
### Method3: Token-based splitting
print("\n TOKEN TEXT SPLITTER")
token_splitter=TokenTextSplitter(
    chunk_size=50, #Size in tokens (not characters)
    chunk_overlap=10
)
token_chunks=token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:100]} ....")


 TOKEN TEXT SPLITTER
Created 6 chunks
First chunk: 

Python is a high-level, interpreted programming language known for its simplicity and readability. ....


In [32]:
# üìä Comparison
print("\nüìä Text Splitting Methods Comparison:")
print("\nCharacterTextSplitter:")
print("  ‚úÖ Simple and predictable")
print("  ‚úÖ Good for structured text")
print("  ‚ùå May break mid-sentence")
print("  Use when: Text has clear delimiters")

print("\nRecursiveCharacterTextSplitter:")
print("  ‚úÖ Respects text structure")
print("  ‚úÖ Tries multiple separators")
print("  ‚úÖ Best general-purpose splitter")
print("  ‚ùå Slightly more complex")
print("  Use when: Default choice for most texts")

print("\nTokenTextSplitter:")
print("  ‚úÖ Respects model token limits")
print("  ‚úÖ More accurate for embeddings")
print("  ‚ùå Slower than character-based")
print("  Use when: Working with token-limited models")


üìä Text Splitting Methods Comparison:

CharacterTextSplitter:
  ‚úÖ Simple and predictable
  ‚úÖ Good for structured text
  ‚ùå May break mid-sentence
  Use when: Text has clear delimiters

RecursiveCharacterTextSplitter:
  ‚úÖ Respects text structure
  ‚úÖ Tries multiple separators
  ‚úÖ Best general-purpose splitter
  ‚ùå Slightly more complex
  Use when: Default choice for most texts

TokenTextSplitter:
  ‚úÖ Respects model token limits
  ‚úÖ More accurate for embeddings
  ‚ùå Slower than character-based
  Use when: Working with token-limited models
