## Data Ingestion

#### Understanding Document Structure in Langchain

In [2]:
from langchain_core.documents import Document

In [5]:
# Create a single document
doc = Document(
    page_content = "This is the main text content that will be embedded and searched.",
    metadata = {
        "author": "Lalu Mahato",
        "source": "example.txt",
        "page": 1,
        "created_at": "2025-09-25",
    }
)

print("****Document Structure****")
print("Content: ", doc.page_content)
print("Metadata: ", doc.metadata)

# Why metadata matters
print("\n📝 Metadata is crucial for:")
print("- Filtering search results")
print("- Tracking document sources")
print("- Providing context in responses")
print("- Debugging and auditing")

****Document Structure****
Content:  This is the main text content that will be embedded and searched.
Metadata:  {'author': 'Lalu Mahato', 'source': 'example.txt', 'page': 1, 'created_at': '2025-09-25'}

📝 Metadata is crucial for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


#### Text File

In [9]:
# Create a text file
import os

os.makedirs("data/text_files", exist_ok=True)
sample_text = {
    "data/text_files/python_intro.txt": """Python Programming Introduction:
    Python is a high-level, interpreted programming language known for its simplicity and readability.
    Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
    programming languages in the world.

    Key Features:
    - Easy to learn and use
    - Extensive standard library
    - Cross-platform compatibility
    - Strong community support

    Python is widely used in web development, data science, artificial intelligence, and automation.
    """,
            
    "data/text_files/machine_learning.txt": """Machine Learning Basics:
    Machine learning is a subset of artificial intelligence that enables systems to learn and improve
    from experience without being explicitly programmed. It focuses on developing computer programs
    that can access data and use it to learn for themselves.

    Types of Machine Learning:
    1. Supervised Learning: Learning with labeled data
    2. Unsupervised Learning: Finding patterns in unlabeled data
    3. Reinforcement Learning: Learning through rewards and penalties

    Applications include image recognition, speech processing, and recommendation systems
    """
}

for filepath, content in sample_text.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)


##### 1. Read/Load a single file

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")
documents = loader.load()
print(type(documents))
print(documents)

<class 'list'>
[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='Python Programming Introduction:\n    Python is a high-level, interpreted programming language known for its simplicity and readability.\n    Created by Guido van Rossum and first released in 1991, Python has become one of the most popular\n    programming languages in the world.\n\n    Key Features:\n    - Easy to learn and use\n    - Extensive standard library\n    - Cross-platform compatibility\n    - Strong community support\n\n    Python is widely used in web development, data science, artificial intelligence, and automation.\n    ')]


In [14]:
print(f"📄 Loaded {len(documents)} documents")
print(f"Content Preview: {documents[0].page_content[:100]}...")
print(f"Metadata: {documents[0].metadata}")

📄 Loaded 1 documents
Content Preview: Python Programming Introduction:
    Python is a high-level, interpreted programming language known ...
Metadata: {'source': 'data/text_files/python_intro.txt'}


##### 2. Load multiple files

In [18]:
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "data/text_files",
    glob="**/*.txt", # match .txt file
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=True
)

documents = dir_loader.load()

print(f"📁 Loaded {len(documents)} documents")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"Source: {doc.metadata['source']}")
    print(f"Length: {len(doc.page_content)} characters")
    
# 📊 Analysis
print("\n📊 DirectoryLoader Characteristics:")
print("✅ Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n❌ Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")

100%|██████████| 2/2 [00:00<00:00, 666.82it/s]

📁 Loaded 2 documents

Document 1:
Source: data\text_files\machine_learning.txt
Length: 605 characters

Document 2:
Source: data\text_files\python_intro.txt
Length: 530 characters

📊 DirectoryLoader Characteristics:
✅ Advantages:
  - Loads multiple files at once
  - Supports glob patterns
  - Progress tracking
  - Recursive directory scanning

❌ Disadvantages:
  - All files must be same type
  - Limited error handling per file
  - Can be memory intensive for large directories





### Text Splitting Strategies

#### 1.  CHARACTER TEXT SPLITTER

In [34]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

chunks = splitter.split_text(documents[0].page_content)
print(f"Total Chunks: {len(chunks)}")
print(chunks[0], "\n", "--"*50)
print(chunks[1], "\n", "--"*50)
print(chunks[2], "\n", "--"*50)
print(chunks[3], "\n", "--"*50)

Total Chunks: 4
Machine Learning Basics:
    Machine learning is a subset of artificial intelligence that enables systems to learn and improve 
 ----------------------------------------------------------------------------------------------------
from experience without being explicitly programmed. It focuses on developing computer programs
    that can access data and use it to learn for themselves.
    Types of Machine Learning: 
 ----------------------------------------------------------------------------------------------------
1. Supervised Learning: Learning with labeled data
    2. Unsupervised Learning: Finding patterns in unlabeled data
    3. Reinforcement Learning: Learning through rewards and penalties 
 ----------------------------------------------------------------------------------------------------
Applications include image recognition, speech processing, and recommendation systems 
 --------------------------------------------------------------------------------------

In [36]:
splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

chunks = splitter.split_text(documents[0].page_content)
print(len(chunks))
print(chunks[0], "\n", "--"*50)
print(chunks[1], "\n", "--"*50)
print(chunks[2], "\n", "--"*50)
print(chunks[3], "\n", "--"*50)

4
Machine Learning Basics:
 Machine learning is a subset of artificial intelligence that enables systems to learn and improve
 from experience without being explicitly programmed. It focuses on 
 ----------------------------------------------------------------------------------------------------
It focuses on developing computer programs
 that can access data and use it to learn for themselves.

 Types of Machine Learning:
 1. Supervised Learning: Learning with labeled data
 2. Unsupervised 
 ----------------------------------------------------------------------------------------------------
2. Unsupervised Learning: Finding patterns in unlabeled data
 3. Reinforcement Learning: Learning through rewards and penalties

 Applications include image recognition, speech processing, and 
 ----------------------------------------------------------------------------------------------------
processing, and recommendation systems 
 ----------------------------------------------------------------

#### 2.  RECURSIVE CHARACTER TEXT SPLITTER

In [39]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

rec_splitter = RecursiveCharacterTextSplitter(
    separators=" ",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

chunks = rec_splitter.split_text(documents[0].page_content)
print(len(chunks))
print(chunks[0], "\n", "--"*50)
print(chunks[1], "\n", "--"*50)
print(chunks[2], "\n", "--"*50)
print(chunks[3], "\n", "--"*50)

4
Machine Learning Basics:
    Machine learning is a subset of artificial intelligence that enables systems to learn and improve
    from experience without being explicitly programmed. It focuses on 
 ----------------------------------------------------------------------------------------------------
It focuses on developing computer programs
    that can access data and use it to learn for themselves.

    Types of Machine Learning:
    1. Supervised Learning: Learning with labeled data
    2. 
 ----------------------------------------------------------------------------------------------------
labeled data
    2. Unsupervised Learning: Finding patterns in unlabeled data
    3. Reinforcement Learning: Learning through rewards and penalties

    Applications include image recognition, speech 
 ----------------------------------------------------------------------------------------------------
recognition, speech processing, and recommendation systems 
 --------------------------------

In [41]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

rec_splitter = RecursiveCharacterTextSplitter(
    separators=["\n", ""],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

chunks = rec_splitter.split_text(documents[0].page_content)
print(len(chunks))
print(chunks[0], "\n", "--"*50)
print(chunks[1], "\n", "--"*50)
print(chunks[2], "\n", "--"*50)
print(chunks[3], "\n", "--"*50)

4
Machine Learning Basics:
    Machine learning is a subset of artificial intelligence that enables systems to learn and improve 
 ----------------------------------------------------------------------------------------------------
from experience without being explicitly programmed. It focuses on developing computer programs
    that can access data and use it to learn for themselves.

    Types of Machine Learning: 
 ----------------------------------------------------------------------------------------------------
1. Supervised Learning: Learning with labeled data
    2. Unsupervised Learning: Finding patterns in unlabeled data
    3. Reinforcement Learning: Learning through rewards and penalties 
 ----------------------------------------------------------------------------------------------------
Applications include image recognition, speech processing, and recommendation systems 
 ---------------------------------------------------------------------------------------------------

#### 3. TOKEN TEXT SPLITTER

In [42]:
from langchain_text_splitters import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=50,
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(documents[0].page_content)
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:100]}...")

Created 4 chunks
First chunk: Machine Learning Basics:
    Machine learning is a subset of artificial intelligence that enables sy...


In [43]:
token_chunks

['Machine Learning Basics:\n    Machine learning is a subset of artificial intelligence that enables systems to learn and improve\n    from experience without being explicitly programmed. It focuses on developing computer programs\n    that can access data and use',
 '\n    that can access data and use it to learn for themselves.\n\n    Types of Machine Learning:\n    1. Supervised Learning: Learning with labeled data\n    2. Unsupervised Learning:',
 '    2. Unsupervised Learning: Finding patterns in unlabeled data\n    3. Reinforcement Learning: Learning through rewards and penalties\n\n    Applications include image recognition, speech processing, and recommendation systems\n  ',
 ', speech processing, and recommendation systems\n    ']