## Table of Contents of the course

- Introduction to Data Ingestion
- Text Files (.txt)
- PDF Documents
- Microsoft Word Documents
- CSV and Excel Files
- JSON and Structured Data
- Web Scraping
- Databases (SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices

### Introduction To Data Ingestion

In [60]:
import os
from typing import List, Dict, Any


In [41]:
from langchain.text_splitter import(
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)           
from langchain.schema import Document

### Document Structure In Langchain

In [42]:
## A document is a simple object that contains a page content and metadata.
### Document Structure In Langchain
# Why metadata matters:
    ## Metadata is crucial for:
        ##Filtering search results
        ##Tracking document sources
        ##Providing context in response
        ##Debugging and auditing

doc=Document(
    page_content="This is the main text content that will be embedded and searched.",
    metadata={
        "source":"book.txt",
        "page":2,
        "author":"Hammad Ali Tahir",
        "date_created":"2026-01-01",
        "cutom_field":"any_value"

    }
)
print("Document Structure")

print(f"Content :{doc.page_content}")
print(f"Metadata :{doc.metadata}")

Document Structure
Content :This is the main text content that will be embedded and searched.
Metadata :{'source': 'book.txt', 'page': 2, 'author': 'Hammad Ali Tahir', 'date_created': '2026-01-01', 'cutom_field': 'any_value'}


In [43]:
type(doc)

langchain_core.documents.base.Document

### Text Files (.txt) - The Simplest Case {#2-text-files}

In [44]:
## Create a simple txt file
import os
os.makedirs("data/text_files",exist_ok=True)

In [45]:
## Now writting some text in those files
## There are two sample text files
sample_texts={
    "data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("Sample text files created!")

Sample text files created!


### TextLoader- Read Single File 

In [46]:
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

## Loading a single text file
loader=TextLoader("data/text_files/python_intro.txt", encoding="utf-8")

documents=loader.load()
print(f"üìÑ Loaded {len(documents)} document")
print(f"Content preview: {documents[0].page_content[:100]}...")
print(f"Metadata: {documents[0].metadata}")

üìÑ Loaded 1 document
Content preview: Python Programming Introduction

Python is a high-level, interpreted programming language known for ...
Metadata: {'source': 'data/text_files/python_intro.txt'}


#### Reading file made outside

In [47]:
loader=TextLoader("data/output.txt", encoding="utf-8")

output_documents=loader.load()
print(f"üìÑ Loaded {len(output_documents)} document")
print(f"Content preview: {output_documents[0].page_content[:1000]}....")
print(f"Metadata: {output_documents[0].metadata}")

üìÑ Loaded 1 document
Content preview: 
        Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
        Explanation: With Labour trailing Nigel Farage's populist Reform UK in the polls, UK PM Starmer faces difficult state spending and tax choices.
 
     

        Date: 05 Sep, 2025 09:40pm
        Title: Justice Shah asks CJP Afridi to publicly answer 6 questions on ‚Äòpressing institutional concerns‚Äô
        Explanation: Says he trusts CJP to use Sept 8 judicial conference as a ‚Äúmoment of institutional renewal by answering these questions and reaffirming the principles of collegiality and constitutional fidelity‚Äù.
 
     

        Date: 05 Sep, 2025 07:25pm
        Title: Imran Khan‚Äôs other nephew Shershah also released from Kot Lakhpat jail after bail
        Explanation: Shershah's brother Shahrez was set free from prison after being granted bail a day earlier.
 
     

        Date: 05 Sep, 2025 07:05pm
  

### DirectoryLoader- Multiple Text Files
Loading both files from text_files

In [None]:
from langchain_community.document_loaders import DirectoryLoader

## load all the text files from the directory
dir_loader=DirectoryLoader(
    "data/text_files",
    glob="**/*.txt", ## Pattern to match files......As we want to load all the text files in the directory and its subdirectories, 
                                                                    ## So we use the glob pattern **/*.txt
    loader_cls= TextLoader, ##loader class to use
    loader_kwargs={'encoding': 'utf-8'}, ##kwargs to pass to the loader class
    show_progress=True
)

documents=dir_loader.load()

print(f" Loaded {len(documents)} documents")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"  Source: {doc.metadata['source']}")
    print(f"  Length: {len(doc.page_content)} characters")


# üìä Analysis
print("\n DirectoryLoader Characteristics:")
print(" Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 2273.95it/s]

 Loaded 2 documents

Document 1:
  Source: data/text_files/python_intro.txt
  Length: 489 characters

Document 2:
  Source: data/text_files/machine_learning.txt
  Length: 575 characters

 DirectoryLoader Characteristics:
 Advantages:
  - Loads multiple files at once
  - Supports glob patterns
  - Progress tracking
  - Recursive directory scanning

 Disadvantages:
  - All files must be same type
  - Limited error handling per file
  - Can be memory intensive for large directories





### Text Splitting Statergies

In [49]:
### Different text splitting strategies
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)
print(output_documents)



In [50]:
### MEthod 1- Character Text Splitter
text=output_documents[0].page_content
print(text)


        Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
        Explanation: With Labour trailing Nigel Farage's populist Reform UK in the polls, UK PM Starmer faces difficult state spending and tax choices.
 
     

        Date: 05 Sep, 2025 09:40pm
        Title: Justice Shah asks CJP Afridi to publicly answer 6 questions on ‚Äòpressing institutional concerns‚Äô
        Explanation: Says he trusts CJP to use Sept 8 judicial conference as a ‚Äúmoment of institutional renewal by answering these questions and reaffirming the principles of collegiality and constitutional fidelity‚Äù.
 
     

        Date: 05 Sep, 2025 07:25pm
        Title: Imran Khan‚Äôs other nephew Shershah also released from Kot Lakhpat jail after bail
        Explanation: Shershah's brother Shahrez was set free from prison after being granted bail a day earlier.
 
     

        Date: 05 Sep, 2025 07:05pm
        Title: ‚ÄòRoblox‚Äô game to impose

In [79]:
# Method 1: Character-based splitting
print("1Ô∏è‚É£ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines
    chunk_size=270,  # Max chunk size in characters
    chunk_overlap=1,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0]}")

1Ô∏è‚É£ CHARACTER TEXT SPLITTER
Created 216 chunks
First chunk: Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer


In [80]:
print(char_chunks[0])
print("------------------")
print(char_chunks[1])
print("------------------")
print(char_chunks[2])
print("------------------")
print(char_chunks[3])

Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
------------------
Explanation: With Labour trailing Nigel Farage's populist Reform UK in the polls, UK PM Starmer faces difficult state spending and tax choices.
 
     
        Date: 05 Sep, 2025 09:40pm
------------------
Title: Justice Shah asks CJP Afridi to publicly answer 6 questions on ‚Äòpressing institutional concerns‚Äô
------------------
Explanation: Says he trusts CJP to use Sept 8 judicial conference as a ‚Äúmoment of institutional renewal by answering these questions and reaffirming the principles of collegiality and constitutional fidelity‚Äù.
 
     
        Date: 05 Sep, 2025 07:25pm


In [71]:
# Method 1: Character-based splitting
print("1Ô∏è‚É£ CHARACTER TEXT SPLITTER")
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines
    chunk_size=300,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:1000]}...")

1Ô∏è‚É£ CHARACTER TEXT SPLITTER
Created 194 chunks
First chunk: Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
        Explanation: With Labour trailing Nigel Farage's populist Reform UK in the polls, UK PM Starmer faces difficult state spending and tax choices....


In [72]:
print(char_chunks[0])
print("-------------")
print(char_chunks[1])
print("-------------")
print(char_chunks[2])

Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
        Explanation: With Labour trailing Nigel Farage's populist Reform UK in the polls, UK PM Starmer faces difficult state spending and tax choices.
-------------
Date: 05 Sep, 2025 09:40pm
        Title: Justice Shah asks CJP Afridi to publicly answer 6 questions on ‚Äòpressing institutional concerns‚Äô
-------------
Explanation: Says he trusts CJP to use Sept 8 judicial conference as a ‚Äúmoment of institutional renewal by answering these questions and reaffirming the principles of collegiality and constitutional fidelity‚Äù.
 
     
        Date: 05 Sep, 2025 07:25pm


In [55]:
# Method 2: Recursive character splitting (RECOMMENDED)
print("\n2Ô∏è‚É£ RECURSIVE CHARACTER TEXT SPLITTER")
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # Try these separators in order
    chunk_size=250,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
print(f"First chunk: {recursive_chunks[0][:100]}...")


2Ô∏è‚É£ RECURSIVE CHARACTER TEXT SPLITTER
Created 216 chunks
First chunk: Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging b...


In [56]:
print(recursive_chunks[0])
print("-----------------")
print(recursive_chunks[1])
print("------------------")
print(recursive_chunks[2])

Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
        Explanation: With Labour trailing Nigel Farage's populist Reform UK in the polls, UK PM Starmer faces difficult state
-----------------
difficult state spending and tax choices.
 
     

        Date: 05 Sep, 2025 09:40pm
        Title: Justice Shah asks CJP Afridi to publicly answer 6 questions on ‚Äòpressing institutional concerns‚Äô
        Explanation: Says he trusts CJP to use Sept
------------------
CJP to use Sept 8 judicial conference as a ‚Äúmoment of institutional renewal by answering these questions and reaffirming the principles of collegiality and constitutional fidelity‚Äù.
 
     

        Date: 05 Sep, 2025 07:25pm
        Title: Imran


## Overlapping is good to happen because this can be helpful to pull all the related chunks needed.

### There is the overlapping between the chunks. Becuase there in only one separator(Space) is been used.

In [57]:
# Create text without natural break points
simple_text = "This is sentence one and it is quite long. This is sentence two and it is also quite long. This is sentence three which is even longer than the others. This is sentence four. This is sentence five. This is sentence six."

splitter = RecursiveCharacterTextSplitter(
    separators=[" "],  # Only split on spaces
    chunk_size=80,
    chunk_overlap=20,
    length_function=len
)

chunks = splitter.split_text(simple_text)

print(f"\nSimple text example - {len(chunks)} chunks:\n")

for i in range(len(chunks) - 1):
    print(f"Chunk {i+1}: '{chunks[i]}'")
    print(f"Chunk {i+2}: '{chunks[i+1]}'")
    
    
    print()


Simple text example - 4 chunks:

Chunk 1: 'This is sentence one and it is quite long. This is sentence two and it is also'
Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'

Chunk 2: 'two and it is also quite long. This is sentence three which is even longer than'
Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'

Chunk 3: 'is even longer than the others. This is sentence four. This is sentence five.'
Chunk 4: 'is sentence five. This is sentence six.'



In [58]:
# Method 3: Token-based splitting
print("\n3Ô∏è‚É£ TOKEN TEXT SPLITTER")
token_splitter = TokenTextSplitter(
    chunk_size=50,  # Size in tokens (not characters)
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:1000]}...")


3Ô∏è‚É£ TOKEN TEXT SPLITTER
Created 416 chunks
First chunk: 
        Date: 05 Sep, 2025 09:03pm
        Title: UK deputy PM Rayner resigns over tax mistake in damaging blow to Starmer
     ...


### CharacterTextSplitter:
  ‚úÖ Simple and predictable
  ‚úÖ Good for structured text
  ‚ùå May break mid-sentence
  Use when: Text has clear delimiters

### RecursiveCharacterTextSplitter:
  ‚úÖ Respects text structure
  ‚úÖ Tries multiple separators
  ‚úÖ Best general-purpose splitter
  ‚ùå Slightly more complex
  Use when: Default choice for most texts

### TokenTextSplitter:
  ‚úÖ Respects model token limits
  ‚úÖ More accurate for embeddings
  ‚ùå Slower than character-based
  Use when: Working with token-limited models