# Data Ingestion

In [100]:
import os
import pandas as pd
from typing import List, Dict, Any

## Document

In [101]:
from langchain_core.documents import Document

## understanding Document structure in Langchain
doc = Document(
    page_content="This is a sample document.",
    metadata={
        "source": "sample_source.txt",
        "author": "Mostafa",
        "page": 1,
        "custom_field": "custom_value"
    }
)

print("Document Structure:")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")
print(f"Author: {doc.metadata['author']}")

Document Structure:
Content: This is a sample document.
Metadata: {'source': 'sample_source.txt', 'author': 'Mostafa', 'page': 1, 'custom_field': 'custom_value'}
Author: Mostafa


**Why metadata?**
- Filtering search results
- Tracking document source
- Providing context in response
- Debugging and auditing

## Text

In [102]:
from langchain_community.document_loaders import TextLoader

## single text file loading
text_loader = TextLoader(r"data\text_files\python_intro.txt", encoding="utf8")

text_documents = text_loader.load()

print("Text Loader Structure:")
print(f"Number of Documents Loaded:{len(text_documents)}")
print(f"Content of Text:\n{text_documents[0].page_content[:]}\n") 
print(f"Metadata of Text: {text_documents[0].metadata}")

Text Loader Structure:
Number of Documents Loaded:1
Content of Text:
Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.

Metadata of Text: {'source': 'data\\text_files\\python_intro.txt'}


In [103]:
from langchain_community.document_loaders import DirectoryLoader

## loading all text files from a directory, it can work with other file types too
dir_loader = DirectoryLoader(
    r"data\text_files",
    glob="*.txt", ## load only text files
    loader_cls=TextLoader, ## specify the loader class
    loader_kwargs={"encoding": "utf8"}, ## loader specific arguments
    show_progress=True
)

dir_documents = dir_loader.load()
print("Directory Loader Structure:")
for i, document in enumerate(dir_documents):
    print("="*40)
    print(f"Document {i+1}:")
    print(f"Length of Content: {len(document.page_content)} characters")
    print(f"Metadata: {document.metadata}")

100%|██████████| 2/2 [00:00<00:00, 976.56it/s]

Directory Loader Structure:
Document 1:
Length of Content: 575 characters
Metadata: {'source': 'data\\text_files\\machine_learning.txt'}
Document 2:
Length of Content: 489 characters
Metadata: {'source': 'data\\text_files\\python_intro.txt'}





**DirectoryLoader**

**Pros** - load multiple files at once, support glob pattern, progress tracking, recursive directory scanning

**Cons** - all files must be same type, limited error handling per file, can be memory intensive for large directories

## PDF

In [104]:
from langchain_community.document_loaders import (
    PyMuPDFLoader,
    PyPDFLoader
)

In [105]:
## Method 1: Using PyPDFLoader
pypdf_loader = PyPDFLoader(r"data\pdf\attention.pdf")
pypdf_documents = pypdf_loader.load()
print("PyPDFLoader loaded documents:")
for i, document in enumerate(pypdf_documents):
    print("="*40)
    print(f"Document {i+1}:")
    print(f"Length of Content: {len(document.page_content)} characters")
    print(f"Page content preview: {document.page_content[:100]}...")
    print(f"Metadata: {document.metadata}")

PyPDFLoader loaded documents:
Document 1:
Length of Content: 2859 characters
Page content preview: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and...
Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data\\pdf\\attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}
Document 2:
Length of Content: 4257 characters
Page content preview: 1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...
Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00',

In [106]:
## Method 2: Using PyMuPDFLoader
pymupdf_loader = PyMuPDFLoader(r"data\pdf\attention.pdf")
pymupdf_documents = pymupdf_loader.load()
print("PyMuPDFLoader loaded documents:")
for i, document in enumerate(pymupdf_documents):
    print("="*40)
    print(f"Document {i+1}:")
    print(f"Length of Content: {len(document.page_content)} characters")
    print(f"Page content preview: {document.page_content[:100]}...")
    print(f"Metadata: {document.metadata}")

PyMuPDFLoader loaded documents:
Document 1:
Length of Content: 2857 characters
Page content preview: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and...
Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'source': 'data\\pdf\\attention.pdf', 'file_path': 'data\\pdf\\attention.pdf', 'total_pages': 15, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'trapped': '', 'modDate': 'D:20240410211143Z', 'creationDate': 'D:20240410211143Z', 'page': 0}
Document 2:
Length of Content: 4255 characters
Page content preview: 1
Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...
Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'source': 'data\\pdf\\attention.pdf', 'file_path': 'data\\pdf\\attenti

**PyMuPDFLoader**
- pros: fast processing, good text extraction, image extraction support
- use: speed is important

**PyPDFLoader**
- pros: simple and reliable, good for most PDFs, preserves page number
- cons: basic text extraction
- use: standard text PDFs

## Word

In [126]:
from langchain_community.document_loaders import Docx2txtLoader

docx_loader = Docx2txtLoader(r"data\word_files\proposal.docx")
docx_documents = docx_loader.load()

print("Docx2txtLoader loaded documents:")
print(f"Number of Documents Loaded: {len(docx_documents)}")
print(f"Content Preview:\n{docx_documents[0].page_content[:200]}...")
print(f"Metadata: {docx_documents[0].metadata}")

Docx2txtLoader loaded documents:
Number of Documents Loaded: 1
Content Preview:
Project Proposal: RAG Implementation

Executive Summary

This proposal outlines the implementation of a Retrieval-Augmented Generation system for our organization.

Objectives

Key objectives include:...
Metadata: {'source': 'data\\word_files\\proposal.docx'}


In [127]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

unstructured_loader = UnstructuredWordDocumentLoader(r"data\word_files\proposal.docx",mode="elements")
unstructured_documents = unstructured_loader.load()

print("UnstructuredWordDocumentLoader loaded documents:")
print(f"Number of Documents Loaded: {len(unstructured_documents)}")

for i, document in enumerate(unstructured_documents):
    print("="*40)
    print(f"element {i+1}:")
    print(f"Length of Content: {len(document.page_content)} characters")
    print(f"Metadata: {document.metadata['category']}")
    

UnstructuredWordDocumentLoader loaded documents:
Number of Documents Loaded: 20
element 1:
Length of Content: 36 characters
Metadata: Title
element 2:
Length of Content: 17 characters
Metadata: Title
element 3:
Length of Content: 106 characters
Metadata: NarrativeText
element 4:
Length of Content: 10 characters
Metadata: Title
element 5:
Length of Content: 23 characters
Metadata: NarrativeText
element 6:
Length of Content: 38 characters
Metadata: ListItem
element 7:
Length of Content: 41 characters
Metadata: ListItem
element 8:
Length of Content: 38 characters
Metadata: ListItem
element 9:
Length of Content: 19 characters
Metadata: Title
element 10:
Length of Content: 15 characters
Metadata: UncategorizedText
element 11:
Length of Content: 18 characters
Metadata: UncategorizedText
element 12:
Length of Content: 37 characters
Metadata: UncategorizedText
element 13:
Length of Content: 22 characters
Metadata: Title
element 14:
Length of Content: 22 characters
Metadata: NarrativeText
eleme

## CSV or Structure

In [137]:
from langchain_community.document_loaders import CSVLoader

csv_loader = CSVLoader(file_path=r"data\structured_files\products.csv", encoding="utf8")

csv_documents = csv_loader.load()
print("CSVLoader loaded documents:")
print(f"Number of Documents Loaded: {len(csv_documents)}")
print(f"Content Preview:\n{csv_documents[0].page_content}...")
print(f"Metadata: {csv_documents[0].metadata}")
print('='*40)
print(f"Content Preview:\n{csv_documents[1].page_content}...")

CSVLoader loaded documents:
Number of Documents Loaded: 5
Content Preview:
Product: Laptop
Category: Electronics
Price: 999.99
Stock: 50
Description: High-performance laptop with 16GB RAM and 512GB SSD...
Metadata: {'source': 'data\\structured_files\\products.csv', 'row': 0}
Content Preview:
Product: Mouse
Category: Accessories
Price: 29.99
Stock: 200
Description: Wireless optical mouse with ergonomic design...


## JSON

In [142]:
from langchain_community.document_loaders import JSONLoader

json_loader = JSONLoader(file_path=r"data\json_files\company_data.json", jq_schema=".employees[]", text_content=False)
json_documents = json_loader.load()

print("JSONLoader loaded documents:")
print(f"Number of Documents Loaded: {len(json_documents)}")
print(f"Content Preview:\n{json_documents[0].page_content}...")
print(f"Metadata: {json_documents[0].metadata}")
print('='*40)
print(f"Content Preview:\n{json_documents[1].page_content}...")

JSONLoader loaded documents:
Number of Documents Loaded: 2
Content Preview:
{"id": 1, "name": "John Doe", "role": "Software Engineer", "skills": ["Python", "JavaScript", "React"], "projects": [{"name": "RAG System", "status": "In Progress"}, {"name": "Data Pipeline", "status": "Completed"}]}...
Metadata: {'source': 'F:\\AI Space\\RAG\\data\\json_files\\company_data.json', 'seq_num': 1}
Content Preview:
{"id": 2, "name": "Jane Smith", "role": "Data Scientist", "skills": ["Python", "Machine Learning", "SQL"], "projects": [{"name": "ML Model", "status": "In Progress"}, {"name": "Analytics Dashboard", "status": "Planning"}]}...


## DB

In [150]:
from langchain_community.document_loaders import SQLDatabaseLoader
from langchain_community.utilities import SQLDatabase

db = SQLDatabase.from_uri("sqlite:///data/databases/company.db")

print("SQLDatabase connected successfully.")
print(f"tables in the database: {db.get_table_names()}")

db_loader = SQLDatabaseLoader(query="SELECT * FROM employees", db=db)

db_documents = db_loader.load()
print("SQLDatabaseLoader loaded documents:")
print(f"Number of Documents Loaded: {len(db_documents)}")
print(f"Content Preview:\n{db_documents[0].page_content}...")
print(f"Metadata: {db_documents[0].metadata}")
print('='*40)
print(f"Content Preview:\n{db_documents[1].page_content}...")

SQLDatabase connected successfully.
tables in the database: ['employees', 'projects']
SQLDatabaseLoader loaded documents:
Number of Documents Loaded: 4
Content Preview:
id: 1
name: John Doe
role: Senior Developer
department: Engineering
salary: 95000.0...
Metadata: {}
Content Preview:
id: 2
name: Jane Smith
role: Data Scientist
department: Analytics
salary: 105000.0...


# Data Parsing

## Text Splitting

In [107]:
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

In [108]:
## Method 1 - CharacterTextSplitter

char_splitter = CharacterTextSplitter(
    separator="\n", # split at newlines
    chunk_size=200, # each chunk will have max 200 characters
    chunk_overlap=50, # overlap of 50 characters between chunks
    length_function=len # function to calculate length
)

text = text_documents[0].page_content[:]
char_chunks = char_splitter.split_text(text)

print(f"Created {len(char_chunks)} chunks using CharacterTextSplitter.")
print("Sample Chunks:")
for i, chunk in enumerate(char_chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)
    print()

Created 3 chunks using CharacterTextSplitter.
Sample Chunks:
--- Chunk 1 ---
Python Programming Introduction
Python is a high-level, interpreted programming language known for its simplicity and readability.

--- Chunk 2 ---
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.
Key Features:
- Easy to learn and use
- Extensive standard library

--- Chunk 3 ---
- Extensive standard library
- Cross-platform compatibility
- Strong community support
Python is widely used in web development, data science, artificial intelligence, and automation.



> Created a chunk of size 232, which is longer than the specified 200

In [109]:
## Method 1 - CharacterTextSplitter

char_splitter = CharacterTextSplitter(
    separator="\n\n", # split at double newlines
    chunk_size=232, # edited chunk size to 232 characters
    chunk_overlap=50, # overlap of 50 characters between chunks
    length_function=len # function to calculate length
)

text = text_documents[0].page_content[:]
char_chunks = char_splitter.split_text(text)

print(f"Created {len(char_chunks)} chunks using CharacterTextSplitter.")
print("Sample Chunks:")
for i, chunk in enumerate(char_chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)
    print()

Created 3 chunks using CharacterTextSplitter.
Sample Chunks:
--- Chunk 1 ---
Python Programming Introduction

--- Chunk 2 ---
Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

--- Chunk 3 ---
Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.



In [110]:
## Method 2 - RecursiveCharacterTextSplitter
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n", "\n\n", " ", ""], # hierarchy of separators
    chunk_size=200,
    chunk_overlap=50,
    length_function=len
)

rec_chunks = recursive_splitter.split_text(text)
print(f"Created {len(rec_chunks)} chunks using RecursiveCharacterTextSplitter.")
print("Sample Chunks:")
for i, chunk in enumerate(rec_chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Created 4 chunks using RecursiveCharacterTextSplitter.
Sample Chunks:
--- Chunk 1 ---
Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
--- Chunk 2 ---
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
--- Chunk 3 ---
Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support
--- Chunk 4 ---
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.


In [111]:
## Method 3 - TokenTextSplitter
token_splitter = TokenTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks using TokenTextSplitter.")
print("Sample Chunks:")
for i, chunk in enumerate(token_chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)

Created 3 chunks using TokenTextSplitter.
Sample Chunks:
--- Chunk 1 ---
Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in
--- Chunk 2 ---
 one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web
--- Chunk 3 ---
 community support

Python is widely used in web development, data science, artificial intelligence, and automation.


**CharacterTextSplitter**
- pros: simple and predictable, good for structured text
- cons: may break mid-sentence
- use: text has clear delimiter

**RecursiveCharacterTextSplitter**
- pros: respects text structure, tries multiple separator, best general-purpose splitter
- cons: slightly more complex
- use: default choice for most texts

**TokenTextSplitter**
- pros: respects model token limits, more accurate for embedding
- cons: slower than character-based
- use: working with token-limited models