### DATA INGESTION

In [3]:
## Document DataStructure

from langchain_core.documents import Document

In [4]:
doc = Document(
    page_content="THis is the sample document content taken from example.txt",
    metadata={
        "source":"example.txt",
        "pages":1,
        "author":"Deepak",
        "date_created":"16-11-2025"
    }
)

In [5]:
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Deepak', 'date_created': '16-11-2025'}, page_content='THis is the sample document content taken from example.txt')

In [6]:
## Creating a simple text file
import os

os.makedirs("../data/text_files",exist_ok=True)

In [7]:
sample_texts = {
    "../data/text_files/python_intro.txt":"""
    What is Python?
Python is a popular programming language. It was created by Guido van Rossum, and released in 1991.

It is used for:

web development (server-side),
software development,
mathematics,
system scripting.
What can Python do?
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software development.
Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-oriented way or a functional way.
Good to know
The most recent major version of Python is Python 3, which we shall be using in this tutorial.
In this tutorial Python will be written in a text editor. It is possible to write Python in an Integrated Development Environment, such as Thonny, Pycharm, Netbeans or Eclipse which are particularly useful when managing larger collections of Python files.
Python Syntax compared to other programming languages
Python was designed for readability, and has some similarities to the English language with influence from mathematics.
Python uses new lines to complete a command, as opposed to other programming languages which often use semicolons or parentheses.
Python relies on indentation, using whitespace, to define scope; such as the scope of loops, functions and classes. Other programming languages often use curly-brackets for this purpose.
"""
}


for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("Sample text file created ")

Sample text file created 


In [8]:
### Textloader
from langchain.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt",
                    encoding="utf-8")

In [9]:
loader

<langchain_community.document_loaders.text.TextLoader at 0x277fb93c040>

In [10]:
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='\n    What is Python?\nPython is a popular programming language. It was created by Guido van Rossum, and released in 1991.\n\nIt is used for:\n\nweb development (server-side),\nsoftware development,\nmathematics,\nsystem scripting.\nWhat can Python do?\nPython can be used on a server to create web applications.\nPython can be used alongside software to create workflows.\nPython can connect to database systems. It can also read and modify files.\nPython can be used to handle big data and perform complex mathematics.\nPython can be used for rapid prototyping, or for production-ready software development.\nWhy Python?\nPython works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).\nPython has a simple syntax similar to the English language.\nPython has syntax that allows developers to write programs with fewer lines than some other programming languages.\nPython runs on an interpreter system

In [11]:
### Direcgtory loader
from langchain_community.document_loaders import DirectoryLoader

dir_loader= DirectoryLoader(
    path="../data/text_files",
    glob="**/*.txt",  # Pattern to match files
    loader_cls=TextLoader,
    # loader_kwargs={"encoding":"utf-8"},
    # show_progress=True
)

In [12]:
dir_doc= dir_loader.load()
print(dir_doc)

[Document(metadata={'source': '..\\data\\text_files\\demo.txt'}, page_content='\n    What is Python?\nPython is a popular programming language. It was created by Guido van Rossum, and released in 1991.\n\nIt is used for:\n\nweb development (server-side),\nsoftware development,\nmathematics,\nsystem scripting.\nWhat can Python do?\nPython can be used on a server to create web applications.\nPython can be used alongside software to create workflows.\nPython can connect to database systems. It can also read and modify files.\nPython can be used to handle big data and perform complex mathematics.\nPython can be used for rapid prototyping, or for production-ready software development.\nWhy Python?\nPython works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).\nPython has a simple syntax similar to the English language.\nPython has syntax that allows developers to write programs with fewer lines than some other programming languages.\nPython runs on an interpreter system, mea

In [13]:
## Reading pdf files

from langchain_community.document_loaders import PyPDFLoader,PyMuPDFLoader

dir_loader= DirectoryLoader(
    path="../data/pdf",
    glob="**/*.pdf",  # Pattern to match files
    loader_cls=PyMuPDFLoader,
    # loader_kwargs={"encoding":"utf-8"},
    # show_progress=True
)

pdf_doc= dir_loader.load()
print(pdf_doc)

[Document(metadata={'producer': 'Skia/PDF m137 Google Docs Renderer', 'creator': '', 'creationdate': '', 'source': '..\\data\\pdf\\1Notes.pdf', 'file_path': '..\\data\\pdf\\1Notes.pdf', 'total_pages': 2, 'format': 'PDF 1.4', 'title': 'Untitled document', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Lecture 1 Notes :\u200b\n\u200b\n1.\u2060 \u2060What is Low‑Level Design (LLD)? \n \nDefinition: Designing the internal structure (“skeleton”) of an application by identifying \nclasses/objects, their relationships, data flows, and how DSA solutions plug into this \nstructure. \n●\u200b DSA: Solves isolated problems (e.g. “find shortest path in an array/graph”) using \nalgorithms like binary search, quicksort, Dijkstra’s, heaps, etc. \n●\u200b LLD: Determines which objects exist in the system and how they interact, then \napplies DSA inside that structure. \n \n2.\u2060 \u2060Illustrative Story: Two Ap