## Data ingestion 

Raw PDF → Loader → Document objects → Splitter → Smaller Document objects → Embeddings → Vector DB

langchain_core.document is a module inside langchain tht defines the documnet class.

In [2]:
from langchain_core.documents import Document
import langchain

A Document mainly has:
<page_content → The actual text.>
<metadata → Extra information about that text.>

In [3]:
doc = Document(
    page_content="this is the main text contian i am using to learn rag",
    metadata={
        "source":"example.txt",
        "page":1,
        "author":"Taneeeshh"
    }
)
doc.metadata
#doc.page_content

{'source': 'example.txt', 'page': 1, 'author': 'Taneeeshh'}

In [4]:
import os 
os.makedirs("../data/text_files",exist_ok=True)

In [5]:
sample_text={
    "../data/text_files/intro.txt":"""The Evolution of Artificial Intelligence: 
    From Logic to Learning
Artificial Intelligence (AI) did not begin as a field obsessed with neural networks and billion-parameter models. Its roots lie in formal logic, symbolic reasoning, and the philosophical question of whether machines could simulate human thought. Early researchers believed intelligence could be replicated through carefully designed rules and structured representations of knowledge. This approach, often called symbolic AI, dominated the field during the mid-20th century.
One of the earliest milestones in AI research was the development of the perceptron in the 1950s. The perceptron was a simple computational model inspired by the biological neuron. While limited in capability, it demonstrated that machines could learn from data rather than rely purely on predefined rules. However, limitations in computational power and theoretical understanding led to what became known as the “AI winter,” a period of reduced funding and slowed progress.
The resurgence of AI began in the 1990s and accelerated in the 2000s with the availability of large datasets and improved computing resources. Machine learning emerged as a dominant paradigm, shifting the focus from manually encoding knowledge to training models on examples. Instead of telling the computer explicitly how to solve a problem, researchers allowed algorithms to identify patterns on their own.
Deep learning, a subset of machine learning, revolutionized the field further. Neural networks with multiple hidden layers demonstrated remarkable capabilities in tasks such as image recognition, natural language processing, and speech synthesis. Convolutional neural networks transformed computer vision, while recurrent neural networks and later transformer architectures reshaped language understanding.
The introduction of transformer models marked a significant breakthrough. Transformers rely on self-attention mechanisms, allowing them to process relationships between words in parallel rather than sequentially. This architecture enabled the training of extremely large language models capable of generating coherent and context-aware text. The ability to pre-train models on massive corpora and fine-tune them for specific tasks significantly reduced the barrier to building AI applications.
Today, AI systems are embedded in everyday technology. From recommendation engines and voice assistants to autonomous vehicles and medical diagnostic tools, machine learning algorithms operate behind the scenes of modern life. However, the rapid advancement of AI has also raised ethical and societal concerns. Issues such as data privacy, algorithmic bias, job displacement, and misinformation have become central topics of discussion.
Looking ahead, the future of AI depends not only on technical innovation but also on responsible governance. As models grow more capable, researchers and policymakers must collaborate to ensure that AI systems remain transparent, fair, and aligned with human values. The challenge is no longer merely to build intelligent systems, but to build systems that are beneficial, safe, and trustworthy.
Artificial Intelligence has evolved from rule-based logic engines to large-scale learning systems capable of creative output. Its trajectory reflects a broader shift in computing—from explicit instruction to emergent behavior derived from data. Whether AI will ultimately replicate human-level reasoning remains uncertain, but its transformative impact on society is already undeniable.
"""
}

for filepath,content in sample_text.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)
print("Sample text file created ")

Sample text file created 


### Text loader 

Read a text file → convert it into LangChain Document objects

In [6]:
from langchain_community.document_loaders import TextLoader
loader=TextLoader("../data/text_files/intro.txt",encoding='utf-8')
loader.load()

[Document(metadata={'source': '../data/text_files/intro.txt'}, page_content='The Evolution of Artificial Intelligence: \n    From Logic to Learning\nArtificial Intelligence (AI) did not begin as a field obsessed with neural networks and billion-parameter models. Its roots lie in formal logic, symbolic reasoning, and the philosophical question of whether machines could simulate human thought. Early researchers believed intelligence could be replicated through carefully designed rules and structured representations of knowledge. This approach, often called symbolic AI, dominated the field during the mid-20th century.\nOne of the earliest milestones in AI research was the development of the perceptron in the 1950s. The perceptron was a simple computational model inspired by the biological neuron. While limited in capability, it demonstrated that machines could learn from data rather than rely purely on predefined rules. However, limitations in computational power and theoretical understan

###  2>DIR LOADER

Dir loader is used to Load everything from that folder .


In [7]:
from langchain_community.document_loaders import DirectoryLoader

dir_loader=DirectoryLoader(
    "../data/text_files",   # folder path
    glob="**/*.txt",       # match all .txt files (including subfolders)
    loader_cls=TextLoader, # use TextLoader for each file
    loader_kwargs={'encoding':'utf-8'},
    show_progress=False     # show loading progress
)

documents=dir_loader.load()
documents


[Document(metadata={'source': '..\\data\\text_files\\intro.txt'}, page_content='The Evolution of Artificial Intelligence: \n    From Logic to Learning\nArtificial Intelligence (AI) did not begin as a field obsessed with neural networks and billion-parameter models. Its roots lie in formal logic, symbolic reasoning, and the philosophical question of whether machines could simulate human thought. Early researchers believed intelligence could be replicated through carefully designed rules and structured representations of knowledge. This approach, often called symbolic AI, dominated the field during the mid-20th century.\nOne of the earliest milestones in AI research was the development of the perceptron in the 1950s. The perceptron was a simple computational model inspired by the biological neuron. While limited in capability, it demonstrated that machines could learn from data rather than rely purely on predefined rules. However, limitations in computational power and theoretical unders

For TextLoader \
loader_kwargs={"encoding": "utf-8"}\
For PyPDFLoader<br>
loader_kwargs={"extract_images": True}<br>
For CSVLoader<br>
loader_kwargs={"csv_args": {"delimiter": ","}}<br>

## PDF LOADER

In [11]:
from langchain_community.document_loaders import PyMuPDFLoader,PyPDFLoader

dir_loader=DirectoryLoader(
    "../data/pdf_files",   # folder path
    glob="**/*.pdf",       # match all .txt files (including subfolders)
    loader_cls=PyMuPDFLoader, # use TextLoader for each file
    show_progress=False     # show loading progress
)

pdf_documents=dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'Corel PDF Engine Version 23.5.0.506', 'creator': 'CorelDRAW 2021', 'creationdate': '2025-07-08T11:23:57+05:30', 'source': '..\\data\\pdf_files\\MIT-UG Academic HandBook 2025 -2026 - Revised.pdf', 'file_path': '..\\data\\pdf_files\\MIT-UG Academic HandBook 2025 -2026 - Revised.pdf', 'total_pages': 409, 'format': 'PDF 1.7', 'title': '', 'author': 'Harish Shetty', 'subject': '', 'keywords': '', 'moddate': '2025-07-09T11:29:33+05:30', 'trapped': '', 'modDate': "D:20250709112933+05'30'", 'creationDate': "D:20250708112357+05'30'", 'page': 0}, page_content='(A constituent unit of MAHE, Manipal)\nMANIPAL INSTITUTE OF TECHNOLOGY\nMANIPAL\n2025-26\nB.Tech. 2025 - 26\nAcademic \nProgramme\nHand\x00book'),
 Document(metadata={'producer': 'Corel PDF Engine Version 23.5.0.506', 'creator': 'CorelDRAW 2021', 'creationdate': '2025-07-08T11:23:57+05:30', 'source': '..\\data\\pdf_files\\MIT-UG Academic HandBook 2025 -2026 - Revised.pdf', 'file_path': '..\\data\\pdf_files\