### Data Ingestion

In [1]:
### Document Structure

from langchain_core.documents import Document

In [2]:
doc = Document(
    page_content="this is the main content",
    metadata={
        "source":"example.txt",
        "pages":1,
        "author":"Yash",
        "data_created":"2025-01-01"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Yash', 'data_created': '2025-01-01'}, page_content='this is the main content')

In [3]:
##create a simple txt file
import os
os.makedirs("../data/text_files",exist_ok=True)

In [4]:
sample_texts={
    "../data/text_files/ml_intro.txt" : """Machine learning (ML) powers some of the most important technologies we use, from translation apps to autonomous vehicles. This course explains the core concepts behind ML.

ML offers a new way to solve problems, answer complex questions, and create new content. ML can predict the weather, estimate travel times, recommend songs, auto-complete sentences, summarize articles, and generate never-seen-before images.

In basic terms, ML is the process of training a piece of software, called a model, to make useful predictions or generate content (like text, images, audio, or video) from data.

For example, suppose we wanted to create an app to predict rainfall. We could use either a traditional approach or an ML approach. Using a traditional approach, we'd create a physics-based representation of the Earth's atmosphere and surface, computing massive amounts of fluid dynamics equations. This is incredibly difficult.

Using an ML approach, we would give an ML model enormous amounts of weather data until the ML model eventually learned the mathematical relationship between weather patterns that produce differing amounts of rain. We would then give the model the current weather data, and it would predict the amount of rain.""",
        "../data/text_files/dl_intro.txt" : """Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. See these course notes for a brief introduction to Machine Learning for AI and an introduction to Deep Learning algorithms.

Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms, see for example:

The monograph or review paper Learning Deep Architectures for AI (Foundations & Trends in Machine Learning, 2009).
The ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references.
The LISA public wiki has a reading list and a bibliography.
Geoff Hinton has readings from 2009’s NIPS tutorial.
The tutorials presented here will introduce you to some of the most important deep learning algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.

The algorithm tutorials have some prerequisites. You should know some python, and be familiar with numpy. Since this tutorial is about using Theano, you should read over the Theano basic tutorial first. Once you’ve done that, read through our Getting Started chapter – it introduces the notation, and downloadable datasets used in the algorithm tutorials, and the way we do optimization by stochastic gradient descent.
"""
}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)
print("Text files created")

Text files created


In [5]:
### TextLoader


from langchain_community.document_loaders import TextLoader

loader=TextLoader("../data/text_files/ml_intro.txt",encoding="utf-8")
document=loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/ml_intro.txt'}, page_content="Machine learning (ML) powers some of the most important technologies we use, from translation apps to autonomous vehicles. This course explains the core concepts behind ML.\n\nML offers a new way to solve problems, answer complex questions, and create new content. ML can predict the weather, estimate travel times, recommend songs, auto-complete sentences, summarize articles, and generate never-seen-before images.\n\nIn basic terms, ML is the process of training a piece of software, called a model, to make useful predictions or generate content (like text, images, audio, or video) from data.\n\nFor example, suppose we wanted to create an app to predict rainfall. We could use either a traditional approach or an ML approach. Using a traditional approach, we'd create a physics-based representation of the Earth's atmosphere and surface, computing massive amounts of fluid dynamics equations. This is incredibly di

In [6]:
### DirectoryLoader

from langchain_community.document_loaders import DirectoryLoader

dir_loader=DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls= TextLoader, ##loader class to use
    loader_kwargs={'encoding':'utf-8'},
    show_progress=False
)

documents=dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\dl_intro.txt'}, page_content='Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. See these course notes for a brief introduction to Machine Learning for AI and an introduction to Deep Learning algorithms.\n\nDeep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms, see for example:\n\nThe monograph or review paper Learning Deep Architectures for AI (Foundations & Trends in Machine Learning, 2009).\nThe ICML 2009 Workshop on Learning Feature Hierarchies webpage has a list of references.\nThe LISA public wiki has a reading list and a bibliography.\nGeoff Hinton has readings from 2009’s NIPS tutorial.\nThe tutorials presented here will introduce you to some of the most i

In [7]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader=DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf",
    loader_cls= PyMuPDFLoader, ##loader class to use
    show_progress=False
)
pdf_documents=dir_loader.load()
pdf_documents


[Document(metadata={'producer': 'pdfTeX-1.40.27', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-09-21T16:40:31+00:00', 'source': '..\\data\\pdf\\Resume_Latest.pdf', 'file_path': '..\\data\\pdf\\Resume_Latest.pdf', 'total_pages': 1, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-09-21T16:40:31+00:00', 'trapped': '', 'modDate': 'D:20250921164031Z', 'creationDate': 'D:20250921164031Z', 'page': 0}, page_content='Yash Kumar\n+919471694118 — yashcoder9187@gmail.com — linkedin.com/in/yashcoder2403 — github.com/Anonymus-Coder2403\nBegusarai, India\nSummary\nApplied AI/Full-Stack engineer building LLM features end-to-end: data refinement to model-ready signals, LangChain/FastAPI APIs, and\nNext.js/React frontends. Experience in SQL/NoSQL schema design, ETL with Python + SQL, vector search (Pinecone/FAISS)\nSkills and Interests\nLanguages: Python, Java, C/C++, JavaScript, TypeScript, SQL\nAI/LLMs: LangChain, Prompt Engineering, Hugging F

In [8]:
type(pdf_documents[0])

langchain_core.documents.base.Document