## Data Ingestion

### Document datastructure

In [10]:
from langchain_core.documents import Document

In [11]:
pip install langchain


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
doc = Document(page_content="This is the main page content I am using to create the RAG",
                metadata={"source": "Example.txt",
                          "pages":1,
                          "author":"Manish Basnet",
                          "date_created":"2024-06-01"}
                          
                          )
doc


Document(metadata={'source': 'Example.txt', 'pages': 1, 'author': 'Manish Basnet', 'date_created': '2024-06-01'}, page_content='This is the main page content I am using to create the RAG')

In [13]:
#Create a simple txt file
import os

os.makedirs("data/text_files", exist_ok=True)

In [14]:
sample_text={
    "./data/text_files/python_intro.txt": """Python üêç is a versatile, high-level programming language known for its readability and simple syntax, which often looks like plain English. It is a favorite for everything from web development to data science and artificial intelligence ü§ñ.

Let‚Äôs build a basic introduction file together. We can start by looking at three fundamental concepts. Which of these would you like to explore first?

Variables and Data Types: How Python stores information like names (strings), ages (integers), or prices (floats).

Control Flow: Using if statements and loops to let your code make decisions and repeat tasks.

Functions: Creating reusable blocks of code to keep your scripts organized and efficient."""
}

for file_path, content in sample_text.items():
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content)    

print("Sample text file created successfully!")

Sample text file created successfully!


In [15]:
sample_text={
    "./data/text_files/machine_learning.txt": """Machine learning ü§ñ is a subset of artificial intelligence that focuses on building systems that learn from data to make predictions or decisions. Instead of following explicit rules (like a traditional program), these systems identify patterns within vast amounts of information.To build your introductory text for machine learning, we can explore these three core areas. I'll ask guiding questions along the way to help you fill in the details:

The Three Main Types: Understanding supervised learning (learning with labels), unsupervised learning (finding hidden patterns), and reinforcement learning (learning through trial and error).

The Training Process: How we take raw data, split it into training and testing sets, and help the model "learn" from its mistakes.

Real-World Applications: Exploring how machine learning powers things you use every day, like recommendation engines, spam filters, or image recognition."""
}

for file_path, content in sample_text.items():
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content)    

print("Sample text file created successfully!")

Sample text file created successfully!


In [18]:
### Text Loader
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../data/text_files/python_intro.txt",encoding="utf-8")
loader
document = loader.load()
print(document)


[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python üêç is a versatile, high-level programming language known for its readability and simple syntax, which often looks like plain English. It is a favorite for everything from web development to data science and artificial intelligence ü§ñ.\n\nLet‚Äôs build a basic introduction file together. We can start by looking at three fundamental concepts. Which of these would you like to explore first?\n\nVariables and Data Types: How Python stores information like names (strings), ages (integers), or prices (floats).\n\nControl Flow: Using if statements and loops to let your code make decisions and repeat tasks.\n\nFunctions: Creating reusable blocks of code to keep your scripts organized and efficient.')]


In [20]:
### Directory Loader
from langchain_community.document_loaders import DirectoryLoader

#load all the text files form the directory
dir_loader = DirectoryLoader(
    "../data/text_files", #Directory to load files from
    glob="**/*.txt", #Pattern to match all text files in the directory and subdirectories
    loader_cls=TextLoader, #Specify the loader class to use for each file
    loader_kwargs ={"encoding":"utf-8"}, #Additional arguments to pass to the loader class
    show_progress=False #Whether to display a progress bar during loading
)

documents = dir_loader.load()
print(documents)

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine learning ü§ñ is a subset of artificial intelligence that focuses on building systems that learn from data to make predictions or decisions. Instead of following explicit rules (like a traditional program), these systems identify patterns within vast amounts of information.To build your introductory text for machine learning, we can explore these three core areas. I\'ll ask guiding questions along the way to help you fill in the details:\n\nThe Three Main Types: Understanding supervised learning (learning with labels), unsupervised learning (finding hidden patterns), and reinforcement learning (learning through trial and error).\n\nThe Training Process: How we take raw data, split it into training and testing sets, and help the model "learn" from its mistakes.\n\nReal-World Applications: Exploring how machine learning powers things you use every day, like recommendation engines, spam filt

In [22]:
### Directory Loader
from langchain_community.document_loaders import PyPDFLoader,PyMuPDFLoader

#load all the text files form the directory
dir_loader = DirectoryLoader(
    "../data/pdf", #Directory to load files from
    glob="**/*.pdf", #Pattern to match all pdf files in the directory and subdirectories
    loader_cls=PyMuPDFLoader, #Specify the loader class to use for each file
    show_progress=False #Whether to display a progress bar during loading
)

pdf_documents = dir_loader.load()
print(pdf_documents)

[Document(metadata={'producer': 'Microsoft¬Æ Word 2019', 'creator': 'Microsoft¬Æ Word 2019', 'creationdate': '2026-01-31T21:58:09+05:45', 'source': '..\\data\\pdf\\Project Proposal A-M-S.pdf', 'file_path': '..\\data\\pdf\\Project Proposal A-M-S.pdf', 'total_pages': 16, 'format': 'PDF 1.7', 'title': '', 'author': 'Ashutosh Adhikari', 'subject': '', 'keywords': '', 'moddate': '2026-01-31T21:58:09+05:45', 'trapped': '', 'modDate': "D:20260131215809+05'45'", 'creationDate': "D:20260131215809+05'45'", 'page': 0}, page_content='TRIBHUVAN UNIVERSITY \nInstitute of Science and Technology \n \n \nA Project Proposal \nOn \n"E-Voting System" \n \nSubmitted to \nDepartment of Statistics and Computer Science \nPatan Multiple Campus \n \nIn partial fulfillement of the requriments for Bachelor Degree in Computer \nscience and Information Technology \n \n \nSubmitted By: \nAshutosh Adhikari (79010020) \nManish Basnet (79010054) \nSnehal Sigdel (79010119) \n \nDate: \n1st Feb 2026'), Document(metadata=

In [24]:
type(pdf_documents[0])

langchain_core.documents.base.Document