### Data Ingestion

In [9]:
### Document Structure

from langchain_core.documents import Document


In [10]:
doc = Document (
    page_content = "this is the content of the RAG document", 
    metadata = {
        "source": "example_source.txt",
        "author": "Aditya Rathore",
        "date": "2024-10-01", 
        "page_number": 1
    }
)
doc

Document(metadata={'source': 'example_source.txt', 'author': 'Aditya Rathore', 'date': '2024-10-01', 'page_number': 1}, page_content='this is the content of the RAG document')

In [11]:
import os
os.makedirs("../data/text_files",exist_ok=True) 

In [12]:
sample_text={
    "../data/text_files/python_intro.txt": """Python Programming Introduction
    
    Python is a high-level, interpreted programming language known for its readability and versatility. 
    It supports multiple programming paradigms, including procedural, object-oriented, and functional programming.
    Python's extensive standard library and vibrant ecosystem of third-party packages make it suitable for a wide range of applications, 
    from web development to data science and artificial intelligence.
    
    Key Features of Python:

    1. Readability: Python's syntax emphasizes code readability, making it easier for developers to write and maintain code.
    2. Versatility: Python can be used for various applications, including web development, data analysis, machine learning, automation, and more.
    3. Extensive Libraries: Python has a rich set of libraries and frameworks, such as Django for web development,
       NumPy and Pandas for data analysis, and TensorFlow and PyTorch for machine learning.
    4. Community Support: Python has a large and active community that contributes to its development and provides support through forums, tutorials, and documentation.
    5. Cross-Platform: Python is available on multiple platforms, including Windows, macOS, and Linux, allowing developers to write code that runs seamlessly across different operating systems.
    
    Overall, Python's simplicity, versatility, and strong community support have made it one of the most popular programming languages in the world.""",

    "../data/text_files/machine_learning_basics.txt": """Machine Learning Basics

    Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models
    that enable computers to perform tasks without explicit instructions. It involves training models on large datasets to recognize patterns,
    make decisions, and improve their performance over time.

    Key Concepts in Machine Learning:

    1. Supervised Learning: In supervised learning, models are trained on labeled data, where the input features are paired with the correct output.
       The model learns to map inputs to outputs and can make predictions on new, unseen data.
    2. Unsupervised Learning: Unsupervised learning involves training models on unlabeled data, allowing them to discover patterns and relationships
       within the data without explicit guidance. Clustering and dimensionality reduction are common techniques in this category.
    3. Reinforcement Learning: Reinforcement learning is a type of machine learning where agents learn to make decisions by interacting with an environment.
       They receive feedback in the form of rewards or penalties and aim to maximize cumulative rewards over time.
    4. Neural Networks: Neural networks are a class of machine learning models inspired by the human brain's structure. They consist of interconnected nodes
       (neurons) organized in layers and are particularly effective for tasks such as image and speech recognition.
    5. Overfitting and Underfitting: Overfitting occurs when a model learns the training data too well, including noise and outliers, leading to poor
       generalization on new data. Underfitting happens when a model is too simple to capture the underlying patterns in the data.

    Applications of Machine Learning:

    Machine learning has a wide range of applications across various industries, including:

    - Healthcare: Predictive analytics for patient outcomes, medical image analysis, and drug discovery.
    - Finance: Fraud detection, algorithmic trading, and credit scoring.
    - Marketing: Customer segmentation, recommendation systems, and sentiment analysis.
    - Autonomous Systems: Self-driving cars, robotics, and drone navigation.

    Conclusion:

    Machine learning is a rapidly evolving field with the potential to transform industries and improve decision-making processes.
    As more data becomes available and computational power increases, machine learning will continue to advance and unlock new possibilities.
    """
}

for filepath, content in sample_text.items():
    with open(filepath, "w", encoding = "utf-8") as f:
        f.write(content)

print("Sample Text files created.")

Sample Text files created.


In [13]:
### TextLoader Example

from langchain.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document = loader.load()

print(document)


[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python Programming Introduction\n\n    Python is a high-level, interpreted programming language known for its readability and versatility. \n    It supports multiple programming paradigms, including procedural, object-oriented, and functional programming.\n    Python's extensive standard library and vibrant ecosystem of third-party packages make it suitable for a wide range of applications, \n    from web development to data science and artificial intelligence.\n\n    Key Features of Python:\n\n    1. Readability: Python's syntax emphasizes code readability, making it easier for developers to write and maintain code.\n    2. Versatility: Python can be used for various applications, including web development, data analysis, machine learning, automation, and more.\n    3. Extensive Libraries: Python has a rich set of libraries and frameworks, such as Django for web development,\n       NumPy and Pandas fo

In [14]:
### DirectoryLoader Example

from langchain.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls= TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=False
)
documents = dir_loader.load()
documents

[Document(metadata={'source': '../data/text_files/machine_learning_basics.txt'}, page_content="Machine Learning Basics\n\n    Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models\n    that enable computers to perform tasks without explicit instructions. It involves training models on large datasets to recognize patterns,\n    make decisions, and improve their performance over time.\n\n    Key Concepts in Machine Learning:\n\n    1. Supervised Learning: In supervised learning, models are trained on labeled data, where the input features are paired with the correct output.\n       The model learns to map inputs to outputs and can make predictions on new, unseen data.\n    2. Unsupervised Learning: Unsupervised learning involves training models on unlabeled data, allowing them to discover patterns and relationships\n       within the data without explicit guidance. Clustering and dimensionality reduction are comm

In [16]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader 

dir_loader = DirectoryLoader(
    "../data/pdf_files/",
    glob="**/*.pdf",
    loader_cls= PyMuPDFLoader,
    show_progress=False
)

pdf_documents = dir_loader.load()
pdf_documents  

[Document(metadata={'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': '', 'creationdate': '', 'source': '../data/pdf_files/Rental_Lease_Agreement.pdf', 'file_path': '../data/pdf_files/Rental_Lease_Agreement.pdf', 'total_pages': 2, 'format': 'PDF 1.4', 'title': 'Untitled document', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Residential Lease Agreement \nThis Lease Agreement (the "Agreement") is made and entered into this 21st day of \nSeptember, 2025, by and between Rohit Soni (the "Landlord") and Amit Kumar (the \n"Tenant"). \n1. PROPERTY The Landlord agrees to lease to the Tenant the property located at: 123 \nInnovation Drive, Techville, ST 54321 (the "Premises"). \n2. LEASE TERM The term of this lease shall be for a period of 12 months, commencing on \nOctober 1, 2025, and ending on September 30, 2026. \n3. RENT 3.1. The Tenant shall pay the Landlord a monthly rent of $2,400.00. 3