## Data Ingestion

#### 1. Converting all files to langchain document structure
```python
from lanchain.schema import Document
```
- **Core components**:
    - page_content(str) -> Actual text content of the file
    - metadata (dict)

**Creating a Document**:
```python
doc = Document(
    page_content = "RAG is a technique...",
    metadata = {
        "source" : "chapter1.pdf",
        "page": 5,
        "timestamp" : "2024-01-15"
    }
)
```

At the end of the day, we'll do embedding on this data and push it into the Vector DB. Then we'll be able to algorithms like cosine similarity or similarty search.

##### LangChain Document Loaders
Loading various file formats in Document structure directly
1. **PDFLoader**
    ```python
    from langchain.document_loaders improt PyPDFLoader
    loader = PyPDFLoader("file.pdf")
    documents = loader.load()
    ```
2. **CSVLoader**
    ```python
    from langchain.document_loaders improt CSVLoader
    loader = CSVLoader("file.pdf")
    documents = loader.load()
    ```
3. **WebBaseLoader**
    ```python
    from langchain.document_loaders improt WebBaseLoader
    loader = WebBaseLoader("file.pdf")
    documents = loader.load()
    ```
4. **DirectoryLoader**
    ```python
    from langchain.document_loaders improt DirectoryLoader
    loader = DirectoryLoader("file.pdf")
    documents = loader.load()
    ```


In [1]:
### Document Structure

from langchain_core.documents import Document

In [2]:
doc = Document(
    page_content="This is the main text content I'm using to create RAG",
    metadata = {
        "source": "example.txt",
        "pages": "Pranav Yadav",
        "date_created": "2024-01-01"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 'Pranav Yadav', 'date_created': '2024-01-01'}, page_content="This is the main text content I'm using to create RAG")

**Why do we need the metadata?**
- To apply filters while doing similarity search when this gets stored in the vector DB

In [3]:
## create a simple txt file

import os
os.makedirs("../data/text_files", exist_ok=True)

In [4]:
sample_texts = {
    "../data/text_files/placements.txt": """⦁	DSA, SQL, Probab & Stats practice
⦁	PyTorch, PySpark Practice
⦁	Implementation
 	- RAG
 	- FastAPI
 	- Docker
 	- Fine-tuning LLMs using Hugging face (PEFT - LoRA, QLoRA)
 	- Langchain
 	- Done:
 		- Kafka
 		- Kubernetes
 		- PySpark
⦁	Learn
	- DBMS
	- GenAI stuff (GANs, diffusion models, VAEs, etc)
 	- Agentic frameworks (whatever you can)
 	-

Need to finished with before tests:
1.	200 DSA questions and theory rev
2. Probab 50+ questions and theory rev
4. ML, DL, MLOps, Probab & Stats revised
5. Any other things like aptitude, ML/DL libraries, etc.

Theory:
1.	DSA - Lectures and slides
2.	FML - Slides
3.	Intro to DL - Mitesh sir's site
4.	MLOps lab - GitHub sudarsun, internet, my notes
5.	Probab, stats and linear algebra - MFDS slides, prep materials
6.	Data analytics lab - GitHub sudarsun, internet

Target companies:
1.	Honda - Less chance
2. Abacus.ai - Less chance
3. BlackRock - IDK (SQL, Aptitude, Coding)
4. PIMIC AI - IDK
3. American Express - High chance
4. Meesho - Moderate chance
6. Infoedge - High Chance
8. Hilabs - Good Chance
9. Pilvo - Less chance
10. Suntory - High chance (if no detection)
""",
"../data/text_files/machine_learning.txt":"""
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions.[1] Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.[2]

ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.

Statistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (EDA) via unsupervised learning.[4][5]

From a theoretical viewpoint, probably approximately correct learning provides a mathematical and statistical framework for describing machine learning. Most traditional machine learning and deep learning algorithms can be described as empirical risk minimisation under this framework.

History
See also: Timeline of machine learning
The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence.[6][7] The synonym self-teaching computers was also used in this time period.[8][9]

The earliest machine learning program was introduced in the 1950s when Arthur Samuel invented a computer program that calculated the winning chance in checkers for each side, but the history of machine learning roots back to decades of human desire and effort to study human cognitive processes.[10] In 1949, Canadian psychologist Donald Hebb published the book The Organization of Behavior, in which he introduced a theoretical neural structure formed by certain interactions among nerve cells.[11] Hebb's model of neurons interacting with one another set a groundwork for how AIs and machine learning algorithms work under nodes, or artificial neurons used by computers to communicate data.[10] Other researchers who have studied human cognitive systems contributed to the modern machine learning technologies as well, including logician Walter Pitts and Warren McCulloch, who proposed the early mathematical models of neural networks to come up with algorithms that mirror human thought processes.[10]

By the early 1960s, an experimental "learning machine" with punched tape memory, called Cybertron, had been developed by Raytheon Company to analyse sonar signals, electrocardiograms, and speech patterns using rudimentary reinforcement learning. It was repetitively "trained" by a human operator/teacher to recognise patterns and equipped with a "goof" button to cause it to reevaluate incorrect decisions.[12] A representative book on research into machine learning during the 1960s was Nils Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.[13] Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in 1973.[14] In 1981, a report was given on using teaching strategies so that an artificial neural network learns to recognise 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.[15]

Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."[16] This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms. This follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", in which the question, "Can machines think?", is replaced with the question, "Can machines do what we (as thinking entities) can do?".[17]

Modern day Machine Learning algorithms are broken into 3 algorithms types: Supervised Learning Algorithms, Unsupervised Learning Algorithms, and Reinforcement Learning Algorithms.[18]

Current Supervised Learning Algorithms have objectives of classification and regression.
Current Unsupervised Learning Algorithms have objectives of clustering, dimensionality reduction, and association rule.
Current Reinforcement Learning Algorithms focus on decisions that must be made with respect to some previous, unknown time and are broken down to either be studies of model based methods, and model free methods.
In 2014 Ian Goodfellow and others introduced generative adversarial networks (GANs) with realistic data synthesis.[19] By 2016 AlphaGo obtained victory against top human players using reinforcement learning techniques.[20] Shortly after, transformer architectures obtained natural language processing, powering the now popular large language models advancing generative AI and multimodal applications.[21]
"""
}

for filepath, content in sample_texts.items():
    with open(filepath, 'w', encoding = "utf-8") as f:
        f.write(content)

print("Sample test files created!")

Sample test files created!


##### langchain_community
- You can use `langchain_community.document_loaders` package to load different types of files like pdf, txt, etc. directly in the document structure you want for the later stages of RAG

In [6]:
### TextLoader 
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/placements.txt", encoding = "utf-8")
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/placements.txt'}, page_content="⦁\tDSA, SQL, Probab & Stats practice\n⦁\tPyTorch, PySpark Practice\n⦁\tImplementation\n\xa0\t- RAG\n\xa0\t- FastAPI\n\xa0\t- Docker\n\xa0\t- Fine-tuning LLMs using Hugging face (PEFT - LoRA, QLoRA)\n\xa0\t- Langchain\n\xa0\t- Done:\n\xa0\t\t- Kafka\n\xa0\t\t- Kubernetes\n\xa0\t\t- PySpark\n⦁\tLearn\n\t- DBMS\n\t- GenAI stuff (GANs, diffusion models, VAEs, etc)\n\xa0\t- Agentic frameworks (whatever you can)\n\xa0\t-\n\nNeed to finished with before tests:\n1.\t200 DSA questions and theory rev\n2. Probab 50+ questions and theory rev\n4. ML, DL, MLOps, Probab & Stats revised\n5. Any other things like aptitude, ML/DL libraries, etc.\n\nTheory:\n1.\tDSA - Lectures and slides\n2.\tFML - Slides\n3.\tIntro to DL - Mitesh sir's site\n4.\tMLOps lab - GitHub sudarsun, internet, my notes\n5.\tProbab, stats and linear algebra - MFDS slides, prep materials\n6.\tData analytics lab - GitHub sudarsun, internet\n\nTarget co

In [10]:
### Directory Loader - to load all files in a directory
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "../data/text_files",
    glob = "**/*.txt", ## Pattern to match files
    loader_cls = TextLoader, ## Loader class to use
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=False
)
documents = dir_loader.load()

documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='\nMachine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions.[1] Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.[2]\n\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.\n\nStatistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analys

In [None]:
### Loading pdf files to document structure directly using langchain_community
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader = DirectoryLoader(
    "../data/pdf",
    glob = "**/*.pdf", ## Pattern to match files
    loader_cls = PyMuPDFLoader, ## Loader class to use
    show_progress=False
)

pdf_documents = dir_loader.load()

pdf_documents

[Document(metadata={'producer': 'Microsoft: Print To PDF', 'creator': '', 'creationdate': '2025-09-19T19:17:00+09:00', 'source': '..\\data\\pdf\\AI researcher.pdf', 'file_path': '..\\data\\pdf\\AI researcher.pdf', 'total_pages': 2, 'format': 'PDF 1.7', 'title': 'IIT登録用Job Description(2025-2026)0903提出.xlsx', 'author': 'AtoJ-Ruchira', 'subject': '', 'keywords': '', 'moddate': '2025-09-19T19:17:00+09:00', 'trapped': '', 'modDate': "D:20250919191700+09'00'", 'creationDate': "D:20250919191700+09'00'", 'page': 0}, page_content='Honda Motor Co., \nLtd.\nAI Research Engineer\nHonda is a total mobility company dedicated to delivering the "joy of free movement." We innovate in various fields, including motorcycles, automobiles, power products, aircraft and aircraft \nengines, and robotics. Honda operates in markets across about 200 countries worldwide, continuously creating next-generation mobility solutions and products that improve people’s lives by \ncombining our technology and expertise.\nA

#### 2. Chunking