# Data Ingestion

In [3]:
from langchain_core.documents import Document

In [4]:
doc = Document(page_content="This is a sample document.", 
metadata={
    "source": "generated",
    "author": "LangChain",
    "pages": 1,
    "date_created": "2024-06-15"
    })
# print(doc)

doc

Document(metadata={'source': 'generated', 'author': 'LangChain', 'pages': 1, 'date_created': '2024-06-15'}, page_content='This is a sample document.')

In [5]:
# Create a simple Text file
import os
os.makedirs("../datas/text_files", exist_ok=True)

In [6]:
sample_text = {
    "python.txt":
    '''What is Python? Executive Summary
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.

See also some comparisons between Python and other languages.

1. Introduction
Python, renowned for its simplicity and readability, stands as one of the most versatile and widely adopted programming languages in the world today. Created by Guido van Rossum in the late 1980s, Python was designed with a focus on code readability, enabling developers to express concepts in fewer lines of code compared to languages like C++ or Java.

The Significance of Python
The importance of Python transcends traditional coding realms. Its versatility allows it to be employed in a multitude of applications, ranging from web development and scientific computing to data analysis, artificial intelligence, and more. Python's extensive library ecosystem empowers developers with pre-built modules and packages, easing the implementation of complex functionalities. This language's adaptability is showcased by its role in some of the most prominent technological advances of our time.

In web development, frameworks like Django and Flask have propelled Python to the forefront, enabling developers to build robust and scalable applications. In the realm of data science, Python, along with libraries like Pandas, NumPy, and Matplotlib, has become the de facto choice for data manipulation, analysis, and visualization. Additionally, Python's prowess in artificial intelligence and machine learning is exemplified by the popularity of libraries such as TensorFlow and PyTorch.

Beyond these domains, Python finds application in automation, scripting, game development, and more. Its straightforward syntax and vast community support make it an ideal choice for both novice programmers and seasoned developers alike.

In this article, we embark on a journey through the foundational aspects of Python programming, equipping you with the skills to leverage this versatile language for your own projects and endeavors.

2. Getting Started with Python
Python’s Genesis and Guido van Rossum
Python, conceived in the late 1980s by Guido van Rossum, was designed with a vision to create a language that emphasized code readability and maintainability. Its name is a nod to the British comedy group Monty Python, highlighting the language's penchant for humor and accessibility.

Installing Python
Before we dive into Python programming, you'll need to set up Python on your system. Follow these steps based on your operating system:

- For Windows:
1. Visit the official Python website at python.org.
2. Navigate to the "Downloads" section.
3. Select the latest version compatible with your system (usually recommended for most users).
4. Check the box that says "Add Python X.X to PATH" during installation.
5. Click "Install Now" and follow the on-screen prompts.

- For macOS:
1. Also, Visit python.org.
2. Navigate to the "Downloads" section.
3. Select the latest version compatible with your system (usually recommended for most users).
4. Run the installer and follow the on-screen instructions.

- For Linux:
- Python is often pre-installed in many Linux distributions. To check if it’s installed, open a terminal and type `python --version`. If Python is not installed, you can install it via your package manager (e.g., `sudo apt install python3` for Ubuntu).

Choosing an IDE or Text Editor
Once Python is installed, you'll need an environment to write and run your code. Here are a few popular options:

- PyCharm:
- PyCharm is a powerful IDE known for its intelligent code assistance, debugging capabilities, and extensive plugin ecosystem. It’s suitable for both beginners and experienced developers.

- Jupyter Notebook:
- Jupyter Notebook provides an interactive environment for running code snippets. It’s particularly useful for data analysis, experimentation, and creating interactive documents.

- Visual Studio Code (VSCode):
- VSCode is a lightweight, open-source code editor that supports Python with extensions. It offers a rich set of features, including debugging, version control, and a thriving community.

Selecting the right environment largely depends on your personal preferences and the nature of your projects. Experiment with a few to find the one that best suits your workflow.

3. Python Basics
Writing and Running a Simple Python Program
Let's kickstart your Python journey by writing and running a basic program. Open your chosen Python environment (IDE or text editor), and type the following:

print("Hello, Python!")
Save this file with a `.py` extension (e.g., `hello.py`). Then, in your terminal or command prompt, navigate to the directory containing the file and execute it by typing `python hello.py`. You should see the output: `Hello, Python!`.

Variables and Data Types
In Python, variables are like containers that hold data. They can store various types of information such as numbers, text, and more. Here are some essential data types:

- Integer (int): Represents whole numbers (positive or negative), e.g., `5`, `-10`.
- Float (float): Represents decimal numbers, e.g., `3.14`, `-0.001`.
- String (str): Represents text, enclosed in either single or double quotes, e.g., `’Hello’`, `"Python"`.

To declare a variable, you simply assign a value to it:

age = 25
pi = 3.14
name = 'Alice'
Operators
Python supports a variety of operators for performing operations on variables and values.

- Arithmetic Operators (+, -, \*, /, %,**):
- Addition, subtraction, multiplication, division, modulus (remainder), exponentiation.

- Comparison Operators (==, !=, <, >, <=, >=):
- Compare values and return `True` or `False`.

- Logical Operators (and, or, not):
- Perform logical operations on `True` and `False` values.

Here's a quick example illustrating these operators:

x = 10
y = 5

# Arithmetic
sum_result = x + y
difference = x - y
product = x * y
quotient = x / y

# Comparison
is_equal = x == y
is_greater = x > y

# Logical
logical_and = (x > 0) and (y < 10)
logical_or = (x > 0) or (y > 10)
logical_not = not(x > 0)
Understanding and using these concepts will serve as a strong foundation for your Python programming journey.

4. Control Flow
Conditional Statements (if-else)
Conditional statements allow your program to make decisions based on certain conditions. They are pivotal for executing different code blocks depending on the input or circumstances.

# Example 1: Simple if-else statement
age = 20

if age >= 18:
    print("You are eligible to vote.")
else:
    print("You are not eligible to vote yet.")
In this example, if the condition `age >= 18` evaluates to `True`, the first block of code (indented under `if`) will be executed. Otherwise, the block of code under `else` will be executed.

# Example 2: Chained conditions with elif
score = 85

if score >= 90:
    print("You got an A!")
elif score >= 80:
    print("You got a B.")
elif score >= 70:
    print("You got a C.")
else:
    print("You need to improve.")
Here, the program checks multiple conditions one after another. If the first condition is not met, it moves to the next `elif` statement. If none of the conditions are met, the code under `else` is executed.

Loops (for, while)
Loops are fundamental for executing a block of code repeatedly.

For Loop Example:

# Example 1: Iterating over a list
fruits = ['apple', 'banana', 'cherry']

for fruit in fruits:
    print(fruit)
This loop iterates through the list of fruits and prints each one.

# Example 2: Using range() for a specified number of iterations
for i in range(5):
    print(i)
This loop uses `range(5)` to iterate from `0` to `4`, printing each number.

While Loop Example:

# Example: Countdown using a while loop
count = 5

while count > 0:
    print(count)
    count -= 1
This while loop counts down from 5 to 1.

Choosing Between for and while Loops
When deciding between a `for` loop and a `while` loop, consider the following:

Use a `for` loop when:

- You know the number of iterations in advance.
- You're iterating over a sequence, like a list or a range of numbers.
- You want to iterate through a collection or perform a specific action a fixed number of times.

Example:

for i in range(5):
    print(i)
Use a `while` loop when:

- You don't know the number of iterations in advance or want to loop until a specific condition is met.
- The condition for terminating the loop may change during runtime.

Example:

count = 5
while count > 0:
    print(count)
    count -= 1
Remember, `while` loops rely on a condition to stop execution, which means they can potentially run indefinitely if the condition is not met. Always ensure there's a mechanism to break out of a `while` loop.

Loops provide a powerful mechanism for automating repetitive tasks in your programs.

5. Data Structures in Python
Lists, Tuples, and Dictionaries
Python provides a rich set of data structures to handle different types of data efficiently.

- Lists:
- Lists are ordered collections of elements that can be of any data type (including mixed types).
- They are mutable, meaning you can modify their contents after creation.

- Example:

numbers = [1, 2, 3, 4, 5]
fruits = [’apple’, 'banana’, 'cherry’]
- Use Cases: Storing and manipulating sequences of items, such as a list of numbers or names.

- Tuples:
- Tuples are similar to lists, but they are immutable, meaning their elements cannot be changed after creation.

- Example:

coordinates = (3, 5)
colors = (’red’, 'green’, 'blue’)
- Use Cases: Representing collections of related data, like coordinates or settings.

- Dictionaries:
- Dictionaries store data in key-value pairs, allowing for fast retrieval based on keys.

- They are unordered and mutable.

- Example:

person = {’name’: 'John Doe’, 'age’: 30, 'city’: 'New York’}
- Use Cases: Storing and retrieving information based on labels or IDs, like user profiles or configurations.

Strings and String Manipulation
Strings are sequences of characters, and Python provides powerful tools for working with them.

- Concatenation:
- Concatenation allows you to combine two or more strings into one.

- Example:

first_name = 'John'
last_name = 'Doe'full_name = first_name + ' ' + last_name
- Slicing:
- Slicing allows you to extract parts of a string.

- Example:

message = 'Hello, Python!'
sub_message = message[7:13]  # Output: 'Python'
- String Formatting:
- String formatting enables you to construct strings dynamically with variables.
''',

"machine_learning.txt":
'''



Machine Learning (ML) is a branch of artificial intelligence (AI) that allows computers to learn and make predictions or decisions without being explicitly programmed for each task. Instead, use data to “train” to understand patterns, make predictions, and improve performance over time.

In simpler terms, ML enables computers to automatically improve their performance on a task through experience. Just like humans learn from their experiences, machines learn from the data provided to them.

Types of Machine Learning
There are three main types of machine learning:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Let’s break down each type and see how they work.

Supervised Learning
In Supervised Learning, the model is trained on labeled data, meaning each input (the data) has a corresponding output (the label). The model learns the relationship between inputs and outputs and can then predict outputs for new, unseen data.

Example:
- Suppose you have a dataset of houses where each house is described by its features (size, number of rooms, location) and a label (price of the house). You can train a supervised learning model to predict the price of a new house based on its features.

Simple Analogy:
Think of a student learning math with a teacher. The teacher (data labels) gives the student (the model) correct answers during practice. Over time, the student learns to solve similar problems on their own.

Unsupervised Learning
In Unsupervised Learning, the data does not have labels. The model is given only the inputs and must find patterns or relationships between them. It often groups similar data points together.

Example:
- Imagine you have a large set of images of different animals, but none of the images are labeled (no “dog,” “cat,” etc.). An unsupervised learning algorithm could group similar images together, forming clusters, even if it doesn’t know what the animals are.

Simple Analogy:
It’s like a person sorting different types of fruits without knowing their names. The person groups similar-looking fruits together (apples in one group, oranges in another) without knowing exactly what they are.

Reinforcement Learning
Reinforcement Learning (RL) involves an agent (a model) that learns through trial and error by interacting with an environment. The agent receives rewards for good actions and penalties for bad actions and adjusts its behavior to maximize rewards over time.

Example:
- In a video game, an RL agent learns to play by making moves and receiving points (rewards) or losing lives (penalties). Over time, it figures out the best strategies to win the game.

Simple Analogy:
Imagine teaching a pet a new trick. Every time the pet performs the trick correctly, you give it a treat (reward). If the pet does something wrong, you don’t give a treat (penalty). Over time, the pet learns to perform the trick correctly to get the treat.

Differences Between Supervised, Unsupervised, and Reinforcement Learning
Press enter or click to view image in full size

Differences Between Supervised, Unsupervised, and Reinforcement Learning
Supervised Learning: Train and Test a Model (Simple Example)
In Supervised Learning, we train a model using a dataset where we already know the correct answers (labels). After training, we evaluate the model’s performance on a separate “test set” to see how well it can predict new data.

Steps:

1. Train the Model: Use a portion of the data (the training set) to teach the model.
2. Test the Model: Use the remaining data (the test set) to evaluate the model’s performance on unseen examples.

Example (Predicting House Prices)

Let’s use a simple example of training and testing a model to predict house prices based on features like size and number of rooms. We will use a supervised learning approach, and sklearn library.

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create a simple dataset for house prices
# Features: size (in square feet), number of rooms
# Target: house price in $1000s
data = {
    'size': [1500, 1800, 2400, 3000, 3500, 4000, 4500, 5000, 5500, 6000],
    'rooms': [3, 3, 4, 4, 5, 5, 6, 6, 6, 7],
    'price': [300, 320, 400, 450, 500, 540, 600, 620, 670, 700]
}

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame(data)

# Define the features (X) and the target (y)
X = df[['size', 'rooms']]  # Input features
y = df['price']  # Target variable

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                              random_state=42)

# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Output the results
print(f"Predicted prices: {y_pred}")
print(f"Actual prices: {y_test.values}")
print(f"Mean Squared Error: {mse:.2f}")
Why Split Data?
- Training Set: Used to teach the model by showing it examples with known answers.
- Test Set: Used to see how well the model performs on data it has never seen before. This helps check if the model is overfitting (performing well on training but poorly on new data) or generalizing well.

Predictions are made on the test set, and the Mean Squared Error (MSE) is calculated to measure how well the model is performing. The lower the MSE, the better the model’s predictions.

Machine Learning offers powerful tools for making predictions and uncovering insights from data. By understanding the key types (Supervised, Unsupervised, and Reinforcement Learning) and using simple models, you can begin to explore its potential. Supervised learning is often the easiest to start with since you have labeled data, and you can quickly test models with real-world applications like house price prediction or flower classification.

Regression vs. Classification
In machine learning, there are two main types of tasks:

Classification
Regression
1. Classification
Classification involves predicting a category or class label. The output is discrete, meaning the model tries to classify data into predefined labels or groups.

Example:

Predicting whether an email is “spam” or “not spam” (two distinct classes).
Classifying a flower as one of three species based on its features.
2. Regression
Regression involves predicting a continuous value or quantity. The output is continuous, meaning the model predicts a value on a numerical scale.

Example:

Predicting the price of a house based on its features (like size, number of rooms, location).
Forecasting the temperature tomorrow based on historical data.
Understanding Linear Regression
Linear regression is one of the most basic types of regression analysis. It is used to predict a continuous value by modeling the relationship between an independent variable (input) and a dependent variable (output).

Linear Regression Example
Above example of predicting house prices based on the size of the house was an example of linear regression. Linear regression would model this relationship using a straight line:

Press enter or click to view image in full size

Linear Regression
Press enter or click to view image in full size

Linear Regression
We plot the original data and the fitted line to visually see how well the model captures the relationship.

Multivariable (Multiple) Linear Regression
Multiple Linear Regression is an extension of simple linear regression where we use multiple input features (variables) to predict the output.

Instead of just predicting house prices based on size, we could also consider the number of rooms, location, or age of the house. The equation would now look like this:

Press enter or click to view image in full size

Multivariable Regression
Understanding the Cost Function in Machine Learning
A cost function (sometimes called a loss function or error function) is a mathematical function that tells us how far our model’s predictions are from the actual results. It essentially gives a score to a model’s performance: the higher the score, the worse the model; the lower the score, the better the model.

The ultimate goal is to minimize the cost function, which would mean that the model’s predictions are as close to the actual values as possible.

In the case of Gradient Descent, the algorithm adjusts the parameters in such a way that the cost function decreases with each iteration until it reaches a minimum value (the lowest point in the cost function). At this minimum, the model is making the best possible predictions given the training data.

Types of Cost Functions in Machine Learning
1. For Regression (e.g., Linear Regression)*:
The most common cost function for linear regression is the Mean Squared Error (MSE), which looks at the difference between the predicted and actual values, squares them (to avoid negative differences), and averages them over the entire dataset.

Press enter or click to view image in full size

MSE
Where:

n is the number of training examples.
y(i) is the predicted value for the i-th training example.
y(i)^ is the actual value for the i-th training example.

2. For Classification (e.g., Logistic Regression):
The cost function is often the Logarithmic Loss (also known as Cross-Entropy Loss), which measures the error in classifying between categories. For binary classification, this function looks like:
Press enter or click to view image in full size

log loss
Complexity with Multiple Variables (Features)
As we move from simple linear regression (with one feature) to multiple linear regression (with several features), the complexity of the model increases. When we add more features to our dataset, it becomes challenging to compute the optimal values for the parameters (coefficients) using analytical methods like the Gradient descent.

Limitations of Linear Regression
Linear regression, though a very powerful algorithm, has certain disadvantages

Get Rishabh Singh’s stories in your inbox
Join Medium for free to get updates from this writer.

Enter your email
Subscribe
1. Main limitation of Linear Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is almost never linearly separable. The assumption that there is a straight-line relationship is usually wrong.

2. Prone to noise and overfitting: If the number of observations are lesser than the number of features, Linear Regression should not be used, otherwise it may lead to overfit, and the relationship thus formed will be noisy.

3. Prone to outliers: Linear regression is very sensitive to outliers. An outlier can be considered as an anomaly. It refers to a datapoint which has no clear relationship with any other data point in the data. So, outliers should be analyzed and removed before applying Linear Regression to the dataset, or the linear relationship formed would be highly skewed.


Linear line formed will not correctly predict the results of data points (shown in blue)
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the cost function by iteratively updating the parameters (coefficients). Gradient Descent iteratively moves toward the optimal solution by following the slope of the cost function.

Press enter or click to view image in full size

Intuition Behind Gradient Descent
Imagine you are standing on a mountain peak and want to reach the lowest point (valley). You can’t see where the valley is because you’re blindfolded. The only thing you can do is feel the ground near you and step in the direction where the slope decreases.

Press enter or click to view image in full size

In this analogy:

The mountain is the cost function.
The goal is to minimize the cost, i.e., find the point with the lowest value (the valley).
Each step you take corresponds to updating the parameters (coefficients) of the model.
How Gradient Descent Works:
Initialize Parameters: Start by assigning random values to the parameters (coefficients).
Compute the Cost Function: The cost function represents the error between the predicted values and actual values. In linear regression, this is often the Mean Squared Error.
Calculate the Gradient (Slope): Compute the slope of the cost function with respect to each parameter. This slope tells us the direction to move the parameters to reduce the cost.
Update Parameters: Adjust the parameters using the gradient and a learning rate (which controls the size of the steps).
Repeat: Continue this process until the parameters converge (when further updates make minimal improvements).
Importance of Learning Rate
The learning rate (α) is a crucial hyperparameter that controls how large each update step is. Learning rate controls how much the coefficients can change on each iteration. If the learning rate is too large, the algorithm might overshoot the minimum and fail to converge. If it’s too small, the algorithm might take too long to find the minimum.

Press enter or click to view image in full size

Learning Rate
Small Learning Rate: Slow convergence, but more precise.
Large Learning Rate: Faster, but risks overshooting and not finding the minimum.

Feature Scaling in Machine Learning
When working with machine learning algorithms, the features (input variables) can often have different scales. For instance, if one feature is measured in kilometers and another in meters, the range of values can differ significantly. Feature scaling helps normalize or standardize these features, ensuring that the model treats them equally during training.

Why Feature Scaling is Important?
Many machine learning algorithms, especially those based on distance or gradient descent, are sensitive to the scale of the input features. Some examples include:

Gradient Descent: The convergence of gradient descent is faster when features are on a similar scale. Without scaling, features with larger ranges dominate the optimization process, making it inefficient.
Types of Feature Scaling
Min-Max Normalization (Rescaling)
Standardization (Z-score scaling)
Let’s explore both techniques with simple examples.

1. Min-Max Normalization
Min-Max normalization scales the data to a fixed range, typically between 0 and 1. Each feature’s minimum value becomes 0, and the maximum value becomes 1.

Press enter or click to view image in full size

Where:

X is the original feature value,
Xmin, Xmax​ are the minimum and maximum values of that feature.
When to Use Min-Max Scaling?
Use Min-Max scaling when you know that the distribution of your data does not contain extreme outliers and is relatively uniform.
It’s commonly used in algorithms like KNN, which are based on distances between data points.
2. Standardization (Z-score Scaling)
Standardization (also called Z-score scaling) transforms the data to have a mean of 0 and a standard deviation of 1. It centers the data by subtracting the mean and then scales by dividing by the standard deviation.


Where:

X is the original feature value,
μ is the mean of the feature,
σ is the standard deviation of the feature.
When to Use Standardization?
Standardization is useful when your data has outliers or when the distribution of features is not uniform.
It’s widely used in algorithms that rely on the Gaussian distribution, such as logistic regression, linear regression, and support vector machines.
When is Feature Scaling Not Necessary?
Not all algorithms are sensitive to the scale of features. For example:

Tree-based models (like Decision Trees, Random Forests) do not require feature scaling since they are based on splitting points in the data and are not sensitive to the relative scales of the features.
Naive Bayes is also insensitive to feature scaling because it relies on probabilities rather than distance or magnitude.
Using Min-Max Scaling with scikit-learn:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Original data
data = np.array([[50, 30], [60, 90], [70, 100]])
# Initialize MinMaxScaler
scaler = MinMaxScaler()
# Scale the data
scaled_data = scaler.fit_transform(data)
print("Scaled Data using Min-Max Scaling:\n", scaled_data)
Using Standardization with scikit-learn:
from sklearn.preprocessing import StandardScaler
# Initialize StandardScaler
scaler = StandardScaler()
# Scale the data
scaled_data = scaler.fit_transform(data)
print("Scaled Data using Standardization:\n", scaled_data)
This is just introduction to ML, we will learn more about ML is detail in future blogs…Stay tuned'''
}

In [7]:
for filename, content in sample_text.items():
    with open(f"../datas/text_files/{filename}", "w") as f:
        f.write(content)
    print(f"Created file: {filename}")

print("All sample text files created successfully.")

Created file: python.txt
Created file: machine_learning.txt
All sample text files created successfully.


## Read using TextLoader

In [8]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("../datas/text_files/python.txt")

documents = loader.load()

# print(documents)

In [9]:
type(documents)

list

## Read using DirectoryLoader

In [10]:
from langchain_community.document_loaders import DirectoryLoader
directory_loader = DirectoryLoader(
    "../datas", 
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=True
    )
docs = directory_loader.load()

100%|██████████| 2/2 [00:00<00:00, 861.78it/s]


In [11]:
docs

[Document(metadata={'source': '../datas/text_files/machine_learning.txt'}, page_content='\n\n\n\nMachine Learning (ML) is a branch of artificial intelligence (AI) that allows computers to learn and make predictions or decisions without being explicitly programmed for each task. Instead, use data to “train” to understand patterns, make predictions, and improve performance over time.\n\nIn simpler terms, ML enables computers to automatically improve their performance on a task through experience. Just like humans learn from their experiences, machines learn from the data provided to them.\n\nTypes of Machine Learning\nThere are three main types of machine learning:\n1. Supervised Learning\n2. Unsupervised Learning\n3. Reinforcement Learning\n\nLet’s break down each type and see how they work.\n\nSupervised Learning\nIn Supervised Learning, the model is trained on labeled data, meaning each input (the data) has a corresponding output (the label). The model learns the relationship between 

## Load Pdf Documents

In [12]:
from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader
directory_loader = DirectoryLoader(
    "../datas", 
    glob="**/*.pdf",
    loader_cls=PyMuPDFLoader,
    # loader_kwargs={'encoding': 'utf-8'},
    show_progress=True
    )
pdf_documents = directory_loader.load()


100%|██████████| 4/4 [00:02<00:00,  1.81it/s]


In [13]:
len(pdf_documents)

94

In [14]:
pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-04-17T00:45:22+00:00', 'source': '../datas/pdf_files/object_detection.pdf', 'file_path': '../datas/pdf_files/object_detection.pdf', 'total_pages': 21, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2019-04-17T00:45:22+00:00', 'trapped': '', 'modDate': 'D:20190417004522Z', 'creationDate': 'D:20190417004522Z', 'page': 0}, page_content='THIS PAPER HAS BEEN ACCEPTED BY IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS FOR PUBLICATION\n1\nObject Detection with Deep Learning: A Review\nZhong-Qiu Zhao, Member, IEEE, Peng Zheng,\nShou-tao Xu, and Xindong Wu, Fellow, IEEE\nAbstract—Due to object detection’s close relationship with\nvideo analysis and image understanding, it has attracted much\nresearch attention in recent years. Traditional object detection\nmethods are built on handcrafted features and shallow trainable\narchitect

In [15]:
type(pdf_documents[0])

langchain_core.documents.base.Document

## Embedding & Vector Store DB

In [3]:
import os
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
# from chromadb.settings import Settings
import uuid
from typing import List, Tuple, Dict, Any
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
class EmbeddingManager:
    # Handles embedding generation using SentenceTransformer
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        '''Initialize the embedding model. from the huggingface sentence-transformers library'''
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        try:
            print(f"Loading Embedding model: {self.model_name}...")
            self.model = SentenceTransformer(self.model_name)
            print("Embedding model loaded successfully. Embedding dimension:", self.model.get_sentence_embedding_dimension())
        except Exception as e:
            print("Error loading embedding model:", str(e))

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts
        
        Args:
            texts: List of text strings to embed
            
        Returns:
            numpy array of embeddings with shape (len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings

# Initialize the EmbeddingManager
embedding_manager = EmbeddingManager(model_name='all-MiniLM-L6-v2')

Loading Embedding model: all-MiniLM-L6-v2...
Embedding model loaded successfully. Embedding dimension: 384


## Vector Store

In [5]:
class VectorStore:
    # Manages the ChromaDB vector store
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../datas/vector_store"):
        '''Initialize the ChromaDB client and collection.'''
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        try:
            print("Initializing ChromaDB client...")
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)
            
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "Collection of PDF document embeddings"}
                )
            print("Vector store initialized successfully. Collection name:", self.collection_name)
            print("Number of documents in collection:", self.collection.count())
        except Exception as e:
            print("Error initializing ChromaDB client:", str(e))

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        '''Add documents and their embeddings to the collection.'''
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents and embeddings must match.")
        
        print(f"Adding {len(documents)} documents to the vector store...")
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            metadata = dict(doc.metadata)
            metadata["doc_index"] = i
            metadata["context_length"] = len(doc.page_content)
            metadatas.append(metadata)

            documents_text.append(doc.page_content)
            embeddings_list.append(embedding.tolist())

        try:
            self.collection.add(
                ids=ids,
                metadatas=metadatas,
                documents=documents_text,
                embeddings=embeddings_list
            )
            print(f"Successfully added {len(documents)} documents to the vector store.")
            print("Documents added successfully. Total documents in collection:", self.collection.count())
        except Exception as e:
            print("Error adding documents to vector store:", str(e))

vector_store = VectorStore()

Initializing ChromaDB client...
Vector store initialized successfully. Collection name: pdf_documents
Number of documents in collection: 188


In [19]:
# pdf_documents

In [20]:
# Convert documents to embeddings
texts = [doc.page_content for doc in pdf_documents]
embeddings = embedding_manager.generate_embeddings(texts)

# Store in the vector db
vector_store.add_documents(pdf_documents, embeddings)
# texts

Generating embeddings for 94 texts...


Batches: 100%|██████████| 3/3 [00:10<00:00,  3.62s/it]


Generated embeddings with shape: (94, 384)
Adding 94 documents to the vector store...
Successfully added 94 documents to the vector store.
Documents added successfully. Total documents in collection: 188


# RAG Pipeline Retriever from VectorStore

In [6]:
class RAGRetreiver:
    '''
    Handles Query based Retrieval from the Vector Store
    '''
    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        '''
        Initialize the RAG Retriever with a vector store and embedding manager.

        Args:
            vector_store: Instance of VectorStore
            embedding_manager: Instance of EmbeddingManager
        '''
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:
        """
        Retrieve relevant documents for a query
        
        Args:
            query: The search query
            top_k: Number of top results to return
            score_threshold: Minimum similarity score threshold
            
        Returns:
            List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_threshold}")
        
        # Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]
        
        # Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            
            # Process results
            retrieved_docs = []
            
            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]
                
                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)
                    similarity_score = 1 - distance
                    
                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i + 1
                        })
                
                print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
            else:
                print("No documents found")
            
            return retrieved_docs
            
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []
        
rag_retriever = RAGRetreiver(vector_store, embedding_manager)

In [22]:
rag_retriever.retrieve('what is object detection')

Retrieving documents for query: 'what is object detection'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 29.23it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents (after filtering)





[{'id': 'doc_5fee55da_18',
  'content': 'THIS PAPER HAS BEEN ACCEPTED BY IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS FOR PUBLICATION\n19\n[92] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time\nobject detection with region proposal networks,” IEEE Trans. Pattern\nAnal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.\n[93] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-\ning the inception architecture for computer vision,” in CVPR, 2016.\n[94] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,\nP. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in\ncontext,” in ECCV, 2014.\n[95] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outside\nnet: Detecting objects in context with skip pooling and recurrent neural\nnetworks,” in CVPR, 2016.\n[96] A. Arnab and P. H. S. Torr, “Pixelwise instance segmentation with a\ndynamically instantiated network,” in CVPR, 2017.\n[97] J. Dai, K. He, and J

In [23]:
rag_retriever.retrieve('explain attention is all you need')[0]["content"]

Retrieving documents for query: 'explain attention is all you need'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  2.58it/s]

Generated embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)





'Attention Visualizations\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nFigure 3: An example of the attention mechanism following long-distance dependencies in the\nencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of\nthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for\nthe word ‘making’. Different colors represent different heads. Best viewed in color.\n13'

## Integration VectorDB Context Pipeline with LLM Output

In [7]:
# Simple RAG Pipeline with LLM Output
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

# Initialize GROQ LLM
api_key = os.getenv("GROQ_API_KEY")

llm = ChatGroq(api_key=api_key, temperature=0.1,model="moonshotai/kimi-k2-instruct-0905", max_tokens=1024)

## Build RAG Function : Retrieve context + response generation
def rag_qa(query, retreiver, llm, top_k=3):
    # Retrieve the context
    results = retreiver.retrieve(query, top_k=top_k)
    context = "\n\n".join([doc['content'] for doc in results]) if results else ""
    if not context:
        return "No relevant context found to answer the question."
    
    # Generate the answer using LLM
    prompt = f'''Use the following context to answer the question concisely.\n\n
        Context: 
        {context}

        Question: {query}
        Answer:
    '''
    response = llm.invoke(prompt.format(context=context, query=query))

    return response.content

In [8]:
answer = rag_qa("vision transformer", rag_retriever, llm, top_k=3)
print("Answer:", answer)

Retrieving documents for query: 'vision transformer'
Top K: 3, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  4.38it/s]


Generated embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)
Answer: Vision Transformer (ViT) splits an image into fixed-size patches, linearly embeds each patch, adds 1-D position embeddings and a learnable classification token, then processes the sequence through a standard Transformer encoder whose output token is fed to an MLP classification head.


## Enhanced RAG Pipeline Features

In [10]:
# --- Enhanced RAG Pipeline Features ---
def rag_advanced(query, retriever, llm, top_k=5, min_score=0.2, return_context=False):
    """
    RAG pipeline with extra features:
    - Returns answer, sources, confidence score, and optionally full context.
    """
    results = retriever.retrieve(query, top_k=top_k, score_threshold=min_score)
    if not results:
        return {'answer': 'No relevant context found.', 'sources': [], 'confidence': 0.0, 'context': ''}
    
    # Prepare context and sources
    context = "\n\n".join([doc['content'] for doc in results])
    sources = [{
        'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
        'page': doc['metadata'].get('page', 'unknown'),
        'score': doc['similarity_score'],
        'preview': doc['content'][:300] + '...'
    } for doc in results]
    confidence = max([doc['similarity_score'] for doc in results])
    
    # Generate answer
    prompt = f"""Use the following context to answer the question concisely.\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"""
    response = llm.invoke([prompt.format(context=context, query=query)])
    
    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }
    if return_context:
        output['context'] = context
    return output

# Example usage:
result = rag_advanced("vision transformer", rag_retriever, llm, top_k=3, min_score=0.01, return_context=True)
print("Answer:", result['answer'])
print("Sources:", result['sources'])
print("Confidence:", result['confidence'])
print("Context Preview:", result['context'][:300])

Retrieving documents for query: 'vision transformer'
Top K: 3, Score threshold: 0.01
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.15it/s]

Generated embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)





Answer: Vision Transformer (ViT) splits an image into fixed-size patches, linearly embeds each patch, adds 1-D position embeddings and a learnable classification token, then feeds the sequence to a standard Transformer encoder; the encoder’s output for the class token is used for classification via an MLP head.
Sources: [{'source': '../datas/pdf_files/ViT.pdf', 'page': 2, 'score': 0.09102499485015869, 'preview': 'Published as a conference paper at ICLR 2021\nTransformer Encoder\nMLP \nHead\nVision Transformer (ViT)\n*\nLinear Projection of Flattened Patches\n* Extra learnable\n     [ cl ass]  embedding\n1\n2\n3\n4\n5\n6\n7\n8\n9\n0\nPatch + Position \nEmbedding\nClass\nBird\nBall\nCar\n...\nEmbedded \nPatches\nMulti-Head \nAttention\nNor...'}, {'source': '../datas/pdf_files/ViT.pdf', 'page': 2, 'score': 0.09102499485015869, 'preview': 'Published as a conference paper at ICLR 2021\nTransformer Encoder\nMLP \nHead\nVision Transformer (ViT)\n*\nLinear Projection of Flattened Patches\n* Ex

In [13]:
# --- Advanced RAG Pipeline: Streaming, Citations, History, Summarization ---
from typing import List, Dict, Any
import time

class AdvancedRAGPipeline:
    def __init__(self, retriever, llm):
        self.retriever = retriever
        self.llm = llm
        self.history = []  # Store query history

    def query(self, question: str, top_k: int = 5, min_score: float = 0.2, stream: bool = False, summarize: bool = False) -> Dict[str, Any]:
        # Retrieve relevant documents
        results = self.retriever.retrieve(question, top_k=top_k, score_threshold=min_score)
        if not results:
            answer = "No relevant context found."
            sources = []
            context = ""
        else:
            context = "\n\n".join([doc['content'] for doc in results])
            sources = [{
                'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
                'page': doc['metadata'].get('page', 'unknown'),
                'score': doc['similarity_score'],
                'preview': doc['content'][:120] + '...'
            } for doc in results]
            # Streaming answer simulation
            prompt = f"""Use the following context to answer the question concisely.\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"""
            if stream:
                print("Streaming answer:")
                for i in range(0, len(prompt), 80):
                    print(prompt[i:i+80], end='', flush=True)
                    time.sleep(0.05)
                print()
            response = self.llm.invoke([prompt.format(context=context, question=question)])
            answer = response.content

        # Add citations to answer
        citations = [f"[{i+1}] {src['source']} (page {src['page']})" for i, src in enumerate(sources)]
        answer_with_citations = answer + "\n\nCitations:\n" + "\n".join(citations) if citations else answer

        # Optionally summarize answer
        summary = None
        if summarize and answer:
            summary_prompt = f"Summarize the following answer in 2 sentences:\n{answer}"
            summary_resp = self.llm.invoke([summary_prompt])
            summary = summary_resp.content

        # Store query history
        self.history.append({
            'question': question,
            'answer': answer,
            'sources': sources,
            'summary': summary
        })

        return {
            'question': question,
            'answer': answer_with_citations,
            'sources': sources,
            'summary': summary,
            'history': self.history
        }

# Example usage:
adv_rag = AdvancedRAGPipeline(rag_retriever, llm)
result = adv_rag.query("vision transformer", top_k=3, min_score=0.01, stream=True, summarize=True)
print("\nFinal Answer:", result['answer'])
print("Summary:", result['summary'])
print("History:", result['history'][-1])

Retrieving documents for query: 'vision transformer'
Top K: 3, Score threshold: 0.01
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 11.16it/s]

Generated embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)
Streaming answer:
Use the following context to answer the question concisely.
Context:
Published a

s a conference paper at ICLR 2021
Transformer Encoder
MLP 
Head
Vision Transform




er (ViT)
*
Linear Projection of Flattened Patches
* Extra learnable
     [ cl ass]  embedding
1
2
3
4
5
6
7
8
9
0
Patch + Position 
Embedding
Class
Bird
Ball
Car
...
Embedded 
Patches
Multi-Head 
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,
add position embeddings, and feed the resulting sequence of vectors to a standard Transformer
encoder. In order to perform classiﬁcation, we use the standard approach of adding an extra learnable
“classiﬁcation token” to the sequence. The illustration of the Transformer encoder was inspired by
Vaswani et al. (2017).
3
METHOD
In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.
An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and
their efﬁcient implementations – can be used almost out of the box.
3.1
VISION TRANSFORMER (VIT)
An overview of the model is depi