# Retrieval-Augmented Generation (RAG) System

## Problem Statement

Large Language Models (LLMs) generate responses based on their pre-trained knowledge, which may be outdated or inaccurate. To improve response accuracy, a Retrieval-Augmented Generation (RAG) system is used.

This project builds a RAG pipeline that retrieves relevant information from a document and provides it to a language model to generate context-aware and accurate answers.

## Objectives

- Load a document as a knowledge source
- Split the document into smaller text chunks
- Convert text into vector embeddings
- Store embeddings in a vector database
- Retrieve relevant chunks based on user queries
- Generate accurate responses using a language model

## Dataset / Knowledge Source

- **Data Type:** PDF  
- **File Name:** ml_intro.pdf  
- **Source:** Public research paper  
- **Origin:** arXiv (https://arxiv.org)  
- **Domain:** Machine Learning  

This document serves as the knowledge base for the RAG system. The model retrieves relevant information from this PDF to answer user queries.

**File Path:**  
/content/ml_intro.pdf

In [1]:
import os

file_path = "/content/drive/MyDrive/ml_intro.pdf"

if os.path.exists(file_path):
    print("Dataset uploaded successfully.")
else:
    print("File not found. Please upload ml_intro.pdf")

Dataset uploaded successfully.


## Step 1: Install Required Libraries

The following libraries are required for building the RAG system:

- **LangChain** – Framework for building RAG pipelines  
- **FAISS** – Vector database for similarity search  
- **PyPDF** – PDF document loader  
- **Sentence Transformers** – Embedding generation  
- **Transformers** – Language model support  

These libraries are installed using pip in Google Colab.

In [2]:

!pip install -q langchain==0.2.0 langchain-community==0.2.0 langchain-text-splitters==0.2.0 faiss-cpu pypdf sentence-transformers transformers

In [4]:
!pip uninstall -y langchain langchain-community langchain-core langchain-text-splitters pydantic

!pip install -q \
pydantic==1.10.13 \
langchain==0.0.353 \
faiss-cpu \
pypdf \
sentence-transformers \
transformers

Found existing installation: langchain 0.2.0
Uninstalling langchain-0.2.0:
  Successfully uninstalled langchain-0.2.0
Found existing installation: langchain-community 0.2.0
Uninstalling langchain-community-0.2.0:
  Successfully uninstalled langchain-community-0.2.0
Found existing installation: langchain-core 0.2.8
Uninstalling langchain-core-0.2.8:
  Successfully uninstalled langchain-core-0.2.8
Found existing installation: langchain-text-splitters 0.2.0
Uninstalling langchain-text-splitters-0.2.0:
  Successfully uninstalled langchain-text-splitters-0.2.0
Found existing installation: pydantic 1.10.12
Uninstalling pydantic-1.10.12:
  Successfully uninstalled pydantic-1.10.12
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.6/149.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.6/158.6 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.1/803.1 k

In [6]:
!pip install -q pypdf faiss-cpu sentence-transformers transformers

## Step 2: Import Libraries

In this step, all required modules are imported for:

- Loading the PDF document
- Splitting text into chunks
- Generating embeddings
- Creating a vector database (FAISS)
- Building the Retrieval-Augmented Generation pipeline
- Running a language model for answer generation



This project uses lightweight libraries to build a stable RAG pipeline:

- PyPDF – PDF loading
- Sentence Transformers – Text embeddings
- FAISS – Vector similarity search
- Transformers – Language model for answer generation

In [7]:
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline

print("Environment ready (stable version)")

Environment ready (stable version)


## Step 3: Load PDF Document

The PDF document is loaded from Google Drive using PyPDF.

Each page of the document is extracted as text and stored for further processing.

**File Path:**  
/content/drive/MyDrive/ml_intro.pdf

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
file_path = "/content/drive/MyDrive/ml_intro.pdf"

reader = PdfReader(file_path)

documents = []
for page in reader.pages:
    text = page.extract_text()
    if text:
        documents.append(text)

print("Total pages loaded:", len(documents))
print("Sample text preview:\n")
print(documents[0][:500])

Total pages loaded: 158
Sample text preview:

Geometric Deep Learning
Grids, Groups, Graphs,
Geodesics, and Gauges
Michael M. Bronstein1, Joan Bruna2, Taco Cohen3, Petar VeliŁković4
May 4, 2021
1Imperial College London / USI IDSIA / Twitter
2New York University
3Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm
Technologies, Inc.
4DeepMind
arXiv:2104.13478v2  [cs.LG]  2 May 2021


## Step 4: Text Chunking Strategy

The extracted text is divided into smaller chunks to improve retrieval performance.

**Chunk Size:** 500 characters  
**Chunk Overlap:** 50 characters  

### Reason for Selection
- Maintains context continuity between chunks
- Prevents loss of important information at chunk boundaries
- Improves semantic search accuracy
- Balances retrieval quality and memory efficiency

In [10]:
chunk_size = 500
chunk_overlap = 50

chunks = []

for doc in documents:
    start = 0
    while start < len(doc):
        end = start + chunk_size
        chunk = doc[start:end]
        chunks.append(chunk)
        start = end - chunk_overlap

print("Total chunks created:", len(chunks))
print("\nSample chunk:\n")
print(chunks[0])

Total chunks created: 881

Sample chunk:

Geometric Deep Learning
Grids, Groups, Graphs,
Geodesics, and Gauges
Michael M. Bronstein1, Joan Bruna2, Taco Cohen3, Petar VeliŁković4
May 4, 2021
1Imperial College London / USI IDSIA / Twitter
2New York University
3Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm
Technologies, Inc.
4DeepMind
arXiv:2104.13478v2  [cs.LG]  2 May 2021


## Step 5: Embedding Generation

Text chunks are converted into numerical vector representations (embeddings) for semantic similarity search.

**Embedding Model Used:**  
sentence-transformers/all-MiniLM-L6-v2

### Reason for Selection
- Lightweight and fast
- Good semantic understanding
- Suitable for CPU execution
- Open-source and free
- Widely used for semantic search tasks

In [11]:
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
embeddings = model.encode(chunks, show_progress_bar=True)

embeddings = np.array(embeddings)

print("Embedding shape:", embeddings.shape)

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

Embedding shape: (881, 384)


## RAG Architecture

User Query  
↓  
Query Embedding  
↓  
FAISS Vector Search  
↓  
Retrieve Top Relevant Chunks  
↓  
Combine Context + Query  
↓  
Language Model  
↓  
Generated Answer

## Step 6: Vector Database

The generated embeddings are stored in a vector database for efficient similarity search.

**Vector Store Used:** FAISS (Facebook AI Similarity Search)

### Reason for Selection
- Fast nearest-neighbor search
- Efficient for large vector datasets
- Works locally without external services
- Lightweight and easy to integrate

In [13]:
dimension = embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print("Total vectors stored in FAISS:", index.ntotal)

Total vectors stored in FAISS: 881


In [14]:
def retrieve(query, top_k=3):
    query_embedding = model.encode([query])
    query_embedding = np.array(query_embedding)

    distances, indices = index.search(query_embedding, top_k)

    results = []
    for idx in indices[0]:
        results.append(chunks[idx])

    return results

## Step 7: Language Model Setup

A language model is used to generate answers based on the retrieved document context.

**Model Used:** google/flan-t5-base

### Reason for Selection
- Open-source and free
- Optimized for question-answering and instruction tasks
- Lightweight enough to run in Google Colab
- Produces structured and relevant responses

In [16]:
from transformers import pipeline

generator = pipeline(
    "text-generation",   # Updated task
    model="google/flan-t5-base",
    max_length=256
)

print("LLM loaded successfully")

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Passing `generation_config` together with generation-related arguments=({'max_length'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
The model 'T5ForConditionalGeneration' is not supported for text-generation. Supported models are ['PeftModelForCausalLM', 'AfmoeForCausalLM', 'ApertusForCausalLM', 'ArceeForCausalLM', 'AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BitNetForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'BltForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'CwmForCausalLM', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV2ForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCa

LLM loaded successfully


## Step 8: Build RAG Pipeline

The Retrieval-Augmented Generation process consists of:

1. User enters a query
2. Query is converted into an embedding
3. FAISS retrieves the most relevant text chunks
4. Retrieved chunks are combined as context
5. Context + query are given to the language model
6. The model generates the final answer

This ensures responses are grounded in the document instead of relying only on model knowledge.

In [17]:
def rag_answer(query, top_k=3):
    # Step 1: Retrieve relevant chunks
    retrieved_chunks = retrieve(query, top_k=top_k)

    # Step 2: Combine context
    context = "\n".join(retrieved_chunks)

    # Step 3: Create prompt
    prompt = f"""
    Answer the question based only on the context below.

    Context:
    {context}

    Question:
    {query}

    Answer:
    """

    # Step 4: Generate response
    result = generator(prompt)[0]['generated_text']

    return result

In [18]:
print(rag_answer("What is machine learning?"))


    Answer the question based only on the context below.

    Context:
    4 BRONSTEIN, BRUNA, COHEN & VELIČKOVIﬂ
1 Introduction
The last decade has witnessed an experimental revolution in data science
and machine learning, epitomised by deep learning methods. Indeed, many
high-dimensional learning tasks previously thought to be beyond reach –
such as computer vision, playing Go, or protein folding – are in fact feasi-
ble with appropriate computational scale. Remarkably, the essence of deep
learning is built from two simple algorithmic principles: ﬁrst, the notion of

 Learning’ was ﬁrst introduced
by one of the authors of this text in his ERC grant in 2015 and popularised
in the eponymous IEEE Signal Processing Magazine paper (Bronstein et al.,
2017). This paper proclaimed, albeit “with some caution”, the signs of “a
newﬁeldbeingborn.” Giventherecentpopularityofgraphneuralnetworks,
the increasing use of ideas of invariance and equivariance in a broad range
of machine learning applic

## Step 9: Testing the RAG System

The system is tested using multiple queries to evaluate its ability to retrieve relevant information and generate accurate responses.

Minimum three queries are tested as required.

In [19]:
test_queries = [
    "What is machine learning?",
    "What are the main types of machine learning?",
    "Explain supervised learning."
]

for i, query in enumerate(test_queries, 1):
    print(f"\nQuery {i}: {query}")
    answer = rag_answer(query)
    print("Answer:")
    print(answer)
    print("-" * 80)


Query 1: What is machine learning?
Answer:

    Answer the question based only on the context below.

    Context:
    4 BRONSTEIN, BRUNA, COHEN & VELIČKOVIﬂ
1 Introduction
The last decade has witnessed an experimental revolution in data science
and machine learning, epitomised by deep learning methods. Indeed, many
high-dimensional learning tasks previously thought to be beyond reach –
such as computer vision, playing Go, or protein folding – are in fact feasi-
ble with appropriate computational scale. Remarkably, the essence of deep
learning is built from two simple algorithmic principles: ﬁrst, the notion of

 Learning’ was ﬁrst introduced
by one of the authors of this text in his ERC grant in 2015 and popularised
in the eponymous IEEE Signal Processing Magazine paper (Bronstein et al.,
2017). This paper proclaimed, albeit “with some caution”, the signs of “a
newﬁeldbeingborn.” Giventherecentpopularityofgraphneuralnetworks,
the increasing use of ideas of invariance and equivariance

## Step 10: Future Improvements

The current RAG system is functional but can be enhanced in several ways:

### Possible Enhancements
- Implement semantic or dynamic chunking instead of fixed-size chunking
- Use hybrid search (keyword + semantic search)
- Add reranking models to improve retrieval accuracy
- Implement metadata-based filtering for multi-document search
- Store FAISS index persistently instead of rebuilding each time
- Support multiple file formats (PDF, TXT, DOCX, Web)
- Develop a Streamlit or Gradio interface for interactive use

## Project Overview

This project implements a Retrieval-Augmented Generation (RAG) system that answers user queries using information retrieved from a PDF document. The system improves response accuracy by grounding the language model in relevant document context.

## Tools and Libraries Used

- Python
- Google Colab
- PyPDF
- Sentence Transformers
- FAISS (Vector Database)
- HuggingFace Transformers

## RAG Workflow

User Query  
→ Query Embedding  
→ FAISS Similarity Search  
→ Retrieve Relevant Chunks  
→ Context + Query  
→ Language Model  
→ Generated Answer

## Instructions to Run

1. Open the notebook in Google Colab
2. Upload the PDF file to Google Drive
3. Update the file path if necessary
4. Run all cells sequentially
5. Modify the query section to test custom questions