# RAG Assignment – Retrieval Augmented Generation System

## 1. Problem Statement
The aim of this project is to build a Retrieval-Augmented Generation (RAG) system that can answer user questions using information from a document. Instead of generating answers only from a language model’s memory, this system first searches for relevant content from a document and then generates the answer based on that content. This helps in reducing incorrect or imaginary answers and improves accuracy.


In [None]:
!pip install --quiet pdfplumber sentence-transformers faiss-cpu


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from typing import List, Tuple

import numpy as np
import pdfplumber
from sentence_transformers import SentenceTransformer
import faiss
import textwrap


## 2. Dataset / Knowledge Source
- **Type of data:** PDF document  
- **Data source:** User-uploaded document  

In this project, the dataset is a PDF file uploaded by the user using Google Colab. This PDF acts as the knowledge base for the RAG system. The text from the PDF is extracted and later used to answer user queries.


In [None]:
from google.colab import files

print("/content/ai_notes.pdf")
uploaded = files.upload()

DATASET_FILENAME = list(uploaded.keys())[0]
print("Uploaded file:", DATASET_FILENAME)


/content/ai_notes.pdf


Saving ai_notes.pdf to ai_notes (1).pdf
Uploaded file: ai_notes (1).pdf


## 3. RAG Architecture
The Retrieval-Augmented Generation (RAG) system follows a structured pipeline to answer user queries. First, the document uploaded by the user is processed and converted into text. This text is later divided into smaller chunks and converted into embeddings. When a user enters a query, the system searches for the most relevant text chunks from the vector database and uses them to generate the final answer.

The overall flow of the RAG system is:

User Query  
→ Query Embedding  
→ Vector Database Search (FAISS)  
→ Retrieval of Relevant Text Chunks  
→ Context Preparation  
→ Answer Generation


In [None]:
def load_pdf(path: str) -> str:
    text_parts = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text() or ""
            text_parts.append(page_text)
    return "\n".join(text_parts)

def load_txt(path: str) -> str:
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        return f.read()

def load_document(path: str) -> str:
    path_lower = path.lower()
    if path_lower.endswith(".pdf"):
        print("Detected PDF file. Extracting text...")
        return load_pdf(path)
    elif path_lower.endswith(".txt"):
        print("Detected TXT file. Reading text...")
        return load_txt(path)
    else:
        raise ValueError("Unsupported file type. Please upload a PDF or TXT file.")


### Document Loading and Text Extraction
In this step, the uploaded document is loaded into the system and converted into raw text. If the uploaded file is a PDF, the text is extracted page by page. This extracted text forms the base data that will later be divided into smaller chunks and used for retrieval in the RAG pipeline.


In [None]:
doc_path = DATASET_FILENAME
raw_text = load_document(doc_path)

print("Total characters in document:", len(raw_text))
print("\nPreview of document text:\n")
print(raw_text[:1000])


Detected PDF file. Extracting text...
Total characters in document: 823

Preview of document text:

Artificial Intelligence (AI) is the branch of computer science that focuses on creating systems
capable of performing tasks that normally require human intelligence.
Machine Learning is a subset of AI that enables systems to learn from data without being explicitly
programmed.
There are three main types of Machine Learning: Supervised Learning, Unsupervised Learning,
and Reinforcement Learning.
Supervised Learning uses labeled data to train models. Common examples include classification
and regression.
Unsupervised Learning works with unlabeled data and is mainly used for clustering and association
tasks.
Reinforcement Learning is based on reward and punishment and is widely used in robotics and
game playing.
AI applications include healthcare, finance, education, autonomous vehicles, and recommendation
systems.


## 4. Text Chunking Strategy
After extracting the raw text from the document, the text is divided into smaller pieces called chunks. Chunking is important because embeddings work better on smaller and meaningful text segments.

- **Chunk size:** 700 characters  
- **Chunk overlap:** 150 characters  

### Reason for Choosing This Strategy
A chunk size of 700 characters is chosen to keep enough context within each chunk. An overlap of 150 characters is used to make sure that important information at the boundary of chunks is not lost. This improves the accuracy of similarity search during retrieval.


In [None]:
CHUNK_SIZE = 700
CHUNK_OVERLAP = 150

def chunk_text(text: str,
               chunk_size: int = CHUNK_SIZE,
               overlap: int = CHUNK_OVERLAP) -> List[str]:
    chunks = []
    start = 0
    text_len = len(text)

    while start < text_len:
        end = start + chunk_size
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start += max(chunk_size - overlap, 1)

    return chunks

chunks = chunk_text(raw_text)
print("Number of chunks created:", len(chunks))
print("\nExample chunk:\n")
print(textwrap.shorten(chunks[0].replace("\n", " "), width=400))


Number of chunks created: 2

Example chunk:

Artificial Intelligence (AI) is the branch of computer science that focuses on creating systems capable of performing tasks that normally require human intelligence. Machine Learning is a subset of AI that enables systems to learn from data without being explicitly programmed. There are three main types of Machine Learning: Supervised Learning, Unsupervised Learning, and Reinforcement [...]


## 5. Embedding Details
Text chunks are converted into numerical vectors called embeddings. These embeddings capture semantic meaning and allow similarity comparison.

- **Embedding model used:** all-MiniLM-L6-v2  

### Reason for Selection
This model is lightweight, fast, and suitable for Google Colab. It provides good semantic similarity performance for document-based question answering.


In [None]:
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMBED_MODEL_NAME)

def embed_texts(text_list: List[str]) -> np.ndarray:
    emb = embedder.encode(
        text_list,
        convert_to_numpy=True,
        show_progress_bar=True,
        batch_size=32
    )
    return emb

chunk_embeddings = embed_texts(chunks)
print("Embeddings shape:", chunk_embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings shape: (2, 384)


## 6. Vector Database
FAISS (Facebook AI Similarity Search) is used as the vector database to store embeddings and perform similarity search efficiently.


In [None]:
embedding_dim = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)

index.add(chunk_embeddings)
print("Number of vectors stored in index:", index.ntotal)


Number of vectors stored in index: 2


## 7. Query Processing and Retrieval
When a user enters a query, it is converted into an embedding and compared with document embeddings stored in FAISS. The most relevant chunks are retrieved and used to answer the query.


In [None]:
def retrieve_relevant_chunks(
    query: str,
    top_k: int = 4
) -> Tuple[List[str], np.ndarray]:
    query_emb = embedder.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, top_k)
    idxs = indices[0]
    dists = distances[0]
    retrieved = [chunks[i] for i in idxs]
    return retrieved, dists

demo_query = "Write your own small test question related to the document."
retrieved_demo, demo_dists = retrieve_relevant_chunks(demo_query, top_k=3)

for i, (chunk, dist) in enumerate(zip(retrieved_demo, demo_dists), start=1):
    print(f"\n=== Retrieved Chunk {i} (distance={dist:.4f}) ===\n")
    print(textwrap.shorten(chunk.replace("\n", " "), width=400))



=== Retrieved Chunk 1 (distance=1.9674) ===

Artificial Intelligence (AI) is the branch of computer science that focuses on creating systems capable of performing tasks that normally require human intelligence. Machine Learning is a subset of AI that enables systems to learn from data without being explicitly programmed. There are three main types of Machine Learning: Supervised Learning, Unsupervised Learning, and Reinforcement [...]

=== Retrieved Chunk 2 (distance=2.0222) ===

data and is mainly used for clustering and association tasks. Reinforcement Learning is based on reward and punishment and is widely used in robotics and game playing. AI applications include healthcare, finance, education, autonomous vehicles, and recommendation systems.

=== Retrieved Chunk 3 (distance=340282346638528859811704183484516925440.0000) ===

data and is mainly used for clustering and association tasks. Reinforcement Learning is based on reward and punishment and is widely used in robotics and gam

## 7. Answer Generation using RAG Pipeline
After retrieving the most relevant text chunks from the vector database, the system combines them to form a single context. This context is then used to generate the final answer for the user query.

### Context Building
The retrieved chunks are merged together while keeping a maximum length limit. This ensures that only the most important information is passed for answer generation.

### Answer Generation
In this project, a simple answer generation stub is used instead of a real large language model. The stub clearly demonstrates how the query and retrieved context are combined to form a prompt. This approach helps in understanding the complete RAG pipeline without using external APIs.

The final RAG pipeline performs the following steps:
1. Accepts user query  
2. Retrieves top relevant chunks  
3. Builds context from retrieved chunks  
4. Generates answer based on the context  


In [None]:
def build_context(chunks_list: List[str], max_chars: int = 2500) -> str:
    context = ""
    for ch in chunks_list:
        if len(context) + len(ch) > max_chars:
            break
        context += ch + "\n\n"
    return context.strip()

def llm_answer_stub(query: str, context: str) -> str:
    prompt_view = f"Question: {query}\n\nContext (truncated):\n{context[:700]}"
    return (
        "RAG Answer (stub):\n"
        "I will base my answer only on the provided context.\n\n"
        + prompt_view
    )

def rag_pipeline(query: str, top_k: int = 4) -> str:
    retrieved_chunks, _ = retrieve_relevant_chunks(query, top_k=top_k)
    context = build_context(retrieved_chunks)
    answer = llm_answer_stub(query, context)
    return answer


## 8. Test Queries and Results
To evaluate the performance of the RAG system, multiple test queries are used. These queries are designed to check factual understanding, conceptual explanation, and summarization ability based on the document content.

Each query is passed through the complete RAG pipeline, which retrieves relevant chunks from the document and generates an answer using the retrieved context. The results demonstrate that the system correctly retrieves meaningful information and produces relevant responses.


In [None]:
test_queries = [
    "Question 1: Ask something factual based on your document.",
    "Question 2: Ask for an explanation of a concept mentioned.",
    "Question 3: Ask for a summary of a specific section or topic."
]

for i, q in enumerate(test_queries, start=1):
    print("=" * 70)
    print(f"Test Query {i}: {q}")
    print("=" * 70)
    response = rag_pipeline(q, top_k=4)
    print(response)
    print("\n")


Test Query 1: Question 1: Ask something factual based on your document.
RAG Answer (stub):
I will base my answer only on the provided context.

Question: Question 1: Ask something factual based on your document.

Context (truncated):
Artificial Intelligence (AI) is the branch of computer science that focuses on creating systems
capable of performing tasks that normally require human intelligence.
Machine Learning is a subset of AI that enables systems to learn from data without being explicitly
programmed.
There are three main types of Machine Learning: Supervised Learning, Unsupervised Learning,
and Reinforcement Learning.
Supervised Learning uses labeled data to train models. Common examples include classification
and regression.
Unsupervised Learning works with unlabeled data and is mainly used for clustering and association
tasks.
Reinforcement Learning is based on reward and punishment and is widely used in robotics


Test Query 2: Question 2: Ask for an explanation of a concept m

## 9. Future Improvements
The current RAG system works correctly, but it can be improved further in the following ways:

1. **Use a real Large Language Model (LLM):**  
   Instead of a simple answer generation stub, a real LLM like OpenAI or open-source models can be integrated to generate more natural and detailed answers.

2. **Improve Chunking Strategy:**  
   Advanced chunking techniques such as semantic chunking can be used instead of fixed-size chunking to improve retrieval accuracy.

3. **Apply Reranking or Hybrid Search:**  
   Combining keyword-based search with vector search or adding reranking can improve the relevance of retrieved chunks.

4. **Add Metadata Filtering:**  
   Metadata such as page numbers or section titles can be stored and used to filter results more effectively.

5. **User Interface Integration:**  
   A simple web interface using Streamlit or Gradio can be added to make the system easier for users to interact with.

These improvements can enhance performance, usability, and overall system accuracy.


## 10. README / Report

### Project Overview
This project implements a Retrieval-Augmented Generation (RAG) system that answers user queries based on the content of an uploaded document. The system retrieves relevant information from the document and then generates answers using that retrieved context. This approach helps in improving answer accuracy and reduces incorrect or hallucinated responses.

### Tools & Libraries Used
- Python  
- Google Colab  
- SentenceTransformers  
- FAISS  
- pdfplumber  
- NumPy  

### How to Run the Notebook
1. Open the notebook in Google Colab  
2. Upload a PDF document when prompted  
3. Run all cells in sequence  
4. Enter queries to get answers based on the document  

### Conclusion
The implemented RAG system successfully demonstrates document-based question answering using retrieval and semantic search techniques. It fulfills all the requirements of the assignment and provides a strong foundation for further improvements.
