# RAG From Scratch — Step-by-Step Tutorial

This notebook demonstrates how to implement a Retrieval-Augmented Generation (RAG) pipeline from scratch, **without relying on high-level RAG frameworks**.

We will build:
1. **Document Text Extraction** — Read text from PDF files.
2. **Chunking with Overlap** — Split text into manageable pieces.
3. **Embeddings** — Convert text chunks into vector representations.
4. **Retrieval** — Find the most relevant chunks for a given query.
5. **Answer Generation (Basic)** — Use retrieved chunks to form answers.

**Goal:** Learn and understand each step of RAG by coding it ourselves.


## Step 0 — Environment & Setup
Before starting, let's check our Python version and install required libraries.


In [1]:
import sys, platform
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())


Python: 3.11.9
Platform: Windows-10-10.0.26100-SP0


### Install dependencies
We will use:
- `pdfplumber` for PDF text extraction
- `sentence-transformers` for embeddings
- `numpy` for similarity calculations


In [2]:
!pip install pdfplumber sentence-transformers numpy




[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step 1 — Document Text Extraction
We start by extracting text from a PDF file using `pdfplumber`.

**Inputs:**
- `pdf_path` — Path to the PDF file.

**Outputs:**
- `text` — Extracted text from all pages.


In [None]:
import pdfplumber

pdf_path = 'Enter your PDF Path'
with pdfplumber.open(pdf_path) as pdf:
    text = "".join([page.extract_text() for page in pdf.pages])

print(text[:500])  # preview first 500 characters


Shivam Kumar
Indian Institute of Technology, Roorkee
(cid:211) +91-7037162459  shivamkushwaha636@gmail.com  shivam k@amsc.iitr.ac.in  Linkedin (cid:135) Github
EDUCATION
Indian Institute of Technology, Roorkee 2023 – 2025
M.Tech - Applied Mathematics and Scientific Computing CGPA: 7.45
Jamia Millia Islamia, New Delhi 2019 – 2021
M.Sc. - Applied Mathematics Percentage: 84.08
INTERESTS
• AI • Time Series • LLM • GenAI
COURSEWORK / SKILLS
• DSA • Machine Learning • NLP • Mathematics
• Soft Computin


## Step 2 — Generating Unique IDs
Sometimes we may want to assign unique IDs to chunks or documents.
Here we use `uuid-utils` to generate a random UUID.


In [4]:
!pip install uuid-utils





[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import uuid_utils as uuid
id = uuid.uuid4()
print(str(id))


f3591049-a7a5-4786-823b-df90821658c5


## Step 3 — Chunking with Overlap
We split the document text into smaller chunks, allowing overlaps for context preservation.

**Function:** `chunk_overlap(text, chunk_size, overlap)`
- `text` — Full text
- `chunk_size` — Length of each chunk
- `overlap` — Number of overlapping characters between chunks


In [6]:
def chunk_overlap(text, chunk_size, overlap):
    """
    Splits text into chunks of given size with overlaps.

    Args:
        text (str): Full document text.
        chunk_size (int): Size of each chunk in characters.
        overlap (int): Number of characters overlapping between chunks.

    Returns:
        dict: Mapping from chunk index to chunk text.
    """
    chunks = {}
    start = 0
    while start < len(text):
        chunk_id = str(uuid.uuid4())
        end = start + chunk_size
        chunks[chunk_id] = text[start:end]
        start = end - overlap
    return chunks


### Apply chunking
We apply our chunking function to the extracted text.


In [7]:
chunks = chunk_overlap(text, chunk_size=500, overlap=50)
len(chunks), list(chunks.items())[:3]


(7,
 [('29a18838-074f-4fff-aa4f-ba83e10aabac',
   'Shivam Kumar\nIndian Institute of Technology, Roorkee\n(cid:211) +91-7037162459  shivamkushwaha636@gmail.com  shivam k@amsc.iitr.ac.in  Linkedin (cid:135) Github\nEDUCATION\nIndian Institute of Technology, Roorkee 2023 – 2025\nM.Tech - Applied Mathematics and Scientific Computing CGPA: 7.45\nJamia Millia Islamia, New Delhi 2019 – 2021\nM.Sc. - Applied Mathematics Percentage: 84.08\nINTERESTS\n• AI • Time Series • LLM • GenAI\nCOURSEWORK / SKILLS\n• DSA • Machine Learning • NLP • Mathematics\n• Soft Computin'),
  ('7777e251-cc4c-4037-9101-829fd316c484',
   'chine Learning • NLP • Mathematics\n• Soft Computing • Deep Learning • Statistics • Optimization\nPROJECTS\nExploratory Data Analysis on FIFA World Cup 2022 Dataset | IIT Roorkee Nov 2023\n• Conducted an in-depth analysis of the FIFA 2022 dataset with a team of four members.\n• Project encompassed data cleaning, feature engineering, data visualization, and comparative analysis.\n• Le

## Step 4 — Load Embedding Model
We use `SentenceTransformer` to convert each chunk into a dense vector embedding.


In [8]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")


### Generate embeddings for all chunks
We pass each chunk to the embedding model and store the resulting vectors.


In [9]:
def embedd_chunks(chunks):
    """
    Converts text chunks into embeddings using SentenceTransformer.

    Args:
        chunks (dict): Mapping from chunk index to text.

    Returns:
        dict: Mapping from chunk index to embedding vector.
    """
    chunk_embeddings = {}
    for idx, chunk in chunks.items():
        chunk_embeddings[idx] = model.encode(chunk)
    return chunk_embeddings

embeddings = embedd_chunks(chunks)


## Step 5 — Retrieve Relevant Chunks
Given a query, we find the top-k most relevant chunks using cosine similarity.


In [10]:
import numpy as np

def retrieve_chunks(query, k):
    """
    Retrieves the top-k most relevant chunks for a query.

    Args:
        query (str): The user query.
        k (int): Number of chunks to retrieve.

    Returns:
        dict: Mapping of chunk index to similarity score.
    """
    query_embedd = model.encode([query])[0]
    similarity = {}
    for idx, emb in embeddings.items():
        sim = np.dot(query_embedd, emb) / (np.linalg.norm(query_embedd) * np.linalg.norm(emb))
        similarity[idx] = sim
    sorted_similarity = sorted(similarity.items(), key = lambda x: x[1], reverse = True)
    top_chunks = [chunks[id] for id, _ in sorted_similarity[:k]]
    return top_chunks


### Test retrieval with a sample query
We query the system and print the top-k chunks.


In [11]:
query = "What is CGPA of the candidate?"
k = 3
retrieve_chunks(query, k)


['Shivam Kumar\nIndian Institute of Technology, Roorkee\n(cid:211) +91-7037162459  shivamkushwaha636@gmail.com  shivam k@amsc.iitr.ac.in  Linkedin (cid:135) Github\nEDUCATION\nIndian Institute of Technology, Roorkee 2023 – 2025\nM.Tech - Applied Mathematics and Scientific Computing CGPA: 7.45\nJamia Millia Islamia, New Delhi 2019 – 2021\nM.Sc. - Applied Mathematics Percentage: 84.08\nINTERESTS\n• AI • Time Series • LLM • GenAI\nCOURSEWORK / SKILLS\n• DSA • Machine Learning • NLP • Mathematics\n• Soft Computin',
 'T. LTD. Oct-2019 To Till Now\n• To answer questions of mathematics posted by students on the QA board while maintaining a high level of\nacademic integrity.\nFreelance Expert | BARTLEBY Apr-2022 To Aug-2022\n• To provide students with step-by-step solutions for textbook problems and homework questions online.\nVolunteer | 12th international conference on soft computing and problem solving IIT Roorkee Aug-2023\n• Volunteered during 12th international conference on soft computin

## Step 7 — LLM Answer Generation with Groq API

Now that we can retrieve the most relevant chunks, let's use an LLM to generate an answer.  
We'll use Groq's `llama-3.3-70b-versatile` model, passing in both the **query** and the **retrieved context**.

**Process:**
1. Retrieve top-k relevant chunks from our knowledge base.
2. Combine them into a single context string.
3. Pass query + context to Groq LLM for final answer.


In [12]:
!pip install groq




[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from groq import Groq

# Initialize Groq client with your API key
api_key = "Enter Your GROQ API Key"  # Replace with your actual key
client = Groq(api_key=api_key)

def generate_answer(query, k=3):
    """
    Generates an answer using Groq LLM based on retrieved chunks.

    Args:
        query (str): The user's question.
        k (int): Number of top chunks to use as context.

    Returns:
        str: The LLM-generated answer.
    """

    # Step 1: Retrieve top-k chunks
    retrieve_chunk = retrieve_chunks(query,k=3)

    # Step 2: Build context from retrieved chunks
    retrieve_context = "\n".join(retrieve_chunk)

    # Step 3: Create prompt for the LLM
    prompt = f"""
    You are an help full assistent, who help other to answer query from the given context.
    query: {query}
    context: {retrieve_context}

    If you dont find answer from the context, politely say so.
    """

    # Step 4: Get completion from Groq LLM
    chat_completion = client.chat.completions.create(
        messages=[
            # Set an optional system message. This sets the behavior of the
            # assistant and can be used to provide specific instructions for
            # how it should behave throughout the conversation.
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            # Set a user message for the assistant to respond to.
            {
                "role": "user",
                "content": prompt,
            }
        ],

        # The language model which will generate the completion.
        model="llama-3.3-70b-versatile"
    )
    return chat_completion.choices[0].message.content

In [14]:
# Example test
query = "What is the CGPA of the candidate?"
print(generate_answer(query, k=1))

The CGPA of the candidate, Shivam Kumar, is 7.45, as mentioned in the EDUCATION section of the context, specifically for his M.Tech in Applied Mathematics and Scientific Computing at the Indian Institute of Technology, Roorkee.


In [15]:
# Example 2 test
query = 'Give the educational information of this candidate'
print(generate_answer(query, k=3))

The educational information of the candidate, Shivam Kumar, is as follows:

* M.Tech in Applied Mathematics and Scientific Computing from the Indian Institute of Technology, Roorkee (2023-2025) with a CGPA of 7.45.
* M.Sc. in Applied Mathematics from Jamia Millia Islamia, New Delhi (2019-2021) with a percentage of 84.08.

This information is available in the "EDUCATION" section of the context.


In [16]:
# Example 3 test
query = 'Tell me about integration and differentiation'
print(generate_answer(query, k=1))

Based on the provided context, I couldn't find any direct information related to integration and differentiation in the context of decision-making. The context appears to be a resume or a personal profile, highlighting the individual's educational background, technical skills, and work experience.

However, I can infer that the individual has a strong background in mathematics, particularly in applied mathematics, and has worked with various software packages and coding platforms. They also have experience in machine learning, NLP, and soft computing, which may involve concepts related to integration and differentiation.

If you're looking for information on integration and differentiation in a specific context, such as calculus or mathematical modeling, I'd be happy to try and provide a general overview. But if you're looking for a direct connection to decision-making, I'm afraid I couldn't find any relevant information in the provided context. Would you like me to try and provide a g

## 📌 Summary & Next Steps

In this notebook, we built a **Retrieval-Augmented Generation (RAG)** system from scratch:

1. **Document Loading** — Extracted text from a file for knowledge base creation.  
2. **Chunking** — Split the text into overlapping chunks for better retrieval.  
3. **Embedding** — Converted each chunk into a vector representation using a sentence transformer model.  
4. **Similarity Search** — Found the most relevant chunks for a query using cosine similarity.  
5. **Retrieval Pipeline** — Built a function to fetch top-k chunks as context.  
6. **LLM Integration** — Used Groq's `llama-3.3-70b-versatile` model to generate final answers from retrieved context.  

---

### ✅ What We Achieved
- Created a **minimal but complete** RAG pipeline without heavy frameworks.
- Kept the process transparent so each step is easy to understand and modify.
- Enabled the system to **politely handle** cases where the answer is not found in the context.

---

### 🚀 Next Steps
- **Switch to a Vector Database** (FAISS, ChromaDB, Weaviate) for faster and scalable retrieval.
- Experiment with **different embedding models** for better semantic matching.
- Add **multi-document support** to handle a larger knowledge base.
- Fine-tune or prompt-engineer the LLM for domain-specific tasks.
- Deploy as an **API or web app** for interactive querying.

