# üìÑ Resume Retrieval System using RAG + ChromaDB

This project implements a **retrieval-based resume analysis system** using a 
Retrieval-Augmented Generation (RAG) approach without an LLM.

The system works by:
1. Reading multiple resumes (PDFs)
2. Extracting text and cleaning it
3. Applying one chosen **chunking technique**
4. Storing the chunk embeddings into **ChromaDB**
5. Asking predefined questions (from a `.txt` file)
6. Retrieving the most relevant sections from each resume

üîç This allows the system to answer:
- What technical skills does each candidate have?
- What projects did they work on?
- Do they have internship experience?
- How strong are their qualifications?

üìå Each resume is processed **independently**, ensuring fair comparison across all candidates.



In [11]:
# Install required libraries if not available

# !pip install chromadb sentence-transformers pymupdf nltk --quiet

In [12]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"

import fitz
import chromadb
from chromadb.utils import embedding_functions
import re
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
# Chunking Methods

def sentence_chunking(text, max_sentences=5):
    sentences = sent_tokenize(text)
    return [" ".join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

def paragraph_chunking(text):
    paragraphs = [p for p in text.split("\n") if len(p.strip()) > 0]
    return paragraphs

def semantic_chunking(text, size=120, overlap=40):
    words = text.split()
    chunks = []
    step = size - overlap
    for i in range(0, len(words), step):
        chunks.append(" ".join(words[i:i+size]))
    return chunks

def sliding_window_chunking(text, chunk_size=120, overlap=40):
    words = text.split()
    chunks = []
    step = chunk_size - overlap
    for i in range(0, len(words), step):
        chunk_words = words[i:i + chunk_size]
        if not chunk_words:
            break
        chunk = " ".join(chunk_words)
        chunks.append(chunk)
    return chunks

In [14]:
# Chunking Selection ONE TIME

CHUNK_METHOD = None

def choose_chunking_once():
    global CHUNK_METHOD
    
    if CHUNK_METHOD is not None:
        return CHUNK_METHOD
    
    print("\nüîπ Select Chunking Method:")
    print("1Ô∏è‚É£ Sentence-based")
    print("2Ô∏è‚É£ Paragraph-based")
    print("3Ô∏è‚É£ Semantic-based")
    print("4Ô∏è‚É£ Sliding Window-based")

    choice = input("Enter 1, 2, 3, or 4: ")

    CHUNK_METHOD = {
        "1": sentence_chunking,
        "2": paragraph_chunking,
        "3": semantic_chunking,
        "4": sliding_window_chunking
    }.get(choice, semantic_chunking)

    return CHUNK_METHOD

In [15]:
# Embedding Selection ONE TIME

def choose_embedding():
    print("\n‚ú® Choose Embedding Model:")
    print("1Ô∏è‚É£ all-MiniLM-L6-v2  (Fast - Good)")
    print("2Ô∏è‚É£ all-mpnet-base-v2  (Higher Accuracy)")
    print("3Ô∏è‚É£ paraphrase-mpnet-base-v2 (Excellent)")

    choice = input("Enter your choice: ")

    return {
        "1": "all-MiniLM-L6-v2",
        "2": "all-mpnet-base-v2",
        "3": "paraphrase-mpnet-base-v2"
    }.get(choice, "all-MiniLM-L6-v2")


In [16]:
# Read & Preprocess Resume

def preprocess(text):
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def read_resume(path: str) -> str:
    text = ""
    with fitz.open(path) as doc:
        for page in doc:
            raw = page.get_text("text")
            text += preprocess(raw) + "\n"
    return text

In [17]:
# Create ChromaDB Collection

def create_collection(model_name):
    client = chromadb.Client()

    # remove old embedding index if exists
    try:
        client.delete_collection("resume_chunks")
    except:
        pass

    emb = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=model_name
    )
    return client.create_collection(name="resume_chunks", embedding_function=emb)

In [18]:
# Index Resume PDFs

def index_resume(collection, pdf_path, base_id):
    global CHUNK_METHOD
    
    if CHUNK_METHOD is None:
        CHUNK_METHOD = choose_chunking_once()

    text = read_resume(pdf_path)
    chunks = CHUNK_METHOD(text)

    ids = [f"{base_id}_{i}" for i in range(len(chunks))]
    collection.add(documents=chunks, ids=ids)

    print(f"üìå Indexed {len(chunks)} chunks ‚Üí {os.path.basename(pdf_path)}")
    return chunks

In [19]:
# Retrieval

def ask_resume(collection, query, topk=3):
    results = collection.query(query_texts=[query], n_results=topk)
    ids = results["ids"][0]
    docs = results["documents"][0]
    dists = results["distances"][0]

    print(f"\nüîç Query: {query}")
    print("--------------------------------------------------")

    for i, (doc_id, text, dist) in enumerate(zip(ids, docs, dists), 1):
        print(f"\n‚≠ê Result #{i}")
        print(f"ID: {doc_id}")
        print(f"Similarity: {1 - dist:.4f}")
        print(f"Text:\n{text[:300]}...")
        print("-" * 50)

In [None]:
# Main Pipeline

model = choose_embedding()
chunk_method = choose_chunking_once()

PDF_FOLDER = "CVs"
QUESTIONS_FILE = "questions.txt"

# Read all questions once
with open(QUESTIONS_FILE, "r", encoding="utf-8") as f:
    questions = [q.strip() for q in f.readlines() if q.strip()]

print("\nüìå Starting Multi-CV Query System...\n")

# üî• Loop over each CV separately
for file in os.listdir(PDF_FOLDER):
    if not file.endswith(".pdf"):
        continue

    pdf_path = os.path.join(PDF_FOLDER, file)
    cv_name = file.replace(".pdf", "")

    print("=" * 70)
    print(f"üìå RESULTS FOR CV: {file}")
    print("=" * 70)

    # Create a separate collection for each CV
    collection = create_collection(model_name=model)

    # Index the specific CV only
    index_resume(collection, pdf_path, base_id=cv_name)

    # Now apply all questions one by one
    for q in questions:
        print(f"\nüîç Question: {q}")
        ask_resume(collection, q, topk=3)

    print("\n" + "=" * 70)
    print(f"‚úî Finished Results for {file}")
    print("=" * 70 + "\n")



‚ú® Choose Embedding Model:
1Ô∏è‚É£ all-MiniLM-L6-v2  (Fast - Good)
2Ô∏è‚É£ all-mpnet-base-v2  (Higher Accuracy)
3Ô∏è‚É£ paraphrase-mpnet-base-v2 (Excellent)

üìå Starting Multi-CV Query System...

üìå RESULTS FOR CV: cv.pdf

üìå Starting Multi-CV Query System...

üìå RESULTS FOR CV: cv.pdf

üîπ Select Chunking Method:
1Ô∏è‚É£ Sentence-based
2Ô∏è‚É£ Paragraph-based
3Ô∏è‚É£ Semantic-based
4Ô∏è‚É£ Sliding Window-based

üîπ Select Chunking Method:
1Ô∏è‚É£ Sentence-based
2Ô∏è‚É£ Paragraph-based
3Ô∏è‚É£ Semantic-based
4Ô∏è‚É£ Sliding Window-based
üìå Indexed 4 chunks ‚Üí cv.pdf

üîç Question: What are my technical skills?

üîç Query: What are my technical skills?
--------------------------------------------------

‚≠ê Result #1
ID: cv_0
Similarity: 0.3435
Text:
ibraheem.khdier@gmail.com +(970) 569-049-126 Ibrahim Khdier https://github.com/IbrahimKhdeir https://www.linkedin.com/in/ibrah eem-khdier-0b34a3252/ Experience Software Engineer Intern OppoTrain Jul 2025 - Sep 2025 ‚óè Deve