
# 🧠 CV Screening RAG Chatbot — Hands‑On

This notebook walks you through building a **Retrieval‑Augmented Generation (RAG) chatbot** over a collection of resumes (CVs).  
We will **manually upload CV in PDF** and build a private, local search system.

---

### 🎯 Goals
- Let HR quickly **search candidate CV** for skills, experience, and qualifications.
- Use **RAG (Retrieval-Augmented Generation)** with OpenAI models to ground chatbot answers in CV data.

---

⚠️ **Privacy Note**: CV text and your queries will be sent to OpenAI's API to generate responses. Avoid uploading sensitive or personal information unless you have permission.



## 1️⃣ Setup & Imports

We’ll use:
- `PyPDF2` → to extract text from CV (PDFs)
- `tiktoken` → for tokenization when chunking text
- `faiss` → local vector database to store embeddings
- `openai` → for embeddings + chat completions
- `dotenv` → to load API keys from `.env`


In [None]:

import os
import json
import time
from typing import List, Dict
from dataclasses import dataclass

import numpy as np
import faiss
from PyPDF2 import PdfReader
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Check API key
api_key = os.getenv("OPENAI_API_KEY", "").strip()
if not api_key:
    raise RuntimeError("❌ OPENAI_API_KEY not set. Please add it to your .env file.")
client = OpenAI(api_key=api_key)

print("✅ OpenAI client ready")



## 2️⃣ Upload & Parse CV (PDFs)

We’ll read CV from a local folder (`./uploads`).  
Each PDF will be converted into plain text for further processing.


In [None]:

UPLOAD_DIR = "uploads"
os.makedirs(UPLOAD_DIR, exist_ok=True)

def extract_text_from_pdf(pdf_path: str) -> str:
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

# Parse all resumes in uploads/
resumes = {}
for fname in os.listdir(UPLOAD_DIR):
    if fname.lower().endswith(".pdf"):
        path = os.path.join(UPLOAD_DIR, fname)
        resumes[fname] = extract_text_from_pdf(path)

print(f"✅ Loaded {len(resumes)} resumes")
list(resumes.keys())[:5]



## 3️⃣ Chunking CV Texts

Why chunking?  
- CV can be long. Embedding entire CV leads to poor retrieval.  
- We split text into **manageable chunks** (300–400 tokens) with slight overlap.

This helps us retrieve only the most relevant pieces.


In [None]:

import tiktoken

def chunk_text(text: str, max_tokens: int = 400, overlap: int = 60) -> List[str]:
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens - overlap):
        chunk = tokens[i:i+max_tokens]
        chunks.append(enc.decode(chunk))
    return chunks

# Example chunking on first resume
sample_resume = list(resumes.values())[0]
chunks = chunk_text(sample_resume)
print(f"First resume split into {len(chunks)} chunks")
chunks[:2]



## 4️⃣ Embeddings + FAISS Index

We’ll convert each chunk into a **vector embedding** using OpenAI.  
Then, we’ll store all vectors in a **FAISS index** for fast similarity search.


In [None]:

@dataclass
class DocChunk:
    doc_id: str
    chunk_id: int
    text: str

all_chunks: List[DocChunk] = []
for doc_name, text in resumes.items():
    chunks = chunk_text(text)
    for i, ch in enumerate(chunks):
        all_chunks.append(DocChunk(doc_id=doc_name, chunk_id=i, text=ch))

print(f"Total chunks: {len(all_chunks)}")

def embed_texts(texts: List[str]) -> np.ndarray:
    vectors = []
    for i in range(0, len(texts), 50):
        batch = texts[i:i+50]
        resp = client.embeddings.create(model="text-embedding-3-small", input=batch)
        vectors.extend([d.embedding for d in resp.data])
        time.sleep(0.5)  # be polite
    return np.array(vectors).astype("float32")

# Embed all chunks
texts = [c.text for c in all_chunks]
embeddings = embed_texts(texts)

# Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

print("✅ FAISS index ready")



## 5️⃣ Retrieval + Answer Generation (RAG)

Workflow:
1. Embed the user’s question
2. Retrieve top‑k most similar chunks
3. Send them along with the question to OpenAI for a **grounded answer**


In [None]:

def search_index(query: str, k: int = 3):
    q_emb = embed_texts([query])
    D, I = index.search(q_emb, k)
    return [(all_chunks[i], float(D[0][j])) for j, i in enumerate(I[0])]

def answer_with_rag(query: str, k: int = 3) -> str:
    results = search_index(query, k)
    context = "\n---\n".join([f"[{r.doc_id}#{r.chunk_id}] {r.text}" for r, _ in results])
    prompt = f"""You are an HR assistant.
Use the following resume excerpts to answer the question:

{context}

Question: {query}
Answer clearly, citing resume IDs like [filename#chunk]."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.choices[0].message.content

# Try a sample HR query
print(answer_with_rag("Who has experience with Python and machine learning?"))



## ✅ Key Takeaways

- RAG lets HR **search CV** using natural language (not just keywords).
- **Chunking + embeddings** improves retrieval accuracy.
- **FAISS** is fast and lightweight for local experiments.
- With **Streamlit** or Docker, this can be demoed to non-technical HR teams easily.

---

## 📌 Next Steps
- Add filters (e.g., years of experience, location).
- Use a managed vector DB (Pinecone, Chroma, pgvector) for scale.
- Improve PDF parsing (layout, tables, OCR).  
- Add guardrails for bias + privacy.

🎉 You now have a working RAG chatbot over CV!
