# AI-Powered Recruitment Agent  
### Intelligent CV Screening, Matching, and Ranking System  

---

## ðŸŽ¯ **Project Objective**

The goal of this project is to develop an **AI Recruiter Agent** capable of:
- Automatically reading and analyzing unstructured CVs.  
- Matching candidates intelligently to job descriptions using semantic similarity.  
- Ranking candidates based on relevance.  
- Providing human-like explanations for each match.  
- Reducing the manual screening time for HR professionals.  

---

## **Dataset Overview**

We use a Kaggle dataset containing the following key columns:  
`ID`, `Name`, `Role`, `Transcript`, `Resume`, `Decision`, `Reason_for_decision`, `Job_Description`  

For this notebook, we focus mainly on **textual columns**:  
`Resume`, `Transcript`, and `Job_Description`, which represent the candidateâ€™s background and the job requirements.

---

## **Methodology: CRISP-DM Framework**

| Phase | Description |
|-------|--------------|
| **1. Business Understanding** | Define project goals (automated candidate screening). |
| **2. Data Understanding** | Explore dataset structure and content. |
| **3. Data Preparation** | (Already completed) Preprocess CVs and job descriptions. |
| **4. Modeling** | Convert text into embeddings using BERT, store vectors in Qdrant, and perform similarity-based matching. |
| **5. Evaluation** | Rank candidates and evaluate semantic similarity results. |
| **6. Deployment** | Build an AI Agent with LangGraph that automates the end-to-end recruitment process. |

---

## **Notebook Objective**

This notebook focuses on **Modeling**, **Evaluation**, and **Deployment preparation**.  
Specifically, it will:  
1. Generate **embeddings** for CVs and job descriptions using **Sentence-BERT**.  
2. Store embeddings and metadata in a **Qdrant vector database**.  
3. Implement **retrieval logic** (RAG) to find the best candidates for a job.  
4. Generate **explanations** for candidate-job matches using **Phi-3 (Ollama)**.  
5. Prepare nodes and workflow for final **LangGraph agent integration**.


In [1]:
import pandas as pd
import sys
import os

# Add src folder to path
sys.path.append(os.path.join(os.getcwd(), "../src"))

# Import preprocessing functions
from preprocessing import preprocess_dataframe, add_wordcount_columns

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sayar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv("../data/raw/dataset.csv")
df.head()

Unnamed: 0,ID,Name,Role,Transcript,Resume,decision,Reason_for_decision,Job_Description
0,jasojo159,Jason Jones,E-commerce Specialist,"Interviewer: Good morning, Jason. It's great t...",Here's a professional resume for Jason Jones:\...,reject,Lacked leadership skills for a senior position.,Be part of a passionate team at the forefront ...
1,annma759,Ann Marshall,Game Developer,Interview Scene\n\nA conference room with a ta...,Here's a professional resume for Ann Marshall:...,select,Strong technical skills in AI and ML.,Help us build the next-generation products as ...
2,patrmc729,Patrick Mcclain,Human Resources Specialist,Interview Setting: A conference room in a medi...,Here's a professional resume for Patrick Mccla...,reject,Insufficient system design expertise for senio...,We need a Human Resources Specialist to enhanc...
3,patrgr422,Patricia Gray,E-commerce Specialist,Here's a simulated professional interview for ...,Here's a professional resume for Patricia Gray...,select,Impressive leadership and communication abilit...,Be part of a passionate team at the forefront ...
4,amangr696,Amanda Gross,E-commerce Specialist,Here's the simulated interview:\n\nInterviewer...,Here's a professional resume for Amanda Gross:...,reject,Lacked leadership skills for a senior position.,We are looking for an experienced E-commerce S...


In [4]:
# Extract emails from raw text before preprocessing

import re
def extract_email_from_row(row):
    for col in ["Resume"]:
        val = row.get(col)
        if isinstance(val, str) and val:
            m = re.search(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", val)
            if m:
                return m.group(0).strip()
    return None

df_email = df.apply(extract_email_from_row, axis=1)
print("Extracted emails:")
print(df_email.head())

Extracted emails:
0         jasonjones@email.com
1       ann.marshall@email.com
2    patrick.mcclain@email.com
3       patriciagray@email.com
4       amanda.gross@email.com
dtype: object


## Data Preprocessing

**Objective:** Clean and standardize textual data from CVs, Transcripts, and Job Descriptions to prepare it for NLP tasks and AI-based candidate matching.

**What we did:**
- Converted text to lowercase to ensure consistency.
- Removed URLs, HTML tags, and email addresses to eliminate noise.
- Removed punctuation and special characters.
- Tokenized text into words.
- Removed common stopwords to focus on meaningful terms.
- Applied lemmatization to reduce words to their base forms.
- Calculated word counts before and after cleaning for comparison.

**Why it is important:**
- Ensures embeddings (CBOW, Skip-gram, BERT) capture meaningful information rather than irrelevant noise.
- Reduces vocabulary size, improves model efficiency, and enhances AI agent understanding of candidate skills and qualifications.
- Prepares the dataset for the next steps: embedding generation, similarity search, and candidate ranking.


In [5]:
# Columns to preprocess
text_columns = ['Resume', 'Transcript', 'Job_Description']

# Apply preprocessing
df = preprocess_dataframe(df, text_columns)
df = add_wordcount_columns(df, text_columns)

# Quick check
df.head()

Unnamed: 0,ID,Name,Role,Transcript,Resume,decision,Reason_for_decision,Job_Description,Resume_clean,Transcript_clean,Job_Description_clean,Resume_wordcount,Resume_clean_wordcount,Transcript_wordcount,Transcript_clean_wordcount,Job_Description_wordcount,Job_Description_clean_wordcount
0,jasojo159,Jason Jones,E-commerce Specialist,"Interviewer: Good morning, Jason. It's great t...",Here's a professional resume for Jason Jones:\...,reject,Lacked leadership skills for a senior position.,Be part of a passionate team at the forefront ...,professional resume jason jones jason jones e ...,interviewer good morning jason great meet welc...,part passionate team forefront machine learnin...,342,264,606,339,22,13
1,annma759,Ann Marshall,Game Developer,Interview Scene\n\nA conference room with a ta...,Here's a professional resume for Ann Marshall:...,select,Strong technical skills in AI and ML.,Help us build the next-generation products as ...,professional resume ann marshall ann marshall ...,interview scene conference room table two chai...,help u build next generation product game deve...,51,41,635,347,17,13
2,patrmc729,Patrick Mcclain,Human Resources Specialist,Interview Setting: A conference room in a medi...,Here's a professional resume for Patrick Mccla...,reject,Insufficient system design expertise for senio...,We need a Human Resources Specialist to enhanc...,professional resume patrick mcclain patrick mc...,interview setting conference room medium sized...,need human resource specialist enhance team te...,405,287,739,392,19,13
3,patrgr422,Patricia Gray,E-commerce Specialist,Here's a simulated professional interview for ...,Here's a professional resume for Patricia Gray...,select,Impressive leadership and communication abilit...,Be part of a passionate team at the forefront ...,professional resume patricia gray patricia gra...,simulated professional interview e commerce sp...,part passionate team forefront cloud computing...,319,241,843,490,22,13
4,amangr696,Amanda Gross,E-commerce Specialist,Here's the simulated interview:\n\nInterviewer...,Here's a professional resume for Amanda Gross:...,reject,Lacked leadership skills for a senior position.,We are looking for an experienced E-commerce S...,professional resume amanda gross amanda gross ...,simulated interview interviewer good morning a...,looking experienced e commerce specialist join...,357,274,585,335,20,13


In [6]:
df["email"] = df_email

In [7]:
# Make sure processed folder exists
os.makedirs("data/processed", exist_ok=True)

# Save cleaned CSV
df.to_csv("../data/processed/cv_data_clean.csv", index=False)
print("Cleaned data saved to data/processed/cv_data_clean.csv")

Cleaned data saved to data/processed/cv_data_clean.csv


# Modeling Phase

## Embedding Generation

### Objective  
Transform textual data (CVs and Job Descriptions) into **numerical vector representations** that capture semantic meaning.  
These embeddings will later be stored in **Qdrant**, allowing our system to perform **similarity-based retrieval** between job requirements and candidate profiles.

### How It Works  
- Uses **BERT (Bidirectional Encoder Representations from Transformers)** a deep contextual model trained to understand language semantics.  
- Each text (CV or job description) is tokenized, passed through BERT, and converted into a **768-dimensional embedding vector**.  
- The **mean pooling** of token embeddings represents the global meaning of the text.  
- These embeddings are the foundation of the **matching and ranking logic** in the next stages.

### In This Notebook  
We call the function `generate_embeddings()` from `embeddings.py` to process two main columns:  
- `Resume_clean` â†’ Candidate information.  
- `Job_Description_clean` â†’ Job requirements.  

The output (`bert_embeddings`) will contain two matrices of shape `(num_texts Ã— 768)` representing both sides of our recruitment problem.


In [None]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), "../src"))

from embeddings import generate_embeddings

text_columns = ['Resume_clean', 'Job_Description_clean']
bert_embeddings = generate_embeddings(df, text_columns, batch_size=16)

In [8]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), "../src"))

from embeddings import generate_embeddings

text_columns = ['Resume_clean']
bert_embeddings = generate_embeddings(df[:100], text_columns, batch_size=16)

  from .autonotebook import tqdm as notebook_tqdm


Generating BERT embeddings for 'Resume_clean' ...


BERT batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7/7 [00:22<00:00,  3.18s/it]

'Resume_clean' embeddings done! Shape: (100, 768)





In [9]:
# Display shapes of generated embeddings
for col, emb in bert_embeddings.items():
    print(f"Column: {col}, Embedding shape: {emb.shape}")

# Display first 3 embeddings for each column
for col, emb in bert_embeddings.items():
    print(f"\nColumn: {col} - First 3 embeddings:")
    for i in range(min(3, len(emb))):
        print(f"Index {i}:", emb[i][:10], "...")  # show first 10 values of each vector


Column: Resume_clean, Embedding shape: (100, 768)

Column: Resume_clean - First 3 embeddings:
Index 0: [-0.1221816   0.16356383  0.5868464  -0.13695338  0.3472323   0.01262982
  0.25611553  0.06087828 -0.10591841 -0.20190999] ...
Index 1: [-0.05073959 -0.11806602  0.27505004  0.0218641   0.09115624 -0.16021815
  0.02668428  0.05024377  0.25067604 -0.16618688] ...
Index 2: [-0.21807513  0.29017293  0.47447053 -0.19598886  0.38506976 -0.00517338
  0.27680123 -0.01751059 -0.08286051 -0.18198432] ...


In [11]:
from storage import CVStorage

storage = CVStorage()
storage.add_cv_embeddings(bert_embeddings['Resume_clean'], df)

Created collection 'cvs' in Qdrant.
Stored 100 CV embeddings in collection 'cvs'.


In [10]:
# ...existing code...
from storage import CVStorage
import numpy as np

# Initialize storage
cv_storage = CVStorage(host="localhost", port=6333, collection_name="cvs")

# Scroll first 10 points and request vectors back
points, next_page = cv_storage.client.scroll(
    collection_name="cvs",
    limit=10,
    with_vectors=True,
    with_payload=True,
)

print("Retrieved points from Qdrant:\n")

def extract_vector(p):
    # Single unnamed vector
    if getattr(p, "vector", None) is not None:
        return p.vector
    # Named vectors
    vs = getattr(p, "vectors", None)
    if vs is None:
        return None
    # vs may be dict-like or a VectorStruct with .data
    if isinstance(vs, dict):
        return vs.get("default") or (next(iter(vs.values())) if vs else None)
    data = getattr(vs, "data", None)
    if isinstance(data, dict):
        return data.get("default") or (next(iter(data.values())) if data else None)
    return None

for i, p in enumerate(points):
    vec_raw = extract_vector(p)
    if vec_raw is None:
        print(f"CV {i}, candidate_id: {p.payload.get('candidate_id')}, vector is None (not returned).")
        continue
    vec = np.asarray(vec_raw, dtype=float)
    print(
        f"CV {i}, candidate_id: {p.payload.get('candidate_id')}, "
        f"vector length: {vec.shape[0]}, first 10 values: {vec[:10]}, any NaNs: {np.isnan(vec).any()}"
    )
# ...existing code...

Using existing collection 'cvs' in Qdrant.
Retrieved points from Qdrant:

CV 0, candidate_id: 0, vector length: 768, first 10 values: [-0.0156574   0.02096048  0.07520354 -0.01755039  0.04449734  0.00161849
  0.03282085  0.00780147 -0.0135733  -0.02587448], any NaNs: False
CV 1, candidate_id: 1, vector length: 768, first 10 values: [-0.00639109 -0.01487142  0.0346449   0.00275397  0.01148191 -0.02018085
  0.00336111  0.00632863  0.03157479 -0.02093266], any NaNs: False
CV 2, candidate_id: 2, vector length: 768, first 10 values: [-0.02698374  0.03590483  0.05870907 -0.02425087  0.04764698 -0.00064013
  0.03425027 -0.00216669 -0.01025283 -0.02251801], any NaNs: False
CV 3, candidate_id: 3, vector length: 768, first 10 values: [-0.01062189  0.0214748   0.06493713 -0.02393205  0.04250755 -0.00113168
  0.04098814  0.00662222 -0.00417765 -0.02143684], any NaNs: False
CV 4, candidate_id: 4, vector length: 768, first 10 values: [-0.00082673  0.02626257  0.06901769 -0.02630627  0.04887611  0.00

In [8]:
# test_retriever.py

import pandas as pd
from retriever import CVRetriever



# -----------------------------
# Initialize Retriever
# -----------------------------
retriever = CVRetriever(storage=storage, top_k=3)

# -----------------------------
# Test with a job description
# -----------------------------
job_desc = """
Looking for a Data Scientist with experience in Python, ML, NLP, and data visualization.
Must have strong analytical skills and 2+ years of industry experience.
"""

top_candidates = retriever.retrieve_top_candidates(job_desc)

# -----------------------------
# Display results
# -----------------------------
for i, c in enumerate(top_candidates, 1):
    print(f"\nCandidate {i}:")
    print(f"ID: {c['candidate_id']}, Score: {c['score']:.4f}")
    print(f"Resume Text (truncated): {c['cv_text'][:200]}...")
    print(f"Explanation:\n{c['explanation']}")


BERT batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  2.70it/s]



Candidate 1:
ID: 33, Score: 0.7152
Resume Text (truncated): professional resume susan edward susan edward quality assurance engineer contact information address 123 main st anytown usa 12345 phone 555 555 5555 email linkedin linkedin com susanedwardsqa profess...
Explanation:
Susan Edward's resume reveals that she is a highly motivated Quality Assurance Engineer with over five years of professional experience in the software testing field which aligns well with seeking someone for Data Science roles emphasizing on analytical skills required by your job description. Although her primary skill set doesnâ€™t specifically mention Python, machine learning (ML), natural language processing (NLP), or data visualization as presented here, Susan has a proven track record in developing automated tests and performance testing which may suggest experience with the kind of technical troubleshooting that can be related to dealing with ML systems. 

Here are some highlighted skills from her resume:


# LangGraph

In [4]:
job_desc = """
Looking for a Data Scientist with experience in Python, ML, NLP, and data visualization.
Must have strong analytical skills and 2+ years of industry experience.
"""

In [1]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), "../src"))
from agentGraph import run_agent_simple, run_agent

job_desc = """
Looking for a Data Scientist with experience in Python, ML, NLP, and data visualization.
Must have strong analytical skills and 2+ years of industry experience.
"""


print("\nGraph pipeline:")
res_graph = run_agent(job_desc, top_k=5, with_explain=True)
print(len(res_graph), [ (r.get("candidate_id"), round(float(r.get("score", 0.0)), 4)) for r in res_graph ])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sayar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


Using existing collection 'cvs' in Qdrant.

Graph pipeline:


BERT batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  6.77it/s]


[retrieve] job_vec shape: (768,)
[retrieve] search hits: 10
[retrieve] kept cvs: 10
[analyze] cvs in: 10
[analyze] analyzed: 10
[rank] ids: 10, embs: 10, job_emb: (768,)
[rank] ranking size: 10
[rank] top scores: [0.8135, 0.8132, 0.8117]
[explain] ranked in: 10, top_k: 5
[explain] results: 5
5 [(19, 0.8135), (33, 0.8132), (18, 0.8117), (26, 0.8004), (92, 0.8001)]


In [2]:
# In notebook
for r in res_graph:
    print(f"ID {r['candidate_id']}, Score {r['score']:.4f}")
    print("Explanation:", (r.get("explanation") or "<none>").strip(), "\n")

ID 19, Score 0.8135
Explanation: The candidate is highly suitable for the Data Scientist position requiring Python ML and NLP expertise as evidenced by his Master's degree in Computer Science from Stanford University where he demonstrated strong analytical abilitiesâ€”an essential requirement of this role. With a proven track record spanning five years, Mario Edward has specialized in developing scalable, accurate, efficient machine learning models using Python and frameworks such as TensorFlow, PyTorch, and Scikit-learn (Deep Learning library), which aligns well with the job description's emphasis on analytical skills.

Mario also possesses hands-on experience that matches industry standards for ML roles in data visualizationâ€”a skill demonstrated during his tenure as a senior Machine Learning Engineer at XYZ Corporation, where he led projects delivering deep learning based e-commerce recommendation systems and computer vision defect detection systems with over 95% accuracy. These ac

In [11]:
import sys, os
sys.path.append(os.path.join(os.getcwd(), "../src"))
from agentGraph import run_agent_simple, run_agent

job_desc = """
Looking for a Data Scientist with experience in Python, ML, NLP, and data visualization.
Must have strong analytical skills and 2+ years of industry experience.
"""


print("\nGraph pipeline:")
res_graph = run_agent(job_desc, top_k=5, with_explain=True)
print(len(res_graph), [ (r.get("candidate_id"), round(float(r.get("score", 0.0)), 4)) for r in res_graph ])


Using existing collection 'cvs' in Qdrant.

Graph pipeline:


BERT batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  3.77it/s]


[retrieve] job_vec shape: (768,)
[retrieve] search hits: 10
[retrieve] kept cvs: 10
[analyze] cvs in: 10
[analyze] analyzed: 10
[rank] ids: 10, embs: 10, job_emb: (768,)
[rank] ranking size: 10
[rank] top scores: [0.8135, 0.8132, 0.8117]
[explain] ranked in: 10, top_k: 5
[explain] results: 5
5 [(19, 0.8135), (33, 0.8132), (18, 0.8117), (26, 0.8004), (92, 0.8001)]
