# Resume / Candidate Screening System

## Project Context

This project is part of the **Future Interns Machine Learning Internship – Task 3 (2026)**.

Recruitment teams often receive hundreds of resumes for a single job opening.  
Manually reviewing each resume is slow, inconsistent, and prone to bias.

Machine Learning–based resume screening systems help companies:
- Shortlist candidates faster
- Match candidate skills with job requirements
- Identify missing or weak skills
- Reduce recruiter workload

This project focuses on building a **decision-support ML system**, not a fully automated hiring tool.

---

## Objective

The objective of this project is to build a machine learning system that can:

- Read and process resume text data
- Extract relevant skills and keywords
- Compare resumes against a given job description
- Score and rank candidates based on role fit
- Highlight missing or required skills for each candidate

The goal is to support recruiters and hiring managers with **clear, explainable insights**, not just predictions.


In [1]:
import pandas as pd

df = pd.read_csv("../data/Resume.csv")
df.head()


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [2]:
df.info()


<class 'pandas.DataFrame'>
RangeIndex: 2484 entries, 0 to 2483
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   ID           2484 non-null   int64
 1   Resume_str   2484 non-null   str  
 2   Resume_html  2484 non-null   str  
 3   Category     2484 non-null   str  
dtypes: int64(1), str(3)
memory usage: 77.8 KB


In [3]:
X_text = df["Resume_str"]
y = df["Category"]

X_text.head(), y.head()


(0             HR ADMINISTRATOR/MARKETING ASSOCIATE\...
 1             HR SPECIALIST, US HR OPERATIONS      ...
 2             HR DIRECTOR       Summary      Over 2...
 3             HR SPECIALIST       Summary    Dedica...
 4             HR MANAGER         Skill Highlights  ...
 Name: Resume_str, dtype: str,
 0    HR
 1    HR
 2    HR
 3    HR
 4    HR
 Name: Category, dtype: str)

In [4]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

X_clean = X_text.apply(clean_text)

X_clean.head()


0    hr administrator marketing associate hr admini...
1    hr specialist us hr operations summary versati...
2    hr director summary over years experience in r...
3    hr specialist summary dedicated driven and dyn...
4    hr manager skill highlights hr skills hr depar...
Name: Resume_str, dtype: str

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words="english"
)

X_tfidf = vectorizer.fit_transform(X_clean)

X_tfidf.shape


(2484, 5000)

In [6]:
job_description = """
We are looking for a Data Scientist with strong skills in Python, machine learning,
data analysis, statistics, and experience with data visualization tools.
Knowledge of SQL, pandas, scikit-learn, and NLP is a plus.
"""

job_description


'\nWe are looking for a Data Scientist with strong skills in Python, machine learning,\ndata analysis, statistics, and experience with data visualization tools.\nKnowledge of SQL, pandas, scikit-learn, and NLP is a plus.\n'

In [7]:
job_clean = clean_text(job_description)
job_clean


'we are looking for a data scientist with strong skills in python machine learning data analysis statistics and experience with data visualization tools knowledge of sql pandas scikit learn and nlp is a plus'

In [8]:
job_vector = vectorizer.transform([job_clean])
job_vector.shape


(1, 5000)

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(X_tfidf, job_vector).flatten()
similarity_scores[:5]


array([0.06156519, 0.0037955 , 0.01362744, 0.00626964, 0.01475613])

In [10]:
df["Match_Score"] = similarity_scores
ranked_candidates = df.sort_values("Match_Score", ascending=False)
ranked_candidates[["ID", "Category", "Match_Score"]].head(10)


Unnamed: 0,ID,Category,Match_Score
1218,21156767,CONSULTANT,0.40015
1762,12011623,ENGINEERING,0.347898
1339,18448085,AUTOMOBILE,0.296956
926,62994611,AGRICULTURE,0.259155
1303,42156237,DIGITAL-MEDIA,0.223147
1091,24610685,SALES,0.218913
1142,30863060,CONSULTANT,0.218486
331,18067556,INFORMATION-TECHNOLOGY,0.212271
1040,12351749,SALES,0.202948
335,79541391,INFORMATION-TECHNOLOGY,0.190893


In [11]:
required_skills = [
    "python",
    "machine learning",
    "data analysis",
    "statistics",
    "sql",
    "pandas",
    "scikit learn",
    "nlp",
    "data visualization"
]


In [12]:
def missing_skills(resume_text, skills):
    return [skill for skill in skills if skill not in resume_text]


In [13]:
top_candidates = ranked_candidates.head(10).copy()

top_candidates["Missing_Skills"] = top_candidates["Resume_str"].apply(
    lambda x: missing_skills(clean_text(x), required_skills)
)

top_candidates[["ID", "Category", "Match_Score", "Missing_Skills"]]


Unnamed: 0,ID,Category,Match_Score,Missing_Skills
1218,21156767,CONSULTANT,0.40015,"[statistics, pandas, scikit learn, nlp]"
1762,12011623,ENGINEERING,0.347898,"[scikit learn, nlp]"
1339,18448085,AUTOMOBILE,0.296956,"[machine learning, nlp]"
926,62994611,AGRICULTURE,0.259155,"[machine learning, data analysis, statistics, ..."
1303,42156237,DIGITAL-MEDIA,0.223147,"[machine learning, data analysis, statistics, ..."
1091,24610685,SALES,0.218913,"[python, machine learning, data analysis, stat..."
1142,30863060,CONSULTANT,0.218486,"[python, machine learning, data analysis, stat..."
331,18067556,INFORMATION-TECHNOLOGY,0.212271,"[machine learning, statistics, pandas, scikit ..."
1040,12351749,SALES,0.202948,"[python, sql, pandas, scikit learn, nlp, data ..."
335,79541391,INFORMATION-TECHNOLOGY,0.190893,"[python, machine learning, data analysis, pand..."
