# End-to-End Resume Screening and Candidate Ranking System

## Project Overview
This project implements an automated resume screening system using NLP techniques. It ranks candidates based on the relevance of their resumes to a given Job Description (JD).

### Pipeline Steps:
1. **Data Loading**: Load the resume dataset.
2. **Preprocessing**: Clean text (lowercase, remove stopwords, lemmatization).
3. **Feature Engineering**: Convert text to numerical vectors using TF-IDF.
4. **Similarity Calculation**: Compute Cosine Similarity between resumes and JD.
5. **Ranking**: Sort candidates by relevance score.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from src.preprocessing import Preprocessor
from src.skills import SkillExtractor
from src.ranking import ResumeRanker

%matplotlib inline

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 1. Load Data
Load the resume dataset from 'Resume/Resume.csv'.

In [2]:
data_path = 'Resume/Resume.csv'
df = pd.read_csv(data_path)
print(f"Dataset Shape: {df.shape}")
df.head()

Dataset Shape: (2484, 4)


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


## 2. Preprocessing
Clean the resume text to remove noise and standardize it for analysis.

In [3]:
preprocessor = Preprocessor()
df['Cleaned_Resume'] = df['Resume_str'].apply(preprocessor.clean_text)
df[['Resume_str', 'Cleaned_Resume']].head()

Unnamed: 0,Resume_str,Cleaned_Resume
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,hr administratormarketing associate hr adminis...
1,"HR SPECIALIST, US HR OPERATIONS ...",hr specialist u hr operation summary versatile...
2,HR DIRECTOR Summary Over 2...,hr director summary 20 year experience recruit...
3,HR SPECIALIST Summary Dedica...,hr specialist summary dedicated driven dynamic...
4,HR MANAGER Skill Highlights ...,hr manager skill highlight hr skill hr departm...


## 3. Skill Extraction
Extract key skills from the resumes to aid in analysis.

In [4]:
extractor = SkillExtractor()
df['Skills'] = df['Cleaned_Resume'].apply(extractor.extract_skills)
df[['Cleaned_Resume', 'Skills']].head()

Unnamed: 0,Cleaned_Resume,Skills
0,hr administratormarketing associate hr adminis...,"[time management, presentation, swift, leaders..."
1,hr specialist u hr operation summary versatile...,"[communication, project management, presentation]"
2,hr director summary 20 year experience recruit...,"[project management, negotiation, express, lea..."
3,hr specialist summary dedicated driven dynamic...,"[communication, presentation, excel]"
4,hr manager skill highlight hr skill hr departm...,"[problem solving, project management, negotiat..."


## 4. Define Job Description
Enter the job description for the role you are hiring for.

In [5]:
job_description = """
We are looking for a Data Scientist with strong Python and Machine Learning skills.
Experience with Deep Learning frameworks like TensorFlow or PyTorch is required.
Knowledge of SQL and Big Data tools like Spark is a plus.
"""

# Clean the JD
cleaned_jd = preprocessor.clean_text(job_description)
jd_skills = extractor.extract_skills_from_job_description(cleaned_jd)
print(f"Extracted Skills from JD: {jd_skills}")

Extracted Skills from JD: ['python', 'spark', 'pytorch', 'sql', 'tensorflow', 'machine learning', 'deep learning']


## 5. Ranking Candidates
Compute similarity scores using TF-IDF and Cosine Similarity.

In [6]:
ranker = ResumeRanker()
scores = ranker.score_resumes(df['Cleaned_Resume'], cleaned_jd)
df['Similarity_Score'] = scores

## 6. Results and Analysis
Display top ranked candidates and analyze missing skills.

In [7]:
# Identify missing skills
def get_missing_skills(candidate_skills, required_skills):
    return list(set(required_skills) - set(candidate_skills))

df['Missing_Skills'] = df['Skills'].apply(lambda x: get_missing_skills(x, jd_skills))

ranked_df = df.sort_values(by='Similarity_Score', ascending=False)
print("Top 10 Candidates:")
ranked_df[['ID', 'Category', 'Similarity_Score', 'Skills', 'Missing_Skills']].head(10)

Top 10 Candidates:


Unnamed: 0,ID,Category,Similarity_Score,Skills,Missing_Skills
1762,12011623,ENGINEERING,0.245769,"[python, html, collaboration, sql, machine lea...","[deep learning, pytorch, tensorflow, spark]"
297,83816738,INFORMATION-TECHNOLOGY,0.234114,"[java, git, project management, android, githu...","[python, spark, pytorch, tensorflow, machine l..."
1218,21156767,CONSULTANT,0.20733,"[java, python, sql, machine learning, javascri...","[deep learning, pytorch, tensorflow, spark]"
2153,34953092,BANKING,0.161623,"[mathematics, machine learning, sql, python]","[deep learning, pytorch, tensorflow, spark]"
926,62994611,AGRICULTURE,0.161183,"[mathematics, python, project management, sql,...","[machine learning, pytorch, deep learning, spark]"
1142,30863060,CONSULTANT,0.151388,"[java, collaboration, sql, communication, java...","[python, spark, pytorch, tensorflow, machine l..."
2291,12777487,ARTS,0.150243,[leadership],"[python, spark, pytorch, sql, tensorflow, mach..."
331,18067556,INFORMATION-TECHNOLOGY,0.143717,"[excel, python, project management, android, n...","[spark, pytorch, tensorflow, machine learning,..."
929,11813872,AGRICULTURE,0.141284,"[python, aws, agile]","[spark, pytorch, sql, tensorflow, machine lear..."
194,18835363,DESIGNER,0.140158,[project management],"[python, spark, pytorch, sql, tensorflow, mach..."
