<a href="https://colab.research.google.com/github/Siddharth-Singh-Verma/final_year_research/blob/main/research_ATS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Performance Evaluation of ML Algorithms for Efficient Resume-Job Matching

## Abstract
Resume-job matching is a critical task in recruitment, enabling organizations to efficiently identify candidates whose skills align with job requirements. This project explores the effectiveness of different approaches to resume-job matching, transitioning from **rule-based methods** to **machine learning (ML), deep learning, and generative AI**. By analyzing the performance of these techniques, we aim to determine the most efficient approach for improving automated hiring processes. The study begins with rule-based matching, progresses to ML algorithms, incorporates deep learning techniques, and finally investigates the potential of generative AI for enhancing job-resume matching accuracy.


## 📂 Dataset Analysis

### Why Are We Analyzing the Dataset?
Before implementing any matching techniques, it is crucial to **understand the structure and quality of the data**. This step ensures that our dataset is suitable for both rule-based and machine learning approaches. Specifically, we will:

- **Explore the dataset** to check for missing values, inconsistencies, and variations.
- **Analyze job titles** to find common and unique roles in both datasets.
- **Ensure logical correctness** before merging resumes with job descriptions.

A well-prepared dataset will help us achieve more **accurate and efficient resume-job matching** throughout our study.



In [2]:
import pandas as pd


resume_df = pd.read_csv("/content/drive/MyDrive/finalyearproject/UpdatedResumeDataSet.csv")
job_desc_df = pd.read_csv("/content/drive/MyDrive/finalyearproject/data.csv")


print("Resume Dataset Info:")
resume_df.info()
print("\nJob Description Dataset Info:")
job_desc_df.info()

# Display first few rows
resume_df.head(), job_desc_df.head()


Resume Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1262 entries, 0 to 1261
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  1262 non-null   object
 1   Resume    1262 non-null   object
dtypes: object(2)
memory usage: 19.8+ KB

Job Description Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       157 non-null    int64 
 1   company          157 non-null    object
 2   position         157 non-null    object
 3   url              157 non-null    object
 4   location         157 non-null    object
 5   headquaters      157 non-null    object
 6   employees        154 non-null    object
 7   founded          154 non-null    object
 8   industry         154 non-null    object
 9   Job Description  157 non-null    object
dtyp

(       Category                                             Resume
 0  Data Science  Skills * Programming Languages: Python (pandas...
 1  Data Science  Education Details \r\nMay 2013 to May 2017 B.E...
 2  Data Science  Areas of Interest Deep Learning, Control Syste...
 3  Data Science  Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
 4  Data Science  Education Details \r\n MCA   YMCAUST,  Faridab...,
    Unnamed: 0                          company  \
 0           1          Visual BI Solutions Inc   
 1           2                       Jobvertise   
 2           3           Santander Consumer USA   
 3           4   Federal Reserve Bank of Dallas   
 4           5                           Aviall   
 
                                             position  \
 0  Graduate Intern (Summer 2017) - SAP BI / Big D...   
 1                          Digital Marketing Manager   
 2    Manager, Pricing Management Information Systems   
 3               Treasury Services Analyst Internship  

In [3]:
# Extract job titles from both datasets
resume_titles = resume_df["Category"].unique()
job_desc_titles = job_desc_df["position"].unique()

# Convert to sets for easy comparison
resume_titles_set = set(resume_titles)
job_desc_titles_set = set(job_desc_titles)

# Find common and unique job titles
common_titles = resume_titles_set & job_desc_titles_set
unique_resume_titles = resume_titles_set - job_desc_titles_set
unique_job_desc_titles = job_desc_titles_set - resume_titles_set

# Display results
print(f"Total Resume Job Titles: {len(resume_titles_set)}")
print(f"Total Job Description Titles: {len(job_desc_titles_set)}")
print(f"Common Titles: {len(common_titles)}\n", common_titles)
print(f"Unique Titles in Resume Dataset: {len(unique_resume_titles)}\n", unique_resume_titles)
print(f"Unique Titles in Job Description Dataset: {len(unique_job_desc_titles)}\n", unique_job_desc_titles)


Total Resume Job Titles: 35
Total Job Description Titles: 110
Common Titles: 1
 {'Business Analyst'}
Unique Titles in Resume Dataset: 34
 {'Health and fitness', 'Hadoop', 'Python Developer', 'Mechanical Engineer', 'DevOps Engineer', 'Data Science', 'Cloud Architect', 'Advocate', 'Sales', 'VR/AR Developer', 'SAP Developer', 'ETL Developer', 'Testing', 'Energy Analyst', 'Civil Engineer', 'Cybersecurity Analyst', 'DotNet Developer', 'Product Manager', 'Network Security Engineer', 'Arts', 'Game Developer', 'UI/UX Designer', 'HR', 'Digital Marketing', 'Robotics Engineer', 'Database', 'Blockchain', 'PMO', 'Electrical Engineering', 'Web Designing', 'AI Specialist', 'Automation Testing', 'Operations Manager', 'Java Developer'}
Unique Titles in Job Description Dataset: 109
 {'Digital Marketing Intern', 'Sr Copywriter', 'Quantitative Analyst Intern', 'Global Real Estate Research Intern', 'Summer Intern - Channel Marketing - Marketing Dept (PAID)', 'Research Analyst Intern', 'Summer Associate - S

In [4]:
# Define manual job title mapping
job_title_mapping = {
    "Data Science": [
        "Data Scientist Intern", "Data Science / Software Engineering Intern", "Intern - Data Scientist"
    ],
    "DevOps Engineer": [
        "Intern - DevOps", "Cloud Engineer Intern"
    ],
    "Digital Marketing": [
        "Digital Marketing Intern", "Marketing Analyst Intern", "Performance Marketing Professional - Intern"
    ],
    "Cybersecurity Analyst": [
        "Information Security Intern", "IT Security Analyst Intern"
    ],
    "UI/UX Designer": [
        "User Interaction Design Intern", "Digital Technology/E-Commerce Summer Internship"
    ],
    "Java Developer": [
        "Software Engineering Intern", "Application Developer Intern"
    ],
    "Product Manager": [
        "Intern, Product Management", "Associate Project Specialist Intern"
    ],
    "Business Analyst": [
        "Business Analyst Intern", "Project Manager/Business Analyst Intern"
    ]
}

# Reverse the mapping for easy lookup
title_reverse_mapping = {job_title: resume_title for resume_title, job_titles in job_title_mapping.items() for job_title in job_titles}

# Apply mapping to job description dataset
job_desc_df["Mapped Job Title"] = job_desc_df["position"].map(title_reverse_mapping)

# Fill unmatched job titles as 'Other'
job_desc_df["Mapped Job Title"].fillna("Other", inplace=True)

# Display results
job_desc_df[["position", "Mapped Job Title"]].head(20)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  job_desc_df["Mapped Job Title"].fillna("Other", inplace=True)


Unnamed: 0,position,Mapped Job Title
0,Graduate Intern (Summer 2017) - SAP BI / Big D...,Other
1,Digital Marketing Manager,Other
2,"Manager, Pricing Management Information Systems",Other
3,Treasury Services Analyst Internship,Other
4,"Intern, Sales Analyst",Other
5,Human Resources Analyst Internship,Other
6,Intern - Business Analytics,Other
7,Intern - Business Analytics,Other
8,Digital Marketing Intern,Digital Marketing
9,Data Analyst - Intern,Other


In [6]:
# Display results
job_desc_df[["position", "Mapped Job Title"]].head(20)

Unnamed: 0,position,Mapped Job Title
0,Graduate Intern (Summer 2017) - SAP BI / Big D...,Other
1,Digital Marketing Manager,Other
2,"Manager, Pricing Management Information Systems",Other
3,Treasury Services Analyst Internship,Other
4,"Intern, Sales Analyst",Other
5,Human Resources Analyst Internship,Other
6,Intern - Business Analytics,Other
7,Intern - Business Analytics,Other
8,Digital Marketing Intern,Digital Marketing
9,Data Analyst - Intern,Other


In [9]:
from fuzzywuzzy import process

# Convert all job titles to lowercase for case-insensitive comparison
job_desc_df["position"] = job_desc_df["position"].str.lower().str.strip()

# Convert mapping keys to lowercase
title_reverse_mapping_lower = {job_title.lower(): resume_title for job_title, resume_title in title_reverse_mapping.items()}

# Function to apply fuzzy matching
def fuzzy_map_title(job_title):
    best_match, score = process.extractOne(job_title, title_reverse_mapping_lower.keys())
    return title_reverse_mapping_lower[best_match] if score > 80 else "Other"  # 80 is a confidence threshold

# Apply fuzzy matching
job_desc_df["Mapped Job Title"] = job_desc_df["position"].apply(fuzzy_map_title)

# Display results
job_desc_df[["position", "Mapped Job Title"]].head(20)




Unnamed: 0,position,Mapped Job Title
0,graduate intern (summer 2017) - sap bi / big d...,Data Science
1,digital marketing manager,Digital Marketing
2,"manager, pricing management information systems",Cybersecurity Analyst
3,treasury services analyst internship,Digital Marketing
4,"intern, sales analyst",Data Science
5,human resources analyst internship,Other
6,intern - business analytics,Business Analyst
7,intern - business analytics,Business Analyst
8,digital marketing intern,Digital Marketing
9,data analyst - intern,Data Science


In [11]:
from fuzzywuzzy import process

# Convert all job titles to lowercase for case-insensitive comparison
job_desc_df["position"] = job_desc_df["position"].str.lower().str.strip()

# Convert mapping keys to lowercase
title_reverse_mapping_lower = {job_title.lower(): resume_title for job_title, resume_title in title_reverse_mapping.items()}

# Define manual keyword-based rules
manual_rules = {
    "business analyst": "Business Analyst",
    "data analyst": "Data Science",
    "data scientist": "Data Science",
    "digital marketing": "Digital Marketing",
    "software engineer": "Java Developer",
    "devops": "DevOps Engineer",
    "cybersecurity": "Cybersecurity Analyst",
    "human resources": "HR",
    "sales": "Sales",
    "marketing": "Digital Marketing",
    "operations": "Operations Manager",
    "mechanical": "Mechanical Engineer",
    "electrical": "Electrical Engineering",
    "civil": "Civil Engineer",

    # Intern-Specific Mappings
    "finance intern": "Finance",
    "treasury analyst": "Finance",
    "pricing management": "Finance",
    "business risk": "Risk Analyst",
    "research & development": "Research & Development",
    "strategy": "Business Analyst",
    "consulting intern": "Business Analyst",
    "accounting": "Accounting",
}


# Function to apply manual rules first, then fuzzy matching
def hybrid_map_title(job_title):
    # Check manual rules first
    for keyword, mapped_title in manual_rules.items():
        if keyword in job_title:
            return mapped_title

    # If no manual rule applies, use fuzzy matching
    best_match, score = process.extractOne(job_title, title_reverse_mapping_lower.keys())
    return title_reverse_mapping_lower[best_match] if score > 85 else "Other"

# Apply mapping
job_desc_df["Mapped Job Title"] = job_desc_df["position"].apply(hybrid_map_title)

# Display results
job_desc_df[["position", "Mapped Job Title"]].head(20)


Unnamed: 0,position,Mapped Job Title
0,graduate intern (summer 2017) - sap bi / big d...,Data Science
1,digital marketing manager,Digital Marketing
2,"manager, pricing management information systems",Finance
3,treasury services analyst internship,Digital Marketing
4,"intern, sales analyst",Sales
5,human resources analyst internship,HR
6,intern - business analytics,Business Analyst
7,intern - business analytics,Business Analyst
8,digital marketing intern,Digital Marketing
9,data analyst - intern,Data Science


In [12]:
job_desc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        157 non-null    int64 
 1   company           157 non-null    object
 2   position          157 non-null    object
 3   url               157 non-null    object
 4   location          157 non-null    object
 5   headquaters       157 non-null    object
 6   employees         154 non-null    object
 7   founded           154 non-null    object
 8   industry          154 non-null    object
 9   Job Description   157 non-null    object
 10  Mapped Job Title  157 non-null    object
dtypes: int64(1), object(10)
memory usage: 13.6+ KB


In [16]:


# Group resumes by Category (which represents job titles)
resume_agg = resume_df.groupby("Category")["Resume"].apply(lambda x: " ".join(x)).reset_index()

# Rename "Category" to "Mapped Job Title" for consistency
resume_agg.rename(columns={"Category": "Mapped Job Title"}, inplace=True)

# Merge with job descriptions on "Mapped Job Title"
final_df = job_desc_df.merge(resume_agg, on="Mapped Job Title", how="left")

# Display final DataFrame info
final_df.info()

# Show sample data
final_df.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        157 non-null    int64 
 1   company           157 non-null    object
 2   position          157 non-null    object
 3   url               157 non-null    object
 4   location          157 non-null    object
 5   headquaters       157 non-null    object
 6   employees         154 non-null    object
 7   founded           154 non-null    object
 8   industry          154 non-null    object
 9   Job Description   157 non-null    object
 10  Mapped Job Title  157 non-null    object
 11  Resume            139 non-null    object
dtypes: int64(1), object(11)
memory usage: 14.8+ KB


Unnamed: 0.1,Unnamed: 0,company,position,url,location,headquaters,employees,founded,industry,Job Description,Mapped Job Title,Resume
0,1,Visual BI Solutions Inc,graduate intern (summer 2017) - sap bi / big d...,https://www.glassdoor.com/partner/jobListing.h...,"Plano, TX","Plano, TX",51 to 200 employees,2010,Information Technology,"Location: Plano, TX or Oklahoma City, OK Dura...",Data Science,Skills * Programming Languages: Python (pandas...
1,2,Jobvertise,digital marketing manager,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Berlin, Germany",1 to 50 employees,2011,Unknown,The Digital Marketing Manager is the front li...,Digital Marketing,Category: Digital Marketing\nSkills: Communica...
2,3,Santander Consumer USA,"manager, pricing management information systems",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",5001 to 10000 employees,1995,Finance,Summary of Responsibilities:The Manager Prici...,Finance,
3,4,Federal Reserve Bank of Dallas,treasury services analyst internship,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,1914,Finance,ORGANIZATIONAL SUMMARY: As part of the nati...,Digital Marketing,Category: Digital Marketing\nSkills: Communica...
4,5,Aviall,"intern, sales analyst",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,Boeing,Subsidiary or Business Segment,Aviall is the world's largest provider of n...,Sales,Education Details \r\n Bachelor's \r\n Bache...


In [15]:
print(resume_df.columns)


Index(['Category', 'Resume'], dtype='object')


In [17]:
import re

def clean_text(text):
    """Preprocess text by lowercasing and removing special characters."""
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    return text.strip()

# Fill missing resume values with an empty string
final_df["Resume"].fillna("", inplace=True)

# Apply text preprocessing
final_df["Job Description"] = final_df["Job Description"].apply(clean_text)
final_df["Resume"] = final_df["Resume"].apply(clean_text)

# Create a new DataFrame with selected columns
job_resume_df = final_df[["Mapped Job Title", "Job Description", "Resume"]]

# Display the first few rows
job_resume_df.head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  final_df["Resume"].fillna("", inplace=True)


Unnamed: 0,Mapped Job Title,Job Description,Resume
0,Data Science,location plano tx or oklahoma city ok duration...,skills programming languages python pandas nu...
1,Digital Marketing,the digital marketing manager is the front lin...,category digital marketing\nskills communicati...
2,Finance,summary of responsibilitiesthe manager pricing...,
3,Digital Marketing,organizational summary as part of the nation...,category digital marketing\nskills communicati...
4,Sales,aviall is the worlds largest provider of new a...,education details \r\n bachelors \r\n bachel...


In [19]:
final_df = job_resume_df.dropna(subset=["Resume"])
final_df

Unnamed: 0,Mapped Job Title,Job Description,Resume
0,Data Science,location plano tx or oklahoma city ok duration...,skills programming languages python pandas nu...
1,Digital Marketing,the digital marketing manager is the front lin...,category digital marketing\nskills communicati...
2,Finance,summary of responsibilitiesthe manager pricing...,
3,Digital Marketing,organizational summary as part of the nation...,category digital marketing\nskills communicati...
4,Sales,aviall is the worlds largest provider of new a...,education details \r\n bachelors \r\n bachel...
...,...,...,...
152,Business Analyst,realworld experience lifelong connections inte...,education details \r\n be computer science mum...
153,Digital Marketing,the internship program our paid internship pr...,category digital marketing\nskills communicati...
154,Business Analyst,are you an analytical thinker with a passion f...,education details \r\n be computer science mum...
155,Digital Marketing,the internship program our paid internship pr...,category digital marketing\nskills communicati...


In [20]:
final_df.to_csv("/content/drive/MyDrive/finalyearproject/final_dataset.csv", index=False)


## Data Merging Summary

We have successfully merged the resume dataset and job description dataset.  
The merged dataset contains three columns:
- **Mapped Job Title**: Standardized job titles.
- **Job Description**: Descriptions from job postings.
- **Resume**: Resume text corresponding to job titles.

Next, we will perform data validation and testing to ensure the quality of the merged dataset.


In [21]:
df = pd.read_csv("/content/drive/MyDrive/finalyearproject/final_dataset.csv")

In [22]:
df.head()

Unnamed: 0,Mapped Job Title,Job Description,Resume
0,Data Science,location plano tx or oklahoma city ok duration...,skills programming languages python pandas nu...
1,Digital Marketing,the digital marketing manager is the front lin...,category digital marketing\nskills communicati...
2,Finance,summary of responsibilitiesthe manager pricing...,
3,Digital Marketing,organizational summary as part of the nation...,category digital marketing\nskills communicati...
4,Sales,aviall is the worlds largest provider of new a...,education details \r\n bachelors \r\n bachel...


In [23]:
import pandas as pd

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/finalyearproject/final_dataset.csv")

# 1. Display basic information
print("Basic Info:")
print(df.info())

# 2. Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# 3. Check unique job titles
print("\nUnique Job Titles:")
print(df["Mapped Job Title"].nunique())
print(df["Mapped Job Title"].unique())

# 4. Check for duplicate rows
print("\nDuplicate Rows:", df.duplicated().sum())

# 5. Check length of descriptions and resumes
df["Job Description Length"] = df["Job Description"].astype(str).apply(len)
df["Resume Length"] = df["Resume"].astype(str).apply(len)

print("\nShort Job Descriptions (<50 characters):")
print(df[df["Job Description Length"] < 50][["Mapped Job Title", "Job Description"]])

print("\nShort Resumes (<50 characters):")
print(df[df["Resume Length"] < 50][["Mapped Job Title", "Resume"]])

# 6. Check if each job title has descriptions and resumes
print("\nRows with Empty Job Descriptions:")
print(df[df["Job Description"].isna()])

print("\nRows with Empty Resumes:")
print(df[df["Resume"].isna()])


Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Mapped Job Title  157 non-null    object
 1   Job Description   157 non-null    object
 2   Resume            139 non-null    object
dtypes: object(3)
memory usage: 3.8+ KB
None

Missing Values:
Mapped Job Title     0
Job Description      0
Resume              18
dtype: int64

Unique Job Titles:
17
['Data Science' 'Digital Marketing' 'Finance' 'Sales' 'HR'
 'Business Analyst' 'Risk Analyst' 'Research & Development'
 'Operations Manager' 'Other' 'DevOps Engineer' 'Product Manager'
 'UI/UX Designer' 'Accounting' 'Cybersecurity Analyst' 'Java Developer'
 'Electrical Engineering']

Duplicate Rows: 34

Short Job Descriptions (<50 characters):
Empty DataFrame
Columns: [Mapped Job Title, Job Description]
Index: []

Short Resumes (<50 characters):
           Mapped Job Title Resume
2 