# InterviewMate - Resume-JD Matching Analysis with Sentence Transformers
In this project, I developed and evaluated three different approaches for Resume ‚Üí Job Description matching:

1Ô∏è‚É£ Method 1 ‚Äî Pure Semantic Matching (Sentence-BERT + Cosine Similarity)

2Ô∏è‚É£ Method 2 ‚Äî Hybrid Matching (Semantic + Keyword Score)

3Ô∏è‚É£ Method 3 ‚Äî Semantic Ranking + Keyword Explanation (Final Chosen Approach)

### üîç Key Findings

1Ô∏è‚É£ Sentence-BERT consistently delivered the most accurate and meaningful rankings

- It understands context rather than relying on word overlap

- It ranked resumes correctly even when wording differed from the JD

- Scores stabilized around ~0.6 maximum similarity ‚Üí expected for real-world resume differences

2Ô∏è‚É£ Hybrid scoring introduced noise

- Generic skills (e.g., SQL, server, testing) appeared across unrelated resumes

- Keyword weighting sometimes pushed incorrect resumes higher

- Although academically interesting, it reduced practical accuracy

3Ô∏è‚É£ Explanation matters

- Recruiters need to understand why a resume fits

- Adding keyword matching as an explanation layer only provided clarity without hurting performance

- ‚ÄúWhy it fits?‚Äù + ‚ÄúMissing Skills‚Äù transformed the system from purely technical to product-ready

In [91]:
import pandas as pd
import seaborn as sns
import re
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity  
from gensim.models import Word2Vec  
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer   
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nguye\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [92]:
df = pd.read_csv('../data/JD/job_dataset.csv')
df.head()
df.shape

(1068, 7)

In [113]:
df_resume = pd.read_csv('../data/data/processed_data/pdf_data.csv')
df_resume.head()
df_resume.shape

(2484, 7)

In [94]:
df['Title'].value_counts()

Title
.NET Developer                  20
Business Analyst                20
Copywriter                      20
Data Engineer                   20
Digital Marketing Specialist    20
                                ..
Android Architect                1
iOS Mobile Developer             1
Junior iOS Developer             1
iOS App Developer Intern         1
Graduate iOS Developer           1
Name: count, Length: 218, dtype: int64

In [114]:
def build_jd_text(row):
    parts = [
        str(row.get("Title", "")),
        str(row.get("ExperienceLevel", "")),
        str(row.get("YearsOfExperience", "")),
        str(row.get("Skills", "")),
        str(row.get("Responsibilities", "")),
        str(row.get("Keywords", ""))
    ]
    return " ".join(parts)

df["JD_text"] = df.apply(build_jd_text, axis=1)
df.head()


Unnamed: 0,JobID,Title,ExperienceLevel,YearsOfExperience,Skills,Responsibilities,Keywords,JD_text
0,NET-F-001,.NET Developer,Fresher,0-1,C#; VB.NET basics; .NET Framework; .NET Core f...,Assist in coding and debugging applications; L...,.NET; C#; ASP.NET MVC; Entity Framework; SQL S...,.NET Developer Fresher 0-1 C#; VB.NET basics; ...
1,NET-F-002,.NET Developer,Fresher,0-1,C#; .NET Framework basics; ASP.NET; Razor; HTM...,Write simple C# programs under guidance; Suppo...,.NET; C#; ASP.NET MVC; Entity Framework; SQL S...,.NET Developer Fresher 0-1 C#; .NET Framework ...
2,NET-F-003,.NET Developer,Fresher,0-1,C#; VB.NET basics; .NET Core; ASP.NET MVC; HTM...,Contribute to development of small modules; As...,.NET; C#; ASP.NET MVC; SQL Server; Entity Fram...,.NET Developer Fresher 0-1 C#; VB.NET basics; ...
3,NET-F-004,.NET Developer,Fresher,0-1,C#; .NET Framework; ASP.NET basics; SQL Server...,Support in software design documentation; Assi...,.NET; C#; SQL Server; Entity Framework; ASP.NET,.NET Developer Fresher 0-1 C#; .NET Framework;...
4,NET-F-005,.NET Developer,Fresher,0-1,C#; ASP.NET; MVC; Entity Framework basics; SQL...,Learn to design and build ASP.NET applications...,.NET; C#; ASP.NET MVC; Entity Framework; SQL S...,.NET Developer Fresher 0-1 C#; ASP.NET; MVC; E...


## Sentence Transformer

In [122]:

model = SentenceTransformer("all-MiniLM-L6-v2")
jd_embeddings = model.encode(df["JD_text"].tolist(), show_progress_bar=True)
# resume_embeddings =  np.load("model/resume_embeddings.npy")
resume_embeddings =  np.load("model/resume_embeddings.npy")

print('Success on import jd and resume embeddings.')


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 34/34 [00:26<00:00,  1.28it/s]

Success on import jd and resume embeddings.





In [115]:
df_resume.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Filename,Filepath,Extracted_Text,Category,Cleaned_Text
0,0,0,10554236.pdf,../data/data/data/ACCOUNTANT\10554236.pdf,ACCOUNTANT\nSummary\nFinancial Accountant spec...,ACCOUNTANT,accountant summary financial accountant specia...
1,1,1,10674770.pdf,../data/data/data/ACCOUNTANT\10674770.pdf,STAFF ACCOUNTANT\nSummary\nHighly analytical a...,ACCOUNTANT,staff accountant summary highly analytical det...
2,2,2,11163645.pdf,../data/data/data/ACCOUNTANT\11163645.pdf,ACCOUNTANT\nProfessional Summary\nTo obtain a ...,ACCOUNTANT,accountant professional summary obtain positio...
3,3,3,11759079.pdf,../data/data/data/ACCOUNTANT\11759079.pdf,SENIOR ACCOUNTANT\nExperience\nCompany Name Ju...,ACCOUNTANT,senior accountant experience company name june...
4,4,4,12065211.pdf,../data/data/data/ACCOUNTANT\12065211.pdf,SENIOR ACCOUNTANT\nProfessional Summary\nSenio...,ACCOUNTANT,senior accountant professional summary senior ...


In [123]:
df_resume.shape

(2484, 7)

## Match JD to Resume 
### Method 1: Pure semantic matching using Sentence-BERT

Purpose -> real ranking 

Sentence-BERT embeddings:

- understand meaning / capture context

- are pre-trained on massive NLI + semantic similarity datasets

- excel at resume ‚Üî JD matching (industry-standard approach)

In [124]:
def match_jd_to_resumes(jd_embeddings, top_k=10): 
    # accept 1-D embedding or 2-D (1, dim) 
    emb = np.asarray(jd_embeddings) 
    if emb.ndim == 2 and emb.shape[0] == 1: 
        emb = emb[0] 
    if emb.ndim != 1: 
        raise ValueError("jd_embeddings must be a 1-D embedding vector or a 2-D array with shape (1, dim)") 
    scores = cosine_similarity(emb.reshape(1, -1), resume_embeddings).flatten() 
    # Safely handle cases where resume_embeddings length and df_resume length differ. 
    # We compute sorted indices by score, then filter only those indices that exist in df_resume. 
    sorted_idx = scores.argsort()[::-1] 
    valid_idx = [int(i) for i in sorted_idx if i < len(df_resume)] 
    if len(valid_idx) == 0: 
        # return empty dataframe with score column if no valid resumes available 
        results = df_resume.iloc[[]].copy() 
        results["score"] = [] 
        return results 
    top_k = min(top_k, len(valid_idx)) 
    top_idx = valid_idx[:top_k] 
    results = df_resume.iloc[top_idx].copy() 
    results["score"] = scores[np.array(top_idx)] 
    return results

In [125]:
test_jd = df["JD_text"].iloc[0]
jd_embedding = model.encode(test_jd, normalize_embeddings=True)
print(test_jd)
match_jd_to_resumes(jd_embedding, top_k=10)


.NET Developer Fresher 0-1 C#; VB.NET basics; .NET Framework; .NET Core fundamentals; ASP.NET; MVC; HTML; CSS; JavaScript basics; SQL Server; Entity Framework basics; LINQ; Visual Studio; Git; Unit Testing basics Assist in coding and debugging applications; Learn and apply .NET Framework and Core fundamentals; Support team in building ASP.NET MVC web applications; Write basic SQL queries and work with Entity Framework; Collaborate with peers to solve issues; Participate in code reviews for learning; Follow best practices in coding; Work with version control (Git) .NET; C#; ASP.NET MVC; Entity Framework; SQL Server; LINQ; Visual Studio; Unit Testing


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Filename,Filepath,Extracted_Text,Category,Cleaned_Text,score
2102,2102,2102,26746496.pdf,../data/data/data/INFORMATION-TECHNOLOGY\26746...,DATABASE PROGRAMMER/ANALYST (.NET DEVELOPER)\n...,INFORMATION-TECHNOLOGY,database programmer analyst net developer summ...,0.614708
2059,2059,2059,16186411.pdf,../data/data/data/INFORMATION-TECHNOLOGY\16186...,DATABASE PROGRAMMER/ANALYST (.NET DEVELOPER)\n...,INFORMATION-TECHNOLOGY,database programmer analyst net developer summ...,0.584151
2048,2048,2048,12763627.pdf,../data/data/data/INFORMATION-TECHNOLOGY\12763...,ASP.NET WEB DEVELOPER\nAccomplishments\nWon As...,INFORMATION-TECHNOLOGY,asp net web developer accomplishments associat...,0.540443
742,742,742,39247950.pdf,../data/data/data/BANKING\39247950.pdf,"SOFTWARE ENGINEER\nQualifications\nC# 3.0, PL/...",BANKING,software engineer qualifications c pl sql java...,0.513053
1331,1331,1331,35990852.pdf,../data/data/data/DESIGNER\35990852.pdf,WEBSITE DESIGNER\nSummary\nSoftware developer ...,DESIGNER,website designer summary software developer we...,0.510468
2148,2148,2148,83816738.pdf,../data/data/data/INFORMATION-TECHNOLOGY\83816...,INFORMATION TECHNOLOGY INTERN (TEST AUTOMATION...,INFORMATION-TECHNOLOGY,information technology intern test automation ...,0.505999
2080,2080,2080,20674668.pdf,../data/data/data/INFORMATION-TECHNOLOGY\20674...,INFORMATION TECHNOLOGY SPECIALIST III (DRUPAL ...,INFORMATION-TECHNOLOGY,information technology specialist drupal dev s...,0.487816
2112,2112,2112,28126340.pdf,../data/data/data/INFORMATION-TECHNOLOGY\28126...,INFORMATION TECHNOLOGY COORDINATOR\nProfession...,INFORMATION-TECHNOLOGY,information technology coordinator professiona...,0.474242
766,766,766,99124477.pdf,../data/data/data/BANKING\99124477.pdf,ASSOCIATE CONSULTANT\nProfessional Summary\n7+...,BANKING,associate consultant professional summary year...,0.471596
1356,1356,1356,85101052.pdf,../data/data/data/DESIGNER\85101052.pdf,TECHNICAL DESIGNER\nCareer Overview\n√¢‚Äî‚Äã√Ç Havi...,DESIGNER,technical designer career overview years exper...,0.470555


### Method 2: Hybrid (Semantic + Keywords)

Purpose -> Explain why its fits 

`final_scores = (weight_semantic [0.7] * semantic_scores) + \
                (weight_keyword [0.3] * keyword_scores)`

The 2nd (hybrid) performed worse because:
- keyword scoring introduced noise

- weights distorted ranking

- keywords are often generic and unreliable

- SBERT already captures those concepts better

#### Normalization (matches your Cleaned_Text style)

In [126]:
def normalize_for_match(text: str) -> str:
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


#### Extract JD keywords

In [127]:
def extract_keywords(jd_keywords):
    if not isinstance(jd_keywords, str):
        return []
    jd_keywords = jd_keywords.replace(";", ",")
    return [kw.strip() for kw in jd_keywords.split(",") if kw.strip()]


In [128]:
def keyword_score(resume_text, jd_keywords):
    resume_norm = normalize_for_match(resume_text)
    resume_tokens = set(resume_norm.split())

    matched = 0
    kws = extract_keywords(jd_keywords)

    for kw in kws:
        kw_norm = normalize_for_match(kw)
        kw_tokens = kw_norm.split()

        if kw_tokens and all(t in resume_tokens for t in kw_tokens):
            matched += 1

    return matched / max(1, len(kws))


In [129]:
def match_jd_to_resumes_method_2(jd_embedding, jd_keywords, 
                                 top_k=10, w_semantic=0.7, w_keyword=0.3):

    emb = np.asarray(jd_embedding)
    if emb.ndim == 2 and emb.shape[0] == 1:
        emb = emb[0]

    scores = cosine_similarity(emb.reshape(1, -1), resume_embeddings).flatten()

    kw_scores = []
    for _, row in df_resume.iterrows():
        txt = row.get("Cleaned_Text", "")
        kw_scores.append(keyword_score(txt, jd_keywords))

    kw_scores = np.array(kw_scores)

    if len(kw_scores) != len(scores):
        raise ValueError("Resume embeddings and df_resume not aligned")

    final = (w_semantic * scores) + (w_keyword * kw_scores)

    sorted_idx = final.argsort()[::-1][:top_k]

    results = df_resume.iloc[sorted_idx].copy()
    results["semantic_score"] = scores[sorted_idx]
    results["keyword_score"] = kw_scores[sorted_idx]
    results["final_score"] = final[sorted_idx]

    return results.sort_values(by="final_score", ascending=False)



In [130]:
jd = df.iloc[0]

jd_text = jd["JD_text"]
jd_keywords = jd["Keywords"]

jd_embedding = model.encode(jd_text, normalize_embeddings=True)


match_jd_to_resumes_method_2(jd_embedding, jd_keywords, top_k=10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Filename,Filepath,Extracted_Text,Category,Cleaned_Text,semantic_score,keyword_score,final_score
2102,2102,2102,26746496.pdf,../data/data/data/INFORMATION-TECHNOLOGY\26746...,DATABASE PROGRAMMER/ANALYST (.NET DEVELOPER)\n...,INFORMATION-TECHNOLOGY,database programmer analyst net developer summ...,0.614708,0.875,0.692795
2059,2059,2059,16186411.pdf,../data/data/data/INFORMATION-TECHNOLOGY\16186...,DATABASE PROGRAMMER/ANALYST (.NET DEVELOPER)\n...,INFORMATION-TECHNOLOGY,database programmer analyst net developer summ...,0.584151,0.875,0.671406
1331,1331,1331,35990852.pdf,../data/data/data/DESIGNER\35990852.pdf,WEBSITE DESIGNER\nSummary\nSoftware developer ...,DESIGNER,website designer summary software developer we...,0.510468,1.0,0.657328
1225,1225,1225,43311839.pdf,../data/data/data/CONSULTANT\43311839.pdf,IT CONSULTANT\nSummary\nOver Seven years of So...,CONSULTANT,consultant summary seven years software applic...,0.46613,1.0,0.626291
2048,2048,2048,12763627.pdf,../data/data/data/INFORMATION-TECHNOLOGY\12763...,ASP.NET WEB DEVELOPER\nAccomplishments\nWon As...,INFORMATION-TECHNOLOGY,asp net web developer accomplishments associat...,0.540443,0.625,0.56581
1556,1556,1556,51588273.pdf,../data/data/data/ENGINEERING\51588273.pdf,SOFTWARE ENGINEERING MANAGER\nSummary\nMultifa...,ENGINEERING,software engineering manager summary multiface...,0.46765,0.625,0.514855
2035,2035,2035,10089434.pdf,../data/data/data/INFORMATION-TECHNOLOGY\10089...,INFORMATION TECHNOLOGY TECHNICIAN I\nSummary\n...,INFORMATION-TECHNOLOGY,information technology technician summary vers...,0.465147,0.625,0.513103
742,742,742,39247950.pdf,../data/data/data/BANKING\39247950.pdf,"SOFTWARE ENGINEER\nQualifications\nC# 3.0, PL/...",BANKING,software engineer qualifications c pl sql java...,0.513053,0.5,0.509137
2080,2080,2080,20674668.pdf,../data/data/data/INFORMATION-TECHNOLOGY\20674...,INFORMATION TECHNOLOGY SPECIALIST III (DRUPAL ...,INFORMATION-TECHNOLOGY,information technology specialist drupal dev s...,0.487816,0.5,0.491471
2148,2148,2148,83816738.pdf,../data/data/data/INFORMATION-TECHNOLOGY\83816...,INFORMATION TECHNOLOGY INTERN (TEST AUTOMATION...,INFORMATION-TECHNOLOGY,information technology intern test automation ...,0.505999,0.375,0.4667


### Method 3: Semantic Ranking + Keyword EXPLANATION ONLY

In [131]:
def explain_fit_only(resume_text, jd_keywords):
    resume_norm = normalize_for_match(resume_text)
    resume_tokens = set(resume_norm.split())

    matched = []

    for raw_kw in extract_keywords(jd_keywords):
        kw_norm = normalize_for_match(raw_kw)
        kw_tokens = kw_norm.split()

        if kw_tokens and all(t in resume_tokens for t in kw_tokens):
            matched.append(raw_kw)

    return ", ".join(matched) if matched else "No matching keywords"

def missing_skills_only(resume_text, jd_keywords):
    resume_norm = normalize_for_match(resume_text)
    resume_tokens = set(resume_norm.split())

    missing = []

    for raw_kw in extract_keywords(jd_keywords):
        kw_norm = normalize_for_match(raw_kw)
        kw_tokens = kw_norm.split()

        # skill is missing if any token isn't present
        if not (kw_tokens and all(t in resume_tokens for t in kw_tokens)):
            missing.append(raw_kw)
    return ", ".join(missing) if missing else "None"

def match_jd_to_resumes_method_3(jd_embedding, jd_keywords, top_k=10):
    results = match_jd_to_resumes(jd_embedding, top_k)

    results["Why it fits?"] = results["Cleaned_Text"].apply(
        lambda txt: explain_fit_only(txt, jd_keywords)
    )

    results["Missing skilss"] = results["Cleaned_Text"].apply(
        lambda txt: missing_skills_only(txt, jd_keywords)

    )

    return results



In [132]:
jd = df.iloc[0]

jd_text = jd["JD_text"]
jd_keywords = jd["Keywords"]

jd_embedding = model.encode(jd_text, normalize_embeddings=True)


match_jd_to_resumes_method_3(jd_embedding, jd_keywords, top_k=10)


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Filename,Filepath,Extracted_Text,Category,Cleaned_Text,score,Why it fits?,Missing skilss
2102,2102,2102,26746496.pdf,../data/data/data/INFORMATION-TECHNOLOGY\26746...,DATABASE PROGRAMMER/ANALYST (.NET DEVELOPER)\n...,INFORMATION-TECHNOLOGY,database programmer analyst net developer summ...,0.614708,".NET, C#, ASP.NET MVC, Entity Framework, SQL S...",Unit Testing
2059,2059,2059,16186411.pdf,../data/data/data/INFORMATION-TECHNOLOGY\16186...,DATABASE PROGRAMMER/ANALYST (.NET DEVELOPER)\n...,INFORMATION-TECHNOLOGY,database programmer analyst net developer summ...,0.584151,".NET, C#, ASP.NET MVC, Entity Framework, SQL S...",Unit Testing
2048,2048,2048,12763627.pdf,../data/data/data/INFORMATION-TECHNOLOGY\12763...,ASP.NET WEB DEVELOPER\nAccomplishments\nWon As...,INFORMATION-TECHNOLOGY,asp net web developer accomplishments associat...,0.540443,".NET, C#, ASP.NET MVC, SQL Server, Visual Studio","Entity Framework, LINQ, Unit Testing"
742,742,742,39247950.pdf,../data/data/data/BANKING\39247950.pdf,"SOFTWARE ENGINEER\nQualifications\nC# 3.0, PL/...",BANKING,software engineer qualifications c pl sql java...,0.513053,".NET, C#, SQL Server, Visual Studio","ASP.NET MVC, Entity Framework, LINQ, Unit Testing"
1331,1331,1331,35990852.pdf,../data/data/data/DESIGNER\35990852.pdf,WEBSITE DESIGNER\nSummary\nSoftware developer ...,DESIGNER,website designer summary software developer we...,0.510468,".NET, C#, ASP.NET MVC, Entity Framework, SQL S...",
2148,2148,2148,83816738.pdf,../data/data/data/INFORMATION-TECHNOLOGY\83816...,INFORMATION TECHNOLOGY INTERN (TEST AUTOMATION...,INFORMATION-TECHNOLOGY,information technology intern test automation ...,0.505999,"C#, SQL Server, Unit Testing",".NET, ASP.NET MVC, Entity Framework, LINQ, Vis..."
2080,2080,2080,20674668.pdf,../data/data/data/INFORMATION-TECHNOLOGY\20674...,INFORMATION TECHNOLOGY SPECIALIST III (DRUPAL ...,INFORMATION-TECHNOLOGY,information technology specialist drupal dev s...,0.487816,"C#, SQL Server, Visual Studio, Unit Testing",".NET, ASP.NET MVC, Entity Framework, LINQ"
2112,2112,2112,28126340.pdf,../data/data/data/INFORMATION-TECHNOLOGY\28126...,INFORMATION TECHNOLOGY COORDINATOR\nProfession...,INFORMATION-TECHNOLOGY,information technology coordinator professiona...,0.474242,".NET, SQL Server, Visual Studio","C#, ASP.NET MVC, Entity Framework, LINQ, Unit ..."
766,766,766,99124477.pdf,../data/data/data/BANKING\99124477.pdf,ASSOCIATE CONSULTANT\nProfessional Summary\n7+...,BANKING,associate consultant professional summary year...,0.471596,".NET, C#, Unit Testing","ASP.NET MVC, Entity Framework, SQL Server, LIN..."
1356,1356,1356,85101052.pdf,../data/data/data/DESIGNER\85101052.pdf,TECHNICAL DESIGNER\nCareer Overview\n√¢‚Äî‚Äã√Ç Havi...,DESIGNER,technical designer career overview years exper...,0.470555,"C#, SQL Server, Unit Testing",".NET, ASP.NET MVC, Entity Framework, LINQ, Vis..."


In [134]:
# df.to_csv('../data/data/processed_data/jd_data.csv')

import os
import datetime
import numpy as np
import pandas as pd
import json

today = datetime.datetime.now().strftime("%Y-%m-%d")

save_dir = f"model/{today}"
os.makedirs(save_dir, exist_ok=True)

# 2Ô∏è‚É£ Save embeddings
np.save(f"{save_dir}/resume_embeddings.npy", resume_embeddings)
np.save(f"{save_dir}/jd_embeddings.npy", jd_embeddings)


print("Saved version:", save_dir)


Saved version: model/2025-12-31
