## CV-JD Matching
### This Notebook focuses on matching JDs and CVs using cosine similarity using GPT-2 Model

In [2]:
import numpy as np
import pandas as pd

import re
import string 
import contractions 

from tqdm import tqdm
tqdm.pandas(desc="Progress Bar")

import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2Model
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
jd_data = load_dataset('jacob-hugging-face/job-descriptions', split="train")
jd_data

Dataset({
    features: ['company_name', 'job_description', 'position_title', 'description_length', 'model_response'],
    num_rows: 853
})

In [4]:
jd_df = pd.DataFrame(jd_data)
jd_df.head()

Unnamed: 0,company_name,job_description,position_title,description_length,model_response
0,Google,minimum qualifications\nbachelors degree or eq...,Sales Specialist,2727,"{\n ""Core Responsibilities"": ""Responsible fo..."
1,Apple,description\nas an asc you will be highly infl...,Apple Solutions Consultant,828,"{\n ""Core Responsibilities"": ""as an asc you ..."
2,Netflix,its an amazing time to be joining netflix as w...,Licensing Coordinator - Consumer Products,3205,"{\n ""Core Responsibilities"": ""Help drive bus..."
3,Robert Half,description\n\nweb designers looking to expand...,Web Designer,2489,"{\n ""Core Responsibilities"": ""Designing webs..."
4,TrackFive,at trackfive weve got big goals were on a miss...,Web Developer,3167,"{\n ""Core Responsibilities"": ""Build and layo..."


In [5]:
print(jd_df['job_description'][0])

minimum qualifications
bachelors degree or equivalent practical experience years of experience in saas or productivity tools businessexperience managing enterprise accounts with sales cycles
preferred qualifications
 years of experience building strategic business partnerships with enterprise customersability to work through and with a reseller ecosystem to scale the businessability to plan pitch and execute a territory business strategyability to build relationships and to deliver results in a crossfunctionalmatrixed environmentability to identify crosspromoting and uppromoting opportunities within the existing account baseexcellent account management writtenverbal communication strategic and analyticalthinking skills
about the job
as a member of the google cloud team you inspire leading companies schools and government agencies to work smarter with google tools like google workspace search and chrome you advocate the innovative power of our products to make organizations more product

In [6]:
df = pd.read_csv('data/extracted_resume.csv')
df.head()

Unnamed: 0,Skills,Education,ID,Category
0,Excellent classroom managementÂ,Subject Matter Authorization in Science: Scien...,37201447,AGRICULTURE
1,"Team mediation, Budget Management, Delegation ...","2009 Howard University ï¼​ City , State , USA ...",12674256,AGRICULTURE
2,"COMPUTER LITERACY, E-mail, English, government...","2011\nThe Universty of Zambia ï¼​ City , State...",29968330,AGRICULTURE
3,"C, C++, communication skills, designing, ELISA...","Masters of Science , Biotechnology 5 2013 Univ...",81042872,AGRICULTURE
4,"Data Entry, Printers, Clients, Loans, Tax Retu...",Wayne State University 2013 MBA : Linguistics ...,20006992,AGRICULTURE


In [7]:
df.shape

(2484, 4)

In [None]:
def text_cleaning(text:str) -> str:
    """
    Function to clean the text data
    
    Args:
    text (str): Text to be cleaned
    Returns:
    str: Cleaned text
    """

    if pd.isnull(text):
        return
    
    # lower-case everything
    text = text.lower().strip()
    
    # For removing puctuations
    translator = str.maketrans('', '', string.punctuation)
    
    # expand all the short-form words
    text = contractions.fix(text)
    
    # remove any special chars
    text = re.sub(r'http\S+|www\S+|https\S+', '', text) # Remove URLs
    text = re.sub(r'\S+@\S+', '', text) # Remove emails
    text = re.sub(r'\b\d{1,3}[-./]?\d{1,3}[-./]?\d{1,4}\b', '', text) # Remove phone numbers
    text = text.translate(translator) # Remove puctuations
    text = re.sub(r'[^a-zA-Z]', ' ', text) # Remove other non-alphanumeric characters
    
    return text.strip()

In [None]:
# We have 15 Resumes where Skills & Education were not extracted
# So, let's remove them
cv_df = df[~(df['Skills'].isna() & df['Education'].isna())].reset_index(drop=True)

# Filling the null values in Skills & Education with Empty String before concatinating them
cv_df = cv_df.fillna(value='')

# Let's stitch together Skills & Education, similar to given in job description.
cv_df['CV'] = cv_df['Skills'] + ' ' + cv_df['Education']

# Doing text cleaning
cv_df['CV'] = cv_df['CV'].progress_apply(text_cleaning)

Progress Bar: 100%|██████████| 2469/2469 [00:00<00:00, 18007.57it/s]


In [10]:
cv_df.shape

(2469, 5)

In [11]:
# Sample job descriptions
job_descriptions = jd_df['job_description'].apply(text_cleaning)[:15].to_list() # jd_df['job_description'][:15]

# Sample resumes (replace with your extracted resume data)
resumes = cv_df['CV'].to_list()

## Creating Embedding using `GPT2Tokenizer`, `GPT2Model`

In [12]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

# Embedding function
def get_gpt2_embedding(text):
    if not text.strip():  # Handle empty string
        return np.zeros(model.config.hidden_size)
    
    tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    if tokens["input_ids"].shape[1] == 0:
        return np.zeros(model.config.hidden_size)
    
    with torch.no_grad():
        outputs = model(**tokens)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Generate embeddings
job_description_embeddings = [get_gpt2_embedding(desc) for desc in job_descriptions]
resume_embeddings = [get_gpt2_embedding(resume) for resume in resumes]



In [13]:
job_description_embeddings[0].shape, resume_embeddings[0].shape

((768,), (768,))

In [14]:
len(job_description_embeddings), len(resume_embeddings)

(15, 2469)

## Calculating Similarity Score & Getting Top 5 Candidates

In [15]:
similarity_scores = cosine_similarity(job_description_embeddings, resume_embeddings)
similarity_scores

array([[0.9973296 , 0.99834927, 0.9969055 , ..., 0.99878311, 0.99674451,
        0.996813  ],
       [0.99610823, 0.99775191, 0.99539102, ..., 0.99843083, 0.99559432,
        0.99569584],
       [0.99702658, 0.9986651 , 0.9964485 , ..., 0.99858105, 0.99716343,
        0.9966288 ],
       ...,
       [0.99537926, 0.99751591, 0.99476767, ..., 0.99769908, 0.99567614,
        0.99477961],
       [0.9980274 , 0.99883912, 0.99729106, ..., 0.99898969, 0.99759107,
        0.99740252],
       [0.99784996, 0.99894311, 0.99718943, ..., 0.9989561 , 0.99742801,
        0.9973633 ]])

In [16]:
num_top_candidates = 10
top_candidates = []

for i, job_description in enumerate(job_descriptions):
    candidates_with_scores = list(enumerate(similarity_scores[i]))
    candidates_with_scores.sort(key=lambda x: x[1], reverse=True)
    top_candidates_for_job = candidates_with_scores[:num_top_candidates]
    top_candidates.append(top_candidates_for_job)

# Print the top candidates for each job description
for i, job_description in enumerate(job_descriptions):
    print(f"Top candidates for JD {i+1} - Postition: {jd_df['position_title'][i]}")
    for candidate_index, score in top_candidates[i]:
        print(f"  Candidate {candidate_index + 1} - Similarity Score: {score:.4f} - {cv_df['Category'][candidate_index]}/{cv_df['ID'][candidate_index]}.pdf")
        # print(f"  Resume: {resumes[candidate_index]}")
    print()

Top candidates for JD 1 - Postition: Sales Specialist
  Candidate 2085 - Similarity Score: 0.9995 - ADVOCATE/97405769.pdf
  Candidate 740 - Similarity Score: 0.9994 - PUBLIC-RELATIONS/24491862.pdf
  Candidate 2071 - Similarity Score: 0.9994 - APPAREL/29764492.pdf
  Candidate 225 - Similarity Score: 0.9994 - SALES/17410700.pdf
  Candidate 61 - Similarity Score: 0.9993 - AGRICULTURE/84512719.pdf
  Candidate 1932 - Similarity Score: 0.9993 - BUSINESS-DEVELOPMENT/20317319.pdf
  Candidate 1268 - Similarity Score: 0.9993 - INFORMATION-TECHNOLOGY/17111768.pdf
  Candidate 269 - Similarity Score: 0.9993 - SALES/54101961.pdf
  Candidate 1450 - Similarity Score: 0.9993 - CONSTRUCTION/30311725.pdf
  Candidate 855 - Similarity Score: 0.9992 - AVIATION/99416532.pdf

Top candidates for JD 2 - Postition: Apple Solutions Consultant
  Candidate 582 - Similarity Score: 0.9996 - CHEF/21869994.pdf
  Candidate 802 - Similarity Score: 0.9995 - PUBLIC-RELATIONS/14364597.pdf
  Candidate 649 - Similarity Score:

In [17]:
selected_jd_indices = [0, 1, 2, 3, 4]

top_data = []

for jd_index in selected_jd_indices:
    # Top matching resume for this JD
    top_resume_index, top_score = top_candidates[jd_index][0]
    
    # Extract resume details
    resume_id = cv_df['ID'][top_resume_index]
    resume_cat = cv_df['Category'][top_resume_index]
    
    top_data.append({
        "JD Title": jd_df['position_title'][jd_index],
        "Resume ID": resume_id,
        "Resume Category": resume_cat,
        "Similarity Score": top_score
    })

top_resumes_df = pd.DataFrame(top_data)
top_resumes_df
# print(top_resumes_df)


Unnamed: 0,JD Title,Resume ID,Resume Category,Similarity Score
0,Sales Specialist,97405769,ADVOCATE,0.999457
1,Apple Solutions Consultant,21869994,CHEF,0.999566
2,Licensing Coordinator - Consumer Products,20317319,BUSINESS-DEVELOPMENT,0.999535
3,Web Designer,19156751,SALES,0.999545
4,Web Developer,35579812,HEALTHCARE,0.999458


Getting Top-10 Average Scores

In [18]:
avg_top10_scores = []

for jd_index in selected_jd_indices:
    top_10 = top_candidates[jd_index][:10]
    
    scores = [score for _, score in top_10]
    
    avg_score = sum(scores) / len(scores)
    
    avg_top10_scores.append({
        "JD Title": jd_df['position_title'][jd_index],
        "Average Top-10 Score": round(avg_score, 4)
    })

avg_top10_df = pd.DataFrame(avg_top10_scores)

print(avg_top10_df)


                                    JD Title  Average Top-10 Score
0                           Sales Specialist                0.9993
1                 Apple Solutions Consultant                0.9994
2  Licensing Coordinator - Consumer Products                0.9995
3                               Web Designer                0.9995
4                              Web Developer                0.9994


In [20]:
# Fixed selections
selected_jd_indices = [0, 1, 2, 3, 4]
selected_resume_indices = [1160, 1799, 2112, 1268, 1430, 1360, 713, 1944, 663, 1551]

# Initialize matrix
gpt_matrix = []

for resume_idx in selected_resume_indices:
    row = []
    for jd_idx in selected_jd_indices:
        sim_score = similarity_scores[jd_idx][resume_idx]
        row.append(round(sim_score, 4))
    gpt_matrix.append(row)

# Prepare column & row labels
jd_titles = [jd_df["position_title"][i] for i in selected_jd_indices]
resume_labels = [f"{cv_df['Category'][i]}/{cv_df['ID'][i]}" for i in selected_resume_indices]

# Create DataFrame
gpt_df = pd.DataFrame(gpt_matrix, index=resume_labels, columns=jd_titles)
gpt_df


Unnamed: 0,Sales Specialist,Apple Solutions Consultant,Licensing Coordinator - Consumer Products,Web Designer,Web Developer
ACCOUNTANT/75286906,0.9966,0.9958,0.9976,0.9973,0.9956
FITNESS/27974588,0.994,0.9926,0.9944,0.9922,0.9954
ADVOCATE/14445309,0.9959,0.9942,0.9956,0.9944,0.9973
INFORMATION-TECHNOLOGY/29975124,0.9983,0.9975,0.9985,0.9979,0.9981
CONSTRUCTION/30397268,0.9971,0.9964,0.997,0.9963,0.9979
HR/22323967,0.9985,0.9979,0.9989,0.9982,0.9978
HEALTHCARE/24025053,0.9966,0.9954,0.9965,0.9956,0.9976
BUSINESS-DEVELOPMENT/12546838,0.9973,0.9976,0.997,0.9961,0.9957
HEALTHCARE/23944036,0.9977,0.9969,0.9972,0.996,0.9988
DESIGNER/25023614,0.9987,0.9982,0.9985,0.9977,0.9989


The GPT-2 model produced consistently high similarity scores, with values ranging between 0.9922 and 0.9995 for all resume-JD combinations. This narrow range of near-perfect scores indicates a lack of meaningful variance across the comparisons. While this might initially appear to reflect strong semantic alignment, closer human evaluation revealed that GPT-2 failed to capture nuanced contextual differences between unrelated job descriptions and resumes. For example, a resume from the “Fitness” domain received similarity scores above 0.99 for technical roles like “Web Developer” and “Licensing Coordinator,” which are contextually mismatched. This overestimation is likely due to GPT-2’s autoregressive architecture, which is not optimized for sentence-pair embeddings and semantic understanding tasks