# Resume-Job Description Semantic Matching Notebook

This notebook performs semantic matching between preprocessed resume data and job descriptions using Sentence-BERT. The resume data has been preprocessed and lemmatized, and now we focus on the semantic similarity matching workflow.

# 1. Load Preprocessed Resume Data

Load the cleaned and lemmatized resume data CSV file and add unique identifiers for tracking.

## 2. Load the Resume Data CSV
Read the 'parsed_resumes.csv' file into a pandas DataFrame.

## 3. Explore the Data Structure
Display the first few rows and check column names, datatypes, and basic statistics.

### Exploring Data Structure

Let's examine the columns, datatypes, and basic statistics of the loaded DataFrame to understand the structure of our parsed resume data.

## 4. Handle Missing Values

In this section, we will identify and handle missing values in the dataset to ensure data quality for further processing.

## 5. Normalize Text Columns

We will now normalize text data by converting to lowercase, stripping whitespace, and standardizing formats where appropriate.

## 6. Remove Duplicates

Next, we will remove duplicate rows to ensure each resume entry is unique.

## 7. Extract and Clean Skills

We will now extract and clean the 'Skills' column, ensuring consistent formatting and removing irrelevant characters.

## 8. Tokenize and Lemmatize Text Fields

We will now tokenize and lemmatize relevant text fields (such as 'Summary', 'Experience', etc.) to prepare for downstream NLP tasks.

## 9. Save Preprocessed Data

Finally, we will save the cleaned and preprocessed DataFrame to a new CSV file for downstream analysis or modeling.

In [None]:
# Load the manually cleaned lemmatized CSV and add unique identifier
import pandas as pd
import numpy as np

# Load the cleaned lemmatized data
df_lemmatized = pd.read_csv('preprocessed_resumes_lemmatized.csv')

# Add unique identifier column at the start
df_lemmatized.insert(0, 'Resume_ID', range(1, len(df_lemmatized) + 1))

# Save the updated CSV with unique identifier
df_lemmatized.to_csv('preprocessed_resumes_lemmatized.csv', index=False)

print(f'Loaded {len(df_lemmatized)} resumes')
print(f'Columns: {df_lemmatized.columns.tolist()}')
print(f'Shape: {df_lemmatized.shape}')
df_lemmatized.head()

Lemmatized columns found: ['Education_lemmatized', 'Skills_lemmatized', 'Work Experience_lemmatized', 'Additional Information_lemmatized', 'Person Name_lemmatized', 'Certifications/Licenses_lemmatized', 'Education_lemmatized_lemmatized', 'Skills_lemmatized_lemmatized', 'Work Experience_lemmatized_lemmatized', 'Additional Information_lemmatized_lemmatized', 'Person Name_lemmatized_lemmatized', 'Certifications/Licenses_lemmatized_lemmatized', 'Education_lemmatized_lemmatized_lemmatized', 'Skills_lemmatized_lemmatized_lemmatized', 'Work Experience_lemmatized_lemmatized_lemmatized', 'Additional Information_lemmatized_lemmatized_lemmatized']
Lemmatized data saved to preprocessed_resumes_lemmatized.csv
Shape of lemmatized dataframe: (2437, 16)
Columns in lemmatized CSV: ['Education_lemmatized', 'Skills_lemmatized', 'Work Experience_lemmatized', 'Additional Information_lemmatized', 'Person Name_lemmatized', 'Certifications/Licenses_lemmatized', 'Education_lemmatized_lemmatized', 'Skills_lemma

In [27]:
# Add unique identifier (Resume_ID) to each row
# Check if Resume_ID column already exists
if 'Resume_ID' not in df_lemmatized.columns:
    # Add Resume_ID as the first column
    df_lemmatized.insert(0, 'Resume_ID', range(1, len(df_lemmatized) + 1))
    print(f'Added Resume_ID column to {len(df_lemmatized)} resumes')
else:
    print('Resume_ID column already exists')

# Display the first few rows to verify
print(f'Dataframe shape: {df_lemmatized.shape}')
print(f'Columns: {df_lemmatized.columns.tolist()}')
df_lemmatized.head()

Added Resume_ID column to 2437 resumes
Dataframe shape: (2437, 18)
Columns: ['Resume_ID', 'Education_lemmatized', 'Skills_lemmatized', 'Work Experience_lemmatized', 'Additional Information_lemmatized', 'Person Name_lemmatized', 'Certifications/Licenses_lemmatized', 'Education_lemmatized_lemmatized', 'Skills_lemmatized_lemmatized', 'Work Experience_lemmatized_lemmatized', 'Additional Information_lemmatized_lemmatized', 'Person Name_lemmatized_lemmatized', 'Certifications/Licenses_lemmatized_lemmatized', 'Education_lemmatized_lemmatized_lemmatized', 'Skills_lemmatized_lemmatized_lemmatized', 'Work Experience_lemmatized_lemmatized_lemmatized', 'Additional Information_lemmatized_lemmatized_lemmatized', 'JD_Match_Score']


Unnamed: 0,Resume_ID,Education_lemmatized,Skills_lemmatized,Work Experience_lemmatized,Additional Information_lemmatized,Person Name_lemmatized,Certifications/Licenses_lemmatized,Education_lemmatized_lemmatized,Skills_lemmatized_lemmatized,Work Experience_lemmatized_lemmatized,Additional Information_lemmatized_lemmatized,Person Name_lemmatized_lemmatized,Certifications/Licenses_lemmatized_lemmatized,Education_lemmatized_lemmatized_lemmatized,Skills_lemmatized_lemmatized_lemmatized,Work Experience_lemmatized_lemmatized_lemmatized,Additional Information_lemmatized_lemmatized_lemmatized,JD_Match_Score
0,1,associate drafting technology university texas...,window window system administrator ii trico pr...,senior system administrator trico product brow...,,,,associate drafting technology university texas...,window window system administrator ii trico pr...,senior system administrator trico product brow...,,,,associate drafting technology university texas...,window window system administrator ii trico pr...,senior system administrator trico product brow...,,0.250928
1,2,bachelor science computer criminology florida ...,less average per day service desk analyst knac...,system administrator bios technology metairie ...,,b,comptia november november comptia november nov...,bachelor science computer criminology florida ...,less average per day service desk analyst knac...,system administrator bios technology metairie ...,,b,comptia november november comptia november nov...,bachelor science computer criminology florida ...,less average per day service desk analyst knac...,system administrator bios technology metairie ...,,0.282353
2,3,certificate web development university wiscons...,altiris apache barracuda cisco switch year web...,system administrator nord gear corporation wau...,seasoned professional proven track record tech...,c,,certificate web development university wiscons...,altiris apache barracuda cisco switch year web...,system administrator nord gear corporation wau...,seasoned professional proven track record tech...,c,,certificate web development university wiscons...,altiris apache barracuda cisco switch year web...,system administrator nord gear corporation wau...,seasoned professional proven track record tech...,0.22915
3,4,certified technician telecommunication cerco i...,available resource set new computer built serv...,roti mediterranean grill north bethesda md jan...,technical skill patient punctual great attenti...,,certified technician,certified technician telecommunication cerco i...,available resource set new computer built serv...,roti mediterranean grill north bethesda md jan...,technical skill patient punctual great attenti...,,certified technician,certified technician telecommunication cerco i...,available resource set new computer built serv...,roti mediterranean grill north bethesda md jan...,technical skill patient punctual great attenti...,0.215634
4,5,bachelor science statistic university florida ...,aws year bash year database year freebsd year ...,system administrator bex realty boca raton fl ...,computer skill linux skill linux user since op...,e,red hat certified system administrator august ...,bachelor science statistic university florida ...,aws year bash year database year freebsd year ...,system administrator bex realty boca raton fl ...,computer skill linux skill linux user since op...,e,red hat certified system administrator august ...,bachelor science statistic university florida ...,aws year bash year database year freebsd year ...,system administrator bex realty boca raton fl ...,computer skill linux skill linux user since op...,0.317821


# 3. Resume-Job Description Semantic Matching

Use Sentence-BERT to compute semantic similarity between each preprocessed resume and the job description. The similarity score indicates how well each resume matches the job requirements.

In [24]:
# Install required packages for semantic matching
!pip install torch==2.2.2 --force-reinstall --no-cache-dir
!pip install transformers==4.40.2 --force-reinstall --no-cache-dir
!pip install sentence-transformers==2.6.1 --force-reinstall --no-cache-dir
!pip install scikit-learn --force-reinstall --no-cache-dir

Collecting torch==2.2.2
  Downloading torch-2.2.2-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting filelock (from torch==2.2.2)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch==2.2.2)
  Downloading typing_extensions-4.14.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch==2.2.2)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch==2.2.2)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch==2.2.2)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec (from torch==2.2.2)
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch==2.2.2)
  Downloading MarkupSafe-3.0.2-cp311-cp311-win_amd64.whl.metadata (4.1 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy->torch==2.2.2)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.2

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-intel 2.16.1 requires keras>=3.0.0, which is not installed.
s3fs 2023.10.0 requires fsspec==2023.10.0, but you have fsspec 2025.5.1 which is incompatible.
streamlit 1.30.0 requires numpy<2,>=1.19.3, but you have numpy 2.3.1 which is incompatible.
streamlit 1.30.0 requires packaging<24,>=16.8, but you have packaging 25.0 which is incompatible.
streamlit 1.30.0 requires pillow<11,>=7.1.0, but you have pillow 11.3.0 which is incompatible.
streamlit 1.30.0 requires protobuf<5,>=3.20, but you have protobuf 5.29.5 which is incompatible.
streamlit 1.30.0 requires rich<14,>=10.14.0, but you have rich 14.0.0 which is incompatible.
tensorflow-intel 2.16.1 requires ml-dtypes~=0.3.1, but you have ml-dtypes 0.5.1 which is incompatible.
tensorflow-intel 2.16.1 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", 

Collecting transformers==4.40.2
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
Collecting filelock (from transformers==4.40.2)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers==4.40.2)
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers==4.40.2)
  Downloading numpy-2.3.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting packaging>=20.0 (from transformers==4.40.2)
  Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Collecting pyyaml>=5.1 (from transformers==4.40.2)
  Downloading PyYAML-6.0.2-cp311-cp311-win_amd64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers==4.40.2)
  Downloading regex-2024.11.6-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting requests (from transformers==4.40.2)
  Downloading requests-2.32.4-py3-none-any.whl.metadata (4.9 kB)
Collecting tokenizers<0.20,>=0.19 (from tra

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.75 requires requests_mock, which is not installed.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tensorflow-intel 2.16.1 requires keras>=3.0.0, which is not installed.
anaconda-cloud-auth 0.1.4 requires pydantic<2.0, but you have pydantic 2.11.7 which is incompatible.
astropy 5.3.4 requires numpy<2,>=1.21, but you have numpy 2.3.1 which is incompatible.
botocore 1.31.64 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.5.0 which is incompatible.
conda-repo-cli 1.0.75 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.75 requires PyYAML==6.0.1, but you have pyyaml 6.0.2 which is incompatible.
conda-repo-cli 1.0.75 requires requests==2.31.0, but you have requests 2.32.4 which

Collecting sentence-transformers==2.6.1
  Downloading sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Collecting transformers<5.0.0,>=4.32.0 (from sentence-transformers==2.6.1)
  Downloading transformers-4.53.0-py3-none-any.whl.metadata (39 kB)
Collecting tqdm (from sentence-transformers==2.6.1)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers==2.6.1)
  Downloading torch-2.7.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting numpy (from sentence-transformers==2.6.1)
  Downloading numpy-2.3.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scikit-learn (from sentence-transformers==2.6.1)
  Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl.metadata (14 kB)
Collecting scipy (from sentence-transformers==2.6.1)
  Downloading scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting huggingface-hub>=0.15.1 (from sentence-transformers==2.6.1)
  Downloading huggingface_hub-0.33.2-py3-none-a

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.75 requires requests_mock, which is not installed.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tensorflow-intel 2.16.1 requires keras>=3.0.0, which is not installed.
anaconda-cloud-auth 0.1.4 requires pydantic<2.0, but you have pydantic 2.11.7 which is incompatible.
astropy 5.3.4 requires numpy<2,>=1.21, but you have numpy 2.3.1 which is incompatible.
botocore 1.31.64 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.5.0 which is incompatible.
conda-repo-cli 1.0.75 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.75 requires PyYAML==6.0.1, but you have pyyaml 6.0.2 which is incompatible.
conda-rep

Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl.metadata (14 kB)
Collecting numpy>=1.22.0 (from scikit-learn)
  Downloading numpy-2.3.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl (10.7 MB)
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
   --- ------------------------------------ 1.0/10.7 MB 5.6 MB/s eta 0:00:02
   ------ --------------------------------- 1.8/10.7 MB 4.6 MB/s eta 0:00:02
   ---------- ----------------------------- 2.9/10.7 MB 4.3 MB/s eta 0:00:02
   ----------- ---------------------------- 3.1/10.7 MB 4.3 MB/s eta 0:00:02


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tensorflow-intel 2.16.1 requires keras>=3.0.0, which is not installed.
astropy 5.3.4 requires numpy<2,>=1.21, but you have numpy 2.3.1 which is incompatible.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.3.1 which is incompatible.
numba 0.59.0 requires numpy<1.27,>=1.22, but you have numpy 2.3.1 which is incompatible.
pywavelets 1.5.0 requires numpy<2.0,>=1.22.4, but you have numpy 2.3.1 which is incompatible.
streamlit 1.30.0 requires numpy<2,>=1.19.3, but you have numpy 2.3.1 which is incompatible.
streamlit 1.30.0 requires packaging<24,>=16.8, but you have packaging 25.0 which is incompatible.
streamlit 1.30.0 requires pillow<11,>=7.1.0, but you have pillow 11.3.0 which is incompatible.
streamlit 1.30.0 requires protobuf<5,>=3.20, b

# 2. Install Dependencies

Install the required packages for semantic matching with Sentence-BERT.

In [28]:
# Load the job description from file
with open(r'c:/Users/AMARTYA KUMAR/Desktop/PS-2/PS-2 functions/Web-Developer-job-description.txt', 'r', encoding='utf-8') as f:
    job_description = f.read()
print(job_description)


Job Title: Web Developer
Company: Not specified
Department: engineering
Level: intern
Work Mode: remote
Location: Not specified
Salary: EUR1K - EUR2K
Job Type: internship
Description: We need web developers with expertise in at least 1 of the following skills - Node.js, Nest.js, Vue.js, Laravel.

Selected Intern's day-to-day responsibilities include creating websites through Node.js, Nest.js, Vue.js, etc.

Please do not apply if you do not have expertise.

We are looking only for serious interns. To prevent mutual waste of time, energy, and resources, any intern not working seriously can be removed with immediate effect without any benefits being given. We will give intern certificates, etc., to only those who have contributed sufficiently.
                  


In [29]:
# Semantic matching using Sentence-BERT
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare resume texts (concatenate all lemmatized fields)
resume_texts = df_lemmatized.apply(lambda row: ' '.join([
    str(row.get('Person Name_lemmatized', '')),
    str(row.get('Work Experience_lemmatized', '')),
    str(row.get('Skills_lemmatized', '')),
    str(row.get('Education_lemmatized', '')),
    str(row.get('Certifications/Licenses_lemmatized', '')),
    str(row.get('Additional Information_lemmatized', ''))
]), axis=1).tolist()

print(f'Processing {len(resume_texts)} resumes for semantic matching...')

# Encode resumes and job description
resume_embeddings = model.encode(resume_texts, convert_to_numpy=True)
jd_embedding = model.encode([job_description], convert_to_numpy=True)

# Compute cosine similarity
similarity_scores = cosine_similarity(resume_embeddings, jd_embedding).flatten()

# Add match scores to dataframe
df_lemmatized['JD_Match_Score'] = similarity_scores

# Show top 10 matches
top_matches = df_lemmatized.sort_values('JD_Match_Score', ascending=False).head(10)
print('\nTop 10 Resume Matches:')
display(top_matches[['Resume_ID', 'Person Name_lemmatized', 'Skills_lemmatized', 'JD_Match_Score']])

# Save results
df_lemmatized.to_csv('resume_matching_results_final.csv', index=False)
print(f'\nResults saved to resume_matching_results_final.csv')

Processing 2437 resumes for semantic matching...

Top 10 Resume Matches:

Top 10 Resume Matches:


Unnamed: 0,Resume_ID,Person Name_lemmatized,Skills_lemmatized,JD_Match_Score
637,638,xn,year bootstrap expressjs year ajax year year a...,0.625818
693,694,zr,year bootstrap expressjs year ajax year year a...,0.616476
858,859,aga,year bootstrap expressjs year ajax year year a...,0.603025
748,749,abu,front end less year javascript less year ui le...,0.583438
715,716,aan,coding year troubleshooting,0.554881
656,657,yg,coding year troubleshooting,0.544948
608,609,wk,avocode year photoshop,0.529047
643,644,xt,cooporate design team create simple beautiful ...,0.508679
803,804,adx,angular less year front end less year javascri...,0.48399
1913,1914,bup,ajax bootstrap cs dhtml html javascript jquery...,0.481086



Results saved to resume_matching_results_final.csv
