# RESUME PARSER 


---

## Introduction

This project implements an AI-driven system to recommend jobs to candidates based on semantic similarity between resumes and job descriptions. It leverages pre-trained sentence transformer models to generate embeddings for job descriptions and resumes, enabling more meaningful and context-aware matching beyond simple keyword searches.

## Problem Definition

Traditional recruitment processes rely heavily on manual resume screening and keyword matching, which are time-consuming, inefficient, and prone to overlooking well-qualified candidates due to differences in phrasing or terminology. This results in longer hiring cycles and potentially biased selections.

## Value Proposition

This project demonstrates how embedding-based similarity matching can:

* **Enhance Candidate-Job Fit:** By understanding the semantic content of resumes and job postings, the system provides relevant job recommendations to candidates.
* **Increase Recruitment Efficiency:** Automating the matching process reduces manual effort and accelerates candidate shortlisting.
* **Support Data-Driven Hiring Decisions:** Enables recruiters to prioritize candidates based on meaningful content similarity scores.

## Scope

* Use of the Sentence-Transformers `all-MiniLM-L6-v2` embedding model for feature extraction.
* Processing and embedding of real-world datasets of job postings and candidate resumes.
* Development of a recommendation mechanism that ranks top job matches for candidates based on cosine similarity.
* Visualization of job and resume data insights using multiple plot types (bar charts, pie charts, word clouds).
* Implementation of evaluation metrics suited for recommendation tasks (Top-K Accuracy, Precision, Recall, MRR).
* Threshold tuning and binary classification evaluation (Precision, Recall, F1, ROC-AUC) to measure match quality.

## Data Sources

* **Job Postings Dataset:** Real job descriptions with titles, companies, locations, and domains.
* **Resume Dataset:** Candidate resumes with extracted text and metadata fields.

## Data Preprocessing

* Handling missing values in critical columns to maintain dataset quality.
* Text normalization including lowercasing and removal of non-alphanumeric characters for consistent embedding generation.
* Extraction of key textual features (e.g., job descriptions, resume text) to generate meaningful embeddings.

## Model and Matching

* Use of a sentence transformer model to convert textual data into dense vector representations.
* Calculation of cosine similarity between job and resume embeddings to identify the best matches.
* Application of filters and ranking to refine recommendations.

## Evaluation

* Use of recommendation system metrics like Top-5 Accuracy, Precision\@5, Recall\@5, and Mean Reciprocal Rank to evaluate ranking performance.
* Additional binary classification metrics after threshold tuning to assess the quality of match predictions.

---




### Import Libraries and Install Dependecies

In [1]:
pip install nltk spacy

Collecting numpy>=1.19.0 (from spacy)
  Using cached numpy-2.2.6-cp310-cp310-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.2.6-cp310-cp310-win_amd64.whl (12.9 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
Successfully installed numpy-2.2.6
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.45.1 requires packaging<25,>=20, but you have packaging 25.0 which is incompatible.
tensorflow-intel 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 2.2.6 which is incompatible.
tensorflow-intel 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.14.1 which is incompatible.
torchvision 0.22.1 requires torch==2.7.1, but you have torch 2.1.2 which is incompatible.


In [38]:
pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pandas numpy scikit-learn matplotlib seaborn

Note: you may need to restart the kernel to use updated packages.


In [39]:
# Installing  the sentence-transformers library required for Sentence-BERT embeddings
!pip install -U sentence-transformers

Defaulting to user installation because normal site-packages is not writeable


In [5]:
!pip install spacy

Defaulting to user installation because normal site-packages is not writeable


In [6]:
!python -m spacy download en_core_web_sm

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\spacy\__init__.py", line 6, in <module>
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\spacy\errors.py", line 3, in <module>
    from .compat import Literal
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\spacy\compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\thinc\__init__.py", line 5, in <module>
    from .config import registry
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\thinc\config.py", line 5, in <module>
    from .types import Decorator
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\thinc\types.py", line 27, in 

In [7]:
pip install tensorflow keras ipywidgets sentence-transformers

Collecting numpy<=1.24.3,>=1.22 (from tensorflow-intel==2.13.0->tensorflow)
  Using cached numpy-1.24.3-cp310-cp310-win_amd64.whl.metadata (5.6 kB)
Collecting typing-extensions<4.6.0,>=3.6.6 (from tensorflow-intel==2.13.0->tensorflow)
  Using cached typing_extensions-4.5.0-py3-none-any.whl.metadata (8.5 kB)
Using cached numpy-1.24.3-cp310-cp310-win_amd64.whl (14.8 MB)
Using cached typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Installing collected packages: typing-extensions, numpy

  Attempting uninstall: typing-extensions

    Found existing installation: typing_extensions 4.14.1

    Uninstalling typing_extensions-4.14.1:

   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
altair 5.5.0 requires typing-extensions>=4.10.0; python_version < "3.14", but you have typing-extensions 4.5.0 which is incompatible.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.24.3 which is incompatible.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.24.3 which is incompatible.
optree 0.16.0 requires typing-extensions>=4.6.0, but you have typing-extensions 4.5.0 which is incompatible.
pydantic 2.11.7 requires typing-extensions>=4.12.2, but you have typing-extensions 4.5.0 which is incompatible.
pydantic-core 2.33.2 requires typing-extensions!=4.7.0,>=4.6.0, but you have typing-extensions 4.5.0 which is incompatible.
streamlit 1.45.1 requires packaging<25,>=20, but you have packaging 25.0 which is incompa

In [8]:
!jupyter nbextension enable --py widgetsnbextension

usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: console dejavu events execute kernel kernelspec lab
labextension labhub migrate nbconvert notebook qtconsole run script server
troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


In [9]:
import ipywidgets as widgets
widgets.IntSlider()

IntSlider(value=0)

In [10]:
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: console dejavu events execute kernel kernelspec lab
labextension labhub migrate nbconvert notebook qtconsole run script server
troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


### Imorting necessary libraries

In [13]:
pip uninstall transformers sentence-transformers -y

Found existing installation: transformers 4.54.1
Uninstalling transformers-4.54.1:
  Successfully uninstalled transformers-4.54.1
Found existing installation: sentence-transformers 5.0.0
Uninstalling sentence-transformers-5.0.0:
  Successfully uninstalled sentence-transformers-5.0.0
Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install transformers==4.30.2 sentence-transformers==2.2.2

Collecting transformers==4.30.2
  Downloading transformers-4.30.2-py3-none-any.whl.metadata (113 kB)
Collecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.30.2)
  Downloading tokenizers-0.13.3-cp310-cp310-win_amd64.whl.metadata (6.9 kB)
Collecting sentencepiece (from sentence-transformers==2.2.2)
  Downloading sentencepiece-0.2.0-cp310-cp310-win_amd64.whl.metadata (8.3 kB)
Collecting torch>=1.6.0 (from sentence-transformers==2.2.2)
  Using cached torch-2.7.1-cp310-cp310-win_amd64.whl.metadata (28 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<1.0,>=0.14.1->transformers==4.30.2)
  Using cached typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
   ---------------------------------------- 0.0/7.2 M

  DEPRECATION: Building 'sentence-transformers' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'sentence-transformers'. Discussion can be found at https://github.com/pypa/pip/issues/6334
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.45.1 requires packaging<25,>=20, but you have packaging 25.0 which is incompatible.
tensorflow-intel 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.14.1 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, 

In [16]:
pip install sentence-transformers==2.2.2 huggingface_hub==0.10.1 transformers==4.30.2

Collecting huggingface_hub==0.10.1
  Downloading huggingface_hub-0.10.1-py3-none-any.whl.metadata (6.1 kB)
INFO: pip is looking at multiple versions of transformers to determine which version is compatible with other requirements. This could take a while.
Collecting sentence-transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py3-none-any.whl

The conflict is caused by:
    The user requested huggingface_hub==0.10.1
    sentence-transformers 2.2.2 depends on huggingface-hub>=0.4.0
    transformers 4.30.2 depends on huggingface-hub<1.0 and >=0.14.1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip to attempt to solve the dependency conflict

Note: you may need to restart the kernel to use updated packages.


ERROR: Cannot install huggingface_hub==0.10.1, sentence-transformers==2.2.2 and transformers==4.30.2 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts


In [40]:
pip install huggingface-hub==0.14.1

Collecting huggingface-hub==0.14.1
  Downloading huggingface_hub-0.14.1-py3-none-any.whl.metadata (7.6 kB)
Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.34.3
    Uninstalling huggingface-hub-0.34.3:
      Successfully uninstalled huggingface-hub-0.34.3
Successfully installed huggingface-hub-0.14.1
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.0.0 requires huggingface-hub>=0.24.0, but you have huggingface-hub 0.14.1 which is incompatible.


In [43]:
import pandas as pd
import numpy as np
import spacy
import re
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

### Load Resume and Job Posting Dataset

In [19]:
import pandas as pd

# Load resumes dataset
resumes_df = pd.read_csv(r'C:\Users\Gouthum\Downloads\PinnacleAiInternship\Dataset\Resume\Resume.csv')

# Load job postings dataset
jobs_df = pd.read_csv(r'C:\Users\Gouthum\Downloads\PinnacleAiInternship\jobdetails\postings.csv')

print(f"Resumes: {resumes_df.shape}")
print(f"Job Postings: {jobs_df.shape}")

# Display first few rows
print(resumes_df.head())
print(jobs_df.head())

Resumes: (2484, 4)
Job Postings: (123849, 31)
         ID                                         Resume_str  \
0  16852973           HR ADMINISTRATOR/MARKETING ASSOCIATE\...   
1  22323967           HR SPECIALIST, US HR OPERATIONS      ...   
2  33176873           HR DIRECTOR       Summary      Over 2...   
3  27018550           HR SPECIALIST       Summary    Dedica...   
4  17812897           HR MANAGER         Skill Highlights  ...   

                                         Resume_html Category  
0  <div class="fontsize fontface vmargins hmargin...       HR  
1  <div class="fontsize fontface vmargins hmargin...       HR  
2  <div class="fontsize fontface vmargins hmargin...       HR  
3  <div class="fontsize fontface vmargins hmargin...       HR  
4  <div class="fontsize fontface vmargins hmargin...       HR  
     job_id            company_name  \
0    921716   Corcoran Sawyer Smith   
1   1829192                     NaN   
2  10998357  The National Exemplar    
3  23221523  Abra

### Resume.CSV Dataset

In [20]:
resumes_df

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR
...,...,...,...,...
2479,99416532,RANK: SGT/E-5 NON- COMMISSIONED OFFIC...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION
2480,24589765,"GOVERNMENT RELATIONS, COMMUNICATIONS ...","<div class=""fontsize fontface vmargins hmargin...",AVIATION
2481,31605080,GEEK SQUAD AGENT Professional...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION
2482,21190805,PROGRAM DIRECTOR / OFFICE MANAGER ...,"<div class=""fontsize fontface vmargins hmargin...",AVIATION


In [21]:
resumes_df.head(5)

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


### postings.CSV Dataset

In [29]:
jobs_df

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1.713398e+12,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1.712858e+12,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1.713278e+12,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1.712896e+12,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1.713452e+12,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123844,3906267117,Lozano Smith,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...,195000.0,YEARLY,"Walnut Creek, CA",56120.0,1.0,,...,,1.713571e+12,,0,FULL_TIME,USD,BASE_SALARY,157500.0,94595.0,6013.0
123845,3906267126,Pinterest,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...,,,United States,1124131.0,3.0,,...,,1.713572e+12,www.pinterestcareers.com,0,FULL_TIME,,,,,
123846,3906267131,EPS Learning,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...,,,"Spokane, WA",90552133.0,3.0,,...,,1.713572e+12,epsoperations.bamboohr.com,0,FULL_TIME,,,,99201.0,53063.0
123847,3906267195,Trelleborg Applied Technologies,Business Development Manager,The Business Development Manager is a 'hunter'...,,,"Texas, United States",2793699.0,4.0,,...,,1.713573e+12,,0,FULL_TIME,,,,,


In [23]:
jobs_df.head(5)

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


In [32]:
postings_path = r"C:\Users\Gouthum\Downloads\PinnacleAiInternship\jobdetails\postings.csv"

# Load the Job Postings CSV file into a Pandas DataFrame
postings_df = pd.read_csv(postings_path)

# Display the first few rows of the dataset to understand its structure
postings_df.head()

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


### Size of the Dataset

In [24]:
#Resume.CSV
print(f"Resumes: {resumes_df.shape}")

#postings.csv
print(f"Job Postings: {jobs_df.shape}")

Resumes: (2484, 4)
Job Postings: (123849, 31)


### Statistics of Resume.CV

In [25]:
resumes_df.describe()

Unnamed: 0,ID
count,2484.0
mean,31826160.0
std,21457350.0
min,3547447.0
25%,17544300.0
50%,25210310.0
75%,36114440.0
max,99806120.0


### Statistics of postings.csv

In [26]:
jobs_df.describe()

Unnamed: 0,job_id,max_salary,company_id,views,med_salary,min_salary,applies,original_listed_time,remote_allowed,expiry,closed_time,listed_time,sponsored,normalized_salary,zip_code,fips
count,123849.0,29793.0,122132.0,122160.0,6280.0,29793.0,23320.0,123849.0,15246.0,123849.0,1073.0,123849.0,123849.0,36073.0,102977.0,96434.0
mean,3896402000.0,91939.42,12204010.0,14.618247,22015.619876,64910.85,10.591981,1713152000000.0,1.0,1716213000000.0,1712928000000.0,1713204000000.0,0.0,205327.0,50400.491887,28713.879887
std,84043550.0,701110.1,25541430.0,85.903598,52255.873846,495973.8,29.047395,484820900.0,0.0,2321394000.0,362289300.0,398912200.0,0.0,5097627.0,30252.232515,16015.929825
min,921716.0,1.0,1009.0,1.0,0.0,1.0,1.0,1701811000000.0,1.0,1712903000000.0,1712346000000.0,1711317000000.0,0.0,0.0,1001.0,1003.0
25%,3894587000.0,48.28,14352.0,3.0,18.94,37.0,1.0,1712863000000.0,1.0,1715481000000.0,1712670000000.0,1712886000000.0,0.0,52000.0,24112.0,13121.0
50%,3901998000.0,80000.0,226965.0,4.0,25.5,60000.0,3.0,1713395000000.0,1.0,1716042000000.0,1712670000000.0,1713408000000.0,0.0,81500.0,48059.0,29183.0
75%,3904707000.0,140000.0,8047188.0,8.0,2510.5,100000.0,8.0,1713478000000.0,1.0,1716088000000.0,1713283000000.0,1713484000000.0,0.0,125000.0,78201.0,42077.0
max,3906267000.0,120000000.0,103473000.0,9975.0,750000.0,85000000.0,967.0,1713573000000.0,1.0,1729125000000.0,1713562000000.0,1713573000000.0,0.0,535600000.0,99901.0,56045.0


# Random Sampling

In [53]:
print(type(jobs_df))
print(type(resumes_df))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [55]:
print("Jobs df rows:", jobs_df.shape[0])
print("Resumes df rows:", resumes_df.shape[0])

Jobs df rows: 123849
Resumes df rows: 2484


In [57]:
postings_sample_df = jobs_df.sample(2000, random_state=42).copy()
resume_sample_df = resumes_df.sample(2000, random_state=42).copy()

print('Job Postings Sample Shape:', postings_sample_df.shape)
print('Resume Sample Shape:', resume_sample_df.shape)

Job Postings Sample Shape: (2000, 31)
Resume Sample Shape: (2000, 4)


In [159]:
postings_sample_df.to_csv(r"C:\Users\Gouthum\Downloads\PinnacleAiInternship\postings_sample.csv", index=False)
resume_sample_df.to_csv(r"C:\Users\Gouthum\Downloads\PinnacleAiInternship\resume_sample.csv", index=False)
print("Saved Sample File Sucessfully")

Saved Sample File Sucessfully


In [58]:
# Placeholder for dynamic resume and JD input
uploaded_resume = None
uploaded_jd = None

# Data Preprocessing
This stage prepares raw product text data (like product name, description, specifications) for modeling by cleaning and normalizing it.

🔧 Tasks Performed:
Handle Missing Values

Dropped or filled missing values in key fields like description, product_name, and product_specifications.

Text Cleaning & Normalization
Applied preprocessing to ensure consistency and remove noise:

Lowercasing all text

Removing punctuation, special characters, and digits

Removing extra white spaces

Tokenization (breaking text into words)

Stopword removal (removing words like "and", "the", etc.)

Lemmatization (converting words to their base form)

#### Preprocess and Extract Features from the Job Description Dataset¶
This process involves:

Handling missing values.
Normalizing text fields (e.g., lemmatizing with spaCy, converting to lowercase).
Extracting structured features (skills, domains) using predefined lists.
Creating a single string representation of each job posting and resume, enhanced with extracted skills.

In [60]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [61]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gouthum/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gouthum/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [62]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [63]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [64]:
lemmatizer

<WordNetLemmatizer>

#### Preprocess_text

In [83]:
import re
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if pd.isna(text):
        return {'original': [], 'lemmatized': ''}
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    tokens = word_tokenize(text)
    original_terms = tokens
    lemmatized_terms = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    lemmatized = ' '.join(lemmatized_terms)
    return {'original': original_terms, 'lemmatized': lemmatized}

text = "Looking for a python developer with experience in AWS and docker, wanted 5+years expereince in healthcare', 'finance', 'tech'."

result = preprocess_text(text)
print(result)

{'original': ['looking', 'for', 'a', 'python', 'developer', 'with', 'experience', 'in', 'aws', 'and', 'docker', 'wanted', '5years', 'expereince', 'in', 'healthcare', 'finance', 'tech'], 'lemmatized': 'looking python developer experience aws docker wanted 5years expereince healthcare finance tech'}


In [84]:
text

"Looking for a python developer with experience in AWS and docker, wanted 5+years expereince in healthcare', 'finance', 'tech'."

In [87]:
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Setup
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if pd.isna(text):
        return {'original': [], 'lemmatized': ''}
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    tokens = word_tokenize(text)
    original_terms = tokens
    lemmatized_terms = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    lemmatized = ' '.join(lemmatized_terms)
    return {'original': original_terms, 'lemmatized': lemmatized}

In [88]:
text = "Looking for a python developer with experience in AWS and docker, wanted 5+years expereince in healthcare', 'finance', 'tech'."
result = preprocess_text(text)

print("Original tokens:", result['original'])
print("Lemmatized:", result['lemmatized'])


Original tokens: ['looking', 'for', 'a', 'python', 'developer', 'with', 'experience', 'in', 'aws', 'and', 'docker', 'wanted', '5years', 'expereince', 'in', 'healthcare', 'finance', 'tech']
Lemmatized: looking python developer experience aws docker wanted 5years expereince healthcare finance tech


### def extract_skills(text_data):

In [95]:
# Example SKILLS list
SKILLS = ['python', 'aws', 'docker', 'java', 'sql']

# Example processed text from a resume/job description
text_data = {
    'original': ['looking', 'for', 'a', 'python', 'developer', 'with', 'experience', 'in', 'aws', 'and', 'docker'],
    'lemmatized': 'looking python developer experience aws docker'
}

def extract_skills(text_data):
    if pd.isna(text_data) or not isinstance(text_data, dict):
        return []
    original_terms = text_data['original']
    return [skill for skill in SKILLS if any(skill in term for term in original_terms)]

# Call the function
extracted = extract_skills(text_data)
print("Extracted Skills:", extracted)


Extracted Skills: ['python', 'aws', 'docker']


### Extract Domains

In [105]:
# Example dictionary (like one processed from resume text)
text_data = {
    'original': ['looking', 'for', 'python', 'developer', 'aws', 'docker', 'kubernetes', 'cloud'],
    'lemmatized': 'looking python developer aws docker kubernetes cloud'
}


DOMAINS = ['tech', 'healthcare', 'finance', 'education', 'business']


# Reuse extract_domains function
def extract_domains(text_data, skills=None):
    if pd.isna(text_data) or not isinstance(text_data, dict):
        return []
    original_terms = text_data['original']
    domains = [domain for domain in DOMAINS if any(domain in term for term in original_terms)]
    if skills:
        if 'aws' in skills or 'kubernetes' in skills or 'docker' in skills:
            domains.append('tech')
        if 'management' in skills or 'leadership' in skills:
            domains.append('business')
    return list(set(domains))  # Remove duplicates


domains = extract_domains(text_data)


print("Extracted Domains:", DOMAINS)


Extracted Domains: ['tech', 'healthcare', 'finance', 'education', 'business']


### Apply preprocessing to job postings


In [106]:
postings_sample_df['processed_desc'] = postings_sample_df['description'].apply(preprocess_text)
postings_sample_df['job_skills'] = postings_sample_df['processed_desc'].apply(extract_skills)
postings_sample_df['job_domain'] = postings_sample_df.apply(lambda x: extract_domains(x['processed_desc'], x['job_skills']), axis=1)

In [107]:
postings_sample_df['processed_desc']

73989     {'original': ['the', 'senior', 'automation', '...
59308     {'original': ['company', 'summary', 'dish', 'a...
44663     {'original': ['division', 'north', 'alabama', ...
81954     {'original': ['kmgh', 'the', 'ew', 'scripps', ...
113151    {'original': ['come', 'for', 'the', 'flexibili...
                                ...                        
74903     {'original': ['rd', 'staff', 'chemist', 'coati...
11601     {'original': ['about', 'the', 'companyour', 'c...
53215     {'original': ['medpro', 'healthcare', 'staffin...
50579     {'original': ['overview', 'systems', 'planning...
66711     {'original': ['overview', 'financial', 'specia...
Name: processed_desc, Length: 2000, dtype: object

In [108]:
postings_sample_df['job_skills']

73989        []
59308        []
44663        []
81954     [aws]
113151    [aws]
          ...  
74903        []
11601        []
53215        []
50579        []
66711        []
Name: job_skills, Length: 2000, dtype: object

In [109]:
postings_sample_df['job_domain']

73989      [business, tech]
59308      [business, tech]
44663     [tech, education]
81954      [business, tech]
113151               [tech]
                ...        
74903                [tech]
11601       [tech, finance]
53215          [healthcare]
50579      [business, tech]
66711                    []
Name: job_domain, Length: 2000, dtype: object

### Apply preprocessing to resumes (using Resume_str as the main text field)


In [112]:
resume_sample_df['processed_resume'] = resume_sample_df['Resume_str'].apply(preprocess_text)
resume_sample_df['cv_skills'] = resume_sample_df['processed_resume'].apply(extract_skills)
resume_sample_df['cv_domain'] = resume_sample_df.apply(lambda x: extract_domains(x['processed_resume'], x['cv_skills']), axis=1)

In [113]:
resume_sample_df['processed_resume']

420     {'original': ['kpandipou', 'koffi', 'summary',...
1309    {'original': ['director', 'of', 'digital', 'tr...
2023    {'original': ['senior', 'project', 'manager', ...
1360    {'original': ['chef', 'summary', 'experienced'...
2186    {'original': ['operations', 'manager', 'summar...
                              ...                        
750     {'original': ['claims', 'service', 'specialist...
657     {'original': ['business', 'development', 'cons...
1866    {'original': ['senior', 'accountant', 'summary...
234     {'original': ['information', 'technology', 'sp...
732     {'original': ['customer', 'service', 'represen...
Name: processed_resume, Length: 2000, dtype: object

In [114]:
resume_sample_df['cv_skills']

420     []
1309    []
2023    []
1360    []
2186    []
        ..
750     []
657     []
1866    []
234     []
732     []
Name: cv_skills, Length: 2000, dtype: object

In [115]:
resume_sample_df['cv_domain']

420                 [business, education]
1309          [business, tech, education]
2023       [business, education, finance]
1360                                   []
2186          [business, tech, education]
                      ...                
750     [healthcare, business, education]
657           [business, tech, education]
1866                          [education]
234           [business, tech, education]
732         [healthcare, tech, education]
Name: cv_domain, Length: 2000, dtype: object

In [119]:
    print("Extracted skills:", resume_sample_df['cv_skills'].iloc[i])


Extracted skills: []


In [118]:
for i in range(1):
    print("Resume:", resume_sample_df['Resume_str'].iloc[i])
    print("Tokens:", resume_sample_df['processed_resume'].iloc[i]['original'])
    print("Extracted skills:", resume_sample_df['cv_skills'].iloc[i])
    print("------")


Resume:            Kpandipou    Koffi         Summary      Compassionate teaching professional delivering exemplary support and assistance to teachers and students. Display exceptional Communication and problem solving skills.  Experience in office administration and public speaking. Attentive and adaptable, skilled in management of classroom operations. Effective in leveraging student feedback to create dynamic lesson plans that address individual strengths and weaknesses.  Dedicated and responsive team leader with proven skills in classroom management, behavior modification and individualized support.  Personable with experience using relationship-building to cultivate positive client, staff and management connections. Highly-developed communicator with outstanding skills in complex problem-solving and conflict resolution.  High-performing Administrative Assistant offering experience working with diverse client base and delivering exceptional results. Polished in managing client rela

In [120]:
def extract_skills(text_data):
    if pd.isna(text_data) or not isinstance(text_data, dict):
        return []
    original_terms = [term.lower() for term in text_data['original']]
    return [skill for skill in SKILLS if skill in original_terms]


In [122]:
skills_employe = extract_skills(text_data)


print("Extracted skills:", skills_employe)

Extracted skills: ['python', 'aws', 'docker']


In [123]:
def extract_skills(text_data):
    if pd.isna(text_data) or not isinstance(text_data, dict):
        return []
    original_terms = set([term.lower() for term in text_data['original']])
    return [skill for skill in SKILLS if skill in original_terms]

In [124]:
skills_employe = extract_skills(text_data)


print("Extracted skills:", skills_employe)

Extracted skills: ['python', 'aws', 'docker']


# Use NLTK for basic tokenization and lemmatization

In [73]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

# Download required nltk data once
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def normalize_and_lemmatize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmas

text = "Looking for a python developer with experience in AWS and docker, wanted 5+years expereince in healthcare', 'finance', 'tech'."
lemmas = normalize_and_lemmatize(text)

SKILLS = ['javascript', 'node.js', 'aws', 'kubernetes', 'golang', 'ruby', 'python', 'sql', 'java', 
          'docker', 'html', 'management', 'engineering', 'marketing', 'design', 'sales', 'software', 
          'development', 'communication', 'leadership', 'installation', 'technical', 'automation', 'power systems']

DOMAINS = ['healthcare', 'finance', 'tech', 'education', 'manufacturing', 'retail', 'sales', 
           'construction', 'hospitality', 'engineering', 'legal', 'marketing', 'government']

matched_skills = [skill for skill in SKILLS if skill in lemmas]
matched_domains = [domain for domain in DOMAINS if domain in lemmas]

print("Skills found:", matched_skills)
print("Domains found:", matched_domains)

Skills found: ['aws', 'python', 'docker']
Domains found: ['healthcare', 'finance', 'tech']


[nltk_data] Downloading package punkt to C:\Users\Gouthum/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gouthum/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Gouthum/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# simple Python text processing (no spaCy)

In [71]:
import re

# Sample text (e.g., job description or resume text)
text = "Looking for a python developer with experience in AWS and docker healthcare in finance."

# Normalize text: lowercase and remove punctuation
def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    return text

normalized_text = normalize_text(text)

# Define keywords
SKILLS = ['javascript', 'node.js', 'aws', 'kubernetes', 'go lang', 'ruby', 'python', 'sql', 'java', 
          'docker', 'html', 'management', 'engineering', 'marketing', 'design', 'sales', 'software', 
          'development', 'communication', 'leadership', 'installation', 'technical', 'automation', 'power systems']

DOMAINS = ['healthcare', 'finance', 'tech', 'education', 'manufacturing', 'retail', 'sales', 
           'construction', 'hospitality', 'engineering', 'legal', 'marketing', 'government']

# Extract matched keywords from text
matched_skills = [skill for skill in SKILLS if skill in normalized_text]
matched_domains = [domain for domain in DOMAINS if domain in normalized_text]

print("Skills found:", matched_skills)
print("Domains found:", matched_domains)


Skills found: ['aws', 'python', 'docker']
Domains found: ['healthcare', 'finance']


#### Combine data into a single DataFrame

In [129]:

job_data = postings_sample_df[['job_id', 'title', 'processed_desc', 'job_skills', 'job_domain']].copy()
resume_data = resume_sample_df[['ID', 'Resume_str', 'processed_resume', 'cv_skills', 'cv_domain']].copy()
combined_data = pd.concat([job_data.assign(type='job'), resume_data.assign(type='resume')], ignore_index=True)

In [130]:
job_data

Unnamed: 0,job_id,title,processed_desc,job_skills,job_domain
73989,3902944011,Senior Automation Engineer - Power Systems,"{'original': ['the', 'senior', 'automation', '...",[],"[business, tech]"
59308,3901960222,DISH Installation Technician - Field,"{'original': ['company', 'summary', 'dish', 'a...",[],"[business, tech]"
44663,3900944095,Order Builder,"{'original': ['division', 'north', 'alabama', ...",[],"[tech, education]"
81954,3903878594,"Mountain Multimedia Journalist, KMGH","{'original': ['kmgh', 'the', 'ew', 'scripps', ...",[aws],"[business, tech]"
113151,3905670593,Licensed Practical Nurse (LPN),"{'original': ['come', 'for', 'the', 'flexibili...",[aws],[tech]
...,...,...,...,...,...
74903,3903442844,R & D Chemist - Coatings,"{'original': ['rd', 'staff', 'chemist', 'coati...",[],[tech]
11601,3887496577,Senior Accounting Manager,"{'original': ['about', 'the', 'companyour', 'c...",[],"[tech, finance]"
53215,3901638416,Med-Surg Registered Nurse,"{'original': ['medpro', 'healthcare', 'staffin...",[],[healthcare]
50579,3901374545,Intern - Chemist/Biologist,"{'original': ['overview', 'systems', 'planning...",[],"[business, tech]"


In [131]:
resume_data

Unnamed: 0,ID,Resume_str,processed_resume,cv_skills,cv_domain
420,99244405,Kpandipou Koffi Summary ...,"{'original': ['kpandipou', 'koffi', 'summary',...",[],"[business, education]"
1309,17562754,DIRECTOR OF DIGITAL TRANSFORMATION ...,"{'original': ['director', 'of', 'digital', 'tr...",[],"[business, tech, education]"
2023,30311725,SENIOR PROJECT MANAGER Professi...,"{'original': ['senior', 'project', 'manager', ...",[],"[business, education, finance]"
1360,19007667,CHEF Summary Experienced ca...,"{'original': ['chef', 'summary', 'experienced'...",[],[]
2186,11065180,OPERATIONS MANAGER Summary E...,"{'original': ['operations', 'manager', 'summar...",[],"[business, tech, education]"
...,...,...,...,...,...
750,23918545,CLAIMS SERVICE SPECIALIST Pro...,"{'original': ['claims', 'service', 'specialist...",[],"[healthcare, business, education]"
657,91467795,BUSINESS DEVELOPMENT CONSULTANT ...,"{'original': ['business', 'development', 'cons...",[],"[business, tech, education]"
1866,25862026,SENIOR ACCOUNTANT Summary 8+...,"{'original': ['senior', 'accountant', 'summary...",[],[education]
234,27372171,INFORMATION TECHNOLOGY SPECIALIST/SYS...,"{'original': ['information', 'technology', 'sp...",[],"[business, tech, education]"


In [132]:
combined_data

Unnamed: 0,job_id,title,processed_desc,job_skills,job_domain,type,ID,Resume_str,processed_resume,cv_skills,cv_domain
0,3.902944e+09,Senior Automation Engineer - Power Systems,"{'original': ['the', 'senior', 'automation', '...",[],"[business, tech]",job,,,,,
1,3.901960e+09,DISH Installation Technician - Field,"{'original': ['company', 'summary', 'dish', 'a...",[],"[business, tech]",job,,,,,
2,3.900944e+09,Order Builder,"{'original': ['division', 'north', 'alabama', ...",[],"[tech, education]",job,,,,,
3,3.903879e+09,"Mountain Multimedia Journalist, KMGH","{'original': ['kmgh', 'the', 'ew', 'scripps', ...",[aws],"[business, tech]",job,,,,,
4,3.905671e+09,Licensed Practical Nurse (LPN),"{'original': ['come', 'for', 'the', 'flexibili...",[aws],[tech],job,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
3995,,,,,,resume,23918545.0,CLAIMS SERVICE SPECIALIST Pro...,"{'original': ['claims', 'service', 'specialist...",[],"[healthcare, business, education]"
3996,,,,,,resume,91467795.0,BUSINESS DEVELOPMENT CONSULTANT ...,"{'original': ['business', 'development', 'cons...",[],"[business, tech, education]"
3997,,,,,,resume,25862026.0,SENIOR ACCOUNTANT Summary 8+...,"{'original': ['senior', 'accountant', 'summary...",[],[education]
3998,,,,,,resume,27372171.0,INFORMATION TECHNOLOGY SPECIALIST/SYS...,"{'original': ['information', 'technology', 'sp...",[],"[business, tech, education]"


In [150]:
print(combined_data[['type', 'job_skills', 'job_domain']].head(15))

   type        job_skills                       job_domain
0   job                []                 [business, tech]
1   job                []                 [business, tech]
2   job                []                [tech, education]
3   job             [aws]                 [business, tech]
4   job             [aws]                           [tech]
5   job                []                       [business]
6   job             [aws]                [tech, education]
7   job                []                           [tech]
8   job  [aws, java, sql]                           [tech]
9   job                []                               []
10  job                []                           [tech]
11  job                []                 [business, tech]
12  job                []  [healthcare, business, finance]
13  job                []    [healthcare, tech, education]
14  job                []                               []


In [67]:
pip install --upgrade numpy h5py

Collecting numpy
  Using cached numpy-2.2.6-cp310-cp310-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.2.6-cp310-cp310-win_amd64.whl (12.9 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
Successfully installed numpy-2.2.6
Note: you may need to restart the kernel to use updated packages.


  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.0.0 requires huggingface-hub>=0.24.0, but you have huggingface-hub 0.14.1 which is incompatible.
streamlit 1.45.1 requires packaging<25,>=20, but you have packaging 25.0 which is incompatible.
tensorflow-intel 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 2.2.6 which is incompatible.
tensorflow-intel 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.14.1 which is incompatible.


In [68]:
!python -m spacy download en_core_web_sm

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\spacy\__init__.py", line 6, in <module>
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\spacy\errors.py", line 3, in <module>
    from .compat import Literal
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\spacy\compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\thinc\__init__.py", line 5, in <module>
    from .config import registry
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\thinc\config.py", line 5, in <module>
    from .types import Decorator
  File "C:\Users\Gouthum\AppData\Roaming\Python\Python312\site-packages\thinc\types.py", line 27, in 

In [69]:
# Load spaCy model
nlp = spacy.load('en_core_web_sm')

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [153]:
pip install sentence-transformers==4.1.0 transformers==4.51.3 huggingface-hub==0.31.1




In [154]:
pip show transformers

Name: transformers
Version: 4.51.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: c:\users\gouthum\.conda\envs\myenv\lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: sentence-transformers
Note: you may need to restart the kernel to use updated packages.


In [155]:
pip show sentence-transformers

Name: sentence-transformersNote: you may need to restart the kernel to use updated packages.

Version: 4.1.0
Summary: Embeddings, Retrieval, and Reranking
Home-page: https://www.SBERT.net
Author: 
Author-email: Nils Reimers <info@nils-reimers.de>, Tom Aarsen <tom.aarsen@huggingface.co>
License: Apache 2.0
Location: c:\users\gouthum\.conda\envs\myenv\lib\site-packages
Requires: huggingface-hub, Pillow, scikit-learn, scipy, torch, tqdm, transformers, typing_extensions
Required-by: 
