In [1]:
print("Hello World")

Hello World


# Data Exploration and Preprocessing

### Import libraries

In [2]:
import pandas as pd
import numpy as np

## Resume Data

The dataset has been taken from Kaggle and can be found here:
https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset?resource=download-directory

#### About Dataset
**Context** \
A collection of Resume Examples taken from livecareer.com for categorizing a given resume into any of the labels defined in the dataset.

#### **Content**
Contains 2400+ Resumes in string as well as PDF format.
PDF stored in the data folder differentiated into their respective labels as folders with each resume residing inside the folder in pdf form with filename as the id defined in the csv.

**I will make use of the PDF files to test the Vector search performance of FAISS**

In [6]:
%pwd

'c:\\Users\\amman\\Documents\\Generative AI\\End-to-End-AI-Resume-Matcher\\notebooks'

In [7]:
# Change working directory to root directory
import os
os.chdir("../")
%pwd

'c:\\Users\\amman\\Documents\\Generative AI\\End-to-End-AI-Resume-Matcher'

### Job Descriptions Dataset

The job listings dataset is taken from [HuggingFace datasets hub](https://huggingface.co/datasets). It contains 124,000 rows of job listings. Columns include:

- job_id (str)
- company_name (str)
- title (str)
- description (str)
- max_salary (float64)

Load the data into a pandas dataframe

In [48]:
from huggingface_hub import hf_hub_download

REPO_ID = "datastax/linkedin_job_listings"
FILENAME = "postings.csv"

jobs = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Since the dataset is quite large and to speed up preprocessing and limit embedding storage size, a subset of the data will be used. The raw subset will be saved into the data folder.

In [52]:
sample_size = 2000
jobs_samp = jobs.iloc[0:sample_size :]
jobs_samp.shape

(2000, 31)

In [53]:
# Save raw subset as csv
jobs_samp.to_csv('data/job_desc.csv')

In [54]:
# Remove full job_listing dataframe variable to save memory
del jobs

In [89]:
jobs_samp.head()

Unnamed: 0,job_id,description
0,921716,Job descriptionA leading real estate firm in N...
1,1829192,"At Aspen Therapy and Wellness , we are committ..."
2,10998357,The National Exemplar is accepting application...
3,23221523,Senior Associate Attorney - Elder Law / Trusts...
4,35982263,Looking for HVAC service tech with experience ...


In [57]:
# Let's remove all the columns we don't need right now
columns_to_keep = ["job_id", "description", "skills_desc"]
jobs_samp = jobs_samp[columns_to_keep]
jobs_samp.head()

Unnamed: 0,job_id,description,skills_desc
0,921716,Job descriptionA leading real estate firm in N...,Requirements: \n\nWe are seeking a College or ...
1,1829192,"At Aspen Therapy and Wellness , we are committ...",
2,10998357,The National Exemplar is accepting application...,We are currently accepting resumes for FOH - A...
3,23221523,Senior Associate Attorney - Elder Law / Trusts...,This position requires a baseline understandin...
4,35982263,Looking for HVAC service tech with experience ...,


Check for missing values

In [90]:
jobs_samp.isnull().sum()

job_id         0
description    0
dtype: int64

Since most of the skills_desc column is empty, we will remove it

In [62]:
jobs_samp = jobs_samp.drop(["skills_desc"], axis=1)

In [91]:
jobs_samp.head()

Unnamed: 0,job_id,description
0,921716,Job descriptionA leading real estate firm in N...
1,1829192,"At Aspen Therapy and Wellness , we are committ..."
2,10998357,The National Exemplar is accepting application...
3,23221523,Senior Associate Attorney - Elder Law / Trusts...
4,35982263,Looking for HVAC service tech with experience ...


Check for duplicated records

In [92]:
int(jobs_samp.duplicated().sum())

0

In [93]:
jobs_samp.dtypes

job_id          int64
description    object
dtype: object

### Text Cleaning

In [1]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"<.*?>", "", text) # Remove HTML tags
    text = re.sub(r"[^a-zA-Z\s]", "", text) # Remove special characters
    text = " ".join([word for word in text.split()])
    return text 

In [94]:
# Create copy of jobs df to perform cleaning on
jobs_clean = jobs_samp.copy()

In [110]:
jobs_clean["description"] = jobs_clean["description"].apply(clean_text)
jobs_clean.head()

Unnamed: 0,job_id,description
0,921716,job descriptiona leading real estate firm in n...
1,1829192,at aspen therapy and wellness we are committed...
2,10998357,the national exemplar is accepting application...
3,23221523,senior associate attorney elder law trusts and...
4,35982263,looking for hvac service tech with experience ...


Save clean job listings dataset to csv

In [111]:
jobs_clean.to_csv('data/jobs_desc_clean.csv', index=False)