# **Data Intake and Preview**

## **1. Introduction**

### **Notebook Overview**

This notebook prepares the textual data for the **Resume-to-Job Recommendation System**, which aims to recommend job postings to users based on the similarity of their resumes to available roles. To enable effective classification and content-based retrieval, we create a **unified preprocessing pipeline** that cleans and normalizes the text data for **TF-IDF feature extraction**. This ensures consistency across resumes and job descriptions, which is critical for computing cosine similarity and training our classification models later in the pipeline.

The output will be clean, lemmatized, and consistently formatted text, enabling downstream tasks such as TF-IDF vectorization, embedding generation, and feature analysis. All logic is modular and reusable to support future enhancements.

---

### **Objectives**

- Develop a **single, reusable cleaning pipeline** for textual data using `spaCy`.
- Normalize, clean, and lemmatize text for both resumes and job descriptions.
- Create derived features for exploratory data analysis.
- Save preprocessed, clean datasets for downstream modeling and matching phases.

---

### **Dataset Descriptions**

#### **Linkedin Job Postings Dataset**

[LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) by [Arsh Koneru](https://www.kaggle.com/arshkon) and [Zoey Yu Zou](https://www.kaggle.com/zoeyyuzou)
- Contains job titles, descriptions, industries, and metadata.
- We primarily focus on the `title` and `description` fields for text processing.

#### **Resume Dataset**

[Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data) by [Snehaan Bhawal](https://www.kaggle.com/snehaanbhawal)
- Contains labeled résumé texts (`Resume_str`) across multiple categories.
- The `Category` field serves as the ground-truth label for classifier training.

---

### **Importing Packages**

In [1]:
import pandas as pd
import numpy as np
from pandarallel import pandarallel

In [2]:
from jobrec import config
from jobrec import preprocessing as pp
from jobrec import feature_extractor as fe

INFO: Pandarallel will run on 32 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


---

## **2. Data Exploration**

### **Jobs Dataset**

In [3]:
# Load in jobs dataset
jobs_df = pd.read_csv(config.RAW_DATA_DIR/'postings.csv')

In [4]:
# Initial View
jobs_df.head()

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


In [5]:
# Dataframe info
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   company_name                122130 non-null  object 
 2   title                       123849 non-null  object 
 3   description                 123842 non-null  object 
 4   max_salary                  29793 non-null   float64
 5   pay_period                  36073 non-null   object 
 6   location                    123849 non-null  object 
 7   company_id                  122132 non-null  float64
 8   views                       122160 non-null  float64
 9   med_salary                  6280 non-null    float64
 10  min_salary                  29793 non-null   float64
 11  formatted_work_type         123849 non-null  object 
 12  applies                     23320 non-null   float64
 13  original_liste

In [6]:
# Summary Statistics
jobs_df.describe()

Unnamed: 0,job_id,max_salary,company_id,views,med_salary,min_salary,applies,original_listed_time,remote_allowed,expiry,closed_time,listed_time,sponsored,normalized_salary,zip_code,fips
count,123849.0,29793.0,122132.0,122160.0,6280.0,29793.0,23320.0,123849.0,15246.0,123849.0,1073.0,123849.0,123849.0,36073.0,102977.0,96434.0
mean,3896402000.0,91939.42,12204010.0,14.618247,22015.619876,64910.85,10.591981,1713152000000.0,1.0,1716213000000.0,1712928000000.0,1713204000000.0,0.0,205327.0,50400.491887,28713.879887
std,84043550.0,701110.1,25541430.0,85.903598,52255.873846,495973.8,29.047395,484820900.0,0.0,2321394000.0,362289300.0,398912200.0,0.0,5097627.0,30252.232515,16015.929825
min,921716.0,1.0,1009.0,1.0,0.0,1.0,1.0,1701811000000.0,1.0,1712903000000.0,1712346000000.0,1711317000000.0,0.0,0.0,1001.0,1003.0
25%,3894587000.0,48.28,14352.0,3.0,18.94,37.0,1.0,1712863000000.0,1.0,1715481000000.0,1712670000000.0,1712886000000.0,0.0,52000.0,24112.0,13121.0
50%,3901998000.0,80000.0,226965.0,4.0,25.5,60000.0,3.0,1713395000000.0,1.0,1716042000000.0,1712670000000.0,1713408000000.0,0.0,81500.0,48059.0,29183.0
75%,3904707000.0,140000.0,8047188.0,8.0,2510.5,100000.0,8.0,1713478000000.0,1.0,1716088000000.0,1713283000000.0,1713484000000.0,0.0,125000.0,78201.0,42077.0
max,3906267000.0,120000000.0,103473000.0,9975.0,750000.0,85000000.0,967.0,1713573000000.0,1.0,1729125000000.0,1713562000000.0,1713573000000.0,0.0,535600000.0,99901.0,56045.0


### **Resume Dataset**

In [7]:
# Load Resume dataset
resume_df = pd.read_csv(config.RAW_DATA_DIR/'Resume.csv')

In [8]:
# Initial View
resume_df.head()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [9]:
# Dataframe information
resume_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2484 entries, 0 to 2483
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           2484 non-null   int64 
 1   Resume_str   2484 non-null   object
 2   Resume_html  2484 non-null   object
 3   Category     2484 non-null   object
dtypes: int64(1), object(3)
memory usage: 77.8+ KB


In [10]:
# Summary Statistics
resume_df.describe()

Unnamed: 0,ID
count,2484.0
mean,31826160.0
std,21457350.0
min,3547447.0
25%,17544300.0
50%,25210310.0
75%,36114440.0
max,99806120.0


## **3. Data Preprocessing**

### **Text Cleaning Pipeline**

Modern resumes and job postings often include noise — punctuation, special characters, filler words, repeated boilerplate — that weakens NLP feature extraction.

This notebook applies the following pipeline of custom functions:
1. `regex_text`:
    - **Lowercase** and standardize text
    - **Remove URLs, emails, and special characters** with regex
2. `lemmatize_text`:
    - Built with **spaCy**; a high-performance NLP library
    - **Tokenize** text into individual words
    - **Remove stopwords** to keep only meaningful terms
    - **Lemmatize** words to their root form (e.g., *running* → *run*) for semantic consistency
4. `extract_skills`:
    - Identifies and extracts skills from processed text
    - `SKILLS` list manually defined in `config.py`
5.  `extract_domains`:
    - Identifies and extracts domains from preprocessed text and extracted skills list
    - Infers broader knowledge domains from detected skills
    - `DOMAINS` list manually defined in `config.py` 

Both datasets are passed through this pipeline equally, ensuring **identical treatment** of both resumes and job descriptions.

### **Job Postings Dataset**

In order to keep the scope of this project tight, we will only be focusing on the fields which might be useful in directly matching a job listing resume. 

In [11]:
# Load in auxilliary datasets
skill_ids = pd.read_csv(config.RAW_DATA_DIR/'jobs'/'job_skills.csv')
industry_ids = pd.read_csv(config.RAW_DATA_DIR/'jobs'/'job_industries.csv')

# Load in mappings
mapped_skills = pd.read_csv(config.RAW_DATA_DIR/'mappings'/'skills.csv')
mapped_industries = pd.read_csv(config.RAW_DATA_DIR/'mappings'/'industries.csv')

In [12]:
# Merge mapped skills with associated ids
skill_ids = skill_ids.merge(mapped_skills, on='skill_abr', how='left')
industry_ids = industry_ids.merge(mapped_industries, on='industry_id', how='left')

# Collapse to lists of unique mapped skill names per job_id
skills_per_job = skill_ids.groupby('job_id')['skill_name'].apply(lambda x: list(set(x.dropna().str.lower()))).reset_index()

# Collapse to lists of unique mapped industry names per job_id
industries_per_job = industry_ids.groupby('job_id')['industry_name'].apply(lambda x: list(set(x.dropna().str.lower()))).reset_index()

# Merge in mapped skills and industries
jobs_df = jobs_df.merge(skills_per_job, on='job_id', how='left')
jobs_df = jobs_df.merge(industries_per_job, on='job_id', how='left')

# Clean text in job title
jobs_df['title_clean'] = jobs_df['title'].parallel_apply(pp.regex_text)

# Remove all unnecessary features
jobs_df = jobs_df[['job_id', 'title', 'title_clean', 'skill_name', 'industry_name', 'description']]

In [13]:
# Apply preprocessing to job listing descriptions
jobs_df = pp.preprocessing_pipeline(
    df=jobs_df.sample(1000, random_state=42),
    text_column="description",
    prefix="desc_",
    regex_func=pp.regex_text,
    lemmatize_func=pp.lemmatize_text,
    extract_skills_func=pp.extract_skills,
    extract_domains_func=pp.extract_domains
)

### **Resume Dataset**

Description of the data cleaning steps to be performed
- Change `Resume_str` to 'resume'
- Drop `ID` and `Resume_html`
- Apply `preprocessing_pipeline`

In [14]:
# Rename columns and remove unnecessary features
resume_df['resume'] = resume_df['Resume_str']
resume_df = resume_df.drop(columns=['ID', 'Resume_html', 'Resume_str'])

In [15]:
# Apply preprocessing to resumes
resume_df = pp.preprocessing_pipeline(
    df=resume_df.sample(1000, random_state=42),
    text_column="resume",
    prefix="resume_",
    regex_func=pp.regex_text,
    lemmatize_func=pp.lemmatize_text,
    extract_skills_func=pp.extract_skills,
    extract_domains_func=pp.extract_domains
)

---

## **4. Feature Extraction**

### **Text Cleaning Pipeline**
This notebook applies the following pipeline of custom functions to perform feature extraction:
1. `compute_text_length`: Text length
2. `compute_avg_word_length`: Mean length of all words
3. `compute_unique_word_count`:  Number of distinct words
4. `compute_lexical_diversity`:  Unique/total word ratio

### **Job Postings Dataset**

In [16]:
# Apply text feature pipeline to processed linkedin dataset
jobs_df = fe.text_features_pipeline(jobs_df, text_column="desc_clean", prefix="desc_")

In [17]:
# View dataframe
jobs_df.head()

Unnamed: 0,job_id,title,title_clean,skill_name,industry_name,description,desc_clean,desc_clean_lemmatized,desc_skills,desc_domains,desc_text_length,desc_avg_word_length,desc_unique_word_count,desc_lexical_diversity
73989,3902944011,Senior Automation Engineer - Power Systems,senior automation engineer power systems,"[information technology, engineering]",[oil and gas],The Senior Automation / Power Systems Engineer...,the senior automation power systems engineer w...,senior automation power system engineer primar...,"[engineering, design, development, communicati...",[engineering],635,6.059843,335,0.527559
59308,3901960222,DISH Installation Technician - Field,dish installation technician field,"[information technology, engineering]",[telecommunications],"Company Summary\n\nDISH, an EchoStar Company, ...",company summary dish an echostar company has b...,company summary dish echostar company reimagin...,"[leadership, installation]",[business],466,5.193133,260,0.55794
44663,3900944095,Order Builder,order builder,"[manufacturing, management]",[manufacturing],Division: North Alabama\n\nDepartment : Oxford...,division north alabama department oxford wareh...,division north alabama department oxford wareh...,[management],"[business, education]",439,6.214123,291,0.66287
81954,3903878594,"Mountain Multimedia Journalist, KMGH",mountain multimedia journalist kmgh,"[marketing, public relations, writing/editing]",[broadcast media production and distribution],"KMGH, the E.W. Scripps Company ABC affiliate i...",kmgh the e w scripps company abc affiliate in ...,kmgh e w scripps company abc affiliate denver ...,[leadership],[business],833,5.370948,446,0.535414
113151,3905670593,Licensed Practical Nurse (LPN),licensed practical nurse lpn,[health care provider],[hospitals and health care],"Come for the Flexibility, Stay for the Culture...",come for the flexibility stay for the culture ...,come flexibility stay culture need life work l...,[],[],305,5.37377,204,0.668852


In [18]:
# View information on jobs dataframe
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 73989 to 109452
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   job_id                  1000 non-null   int64  
 1   title                   1000 non-null   object 
 2   title_clean             1000 non-null   object 
 3   skill_name              981 non-null    object 
 4   industry_name           988 non-null    object 
 5   description             1000 non-null   object 
 6   desc_clean              1000 non-null   object 
 7   desc_clean_lemmatized   1000 non-null   object 
 8   desc_skills             1000 non-null   object 
 9   desc_domains            1000 non-null   object 
 10  desc_text_length        1000 non-null   int64  
 11  desc_avg_word_length    1000 non-null   float64
 12  desc_unique_word_count  1000 non-null   int64  
 13  desc_lexical_diversity  1000 non-null   float64
dtypes: float64(2), int64(3), object(9)
memo

In [19]:
# View summary statistics of processed dataset
jobs_df.describe()

Unnamed: 0,job_id,desc_text_length,desc_avg_word_length,desc_unique_word_count,desc_lexical_diversity
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,3898151000.0,519.017,5.811018,275.257,0.570253
std,19158250.0,291.481962,0.406078,124.886539,0.089103
min,3438940000.0,19.0,4.288927,18.0,0.333753
25%,3894817000.0,308.75,5.569872,186.75,0.514243
50%,3901988000.0,460.5,5.790077,260.0,0.56295
75%,3904512000.0,705.0,6.072195,361.0,0.617024
max,3906265000.0,2138.0,7.294118,846.0,0.95


In [20]:
# Save to CSV
jobs_df.to_csv(config.PROCESSED_DATA_DIR / 'jobs_clean.csv', index=False)

### **Resume Dataset**

In [21]:
# Apply text feature pipeline to processed linkedin dataset
resume_df = fe.text_features_pipeline(resume_df, text_column="resume_clean", prefix="resume_")

In [22]:
# View dataframe
resume_df.head()

Unnamed: 0,Category,resume,resume_clean,resume_clean_lemmatized,resume_skills,resume_domains,resume_text_length,resume_avg_word_length,resume_unique_word_count,resume_lexical_diversity
420,TEACHER,Kpandipou Koffi Summary ...,kpandipou koffi summary compassionate teaching...,kpandipou koffi summary compassionate teaching...,"[management, marketing, design, communication,...","[marketing, business, education]",675,6.591111,378,0.56
1309,DIGITAL-MEDIA,DIRECTOR OF DIGITAL TRANSFORMATION ...,director of digital transformation executive p...,director digital transformation executive prof...,"[management, marketing, design, development, l...","[marketing, business, education, tech]",845,5.733728,339,0.401183
2023,CONSTRUCTION,SENIOR PROJECT MANAGER Professi...,senior project manager professional summary am...,senior project manager professional summary am...,"[management, marketing, development, communica...","[finance, marketing, business, construction, e...",688,6.476744,324,0.47093
1360,CHEF,CHEF Summary Experienced ca...,chef summary experienced catering chef skilled...,chef summary experience catering chef skille p...,[management],"[retail, business]",180,5.911111,102,0.566667
2186,BANKING,OPERATIONS MANAGER Summary E...,operations manager summary experienced client ...,operation manager summary experience client se...,"[management, sales, development, communication]","[business, legal, sales, education]",602,6.523256,296,0.491694


In [23]:
# View information on resume dataframe
resume_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 420 to 1717
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Category                  1000 non-null   object 
 1   resume                    1000 non-null   object 
 2   resume_clean              1000 non-null   object 
 3   resume_clean_lemmatized   1000 non-null   object 
 4   resume_skills             1000 non-null   object 
 5   resume_domains            1000 non-null   object 
 6   resume_text_length        1000 non-null   int64  
 7   resume_avg_word_length    1000 non-null   float64
 8   resume_unique_word_count  1000 non-null   int64  
 9   resume_lexical_diversity  1000 non-null   float64
dtypes: float64(2), int64(2), object(6)
memory usage: 85.9+ KB


In [24]:
# View summary statistics of processed dataset
resume_df.describe()

Unnamed: 0,resume_text_length,resume_avg_word_length,resume_unique_word_count,resume_lexical_diversity
count,1000.0,1000.0,1000.0,1000.0
mean,797.167,6.172035,352.09,0.467404
std,375.281299,0.334469,112.832988,0.082392
min,118.0,4.519016,81.0,0.12071
25%,631.75,6.002374,292.75,0.420188
50%,737.0,6.201427,348.0,0.468563
75%,921.0,6.397403,405.25,0.517815
max,5070.0,7.098408,1023.0,0.735772


In [25]:
# Save to CSV
resume_df.to_csv(config.PROCESSED_DATA_DIR / 'resume_clean.csv', index=False)

---

## **5. Summary of Work**

- Applied a single, robust text cleaning function to both datasets  
- Removed noise, standardized tokens, and lemmatized words  
- Created columns for cleaned text, lemmatized text, extracted skills and domains
- Added derived `text_length`, `avg_word_length`, `unique_word_count`, `lexical_diversity` features for later EDA  
- Saved `resumes_clean.csv` and `jobs_clean.csv` for modeling

**Next:** Continue to `02_eda.ipynb` to analyze class balance, keyword patterns, and feature distributions that will inform the classification model and recommendation engine.