# **Exploratory Data Analysis**

## **1. Introduction**

### **Notebook Overview**

---

### **EDA Objectives**

**Goal**: Build succinct but high-leverage EDA that:
1. **Validates data readiness**
2. **Characterizes category & text distributions** to guide vectorizer and model decisions
3. **Assesses resume / job domain alignment** so that similarity scores are interpretable
4. **Surface feature signals** that motivate classifier phase

---

### **Key Questions to Explore / Goals**

1. *We’ve consolidated heterogeneous résumé labels into a stable category schema. Are categories
balanced? Are some under‑represented (affects model choice & evaluation)?*
2. *Our text cleaning pipeline produced reasonably normalized documents. Are lengths sane? Any empty /
near‑empty docs that need dropping?*
3. *Résumés and job postings live in related but not identical vocabularies. Quantify overlap → motivates
TF‑IDF vs domain‑invariant embeddings.*
4. *Certain tokens/skills strongly associate with categories. Justifies supervised modelling & informs
interpretability features in the prototype app.*
5. *There is (or isn’t) enough signal alignment between supply (resumes) and demand (jobs) to support
recommender ranking. Drives how heavily to weight category filters before SBERT similarity.*

---

### **Dataset Descriptions**

#### **Linkedin Job Postings Dataset**
**Original Dataset**: [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) by [Arsh Koneru](https://www.kaggle.com/arshkon) and [Zoey Yu Zou](https://www.kaggle.com/zoeyyuzou)
- Contains job titles, descriptions, industries, and metadata.
- We primarily focus on the `title` and `description` fields for text processing.

**Cleaned Jobs Dataset**:
- Processed by spaCy

#### **Resume Dataset**
**Original Dataset**: [Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data) by [Snehaan Bhawal](https://www.kaggle.com/snehaanbhawal)
- Contains labeled résumé texts (`Resume_str`) across multiple categories.
- The `Category` field serves as the ground-truth label for classifier training.

**Cleaned Resume Dataset**
- Processed by spaCy

### **Import Packages**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from joblib import Parallel, delayed

In [2]:
from jobrec import config
from jobrec import visualizer as vis
from jobrec.preprocessing import _nlp
from jobrec.spacy_df_io import save_spacy_df, load_spacy_df

INFO: Pandarallel will run on 32 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [3]:
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

---

## **2. Data Integrity & Sanity Checks**
**Purpose:**
- Catch residual data errors before modelling: broken tokenization, empty texts, duplicate IDs, inconsistent
stats.

**Key Items:**
1. Missing data? (Title, Skills, Descriptions, Docs, Other Fields)
2. Duplicates?
3. Are any text fields empty or extremely short?
4. Do numeric length stats match actual text lengths (spot check)?
5. Are there rows with skills but no text?

**Visuals:** 
- Optional bar of missingness
- Optional boxplot of text length 

In [4]:
# Load Datasets
if config.SPACY_MODE == True:
    # Load jobs dataframe using custom utility
    jobs_df   = load_spacy_df(config.JOB_CORPUS_DIR, _nlp)
    resume_df = load_spacy_df(config.RESUME_CORPUS_DIR, _nlp)
else:
    # Load from CSV
    jobs_df   = pd.read_csv(f"{config.PROCESSED_DATA_DIR / config.JOB_NAME}.csv").reset_index()
    resume_df = pd.read_csv(f"{config.PROCESSED_DATA_DIR / config.RESUME_NAME}.csv").reset_index()

### **2.1 Initial Inspections**
Description of what I will do in this section.

**Questions to Answer:** 

#### **Jobs Dataset**

#### **Resume Dataset**

### **2.2 Missingness**
Description of what I will do in this section.

**Questions to Answer:** 

### **2.3 Duplicates**
Description of what I will do in this section.

### **2.4 Text Length Edge Cases**
- Identify extreme outliers in text length/tokens (>99th percentile, <1st percentile, arbitrary range)

Description of what I will do in this section.

**Questions to Answer:** 

### **2.5 Further Dataset Pruning**

This section is dependant on findings in previous subsections. Anything inconvienient towards analysis will be pruned at this stage.

#### **Jobs Dataset**

#### **Resume Dataset**

---

## **3. Category Landscape**
**Questions:**
1. Is there a balance of resume categories?
2. Can we create a bridge between resume categories and job domains?
3. Are there any signature terms on a category basis?

### **3.1 Resume Category Balance**
**Purpose:**
- Understand target imbalance for supervised résumé classifier.
  
**Key Questions:**
1. Which Categories dominate?
2. How many classes fall below a learnable threshold (<50 rows? <20?)
3. Should we collapse / reweight / use stratified CV?

### **3.2 Category to Domain Bridge Construction and Validation**
Description of what I will do in this section.
**Purpose:**
- Harmonize fine-grained résumé Category labels to the coarser domain vocabulary shared with jobs.
- This mapping underpins filtering in the recommender and aggregation in EDA.

**Questions to Answer:** 
1. Which domains best represents each category?
2. Are there categories that map to multiple domains?
3. How noisy are auto-extracted domains vs manual judgement? 

### **3.3 Category Signature Terms**
**Purpose:**
- Surface discriminative language that differentiates categories (for classifier) and domains (for
recommender filtering + interpretability).

**Questions to Answer:** 
1. What words/phrases are over represented in each resume category?
2. What terms characterize job descriptions in each job domain?
3. Are there mismatches?

---

## **4. Domain Analysis**
**We want to know:** 
- Does each résumé category have enough corresponding job postings? If not, similarity search
will either fail or return cross‑domain noise.

**Purpose:**
- Ensure each résumé domain has enough jobs for recommender filtering and evaluation.

**Questions to Answer:** 
1. How many jobs exist for each mapped resume domain?
2. Are there any coverage gaps?
3. Do domain clusters differ between jobs and resumes?

---

## **5. Text Profiling and Quality Assurance**
- I would absolutely do my numerical analysis on text features and the spaCy docs
- Guides vectorizer limits, n‑gram settings, and whether to trim/clean further.

**Purpose:**
- Characterize textual scale differences that affect vectorization (TF-IDF vs char n-grams), truncation, and
embedding memory.
- Understand document scale differences influencing vectorization, memory, and model
robustness; satisfy histogram, scatter, and Pearson requirements.

**Key Questions:**
1. Are résumés dramatically longer than job descriptions?
2. Are certain domains consistently short/long?
3. Does lexical diversity vary by domain?

---

## **6. Vocabulary Overlap and Lexical Analysis**
Quantify lexical overlap between résumé and job corpora.

**Questions to Answer:** 
1. Is there any general vocabulary overlap between resumes and job listings?
2. Is there vocabulary overlap that is category-conditional?

### **6.1 Global Overlap**
- Top N (e.g., 5k) tokens in résumés vs jobs, compute Jaccard.
- **Weighted overlap:** sum min(TFIDF_R, TFIDF_J) across tokens.

### **6.2 Category-Conditional Overlap**
- For each category, compute Jaccard between résumé subset and job subset tokens.
- Heatmap categories (rows=résumé cats, cols=job cats) colored by token overlap or cosine of average
TF‑IDF vectors.

---

## **7. Skill Supply Vs. Demand**
**Objective:** 
- Show where candidate self‑reported skills (résumé mentions) align or misalign with requested
skills in job postings.

**Purpose:**
- Quantify which skills employers ask for that candidates under-report (and vice versa). Fuels ATS gap feedback &
résumé improvement tips.

**Key Questions:**
1. Which skills are in higher demand than are represented by resumes?
2. Which skills are over-represented in resumes relative to job postings?
3. Are gaps domain-specific?

## **8. Prototype Recommender**
**Steps:**
1. Pick K random résumés.
2. For each, compute cosine similarity vs all job TF‑IDF vectors (bag‑of‑words baseline; embeddings
later).
3. Show top 5 jobs; eyeball whether category alignment is reasonable.

---

## **Conclusions**

### **Save Final Datasets**

---

# Notes