# **Exploratory Data Analysis**

## **1. Introduction**

### **Notebook Overview**

---

### **EDA Objectives**

**Goal**: Build succinct but high-leverage EDA that:
1. **Validates data readiness**
2. **Characterizes category & text distributions** to guide vectorizer and model decisions
3. **Assesses resume / job domain alignment** so that similarity scores are interpretable
4. **Surface feature signals** that motivate classifier phase

---

### **Key Questions to Explore / Goals**

1. *We’ve consolidated heterogeneous résumé labels into a stable category schema. Are categories
balanced? Are some under‑represented (affects model choice & evaluation)?*
2. *Our text cleaning pipeline produced reasonably normalized documents. Are lengths sane? Any empty /
near‑empty docs that need dropping?*
3. *Résumés and job postings live in related but not identical vocabularies. Quantify overlap → motivates
TF‑IDF vs domain‑invariant embeddings.*
4. *Certain tokens/skills strongly associate with categories. Justifies supervised modelling & informs
interpretability features in the prototype app.*
5. *There is (or isn’t) enough signal alignment between supply (resumes) and demand (jobs) to support
recommender ranking. Drives how heavily to weight category filters before SBERT similarity.*

---

### **Dataset Descriptions**

#### **Linkedin Job Postings Dataset**
**Original Dataset**: [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) by [Arsh Koneru](https://www.kaggle.com/arshkon) and [Zoey Yu Zou](https://www.kaggle.com/zoeyyuzou)
- Contains job titles, descriptions, industries, and metadata.
- We primarily focus on the `title` and `description` fields for text processing.

**Cleaned Jobs Dataset**:
- Processed by spaCy

#### **Resume Dataset**
**Original Dataset**: [Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data) by [Snehaan Bhawal](https://www.kaggle.com/snehaanbhawal)
- Contains labeled résumé texts (`Resume_str`) across multiple categories.
- The `Category` field serves as the ground-truth label for classifier training.

**Cleaned Resume Dataset**
- Processed by spaCy

### **Importing Packages and Configuring Directory Pathing**

In [1]:
# REMOVE THIS BEFORE SUBMISSION
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from joblib import Parallel, delayed

In [16]:
from jobrec import config
from jobrec import visualizer as vis
from jobrec.preprocessing import _nlp
from jobrec.spacy_df_io import save_spacy_df, load_spacy_df

INFO: Pandarallel will run on 32 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [4]:
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

---

## **2. Data Overview**

### **Jobs Dataset**

In [17]:
# Load jobs dataframe using custom utility
jobs_df = load_spacy_df(config.CORPUS_DATA_DIR/"jobs", _nlp)

In [18]:
jobs_df.head()

Unnamed: 0,job_id,title,title_clean,skill_name,industry_name,description,desc_clean,desc_clean_lemmatized,desc_clean_tokens,desc_skills,desc_domains,desc_text_length,desc_avg_word_length,desc_unique_word_count,desc_lexical_diversity
73989,3902944011,Senior Automation Engineer - Power Systems,senior automation engineer power systems,"[information technology, engineering]",[oil and gas],The Senior Automation / Power Systems Engineer...,the senior automation power systems engineer w...,senior automation power system engineer primar...,"(the, senior, automation, power, systems, engi...","[engineering, design, development, communicati...",[engineering],635,6.059843,335,0.527559
59308,3901960222,DISH Installation Technician - Field,dish installation technician field,"[information technology, engineering]",[telecommunications],"Company Summary\n\nDISH, an EchoStar Company, ...",company summary dish an echostar company has b...,company summary dish echostar company reimagin...,"(company, summary, dish, an, echostar, company...","[leadership, installation]",[business],466,5.193133,260,0.55794
44663,3900944095,Order Builder,order builder,"[management, manufacturing]",[manufacturing],Division: North Alabama\n\nDepartment : Oxford...,division north alabama department oxford wareh...,division north alabama department oxford wareh...,"(division, north, alabama, department, oxford,...",[management],"[education, business]",439,6.214123,291,0.66287
81954,3903878594,"Mountain Multimedia Journalist, KMGH",mountain multimedia journalist kmgh,"[writing/editing, marketing, public relations]",[broadcast media production and distribution],"KMGH, the E.W. Scripps Company ABC affiliate i...",kmgh the e w scripps company abc affiliate in ...,kmgh e w scripps company abc affiliate denver ...,"(kmgh, the, e, w, scripps, company, abc, affil...",[leadership],[business],833,5.370948,446,0.535414
113151,3905670593,Licensed Practical Nurse (LPN),licensed practical nurse lpn,[health care provider],[hospitals and health care],"Come for the Flexibility, Stay for the Culture...",come for the flexibility stay for the culture ...,come flexibility stay culture need life work l...,"(come, for, the, flexibility, stay, for, the, ...",[],[],305,5.37377,204,0.668852


In [14]:
tokens = jobs_df['desc_clean_tokens'].iloc[0]

In [15]:
type(tokens)

str

### **Resume Dataset**

In [19]:
resume_df = load_spacy_df(config.CORPUS_DATA_DIR/"resumes", _nlp)

In [20]:
resume_df.head()

Unnamed: 0,Category,resume,resume_clean,resume_clean_lemmatized,resume_clean_tokens,resume_skills,resume_domains,resume_text_length,resume_avg_word_length,resume_unique_word_count,resume_lexical_diversity
420,TEACHER,Kpandipou Koffi Summary ...,kpandipou koffi summary compassionate teaching...,kpandipou koffi summary compassionate teaching...,"(kpandipou, koffi, summary, compassionate, tea...","[management, marketing, design, communication,...","[education, business, marketing]",675,6.591111,378,0.56
1309,DIGITAL-MEDIA,DIRECTOR OF DIGITAL TRANSFORMATION ...,director of digital transformation executive p...,director digital transformation executive prof...,"(director, of, digital, transformation, execut...","[management, marketing, design, development, l...","[education, tech, business, marketing]",845,5.733728,339,0.401183
2023,CONSTRUCTION,SENIOR PROJECT MANAGER Professi...,senior project manager professional summary am...,senior project manager professional summary am...,"(senior, project, manager, professional, summa...","[management, marketing, development, communica...","[finance, marketing, construction, education, ...",688,6.476744,324,0.47093
1360,CHEF,CHEF Summary Experienced ca...,chef summary experienced catering chef skilled...,chef summary experience catering chef skille p...,"(chef, summary, experienced, catering, chef, s...",[management],"[business, retail]",180,5.911111,102,0.566667
2186,BANKING,OPERATIONS MANAGER Summary E...,operations manager summary experienced client ...,operation manager summary experience client se...,"(operations, manager, summary, experienced, cl...","[management, sales, development, communication]","[education, business, sales, legal]",602,6.523256,296,0.491694


---

## **3. Job Listings EDA**

In [None]:
stop

---

## **4. Resume EDA**

In [None]:
# Use all cores
results = Parallel(n_jobs=-1)(
    delayed(generate_wordcloud_from_df)(df_categories[i], categories[i])
    for i in range(len(categories))
)

In [None]:
resume_df['Category'].value_counts().sort_index().plot(kind='bar', figsize=(12, 6))
plt.show()

In [None]:
plt.figure(figsize=(32, 28))

for i, (category, wc) in enumerate(results):
    plt.subplot(6, 4, i + 1).set_title(category)
    plt.imshow(wc)
    plt.axis('off')

plt.tight_layout()
plt.show()

In [None]:
fig = plt.figure(figsize=(32, 64))

for i, category in enumerate(categories):
    wf = wordfreq(df_categories[i])

    fig.add_subplot(12, 2, i + 1).set_title(category)
    plt.bar(wf['Word'], wf['Frequency'])
    plt.ylim(0, 1500)

plt.show()

---

## **5. Comparison**

---

## **6. Conclusion and Insights**