# Resume-Job Matching Model

The pipeline is designed to match resumes to job postings by:
- Preprocessing and cleaning text data.
- Building embeddings and token-based features.
- Combining semantic similarity (embeddings), lexical similarity (BM25), and overlap features.
- Training a LightGBM to rank jobs for each resume.

The model is trained using a 2022 listing of New York City Government job openings and web-scraped resumes; available at <u>https://www.kaggle.com/datasets/anandaramg/nyc-jobs-openings-2022</u> and <u>https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset</u>

## Stop-word Handling Overview

In this pipeline, a custom stop-word list is used to filter out tokens that are
unlikely to contribute meaningful information when matching resumes to jobs.

- Generic function words: e.g., the, and, to, in, of
  These are common across all texts and add little discriminative value.

- Domain-specific filler words: e.g., management, department, coordinator, assistant 
  These appear frequently in job postings but do not help distinguish between roles.

### How it works
- During tokenization (`tokenize_informative`), any token that:
  - is shorter than 3 characters,  
  - is in the `STOPWORDS` set, or  
  - is purely numeric  
  is removed from the token list.

- The resulting informative tokens are then used for:
  - Building the informative vocabulary 
  - Computing BM25 lexical similarity
  - Calculating resume–job token overlaps


In [24]:
import os, re
import numpy as np
import pandas as pd
from collections import Counter
from sentence_transformers import SentenceTransformer
import lightgbm as lgb
from rank_bm25 import BM25Okapi 

JOBS_CSV = "NYC_Jobs.csv"
RESUMES_CSV = "Resume.csv"

MODEL_NAME = "all-MiniLM-L6-v2"
RANDOM_STATE = 42

USE_SUBSET = True
MAX_RESUMES = 1000
MAX_JOBS = 3000

TOP_K = 5
CAP_POSITIVES_PER_RESUME = 3
NEGATIVES_PER_RESUME = 100

RESULTS_DIR = "./results"
os.makedirs(RESULTS_DIR, exist_ok=True)

STOPWORDS = {
    "the","and","for","with","a","to","in","of","on","at","by","from","or","as","an","be","is","are","was","were",
    "management","division","department","coordinator","assistant","director","supervisor","support","services",
    "city","office","program","unit","team","senior","junior","lead","staff","administration","administrative", "analyst"
}

## Data Loading and Preprocessing

**`load_jobs()`** 
- Input: Path to a CSV file containing job postings.
- Output: DataFrame with non-null "Business Title" and "Preferred Skills".
- Goal: Load and clean job postings for downstream processing.


**`load_resumes()`** 
- Input: Path to a CSV file containing resumes.
- Output: DataFrame with non-null "Resume_str". Adds "Category" column if missing.
- Goal: Load resumes and ensure consistent schema.

**`clean_text()`**
- Input: Raw text string.
- Output: Lowercased, alphanumeric-only, whitespace-normalized string.
- Goal: Normalize text for tokenization and embedding.

**`preprocess_jobs()`**
- Input: Job postings DataFrame.
- Output: Adds "job_text" column (cleaned concatenation of title + skills).
- Goal: Create a unified text field for jobs.

**`preprocess_resumes()`**
- Input: Resume DataFrame.
- Output: Adds "resume_text" column (cleaned resume text).
- Goal: Create a unified text field for resumes.

**`subset_frames()`**
- Input: Job and resume DataFrames, max sample sizes.
- Output: Subsampled DataFrames.
- Goal: Enable faster debugging and model training by limiting dataset size.


In [17]:
def load_jobs(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    df = df.dropna(subset=["Business Title", "Preferred Skills"])
    return df.reset_index(drop=True)

def load_resumes(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    df = df.dropna(subset=["Resume_str"])
    if "Category" not in df.columns:
        df["Category"] = ""
    return df.reset_index(drop=True)

def clean_text(text: str) -> str:
    if not isinstance(text, str): return ""
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def preprocess_jobs(df: pd.DataFrame) -> pd.DataFrame:
    df["job_text"] = (df["Business Title"].fillna("") + " " +
                      df["Preferred Skills"].fillna("")).map(clean_text)
    return df

def preprocess_resumes(df: pd.DataFrame) -> pd.DataFrame:
    df["resume_text"] = df["Resume_str"].map(clean_text)
    return df

def subset_frames(jobs_df, resumes_df, max_jobs=MAX_JOBS, max_resumes=MAX_RESUMES):
    if USE_SUBSET:
        jobs_df = jobs_df.sample(n=min(max_jobs, len(jobs_df)), random_state=RANDOM_STATE).reset_index(drop=True)
        resumes_df = resumes_df.sample(n=min(max_resumes, len(resumes_df)), random_state=RANDOM_STATE).reset_index(drop=True)
        print(f"Using subset: jobs={len(jobs_df)}, resumes={len(resumes_df)}")
    return jobs_df, resumes_df

## Tokenization, Vocabulary, Embeddings
**`tokenize_informative()`**
- Input: Cleaned text string.
- Output: List of tokens (length > 2, not in stopwords, not numeric).
- Goal: Extract informative tokens for overlap and BM25.
                                  
**`build_informative_vocab()`**
- Input: Job and resume DataFrames.
- Output: Set of tokens that appear frequently enough but not too frequently.
- Goal: Build a domain-specific vocabulary of informative words.


**`class Embedder`**
- Purpose: Wraps a SentenceTransformer model for encoding text into embeddings.
- Methods:
- `__init__()`: Loads transformer model.
- `encode()`: Returns NumPy embeddings for a list of texts.
- Goal: Provide semantic embeddings for resumes and jobs.


In [18]:
def tokenize_informative(text: str):
    return [t for t in text.split() if len(t) > 2 and t not in STOPWORDS and not t.isdigit()]

def build_informative_vocab(jobs_df, resumes_df, min_df=5, max_df_frac=0.1):
    docs = jobs_df["job_text"].tolist() + resumes_df["resume_text"].tolist()
    n_docs = len(docs)
    df_counts = Counter()
    for d in docs:
        df_counts.update(set(tokenize_informative(d)))
    max_df = int(max_df_frac * n_docs)
    vocab = {tok for tok, df in df_counts.items() if df >= min_df and df <= max_df}
    print(f"Informative vocab size: {len(vocab)}")
    return vocab

class Embedder:
    def __init__(self, model_name=MODEL_NAME):
        self.model = SentenceTransformer(model_name)
    def encode(self, texts):
        return self.model.encode(texts, show_progress_bar=False, convert_to_numpy=True)

## Feature Building (BM25 + Overlap + Embeddings)
**`compute_resume_job_sims()`**
- Input: Resume embeddings, job embeddings.
- Output: List of cosine similarity arrays (one per resume).
- Goal: Compute semantic similarity between resumes and jobs.
    
**`is_positive_pair()`**
- Input: Resume text, job text, similarity scores, job index, vocab, thresholds.
- Output: Boolean indicating if the resume–job pair is a positive match.
- Goal: Weak labeling heuristic combining:
- Embedding similarity (above percentile threshold).
- Jaccard overlap of tokens.
- Informative token overlap.

**`build_features()`**
- Input: Resume/job embeddings, DataFrames, vocab, BM25 model, token lists, thresholds.
- Output:
- X: Feature matrix (semantic + lexical + overlap + embedding diffs).
- y: Labels (1 = positive, 0 = negative).
- groups: Group sizes for LambdaRank (jobs per resume).
- pair_index: Resume–job index pairs.
- Goal: Construct training data for the ranker.


In [19]:
def compute_resume_job_sims(resume_embs, job_embs):
    job_norms = np.linalg.norm(job_embs, axis=1) + 1e-9
    sims_per_resume = []
    for r_emb in resume_embs:
        r_norm = np.linalg.norm(r_emb) + 1e-9
        sims = np.dot(job_embs, r_emb) / (job_norms * r_norm)
        sims_per_resume.append(sims)
    return sims_per_resume

def is_positive_pair(r_text, j_text, sims_for_resume, j_idx, vocab,
                     emb_sim_percentile=0.9, jaccard_threshold=0.2):
    r_tokens = set(tokenize_informative(r_text))
    j_tokens = set(tokenize_informative(j_text))
    informative_overlap = len((r_tokens & j_tokens) & vocab) > 0
    sim = sims_for_resume[j_idx]
    perc_threshold = np.quantile(sims_for_resume, emb_sim_percentile)
    inter = len(r_tokens & j_tokens)
    union = len(r_tokens | j_tokens)
    jaccard = (inter / union) if union > 0 else 0.0
    return (informative_overlap and sim >= perc_threshold) or (jaccard >= jaccard_threshold)

def build_features(resume_embs, job_embs, resume_df, job_df, vocab,
                   bm25, job_tokens_list,
                   emb_sim_percentile=0.9, jaccard_threshold=0.2,
                   cap_positives=CAP_POSITIVES_PER_RESUME,
                   negatives_per_resume=NEGATIVES_PER_RESUME):
    rng = np.random.default_rng(RANDOM_STATE)
    X, y, groups, pair_index = [], [], [], []
    job_texts = job_df["job_text"].tolist()
    resume_texts = resume_df["resume_text"].tolist()
    sims_all = compute_resume_job_sims(resume_embs, job_embs)

    for r_idx, r_emb in enumerate(resume_embs):
        sims_for_resume = sims_all[r_idx]
        positives, negatives = [], []
        for j_idx, j_text in enumerate(job_texts):
            if is_positive_pair(resume_texts[r_idx], j_text, sims_for_resume, j_idx, vocab,
                                emb_sim_percentile, jaccard_threshold):
                positives.append(j_idx)
            else:
                negatives.append(j_idx)
        if cap_positives and len(positives) > cap_positives:
            positives = list(rng.choice(positives, size=cap_positives, replace=False))
        chosen_jobs = positives.copy()
        if negatives:
            sampled_negatives = rng.choice(negatives, size=min(negatives_per_resume, len(negatives)), replace=False)
            chosen_jobs.extend(sampled_negatives)
        r_tokens = tokenize_informative(resume_texts[r_idx])
        scores_all = bm25.get_scores(r_tokens)   
        for j_idx in chosen_jobs:
            sim = sims_for_resume[j_idx]
            dot = float(np.dot(r_emb, job_embs[j_idx]))
            bm25_score = scores_all[j_idx]       
            overlap_count = len(set(r_tokens) & set(job_tokens_list[j_idx]))
            feat = np.concatenate([[sim, dot, bm25_score, overlap_count],
                                   np.abs(r_emb - job_embs[j_idx])])
            X.append(feat)
            y.append(1 if j_idx in positives else 0)
            pair_index.append((r_idx, j_idx))
        groups.append(len(chosen_jobs))
    return np.array(X, dtype=np.float32), np.array(y, dtype=np.int16), groups, pair_index

## Ranker (Learning-to-Rank with LightGBM)
**`class Ranker`**
- Purpose: Train and apply a LambdaRank model for resume–job ranking.
- Attributes:
- params: Default LightGBM LambdaRank parameters.
- model: Trained LightGBM model.

**`fit()`**
- Input: Training features, labels, group sizes, optional validation set.
- Output: Trains LightGBM LambdaRank model.
- Goal: Learn ranking function from weakly labeled pairs.
    
**`predict()`**
- Input: Feature matrix.
- Output: Predicted ranking scores.
- Goal: Rank jobs for resumes.
                 
**`build_pair_features_for_one_resume()`**
- Input: Resume embedding, job embeddings, BM25 model, job tokens, resume text.
- Output: Feature matrix for all job candidates for a single resume.
- Goal: Generate features for inference (ranking jobs for one resume).


In [20]:
class Ranker:
    def __init__(self, params=None):
        self.model = None
        self.params = params or {
            "objective": "lambdarank",
            "metric": "ndcg",
            "ndcg_eval_at": [5, 10],
            "learning_rate": 0.05,
            "num_leaves": 31,
            "verbosity": -1
        }

    def fit(self, X, y, groups, num_boost_round=200, valid=None,
            early_stopping_rounds=20, log_period=20):
        train_data = lgb.Dataset(X, label=y, group=groups)
        valid_sets = []
        if valid is not None:
            X_val, y_val, groups_val = valid
            valid_sets.append(lgb.Dataset(X_val, label=y_val, group=groups_val))
        callbacks = []
        if valid_sets and early_stopping_rounds:
            callbacks.append(lgb.early_stopping(early_stopping_rounds))
        callbacks.append(lgb.log_evaluation(period=log_period))
        self.model = lgb.train(self.params, train_data, num_boost_round=num_boost_round,
                               valid_sets=valid_sets if valid_sets else None, callbacks=callbacks)

    def predict(self, X):
        return self.model.predict(X, num_iteration=getattr(self.model, "best_iteration", None))

    def build_pair_features_for_one_resume(self, resume_emb, job_embs, bm25, job_tokens_list, resume_text):
        sims = np.dot(job_embs, resume_emb) / (
            np.linalg.norm(job_embs, axis=1) * np.linalg.norm(resume_emb) + 1e-9
        )
        r_tokens = tokenize_informative(resume_text)
        X = []
        scores_all = bm25.get_scores(r_tokens)
        for j_idx in range(len(job_embs)):
            dot = float(np.dot(resume_emb, job_embs[j_idx]))
            bm25_score = scores_all[j_idx]
            overlap_count = len(set(r_tokens) & set(job_tokens_list[j_idx]))
            feat = np.concatenate([[sims[j_idx], dot, bm25_score, overlap_count],
                                   np.abs(resume_emb - job_embs[j_idx])])
            X.append(feat)
        return np.array(X, dtype=np.float32)

## Evaluation Helpers
**`compute_ranking_metrics()`**

Inputs:
- ranked_jobs: List of job dictionaries, each with keys:
- "skills": set of tokens for the job.
- "score": predicted relevance score.
- "id": job identifier.
- resume_tokens: Set of tokens from the resume.
- thres: Dictionary of thresholds for Jaccard similarity, e.g.: {"high": 0.5, "medium": 0.3, "low": 0.1}

Outputs: 
- Dictionary of evaluation metrics:
- precision5_weak: Precision at top-5 (weak labels via Jaccard).
- ndcg10_weak: Normalized Discounted Cumulative Gain at top-10.
- separation: Difference between mean scores of top vs. bottom jobs.
- diversity_entropy: Entropy of skill distribution among top-10 jobs (higher = more diverse).
- duplicate_penalty: Count of duplicate job IDs in top-10.

Goal: Provide weak evaluation metrics for ranking quality, balancing relevance, diversity, and redundancy.


In [21]:
def compute_ranking_metrics(ranked_jobs, resume_tokens, thres):
    rels = []
    for j in ranked_jobs:
        inter = len(resume_tokens & j["skills"])
        union = len(resume_tokens | j["skills"])
        jaccard = (inter / union) if union > 0 else 0.0
        if jaccard >= thres["high"]:
            rels.append(3)
        elif jaccard >= thres["medium"]:
            rels.append(2)
        elif jaccard >= thres["low"]:
            rels.append(1)
        else:
            rels.append(0)

    k5, k10 = 5, min(10, len(ranked_jobs))
    precision5 = sum(r > 0 for r in rels[:k5]) / max(1, k5)
    discounts = [1/np.log2(i+2) for i in range(k10)]
    dcg = sum(r * d for r, d in zip(rels[:k10], discounts))
    ideal = sorted(rels, reverse=True)[:k10]
    idcg = sum(r * d for r, d in zip(ideal, discounts)) or 1.0
    ndcg = dcg / idcg

    scores = [j["score"] for j in ranked_jobs[:k10]]
    sep = (np.mean(scores[:min(5, len(scores))]) - np.mean(scores[-min(5, len(scores)):])) if len(scores) >= 2 else 0.0

    top_skills = [s for j in ranked_jobs[:k10] for s in j["skills"]]
    skill_counts = {}
    for s in top_skills:
        skill_counts[s] = skill_counts.get(s, 0) + 1
    probs = np.array(list(skill_counts.values())) / max(1, len(top_skills))
    diversity_entropy = float(-np.sum(probs * np.log(probs + 1e-9))) if len(probs) > 0 else 0.0

    job_ids = [j["id"] for j in ranked_jobs[:k10]]
    duplicates = len(job_ids) - len(set(job_ids))

    return {
        "precision5_weak": float(precision5),
        "ndcg10_weak": float(ndcg),
        "separation": float(sep),
        "diversity_entropy": float(diversity_entropy),
        "duplicate_penalty": float(duplicates)
    }

## Orchestration: End-to-End Pipeline
This section ties together all components into a full workflow, training a LightGBM LambdaRank model.

In [25]:
print("Loading and preprocessing data...")
jobs_df = preprocess_jobs(load_jobs(JOBS_CSV))
resumes_df = preprocess_resumes(load_resumes(RESUMES_CSV))
jobs_df, resumes_df = subset_frames(jobs_df, resumes_df)

print("Embedding texts...")
embedder = Embedder(MODEL_NAME)
job_embeddings = embedder.encode(jobs_df["job_text"].tolist())
resume_embeddings = embedder.encode(resumes_df["resume_text"].tolist())

print("Building informative vocab...")
vocab = build_informative_vocab(jobs_df, resumes_df)

print("Building BM25 index...")
job_tokens_list = [tokenize_informative(t) for t in jobs_df["job_text"].tolist()]
bm25 = BM25Okapi(job_tokens_list)

X, y, groups, pair_index = build_features(
    resume_embeddings, job_embeddings, resumes_df, jobs_df,
    vocab=vocab, bm25=bm25, job_tokens_list=job_tokens_list,
    emb_sim_percentile=0.8, jaccard_threshold=0.1
)
print("Features built")

final_params = {
    "objective": "lambdarank",
    "metric": "ndcg",
    "ndcg_eval_at": [5, 10],
    "learning_rate": 0.1,
    "num_leaves": 15,
    "verbosity": -1
}
final_ranker = Ranker(final_params)
final_ranker.fit(X, y, groups, num_boost_round=300)

Loading and preprocessing data...
Using subset: jobs=3000, resumes=1000
Embedding texts...
Building informative vocab...
Informative vocab size: 7578
Building BM25 index...
Features built


## Inference
Testing the model on a new resumes outside of the training data to identify the top 5 job matches it predicts

In [29]:
test_resumes = {
    "Data Analyst": "Experienced data analyst with SQL, Python, and Tableau skills",
    "Healthcare Nurse": "Registered nurse with patient care, clinical experience, and hospital administration background",
    "Teacher": "High school teacher skilled in curriculum design, classroom management, and student engagement",
    "Software Engineer": "Full-stack developer with expertise in Java, React, cloud deployment, and API design",
    "Financial Analyst": "Finance professional with Excel, risk modeling, and investment analysis experience",
    "Graphic Designer": "Creative designer with Adobe Photoshop, Illustrator, and branding project experience"
}

for label, resume_text in test_resumes.items():
    print(f"\n=== Top 5 unique job matches for {label} resume ===")
    
    cleaned = clean_text(resume_text)
    resume_emb = embedder.encode([cleaned])[0]
    
    X_new = final_ranker.build_pair_features_for_one_resume(
        resume_emb, job_embeddings, bm25, job_tokens_list, cleaned
    )
    scores = final_ranker.predict(X_new)

    ranked = []
    for j_idx, score in enumerate(scores):
        job_tokens = set(job_tokens_list[j_idx])
        ranked.append({
            "title": jobs_df.iloc[j_idx]["Business Title"],
            "id": j_idx,
            "skills": job_tokens,
            "score": float(score)
        })

    ranked_df = pd.DataFrame(ranked)
    ranked_unique = (ranked_df.groupby("title", as_index=False)
                                .agg({"score": "max"}))
    
    ranked_unique = ranked_unique.sort_values("score", ascending=False).head(5)
    
    for _, row in ranked_unique.iterrows():
        print(f"  {row['title']} (score: {row['score']:.3f})")



=== Top 5 unique job matches for Data Analyst resume ===
  Analysis and Reporting Analyst (score: 5.285)
  Data Analyst (score: 5.044)
  COOP and Emergency Management Intern (score: 4.645)
  GEOGRAPHIC INFORMATION SYSTEMS (GIS) SPECIALIST (score: 4.541)
  ASSOCIATE DATA SCIENTIST (score: 4.310)

=== Top 5 unique job matches for Healthcare Nurse resume ===
  WTC Case Management Nurse (score: 4.837)
  Supervising Public Health Nurse, Bureau of School Health/SH Nursing Services & Prof Dev (score: 3.292)
  Public Health Nurse, I, Bureau of School Health (score: 2.815)
  Supervising Health Nurse, Bureau of School Health/SH Nursing Services & Prof Dev (score: 1.594)
  Staff Nurse (Part-Time) (score: 1.566)

=== Top 5 unique job matches for Teacher resume ===
  CIVIC ENGAGEMENT STUDIO SUMMER COLLEGE INTERN (score: 2.495)
  Planner (score: -2.151)
  Trout in the Classroom Program Coordinator (score: -2.315)
  RESIDENT ENGAGEMENT ZONE COORDINATOR (score: -2.640)
  BROOKLYN BOROUGH OFFICE SUMME

## Analysis & Improvements
### Current Strengths
- Semantic + Lexical Fusion: By combining transformer embeddings, BM25 scores, and token overlap, the model captures both deep semantic similarity and surface-level keyword alignment. This hybrid approach is stronger than using embeddings or BM25 alone.
- Weak Labeling Heuristics: The use of embedding similarity percentiles and Jaccard thresholds provides a practical way to generate training labels without manual annotation. While noisy, it enables scalable training.
- Ranking Model (LambdaRank): LightGBM’s LambdaRank objective is well-suited for ranking tasks, optimizing directly for NDCG and ensuring that top results are prioritized.
- Evaluation Metrics: Precision@5, NDCG@10, separation, diversity entropy, and duplicate penalties broaden and strengthen ranking quality, balancing relevance, diversity, and redundancy.

### Observed Limitations

Stop-word Handling: The current stop-word list is manually defined. This risks being:
- Too broad (removing informative domain terms like “management” in some contexts).
- Too narrow (missing high-frequency filler words not in the list).
- Domain-misaligned (generic stop-words may not reflect the job/resume corpus).

Feature Scope: Current features focus on text similarity and token overlap.
- Important structured attributes (salary, employment type, seniority, department) are not yet incorporated. This limits the model’s ability to capture practical job–resume fit.

Weak Label Noise: The heuristic labeling (embedding percentile + Jaccard) may misclassify borderline cases, introducing noise into training.
Evaluation Ground Truth: Metrics are based on weak labels rather than human-annotated relevance judgments, so they measure internal consistency more than true accuracy.

Limited training data and jobs: Both input files are approximately 3000 rows large. Many jobs will not be covered within these files, therefore, the model would not be able to predict any jobs that fall outside the scope of the training data. 

### Planned Improvements
- Automatic Stop-word Detection. Replace static stop-word lists with corpus-driven methods (e.g., TF-IDF thresholds, entropy-based filtering, or embedding-cluster detection). This ensures the stop-word set adapts to the domain and avoids arbitrary exclusions.
- Incorporate structured job attributes such as: Salary range (aligns with candidate expectations), Employment type (full-time, part-time, contract), Seniority level (junior, mid, senior), Department/Division (engineering, finance, HR). These features can be encoded as categorical embeddings or one-hot vectors and concatenated with text-based features.
- Explore semi-supervised learning or active learning to refine labels with minimal human input.
- Use pairwise preference signals (e.g., “Resume A is a better fit for Job X than Resume B”) to reduce noise.
- Evaluation with Human Judgments
- Collect a small set of human-annotated resume–job matches to benchmark the model. This would provide a more reliable measure of accuracy than weak labels alone.
- Automate hyperparameter sweeps for thresholds (embedding similarity percentile, Jaccard cutoff).
- Log diagnostics and leaderboard results for reproducibility and empirical tuning
