# Semantic Resume–Job Matching Using Fine-Tuned Sentence-BERT

## Project Documentation

### Team Members
- Maxwell Bernard & Johan Schommartz

### Central Problem, Domain, and Data Characteristics

**Problem:**  
We built a semantic similarity model that predicts how well a candidate’s CV matches a job description focusing on the tech industry. Future work would integrate this into a job recommendation system that ranks job postings for a given CV.

**Domain:**  
Semantic matching, embeddings, psuedo-labeling, resume screening, job recommendation.

**Data:** 
- Two Kaggle datasets:
  - [Public resume dataset](https://www.kaggle.com/datasets/suriyaganesh/resume-dataset-structured) (50,000+ CVs with skills, job titles, experience)
  - [Scraped LinkedIn Job Postings](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024) (1.3m+ jobs with descriptions, required skills)
- Dataset cleaned, normalized and balanced:
  - Removed low-information CVs (e.g. fewer than 4 skills)
  - Filtered to tech-related job titles
  - Normlaized skills and job titles using regex library
  - Cleaned job titles and skills
  - Balanced to 25,000 samples per class 
- Paired CV text and job description text, each labeled as:
  - 1 → relevant match
  - 0 → non-match
- Train/val split with stratification for class balance

### Model Architecture and Training

**Model Architecture:**  
- Sentence-BERT (SBERT) bi-encoder (`sentence-transformers/msmarco-bert-base-dot-v5`) was used to generate independent embeddings for CV text and job description text.  
- These embeddings were compared using cosine similarity to estimate match relevance, and pseudo-labels were generated by combining similarity scores with heuristics of skill overlap and job-title alignment.  
- During fine-tuning, we updated the full SBERT model (no layers were frozen), allowing the transformer to adjust its internal representation space toward CV–job alignment.
- We trained the model using `CosineSimilarityLoss`, which directly optimizes the cosine distance between positive and negative CV–job pairs in the embedding space.
- We did not freeze any layers because SBERT is pretrained for general semantic similarity, not for the CV–job domain. Fine-tuning the entire model allows all transformer layers to adapt to domain-specific patterns (skills and job titles), which leads to a better embedding space for our task. Freezing layers is mainly useful for small or closely related task.


**Why This Architecture:**  
- SBERT is purpose-built for semantic similarity and sentence-level embeddings.  
- The bi-encoder setup allows independent CV and job embeddings, enabling fast retrieval and scalable matching.  
- It consistently outperforms baseline SBERT on embedding-based tasks due to its contrastive pretraining strategy.


**Training Mechanisms:**  
- Token length of training documents truncated to `max_seq_length = 256` to:
  - Reduce CPU bottleneck when tokenizing long CVs and job skills.
  - This token length accommodates most CVs while retaining key information.
  - Prevent memory issues from 1000+ token resumes.
- Batch size was set to **16**, which provided the best trade-off between:
  - Memory efficiency (larger batches caused out-of-memory issues due to long sequences)
  - Stable gradient updates for contrastive learning
  - Reasonable training speed on Kaggle hardware.

- Only **1 epoch** of fine-tuning was used because:
  - The dataset is large and already balanced (50k pairs total).
  - SBERT is pretrained for semantic similarity, so it adapts quickly.
  - Additional epochs gave diminishing returns while increasing training time.

- `warmup_steps` was set to **10% of the dataloader size**, following Sentence-Transformers best practices:
  - This gradually increases the learning rate at the start.
  - Prevents unstable early updates, especially with contrastive objectives.
  - Helps the model stabilise before reaching full learning rate.

- `collate_fn = ft_model.smart_batching_collate` was used because:
  - Smart batching groups samples by similar sequence lengths.
  - This minimizes padding and speeds up training.
  - It improves memory efficiency when working with long, uneven CV and job texts.

### Key Experiments and Results

#### Baseline (no fine-tuning)
Metrics:
- ROC–AUC: 0.9988 
- Accuracy: 0.4834 

Score distributions:
  - Positives: mean ≈ 0.95
  - Negatives: mean ≈ 0.89
  - Margin: 0.0627

Baseline SBERT ranked positive pairs slightly above negatives (yielding a very high ROC–AUC), but the cosine similarity distributions overlapped heavily. Because the margin between classes was tiny, no threshold could cleanly separate matches from non-matches, resulting in very low accuracy.

Baseline SBERT understands language well, but lacks resume/job domain alignment.

#### Fine-tuned SBERT
Metrics:
- ROC–AUC: 0.9993
- Accuracy: 0.9893 

Score distributions:
  - Positives: mean ≈ 0.93
  - Negatives: mean ≈ 0.03
  - Margin: 0.9039

Fine-tuning widened the separation between positive and negative examples. Positive pairs were pushed toward cosine ≈ 1.0 and negatives toward ≈ 0.0, making the classes easily separable. This resulted in a very high accuracy of 98.93% while maintaining a high ROC–AUC.

Fine-tuning significantly improves the model’s ability to capture CV–job relevance.

#### Embedding Visualisation
- The Distribution of Cosine Similarity Scores visual shows:
  - Before fine-tuning, positive and negative embedding scores tightly clustered and overlap heavily.
  - After fine-tuning, the distributions separate cleanly with minimal overlap, reflecting the much larger similarity margin.

## Discussion: Summary and Lessons Learned

### What Worked Well
- Fine-tuning SBERT provided large performance improvements  
- Data balancing prevented model from collapsing to majority class  
- 256-token truncation was the optimal trade-off between speed and accuracy  

### What Could Be Improved
- We could expand beyond tech-focused roles by incorporating CVs and job descriptions from other domains (e.g., healthcare, finance, education), which would make the model more generalisable and better at understanding a wider variety of career paths.
  
- Although we already performed basic normalisation (mapping acronyms like “js” → “javascript” and cleaning job-title variants), we could push this further by using a full standard classification for skills and titles and applying more aggressive variant merging. This would unify noisy data under consistent labels and produce cleaner, more reliable model inputs.
  
- We could apply hard-negative mining by adding non-matching CV–job pairs that appear highly similar (e.g., “Data Analyst” vs. “Business Analyst”), which would encourage the model to learn subtle semantic differences. This would make it easier for the model to clearly separate good matches from bad ones and reduce false-positive rates.


### Most Important Takeaways
- Preprocessing was the biggest challenge: Cleaning noisy, scraped CV and job data had more impact on performance than any modelling decision, highlighting the importance of well-structured inputs.
  
- Bi-encoder architectures work extremely well for CV–job matching: They provide scalable, independent embeddings and benefit strongly from domain-specific fine-tuning.
- Sequence length control is crucial: Truncation and batching strategies directly influenced training stability, speed, and memory usage.
  
- Fine-tuning reshapes the embedding space: Even one epoch of contrastive learning dramatically improved class separability, demonstrating the value of task-specific adaptation.


## Theoretical Foundation

Our approach is inspired by **ConFit** (Yun et al., 2024), which proposes a **BERT bi-encoder** trained with **contrastive learning** for semantic matching tasks.

### ConFit: BERT Bi-Encoder with Contrastive Loss

- **Bi-encoder architecture**  
  - A *single* BERT encoder maps each input text (here: CV or job posting) to a **dense embedding**.  
  - Semantic similarity between two texts is measured via **cosine similarity** of their embeddings.

- **Contrastive learning objective**  
  - The model is trained to **pull together** embeddings of **positive pairs** (texts that should match) and **push apart** embeddings of **negative pairs**.  
  - ConFit uses a **contrastive loss** (InfoNCE-style) where:
    - For each anchor text, the corresponding positive example should have higher similarity than the negatives in the batch.
    - All other examples in the batch act as **in-batch negatives**, making training efficient.

- **Pseudo-labelling in ConFit**  
  - ConFit does not rely solely on manually labelled pairs.  
  - Instead, it uses a base model to retrieve candidate pairs and applies **pseudo-labelling**:  
    - High-confidence pairs from retrieval are treated as **positives**.  
    - Low-confidence or mismatched pairs are treated as **negatives**.  
  - This allows scaling to many training pairs without expensive human annotation.

### Heuristic Pseudo-Labelling

We follow the **architecture and training philosophy** of ConFit (BERT bi-encoder + contrastive loss), but build our own **pseudo-labelling pipeline**:

- Inspired by **Vanetik & Kogan (2023)**, we use **word-level matching** as a cheap but effective signal of semantic relatedness:
  - We represent each CV and job posting as **sets of cleaned skill tokens**.
  - We also compare simplified **job titles** (e.g. “Senior Data Engineer” → `{data, engineer}`).
- For each CV, we:
  - Retrieve candidate job postings.
  - Compute **skill overlap** and **title overlap**.
  - Label:
    - Pairs with **strong overlap** as **pseudo-positives**.
    - Pairs with **no overlap** (within a candidate pool) as **pseudo-negatives**.

The **fine-tuning strategy** (bi-encoder SBERT + contrastive loss over positive and negative pairs) follows ConFit.  
The **word-overlap heuristic** for pseudo-labelling follows the spirit of Vanetik & Kogan’s word-level matching, adapted to our CV–job setting.

### Training Strategy References

**Batch size of 16:**
- The Reimers & Gurevych (2019) Sentence-BERT paper emphasises small batch sizes for contrastive learning due to memory constraints of transformer encoders.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ACL. https://arxiv.org/abs/1908.10084

**Single epoch training:**
- Reimers & Gurevych (2019) show effective adaptation after short fine-tuning, so we chose 1 epoch as the best trade-off between performance and training time.

**Warmup steps**
- We set warmup_steps to 10% of the total training steps because this is the recommended proportion in the official Sentence-Transformers fit() documentation and is widely used to stabilise early updates when fine-tuning pretrained transformer models.
- The warmup phase gradually increases the learning rate from 0 to its maximum before linearly decaying, preventing sudden large gradient updates at the start of training, which is important when working with contrastive objectives and long sequences like CVs.
- Sentence-Transformers documentation: https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.fit 

**Smart batching:**
- we used the built-in `smart_batching_collate` function from Sentence-Transformers to group samples of similar lengths together in each batch. This minimizes padding and speeds up training, improving memory efficiency when working with long, uneven CV and job texts.
- This approach was inspired by Reimers & Gurevych (2019) who implemented this smart batching strategy in their Sentence-BERT framework to reduce computational overhead from padding tokens.

# Mini-Project Script

## Notebook Roadmap

To keep the workflow clear, we structure the notebook into the following steps:

1. **Import libraries & set paths**  
2. **Resume data preparation**
   - Load and aggregate experience and skills.
   - Clean and filter the skills field.
   - Remove low-information and duplicate CVs.
3. **Job posting data preparation**
   - Load LinkedIn job data and job–skill mappings.
   - Filter to relevant rows and build structured job representations.
4. **Label-document creation**
   - Build compact text inputs (“label documents”) for CVs and jobs.
5. **Pseudo-label generation**
   - Construct CV–job pairs using skill/title overlap.
   - Balance positive and negative pairs and split into train/validation.
6. **Model fine-tuning and baseline comparison**
   - Baseline SBERT: frozen encoder + cosine similarity.
   - Fine-tuned SBERT: contrastive training on pseudo-labels.
7. **Evaluation**
   - Compare similarity distributions of positive vs. negative pairs.
   - Visualise results and inspect qualitative example rankings.


## Import Libraries

In [None]:
import ast
import kagglehub
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
from pathlib import Path
from sentence_transformers import SentenceTransformer, util, InputExample, losses
from torch import Tensor
from tqdm.auto import tqdm
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, roc_curve
from transformers import AutoTokenizer

## CV Data Preperation

### Load Data

In [None]:
dataset_id = "suriyaganesh/resume-dataset-structured"
final_path = kagglehub.dataset_download(dataset_id)
# DATA_DIR = Path(final_path)  # FOR MY VS CODE
DATA_DIR = Path("/kaggle/input/resume-dataset-structured")

people = pd.read_csv(DATA_DIR / "01_people.csv")
abilities = pd.read_csv(DATA_DIR / "02_abilities.csv")
education = pd.read_csv(DATA_DIR / "03_education.csv")
experience = pd.read_csv(DATA_DIR / "04_experience.csv")
person_skills = pd.read_csv(DATA_DIR / "05_person_skills.csv")

print("people.columns =", people.columns)
print("abilities.columns =", abilities.columns)
print("education.columns =", education.columns)
print("experience.columns =", experience.columns)
print("person_skills.columns =", person_skills.columns)

### Create CV Dataframe

We merged the multiple resume dataframes based on the primary key "person_id" to create a comprehensive dataframe containing all relevant information about each candidate:
- Past experiences (Job titles)
- Skills

In [None]:
# Aggregate data
experience_agg = (
    experience.groupby("person_id")["title"].apply(list).reset_index(name="title")
)

skills_agg = (
    person_skills.groupby("person_id")["skill"].apply(list).reset_index(name="skills")
)

df_resume = people.merge(experience_agg, on="person_id", how="left").merge(
    skills_agg, on="person_id", how="left"
)

# Rename and drop columns
df_resume = df_resume.drop(columns=["email", "phone", "linkedin", "name"])
df_resume = df_resume.rename(
    columns={
        "title": "past_experience",
        "skill": "technical_skills",
    },
)

print(df_resume.columns)
print(df_resume.shape)
df_resume.head()

### Past Experience Column EDA & Cleaning

Removing empty rows:

In [None]:
na_experience_count = df_resume["past_experience"].isna().sum()
print(f"past_experience null rows: {na_experience_count}")
na_skills_count = df_resume["skills"].isna().sum()
print(f"technical_skills null rows: {na_skills_count}")

In [None]:
# Drop rows with NA
df_resume = df_resume.dropna(subset=["past_experience", "skills"])

# Verify no NA values
na_experience_count = df_resume["past_experience"].isna().sum()
print(f"past_experience null rows: {na_experience_count}")
na_skills_count = df_resume["skills"].isna().sum()
print(f"technical_skills null rows: {na_skills_count}")

Data Exploration of *past_experience* column before cleaning:

In [None]:
df_resume.info()

In [None]:
all_titles = [title for titles in df_resume["past_experience"] for title in titles]
unique_job_titles = len(set(all_titles))
print(f"Number of unique job titles before cleaning: {unique_job_titles}")


avg_past_experiences = round(
    np.mean([len(titles) for titles in df_resume["past_experience"]]), 2
)
print(f"Average number of past jobs per candidate: {avg_past_experiences}")

Cleaning and normalizing *past_experience* column by removing special characters, replacing abbreviations and converting to lowercase:

In [None]:
ACRONYMS = [
    "SQL",
    "IT",
    "UI",
    "UX",
    "API",
    "ETL",
    "AWS",
    "IOS",
]


def normalize_past_experience(x):
    """
    Ensures past_experience becomes a clean list of raw title strings,
    splitting on commas if necessary.
    """
    if isinstance(x, list):
        titles = []
        for s in x:
            for t in str(s).split(","):
                t = t.strip()
                if t:
                    titles.append(t)
        return titles

    elif isinstance(x, str):
        return [t.strip() for t in x.split(",") if t.strip()]

    else:
        return []


def clean_title(t):
    t = str(t).strip()
    t = re.sub(r"\.", "", t)
    t = t.lower()
    t = re.sub(r"\s*-\s*", " ", t)  # replace hyphens with spaces

    replacements = {
        r"\bdba\b": "database administrator",
        r"\bdb\b": "database",
        r"\bsr\b": "senior",
        r"\bjr\b": "junior",
        r"\bdev\b": "developer",
        r"\beng\b": "engineer",
        r"\badmin\b": "administrator",
        r"\barch\b": "architect",
        r"\bpm\b": "project manager",
        r"\bqa\b": "quality assurance",
        r"\bbi\b": "business intelligence",
        r"\bfs\b": "full stack",
        r"\bceo\b": "chief executive officer",
        r"\bhr\b": "human resource",
        r"\bai\b": "artificial intelligence",
        r"\bml\b": "machine learning",
    }

    for pattern, replacement in replacements.items():
        t = re.sub(pattern, replacement, t)
    t = t.title()
    for acronym in ACRONYMS:
        t = re.sub(rf"\b{acronym.title()}\b", acronym, t)

    return t.strip()


def clean_and_dedupe_past_experience(past_titles: list[str]) -> str:
    """past_titles is a list of job title strings."""
    if not isinstance(past_titles, list):
        return ""
    cleaned = [clean_title(t) for t in past_titles if isinstance(t, str) and t.strip()]
    unique_titles = list(dict.fromkeys(cleaned))
    return ", ".join(unique_titles)


df_resume["past_experience"] = (
    df_resume["past_experience"]
    .apply(normalize_past_experience)
    .apply(clean_and_dedupe_past_experience)
)

Data Exploration of *past_experience* column after cleaning:

In [None]:
all_titles = [
    title for titles in df_resume["past_experience"] for title in titles.split(", ")
]
unique_job_titles = len(set(all_titles))

print(f"Unique cleaned job titles: {unique_job_titles}")


### Skills Column EDA & Cleaning

Data Exploration of *skills* column before cleaning:

In [None]:
samples = df_resume["skills"].sample(5, random_state=42)

for i, skills in samples.items():
    print(f"person_id {i} | Skills: {skills}\n")


The raw *skills* column mixes true technical skills with long, responsibility-style sentences.  
To make this field usable for modelling, we extract only **atomic, tool-level skills**.

##### Cleaning *skills* column:

**Why Cleaning Is Needed?**
- Skills include **verbose experience bullet points** (“Develop strategic direction for the information systems…”).  
- These entries are inconsistent, non-technical, and not suitable as skill tokens.  
- Some resumes contain **hundreds of tokens**, creating noise and memory issues

**Cleaning Heuristic (Rule-Based)**
We remove entries that:
- Are **too long** or contain many words.
- End with **sentence-like punctuation**.
- Contain common **responsibility verbs** (“managed”, “implemented”, “maintained”, etc.).
- Begin with **task verbs** (“identify”, “review”, “create”, “plan”, “test”…).

Then we:
- **Normalize** acronyms and variants (e.g. *js -> JavaScript*, consistent casing).
- **Deduplicate** skills within each resume.

**Handling Outlier Resumes**
Resumes with excessively large skill lists are dropped because they:
- Act as **outliers** during training.
- Create **memory/compute spikes** for BERT embeddings.
- Harm similarity learning by inflating token noise.

The resulting *clean_skills* field contains concise, consistent, modelling-ready skill tokens.

In [None]:
ACRONYMS = [
    "SQL",
    "IT",
    "UI",
    "UX",
    "API",
    "ETL",
    "AWS",
    "IOS",
    "XML",
    "HTML",
    "CSS",
    "PHP",
]

SPECIAL_MAP = {
    "js": "javascript",
    "py": "python",
    "ms": "microsoft",
    "node.js": "nodejs",
    "react.js": "react",
}


EXPERIENCE_VERBS = [
    "managed",
    "maintained",
    "responsible",
    "implemented",
    "supported",
    "provided",
    "ensured",
    "analyzed",
    "resolved",
    "troubleshot",
    "troubleshooting",
    "trained",
    "developed",
    "installed",
    "configured",
    "handled",
    "engaged",
    "drove",
    "communicated",
    "explained",
    "assumed",
    "acted",
    "diagnosed",
    "and",
]

START_VERBS = [
    "identify",
    "review",
    "gather",
    "create",
    "document",
    "perform",
    "schedule",
    "plan",
    "participate",
    "execute",
    "lead",
    "work",
    "test",
    "approve",
    "approved",
    "confirmed",
    "tracked",
    "monitored",
    "controlled",
    "managed",
    "evaluated",
    "defined",
    "formulated",
    "assembled",
    "coordinated",
    "follow",
    "followed",
    "upload",
    "uploaded",
    "research",
    "suggested",
    "verified",
]

verb_pattern = re.compile(r"\b(" + "|".join(EXPERIENCE_VERBS) + r")\b", re.IGNORECASE)


def normalize_skill(s):
    """Light normalization, safe for NaNs, suitable for BERT input."""
    if s is None or (isinstance(s, float) and math.isnan(s)):
        return ""

    s = str(s).strip()
    if not s:
        return ""

    s_lower = s.lower()

    if s_lower in SPECIAL_MAP:
        return SPECIAL_MAP[s_lower]

    for ac in ACRONYMS:
        if s_lower == ac.lower():
            return ac

    return s_lower


def is_valid_skill(s):
    """Filter out experience-style sentences, keep short skill-like tokens."""
    if not s:
        return False
    s_strip = s.strip()
    words = s_strip.split()

    if len(words) > 6:
        return False
    if len(s_strip) > 80:
        return False

    first_word = words[0].lower()
    if first_word in START_VERBS:
        return False

    if s_strip[-1] in ".!?":
        return False

    if verb_pattern.search(s_strip) and len(words) > 5:
        return False

    return True


def clean_skill_list(skills_list):
    """Apply normalization + filtering + simple dedupe for one resume."""
    cleaned = []
    seen = set()

    for raw in skills_list:
        norm = normalize_skill(raw)
        if is_valid_skill(norm):
            if norm not in seen:
                seen.add(norm)
                cleaned.append(norm)

    return cleaned


# Apply to your dataframe
df_resume["clean_skills"] = df_resume["skills"].apply(clean_skill_list)

Data Exploration of *skills* column after cleaning:

In [None]:
samples = df_resume["clean_skills"].sample(5, random_state=42)

for person_id, skills in samples.items():
    print(f"person_id {person_id} | Skills: {skills}\n")

In [None]:
df_resume = df_resume.drop(columns=["skills"])

# drop rows with empty clean_skills
df_resume = df_resume[df_resume["clean_skills"].map(len) > 0]
empty_skills_count = (df_resume["clean_skills"].map(len) == 0).sum()
print(f"Rows with empty clean_skills: {empty_skills_count}")

avg_skills_per_candidate = round(
    np.mean([len(skills) for skills in df_resume["clean_skills"]]), 2
)
print(f"Average number of skills per candidate: {avg_skills_per_candidate}")

### Removing Low-Information CVs

After cleaning the skills field, we remove resumes with **fewer than four skills**, because they:

- Usually come from **noisy scraping** or from removing **responsibility-style sentences** during cleaning.
- Contain **too little information** to represent a candidate meaningfully.
- Act as **outliers** that destabilise similarity scores and BERT embeddings.
- Reduce the overall **quality and consistency** of the dataset.

In [None]:
df_resume = df_resume[df_resume["clean_skills"].apply(lambda x: len(x) >= 4)]
print(df_resume.shape)
df_resume.head(10)

In [None]:
all_skills = df_resume["clean_skills"].explode()
unique_skills = all_skills.nunique()
print(f"Number of unique skills after cleaning: {unique_skills}")


Data Exploration of entire CV dataset after cleaning:

In [None]:
df_resume

In [None]:
df_resume.info()

### Remove duplicate CVs

In [None]:
# we must convert the list to a tuple to make it hashable for duplication checking
df_resume["clean_skills_tuple"] = df_resume["clean_skills"].apply(
    lambda x: tuple(sorted(x))
)

dup_counts = df_resume.duplicated(
    subset=["past_experience", "clean_skills_tuple"]
).sum()

print(f"Number of duplicate resumes: {dup_counts}")
df_resume = df_resume.drop_duplicates(
    subset=["past_experience", "clean_skills_tuple"],
    keep="first",
)
# drop the helper column
df_resume = df_resume.drop(columns=["clean_skills_tuple"])

print(f"Dataframe shape after dropping duplicates: {df_resume.shape}")

### Visualize Cleaned & Filtered CV Dataset

In [None]:
def plot_top_with_cumulative(
    series,
    top_n=30,
    ylabel="Category",
    title="Top Categories with Cumulative Percentage",
):
    """
    Make a horizontal bar chart with a cumulative percentage line
    for the top_n values of a Series.
    """
    counts = series.value_counts().head(top_n)
    total = len(series)
    cumulative_pct = (counts.cumsum() / total) * 100
    counts_sorted = counts[::-1]
    cumulative_pct_sorted = cumulative_pct.reindex(counts_sorted.index)

    fig, ax1 = plt.subplots(figsize=(14, 8))

    ax1.barh(
        counts_sorted.index,
        counts_sorted.values.astype(float),
        color="blue",
    )
    ax1.set_xlabel("Count", color="blue")
    ax1.tick_params(axis="x", labelcolor="blue")
    ax1.set_ylabel(ylabel)
    ax1.set_title(title)

    ax2 = ax1.twiny()
    y_pos = np.arange(len(counts_sorted))

    ax2.plot(
        cumulative_pct_sorted.to_numpy(),
        y_pos,
        color="darkorange",
        marker="o",
        linestyle="--",
        linewidth=1.5,
        label="Cumulative %",
    )

    ax2.set_xlabel("Cumulative Percentage", color="darkorange")
    ax2.tick_params(axis="x", labelcolor="darkorange")
    ax2.set_xlim(0, 100)

    for i, percentage in enumerate(cumulative_pct_sorted.values):
        ax2.text(
            percentage + 1,
            i,
            f"{percentage:.1f}%",
            va="center",
            ha="left",
            color="darkorange",
            size=8,
        )

    ax1.grid(axis="x", linestyle="--", alpha=0.5)
    fig.tight_layout()
    plt.show()

Visualize final *skills* column:

In [None]:
number_of_skills = 30

# Explode clean_skills
skills_long = df_resume[["person_id", "clean_skills"]].explode("clean_skills")
skills_long = skills_long.rename(columns={"clean_skills": "skill"})

plot_top_with_cumulative(
    series=skills_long["skill"],
    top_n=number_of_skills,
    ylabel="Skill",
    title=f"Top {number_of_skills} Skills with Cumulative Percentage",
)


Visualze final *past_experience* column:

In [None]:
number_of_titles = 30
df_resume["past_experience_list"] = df_resume["past_experience"].apply(
    lambda s: [t.strip() for t in str(s).split(",") if t.strip()]
)

exploded_titles = df_resume.explode("past_experience_list")

plot_top_with_cumulative(
    series=exploded_titles["past_experience_list"],
    top_n=number_of_titles,
    ylabel="Job Title",
    title=f"Top {number_of_titles} Job Titles in Past Experience with Cumulative Percentage",
)

# drop helper column
df_resume = df_resume.drop(columns=["past_experience_list"])


#### Dataset Domain Characteristics

From the visualisations above it becomes clear that the resume dataset is heavily concentrated in the **tech domain**, including:

Tech jobs:
- Software development  
- Data and analytics roles  
- IT infrastructure, networking, and engineering  
- Cloud and DevOps-related positions  

As well as tech skills:
- A strong prevalence of **programming languages** (Python, Java, SQL)
- Frequent **cloud / tooling skills** (AWS, Azure, Docker, Git)
- Limited representation of **non-technical** roles

We acknowledge that the dataset:

- Does **not represent all industries or job categories** 
- May reflect **source-specific biases** from the scraping process

However, it still provides a **consistent, well-defined domain** for the purpose of our mini-project.


## Job Posting Data Preperation

### Load Data

In [None]:
dataset_id_jobs = "asaniczka/1-3m-linkedin-jobs-and-skills-2024"
final_path_jobs = kagglehub.dataset_download(dataset_id_jobs)
# DATA_DIR_JOBS = Path(final_path_jobs)  # FOR MY VS CODE
DATA_DIR_JOBS = Path("/kaggle/input/1-3m-linkedin-jobs-and-skills-2024")

job_skills = pd.read_csv(DATA_DIR_JOBS / "job_skills.csv")
linkedin_jobs = pd.read_csv(DATA_DIR_JOBS / "linkedin_job_postings.csv")

print("job_skills.columns =", job_skills.columns)
print("linkedin_jobs.columns =", linkedin_jobs.columns)

### Create Job Posting Dataframe

In [None]:
df_jobs = linkedin_jobs.merge(job_skills, on="job_link", how="left")
df_jobs = df_jobs.drop(
    columns=[
        "job_link",
        "last_processed_time",
        "got_summary",
        "got_ner",
        "is_being_worked",
        "company",
        "job_location",
        "first_seen",
        "search_city",
        "search_country",
        "search_position",
        "job_level",
        "job_type",
    ]
)

print(df_jobs.shape)
df_jobs.head(10)

In [None]:
df_jobs.info()

Remove missing values:

In [None]:
df_jobs = df_jobs.dropna()
df_jobs.info()

### Filtering & Sampling the Job Dataframe

Before pseudo-labelling, we filter the **1.3M LinkedIn job postings** using tech-specific keywords.  
This ensures the job data aligns with the **technical focus** of our resume corpus.

**Why filtering is necessary**

Without filtering:
- Random sampling would return mostly **irrelevant roles** (e.g, retail, nursing, hospitality).  
- We would get **almost no meaningful positive CV–job matches**.  
- Top-K retrieval would be dominated by **trivial negatives**.  
- Contrastive fine-tuning could **collapse**, because the model sees only non-matching pairs and learns nothing useful.

**Why filtering is not a problem**
- We are **not restricting** the model’s learning ability.  
- We are **increasing the signal-to-noise ratio** by focusing on jobs from the same domain as the resumes.  
- This helps the model learn **high-quality, domain-relevant semantic patterns** instead of noise.

In short:  
Filtering the job dataset ensures that pseudo-labelling produces **valid positives**, **meaningful negatives**, and a **stable contrastive learning signal**.

In [None]:
TECH_KEYWORDS = [
    "database",
    "sql",
    "dba",
    "data engineer",
    "data scientist",
    "data analyst",
    "software",
    "developer",
    "backend",
    "full stack",
    "cloud",
    "aws",
    "azure",
    "gcp",
    "devops",
    "linux",
    "oracle",
    "postgres",
    "mysql",
    "network engineer",
    "systems engineer",
]

pattern = "|".join(TECH_KEYWORDS)

mask = df_jobs["job_title"].str.contains(pattern, case=False, na=False) | df_jobs[
    "job_skills"
].str.contains(pattern, case=False, na=False)

df_jobs_relevant = df_jobs[mask].reset_index(drop=True)

print("Total jobs:", len(df_jobs))
print("Relevant tech jobs after filtering:", len(df_jobs_relevant))

In [None]:
df_jobs_relevant.sample(10)

**Sampling the Job Corpus**

From the ~220k technical job postings remaining after filtering, we **randomly sample 10,000 jobs** to build the job corpus used for candidate retrieval and pseudo-labelling.

Random sampling:
- **Preserves** the overall distribution of technical job families,
- While **drastically reducing** the computational cost of:
  - Embedding all job postings,
  - Running similarity search,
  - And generating CV–job pseudo-labels.

A 10k job subset is **large enough** to:
- Provide diverse positive and negative pairs,
- And ensure meaningful contrastive learning.

In practice, this setup produces **~125k pseudo-labelled pairs**, which is well within the typical range used in academic SBERT fine-tuning work (often **10k–300k** labelled/ pseudo-labelled pairs) and is more than sufficient for stable bi-encoder training.


In [None]:
N_JOBS = 10_000
df_jobs_final = df_jobs_relevant.sample(n=N_JOBS, random_state=42).reset_index(
    drop=True
)

print("Using jobs:", len(df_jobs_final))

### Job Posting Columns EDA & Cleaning

Data Exploration of *job_title* and *job_skills* columns before cleaning:

In [None]:
unique_job_titles_before = df_jobs_final["job_title"].nunique()
print(f"Number of unique job titles before cleaning: {unique_job_titles_before}")

all_job_skills = (
    df_jobs_final["job_skills"].astype(str).str.split(",").explode().str.strip()
)

unique_job_skills_before = all_job_skills.nunique()
print(f"Unique job skills before cleaning: {unique_job_skills_before}")

We convert each comma-separated skills string into a list of individual skill phrases and then pass them through our existing skill-cleaning pipeline for the CV dataset cleaning.

This pipeline:
- Removes responsibility-style sentences and non-skill fragments,
- Normalizes acronyms (e.g., “api” → “API”),
- Standardizes common variants (“js” → “javascript”),
- Deduplicates repeated entries,
- Applies length and structure heuristics to keep only true skill tokens.


In [None]:
df_jobs_final["job_title_clean"] = df_jobs_final["job_title"].apply(clean_title)
unique_job_titles_after = df_jobs_final["job_title_clean"].nunique()
print(f"Number of unique job titles after cleaning: {unique_job_titles_after}")


def split_job_skill_string(s):
    """Convert a comma-separated skill string into a list of skill phrases."""
    if not isinstance(s, str):
        return []
    return [x.strip() for x in s.split(",") if x.strip()]


# Apply existing cleaning pipeline
df_jobs_final["skill_list_raw"] = df_jobs_final["job_skills"].apply(
    split_job_skill_string
)
df_jobs_final["job_skills_clean"] = df_jobs_final["skill_list_raw"].apply(
    clean_skill_list
)
unique_skills_after = df_jobs_final["job_skills_clean"].explode().nunique()
print(f"Unique job skills after cleaning: {unique_skills_after}")

After cleaning, the number of unique job skills dropped by roughly 15,000, effectively removing noisy, inconsistent, or non-technical entries from the original scraped dataset.

Unused columns are dropped:

In [None]:
df_jobs_final = df_jobs_final.drop(
    columns=["job_title", "job_skills", "skill_list_raw"]
)

### Visualize Cleaned Job Posting Dataset

Visualze final *job_title_clean* column:

In [None]:
df_jobs_final["job_title_clean"]

plot_top_with_cumulative(
    series=df_jobs_final["job_title_clean"],
    top_n=30,
    ylabel="Job Title",
    title="Top 30 Job Titles with Cumulative Percentage",
)

Visualze final *job_skills_clean* columns:

In [None]:
job_skills_long = df_jobs_final["job_skills_clean"].explode()

plot_top_with_cumulative(
    series=job_skills_long,
    top_n=30,
    ylabel="Skill",
    title="Top 30 Job Skills with Cumulative Percentage",
)


## Creating the Pseudo-Labelled CV–Job Matching Dataset

### Overview of the Pipeline
We construct pseudo-labelled CV–job pairs using lightweight heuristics and structured text inputs:

- Build **label-documents** for CVs and jobs (titles + skills)
- Create **skill and title token sets** for overlap computation
- Embed CVs and jobs with a base SBERT model and retrieve top-K jobs per CV
- Compute **skill and title overlaps** for each top-K CV–job pair
- Label top-K pairs with strong overlap as **positives** and non-top-K pairs with no overlap as **negatives**


### Creating Label Documents and Skill/Title Sets for Pseudo-Labelling

To prepare resumes and job postings for overlap-based matching and BERT encoding, we apply the following steps:

- Define skill list to set conversion function

- Define skill overlap computation function

- Define normalization and tokenization function for job titles

- Define title overlap computation function

- Build structured label documents for BERT embedding

In [None]:
def list_to_set(x):
    if isinstance(x, list):
        return set(s.lower() for s in x)
    return set()


def compute_skill_overlap(cv_set: set[str], job_set: set[str]) -> int:
    return len(cv_set & job_set)


TITLE_STOPWORDS = {
    "senior",
    "junior",
    "lead",
    "assistant",
    "intern",
    "and",
    "&",
    "-",
    "/",
}


def title_to_tokens(title: str) -> set[str]:
    t = str(title).lower()
    t = re.sub(r"[|/,]", " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    tokens = [tok for tok in t.split(" ") if tok and tok not in TITLE_STOPWORDS]
    return set(tokens)


def compute_title_overlap(cv_title: str, job_title: str) -> int:
    return len(title_to_tokens(cv_title) & title_to_tokens(job_title))


def build_cv_text(row) -> str:
    past = str(row["past_experience"]).strip()
    skills = ", ".join(row["clean_skills"])
    return f"Past experience: {past}. Technical skills: {skills}."


def build_job_text(row) -> str:
    title = str(row["job_title_clean"]).strip()
    skills = ", ".join(row["job_skills_clean"])
    return f"Job title: {title}. Required skills: {skills}."


cv_df = df_resume.copy()
jobs_df = df_jobs_final.copy()


# Convert skill lists → sets for overlap computation
cv_df["skill_set"] = cv_df["clean_skills"].apply(list_to_set)
jobs_df["skill_set"] = jobs_df["job_skills_clean"].apply(list_to_set)

# Build text for BERT encoder
cv_df["label_doc"] = cv_df.apply(build_cv_text, axis=1)
jobs_df["label_doc"] = jobs_df.apply(build_job_text, axis=1)


Save processed dataframes:

In [None]:
cv_df.to_csv("processed_cv_data.csv", index=False)
jobs_df.to_csv("processed_job_data.csv", index=False)

### Embedding label documents & computing cosine similarity

We load a pre-trained **Sentence-BERT** model (`msmarco-bert-base-dot-v5`), which embeds text into a vector space where semantically similar documents lie close together.

We apply this encoder separately to:
- all CV **label documents**
- all job **label documents**

This produces two embedding matrices: one for CVs and one for jobs.

Using these embeddings, we compute a **cosine similarity matrix**:
- Each row = a CV  
- Each column = a job posting  
- Each value = how semantically similar the CV and job are

High similarity values indicate semantic similarity between CV and job (potential **positive matches**), while low values indicate semantic differences between CV and job (potential **negative matches**).


In [None]:
base_model = SentenceTransformer("sentence-transformers/msmarco-bert-base-dot-v5")
ft_model = SentenceTransformer("sentence-transformers/msmarco-bert-base-dot-v5")

cv_texts = cv_df["label_doc"].tolist()
job_texts = jobs_df["label_doc"].tolist()

cv_emb = base_model.encode(
    cv_texts, batch_size=64, convert_to_tensor=True, show_progress_bar=True
)
job_emb = base_model.encode(
    job_texts, batch_size=64, convert_to_tensor=True, show_progress_bar=True
)

# Cosine similarity matrix: shape (num_cvs, num_jobs)
S: Tensor = util.cos_sim(cv_emb, job_emb)
S.shape

### Creating Pseudo-Labels

To generate training data for contrastive fine-tuning, we create positive and negative CV–job pairs using SBERT similarity and simple overlap rules.

**Prepare skill sets**  
Convert string sets into sets to enable fast overlap checks.

**Define thresholds for labelling**
- `TOP_K = 50` most similar jobs for each CV considered for positive labels
- Positives require: **≥3 skill-token overlaps** and **≥1 title-token overlap**  
- Cap per CV: **max 5 positives** and **max 5 negatives**  
- Negatives sampled from a **random pool of 200** non–Top-K jobs

**Positive pseudo-labels**  
For each CV, take the Top-K most similar jobs (via SBERT cosine similarity) and keep only those that satisfy the skill/title overlap thresholds.  

**Negative pseudo-labels**  
Sample jobs outside the Top-K pool and keep only those with **zero** skill and title overlap. 

We sample outside Top-K to ensure negatives  are truly non-matching by selecting jobs with zero skill and title overlap, this makes the contrastive training signal much cleaner.

 **Output**  
All collected pairs are stored in `pseudo_pairs`, forming a balanced set of high-quality positives and negatives for fine-tuning, containg a subset of the full CV and job datasets.


In [None]:
cv_df["skill_set"] = cv_df["skill_set"].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)
jobs_df["skill_set"] = jobs_df["skill_set"].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)

TOP_K = 50

POS_MIN_SKILLS = 3
POS_MIN_TITLE_TOKENS = 1

MAX_POS_PER_CV = 5
MAX_NEG_PER_CV = 5
RANDOM_NEG_POOL = 200

pseudo_pairs = []

num_cvs = cv_df.shape[0]
num_jobs = jobs_df.shape[0]

for i in tqdm(range(num_cvs), desc="Pseudo-labelling CVs"):
    cv_row = cv_df.iloc[i]
    cv_text = cv_row["label_doc"]
    cv_skills = cv_row["skill_set"]
    cv_title = str(cv_row["past_experience"]).split(",")[0]

    sims = S[i].cpu().numpy()

    # POSITIVES
    top_k_idx = np.argsort(-sims)[:TOP_K]
    pos_count = 0

    for j in top_k_idx:
        job_row = jobs_df.iloc[j]
        job_text = job_row["label_doc"]
        job_skills = job_row["skill_set"]
        job_title = job_row["job_title_clean"]

        skill_overlap = compute_skill_overlap(cv_skills, job_skills)
        title_overlap = compute_title_overlap(cv_title, job_title)
        if (skill_overlap >= POS_MIN_SKILLS) and (
            title_overlap >= POS_MIN_TITLE_TOKENS
        ):
            pseudo_pairs.append((cv_text, job_text, 1))
            pos_count += 1
            if pos_count >= MAX_POS_PER_CV:
                break

    # NEGATIVES
    neg_count = 0

    all_idx = np.arange(num_jobs)
    neg_pool = np.setdiff1d(all_idx, top_k_idx)

    if len(neg_pool) == 0:
        continue
    rand_idx = np.random.choice(
        neg_pool,
        size=min(RANDOM_NEG_POOL, len(neg_pool)),
        replace=False,
    )

    for j in rand_idx:
        job_row = jobs_df.iloc[j]
        job_text = job_row["label_doc"]
        job_skills = job_row["skill_set"]
        job_title = job_row["job_title_clean"]

        skill_overlap = compute_skill_overlap(cv_skills, job_skills)
        title_overlap = compute_title_overlap(cv_title, job_title)

        if skill_overlap == 0 and title_overlap == 0:
            pseudo_pairs.append((cv_text, job_text, 0))
            neg_count += 1
            if neg_count >= MAX_NEG_PER_CV:
                break


In [None]:
pairs_df = pd.DataFrame(pseudo_pairs, columns=["cv_text", "job_text", "label"])
print(pairs_df["label"].value_counts())
pairs_df.head()

In [None]:
print(pairs_df[pairs_df["label"] == 1].sample(5, random_state=42))

In [None]:
print(pairs_df[pairs_df["label"] == 0].sample(5, random_state=42))

## Fine-Tuning the BERT Bi-Encoder Model

##### **Model:** Pre-Trained Sentence-BERT

We start from a pre-trained **Sentence-BERT bi-encoder** model: `msmarco-bert-base-dot-v5`

Each training example consists of: '(cv_label_doc, job_label_doc, binary_label)'

We convert these examples into `InputExample` objects and feed them into a mini-batch training loop.

We use **CosineSimilarityLoss** in a contrastive learning setup, which:

- Encourages **high cosine similarity** for **positive pairs** (label = 1)
- Encourages **low cosine similarity** for **negative pairs** (label = 0)

##### **Mechanism:** Contrastive Learning

1. The bi-encoder maps both texts (CV and job posting) into embeddings.
2. We compute the cosine similarity between the two embeddings.
3. The loss then pushes this similarity score towards the target label:
   - `1.0` for matches  
   - `0.0` for non-matches  

Over many such examples, the bi-encoder learns an **HR-specific similarity space**, where CVs that fit a job are closer to it than unrelated resumes.

#### **Class Balancing:**

The dataset was heavily imbalanced (≈85k negatives vs. 40k positives), so we applied balanced sampling by capping each class at 25,000 examples.

This ensured that the fine-tuning process was not dominated by negative pairs and that both classes contributed equally to the contrastive objective.


### Environment Configuration on Kaggle

On Kaggle, we configure two environment variables to avoid common execution issues:

- **Select the correct GPU:**  
  `CUDA_VISIBLE_DEVICES="0"` ensures that PyTorch uses Kaggle’s single available GPU. This prevents the notebook from accidentally trying to use a non-existent device.

- **Turn off Weights & Biases tracking:**  
  `WANDB_DISABLED="true"` stops the WandB tracking service from starting. Kaggle often blocks external network traffic, and this was done to prevent WandB from freezing or interrupting training (which it was doing).


In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["WANDB_DISABLED"] = "true"

### Train–Test Split

We apply a strict splitting strategy where **both CVs and job postings are completely disjoint** between the training and validation sets.

- No CV appearing in training appears in validation  
- No job posting appearing in training appears in validation  
- No pair or text fragment is shared across splits  

This setup removes all risk of **data leakage**, ensuring the model cannot rely on memorized CV wording or recurring job templates. 

Instead, it must **generalize to entirely unseen CVs and unseen job descriptions**, providing a robust and unbiased evaluation of real-world matching performance.


In [None]:
# balance dataset through downsampling
target_per_class_strict = 25000
pairs_balanced = pairs_df.groupby("label", group_keys=False).apply(
    lambda g: g.sample(n=min(len(g), target_per_class_strict), random_state=42)
)

# unique CVs and Jobs from the balanced pairs
unique_cvs = pairs_balanced["cv_text"].unique()
unique_jobs = pairs_balanced["job_text"].unique()

# split independently into train/val sets
train_cvs, val_cvs = train_test_split(
    unique_cvs,
    test_size=0.2,
    random_state=42,
)

train_jobs, val_jobs = train_test_split(
    unique_jobs,
    test_size=0.2,
    random_state=42,
)

# enforce strict separation
train_df = pairs_balanced[
    pairs_balanced["cv_text"].isin(train_cvs)
    & pairs_balanced["job_text"].isin(train_jobs)
].copy()

val_df = pairs_balanced[
    pairs_balanced["cv_text"].isin(val_cvs) & pairs_balanced["job_text"].isin(val_jobs)
].copy()

print("Strict train pairs:", len(train_df))
print("Strict val pairs:", len(val_df))

print("Train label balance:\n", train_df["label"].value_counts())
print("Val label balance:\n", val_df["label"].value_counts())

Verifying the splits are disjoint:

In [None]:
train_cv_set = set(train_df["cv_text"])
val_cv_set = set(val_df["cv_text"])
print("CV overlap:", len(train_cv_set & val_cv_set))

train_job_set = set(train_df["job_text"])
val_job_set = set(val_df["job_text"])
print("Job overlap:", len(train_job_set & val_job_set))

train_pairs = set(zip(train_df["cv_text"], train_df["job_text"]))
val_pairs = set(zip(val_df["cv_text"], val_df["job_text"]))
print("Pair overlap:", len(train_pairs & val_pairs))

Check CUDA availability and the current device of `ft_model`:

In [None]:
import torch

print("CUDA available:", torch.cuda.is_available())

try:
    print("Model device:", next(ft_model.parameters()).device)
except Exception as e:
    print("Could not inspect model device:", e)


Analyzing token-length distribution of CVs and job postings using same tokenizer as S-BERT model:


In [None]:
# use the same tokenizer as SBERT model
tokenizer = AutoTokenizer.from_pretrained(
    "sentence-transformers/msmarco-bert-base-dot-v5"
)


def compute_lengths(texts):
    return [len(tokenizer.encode(str(t), add_special_tokens=True)) for t in texts]


cv_lengths = compute_lengths(cv_df["label_doc"])
job_lengths = compute_lengths(jobs_df["label_doc"])

cv_lengths = np.array(cv_lengths)
job_lengths = np.array(job_lengths)

print("CV LENGTHS")
print("Mean:", cv_lengths.mean())
print("Median:", np.median(cv_lengths))
print("95th percentile:", np.percentile(cv_lengths, 95))
print("Max:", cv_lengths.max())

print("\nJOB LENGTHS")
print("Mean:", job_lengths.mean())
print("Median:", np.median(job_lengths))
print("95th percentile:", np.percentile(job_lengths, 95))
print("Max:", job_lengths.max())

thresholds = [32, 50, 64, 128, 256]

print("CV LENGTH DISTRIBUTION")
for t in thresholds:
    print(f"{t} tokens:", (cv_lengths >= t).mean() * 100, "%")

print("\nJOB LENGTH DISTRIBUTION")
for t in thresholds:
    print(f"{t} tokens:", (job_lengths >= t).mean() * 100, "%")


### Model Training

To fine-tune the Sentence-BERT model on the strictly disjoint CV–job dataset, we prepared the training pipeline with three key steps:

**1. Selecting Maximum Sequence Length**

Token-length analysis showed that CVs reached **1200 tokens** and job postings up to **800 tokens**.  
Using the default token truncation caused slow CPU-side preprocessing on Kaggle because the tokenizer still processed the **entire long sequence before truncating**.

To avoid this bottleneck while keeping enough semantic content, we set **`max_seq_length = 256`**, which is a practical balance between **speed** and **context coverage**.


**2. Creating InputExamples & DataLoader**

Each training pair (CV text, job text, label) is converted into a Sentence-Transformers `InputExample`, which stores:

- `texts=[cv_text, job_text]`
- `label ∈ {0.0, 1.0}`

These examples are then wrapped in a PyTorch `DataLoader` to ensures clean batching and consistent pairwise training.

**3. Running the fine-tuning loop**

We train the model,with the following configuration:
- **epochs:** 1
- **loss:** CosineSimilarityLoss
- **warmup:** 10 % of the training steps

In [None]:
train_examples = [
    InputExample(
        texts=[row.cv_text, row.job_text],
        label=float(row.label),
    )
    for row in train_df.itertuples(index=False)
]
print("num strict train:", len(train_examples))

# DataLoader
train_batch_size = 16
train_dataloader = DataLoader(
    train_examples,
    shuffle=True,
    batch_size=train_batch_size,
    num_workers=0,
    collate_fn=ft_model.smart_batching_collate,
)

# Loss
train_loss = losses.CosineSimilarityLoss(ft_model)

# Fine-tune
epochs = 1
warmup_steps = int(0.1 * len(train_dataloader))

ft_model.max_seq_length = 256
print("Current max_seq_length:", ft_model.max_seq_length)

ft_model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    show_progress_bar=True,
)

## Evaluation

### Quantitative Evaluation

To measure the effect of contrastive fine-tuning, we compare two models on a test set of pseudo-labelled CV–job pairs:

1. **Baseline SBERT:** `msmarco-bert-base-dot-v5`  
2. **Fine-tuned SBERT:** the same model after contrastive fine-tuning on our CV–job dataset.

We evaluate performance using **two quantitative metrics**:

**1. ROC–AUC (threshold-free ranking metric)**  
- Measures how well the model **ranks** true matches above non-matches.  
- Computed by checking whether positive pairs *consistently* receive higher similarity scores than negative ones.  
- Does **not** require picking a classification threshold.  
- Useful when we care about overall ranking quality.

**2. Accuracy: cosine similarity threshold = 0.5**
- We turn the similarity score into a binary decision:  
  - **score ≥ 0.5 → predict “match”**  
  - **score < 0.5 → predict “non-match”**  
- The prediction is then compared against the **pseudo-labels** for correctness.  
- Accuracy reflects how well the model separates positive and negative pairs at this fixed cutoff.

Together, these metrics give a clearer picture:  
**ROC–AUC evaluates ranking behaviour**, while **accuracy evaluates classification behaviour**.
 
Cosine similarity distributions alone cannot provide this full performance assessment.


In [None]:
def run_full_evaluation(val_df, cv_df, jobs_df, S_matrix, ft_model):
    """
    Full evaluation pipeline:
    - Baseline SBERT (pretrained)
    - Fine-tuned SBERT
    - ROC-based optimal threshold
    - Accuracy at default and optimal thresholds
    """
    # baseline evaluation (precomputed similarity matrix)
    cv_index_map = {text: i for i, text in enumerate(cv_df["label_doc"])}
    job_index_map = {text: i for i, text in enumerate(jobs_df["label_doc"])}

    baseline_scores = []
    labels = val_df["label"].astype(float).to_numpy()

    for row in val_df.itertuples(index=False):
        cv_idx = cv_index_map[row.cv_text]
        job_idx = job_index_map[row.job_text]
        baseline_scores.append(float(S_matrix[cv_idx, job_idx]))

    baseline_scores = np.array(baseline_scores)
    baseline_auc = roc_auc_score(labels, baseline_scores)

    # fine-tuned evaluation (encode with trained model)
    cv_texts = val_df["cv_text"].tolist()
    job_texts = val_df["job_text"].tolist()

    ft_cv_emb = ft_model.encode(cv_texts, batch_size=64, convert_to_tensor=True)
    ft_job_emb = ft_model.encode(job_texts, batch_size=64, convert_to_tensor=True)
    sim_matrix = util.cos_sim(ft_cv_emb, ft_job_emb)
    ft_scores = sim_matrix.diag().cpu().numpy()

    ft_auc = roc_auc_score(labels, ft_scores)

    # 3. ROC curve to get optimal threshold for FT model
    fpr, tpr, thresholds = roc_curve(labels, ft_scores)
    optimal_idx = (tpr - fpr).argmax()
    best_threshold = thresholds[optimal_idx]
    ft_acc_opt = accuracy_score(labels, (ft_scores >= best_threshold).astype(float))
    baseline_acc_opt = accuracy_score(
        labels, (baseline_scores >= best_threshold).astype(float)
    )

    # summary
    print("\nEVALUATION SUMMARY\n")

    print("BASELINE SBERT")
    print(f"ROC-AUC            : {baseline_auc:.4f}")
    print(f"Accuracy @optimal  : {baseline_acc_opt:.4f}")
    print()

    print("FINE-TUNED SBERT")
    print(f"ROC-AUC            : {ft_auc:.4f}")
    print(f"Accuracy @optimal  : {ft_acc_opt:.4f}")
    print()

    print(f"Optimal threshold  : {best_threshold:.4f}")

    return {
        "baseline_scores": baseline_scores,
        "ft_scores": ft_scores,
        "labels": labels,
        "best_threshold": best_threshold,
    }


In [None]:
results = run_full_evaluation(
    val_df,
    cv_df,
    jobs_df,
    S,
    ft_model,
)

In [None]:
baseline_scores = results["baseline_scores"]
ft_scores = results["ft_scores"]
val_labels = results["labels"]

# a sanity check to ensure both have same length
assert len(baseline_scores) == len(ft_scores)

### Quantitative Results

Across both metrics, the **fine-tuned bi-encoder outperforms the baseline**:

- **Higher ROC–AUC:** 
    -  The fine-tuned model ranks matching pairs above mismatches slightly better than the baseline SBERT.

- **Much Higher Accuracy:** <50% → 98.9%+
    - At a fixed optiomal threshold of 0.47, the fine-tuned model produces substantially more correct match/non-match predictions.

This confirms that fine-tuning creates a **more discriminative similarity space**, better tailored to HR semantics.

### Evaluation Visualizations

In [None]:
plt.figure(figsize=(12, 6))

plt.hist(
    baseline_scores,
    bins=40,
    alpha=0.6,
    density=True,
    label="Baseline SBERT",
    color="blue",
)
plt.hist(
    ft_scores, bins=40, alpha=0.6, density=True, label="Fine-Tuned SBERT", color="green"
)

plt.title("Distribution of Cosine Similarity Scores\nBaseline vs Fine-Tuned SBERT")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


In [None]:
# Ensure 1D float arrays
baseline_scores = np.asarray(baseline_scores, dtype=float).ravel()
ft_scores = np.asarray(ft_scores, dtype=float).ravel()
val_labels = np.asarray(val_labels, dtype=int).ravel()

pos_idx = val_labels == 1
neg_idx = val_labels == 0

# baseline plot
plt.figure(figsize=(12, 6))

plt.hist(
    baseline_scores[pos_idx],
    bins=40,
    alpha=0.6,
    density=True,
    label="Baseline Positives",
)
plt.hist(
    baseline_scores[neg_idx],
    bins=40,
    alpha=0.6,
    density=True,
    label="Baseline Negatives",
)

plt.title("Baseline SBERT: Positive vs Negative Score Distribution")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# fine-tuned plot
plt.figure(figsize=(12, 6))

plt.hist(
    ft_scores[pos_idx],
    bins=40,
    alpha=0.6,
    density=True,
    label="Fine-Tuned Positives",
)
plt.hist(
    ft_scores[neg_idx],
    bins=40,
    alpha=0.6,
    density=True,
    label="Fine-Tuned Negatives",
)

plt.title("Fine-Tuned SBERT: Positive vs Negative Score Distribution")
plt.xlabel("Cosine Similarity")
plt.ylabel("Frequency")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

In [None]:
def summarize(name, scores, labels):
    scores = np.array(scores)
    labels = np.array(labels)

    pos = scores[labels == 1]
    neg = scores[labels == 0]

    print(f"{name} SBERT Score Summary:")
    print(f"Positives: mean={pos.mean():.4f}, std={pos.std():.4f}, n={len(pos)}")
    print(f"Negatives: mean={neg.mean():.4f}, std={neg.std():.4f}, n={len(neg)}")
    print(f"Margin (pos_mean - neg_mean): {(pos.mean() - neg.mean()):.4f}")
    print()


summarize("BASELINE", baseline_scores, val_labels)
summarize("FINE-TUNED", ft_scores, val_labels)

### Discussion of Quantitative Results

- **Baseline SBERT ranked pairs well but with little separation in embedding space.**
  ROC–AUC was already extremely high (≈ 0.99) because the baseline model usually assigned slightly higher similarity scores to positive pairs than to negative ones. However, the gap between the two groups was very small (mean positives ~0.95 vs. mean negatives ~0.89).

- **Overlap in cosine similarity distributions caused low accuracy for base SBERT.**
 If cosine similarity distributions for positives and negatives overlap heavily, no similarity threshold (like the optimal threshold of 0.47) can separate them well. The baseline SBERT had overlapping distributions (see below), leading to lower accuracy despite good ranking.

- **Fine-tuning expanded the distance between positives and negatives.**  
  After contrastive training, positive examples clustered near **1.0**, while negative examples were pushed close to **0.0**.  
  The distribution gap increased drastically (from **0.064 to 0.91**), making the two classes clearly separable.

- **Accuracy improved dramatically after fine-tuning.**  
  Because the similarity gap widened, the same threshold (0.47) became meaningful.  
  Accuracy increased from **50% → 98.7%**, showing that the fine-tuned model learned a much more discriminative similarity space.

- **ROC–AUC remained nearly unchanged.**  
  Since the baseline was already ranking positives above negatives correctly, ROC–AUC could not improve much further. Its stability reflects that *ranking quality was already good* and stayed good after fine-tuning.

## Qualiative Evaluation

Testing the model on synthetic data of 1 CV and 5 job postings:

In [None]:
cv_text = """
I am a Data Analyst with 3+ years of experience working with SQL, Python, and BI tools.
I build dashboards in Power BI and Tableau, create ETL pipelines, and work closely with
business stakeholders to define KPIs, prepare reports, and automate data workflows.
Comfortable with statistics, A/B testing, and presenting findings to non-technical audiences.
"""

jobs = [
    {
        "id": 1,
        "title": "Data Analyst (Marketing Analytics)",
        "text": """
        We are looking for a Data Analyst to support our marketing team.
        The role involves writing SQL to query large datasets, building dashboards in Power BI
        or Tableau, and using Python for data cleaning and basic statistics.
        You will collaborate with stakeholders to define KPIs and create regular reports.
        """,
    },
    {
        "id": 2,
        "title": "Business Intelligence Developer",
        "text": """
        As a BI Developer you will design and maintain dashboards and reports,
        mainly using Power BI and SQL Server. You will work with business users to
        understand reporting requirements, build ETL-style data transformations,
        and ensure data quality across multiple sources.
        """,
    },
    {
        "id": 3,
        "title": "Machine Learning Engineer",
        "text": """
        We are hiring a Machine Learning Engineer responsible for developing and deploying
        ML models in production. The role requires strong Python skills, experience with
        frameworks such as TensorFlow or PyTorch, and knowledge of cloud platforms.
        SQL is a plus, but the focus is model development and MLOps, not reporting.
        """,
    },
    {
        "id": 4,
        "title": "Senior Frontend Engineer (React)",
        "text": """
        We are looking for a Senior Frontend Engineer with deep experience in React,
        TypeScript, CSS, and building complex, responsive web applications.
        You will work closely with designers to implement UI components and improve
        the user experience. No data analysis or BI background is required.
        """,
    },
    {
        "id": 5,
        "title": "Kindergarten Teacher",
        "text": """
        We are seeking a Kindergarten Teacher responsible for planning lessons,
        supporting early childhood development, communicating with parents,
        and creating a safe and engaging learning environment.
        No technical or data-related skills are required.
        """,
    },
]


In [None]:
def score_pair(model, cv_text: str, job_text: str) -> float:
    cv_emb = model.encode(cv_text, convert_to_tensor=True)
    job_emb = model.encode(job_text, convert_to_tensor=True)
    sim = util.cos_sim(cv_emb, job_emb).item()
    return float(sim)


def compare_baseline_vs_ft(cv_text, jobs, base_model, ft_model):
    rows = []
    for job in jobs:
        b = score_pair(base_model, cv_text, job["text"])
        f = score_pair(ft_model, cv_text, job["text"])
        rows.append((job["id"], job["title"], b, f))
    rows_sorted = sorted(rows, key=lambda x: x[3], reverse=True)

    print("ID | Title                              | Baseline  | Fine-tuned")
    print("-" * 70)
    for jid, title, b, f in rows_sorted:
        print(f"{jid:<2} | {title:<32} | {b:8.3f} | {f:10.3f}")


compare_baseline_vs_ft(cv_text, jobs, base_model, ft_model)


### Qualitative Results

- **Baseline SBERT fails to differentiate roles.**  
  All five test jobs received very high similarity scores (0.86–0.96), including clearly unrelated roles like *Kindergarten Teacher* and *Frontend Engineer*. This confirms that the baseline model cannot reliably separate relevant from irrelevant matches.

- **Fine-tuned SBERT produces a meaningful ranking.**  
  It assigns:
  - **very high** similarity to the correct role (*Data Analyst: 0.971*)  
  - **moderate** similarity to a partially related role (*BI Developer: 0.693*)  
  - **mid-level** similarity to loosely related roles (*ML Engineer: 0.391*)  
  - **very low** similarity to unrelated ones (*Frontend Engineer: 0.064*, *Kindergarten Teacher: 0.010*)  
  This matches human intuition and shows that the model learned real CV–job matching semantics.

- **Generalises beyond training format.**  
  Although trained only on structured job profiles (title + skill list), the fine-tuned model also performed well on **manually written, natural-language job descriptions**, indicating it is not overfitting to the input format.

- **Learns true skill–role alignment.**  
  The model effectively captures the semantic relationship between:
  - a candidate’s **experience + skills**, and  
  - a job’s **title + required skills**,  
  which is exactly the desired behaviour for a resume–job matching system.


## Overall Conclusion
- The fine-tuning process successfully **reshaped the embedding space**, creating a clear separation between matching and non-matching CV–job pairs. This was the main objective to improve classification performance for a job recommendation system.

- Data cleaning helped reduce noise, but the **true performance gains** came from supervised contrastive training on labelled pairs. The model learned deeper semantic relationships, not just surface-level keyword or skill matching.

- The strong separation margin, with negative pairs pushed close to zero, demonstrates that the fine-tuned model now produces **far more meaningful and discriminative similarity scores**, making it highly effective for classification.
