# Embedding-Based Zero-Shot Classification (v2 - Improved)

**Approach**: Multi-prototype semantic similarity using sentence embeddings

This approach improves upon rule-based matching by:
1. Creating **rich input text** from CV positions (title + company + description)
2. Embedding all texts using sentence-transformers (multilingual model)
3. Creating **multiple prototypes per label** via K-Means clustering (captures diversity)
4. Computing cosine similarity between input and all prototypes
5. Predicting label with highest similarity to any of its prototypes
6. Storing top-k similar labels for explainability

**Key Improvements over v1**:
- **More context**: Uses position description (first 200 chars) for richer embeddings
- **Multiple prototypes**: 3 sub-prototypes per label capture semantic diversity
- **Zero-shot**: No training, pure similarity-based classification
- **Multilingual**: Handles German/French/English seamlessly
- **Explainability**: Top-k most similar labels with confidence scores

**Model**: `paraphrase-multilingual-MiniLM-L12-v2` (384-dimensional embeddings)

**Training Data**: Lookup tables (10,145 dept + 9,428 seniority examples) → 11×3 + 6×3 prototypes  
**Validation Data**: 478 annotated LinkedIn CVs (loaded only for evaluation)

In [1]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import time

# Import data loaders and models
import sys
sys.path.append('../')
from src.data.loader import load_label_lists, load_inference_dataset, load_evaluation_dataset
from src.models.embedding_classifier import EmbeddingClassifier, create_domain_classifier, create_seniority_classifier

# Paths
DATA_DIR = Path('../data')
RESULTS_DIR = Path('./results')
RESULTS_DIR.mkdir(exist_ok=True)

%matplotlib inline

## 1. Load Training Data (Lookup Tables)

In [2]:
# Load lookup tables
dept_df, sen_df = load_label_lists(DATA_DIR, max_per_class=None)

print(f"Department lookup: {len(dept_df):,} examples")
print(f"Seniority lookup:  {len(sen_df):,} examples")
print(f"\nUnique departments: {dept_df['label'].nunique()}")
print(f"Unique seniority levels: {sen_df['label'].nunique()}")

Applying encoding fix...
  Deduplication: 10145 -> 10145 (removed 0 duplicates)
  Deduplication: 9428 -> 9428 (removed 0 duplicates)
Department lookup: 10,145 examples
Seniority lookup:  9,428 examples

Unique departments: 11
Unique seniority levels: 5


## 2. Zero-Shot Embedding Classifier

**Approach**: Pure zero-shot classification using semantic similarity

**Key Steps**:
1. Create consistent input text from CV positions (robust handling of missing values)
2. Generate embeddings for input texts and lookup table entries
3. Aggregate lookup embeddings per label into prototypes (mean)
4. Compute cosine similarity between input and all label prototypes
5. Select label with highest similarity (optional: threshold-based fallback)
6. Store top-k similarities for explainability

**Implementation Details**:
- Model: `paraphrase-multilingual-MiniLM-L12-v2` (multilingual support)
- Caching: Store prototypes locally for faster re-runs
- Evaluation: Accuracy, macro-F1, weighted-F1, precision, recall
- Missing "Professional" seniority: Added manually (found in EDA, missing in CSV)

In [3]:
# Helper function: Create RICH input text from position data with more context
def create_input_text(row):
    parts = []
    
    # 1. Title (most important)
    if 'title' in row and pd.notna(row['title']) and row['title'].strip():
        parts.append(row['title'])
    
    # 2. Company context
    if 'company' in row and pd.notna(row['company']) and row['company'].strip():
        parts.append(f"at {row['company']}")
    
    # 3. Position description (NEW: first 200 chars for additional context)
    # This is the KEY improvement - captures job responsibilities/skills
    if 'text' in row and pd.notna(row['text']) and row['text'].strip():
        # 'text' column often contains position description
        desc = row['text'][:200].strip()  # Limit to avoid noise
        if desc and desc not in ' '.join(parts):  # Avoid duplication
            parts.append(desc)
    
    if not parts:
        return "Unknown Position"
    
    return ' '.join(parts)

## 3. Build Multiple Prototypes per Label with K-Means Clustering

**New Approach**: Instead of 1 mean prototype per label, create **3 sub-prototypes** via K-Means.
This captures semantic diversity within each label (e.g., "Software Engineer" vs "Backend Developer").

Prototypes are cached to disk for faster re-runs.

In [4]:
from sentence_transformers import SentenceTransformer
import pickle

# Cache directory
CACHE_DIR = RESULTS_DIR / 'embedding_cache'
CACHE_DIR.mkdir(exist_ok=True)

DEPT_CACHE = CACHE_DIR / 'dept_prototypes.pkl'
SEN_CACHE = CACHE_DIR / 'sen_prototypes.pkl'

# Model for embeddings
MODEL_NAME = 'paraphrase-multilingual-MiniLM-L12-v2'

print(f"Model: {MODEL_NAME}")
print(f"Cache directory: {CACHE_DIR}")
print(f"Department cache: {DEPT_CACHE.exists()}")
print(f"Seniority cache: {SEN_CACHE.exists()}")

Model: paraphrase-multilingual-MiniLM-L12-v2
Cache directory: results/embedding_cache
Department cache: False
Seniority cache: False


In [5]:
# Force cache rebuild to ensure latest changes are applied
import shutil

if CACHE_DIR.exists():
    shutil.rmtree(CACHE_DIR)
    print("Cache deleted")
    
CACHE_DIR.mkdir(exist_ok=True)
print(f"Cache directory ready: {CACHE_DIR}")

Cache deleted
Cache directory ready: results/embedding_cache


In [6]:
def build_label_prototypes(label_df, cache_path, model_name=MODEL_NAME, n_prototypes=3, force_rebuild=False):
    from sklearn.cluster import KMeans
    
    # Check cache
    if cache_path.exists() and not force_rebuild:
        print(f"Loading prototypes from cache: {cache_path.name}")
        with open(cache_path, 'rb') as f:
            return pickle.load(f)
    
    print(f"Building {n_prototypes} prototypes per label (model: {model_name})...")
    model = SentenceTransformer(model_name)
    
    prototypes = {}
    labels = label_df['label'].unique()
    
    for label in labels:
        # Get all examples for this label
        examples = label_df[label_df['label'] == label]['text'].tolist()
        
        # Embed all examples
        embeddings = model.encode(examples, convert_to_numpy=True, normalize_embeddings=True, 
                                batch_size=32, show_progress_bar=False)
        
        # K-Means clustering to find n_prototypes centers
        if len(examples) >= n_prototypes:
            kmeans = KMeans(n_clusters=n_prototypes, random_state=42, n_init=10)
            kmeans.fit(embeddings)
            # Use cluster centers as prototypes
            label_prototypes = []
            for center in kmeans.cluster_centers_:
                # Normalize each prototype
                center_normalized = center / np.linalg.norm(center)
                label_prototypes.append(center_normalized)
        else:
            # Fallback: if too few examples, use mean + add slight variations
            mean_prototype = np.mean(embeddings, axis=0)
            mean_prototype = mean_prototype / np.linalg.norm(mean_prototype)
            label_prototypes = [mean_prototype]  # Just use single prototype
        
        prototypes[label] = label_prototypes
        print(f"  {label}: {len(examples)} examples → {len(label_prototypes)} prototypes")
    
    # Save to cache
    with open(cache_path, 'wb') as f:
        pickle.dump(prototypes, f)
    print(f"Saved to cache: {cache_path.name}\n")
    
    return prototypes


# Build department prototypes
dept_prototypes = build_label_prototypes(dept_df, DEPT_CACHE)

print(f"Department prototypes: {len(dept_prototypes)} labels")

Building 3 prototypes per label (model: paraphrase-multilingual-MiniLM-L12-v2)...


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/526 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  Marketing: 4295 examples → 3 prototypes
  Project Management: 201 examples → 3 prototypes
  Administrative: 83 examples → 3 prototypes
  Business Development: 620 examples → 3 prototypes
  Consulting: 167 examples → 3 prototypes
  Human Resources: 31 examples → 3 prototypes
  Information Technology: 1305 examples → 3 prototypes
  Other: 42 examples → 3 prototypes
  Purchasing: 40 examples → 3 prototypes
  Sales: 3328 examples → 3 prototypes
  Customer Support: 33 examples → 3 prototypes
Saved to cache: dept_prototypes.pkl

Department prototypes: 11 labels


In [7]:
# Build seniority prototypes
# CRITICAL: Missing "Professional" label (found in EDA, missing in CSV)
ALL_SENIORITY_LABELS = ['Junior', 'Professional', 'Senior', 'Lead', 'Management', 'Director']

# For missing "Professional", create synthetic examples
sen_df_extended = sen_df.copy()
professional_examples = pd.DataFrame({
    'text': ['Professional', 'Professional Position', 'Professional Role'],
    'label': ['Professional'] * 3
})
sen_df_extended = pd.concat([sen_df_extended, professional_examples], ignore_index=True)

sen_prototypes = build_label_prototypes(sen_df_extended, SEN_CACHE)

print(f"Seniority prototypes: {len(sen_prototypes)} labels")
print(f"Labels: {list(sen_prototypes.keys())}")

Building 3 prototypes per label (model: paraphrase-multilingual-MiniLM-L12-v2)...
  Junior: 409 examples → 3 prototypes
  Senior: 3733 examples → 3 prototypes
  Lead: 3546 examples → 3 prototypes
  Management: 756 examples → 3 prototypes
  Director: 984 examples → 3 prototypes
  Professional: 3 examples → 3 prototypes
Saved to cache: sen_prototypes.pkl

Seniority prototypes: 6 labels
Labels: ['Junior', 'Senior', 'Lead', 'Management', 'Director', 'Professional']


In [8]:
# Load evaluation dataset
print("Loading evaluation dataset...")
eval_df = load_evaluation_dataset(DATA_DIR)

print(f"Evaluation samples: {len(eval_df)}")
print(f"Columns: {list(eval_df.columns)}")
print(f"\nSample data:")
print(eval_df[['title', 'company', 'department', 'seniority']].head(3))

Loading evaluation dataset...
Evaluation samples: 478
Columns: ['cv_id', 'title', 'company', 'text', 'department', 'seniority']

Sample data:
                     title             company              department  \
0                Prokurist   Depot4Design GmbH                   Other   
1      Solutions Architect  Computer Solutions  Information Technology   
2  Medizintechnik Beratung           Udo Weber              Consulting   

      seniority  
0    Management  
1  Professional  
2  Professional  


In [9]:
def predict_with_topk(input_text, prototypes, model, top_k=3):
    # Embed input
    input_embedding = model.encode(input_text, convert_to_numpy=True, normalize_embeddings=True)
    
    # Compute similarity to each label (max over all prototypes)
    label_similarities = {}
    for label, label_prototypes in prototypes.items():
        # Compute similarity to all prototypes for this label
        similarities = [np.dot(input_embedding, proto) for proto in label_prototypes]
        # Take MAX similarity across prototypes
        max_sim = max(similarities)
        label_similarities[label] = max_sim
    
    # Sort by similarity
    sorted_labels = sorted(label_similarities.items(), key=lambda x: x[1], reverse=True)
    
    # Top-k results
    topk_dict = dict(sorted_labels[:top_k])
    top_pred = sorted_labels[0][0]
    top_conf = sorted_labels[0][1]
    
    return top_pred, top_conf, topk_dict

### Dev/Test Split

Split annotated data for threshold tuning (dev) and final evaluation (test).

In [10]:
from sklearn.model_selection import train_test_split

# Split annotated data: 70% dev (for threshold tuning), 30% test (for final evaluation)
dev_df, test_df = train_test_split(eval_df, test_size=0.3, random_state=42, stratify=eval_df['department'])

print(f"Total annotated samples: {len(eval_df)}")
print(f"Dev set (threshold tuning): {len(dev_df)} samples ({len(dev_df)/len(eval_df)*100:.1f}%)")
print(f"Test set (final evaluation): {len(test_df)} samples ({len(test_df)/len(eval_df)*100:.1f}%)")
print(f"\nDev set department distribution:")
print(dev_df['department'].value_counts().sort_index())
print(f"\nDev set seniority distribution:")
print(dev_df['seniority'].value_counts().sort_index())

Total annotated samples: 478
Dev set (threshold tuning): 334 samples (69.9%)
Test set (final evaluation): 144 samples (30.1%)

Dev set department distribution:
department
Administrative              6
Business Development       12
Consulting                 20
Customer Support            4
Human Resources            10
Information Technology     38
Marketing                  13
Other                     175
Project Management         22
Purchasing                  8
Sales                      26
Name: count, dtype: int64

Dev set seniority distribution:
seniority
Director         19
Junior            7
Lead             62
Management       98
Professional    118
Senior           30
Name: count, dtype: int64


## 5. Threshold Tuning on Dev Set

Find optimal similarity threshold for each task.
If max_similarity < threshold → use fallback label (Other/Professional), else use top-1 prediction.

In [11]:
# Create input texts for dev set
print("Creating input texts for dev set...")
dev_df['input_text'] = dev_df.apply(create_input_text, axis=1)
print(f"✓ Created input texts for {len(dev_df)} dev samples")

Creating input texts for dev set...
✓ Created input texts for 334 dev samples


In [12]:
# Load model for predictions
model = SentenceTransformer(MODEL_NAME)
print(f"Model loaded: {MODEL_NAME}")

Model loaded: paraphrase-multilingual-MiniLM-L12-v2


In [13]:
# Get predictions and max similarities for dev set
print("Computing predictions and similarities for dev set...")

dev_dept_preds = []
dev_dept_sims = []  # max similarity scores
dev_sen_preds = []
dev_sen_sims = []

for _, row in dev_df.iterrows():
    # Department
    dept_pred, dept_conf, _ = predict_with_topk(row['input_text'], dept_prototypes, model, top_k=1)
    dev_dept_preds.append(dept_pred)
    dev_dept_sims.append(dept_conf)
    
    # Seniority
    sen_pred, sen_conf, _ = predict_with_topk(row['input_text'], sen_prototypes, model, top_k=1)
    dev_sen_preds.append(sen_pred)
    dev_sen_sims.append(sen_conf)

dev_df['dept_pred_raw'] = dev_dept_preds  # predictions without threshold
dev_df['dept_max_sim'] = dev_dept_sims
dev_df['sen_pred_raw'] = dev_sen_preds
dev_df['sen_max_sim'] = dev_sen_sims

print(f"✓ Computed predictions for {len(dev_df)} dev samples")
print(f"\nDepartment max similarity stats:")
print(f"  Mean: {np.mean(dev_dept_sims):.3f}")
print(f"  Min:  {np.min(dev_dept_sims):.3f}")
print(f"  Max:  {np.max(dev_dept_sims):.3f}")
print(f"\nSeniority max similarity stats:")
print(f"  Mean: {np.mean(dev_sen_sims):.3f}")
print(f"  Min:  {np.min(dev_sen_sims):.3f}")
print(f"  Max:  {np.max(dev_sen_sims):.3f}")

Computing predictions and similarities for dev set...
✓ Computed predictions for 334 dev samples

Department max similarity stats:
  Mean: 0.599
  Min:  0.276
  Max:  0.857

Seniority max similarity stats:
  Mean: 0.599
  Min:  0.201
  Max:  0.894


### 5.1 Department Threshold Tuning

In [14]:
# Grid search for best department threshold
print("Grid search for optimal department threshold...")
print("=" * 80)

DEPT_FALLBACK = "Other"
thresholds = np.arange(0.20, 0.62, 0.02)  # 0.20 to 0.60 in 0.02 steps

dept_threshold_results = []

for threshold in thresholds:
    # Apply threshold: if max_sim < threshold, use fallback, else use raw prediction
    dept_preds_thresh = [
        DEPT_FALLBACK if sim < threshold else pred
        for pred, sim in zip(dev_df['dept_pred_raw'], dev_df['dept_max_sim'])
    ]
    
    # Compute metrics
    dept_true = dev_df['department'].tolist()
    acc = accuracy_score(dept_true, dept_preds_thresh)
    _, _, f1_macro, _ = precision_recall_fscore_support(dept_true, dept_preds_thresh, average='macro', zero_division=0)
    _, _, f1_weighted, _ = precision_recall_fscore_support(dept_true, dept_preds_thresh, average='weighted', zero_division=0)
    
    dept_threshold_results.append({
        'threshold': threshold,
        'accuracy': acc,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted
    })

# Convert to DataFrame for easy analysis
dept_results_df = pd.DataFrame(dept_threshold_results)

# Find best threshold (by macro F1)
best_idx = dept_results_df['f1_macro'].idxmax()
best_dept_threshold = dept_results_df.loc[best_idx, 'threshold']
best_dept_f1 = dept_results_df.loc[best_idx, 'f1_macro']

print(f"\n{'Threshold':<10} {'Accuracy':<10} {'F1-Macro':<10} {'F1-Weighted':<12}")
print("-" * 80)
for _, row in dept_results_df.iterrows():
    marker = " ← BEST" if row['threshold'] == best_dept_threshold else ""
    print(f"{row['threshold']:<10.2f} {row['accuracy']:<10.3f} {row['f1_macro']:<10.3f} {row['f1_weighted']:<12.3f}{marker}")

print("\n" + "=" * 80)
print(f"✓ Best Department Threshold: {best_dept_threshold:.2f} (F1-Macro: {best_dept_f1:.3f})")
print("=" * 80)

Grid search for optimal department threshold...

Threshold  Accuracy   F1-Macro   F1-Weighted 
--------------------------------------------------------------------------------
0.20       0.335      0.392      0.324       
0.22       0.335      0.392      0.324       
0.24       0.335      0.392      0.324       
0.26       0.335      0.392      0.324       
0.28       0.338      0.393      0.329       
0.30       0.338      0.393      0.329       
0.32       0.338      0.393      0.329       
0.34       0.335      0.375      0.331       
0.36       0.338      0.378      0.335       
0.38       0.344      0.380      0.344       
0.40       0.353      0.384      0.357       
0.42       0.377      0.394      0.388       
0.44       0.404      0.415      0.422       
0.46       0.422      0.426      0.441       
0.48       0.434      0.427      0.455       
0.50       0.446      0.432      0.468       
0.52       0.467      0.439      0.484        ← BEST
0.54       0.476      0.433      0.

### 5.2 Seniority Threshold Tuning

In [15]:
# Grid search for best seniority threshold
print("Grid search for optimal seniority threshold...")
print("=" * 80)

SEN_FALLBACK = "Professional"

sen_threshold_results = []

for threshold in thresholds:
    # Apply threshold: if max_sim < threshold, use fallback, else use raw prediction
    sen_preds_thresh = [
        SEN_FALLBACK if sim < threshold else pred
        for pred, sim in zip(dev_df['sen_pred_raw'], dev_df['sen_max_sim'])
    ]
    
    # Compute metrics
    sen_true = dev_df['seniority'].tolist()
    acc = accuracy_score(sen_true, sen_preds_thresh)
    _, _, f1_macro, _ = precision_recall_fscore_support(sen_true, sen_preds_thresh, average='macro', zero_division=0)
    _, _, f1_weighted, _ = precision_recall_fscore_support(sen_true, sen_preds_thresh, average='weighted', zero_division=0)
    
    sen_threshold_results.append({
        'threshold': threshold,
        'accuracy': acc,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted
    })

# Convert to DataFrame
sen_results_df = pd.DataFrame(sen_threshold_results)

# Find best threshold (by macro F1)
best_idx = sen_results_df['f1_macro'].idxmax()
best_sen_threshold = sen_results_df.loc[best_idx, 'threshold']
best_sen_f1 = sen_results_df.loc[best_idx, 'f1_macro']

print(f"\n{'Threshold':<10} {'Accuracy':<10} {'F1-Macro':<10} {'F1-Weighted':<12}")
print("-" * 80)
for _, row in sen_results_df.iterrows():
    marker = " ← BEST" if row['threshold'] == best_sen_threshold else ""
    print(f"{row['threshold']:<10.2f} {row['accuracy']:<10.3f} {row['f1_macro']:<10.3f} {row['f1_weighted']:<12.3f}{marker}")

print("\n" + "=" * 80)
print(f"✓ Best Seniority Threshold: {best_sen_threshold:.2f} (F1-Macro: {best_sen_f1:.3f})")
print("=" * 80)

Grid search for optimal seniority threshold...

Threshold  Accuracy   F1-Macro   F1-Weighted 
--------------------------------------------------------------------------------
0.20       0.407      0.352      0.414       
0.22       0.407      0.352      0.414       
0.24       0.410      0.355      0.419       
0.26       0.410      0.355      0.419       
0.28       0.410      0.355      0.419       
0.30       0.413      0.358      0.425       
0.32       0.413      0.358      0.426       
0.34       0.413      0.358      0.426       
0.36       0.419      0.363      0.435       
0.38       0.428      0.370      0.449       
0.40       0.428      0.372      0.451       
0.42       0.443      0.383      0.471       
0.44       0.458      0.394      0.491       
0.46       0.464      0.397      0.501       
0.48       0.476      0.406      0.514       
0.50       0.491      0.417      0.525       
0.52       0.503      0.415      0.534       
0.54       0.512      0.418      0.536     

## 6. Final Evaluation on Test Set

Apply optimized thresholds to test set for final performance assessment.

In [16]:
# Create input texts for test set
print("Preparing test set...")
test_df['input_text'] = test_df.apply(create_input_text, axis=1)

# Get predictions and similarities for test set
test_dept_preds = []
test_dept_sims = []
test_sen_preds = []
test_sen_sims = []

for _, row in test_df.iterrows():
    # Department
    dept_pred, dept_conf, _ = predict_with_topk(row['input_text'], dept_prototypes, model, top_k=1)
    test_dept_preds.append(dept_pred)
    test_dept_sims.append(dept_conf)
    
    # Seniority
    sen_pred, sen_conf, _ = predict_with_topk(row['input_text'], sen_prototypes, model, top_k=1)
    test_sen_preds.append(sen_pred)
    test_sen_sims.append(sen_conf)

# Apply optimized thresholds
test_dept_preds_final = [
    DEPT_FALLBACK if sim < best_dept_threshold else pred
    for pred, sim in zip(test_dept_preds, test_dept_sims)
]

test_sen_preds_final = [
    SEN_FALLBACK if sim < best_sen_threshold else pred
    for pred, sim in zip(test_sen_preds, test_sen_sims)
]

print(f"✓ Computed predictions for {len(test_df)} test samples")
print(f"  Applied thresholds: Dept={best_dept_threshold:.2f}, Sen={best_sen_threshold:.2f}")

Preparing test set...
✓ Computed predictions for 144 test samples
  Applied thresholds: Dept=0.52, Sen=0.54


### 6.1 Test Set Results

In [17]:
# Compute final test metrics
dept_true_test = test_df['department'].tolist()
sen_true_test = test_df['seniority'].tolist()

# Department metrics
dept_accuracy = accuracy_score(dept_true_test, test_dept_preds_final)
dept_precision, dept_recall, dept_f1, _ = precision_recall_fscore_support(
    dept_true_test, test_dept_preds_final, average='macro', zero_division=0
)
_, _, dept_f1_weighted, _ = precision_recall_fscore_support(
    dept_true_test, test_dept_preds_final, average='weighted', zero_division=0
)

# Seniority metrics
sen_accuracy = accuracy_score(sen_true_test, test_sen_preds_final)
sen_precision, sen_recall, sen_f1, _ = precision_recall_fscore_support(
    sen_true_test, test_sen_preds_final, average='macro', zero_division=0
)
_, _, sen_f1_weighted, _ = precision_recall_fscore_support(
    sen_true_test, test_sen_preds_final, average='weighted', zero_division=0
)

# Print results
print("\n" + "="*80)
print("FINAL TEST SET RESULTS (with optimized thresholds)")
print("="*80)
print(f"\nOptimized Thresholds:")
print(f"  Department:  {best_dept_threshold:.2f} (Fallback: '{DEPT_FALLBACK}')")
print(f"  Seniority:   {best_sen_threshold:.2f} (Fallback: '{SEN_FALLBACK}')")
print("\n" + "-"*80)
print(f"Department Accuracy:       {dept_accuracy:.3f}")
print(f"Department F1 (macro):     {dept_f1:.3f}")
print(f"Department F1 (weighted):  {dept_f1_weighted:.3f}")
print()
print(f"Seniority Accuracy:        {sen_accuracy:.3f}")
print(f"Seniority F1 (macro):      {sen_f1:.3f}")
print(f"Seniority F1 (weighted):   {sen_f1_weighted:.3f}")
print("="*80)


FINAL TEST SET RESULTS (with optimized thresholds)

Optimized Thresholds:
  Department:  0.52 (Fallback: 'Other')
  Seniority:   0.54 (Fallback: 'Professional')

--------------------------------------------------------------------------------
Department Accuracy:       0.451
Department F1 (macro):     0.301
Department F1 (weighted):  0.476

Seniority Accuracy:        0.458
Seniority F1 (macro):      0.360
Seniority F1 (weighted):   0.499


In [18]:
# Detailed classification reports
print("\nDetailed Classification Report (Department - Test Set):")
print(classification_report(dept_true_test, test_dept_preds_final, zero_division=0))

print("\nDetailed Classification Report (Seniority - Test Set):")
print(classification_report(sen_true_test, test_sen_preds_final, zero_division=0))


Detailed Classification Report (Department - Test Set):
                        precision    recall  f1-score   support

        Administrative       0.08      0.33      0.12         3
  Business Development       0.20      0.40      0.27         5
            Consulting       0.35      0.88      0.50         8
      Customer Support       0.00      0.00      0.00         2
       Human Resources       0.00      0.00      0.00         5
Information Technology       0.67      0.24      0.35        17
             Marketing       0.33      0.60      0.43         5
                 Other       0.74      0.52      0.61        75
    Project Management       0.43      0.33      0.38         9
            Purchasing       0.20      0.25      0.22         4
                 Sales       0.42      0.45      0.43        11

              accuracy                           0.45       144
             macro avg       0.31      0.36      0.30       144
          weighted avg       0.57      0.45  

### 6.2 Save Final Results

In [19]:
# Save results with optimized thresholds to JSON
results = {
    'approach': 'embedding_zero_shot_v2',
    'model': MODEL_NAME,
    'timestamp': datetime.now().isoformat(),
    'config': {
        'n_prototypes': 3,
        'multi_prototype_clustering': 'KMeans',
        'input_text_enrichment': 'title + company + description (200 chars)',
        'text_normalization': False
    },
    'threshold_tuning': {
        'dev_samples': len(dev_df),
        'test_samples': len(test_df),
        'dept_threshold': float(best_dept_threshold),
        'dept_fallback': DEPT_FALLBACK,
        'sen_threshold': float(best_sen_threshold),
        'sen_fallback': SEN_FALLBACK,
        'tuning_metric': 'f1_macro',
        'threshold_range': '0.20-0.60 (step=0.02)'
    },
    'department': {
        'accuracy': float(dept_accuracy),
        'precision_macro': float(dept_precision),
        'recall_macro': float(dept_recall),
        'f1_macro': float(dept_f1),
        'f1_weighted': float(dept_f1_weighted)
    },
    'seniority': {
        'accuracy': float(sen_accuracy),
        'precision_macro': float(sen_precision),
        'recall_macro': float(sen_recall),
        'f1_macro': float(sen_f1),
        'f1_weighted': float(sen_f1_weighted)
    },
    'notes': 'Zero-shot with multi-prototypes (K-Means), richer input context, and optimized similarity thresholds via grid search on dev set.'
}

output_path = RESULTS_DIR / 'embedding_zeroshot_v2_results.json'
with open(output_path, 'w') as f:
    json.dump(results, f, indent=2)

print("="*80)
print("Real-World (Annotated LinkedIn CVs - Test Set):")
print(f"  Optimized Thresholds: Dept={best_dept_threshold:.2f}, Sen={best_sen_threshold:.2f}")
print("-" * 80)
print(f"Department Accuracy:       {dept_accuracy:.3f}")
print(f"Department F1 (macro):     {dept_f1:.3f}")
print(f"Department F1 (weighted):  {dept_f1_weighted:.3f}")
print(f"\nSeniority Accuracy:        {sen_accuracy:.3f}")
print(f"Seniority F1 (macro):      {sen_f1:.3f}")
print(f"Seniority F1 (weighted):   {sen_f1_weighted:.3f}")
print("="*80)

Real-World (Annotated LinkedIn CVs - Test Set):
  Optimized Thresholds: Dept=0.52, Sen=0.54
--------------------------------------------------------------------------------
Department Accuracy:       0.451
Department F1 (macro):     0.301
Department F1 (weighted):  0.476

Seniority Accuracy:        0.458
Seniority F1 (macro):      0.360
Seniority F1 (weighted):   0.499
