# Hybrid: Rule-Based + Embedding Classifier

**Approach**: Combine strengths of both rule-based and embedding approaches

**Decision Logic**:
1. **Rule-Based Exact Match**: If exact/direct match found → Use rule-based prediction
2. **Embedding Fallback**: Otherwise → Use embedding-based prediction
3. **Rule-Based Safety Net**: If embedding confidence < 50% → Fall back to rule-based (fuzzy/keyword)

**Rationale**:
- **Rule-Based (Notebook 02)**: Excellent precision on exact/fuzzy matches (90%+ confidence) - 72.8% dept, 61.1% sen
- **Embeddings (Notebook 03)**: Better generalization for unknown terms and semantic similarity - 32.4% dept, 42.1% sen
- **Safety Net**: Prevents poor embedding predictions on low-confidence cases

**Expected Benefit**: Best of both worlds - precision + flexibility

**Code Reuse**: This notebook imports and combines classifiers from:
- Notebook 02: `create_department_classifier()`, `create_seniority_classifier()` from `src.models.rule_based`
- Notebook 03: Embedding prototypes and prediction logic

In [1]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

# Import data loaders and models
import sys
sys.path.append('../')
from src.data.loader import load_label_lists, load_evaluation_dataset
from src.models.rule_based import RuleConfig, create_department_classifier, create_seniority_classifier
from src.models.embedding_classifier import EmbeddingConfig, create_domain_classifier, create_seniority_classifier as create_seniority_embedding_classifier

# Paths
DATA_DIR = Path('../data')
RESULTS_DIR = Path('./results')
RESULTS_DIR.mkdir(exist_ok=True)

%matplotlib inline

## 1. Load Data

In [2]:
# Load lookup tables
dept_df, sen_df = load_label_lists(DATA_DIR, max_per_class=None)

print(f"Department lookup: {len(dept_df):,} examples")
print(f"Seniority lookup:  {len(sen_df):,} examples")

# Load evaluation data
eval_df = load_evaluation_dataset(DATA_DIR)
print(f"\nEvaluation samples: {len(eval_df)}")

Applying encoding fix...
  Deduplication: 10145 -> 10145 (removed 0 duplicates)
  Deduplication: 9428 -> 9428 (removed 0 duplicates)
Department lookup: 10,145 examples
Seniority lookup:  9,428 examples

Evaluation samples: 478


In [3]:
# Split into dev/test (same as embedding notebook)
dev_df, test_df = train_test_split(eval_df, test_size=0.3, random_state=42, stratify=eval_df['department'])

print(f"Dev set: {len(dev_df)} samples")
print(f"Test set: {len(test_df)} samples")

Dev set: 334 samples
Test set: 144 samples


## 2. Rule-Based Classifiers (from Notebook 02)

Using the optimized rule-based classifiers with text normalization.

In [4]:
# Configure rule-based classifiers (from Notebook 02)
config_dept = RuleConfig(
    fuzzy_threshold=0.8, 
    use_text_normalization=True,
    default_label="Other"
)

config_sen = RuleConfig(
    fuzzy_threshold=0.8, 
    use_text_normalization=True,
    default_label="Professional"
)

# Create classifiers using factory functions from src.models.rule_based
dept_clf_rule = create_department_classifier(dept_df, config=config_dept)
sen_clf_rule = create_seniority_classifier(sen_df, config=config_sen)

print("✅ Rule-Based Classifiers created")
print(f"   Department: Text normalization + Keywords + Fuzzy matching")
print(f"   Seniority: Text normalization + Keywords + Fuzzy matching")

✅ Rule-Based Classifiers created
   Department: Text normalization + Keywords + Fuzzy matching
   Seniority: Text normalization + Keywords + Fuzzy matching


## 3. Embedding Classifiers (from embedding_classifier.py)

Using the EmbeddingClassifier with averaged example embeddings.

In [5]:
# Model configuration (from embedding_classifier.py)
MODEL_NAME = 'paraphrase-multilingual-MiniLM-L12-v2'

print(f"Building embedding classifiers...")
print(f"Model: {MODEL_NAME}")

Building embedding classifiers...
Model: paraphrase-multilingual-MiniLM-L12-v2


In [6]:
# Helper function for rich input text
def create_input_text(row):
    """Create rich input text from CV position (title + company + description)."""
    parts = []
    
    if 'title' in row and pd.notna(row['title']) and row['title'].strip():
        parts.append(row['title'])
    
    if 'company' in row and pd.notna(row['company']) and row['company'].strip():
        parts.append(f"at {row['company']}")
    
    if 'text' in row and pd.notna(row['text']) and row['text'].strip():
        desc = row['text'][:200].strip()
        if desc and desc not in ' '.join(parts):
            parts.append(desc)
    
    return ' '.join(parts) if parts else "Unknown Position"

In [7]:
# Build embedding classifiers using factory functions from embedding_classifier.py
print("Creating Department Embedding Classifier...")
dept_clf_emb = create_domain_classifier(dept_df, model_name=MODEL_NAME, use_examples=True)

# Add missing "Professional" for seniority
sen_df_extended = sen_df.copy()
professional_examples = pd.DataFrame({
    'text': ['Professional', 'Professional Position', 'Professional Role'],
    'label': ['Professional'] * 3
})
sen_df_extended = pd.concat([sen_df_extended, professional_examples], ignore_index=True)

print("Creating Seniority Embedding Classifier...")
sen_clf_emb = create_seniority_embedding_classifier(sen_df_extended, model_name=MODEL_NAME, use_examples=True)

print(f"\n✅ Embedding Classifiers created")
print(f"   Department: {len(dept_df['label'].unique())} labels")
print(f"   Seniority: {len(sen_df_extended['label'].unique())} labels")

Creating Department Embedding Classifier...
Loading model 'paraphrase-multilingual-MiniLM-L12-v2' on cuda...
Model loaded successfully!
Fitted from examples: 11 labels, shape (11, 384)
Creating Seniority Embedding Classifier...
Loading model 'paraphrase-multilingual-MiniLM-L12-v2' on cuda...
Model loaded successfully!
Fitted from examples: 6 labels, shape (6, 384)

✅ Embedding Classifiers created
   Department: 11 labels
   Seniority: 6 labels


## 4. Hybrid Classifier

**Decision Flow**:
```
1. Try Rule-Based
   ├─ Exact Match Found → Use Rule-Based ✓
   └─ No Exact Match → Try Embedding
       ├─ Confidence ≥ 50% → Use Embedding ✓
       └─ Confidence < 50% → Use Rule-Based (fuzzy/keyword) ✓
```

In [8]:
def hybrid_predict(row, task='department', 
                  embedding_low_threshold=0.50):
    """
    Hybrid prediction combining rule-based and embedding approaches.
    
    Args:
        row: DataFrame row with CV position data
        task: 'department' or 'seniority'
        embedding_low_threshold: If embedding confidence < this, fall back to rule-based (default: 0.50)
    
    Returns:
        prediction: Final label
        confidence: Final confidence score
        method: Which method was used ('rule_based_exact', 'embedding', 'rule_based_low')
    """
    # Select classifiers based on task
    if task == 'department':
        rule_clf = dept_clf_rule
        emb_clf = dept_clf_emb
    else:  # seniority
        rule_clf = sen_clf_rule
        emb_clf = sen_clf_emb
    
    # Step 1: Get rule-based prediction using predict_with_details()
    # predict_with_details returns: [(prediction, confidence, method), ...]
    rule_pred, rule_conf, rule_method = rule_clf.predict_with_details([row['title']])[0]
    
    # Step 2: If rule-based found an EXACT match, use it (direct match only)
    if rule_method == 'exact':
        return rule_pred, rule_conf, 'rule_based_exact'
    
    # Step 3: Otherwise, try embedding
    # Use rich input text (title + company + description)
    input_text = create_input_text(row)
    
    # Use predict_single() method from EmbeddingClassifier
    emb_pred, emb_conf, _ = emb_clf.predict_single(input_text)
    
    # Step 4: If embedding confidence is decent (≥50%), use it
    if emb_conf >= embedding_low_threshold:
        return emb_pred, emb_conf, 'embedding'
    
    # Step 5: Safety net - fall back to rule-based (fuzzy/keyword match)
    return rule_pred, rule_conf, 'rule_based_low'

## 5. Evaluation on Test Set

In [9]:
# Run hybrid predictions on test set
print("Running hybrid predictions on test set...")

test_results = []

for idx, row in test_df.iterrows():
    # Department
    dept_pred, dept_conf, dept_method = hybrid_predict(row, task='department')
    
    # Seniority
    sen_pred, sen_conf, sen_method = hybrid_predict(row, task='seniority')
    
    test_results.append({
        'dept_pred': dept_pred,
        'dept_conf': dept_conf,
        'dept_method': dept_method,
        'dept_true': row['department'],
        'sen_pred': sen_pred,
        'sen_conf': sen_conf,
        'sen_method': sen_method,
        'sen_true': row['seniority']
    })

results_df = pd.DataFrame(test_results)
print(f"✓ Completed predictions for {len(results_df)} test samples")

Running hybrid predictions on test set...
✓ Completed predictions for 144 test samples


In [10]:
# Analyze method usage
print("\n" + "="*80)
print("METHOD USAGE STATISTICS")
print("="*80)

print("\nDepartment:")
dept_method_counts = results_df['dept_method'].value_counts()
for method, count in dept_method_counts.items():
    pct = count / len(results_df) * 100
    print(f"  {method:20s}: {count:3d} ({pct:5.1f}%)")

print("\nSeniority:")
sen_method_counts = results_df['sen_method'].value_counts()
for method, count in sen_method_counts.items():
    pct = count / len(results_df) * 100
    print(f"  {method:20s}: {count:3d} ({pct:5.1f}%)")

print("="*80)


METHOD USAGE STATISTICS

Department:
  embedding           :  93 ( 64.6%)
  rule_based_low      :  39 ( 27.1%)
  rule_based_exact    :  12 (  8.3%)

Seniority:
  embedding           :  66 ( 45.8%)
  rule_based_low      :  43 ( 29.9%)
  rule_based_exact    :  35 ( 24.3%)


In [11]:
# Compute metrics
dept_true = results_df['dept_true'].tolist()
dept_pred = results_df['dept_pred'].tolist()

sen_true = results_df['sen_true'].tolist()
sen_pred = results_df['sen_pred'].tolist()

# Department metrics
dept_accuracy = accuracy_score(dept_true, dept_pred)
dept_precision, dept_recall, dept_f1, _ = precision_recall_fscore_support(
    dept_true, dept_pred, average='macro', zero_division=0
)
_, _, dept_f1_weighted, _ = precision_recall_fscore_support(
    dept_true, dept_pred, average='weighted', zero_division=0
)

# Seniority metrics
sen_accuracy = accuracy_score(sen_true, sen_pred)
sen_precision, sen_recall, sen_f1, _ = precision_recall_fscore_support(
    sen_true, sen_pred, average='macro', zero_division=0
)
_, _, sen_f1_weighted, _ = precision_recall_fscore_support(
    sen_true, sen_pred, average='weighted', zero_division=0
)

In [12]:
print("="*80)
print("Real-World (Annotated LinkedIn CVs - Test Set):")
print(f"  Hybrid Strategy: Rule-Based Exact Match → Embedding → Rule-Based (<50%)")
print("-" * 80)
print(f"Department Accuracy:       {dept_accuracy:.3f}")
print(f"Department F1 (macro):     {dept_f1:.3f}")
print(f"Department F1 (weighted):  {dept_f1_weighted:.3f}")
print(f"\nSeniority Accuracy:        {sen_accuracy:.3f}")
print(f"Seniority F1 (macro):      {sen_f1:.3f}")
print(f"Seniority F1 (weighted):   {sen_f1_weighted:.3f}")
print("="*80)

Real-World (Annotated LinkedIn CVs - Test Set):
  Hybrid Strategy: Rule-Based Exact Match → Embedding → Rule-Based (<50%)
--------------------------------------------------------------------------------
Department Accuracy:       0.479
Department F1 (macro):     0.365
Department F1 (weighted):  0.503

Seniority Accuracy:        0.465
Seniority F1 (macro):      0.434
Seniority F1 (weighted):   0.495


In [13]:
# Detailed classification reports
print("\nDetailed Classification Report (Department - Test Set):")
print(classification_report(dept_true, dept_pred, zero_division=0))

print("\nDetailed Classification Report (Seniority - Test Set):")
print(classification_report(sen_true, sen_pred, zero_division=0))


Detailed Classification Report (Department - Test Set):
                        precision    recall  f1-score   support

        Administrative       0.11      0.33      0.17         3
  Business Development       0.18      0.40      0.25         5
            Consulting       0.23      0.75      0.35         8
      Customer Support       0.00      0.00      0.00         2
       Human Resources       0.25      0.20      0.22         5
Information Technology       0.45      0.29      0.36        17
             Marketing       0.40      0.40      0.40         5
                 Other       0.70      0.51      0.59        75
    Project Management       0.50      0.67      0.57         9
            Purchasing       1.00      0.25      0.40         4
                 Sales       0.78      0.64      0.70        11

              accuracy                           0.48       144
             macro avg       0.42      0.40      0.36       144
          weighted avg       0.58      0.48  

## 6. Method-Specific Performance Analysis

In [14]:
# Analyze accuracy by method used
print("\n" + "="*80)
print("PERFORMANCE BY METHOD")
print("="*80)

print("\nDepartment - Accuracy by Method:")
for method in results_df['dept_method'].unique():
    mask = results_df['dept_method'] == method
    subset = results_df[mask]
    acc = accuracy_score(subset['dept_true'], subset['dept_pred'])
    count = len(subset)
    avg_conf = subset['dept_conf'].mean()
    print(f"  {method:20s}: Acc={acc:.3f}, n={count:3d}, avg_conf={avg_conf:.3f}")

print("\nSeniority - Accuracy by Method:")
for method in results_df['sen_method'].unique():
    mask = results_df['sen_method'] == method
    subset = results_df[mask]
    acc = accuracy_score(subset['sen_true'], subset['sen_pred'])
    count = len(subset)
    avg_conf = subset['sen_conf'].mean()
    print(f"  {method:20s}: Acc={acc:.3f}, n={count:3d}, avg_conf={avg_conf:.3f}")

print("="*80)


PERFORMANCE BY METHOD

Department - Accuracy by Method:
  embedding           : Acc=0.312, n= 93, avg_conf=0.619
  rule_based_low      : Acc=0.718, n= 39, avg_conf=0.077
  rule_based_exact    : Acc=1.000, n= 12, avg_conf=1.000

Seniority - Accuracy by Method:
  rule_based_exact    : Acc=0.543, n= 35, avg_conf=1.000
  embedding           : Acc=0.333, n= 66, avg_conf=0.608
  rule_based_low      : Acc=0.605, n= 43, avg_conf=0.230


## 8. Key Findings & Interpretation

### Results Summary

| Metric | Department | Seniority |
|--------|-----------|----------|
| **Accuracy** | 47.9% | 46.5% |
| **F1 Macro** | 0.365 | 0.434 |

### Key Insights

1. **Rule-based exact matches are highly reliable** (100% on Department, 54% on Seniority)
2. **Embedding predictions show moderate accuracy** (~31% Dept, ~33% Sen) but cover the most cases
3. **The hybrid strategy works best for seniority** where clear keyword matches exist
4. **Department classification remains challenging** due to semantic ambiguity

### Method Usage Analysis

- **Department**: Embeddings used 64.6% of the time, rule-based fallback 27.1%, exact matches 8.3%
- **Seniority**: More balanced - embeddings 45.8%, rule-based fallback 29.9%, exact matches 24.3%

### When to Use This Approach

✅ **Good for**: Cases where precision on known terms is critical, multilingual job titles
❌ **Limitations**: Embedding predictions on ambiguous titles remain uncertain

## 7. Save Results

In [15]:
# Save results to JSON
results = {
    'approach': 'hybrid_rule_based_embedding',
    'timestamp': datetime.now().isoformat(),
    'config': {
        'embedding_low_threshold': 0.50,
        'embedding_model': MODEL_NAME,
        'decision_logic': 'Rule-Based Exact Match → Embedding → Rule-Based (<50%)'
    },
    'method_usage': {
        'department': dept_method_counts.to_dict(),
        'seniority': sen_method_counts.to_dict()
    },
    'department': {
        'accuracy': float(dept_accuracy),
        'precision_macro': float(dept_precision),
        'recall_macro': float(dept_recall),
        'f1_macro': float(dept_f1),
        'f1_weighted': float(dept_f1_weighted)
    },
    'seniority': {
        'accuracy': float(sen_accuracy),
        'precision_macro': float(sen_precision),
        'recall_macro': float(sen_recall),
        'f1_macro': float(sen_f1),
        'f1_weighted': float(sen_f1_weighted)
    },
    'notes': 'Hybrid approach combining rule-based precision with embedding flexibility. Uses rule-based for high-confidence exact matches and low-confidence safety net, embeddings for mid-range cases.'
}

output_path = RESULTS_DIR / 'hybrid_rule_embedding_results.json'
with open(output_path, 'w') as f:
    json.dump(results, f, indent=2)