# 05 - Pseudo-Labeling Experiment

This notebook implements **Approach 2: Programmatic Labeling + Supervised Learning**.

## Targets
- **Department (Domain)**: Professional domain of the current job
- **Seniority**: Seniority level of the current position

## Strategy
1. Generate pseudo-labels for unannotated CVs using rule-based + embedding classifiers
2. Filter for high-confidence predictions only
3. Combine gold (annotated) + silver (pseudo-labeled) data
4. Train transformer on expanded dataset

## Selection Logic
```
IF rule_based.method in ["exact", "substring"]: use rule_based.label
ELIF embedding.confidence > 0.85: use embedding.label
ELSE: discard
```

In [11]:
import os
import sys
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

sys.path.insert(0, os.path.abspath("../"))

from src.models.rule_based import create_department_classifier, create_seniority_classifier
from src.models.embedding_classifier import create_domain_classifier, create_seniority_classifier as create_emb_seniority
from src.models.transformer_classifier import TransformerClassifier
from src.data.loader import load_linkedin_data, prepare_dataset, load_label_lists
from src.data.pseudo_labeler import PseudoLabeler, create_combined_dataset

torch.manual_seed(42)
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Using device: cuda


## 1. Load Data

In [12]:
# Load label lists for both targets
dept_df, seniority_df = load_label_lists("../data")

# Load annotated (gold) data
gold_data = load_linkedin_data("../data/linkedin-cvs-annotated.json")
gold_df = prepare_dataset(gold_data)

# Load unannotated data
unlabeled_data = load_linkedin_data("../data/linkedin-cvs-not-annotated.json")
unlabeled_df = prepare_dataset(unlabeled_data)

print(f"Gold (annotated): {len(gold_df)}")
print(f"  - With department: {gold_df['department'].notna().sum()}")
print(f"  - With seniority: {gold_df['seniority'].notna().sum()}")
print(f"Unlabeled: {len(unlabeled_df)}")

Gold (annotated): 478
  - With department: 478
  - With seniority: 478
Unlabeled: 314


---
# PART 1: DEPARTMENT PSEUDO-LABELING
---

## 2. Initialize Department Classifiers

In [13]:
# Create rule-based classifier for Department
rule_dept = create_department_classifier(dept_df)

# Create embedding classifier for Department
emb_dept = create_domain_classifier(dept_df, use_examples=True)

print("Department classifiers initialized!")

Loading model 'paraphrase-multilingual-MiniLM-L12-v2' on cuda...
Model loaded successfully!
Fitted from examples: 11 labels, shape (11, 384)
Department classifiers initialized!


## 3. Generate Pseudo-Labels (Department)

In [14]:
# Initialize pseudo-labeler for Department
dept_labeler = PseudoLabeler(
    rule_classifier=rule_dept,
    embedding_classifier=emb_dept,
    confidence_threshold=0.85
)

# Generate pseudo-labels
silver_dept_df = dept_labeler.get_high_confidence_subset(
    unlabeled_df, 
    text_column='text',
    label_column='pseudo_department'
)

print(f"Generated {len(silver_dept_df)} high-confidence DEPARTMENT pseudo-labels")
print(f"Label source distribution:")
print(silver_dept_df['pseudo_department_source'].value_counts())

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 35 high-confidence DEPARTMENT pseudo-labels
Label source distribution:
pseudo_department_source
rule_substring    35
Name: count, dtype: int64


## 4. Train Transformer on Combined Data (Department)

In [15]:
# Prepare gold data for department
gold_dept = gold_df.dropna(subset=['department']).copy()

# Combine datasets
combined_dept = create_combined_dataset(
    gold_df=gold_dept,
    silver_df=silver_dept_df,
    gold_label_col='department',
    silver_label_col='pseudo_department',
    gold_weight=1.0,
    silver_weight=0.7
)

# Prepare label mappings
dept_labels = combined_dept['label'].unique().tolist()
dept_label2id = {label: i for i, label in enumerate(dept_labels)}
dept_id2label = {i: label for label, i in dept_label2id.items()}

# Split for training/testing
gold_train_d, gold_test_d = train_test_split(gold_dept, test_size=0.2, random_state=42)

train_dept_df = create_combined_dataset(
    gold_df=gold_train_d,
    silver_df=silver_dept_df,
    gold_label_col='department',
    silver_label_col='pseudo_department'
)
train_dept_df = train_dept_df[train_dept_df['label'].isin(dept_label2id.keys())]

train_texts_d = train_dept_df['text'].tolist()
train_labels_d = [dept_label2id[l] for l in train_dept_df['label']]

test_texts_d = gold_test_d['text'].tolist()
test_labels_d = gold_test_d['department'].tolist()

print(f"Training on {len(train_texts_d)} examples (gold + silver)")
print(f"Testing on {len(test_texts_d)} gold examples")

Combined dataset: 478 gold + 35 silver = 513 total
Combined dataset: 382 gold + 35 silver = 417 total
Training on 417 examples (gold + silver)
Testing on 96 gold examples


In [16]:
# Train Department classifier
dept_pseudo_classifier = TransformerClassifier(
    model_name="distilbert-base-multilingual-cased",
    num_labels=len(dept_label2id),
    id2label=dept_id2label,
    label2id=dept_label2id
)

dept_pseudo_classifier.train(
    texts=train_texts_d,
    labels=train_labels_d,
    output_dir="./results/pseudo_dept",
    epochs=3,
    batch_size=16
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on cuda
Training on 417 examples...


Step,Training Loss
50,1.9119


Training complete!


## 5. Evaluate Department Classifier

In [17]:
# Predict and evaluate
dept_predictions = dept_pseudo_classifier.predict_labels(test_texts_d)

print("=" * 60)
print("TARGET 1: DEPARTMENT PSEUDO-LABELING RESULTS")
print("=" * 60)
print(f"Accuracy: {accuracy_score(test_labels_d, dept_predictions):.4f}")
print("\nClassification Report:")
print(classification_report(test_labels_d, dept_predictions, zero_division=0))

TARGET 1: DEPARTMENT PSEUDO-LABELING RESULTS
Accuracy: 0.4479

Classification Report:
                        precision    recall  f1-score   support

        Administrative       0.00      0.00      0.00         3
  Business Development       0.00      0.00      0.00         3
            Consulting       0.00      0.00      0.00         6
      Customer Support       0.00      0.00      0.00         3
       Human Resources       0.00      0.00      0.00         2
Information Technology       0.00      0.00      0.00        16
             Marketing       0.00      0.00      0.00         1
                 Other       0.45      1.00      0.62        43
    Project Management       0.00      0.00      0.00         7
            Purchasing       0.00      0.00      0.00         3
                 Sales       0.00      0.00      0.00         9

              accuracy                           0.45        96
             macro avg       0.04      0.09      0.06        96
          weight

In [18]:
# Save department model
dept_pseudo_classifier.save("../models/transformer_pseudo_dept")
print("Department pseudo-label model saved!")

Model saved to ..\models\transformer_pseudo_dept
Department pseudo-label model saved!


---
# PART 2: SENIORITY PSEUDO-LABELING
---

## 6. Initialize Seniority Classifiers

In [19]:
# Create rule-based classifier for Seniority
rule_sen = create_seniority_classifier(seniority_df)

# Create embedding classifier for Seniority
emb_sen = create_emb_seniority(seniority_df, use_examples=True)

print("Seniority classifiers initialized!")

Loading model 'paraphrase-multilingual-MiniLM-L12-v2' on cuda...
Model loaded successfully!
Fitted from examples: 5 labels, shape (5, 384)
Seniority classifiers initialized!


## 7. Generate Pseudo-Labels (Seniority)

In [20]:
# Initialize pseudo-labeler for Seniority
sen_labeler = PseudoLabeler(
    rule_classifier=rule_sen,
    embedding_classifier=emb_sen,
    confidence_threshold=0.85
)

# Generate pseudo-labels
silver_sen_df = sen_labeler.get_high_confidence_subset(
    unlabeled_df, 
    text_column='text',
    label_column='pseudo_seniority'
)

print(f"Generated {len(silver_sen_df)} high-confidence SENIORITY pseudo-labels")
print(f"Label source distribution:")
print(silver_sen_df['pseudo_seniority_source'].value_counts())

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 110 high-confidence SENIORITY pseudo-labels
Label source distribution:
pseudo_seniority_source
rule_substring    110
Name: count, dtype: int64


## 8. Train Transformer on Combined Data (Seniority)

In [21]:
# Prepare gold data for seniority
gold_sen = gold_df.dropna(subset=['seniority']).copy()

# Combine datasets
combined_sen = create_combined_dataset(
    gold_df=gold_sen,
    silver_df=silver_sen_df,
    gold_label_col='seniority',
    silver_label_col='pseudo_seniority',
    gold_weight=1.0,
    silver_weight=0.7
)

# Prepare label mappings
sen_labels = combined_sen['label'].unique().tolist()
sen_label2id = {label: i for i, label in enumerate(sen_labels)}
sen_id2label = {i: label for label, i in sen_label2id.items()}

print(f"Seniority classes: {sen_labels}")

# Split for training/testing
gold_train_s, gold_test_s = train_test_split(gold_sen, test_size=0.2, random_state=42)

train_sen_df = create_combined_dataset(
    gold_df=gold_train_s,
    silver_df=silver_sen_df,
    gold_label_col='seniority',
    silver_label_col='pseudo_seniority'
)
train_sen_df = train_sen_df[train_sen_df['label'].isin(sen_label2id.keys())]

train_texts_s = train_sen_df['text'].tolist()
train_labels_s = [sen_label2id[l] for l in train_sen_df['label']]

test_texts_s = gold_test_s['text'].tolist()
test_labels_s = gold_test_s['seniority'].tolist()

print(f"Training on {len(train_texts_s)} examples (gold + silver)")
print(f"Testing on {len(test_texts_s)} gold examples")

Combined dataset: 478 gold + 110 silver = 588 total
Seniority classes: ['Management', 'Professional', 'Director', 'Lead', 'Senior', 'Junior']
Combined dataset: 382 gold + 110 silver = 492 total
Training on 492 examples (gold + silver)
Testing on 96 gold examples


In [22]:
# Train Seniority classifier
sen_pseudo_classifier = TransformerClassifier(
    model_name="distilbert-base-multilingual-cased",
    num_labels=len(sen_label2id),
    id2label=sen_id2label,
    label2id=sen_label2id
)

sen_pseudo_classifier.train(
    texts=train_texts_s,
    labels=train_labels_s,
    output_dir="./results/pseudo_seniority",
    epochs=3,
    batch_size=16
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on cuda
Training on 492 examples...


Step,Training Loss
50,1.5785


Training complete!


## 9. Evaluate Seniority Classifier

In [23]:
# Predict and evaluate
sen_predictions = sen_pseudo_classifier.predict_labels(test_texts_s)

print("=" * 60)
print("TARGET 2: SENIORITY PSEUDO-LABELING RESULTS")
print("=" * 60)
print(f"Accuracy: {accuracy_score(test_labels_s, sen_predictions):.4f}")
print("\nClassification Report:")
print(classification_report(test_labels_s, sen_predictions, zero_division=0))

TARGET 2: SENIORITY PSEUDO-LABELING RESULTS
Accuracy: 0.7188

Classification Report:
              precision    recall  f1-score   support

    Director       0.00      0.00      0.00         5
      Junior       0.00      0.00      0.00         3
        Lead       0.88      0.39      0.54        18
  Management       0.72      0.88      0.79        24
Professional       0.69      0.92      0.79        37
      Senior       0.70      0.78      0.74         9

    accuracy                           0.72        96
   macro avg       0.50      0.49      0.48        96
weighted avg       0.68      0.72      0.67        96



In [24]:
# Save seniority model
sen_pseudo_classifier.save("../models/transformer_pseudo_seniority")
print("Seniority pseudo-label model saved!")

Model saved to ..\models\transformer_pseudo_seniority
Seniority pseudo-label model saved!


---
## Summary

This notebook trained **two classifiers** using pseudo-labeling:

| Target | Gold Data | Silver (Pseudo) | Total Training | Model Saved |
|--------|-----------|----------------|----------------|--------------|
| Department | ~380 | Variable | Gold + Silver | `transformer_pseudo_dept` |
| Seniority | ~380 | Variable | Gold + Silver | `transformer_pseudo_seniority` |

Both targets are now predicted using the pseudo-labeling approach.