# 05 - Pseudo-Labeling Experiment

This notebook implements **Approach 2: Programmatic Labeling + Supervised Learning**.

## Strategy
1. Generate pseudo-labels for unannotated CVs using rule-based + embedding classifiers
2. Filter for high-confidence predictions only
3. Combine gold (annotated) + silver (pseudo-labeled) data
4. Train transformer on expanded dataset

## Selection Logic
```
IF rule_based.method in ["exact", "substring"]: use rule_based.label
ELIF embedding.confidence > 0.85: use embedding.label
ELSE: discard
```

In [1]:
import os
import sys
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Add src to path
sys.path.insert(0, os.path.abspath("../"))

from src.models.rule_based import create_department_classifier
from src.models.embedding_classifier import create_domain_classifier
from src.models.transformer_classifier import TransformerClassifier
from src.data.loader import load_linkedin_data, prepare_dataset, load_label_lists
from src.data.pseudo_labeler import PseudoLabeler, create_combined_dataset

torch.manual_seed(42)
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Using device: cuda


## 1. Load Data

In [2]:
# Load label lists
dept_df, seniority_df = load_label_lists("../data")

# Load annotated (gold) data
gold_data = load_linkedin_data("../data/linkedin-cvs-annotated.json")
gold_df = prepare_dataset(gold_data)
gold_df = gold_df.dropna(subset=['department'])

# Load unannotated data
unlabeled_data = load_linkedin_data("../data/linkedin-cvs-not-annotated.json")
unlabeled_df = prepare_dataset(unlabeled_data)

print(f"Gold (annotated): {len(gold_df)}")
print(f"Unlabeled: {len(unlabeled_df)}")

Gold (annotated): 478
Unlabeled: 314


## 2. Initialize Baseline Classifiers

In [3]:
# Create rule-based classifier
rule_dept = create_department_classifier(dept_df)

# Create embedding classifier
emb_dept = create_domain_classifier(dept_df, use_examples=True)

print("Classifiers initialized!")

Loading model 'paraphrase-multilingual-MiniLM-L12-v2' on cuda...
Model loaded successfully!
Fitted from examples: 11 labels, shape (11, 384)
Classifiers initialized!


## 3. Generate Pseudo-Labels

In [4]:
# Initialize pseudo-labeler
labeler = PseudoLabeler(
    rule_classifier=rule_dept,
    embedding_classifier=emb_dept,
    confidence_threshold=0.85
)

# Generate pseudo-labels for unlabeled data
silver_df = labeler.get_high_confidence_subset(
    unlabeled_df, 
    text_column='text',
    label_column='pseudo_department'
)

print(f"Generated {len(silver_df)} high-confidence pseudo-labels")
print(f"\nLabel source distribution:")
print(silver_df['pseudo_department_source'].value_counts())

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Generated 35 high-confidence pseudo-labels

Label source distribution:
pseudo_department_source
rule_substring    35
Name: count, dtype: int64


In [5]:
# Inspect some examples
print("Sample pseudo-labeled data:")
silver_df[['text', 'pseudo_department', 'pseudo_department_confidence', 'pseudo_department_source']].head(10)

Sample pseudo-labeled data:


Unnamed: 0,text,pseudo_department,pseudo_department_confidence,pseudo_department_source
3,Marketing Manager at Tradeware AG,Marketing,0.515152,rule_substring
6,Business Analyst at ALSO Deutschland GmbH,Business Development,0.390244,rule_substring
17,"Global Marketing Director, Nonwovens at Owens ...",Marketing,0.471698,rule_substring
21,Senior Project Manager at Selbstständig,Project Management,0.564103,rule_substring
22,Federal Account Manager at FCN Inc.,Sales,0.428571,rule_substring
57,Marketing Engineer at ams OSRAM,Marketing,0.580645,rule_substring
91,Sales Manager at Photron Deutschland GmbH,Sales,0.317073,rule_substring
95,Leiter Projektmanagement / Projektleiter at Ka...,Project Management,0.421053,rule_substring
97,HR Business Partner at Neopac,Business Development,0.655172,rule_substring
99,Unternehmensinhaber at 3F Kommunikation,Marketing,0.333333,rule_substring


## 4. Combine Gold + Silver Data

In [6]:
# Combine datasets
combined_df = create_combined_dataset(
    gold_df=gold_df,
    silver_df=silver_df,
    gold_label_col='department',
    silver_label_col='pseudo_department',
    gold_weight=1.0,
    silver_weight=0.7
)

print(f"\nLabel distribution in combined data:")
print(combined_df['label'].value_counts().head(10))

Combined dataset: 478 gold + 35 silver = 513 total

Label distribution in combined data:
label
Other                     250
Information Technology     57
Sales                      46
Project Management         35
Consulting                 33
Marketing                  27
Business Development       20
Human Resources            17
Purchasing                 12
Administrative             10
Name: count, dtype: int64


## 5. Train Transformer on Combined Data

In [7]:
# Prepare label mappings
all_labels = combined_df['label'].unique().tolist()
label2id = {label: i for i, label in enumerate(all_labels)}
id2label = {i: label for label, i in label2id.items()}

# Split gold data for test (we only evaluate on gold)
gold_train, gold_test = train_test_split(gold_df, test_size=0.2, random_state=42)

# Training data = gold_train + all silver
train_df = create_combined_dataset(
    gold_df=gold_train,
    silver_df=silver_df,
    gold_label_col='department',
    silver_label_col='pseudo_department'
)

# Filter to known labels
train_df = train_df[train_df['label'].isin(label2id.keys())]

train_texts = train_df['text'].tolist()
train_labels = [label2id[l] for l in train_df['label']]

test_texts = gold_test['text'].tolist()
test_labels_str = gold_test['department'].tolist()

print(f"Training on {len(train_texts)} examples (gold + silver)")
print(f"Testing on {len(test_texts)} gold examples")

Combined dataset: 382 gold + 35 silver = 417 total
Training on 417 examples (gold + silver)
Testing on 96 gold examples


In [8]:
# Initialize and train
pseudo_classifier = TransformerClassifier(
    model_name="distilbert-base-multilingual-cased",
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

pseudo_classifier.train(
    texts=train_texts,
    labels=train_labels,
    output_dir="./results/pseudo_dept",
    epochs=3,
    batch_size=16
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on cuda
Training on 417 examples...


Step,Training Loss
50,1.9119


Training complete!


## 6. Evaluation

In [9]:
# Predict on gold test set
predictions = pseudo_classifier.predict_labels(test_texts)

# Metrics
print("=" * 50)
print("PSEUDO-LABELING APPROACH RESULTS")
print("=" * 50)
print(f"Accuracy: {accuracy_score(test_labels_str, predictions):.4f}")
print("\nClassification Report:")
print(classification_report(test_labels_str, predictions))

PSEUDO-LABELING APPROACH RESULTS
Accuracy: 0.4479

Classification Report:
                        precision    recall  f1-score   support

        Administrative       0.00      0.00      0.00         3
  Business Development       0.00      0.00      0.00         3
            Consulting       0.00      0.00      0.00         6
      Customer Support       0.00      0.00      0.00         3
       Human Resources       0.00      0.00      0.00         2
Information Technology       0.00      0.00      0.00        16
             Marketing       0.00      0.00      0.00         1
                 Other       0.45      1.00      0.62        43
    Project Management       0.00      0.00      0.00         7
            Purchasing       0.00      0.00      0.00         3
                 Sales       0.00      0.00      0.00         9

              accuracy                           0.45        96
             macro avg       0.04      0.09      0.06        96
          weighted avg      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [10]:
# Save model
pseudo_classifier.save("../models/transformer_pseudo_dept")
print("Model saved!")

Model saved to ..\models\transformer_pseudo_dept
Model saved!
