# 04 - Transformer Classifier

This notebook implements **Approach 1: Fine-tuned Classification Model**.

## Strategy: Two-Stage Training
1. **Stage 1:** Pre-train on CSV patterns (~20K examples)
2. **Stage 2:** Domain adaptation on annotated LinkedIn CVs (609 examples)

## Objectives
- Train a transformer on patternâ†’label mappings
- Fine-tune on real LinkedIn data for domain adaptation
- Evaluate on held-out test set

In [1]:
import os
import sys
import pandas as pd
from pathlib import Path
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Add src to path
sys.path.insert(0, os.path.abspath("../"))

from src.models.transformer_classifier import TransformerClassifier
from src.data.loader import load_linkedin_data, prepare_dataset, load_label_lists

# Reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Using device: cuda


## 1. Load Data

In [2]:
# Load CSV patterns
dept_df, seniority_df = load_label_lists("../data")

print(f"Department patterns: {len(dept_df)}")
print(f"Seniority patterns: {len(seniority_df)}")

# Load LinkedIn data
cv_data = load_linkedin_data("../data/linkedin-cvs-annotated.json")
cv_df = prepare_dataset(cv_data)

print(f"\nAnnotated LinkedIn positions: {len(cv_df)}")
print(f"With department label: {cv_df['department'].notna().sum()}")
print(f"With seniority label: {cv_df['seniority'].notna().sum()}")

Department patterns: 10145
Seniority patterns: 9428

Annotated LinkedIn positions: 478
With department label: 478
With seniority label: 478


## 2. Stage 1: Train on CSV Patterns (Department)

In [3]:
# Prepare label mappings for Department
dept_labels = dept_df['label'].unique().tolist()
label2id = {label: i for i, label in enumerate(dept_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Department classes: {len(label2id)}")

# Prepare training data from CSV
pattern_texts = dept_df['text'].astype(str).tolist()
pattern_labels = [label2id[l] for l in dept_df['label']]

# Split patterns for validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
    pattern_texts, pattern_labels, test_size=0.1, random_state=42
)

print(f"Pattern train: {len(train_texts)}, val: {len(val_texts)}")

Department classes: 11
Pattern train: 9130, val: 1015


In [4]:
# Initialize and train Stage 1
dept_classifier = TransformerClassifier(
    model_name="distilbert-base-multilingual-cased",
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

# Train on patterns
dept_classifier.train(
    texts=train_texts,
    labels=train_labels,
    val_texts=val_texts,
    val_labels=val_labels,
    output_dir="./results/stage1_dept",
    epochs=2,
    batch_size=32
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on cuda
Training on 9130 examples...


Epoch,Training Loss,Validation Loss
1,0.1294,0.09874
2,0.0683,0.062031


Training complete!


## 3. Stage 2: Domain Adaptation on LinkedIn Data

In [5]:
# Prepare LinkedIn data
cv_labeled = cv_df.dropna(subset=['department']).copy()

# Filter to labels we know
cv_labeled = cv_labeled[cv_labeled['department'].isin(label2id.keys())]

# Split for train/test
cv_train, cv_test = train_test_split(cv_labeled, test_size=0.2, random_state=42)

cv_train_texts = cv_train['text'].tolist()
cv_train_labels = [label2id[l] for l in cv_train['department']]

cv_test_texts = cv_test['text'].tolist()
cv_test_labels = cv_test['department'].tolist()

print(f"LinkedIn train: {len(cv_train)}, test: {len(cv_test)}")

LinkedIn train: 382, test: 96


In [6]:
# Continue training on LinkedIn data (Stage 2)
dept_classifier.train(
    texts=cv_train_texts,
    labels=cv_train_labels,
    output_dir="./results/stage2_dept",
    epochs=3,
    batch_size=8,
    learning_rate=1e-5  # Lower LR for fine-tuning
)

Training on 382 examples...


Step,Training Loss
50,1.7262
100,1.1003


Training complete!


## 4. Evaluation

In [7]:
# Predict on test set
predictions = dept_classifier.predict_labels(cv_test_texts)

# Metrics
print("=" * 50)
print("DEPARTMENT CLASSIFICATION RESULTS")
print("=" * 50)
print(f"Accuracy: {accuracy_score(cv_test_labels, predictions):.4f}")
print("\nClassification Report:")
print(classification_report(cv_test_labels, predictions))

DEPARTMENT CLASSIFICATION RESULTS
Accuracy: 0.6146

Classification Report:
                        precision    recall  f1-score   support

        Administrative       0.00      0.00      0.00         3
  Business Development       0.50      0.33      0.40         3
            Consulting       1.00      0.83      0.91         6
      Customer Support       0.00      0.00      0.00         3
       Human Resources       0.00      0.00      0.00         2
Information Technology       0.73      0.50      0.59        16
             Marketing       0.50      1.00      0.67         1
                 Other       0.57      0.91      0.70        43
    Project Management       0.75      0.43      0.55         7
            Purchasing       0.00      0.00      0.00         3
                 Sales       0.67      0.22      0.33         9

              accuracy                           0.61        96
             macro avg       0.43      0.38      0.38        96
          weighted avg     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [8]:
# Save the trained model
dept_classifier.save("../models/transformer_dept")
print("Model saved!")

Model saved to ..\models\transformer_dept
Model saved!
