# 04 - Transformer Classifier

This notebook implements **Approach 1: Fine-tuned Classification Model**.

## Targets
- **Department (Domain)**: Professional domain of the current job
- **Seniority**: Seniority level of the current position

## Strategy: Two-Stage Training
1. **Stage 1:** Pre-train on CSV patterns (~10K-20K examples)
2. **Stage 2:** Domain adaptation on annotated LinkedIn CVs

In [9]:
import os
import sys
import pandas as pd
from pathlib import Path
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

sys.path.insert(0, os.path.abspath("../"))

from src.models.transformer_classifier import TransformerClassifier
from src.data.loader import load_linkedin_data, prepare_dataset, load_label_lists

torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Using device: cuda


## 1. Load Data

In [10]:
# Load CSV patterns for both tasks
dept_df, seniority_df = load_label_lists("../data")

print(f"Department patterns: {len(dept_df)}")
print(f"Seniority patterns: {len(seniority_df)}")

# Load LinkedIn data
cv_data = load_linkedin_data("../data/linkedin-cvs-annotated.json")
cv_df = prepare_dataset(cv_data)

print(f"\nAnnotated LinkedIn positions: {len(cv_df)}")
print(f"With department label: {cv_df['department'].notna().sum()}")
print(f"With seniority label: {cv_df['seniority'].notna().sum()}")

Department patterns: 10145
Seniority patterns: 9428

Annotated LinkedIn positions: 478
With department label: 478
With seniority label: 478


---
# PART 1: DEPARTMENT CLASSIFIER
---

## 2. Stage 1: Train on CSV Patterns (Department)

In [11]:
# Prepare label mappings for Department
dept_labels = dept_df['label'].unique().tolist()
dept_label2id = {label: i for i, label in enumerate(dept_labels)}
dept_id2label = {i: label for label, i in dept_label2id.items()}

print(f"Department classes: {len(dept_label2id)}")

# Prepare training data from CSV
pattern_texts = dept_df['text'].astype(str).tolist()
pattern_labels = [dept_label2id[l] for l in dept_df['label']]

# Split patterns for validation
train_texts, val_texts, train_labels, val_labels = train_test_split(
    pattern_texts, pattern_labels, test_size=0.1, random_state=42
)

print(f"Pattern train: {len(train_texts)}, val: {len(val_texts)}")

Department classes: 11
Pattern train: 9130, val: 1015


In [12]:
# Initialize and train Stage 1 for DEPARTMENT
dept_classifier = TransformerClassifier(
    model_name="distilbert-base-multilingual-cased",
    num_labels=len(dept_label2id),
    id2label=dept_id2label,
    label2id=dept_label2id
)

# Train on patterns
dept_classifier.train(
    texts=train_texts,
    labels=train_labels,
    val_texts=val_texts,
    val_labels=val_labels,
    output_dir="./results/stage1_dept",
    epochs=2,
    batch_size=32
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on cuda
Training on 9130 examples...


Epoch,Training Loss,Validation Loss
1,0.1294,0.09874
2,0.0683,0.062031


Training complete!


## 3. Stage 2: Domain Adaptation on LinkedIn Data (Department)

In [13]:
# Prepare LinkedIn data for Department
cv_dept = cv_df.dropna(subset=['department']).copy()
cv_dept = cv_dept[cv_dept['department'].isin(dept_label2id.keys())]

cv_train_dept, cv_test_dept = train_test_split(cv_dept, test_size=0.2, random_state=42)

cv_train_texts_d = cv_train_dept['text'].tolist()
cv_train_labels_d = [dept_label2id[l] for l in cv_train_dept['department']]

cv_test_texts_d = cv_test_dept['text'].tolist()
cv_test_labels_d = cv_test_dept['department'].tolist()

print(f"LinkedIn train (dept): {len(cv_train_dept)}, test: {len(cv_test_dept)}")

LinkedIn train (dept): 382, test: 96


In [14]:
# Continue training on LinkedIn data (Stage 2)
dept_classifier.train(
    texts=cv_train_texts_d,
    labels=cv_train_labels_d,
    output_dir="./results/stage2_dept",
    epochs=3,
    batch_size=8,
    learning_rate=1e-5
)

Training on 382 examples...


Step,Training Loss
50,1.7262
100,1.1003


Training complete!


## 4. Evaluate Department Classifier

In [15]:
# Predict on test set
dept_predictions = dept_classifier.predict_labels(cv_test_texts_d)

print("=" * 60)
print("TARGET 1: DEPARTMENT CLASSIFICATION RESULTS")
print("=" * 60)
print(f"Accuracy: {accuracy_score(cv_test_labels_d, dept_predictions):.4f}")
print("\nClassification Report:")
print(classification_report(cv_test_labels_d, dept_predictions, zero_division=0))

TARGET 1: DEPARTMENT CLASSIFICATION RESULTS
Accuracy: 0.6146

Classification Report:
                        precision    recall  f1-score   support

        Administrative       0.00      0.00      0.00         3
  Business Development       0.50      0.33      0.40         3
            Consulting       1.00      0.83      0.91         6
      Customer Support       0.00      0.00      0.00         3
       Human Resources       0.00      0.00      0.00         2
Information Technology       0.73      0.50      0.59        16
             Marketing       0.50      1.00      0.67         1
                 Other       0.57      0.91      0.70        43
    Project Management       0.75      0.43      0.55         7
            Purchasing       0.00      0.00      0.00         3
                 Sales       0.67      0.22      0.33         9

              accuracy                           0.61        96
             macro avg       0.43      0.38      0.38        96
          weighte

In [16]:
# Save department model
dept_classifier.save("../models/transformer_dept")
print("Department model saved!")

Model saved to ..\models\transformer_dept
Department model saved!


---
# PART 2: SENIORITY CLASSIFIER
---

## 5. Stage 1: Train on CSV Patterns (Seniority)

In [17]:
# Prepare label mappings for Seniority
seniority_labels = seniority_df['label'].unique().tolist()
sen_label2id = {label: i for i, label in enumerate(seniority_labels)}
sen_id2label = {i: label for label, i in sen_label2id.items()}

print(f"Seniority classes: {len(sen_label2id)}")
print(f"Classes: {seniority_labels}")

# Prepare training data from CSV
sen_pattern_texts = seniority_df['text'].astype(str).tolist()
sen_pattern_labels = [sen_label2id[l] for l in seniority_df['label']]

# Split patterns for validation
sen_train_texts, sen_val_texts, sen_train_labels, sen_val_labels = train_test_split(
    sen_pattern_texts, sen_pattern_labels, test_size=0.1, random_state=42
)

print(f"Pattern train: {len(sen_train_texts)}, val: {len(sen_val_texts)}")

Seniority classes: 5
Classes: ['Junior', 'Senior', 'Lead', 'Management', 'Director']
Pattern train: 8485, val: 943


In [18]:
# Initialize and train Stage 1 for SENIORITY
seniority_classifier = TransformerClassifier(
    model_name="distilbert-base-multilingual-cased",
    num_labels=len(sen_label2id),
    id2label=sen_id2label,
    label2id=sen_label2id
)

# Train on patterns
seniority_classifier.train(
    texts=sen_train_texts,
    labels=sen_train_labels,
    val_texts=sen_val_texts,
    val_labels=sen_val_labels,
    output_dir="./results/stage1_seniority",
    epochs=2,
    batch_size=32
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on cuda
Training on 8485 examples...


Epoch,Training Loss,Validation Loss
1,0.0734,0.061219
2,0.0248,0.028764


Training complete!


## 6. Stage 2: Domain Adaptation on LinkedIn Data (Seniority)

In [19]:
# Prepare LinkedIn data for Seniority
cv_sen = cv_df.dropna(subset=['seniority']).copy()
cv_sen = cv_sen[cv_sen['seniority'].isin(sen_label2id.keys())]

cv_train_sen, cv_test_sen = train_test_split(cv_sen, test_size=0.2, random_state=42)

cv_train_texts_s = cv_train_sen['text'].tolist()
cv_train_labels_s = [sen_label2id[l] for l in cv_train_sen['seniority']]

cv_test_texts_s = cv_test_sen['text'].tolist()
cv_test_labels_s = cv_test_sen['seniority'].tolist()

print(f"LinkedIn train (seniority): {len(cv_train_sen)}, test: {len(cv_test_sen)}")

LinkedIn train (seniority): 253, test: 64


In [20]:
# Continue training on LinkedIn data (Stage 2)
seniority_classifier.train(
    texts=cv_train_texts_s,
    labels=cv_train_labels_s,
    output_dir="./results/stage2_seniority",
    epochs=3,
    batch_size=8,
    learning_rate=1e-5
)

Training on 253 examples...


Step,Training Loss
50,0.7888


Training complete!


## 7. Evaluate Seniority Classifier

In [21]:
# Predict on test set
seniority_predictions = seniority_classifier.predict_labels(cv_test_texts_s)

print("=" * 60)
print("TARGET 2: SENIORITY CLASSIFICATION RESULTS")
print("=" * 60)
print(f"Accuracy: {accuracy_score(cv_test_labels_s, seniority_predictions):.4f}")
print("\nClassification Report:")
print(classification_report(cv_test_labels_s, seniority_predictions, zero_division=0))

TARGET 2: SENIORITY CLASSIFICATION RESULTS
Accuracy: 0.9531

Classification Report:
              precision    recall  f1-score   support

    Director       0.50      1.00      0.67         1
      Junior       1.00      0.50      0.67         2
        Lead       1.00      0.96      0.98        27
  Management       0.96      0.96      0.96        26
      Senior       0.89      1.00      0.94         8

    accuracy                           0.95        64
   macro avg       0.87      0.88      0.84        64
weighted avg       0.96      0.95      0.95        64



In [22]:
# Save seniority model
seniority_classifier.save("../models/transformer_seniority")
print("Seniority model saved!")

Model saved to ..\models\transformer_seniority
Seniority model saved!


---
## Summary

This notebook trained **two transformer classifiers** using two-stage training:

| Target | Stage 1 (CSV Patterns) | Stage 2 (LinkedIn) | Model Saved |
|--------|----------------------|-------------------|-------------|
| Department | ~10K patterns | ~400 CVs | `transformer_dept` |
| Seniority | ~9K patterns | ~400 CVs | `transformer_seniority` |