# DistilBERT Resume Experience Classifier

This notebook trains **DistilBERT** to predict **experience_level** (junior/mid/senior) from resume data.

**Approach:**
- Uses **all columns** from the CSV (no exclusions)
- Automatically concatenates all column text into a single sequence
- Standard multi-class classification with CrossEntropyLoss
- Simple, fast, and efficient training

In [None]:
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# Check if running on CPU or CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "no gpu")

ModuleNotFoundError: No module named 'pandas'

# Load and Prepare Data

Load the CSV and extract all columns (except the label) for training.


In [None]:
CSV_PATH = "cleaned_resumes.csv"
TARGET_COL = "experience_level"

df = pd.read_csv(
    "cleaned_resumes.csv",
    engine="python",
)

print("Shape:", df.shape)
print("Columns:", list(df.columns))

print("\nTarget distribution:\n", df[TARGET_COL].value_counts(dropna=False))

Shape: (2100, 15)
Columns: ['experience', 'projects', 'skills', 'summary', 'education', 'job title', 'total_experience_time', 'last_experience_time', 'summary_count', 'last_experience_only', 'experience_level', 'name', 'email', 'linkedin', 'github']

Target distribution:
 experience_level
senior    700
mid       700
junior    700
Name: count, dtype: int64


# Concatenate Columns and Tokenize

Automatically combine all column values into a single text sequence for each resume, then tokenize. This is more efficient than processing columns separately.


In [None]:
def clean_value(v):
    """Convert any cell to a clean string."""
    if pd.isna(v):
        return ""
    s = str(v)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def row_to_text(row, target_col):
    parts = []
    for col, val in row.items():
        if col == target_col:
            continue
        s = clean_value(val)
        if s:
            parts.append(f"[{col}] {s}")
    return " ".join(parts)

df["text"] = df.apply(lambda r: row_to_text(r, TARGET_COL), axis=1)

print(df["text"].iloc[0][:600])
print("\nAverage text length (chars):", int(df["text"].str.len().mean()))

[experience] Experience 1: Title: qa engineer. Responsibilities: Perfected data analysis and data visualization using Python and Tableau. Developed and deployed scalable solutions. Integrated third-party services into existing systems. Performed software testing and resolved bugs efficiently.. Experience 2: Title: qa engineer. Responsibilities: Performed software testing and resolved bugs efficiently. Automated deployment processes and continuous integration. Optimized system performance and reduced latency. Collaborated with cross-functional teams to design new features. Implemented security 

Average text length (chars): 3132


# Create DataLoaders


In [None]:
labels = sorted(df[TARGET_COL].dropna().unique().tolist())
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}

df["label"] = df[TARGET_COL].map(label2id)

print("Labels:", labels)

Labels: ['junior', 'mid', 'senior']


# Define Model

Simple DistilBERT classifier - takes concatenated text and predicts experience level.


In [None]:
train_df, test_df = train_test_split(
    df,
    test_size=0.2,
    random_state=42,
    stratify=df["label"]
)

print("Train:", train_df.shape, "Test:", test_df.shape)


Train: (1680, 17) Test: (420, 17)


# Training

Train the model with cross-entropy loss and AdamW optimizer. Tracks both training and validation metrics.


In [None]:
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

MAX_LEN = 512  # try 384 if your text is long and you have enough memory

def tokenize(texts):
    return tokenizer(
        texts,
        truncation=True,
        padding=True,
        max_length=MAX_LEN
    )

train_enc = tokenize(train_df["text"].tolist())
test_enc  = tokenize(test_df["text"].tolist())


# Evaluation and Results

Load the best model and evaluate on the validation set. Shows detailed metrics including accuracy, F1 scores, confusion matrix, and classification report.


In [None]:
class ResumeDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

train_dataset = ResumeDataset(train_enc, train_df["label"].tolist())
test_dataset  = ResumeDataset(test_enc,  test_df["label"].tolist())


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def compute_metrics(eval_pred):
    logits, y_true = eval_pred
    y_pred = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "f1_macro": f1_score(y_true, y_pred, average="macro")
    }


In [None]:
args = TrainingArguments(
    output_dir="distilbert_resume_level",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",

    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    # weight_decay=0.01,

    logging_steps=50,
    report_to="none"
)


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>=0.26.0'`

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,  # simple baseline; later you can make a validation split
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()


  trainer = Trainer(


In [None]:
pred = trainer.predict(test_dataset)

y_true = test_df["label"].to_numpy()
y_pred = np.argmax(pred.predictions, axis=1)

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Macro-F1:", f1_score(y_true, y_pred, average="macro"))

print("\nReport:\n", classification_report(y_true, y_pred, target_names=labels))
print("\nConfusion matrix:\n", confusion_matrix(y_true, y_pred))


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 'y_true' and 'y_pred' are available from the previous execution
conf_mat = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_mat, annot=True, fmt="d", cmap="Blues",
            xticklabels=labels, yticklabels=labels)
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("Accuracy:", accuracy_score(y_true, y_pred))