# Support Intelligence & Risk Monitoring â€” Transformer Baseline (T5)

This notebook adds a **lightweight Transformer baseline** to complement the TFâ€‘IDF models from T4.

## Objectives
- Train a **DistilBERT** (or similar) model for **priority prediction** (`low/medium/high`)
- Compare against the **T4 baseline** (F1-macro + confusion matrix)
- Export an artifact ready for later API inference

## Notes
- If you run locally and don't have the libraries, use the install cell below.
- For faster execution, consider running this notebook on **Kaggle** or with a GPU.


## 0) (Optional) Install dependencies

In [6]:
# If you're running locally and these packages are missing, uncomment:
# !python -m pip install -U transformers datasets accelerate evaluate scikit-learn torch

# Tip: On Windows, Python 3.12 is often the most compatible for ML tooling.


## 1) Imports

In [7]:
pip install datasets




DEPRECATION: Loading egg at c:\users\hp\appdata\local\programs\python\python313\lib\site-packages\tractosearch-0.0.1a5-py3.13.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, ConfusionMatrixDisplay

import torch
from datasets import Dataset

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    set_seed
)


## 2) Load cleaned dataset (T3)

In [9]:
DATA_DIR = "data/processed"
CSV_PATH = os.path.join(DATA_DIR, "tickets_clean_en.csv")
PARQUET_PATH = os.path.join(DATA_DIR, "tickets_clean_en.parquet")

if os.path.exists(CSV_PATH):
    print("Loading CSV:", CSV_PATH)
    df = pd.read_csv(CSV_PATH)
elif os.path.exists(PARQUET_PATH):
    print("Loading Parquet:", PARQUET_PATH)
    df = pd.read_parquet(PARQUET_PATH)
else:
    raise FileNotFoundError("Run T3 first to create data/processed/tickets_clean_en.csv (recommended).")

print("Loaded shape:", df.shape)
df.head(3)


Loading CSV: data/processed\tickets_clean_en.csv
Loaded shape: (16338, 21)


Unnamed: 0,subject,body,answer,type,queue,priority,language,version,subject_clean,body_clean,...,subject_filled,message,tags,tags_str,n_tags,body_len,message_len,is_very_short,priority_norm,category_mapped
0,Account Disruption,"Dear Customer Support Team,\n\nI am writing to...","Thank you for reaching out, <name>. We are awa...",Incident,Technical Support,high,en,51,Account Disruption,"Dear Customer Support Team,\n\nI am writing to...",...,Account Disruption,Account Disruption | Dear Customer Support Tea...,"['Account', 'Disruption', 'Outage', 'IT', 'Tec...",Account | Disruption | Outage | IT | Tech Support,5,544,565,0,high,Account
1,Query About Smart Home System Integration Feat...,"Dear Customer Support Team,\n\nI hope this mes...",Thank you for your inquiry. Our products suppo...,Request,Returns and Exchanges,medium,en,51,Query About Smart Home System Integration Feat...,"Dear Customer Support Team,\n\nI hope this mes...",...,Query About Smart Home System Integration Feat...,Query About Smart Home System Integration Feat...,"['Product', 'Feature', 'Tech Support']",Product | Feature | Tech Support,3,534,587,0,medium,Other
2,Inquiry Regarding Invoice Details,"Dear Customer Support Team,\n\nI hope this mes...",We appreciate you reaching out with your billi...,Request,Billing and Payments,low,en,51,Inquiry Regarding Invoice Details,"Dear Customer Support Team,\n\nI hope this mes...",...,Inquiry Regarding Invoice Details,Inquiry Regarding Invoice Details | Dear Custo...,"['Billing', 'Payment', 'Account', 'Documentati...",Billing | Payment | Account | Documentation | ...,5,605,641,0,low,Billing


## 3) Prepare data for Transformers

In [10]:
# Required columns
for c in ["message", "priority_norm"]:
    if c not in df.columns:
        raise ValueError(f"Missing column {c}. Re-run T3.")

df["message"] = df["message"].fillna("").astype(str)
df = df[df["message"].str.strip().str.len() > 0].copy()

# Map labels -> ids
labels = ["low", "medium", "high"]  # fixed order
label2id = {l:i for i,l in enumerate(labels)}
id2label = {i:l for l,i in label2id.items()}

df = df[df["priority_norm"].isin(labels)].copy()
df["label_id"] = df["priority_norm"].map(label2id).astype(int)

print(df["priority_norm"].value_counts())


priority_norm
medium    6618
high      6346
low       3374
Name: count, dtype: int64


## 4) Train/Validation split

We stratify by `priority_norm` for stable evaluation.


In [11]:
RANDOM_STATE = 42
TEST_SIZE = 0.20

train_df, val_df = train_test_split(
    df[["message","label_id","priority_norm"]],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=df["priority_norm"]
)

print("Train:", train_df.shape, "Val:", val_df.shape)


Train: (13070, 3) Val: (3268, 3)


## 5) Build HuggingFace Datasets + Tokenization

In [12]:
MODEL_NAME = "distilbert-base-uncased"  # lightweight baseline

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds   = Dataset.from_pandas(val_df.reset_index(drop=True))

def tokenize(batch):
    return tokenizer(batch["message"], truncation=True, max_length=256)

train_ds = train_ds.map(tokenize, batched=True, remove_columns=["message","priority_norm"])
val_ds   = val_ds.map(tokenize, batched=True, remove_columns=["message","priority_norm"])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 13070/13070 [00:04<00:00, 3002.14 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3268/3268 [00:01<00:00, 2595.38 examples/s]


## 6) Model + TrainingArguments

In [13]:
set_seed(RANDOM_STATE)

num_labels = len(labels)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

# Small, safe defaults (CPU-friendly). Increase batch size / epochs on GPU.
args = TrainingArguments(
    output_dir="outputs/t5_priority_transformer",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    report_to="none"
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>=0.26.0'`

## 7) Metrics (F1-macro)

In [None]:
def compute_metrics(eval_pred):
    logits, labels_true = eval_pred
    preds = np.argmax(logits, axis=-1)
    f1m = f1_score(labels_true, preds, average="macro")
    return {"f1_macro": f1m}


## 8) Train

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


## 9) Evaluate + confusion matrix

In [None]:
pred = trainer.predict(val_ds)
logits = pred.predictions
y_true = pred.label_ids
y_pred = np.argmax(logits, axis=-1)

print("Transformer Priority F1-macro:", round(f1_score(y_true, y_pred, average='macro'), 4))
print("\nClassification report:\n")
print(classification_report(y_true, y_pred, target_names=labels))

fig, ax = plt.subplots(figsize=(6.6, 5.8))
ConfusionMatrixDisplay.from_predictions(
    [id2label[i] for i in y_true],
    [id2label[i] for i in y_pred],
    labels=labels,
    ax=ax,
    values_format="d"
)
ax.set_title("Priority Confusion Matrix â€” Transformer (DistilBERT)")
plt.tight_layout()
plt.show()


## 10) Save artifact

In [None]:
SAVE_DIR = "models/t5_priority_distilbert"
os.makedirs(SAVE_DIR, exist_ok=True)

trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

print("Saved model + tokenizer to:", SAVE_DIR)


## 11) How to use this later (API/batch inference)

At inference time, you will:
- load tokenizer + model from `models/t5_priority_distilbert`
- tokenize incoming `message`
- run forward pass and take `argmax` or apply a custom threshold policy for `high`
