# Fine-tuning DistilBERT for Customer Support Ticket Classification

In this notebook, we'll fine-tune the DistilBERT model on our cleaned customer support tickets dataset. We'll:
1. Load and preprocess the data
2. Set up DistilBERT model and tokenizer
3. Train for 3-5 epochs
4. Evaluate model performance
5. Save the fine-tuned model

In [11]:
!pip install -q transformers datasets evaluate accelerate scikit-learn


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
# Import required libraries
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from  transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
import evaluate
import os

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

In [14]:
# Load the dataset
df = pd.read_csv('../data/customer_support_tickets_cleaned.csv')
label_mapping = {
    'Billing ': 'billing',
    'Billng': 'billing',
    'billing': 'billing',
    'BILLING': 'billing',
    
    'ACCOUNT': 'account',
    'Accnt': 'account',
    'account': 'account',
    'Account ': 'account',
    
    ' Account ': 'account',  # Add this line
    
    'Other': 'other',
    'other': 'other',
    'Othr': 'other',
    'Other ': 'other',
    'OTHER': 'other',
    
    'TECHNICAL': 'Technical',
    'Tech': 'Technical',
    'technical': 'Technical',
    'Tech-support': 'Technical',
    ' Tech': 'Technical'  # Add this line
}

df['label'] = df['label'].map(label_mapping).fillna(df['label'])

# Encode labels
label_encoder = LabelEncoder()
df['label_encoded'] = label_encoder.fit_transform(df['label'])

# Split the data
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label_encoded'])

# Create Dataset objects
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"\nLabel mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"{label}: {i}")

Training samples: 396
Validation samples: 99

Label mapping:
Technical: 0
account: 1
billing: 2
other: 3


In [16]:
# Initialize tokenizer and model
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

num_labels = len(label_encoder.classes_)
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Rename label column to 'labels' for Trainer compatibility
train_dataset = train_dataset.rename_column('label_encoded', 'labels')
val_dataset = val_dataset.rename_column('label_encoded', 'labels')

# Set format for PyTorch
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/396 [00:00<?, ? examples/s]

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

In [17]:
# Define metrics computation
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

# Set up training arguments
training_args = TrainingArguments(
    output_dir='../models/distilbert-ticket-classifier',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='../logs',
    logging_steps=10,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
  
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

Downloading builder script: 0.00B [00:00, ?B/s]


Evaluation metrics:
eval_loss: 0.9946
eval_accuracy: 0.8947
eval_runtime: 82.7175
eval_samples_per_second: 1.1480
eval_steps_per_second: 0.0730
epoch: 3.0000



In [18]:
# Train the model
train_results = trainer.train()

# Print training metrics
print("\nTraining metrics:")
print(f"Total training steps: {train_results.global_step}")
print(f"Training loss: {train_results.training_loss:.4f}")

# Evaluate the model
eval_results = trainer.evaluate()
print("\nEvaluation metrics:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")



Epoch,Training Loss,Validation Loss,Accuracy
1,1.3716,1.3651,0.40404
2,1.3211,1.286471,0.717172
3,1.1938,1.046064,0.858586





Training metrics:
Total training steps: 75
Training loss: 1.3129





Evaluation metrics:
eval_loss: 1.0461
eval_accuracy: 0.8586
eval_runtime: 13.8041
eval_samples_per_second: 7.1720
eval_steps_per_second: 0.5070
epoch: 3.0000


In [19]:
# Save the model and tokenizer
output_dir = '../src/modeling/distilbert-ticket-classifier.csv'
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Save label encoder classes
import json
label_classes = {i: label for i, label in enumerate(label_encoder.classes_)}
with open(os.path.join(output_dir, 'label_mapping.json'), 'w') as f:
    json.dump(label_classes, f)

print(f"\nModel saved to: {output_dir}")
print("Label mapping saved to: label_mapping.json")


Model saved to: ../src/modeling/distilbert-ticket-classifier.csv
Label mapping saved to: label_mapping.json
