**Goal of the project**

One common approach to improving customer service efficiency is support ticket classification. By getting the customer service team to classify the tickets, it’s possible to analyse why customers are making contact so problems can be fixed, content clarified, and processes automated to allow customers to self-serve.

Classifying tickets manually is fine, but it’s time-consuming to do properly, so automating the process is better. In this project, I’ll build a model to classify support tickets using Lightning Flash, and a model called DistilBERT.

**Load the package**

In [1]:
import pandas as pd

**Load the data**

I’m using a [Microsoft support ticket classification dataset](https://github.com/karolzak/support-tickets-classification#22-dataset). This contains the text from the support ticket a user sent to the help desk, plus some data on how it’s been categorised, it’s impact, urgency, the ticket type, and the category.

Since we have lots of training data, we could create a model that could examine the text of past tickets and predict the categorisation of future tickets. Support staff waste lots of time doing this manually, so this could save the support team a lot of time and save money for their business.

In [2]:
# Load dataset
df = pd.read_csv('../input/supportticketsclassification/all_tickets.csv')

In [3]:
# Examine the data
df.head()

Unnamed: 0,title,body,ticket_type,category,sub_category1,sub_category2,business_service,urgency,impact
0,,hi since recruiter lead permission approve req...,1,4,2,21,71,3,4
1,connection with icon,icon dear please setup icon per icon engineers...,1,6,22,7,26,3,4
2,work experience user,work experience user hi work experience studen...,1,5,13,7,32,3,4
3,requesting for meeting,requesting meeting hi please help follow equip...,1,5,13,7,32,3,4
4,reset passwords for external accounts,re expire days hi ask help update passwords co...,1,4,2,76,4,3,4


In [4]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48549 entries, 0 to 48548
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   title             47837 non-null  object
 1   body              48549 non-null  object
 2   ticket_type       48549 non-null  int64 
 3   category          48549 non-null  int64 
 4   sub_category1     48549 non-null  int64 
 5   sub_category2     48549 non-null  int64 
 6   business_service  48549 non-null  int64 
 7   urgency           48549 non-null  int64 
 8   impact            48549 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 3.3+ MB


**Remove NaN values**

In [5]:
df.isnull().sum()

title               712
body                  0
ticket_type           0
category              0
sub_category1         0
sub_category2         0
business_service      0
urgency               0
impact                0
dtype: int64

In [6]:
# Replace NaN values with an empty string
df['title'] = df['title'].fillna('')

**Concatenate the text into a single column**

The next common step is to merge the individual text columns together into a single column.

In [7]:
df['all_text'] = df['title'] + ' ' + df['body']

In [8]:
# Drop unnecessary columns
df = df[['all_text', 'ticket_type']]

**Select our target variable**

There are quite a few columns in this dataset that we could classify with our model. We’ll select just one of them for now - the ticket_type column. This contains two ticket types: 1 with 34,621 tickets assigned and 0 with 13,928 assigned.

In [9]:
df['ticket_type'].value_counts()

1    34621
0    13928
Name: ticket_type, dtype: int64

**Define X and y**

In [10]:
X = df.drop('ticket_type', axis = 1) 

In [11]:
y = df['ticket_type']

**Split the test and training data**

With X and y created, we can now use the train_test_split( ) function to split off our training data from our test data. I’ve assigned 30% of the data to the test dataset and have used the remaining 70% of the data for training purposes.

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

In [14]:
train_df = X_train.join(y_train)

In [15]:
valid_df = X_val.join(y_val)

**Saving DataFrames as CSV files**

In the following step, we are saving our DataFrames as CSV files.

In [16]:
# Write train into file
train_df.to_csv ('train.csv', index = None, header = True) 

In [17]:
# Write valid into file
valid_df.to_csv ('valid.csv', index = None, header = True) 

**Constructing the DataModule**

Once we have downloaded our dataset, Flash provides a TextClassificationData module that handles the complexity of loading the Text data from the CSV file and converting it into a representation that Deep Learning models need to train.

In [20]:
import torch

import flash
from flash.text import TextClassificationData, TextClassifier

In [21]:
# Create the DataModule
datamodule = TextClassificationData.from_csv('all_text', 'ticket_type', train_file = 'train.csv', val_file = 'valid.csv', batch_size = 16)

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-cef8d95e9034ca08/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-cef8d95e9034ca08/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/33984 [00:00<?, ?ex/s]

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-ec76b71a89d4e178/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-ec76b71a89d4e178/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/14565 [00:00<?, ?ex/s]

**Creating the model**

The model creation is straightforward. We pass the model name from the created DataModule, and the number of classes is automatically extracted from the provided dataset.

Flash directly integrates with the HuggingFace model hub so that you can use any model from this vast collection. In this baseline, we use “distilbert-base-uncased-finetuned-sst-2-english” backbone model.

We replace the default Accuracy metric with Precision, Recall, and F1 score from TorchMetrics to better monitor performances on our imbalanced dataset.

In [22]:
from torchmetrics import Precision, Recall, F1Score

In [23]:
metrics = [Precision(num_classes = datamodule.num_classes, average = 'macro', ignore_index = 1), Recall(num_classes = datamodule.num_classes, average = 'macro', ignore_index = 1), F1Score(num_classes = datamodule.num_classes, average = 'macro', ignore_index = 1)]

In [24]:
# Build the task
model = TextClassifier(backbone = 'distilbert-base-uncased-finetuned-sst-2-english', num_classes = datamodule.num_classes, metrics = metrics)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

**Training the model**

With the instantiated DataModule and model, we can start training.

In [25]:
from pytorch_lightning import seed_everything

In [26]:
seed_everything(42, workers = True)

42

In [27]:
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.callbacks import ModelCheckpoint

In [28]:
callbacks = [EarlyStopping(monitor = 'val_cross_entropy', patience = 2, mode = 'min'), ModelCheckpoint(monitor = 'val_cross_entropy', save_top_k = 1,  mode = 'min', every_n_epochs = 1)]

In [29]:
# Create the trainer and finetune the model
trainer = flash.Trainer(max_epochs = 10, callbacks = callbacks, gpus = 1 if torch.cuda.is_available() else None, deterministic = True)
trainer.finetune(model, datamodule = datamodule, strategy = 'freeze')

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

**Validating the model**

In [30]:
trainer.validate(model = model, datamodule = datamodule, ckpt_path = 'best')

Validation: 0it [00:00, ?it/s]

[{'val_precision': 0.9390184879302979,
  'val_recall': 0.9250837564468384,
  'val_f1score': 0.9319990277290344,
  'val_cross_entropy': 0.09886795282363892}]

Our model has a precision of 0.94, recall of 0.93, and F1 score of 0.93.