<a href="https://colab.research.google.com/github/RylieWeaver/Machine-Learning-Personal-Projects/blob/main/Customer_Ticket_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP Project: Customer Service Ticket Subject Classifier

Used this dataset:
https://www.kaggle.com/datasets/suraj520/customer-support-ticket-dataset

Classify the ticket subject based on the ticket description provided by the customer.

### Load

In [2]:
!pip install transformers -U
!pip install accelerate -U
!pip install transformers[torch] -U

Collecting accelerate
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.24.1


In [3]:
import accelerate

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!unzip "/content/drive/My Drive/Machine Learning/customer_support.zip" -d 'Machine Learning'

Archive:  /content/drive/My Drive/Machine Learning/customer_support.zip
  inflating: Machine Learning/customer_support_tickets.csv  


In [6]:
import pandas as pd
df = pd.read_csv('Machine Learning/customer_support_tickets.csv')

In [7]:
df.head()

Unnamed: 0,Ticket ID,Customer Name,Customer Email,Customer Age,Customer Gender,Product Purchased,Date of Purchase,Ticket Type,Ticket Subject,Ticket Description,Ticket Status,Resolution,Ticket Priority,Ticket Channel,First Response Time,Time to Resolution,Customer Satisfaction Rating
0,1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
1,2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,I'm having an issue with the {product_purchase...,Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
2,3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,I'm facing a problem with my {product_purchase...,Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
3,4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,I'm having an issue with the {product_purchase...,Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0
4,5,Alexander Carroll,bradleymark@example.com,67,Female,Autodesk AutoCAD,2020-02-04,Billing inquiry,Data loss,I'm having an issue with the {product_purchase...,Closed,West decision evidence bit.,Low,Email,2023-06-01 00:12:42,2023-06-01 19:53:42,1.0


In [8]:
df.describe()

Unnamed: 0,Ticket ID,Customer Age,Customer Satisfaction Rating
count,8469.0,8469.0,2769.0
mean,4235.0,44.026804,2.991333
std,2444.934048,15.296112,1.407016
min,1.0,18.0,1.0
25%,2118.0,31.0,2.0
50%,4235.0,44.0,3.0
75%,6352.0,57.0,4.0
max,8469.0,70.0,5.0


In [60]:
df['Ticket Description'][0]

"I'm having an issue with the {product_purchased}. Please assist.\n\nYour billing zip code is: 71701.\n\nWe appreciate that you have requested a website address.\n\nPlease double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists."

### Preprocess

In [10]:
X_data = df['Ticket Description']

In [11]:
# Check for NaN values
print(X_data.isnull().sum())

0


In [12]:
X_data.head()

0    I'm having an issue with the {product_purchase...
1    I'm having an issue with the {product_purchase...
2    I'm facing a problem with my {product_purchase...
3    I'm having an issue with the {product_purchase...
4    I'm having an issue with the {product_purchase...
Name: Ticket Description, dtype: object

In [13]:
y_data_unclean = df['Ticket Subject']

In [14]:
y_data_unclean.head()

0               Product setup
1    Peripheral compatibility
2             Network problem
3              Account access
4                   Data loss
Name: Ticket Subject, dtype: object

In [15]:
y_data_unclean.value_counts()

Refund request              576
Software bug                574
Product compatibility       567
Delivery problem            561
Hardware issue              547
Battery life                542
Network problem             539
Installation support        530
Product setup               529
Payment issue               526
Product recommendation      517
Account access              509
Peripheral compatibility    496
Data loss                   491
Cancellation request        487
Display issue               478
Name: Ticket Subject, dtype: int64

Note: Classes are rougly equal but could be balanced... Could take out low quality data to balance

In [16]:
unique_labels = sorted(set(y_data_unclean))
label_map = {label: i for i, label in enumerate(unique_labels)}
print(unique_labels)
print(label_map)

['Account access', 'Battery life', 'Cancellation request', 'Data loss', 'Delivery problem', 'Display issue', 'Hardware issue', 'Installation support', 'Network problem', 'Payment issue', 'Peripheral compatibility', 'Product compatibility', 'Product recommendation', 'Product setup', 'Refund request', 'Software bug']
{'Account access': 0, 'Battery life': 1, 'Cancellation request': 2, 'Data loss': 3, 'Delivery problem': 4, 'Display issue': 5, 'Hardware issue': 6, 'Installation support': 7, 'Network problem': 8, 'Payment issue': 9, 'Peripheral compatibility': 10, 'Product compatibility': 11, 'Product recommendation': 12, 'Product setup': 13, 'Refund request': 14, 'Software bug': 15}


In [17]:
y_data_encoded = [label_map[label] for label in y_data_unclean]

In [18]:
y_data = pd.Series(y_data_encoded)

In [19]:
y_data.value_counts()

14    576
15    574
11    567
4     561
6     547
1     542
8     539
7     530
13    529
9     526
12    517
0     509
10    496
3     491
2     487
5     478
dtype: int64

### Split

In [42]:
from sklearn.model_selection import train_test_split
import numpy as np

# First, split into train and test set (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.001, random_state=42)

# Then, split the train set into train and validation set (80% train, 20% validation of the original train set)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.004, test_size=0.001, random_state=42) # 0.25 x 0.8 = 0.2

In [43]:
print(X_train.shape, X_val.shape, X_test.shape)
print(y_train.shape, y_val.shape, y_test.shape)

(33,) (9,) (9,)
(33,) (9,) (9,)


### Make Model

In [44]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', num_labels=16)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=16)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [45]:
def encode_data(tokenizer, texts, max_length=512):
    """Returns a dictionary containing token ids, attention masks, and token type ids."""
    return tokenizer.batch_encode_plus(
        texts,
        add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
        max_length=max_length,   # Max length to truncate/pad
        padding='max_length',    # Pad to max_length
        truncation=True,         # Truncate longer messages
        return_attention_mask=True,
        return_token_type_ids=False,
        return_tensors='pt'      # Return PyTorch tensors
    )

In [46]:
max_length = 128  # Can be adjusted based on your data

train_encodings = encode_data(tokenizer, X_train.tolist(), max_length=max_length)
val_encodings = encode_data(tokenizer, X_val.tolist(), max_length=max_length)
test_encodings = encode_data(tokenizer, X_test.tolist(), max_length=max_length)

In [47]:
print(len(train_encodings), len(y_train))
print(len(val_encodings), len(y_val))
print(len(test_encodings), len(y_test))

2 33
2 9
2 9


In [48]:
import torch
from torch.utils.data import Dataset

class SupportTicketsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels  # This can be a Pandas Series

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() if isinstance(val[idx], torch.Tensor) else torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx], dtype=torch.long)  # Use iloc for Series
        return item

    def __len__(self):
        return len(self.labels)

# Create dataset objects
train_dataset = SupportTicketsDataset(train_encodings, y_train)
val_dataset = SupportTicketsDataset(val_encodings, y_val)
test_dataset = SupportTicketsDataset(test_encodings, y_test)

In [49]:
print(len(train_dataset), len(y_train))
print(len(val_dataset), len(y_val))
print(len(test_dataset), len(y_test))

33 33
9 9
9 9


In [50]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

In [52]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create a Trainer to fine-tune the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

In [53]:
# Train the model
trainer.train()

Step,Training Loss


TrainOutput(global_step=15, training_loss=2.8104583740234377, metrics={'train_runtime': 261.4726, 'train_samples_per_second': 0.631, 'train_steps_per_second': 0.057, 'total_flos': 10854695301120.0, 'train_loss': 2.8104583740234377, 'epoch': 5.0})

### Evaluate Accuracy

Has about $1/9$ accuracy right now on a $16$ class classifier. Not good, but this is only done on a very small subset of the dataset and only 5 epochs because of the required runtime. To-Do: checkpointing to run with full data and epochs

In [56]:
train_results = trainer.evaluate(train_dataset)
print(train_results)

{'eval_loss': 2.790473461151123, 'eval_accuracy': 0.12121212121212122, 'eval_runtime': 17.8858, 'eval_samples_per_second': 1.845, 'eval_steps_per_second': 0.112, 'epoch': 5.0}


In [57]:
print("Training Accuracy:", train_results.get('eval_accuracy', 'Accuracy not calculated'))

Training Accuracy: 0.12121212121212122


In [55]:
val_results = trainer.evaluate(val_dataset)
print(val_results)

{'eval_loss': 2.6406145095825195, 'eval_accuracy': 0.1111111111111111, 'eval_runtime': 3.912, 'eval_samples_per_second': 2.301, 'eval_steps_per_second': 0.256, 'epoch': 5.0}


In [58]:
print("Val Accuracy:", val_results.get('eval_accuracy', 'Accuracy not calculated'))

Val Accuracy: 0.1111111111111111


### To-Do Checkpointing

Checkpoint by batch size with save `save_steps` and by epoch. Use `model = BertForSequenceClassification.from_pretrained('\path\to\saved\model', num_labels=16)` to restart model on checkpoint after first starting with the `bert-base-uncased` base model. Evaluate on val set manually on each saved epoch to early stop

In [29]:
def save_checkpoint(model, optimizer, epoch, batch_idx, checkpoint_path):
    checkpoint = {
        'epoch': epoch,
        'batch_idx': batch_idx,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }
    torch.save(checkpoint, checkpoint_path)

In [30]:
def load_checkpoint(checkpoint_path, model, optimizer):
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    return checkpoint['epoch'], checkpoint['batch_idx']

In [None]:
from transformers import BertForSequenceClassification

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')  # Initialize model
model.load_state_dict(torch.load('path/to/model.pt'))

In [None]:
trainer.train(resume_from_checkpoint='path/to/checkpoint')

### Future Questions

How high quality is the data? The model can't perform better than what I'm giving it, and it's hard for even myself to some of these. I will want to observe a well curated and fully trained model to know, potentially observing the ones it got wrong and evaluating their difficulty.

Other factors like the product or age could impact the ticket subject and be included as well.

In [61]:
# Print and have it show me full text
pd.set_option('display.max_colwidth', None)
print(df['Ticket Description'][:10])
pd.reset_option('display.max_colwidth')

0                                                         I'm having an issue with the {product_purchased}. Please assist.\n\nYour billing zip code is: 71701.\n\nWe appreciate that you have requested a website address.\n\nPlease double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.
1                                                           I'm having an issue with the {product_purchased}. Please assist.\n\nIf you need to change an existing product.\n\nI'm having an issue with the {product_purchased}. Please assist.\n\nIf The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly.
2                                                                      I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond.\n\n1.8.3 I really I'm using the original charger that came with my {produ