# Auto Tagging Support Tickets Using LLM

## Objective:
Automatically tag support tickets into appropriate categories using Large Language Models (LLMs).

### Tasks Covered:
- Use **zero-shot classification** via `facebook/bart-large-mnli`
- Simulate **few-shot learning** using prompt-based examples
- Perform **fine-tuning** of `distilbert-base-uncased` on labeled ticket data
- Evaluate **Top-1** and **Top-3** prediction accuracy across all approaches


In [6]:
import pandas as pd
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
import torch
import os

## Dataset Preparation
Simulate a real-world support ticket dataset with labeled tags for evaluation. Each ticket may have multiple labels.

In [7]:
# Support ticket dataset
data = {
    'ticket_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'text': [
        "My internet is not working. I can't connect to any websites.",
        "I need to reset my password for my email account.",
        "The application crashes every time I try to open a new file.",
        "My printer is offline and I can't print anything.",
        "How do I add a new user to the system?",
        "I'm experiencing slow internet speeds, pages are loading very slowly.",
        "The software update failed and now the program won't start.",
        "My account is locked, I can't log in.",
        "I want to request a new feature for the reporting tool.",
        "My mouse is not responding, I tried different USB ports."
    ],
    'actual_tags': [
        ['Internet', 'Connectivity'],
        ['Account', 'Password Reset'],
        ['Software', 'Bug'],
        ['Hardware', 'Printer'],
        ['Software', 'User Management'],
        ['Internet', 'Performance'],
        ['Software', 'Installation'],
        ['Account', 'Login Issue'],
        ['Feature Request'],
        ['Hardware', 'Peripheral']
    ]
}
df = pd.DataFrame(data)

possible_tags = sorted(list(set(tag for sublist in df['actual_tags'] for tag in sublist)))

print("Dataset loaded successfully.")
print(df.head())
print("\nPossible Tags:", possible_tags)

Dataset loaded successfully.
   ticket_id                                               text  \
0          1  My internet is not working. I can't connect to...   
1          2  I need to reset my password for my email account.   
2          3  The application crashes every time I try to op...   
3          4  My printer is offline and I can't print anything.   
4          5             How do I add a new user to the system?   

                   actual_tags  
0     [Internet, Connectivity]  
1    [Account, Password Reset]  
2              [Software, Bug]  
3          [Hardware, Printer]  
4  [Software, User Management]  

Possible Tags: ['Account', 'Bug', 'Connectivity', 'Feature Request', 'Hardware', 'Installation', 'Internet', 'Login Issue', 'Password Reset', 'Performance', 'Peripheral', 'Printer', 'Software', 'User Management']


## Zero-Shot Classification using `facebook/bart-large-mnli`

Used HuggingFace's zero-shot classification pipeline to assign top-3 tags to each support ticket, without any training.


In [8]:
#Zero-Shot Classification
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device="cpu")

def zero_shot_classify(text, candidate_labels, top_k=3):
    result = classifier(text, candidate_labels, multi_label=True)
    sorted_scores = sorted(zip(result['labels'], result['scores']), key=lambda x: x[1], reverse=True)
    top_labels = [label for label, score in sorted_scores[:top_k]]
    return top_labels

print("Performing zero-shot classification...")
df['zero_shot_predictions'] = df['text'].apply(lambda x: zero_shot_classify(x, possible_tags))

print("\nZero-Shot Predictions (first 3):")
for i, row in df.head(3).iterrows():
    print(f"Ticket: {row['text']}\nActual: {row['actual_tags']}\nPredicted (Zero-Shot): {row['zero_shot_predictions']}\n---")

Performing zero-shot classification...

Zero-Shot Predictions (first 3):
Ticket: My internet is not working. I can't connect to any websites.
Actual: ['Internet', 'Connectivity']
Predicted (Zero-Shot): ['Internet', 'Connectivity', 'Peripheral']
---
Ticket: I need to reset my password for my email account.
Actual: ['Account', 'Password Reset']
Predicted (Zero-Shot): ['Password Reset', 'User Management', 'Login Issue']
---
Ticket: The application crashes every time I try to open a new file.
Actual: ['Software', 'Bug']
Predicted (Zero-Shot): ['Software', 'Bug', 'Performance']
---


## Few-Shot Learning

We simulate few-shot prompting by injecting 3 examples directly into the prompt to guide the model.
This builds conceptual understanding without model re-training.

In [9]:
# Example few-shot pairs (from dataset)
few_shot_examples = [
    (df['text'][0], df['actual_tags'][0]),
    (df['text'][1], df['actual_tags'][1]),
    (df['text'][2], df['actual_tags'][2])
]

# This function shows how a prompt would be constructed for a generative LLM
def create_few_shot_prompt(ticket_text, examples, candidate_labels):
    prompt = "Classify the following support tickets into one or more of these categories:\n"
    prompt += f"Categories: {', '.join(candidate_labels)}\n\n"
    for example_text, example_tags in examples:
        prompt += f"Ticket: {example_text}\nTags: {', '.join(example_tags)}\n\n"
    prompt += f"Ticket: {ticket_text}\nTags:"
    return prompt

def few_shot_classify_conceptual(text, candidate_labels, examples, top_k=3):
    return zero_shot_classify(text, candidate_labels, top_k)

print("\nPerforming few-shot (conceptual) classification...")
df['few_shot_predictions'] = df['text'].apply(
    lambda x: few_shot_classify_conceptual(x, possible_tags, few_shot_examples)
)

print("\nFew-Shot (Conceptual) Predictions (first 3):")
for i, row in df.head(3).iterrows():
    print(f"Ticket: {row['text']}\nActual: {row['actual_tags']}\nPredicted (Few-Shot/Conceptual): {row['few_shot_predictions']}\n---")


Performing few-shot (conceptual) classification...

Few-Shot (Conceptual) Predictions (first 3):
Ticket: My internet is not working. I can't connect to any websites.
Actual: ['Internet', 'Connectivity']
Predicted (Few-Shot/Conceptual): ['Internet', 'Connectivity', 'Peripheral']
---
Ticket: I need to reset my password for my email account.
Actual: ['Account', 'Password Reset']
Predicted (Few-Shot/Conceptual): ['Password Reset', 'User Management', 'Login Issue']
---
Ticket: The application crashes every time I try to open a new file.
Actual: ['Software', 'Bug']
Predicted (Few-Shot/Conceptual): ['Software', 'Bug', 'Performance']
---


## Fine-Tuning a Transformer on Support Tickets

We fine-tune a `distilbert-base-uncased` model using `Trainer` on the ticket dataset with simplified (single-label) targets.
This allows us to benchmark against zero-shot and few-shot performance. Using the trained model, we predict the top 3 most likely tags for each ticket.


In [10]:
# Fine-Tuning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

device = torch.device("cpu")

df['primary_tag'] = df['actual_tags'].apply(lambda x: x[0] if x else 'Unknown')

unique_primary_tags = sorted(list(set(df['primary_tag'])))
tag_to_id = {tag: i for i, tag in enumerate(unique_primary_tags)}
id_to_tag = {i: tag for i, tag in enumerate(unique_primary_tags)}

df['label'] = df['primary_tag'].map(tag_to_id)

print("\nPrimary Tag Value Counts (before splitting):")
print(df['primary_tag'].value_counts())

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

train_dataset = Dataset.from_pandas(train_df[['text', 'label']])
test_dataset = Dataset.from_pandas(test_df[['text', 'label']])

# Load Tokenizer and Model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(unique_primary_tags)).to(device)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Define Metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted"),
        "precision": precision_score(labels, predictions, average="weighted"),
        "recall": recall_score(labels, predictions, average="weighted"),
    }

# Configure Training Arguments
training_args = TrainingArguments(
    output_dir="./results_cpu",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
    logging_dir='./logs_cpu',
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

# Create Trainer and Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print("\nStarting fine-tuning (this may take a while on CPU)...")
trainer.train()

print("\nFine-tuning complete. Evaluating fine-tuned model...")
eval_results = trainer.evaluate()
print(f"Fine-tuned model evaluation results: {eval_results}")

# Make predictions with fine-tuned model
def fine_tuned_classify(text, tokenizer, model, id_to_tag, top_k=3):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=-1)[0]
    top_k_probs, top_k_indices = torch.topk(probabilities, top_k)
    top_tags = [id_to_tag[idx.item()] for idx in top_k_indices]
    return top_tags

df['fine_tuned_predictions'] = df['text'].apply(lambda x: fine_tuned_classify(x, tokenizer, model, id_to_tag))

print("\nFine-tuned Predictions (first 3):")
for i, row in df.head(3).iterrows():
    print(f"Ticket: {row['text']}\nActual: {row['actual_tags']}\nPredicted (Fine-tuned): {row['fine_tuned_predictions']}\n---")


Primary Tag Value Counts (before splitting):
primary_tag
Software           3
Internet           2
Account            2
Hardware           2
Feature Request    1
Name: count, dtype: int64


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 8/8 [00:00<00:00, 133.68 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 317.57 examples/s]
  trainer = Trainer(



Starting fine-tuning (this may take a while on CPU)...


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
                                             
 33%|███▎      | 1/3 [00:21<00:40, 20.41s/it]

{'eval_loss': 1.6937503814697266, 'eval_accuracy': 0.0, 'eval_f1': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_runtime': 1.2469, 'eval_samples_per_second': 1.604, 'eval_steps_per_second': 0.802, 'epoch': 1.0}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
                                             
 67%|██████▋   | 2/3 [00:47<00:23, 23.36s/it]

{'eval_loss': 1.7046622037887573, 'eval_accuracy': 0.0, 'eval_f1': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_runtime': 1.1375, 'eval_samples_per_second': 1.758, 'eval_steps_per_second': 0.879, 'epoch': 2.0}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
                                             
100%|██████████| 3/3 [01:14<00:00, 22.57s/it]

{'eval_loss': 1.7103205919265747, 'eval_accuracy': 0.0, 'eval_f1': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_runtime': 0.8088, 'eval_samples_per_second': 2.473, 'eval_steps_per_second': 1.236, 'epoch': 3.0}


100%|██████████| 3/3 [01:26<00:00, 28.87s/it]


{'train_runtime': 86.6076, 'train_samples_per_second': 0.277, 'train_steps_per_second': 0.035, 'train_loss': 1.5724231402079265, 'epoch': 3.0}

Fine-tuning complete. Evaluating fine-tuned model...


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
100%|██████████| 1/1 [00:00<00:00, 117.90it/s]


Fine-tuned model evaluation results: {'eval_loss': 1.6937503814697266, 'eval_accuracy': 0.0, 'eval_f1': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_runtime': 0.7833, 'eval_samples_per_second': 2.553, 'eval_steps_per_second': 1.277, 'epoch': 3.0}

Fine-tuned Predictions (first 3):
Ticket: My internet is not working. I can't connect to any websites.
Actual: ['Internet', 'Connectivity']
Predicted (Fine-tuned): ['Hardware', 'Internet', 'Software']
---
Ticket: I need to reset my password for my email account.
Actual: ['Account', 'Password Reset']
Predicted (Fine-tuned): ['Hardware', 'Internet', 'Software']
---
Ticket: The application crashes every time I try to open a new file.
Actual: ['Software', 'Bug']
Predicted (Fine-tuned): ['Hardware', 'Internet', 'Software']
---


## Performance Comparison

We evaluate all three methods using:
- Top-1 accuracy: Is the most likely predicted tag correct?
- Top-3 accuracy: Is any of the top 3 predicted tags correct?

This helps us measure the effectiveness of zero-shot, few-shot, and fine-tuned strategies.


In [11]:
# Performance Comparison
def evaluate_top_k_predictions(actual_tags_list, predicted_top_k_tags_list, k=3):
    correct_at_1 = 0
    correct_at_k = 0
    total = len(actual_tags_list)

    for i in range(total):
        actual = set(actual_tags_list[i])
        predicted_top_k = predicted_top_k_tags_list[i][:k] 

        if actual and predicted_top_k and predicted_top_k[0] in actual:
            correct_at_1 += 1
        
        if actual and any(tag in actual for tag in predicted_top_k):
            correct_at_k += 1
            
    return correct_at_1 / total, correct_at_k / total

# Zero-shot evaluation
zero_shot_acc_at_1, zero_shot_acc_at_k = evaluate_top_k_predictions(
    df['actual_tags'].tolist(), df['zero_shot_predictions'].tolist()
)
print(f"\n Zero-Shot Performance (Top 3 predictions)")
print(f"  Accuracy at K=1: {zero_shot_acc_at_1:.4f}")
print(f"  Accuracy at K=3: {zero_shot_acc_at_k:.4f}")

# Few-shot (Conceptual) evaluation
few_shot_acc_at_1, few_shot_acc_at_k = evaluate_top_k_predictions(
    df['actual_tags'].tolist(), df['few_shot_predictions'].tolist()
)
print(f"\n Few-Shot (Conceptual/Simulated) Performance (Top 3 predictions)")
print(f"  Accuracy at K=1: {few_shot_acc_at_1:.4f}")
print(f"  Accuracy at K=3: {few_shot_acc_at_k:.4f}")

fine_tuned_predictions_for_eval = df['fine_tuned_predictions'].tolist()
fine_tuned_acc_at_1, fine_tuned_acc_at_k = evaluate_top_k_predictions(
    df['actual_tags'].tolist(), fine_tuned_predictions_for_eval
)
print(f"\n Fine-tuned (Simplified Single-Label) Performance (Top 3 predictions)")
print(f"  Accuracy at K=1: {fine_tuned_acc_at_1:.4f}")
print(f"  Accuracy K=3 : {fine_tuned_acc_at_k:.4f}")


 Zero-Shot Performance (Top 3 predictions)
  Accuracy at K=1: 0.9000
  Accuracy at K=3: 0.9000

 Few-Shot (Conceptual/Simulated) Performance (Top 3 predictions)
  Accuracy at K=1: 0.9000
  Accuracy at K=3: 0.9000

 Fine-tuned (Simplified Single-Label) Performance (Top 3 predictions)
  Accuracy at K=1: 0.2000
  Accuracy K=3 : 0.7000
