<a href="https://colab.research.google.com/github/Tstrebe2/help-desk-agent-pilot/blob/main/notebooks/help_ticket_vector_search_poc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Help Desk Ticket Analysis

This notebook demonstrates how to create and visualize sentence embeddings for help desk tickets and then save them for another workflow.

## Setup and Data Loading

First, we install the necessary libraries and load the help desk ticket data.

In [None]:
!pip install sentence-transformers faiss-cpu watermark --quiet

Mount Google Drive to access the data file.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import the required libraries.

In [None]:
import os

import pandas as pd
import numpy as np

import faiss
from sentence_transformers import SentenceTransformer

Define the root path for the data.

In [None]:
ROOT_PATH = "drive/MyDrive/help-desk-tickets-prototype"

Load the dataset and drop rows with missing values in the 'actionbody' column.

In [None]:
df = pd.read_csv(os.path.join(ROOT_PATH, "sample_utterances.csv")).convert_dtypes()
print("df shape:", df.shape)
df = df.dropna(subset=["actionbody"])
print("df shape after drop na:", df.shape)

df shape: (30104, 9)
df shape after drop na: (30081, 9)


Display the first few entries of the 'actionbody' column to understand the data.

In [None]:
df.head()["actionbody"].tolist()

['dear ph_name team',
 'we need your urgent support to fix list of vulnerabilities reported in ph_name application .',
 'please provide resolution date before end of ‘january ph_technical .',
 '.',
 'problem: .']

## Embedding and Indexing

We use a pre-trained sentence transformer model to create embeddings for the ticket descriptions and then build a FAISS index for efficient similarity search.

In [None]:
# Load a fast, local embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all ticket descriptions
descriptions = df['actionbody'].tolist()
embeddings = model.encode(descriptions, show_progress_bar=True).astype('float32')

Batches:   0%|          | 0/941 [00:00<?, ?it/s]

Convert the embeddings to the required format and build the FAISS index.

In [None]:
# Build FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

## Similarity Search

Define a function to search for similar tickets using the FAISS index and display the top 10 results to verify they are related to our query.

In [None]:
# Define the number of top results to retrieve and the user query
top_k = 10
user_query = "Data migration"

# Encode the user query
query_vec = model.encode([user_query]).astype('float32')

# Search FAISS for top 10 most similar tickets (Euclidean distance)
D, I = index.search(query_vec, top_k)
D, I = D.squeeze(), I.squeeze()
top_vectors = embeddings[I]
top_descriptions = df.iloc[I]['actionbody'].tolist()

# Print the top 10 tickets so they can be compared against the graph
print(f"Top 10 similar tickets to {user_query}:")
for i, desc in enumerate(top_descriptions):
    print(f"{i+1}. {desc}")

Top 10 similar tickets to Data migration:
1. 3)execution of the ph_technical migration procedure
2. the issue mainly with migrated data.
3. we are doing test of migration script.
4. use migration script on ph_technical:
5. due to data center migration we will do this upgrade next week.
6. i sent a migration script by email, please suggest a time to connect to proceed it.
7. please kindly note that we shutdown the database on the dr server in order to move it to the new location.
8. greetings reference to our previous conversation and based on our last session observations , kindly note that the current data existing on ph_technical schema on uat ph_technical has many constraint issues due to the restoration process used to rename current tables and keep our system constraints linked to the old tables, a new dump should be taken from prod ph_technical after working hours and restored to uat in order to ensure better data consistency , we also need the data from prod to test the archivin

## Saving Data

Save the FAISS index and the processed dataframe for future use.

In [None]:
faiss.write_index(index, os.path.join(ROOT_PATH, 'faiss_index.idx'))

In [None]:
df.to_csv(os.path.join(ROOT_PATH, 'sample_utterances_drop_na.csv'))

## Environment Information

In [None]:
%reload_ext watermark
%watermark
%watermark --iversions

Last updated: 2025-07-10T20:20:10.454024+00:00

Python implementation: CPython
Python version       : 3.11.13
IPython version      : 7.34.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.123+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

numpy                : 2.0.2
networkx             : 3.5
google               : 2.0.3
scipy                : 1.15.3
plotly               : 5.24.1
sentence_transformers: 4.1.0
faiss                : 1.11.0
pandas               : 2.2.2

