# Ticket Clustering — TF-IDF & Embedding-Based

This notebook explores clustering of support tickets using two unsupervised algorithms on two types of vector representations.

### Techniques:
- **K-Means Clustering**
- **Agglomerative Clustering** (Ward linkage)

### Representations:
- **TF-IDF Vectors**
- **Sentence Embeddings** (`MiniLM` from Sentence Transformers)

Each method is evaluated on how well the resulting clusters align with the true ticket categories  
(optional, if labels are available). To enable comparison, we use the Hungarian algorithm to optimally match predicted cluster labels to ground-truth classes before computing metrics like accuracy and F1-score.

All text is filtered for private information before clustering.


## Environment Setup and Imports

In [None]:
# Required packages:
# pip install pandas numpy matplotlib seaborn scikit-learn umap-learn sentence-transformers


In [None]:
# === General Purpose ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
import matplotlib

# === Clustering + Evaluation ===
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import (
    classification_report, confusion_matrix, silhouette_samples,
    silhouette_score, calinski_harabasz_score, davies_bouldin_score,
    adjusted_rand_score, precision_score, recall_score, f1_score
)
from sklearn.metrics.cluster import contingency_matrix
from scipy.optimize import linear_sum_assignment  # Hungarian Algorithm

# === Vectorization + Reduction ===
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import umap.umap_ as umap
from sentence_transformers import SentenceTransformer


## Configuration

In [None]:
# === Config ===
EMBEDDING_METHOD = "sentence-transformer"                  # Options: "tfidf", "sentence-transformer"  
EMBEDDING_MODEL = "all-MiniLM-L6-v2"        # Used only if method is "sentence-transformer"
N_CLUSTERS = 7                              # Set to number of expected groups (e.g., issue types)
VISUALIZE_WITH = "umap"                     # Options: "pca", "umap"   
DATA_PATH = "cleaned.csv"                   # Path to your preprocessed dataset


## Load and Prepare Data

In [None]:
# === Load Dataset ===
# Make sure your CSV uses ; as separator if needed
df = pd.read_csv("Autotask_KB_Stemmed.csv", sep=';')
texts = df["combined_text"].astype(str).tolist()


### Chose embedding model

In [None]:
# === Embedding Step ===
# Toggle between TF-IDF or SentenceTransformer

if EMBEDDING_METHOD == "tfidf":
    print("Embedding with TF-IDF")
    vectorizer = TfidfVectorizer(max_features=1000)
    X = vectorizer.fit_transform(texts)
    X_dense = X.toarray()  # Required for UMAP and silhouette scoring
else:
    print(f"Embedding with SentenceTransformer: {EMBEDDING_MODEL}")
    model = SentenceTransformer(EMBEDDING_MODEL)
    X = model.encode(texts, show_progress_bar=True)
    X_dense = X  # Already dense


## KMeans Clustering
This section applies the KMeans algorithm to the embedded ticket data.

- Requires dense input (TF-IDF or sentence embeddings)

It helps reveal:
- Natural groupings of tickets based on content
- Cluster assignments for exploratory analysis or labeling

Note: KMeans assumes clusters are spherical and of similar size (Euclidean space).


### K-Means Configs

In [None]:
# === KMeans Clustering ===
kmeans = KMeans(
    n_clusters=N_CLUSTERS,    # Number of clusters to create
    init='k-means++',         # Smart centroid initialization
    n_init='auto',            # Scikit-learn ≥1.2 auto-detects optimal n_init
    max_iter=300,             # Maximum number of iterations
    algorithm='lloyd',        # Standard KMeans algorithm
    random_state=42           # For reproducibility
)

clusters = kmeans.fit_predict(X_dense)
df["cluster"] = clusters


### Run Model

In [None]:
# === KMeans Clustering ===
kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=42)
clusters = kmeans.fit_predict(X_dense)
df["cluster"] = clusters


### Optional: Evaluation with True Labels


In [None]:
if "Issue Type" in df.columns:
    true_labels = df["Issue Type"]
    ari = adjusted_rand_score(true_labels, clusters)
    sil_score = silhouette_score(X_dense, clusters)

    def cluster_purity(y_true, y_pred):
        contingency = contingency_matrix(y_true, y_pred)
        return np.sum(np.amax(contingency, axis=0)) / np.sum(contingency)

    purity = cluster_purity(true_labels, clusters)

    contingency = contingency_matrix(true_labels, clusters)
    row_ind, col_ind = linear_sum_assignment(-contingency)
    cluster_to_label_map = dict(zip(col_ind, row_ind))

    mapped_preds = [cluster_to_label_map[c] for c in clusters]
    mapped_preds = np.array(mapped_preds)
    y_true_mapped = np.array(true_labels.map({label: i for i, label in enumerate(sorted(df["Issue Type"].unique()))}))

    print("\n=== KMeans Clustering Evaluation Metrics ===")
    print(f"Adjusted Rand Index (ARI): {ari:.4f}")
    print(f"Silhouette Score: {sil_score:.4f}")
    print(f"Cluster Purity: {purity:.4f}")

    print("\n=== Metrics After Hungarian Mapping ===")
    print(classification_report(y_true_mapped, mapped_preds, target_names=sorted(df["Issue Type"].unique())))

    cm = confusion_matrix(y_true_mapped, mapped_preds)
    plt.figure(figsize=(10, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Greens")
    plt.title("Confusion Matrix (Aligned Clusters vs True Labels)")
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.tight_layout()
    plt.show()


### Cluster Sizes


In [None]:
cluster_counts = df["cluster"].value_counts().sort_index()
cluster_table = pd.DataFrame({'Cluster': cluster_counts.index, 'Number of Items': cluster_counts.values})

plt.figure(figsize=(10, 6))
bars = plt.bar(cluster_table['Cluster'], cluster_table['Number of Items'], color='skyblue')
plt.xlabel('Cluster')
plt.ylabel('Number of Items')
plt.title('Number of Items per Cluster')
plt.xticks(cluster_table['Cluster'])
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, int(yval), va='bottom')
plt.tight_layout()
plt.show()


### Cluster Visualization (PCA or UMAP)

In [None]:
if VISUALIZE_WITH == "pca":
    reducer = PCA(n_components=2)
elif VISUALIZE_WITH == "umap":
    reducer = umap.UMAP(random_state=42)
else:
    raise ValueError("Invalid VISUALIZE_WITH setting")

X_2d = reducer.fit_transform(X_dense)
plt.figure(figsize=(8, 6))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=clusters, cmap='viridis', s=30)
plt.title("KMeans Clustering of Ticket Texts")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.grid(True)
plt.tight_layout()
plt.show()


### Intrinsic Clustering Metrics

In [None]:

def evaluate_intrinsic_metrics(X, labels, title="Clustering Quality Metrics"):
    sil = silhouette_score(X, labels)
    ch = calinski_harabasz_score(X, labels)
    db = davies_bouldin_score(X, labels)

    print(f"\n=== {title} ===")
    print(f"Silhouette Score: {sil:.4f} (higher = better)")
    print(f"Calinski-Harabasz Index: {ch:.4f} (higher = better)")
    print(f"Davies-Bouldin Index: {db:.4f} (lower = better)")

evaluate_intrinsic_metrics(X_dense, clusters, "KMeans")

### Cluster Word Clouds

In [None]:
def plot_wordclouds(X_text, clusters):
    texts_by_cluster = defaultdict(str)
    for i, c in enumerate(clusters):
        texts_by_cluster[c] += " " + X_text.iloc[i]

    for c in range(N_CLUSTERS):
        wc = WordCloud(width=600, height=400, background_color='white').generate(texts_by_cluster[c])
        plt.figure(figsize=(6, 4))
        plt.imshow(wc, interpolation='bilinear')
        plt.axis("off")
        plt.title(f"Word Cloud - Cluster {c}")
        plt.tight_layout()
        plt.show()

plot_wordclouds(df["combined_text"], clusters)

### Silhouette Plot

In [None]:
def plot_silhouette(X, labels, title="Silhouette Plot"):
    silhouette_vals = silhouette_samples(X, labels)
    y_lower = 10
    for i in range(N_CLUSTERS):
        ith_vals = silhouette_vals[labels == i]
        ith_vals.sort()
        size_i = ith_vals.shape[0]
        y_upper = y_lower + size_i
        color = plt.cm.nipy_spectral(float(i) / N_CLUSTERS)
        plt.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_vals, facecolor=color, edgecolor=color)
        plt.text(-0.05, y_lower + 0.5 * size_i, str(i))
        y_lower = y_upper + 10

    plt.axvline(x=silhouette_vals.mean(), color="red", linestyle="--")
    plt.xlabel("Silhouette Coefficient Values")
    plt.ylabel("Cluster")
    plt.title(title)
    plt.tight_layout()
    plt.show()

plot_silhouette(X_dense, clusters, "Silhouette - KMeans")

## Agglomerative Clustering  
This section applies Agglomerative Clustering to the embedded ticket representations.  
We test different linkage strategies to assess how hierarchical clustering groups the data.

### Clustering Setup  
We use `ward` linkage with Euclidean distance, which tends to form compact, spherical clusters.  
Other options like `average` or `complete` linkage can be tested by changing the `linkage` parameter.


In [None]:
print("Running Agglomerative Clustering...")

agglo = AgglomerativeClustering(
    n_clusters=N_CLUSTERS,
    metric='euclidean',  # Explicit for clarity
    linkage='ward'       # Options: 'average', 'complete', 'ward', 'single'
)


### Run Model  


In [None]:
agglomerative_clusters = agglo.fit_predict(X_dense)
df["agglo_cluster"] = agglomerative_clusters


### Evaluation with Ground Truth Labels  
If true labels are available (e.g. "Issue Type"), we evaluate cluster alignment using:  
- Adjusted Rand Index (ARI)  
- Silhouette Score  
- Cluster Purity  
We also apply the Hungarian algorithm to remap clusters for label-based metrics.


### Confusion Martix

In [None]:
if "Issue Type" in df.columns: 
    true_labels = df["Issue Type"] 
    ari_agglo = adjusted_rand_score(true_labels, agglomerative_clusters) 
    sil_agglo = silhouette_score(X_dense, agglomerative_clusters)

    def cluster_purity(y_true, y_pred):
        contingency = contingency_matrix(y_true, y_pred)
        return np.sum(np.amax(contingency, axis=0)) / np.sum(contingency)

    purity_agglo = cluster_purity(true_labels, agglomerative_clusters)

    print("\n=== Agglomerative Clustering Evaluation Metrics ===") 
    print(f"Adjusted Rand Index (ARI): {ari_agglo:.4f}") 
    print(f"Silhouette Score: {sil_agglo:.4f}") 
    print(f"Cluster Purity: {purity_agglo:.4f}")

    # Hungarian matching
    contingency = contingency_matrix(true_labels, agglomerative_clusters)
    row_ind, col_ind = linear_sum_assignment(-contingency)
    cluster_to_label_map = dict(zip(col_ind, row_ind))

    mapped_preds = [cluster_to_label_map[c] for c in agglomerative_clusters]
    mapped_preds = np.array(mapped_preds)
    y_true_mapped = np.array(true_labels.map({label: i for i, label in enumerate(sorted(df["Issue Type"].unique()))}))

    print("\n=== Metrics After Hungarian Mapping ===")
    print(classification_report(y_true_mapped, mapped_preds, target_names=sorted(df["Issue Type"].unique())))

    # Confusion matrix
    cm = confusion_matrix(y_true_mapped, mapped_preds)
    plt.figure(figsize=(10, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Purples")
    plt.title("Confusion Matrix (Aligned Agglomerative Clusters vs True Labels)")
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.tight_layout()
    plt.show()


### Cluster Distribution  
Visualize the number of samples per cluster to detect imbalance or dominance.


In [None]:
plt.figure(figsize=(6, 4)) 
sns.countplot(x="agglo_cluster", data=df, palette="mako") 
plt.title("Number of Samples per Agglomerative Cluster") 
plt.xlabel("Cluster") 
plt.ylabel("Count") 
plt.tight_layout() 
plt.show()


### 2D Projection of Agglomerative Clusters  
Reduce embeddings to 2D using PCA or UMAP and plot clusters for spatial interpretation.


In [None]:
if VISUALIZE_WITH == "pca": 
    reducer = PCA(n_components=2) 
elif VISUALIZE_WITH == "umap": 
    reducer = umap.UMAP(random_state=42) 
else: 
    raise ValueError("Invalid VISUALIZE_WITH setting")

agglo_2d = reducer.fit_transform(X_dense) 
plt.figure(figsize=(8, 6)) 
plt.scatter(agglo_2d[:, 0], agglo_2d[:, 1], c=agglomerative_clusters, cmap='mako', s=30) 
plt.title("Agglomerative Clustering of Ticket Texts") 
plt.xlabel("Component 1") 
plt.ylabel("Component 2") 
plt.grid(True) 
plt.tight_layout() 
plt.show()


### Additional Visualizations & Metrics  
Includes silhouette scores, word clouds, and intrinsic clustering quality metrics.


In [None]:
plot_silhouette(X_dense, agglomerative_clusters, "Silhouette - Agglomerative (Embeddings)")
plot_wordclouds(df["combined_text"], agglomerative_clusters)
evaluate_intrinsic_metrics(X_dense, agglomerative_clusters, "Agglomerative (Embeddings)")
