## 1. Environment Setup & Data Sanitization
Before we vectorize the text, we must pass it through our TextSanitizer. This ensures that our embeddings are trained on the "intent" of the call rather than being biased by specific PII like unique names or account IDs.

In [1]:
import pandas as pd
import numpy as np
import sys, os
from src.preprocessing.cleaner import TextSanitizer
from src.features.embeddings import VectorEngine
import umap
import hdbscan
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import sys
sys.path[0]

'C:\\Program Files\\Python312\\python312.zip'

In [3]:
# Initialize our custom modules
sanitizer = TextSanitizer()
vector_engine = VectorEngine()

# Apply PII Redaction
print("Redacting PII and cleaning text...")
processed_df['sanitized_text'] = processed_df['clean_text'].apply(sanitizer.redact_pii)
processed_df['sanitized_text'] = processed_df['sanitized_text'].apply(sanitizer.clean_transcript)

processed_df[['clean_text', 'sanitized_text']].head()

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

### 2. High-Dimensional Vectorization using Transformers
We now convert the sanitized text into 768-dimensional dense vectors using the all-mpnet-base-v2 transformer. These embeddings capture semantic meaning—mapping "I want to leave" and "Cancel my subscription" to the same vector space.

In [None]:
# Generate Embeddings
# In a production pipeline, you would cache these to 'data/embeddings/' 
embeddings = vector_engine.generate_embeddings(processed_df['sanitized_text'].tolist())

print(f"Embedding Matrix Shape: {embeddings.shape}")
# Save for later use to avoid re-computing
np.save('data/embeddings/transcript_embeddings.npy', embeddings)

### 3. Dimensionality Reduction for Cluster Stability (UMAP)
Clustering algorithms like HDBSCAN struggle with the "Curse of Dimensionality" in 768D space. We use UMAP (Uniform Manifold Approximation and Projection) to compress our embeddings into 5-10 dimensions, preserving the "local neighborhoods" of similar calls while making the density visible to the clustering algorithm.

In [None]:
# Reduce dimensions for better clustering performance
reducer = umap.UMAP(
    n_neighbors=15, 
    n_components=5, 
    metric='cosine', 
    random_state=42
)

umap_embeddings = reducer.fit_transform(embeddings)
print(f"Reduced Embeddings Shape: {umap_embeddings.shape}")

### 4. Latent Intent Discovery with HDBSCAN
Unlike K-Means, we don't guess the number of clusters. HDBSCAN finds "islands of high density" in the data. Any call that is too unique to fit a pattern is labeled as -1 (Noise), which is perfect for identifying one-off edge cases that need human review.

In [None]:
# Configure HDBSCAN
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15, 
    min_samples=5, 
    metric='euclidean', 
    cluster_selection_method='eom'
)

processed_df['cluster_id'] = clusterer.fit_predict(umap_embeddings)

# Check cluster distribution
print(processed_df['cluster_id'].value_counts())

### 5. Visualizing the Call Archetypes
To make our findings "stakeholder-ready," we project the clusters into 2D space. This visualization allows managers to see the "thematic clusters" of their call center—where Billing issues end and Technical Support begins.

In [None]:
# Reduce to 2D for visualization only
viz_reducer = umap.UMAP(n_components=2, random_state=42)
viz_embeddings = viz_reducer.fit_transform(embeddings)

plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    viz_embeddings[:, 0], 
    viz_embeddings[:, 1], 
    c=processed_df['cluster_id'], 
    cmap='Spectral', 
    s=50, 
    alpha=0.6
)
plt.colorbar(scatter, label='Cluster ID')
plt.title('Call Center Intent Archetypes (UMAP Projection)')
plt.show()

### 6. Archetype Interpretation: Mapping Clusters to Strategy
The final step is translating Cluster IDs back into business terms. We examine the top words and average metrics (Talk Ratio, CSAT) for each cluster to name them (e.g., "The Churn Risk Group").

In [None]:
# Grouping by cluster to see behavioral signatures
cluster_profile = processed_df.groupby('cluster_id').agg({
    'talk_ratio': 'mean',
    'csat_score': 'mean',
    'duration_sec': 'mean',
    'escalated': 'mean',
    'clean_text': 'count'
}).rename(columns={'clean_text': 'volume'})

print("--- Cluster Behavioral Profiles ---")
cluster_profile.sort_values(by='csat_score')

Production Pipeline: Multi-stage NLP pipeline (Clean -> Embed -> Reduce -> Cluster).

Advanced Architecture: You utilized Transformers and Density-Based Clustering.

Scalability: You’ve implemented caching for embeddings and used algorithms that handle large datasets efficiently.