### **Overview**

Build the clustering pipeline and find some more hidden truths.

**Steps:**
1. Select clustering features
2. Clean + impute missing values
3. Scale numeric columns
4. One-hot encode categorical columns
5. Vectorize topics
6. Combine features
7. Apply clustering (GMM / HDBSCAN)

In [1]:
# pip install hdbscan scikit-learn

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.mixture import GaussianMixture
import hdbscan

In [3]:
df_copy = pd.read_csv(r"C:\Users\pc\Desktop\Pro_Jets\CC EDA&ML\EDA notebook\df_copy.csv")
df_copy.head()

Unnamed: 0,interaction_id,customer_id,agent_id,interaction_datetime,interaction_duration_seconds,call_direction,call_channel,call_status,customer_satisfaction_score,speech_sentiment_score,...,issue_resolved,follow_up_required,follow_up_due_date,language,customer_feedback_text,agent_notes,call_hour,call_dayofweek,csat_band,hour
0,INT00001,CUST00001,AGT0001,2024-01-01 09:10:13,415,inbound,phone,completed,4.7,0.82,...,True,False,,en,"Thank you for your help, great service.",Customer called regarding password reset. Issu...,9,Monday,4-5,9
1,INT00002,CUST00002,AGT0002,2024-01-01 11:24:50,23,outbound,phone,dropped,,,...,False,False,,en,,Call dropped instantly. No customer response.,11,Monday,,11
2,INT00003,CUST00003,AGT0003,2024-01-01 13:32:05,198,inbound,chat,completed,4.1,0.63,...,True,False,,es,"Gracias, todo bien.",Customer requested recent statements. Provided...,13,Monday,4-5,13
3,INT00004,CUST00001,AGT0004,2024-01-02 10:45:14,37,inbound,phone,abandoned,0.0,-0.95,...,False,False,,en,,Caller disconnected before an agent could answer.,10,Tuesday,,10
4,INT00005,CUST00004,AGT0001,2024-01-02 15:08:55,720,inbound,phone,completed,4.5,0.91,...,True,False,,en,Resolved my issue quickly.,Customer reported card decline online. Walked ...,15,Tuesday,4-5,15


#### **CALL CENTER CLUSTERING PIPELINE**

In [4]:
# --------- FEATURES USED FOR CLUSTERING ---------
num_features = [
    "interaction_duration_seconds",
    "speech_sentiment_score",
    "customer_satisfaction_score",
    "call_hour"
]

cat_features = [
    "call_direction",
    "call_channel",
    "call_dayofweek",
    "issue_resolved",
    "follow_up_required"
]

##### We let some go in the EDA notebook, now to fix that before it breaks our pipeline

In [5]:
df_copy[num_features + cat_features].isna().sum()

interaction_duration_seconds     0
speech_sentiment_score          56
customer_satisfaction_score     46
call_hour                        0
call_direction                   0
call_channel                     0
call_dayofweek                   0
issue_resolved                   0
follow_up_required               0
dtype: int64

In [6]:
# --------- PREPROCESSING PIPELINE ---------
preprocessor = ColumnTransformer(
    transformers=[
        # Numeric → mean impute + scale
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler())
        ]), num_features),

        # Categorical → fill missing + one-hot encode
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_features)
    ]
)

In [None]:
# --------- TRANSFORM DATA ---------
X = preprocessor.fit_transform(df_copy[num_features + cat_features])

# --------- GAUSSIAN MIXTURE MODEL ---------
gmm = GaussianMixture(
    n_components=4,
    covariance_type="full",
    random_state=42
)

gmm_labels = gmm.fit_predict(X)
df_copy["gmm_cluster"] = gmm_labels

# --------- HDBSCAN MODEL ---------
hdb = hdbscan.HDBSCAN(
    min_cluster_size=10,
    min_samples=5,
    metric="euclidean",
    prediction_data=True
).fit(X)

df_copy["hdbscan_cluster"] = hdb.labels_

#### **SECTION 1 — Cluster Sizes**

In [7]:
# --------- CHECK CLUSTER SHAPES ---------
print(df_copy["gmm_cluster"].value_counts())
print(df_copy["hdbscan_cluster"].value_counts(dropna=False))

gmm_cluster
2    84
1    56
0    34
3    26
Name: count, dtype: int64
hdbscan_cluster
 2    162
 0     15
-1     12
 1     11
Name: count, dtype: int64




**Summary:** GMM provides a 4-group segmentation for broad profiling, while HDBSCAN identifies a dominant "normal" group (81%) and isolates specific outliers as noise (-1).

**Takeaway:** Use GMM for general call categorization and HDBSCAN to pinpoint anomalous behaviors that require targeted investigation.

#### **SECTION 2 — GMM Cluster Profiles Results**

In [19]:
# ---- Profile GMM clusters ----
gmm_profile = df_copy.groupby("gmm_cluster").agg({
    "interaction_duration_seconds": "mean",
    "customer_satisfaction_score": "mean",
    "speech_sentiment_score": "mean",
    "issue_resolved": "mean",
    "follow_up_required": "mean"
})

print(gmm_profile)

             interaction_duration_seconds  customer_satisfaction_score  \
gmm_cluster                                                              
0                              501.794118                     4.441176   
1                               50.107143                     0.000000   
2                              253.976190                     3.952381   
3                              539.692308                     1.052000   

             speech_sentiment_score  issue_resolved  follow_up_required  
gmm_cluster                                                              
0                          0.805588             1.0            0.000000  
1                               NaN             0.0            0.000000  
2                          0.606667             1.0            0.000000  
3                         -0.669615             0.0            0.653846  


**Summary:** GMM identifies four operational archetypes: Premium/complex resolutions (C0), abandoned/dropped attempts (C1), routine support (C2), and high-friction escalations (C3).

**Takeaway:** This segmentation separates high-performing service from critical failures, pinpointing exactly where negative sentiment and unresolved issues (C3) require targeted operational intervention.

#### **Call Channel Patterns (GMM)**

In [21]:
# Call Channel Patterns (GMM)
df_copy.groupby("gmm_cluster")["call_channel"].value_counts(normalize=True)

gmm_cluster  call_channel
0            video           0.705882
             phone           0.294118
1            phone           1.000000
2            chat            0.630952
             phone           0.369048
3            phone           1.000000
Name: proportion, dtype: float64

**Resluts Explanation** Call Channel Patterns (GMM)
- Cluster 0 → 71% video / 29% phone  
- Cluster 1 → 100% phone  
- Cluster 2 → 63% chat / 37% phone  
- Cluster 3 → 100% phone

**Interpretation**

Video calls = handled extremely well
Chat = efficient and successful
Phone = where most pain lives (short-unhappy + long-unhappy)

In [15]:
# Call direction cluster
df_copy.groupby("gmm_cluster")["call_direction"].value_counts(normalize=True)

gmm_cluster  call_direction
0            outbound          0.764706
             inbound           0.235294
1            inbound           0.660714
             outbound          0.339286
2            inbound           0.988095
             outbound          0.011905
3            outbound          0.653846
             inbound           0.346154
Name: proportion, dtype: float64

In [16]:
df_copy.groupby("gmm_cluster")["call_hour"].mean()

gmm_cluster
0    13.088235
1    11.946429
2    12.857143
3    11.730769
Name: call_hour, dtype: float64

**Interpretation:** High-effort + unhappy calls cluster mid-day, typical peak demand window.

#### **SECTION 3 — HDBSCAN Cluster Profiles**

In [17]:
hdb_profile = df_copy.groupby("hdbscan_cluster").agg({
    "interaction_duration_seconds": "mean",
    "customer_satisfaction_score": "mean",
    "speech_sentiment_score": "mean",
    "issue_resolved": "mean",
    "follow_up_required": "mean"
})

print(hdb_profile)

                 interaction_duration_seconds  customer_satisfaction_score  \
hdbscan_cluster                                                              
-1                                 305.583333                      0.29000   
 0                                 693.133333                      1.56000   
 1                                  53.090909                      0.00000   
 2                                 250.524691                      4.09322   

                 speech_sentiment_score  issue_resolved  follow_up_required  
hdbscan_cluster                                                              
-1                            -0.827273        0.000000            0.166667  
 0                            -0.554000        0.000000            1.000000  
 1                                  NaN        0.000000            0.000000  
 2                             0.663983        0.728395            0.000000  


In [18]:
df_copy.groupby("hdbscan_cluster")["call_channel"].value_counts(normalize=True)

hdbscan_cluster  call_channel
-1               phone           1.000000
 0               phone           1.000000
 1               phone           1.000000
 2               phone           0.524691
                 chat            0.327160
                 video           0.148148
Name: proportion, dtype: float64

**Summary:** HDBSCAN distinguishes stable multi-channel operations (C2) from severe phone-based failures, including chronic escalations (C0), abandoned calls (C1), and unique outliers (-1).

**Takeaway:** This density-based view reveals that digital channels (chat/video) act as stabilizers, while the phone channel contains the entirety of the system's most critical unresolved risks.

#### **Section 5 - Final Conclusion**

**Summary:** Both models converge on a four-tier call structure, identifying the phone channel as the primary source of high-friction, unresolved escalations compared to stable digital interactions.

**Takeaway:** Operational success depends on isolating HDBSCAN’s outliers for QA and optimizing phone-specific routing to resolve the long-duration, high-emotion cases identified by both algorithms.

In [23]:
# Next step to the NLP Notebook