# Clustering Analysis - Customer Segmentation

This notebook performs comprehensive clustering analysis using multiple algorithms:
- K-Means
- Hierarchical Clustering
- DBSCAN
- Gaussian Mixture Models (GMM)

## Workflow:
1. Data Loading
2. Exploratory Data Analysis (EDA)
3. Data Preprocessing (One-Hot Encoding + Scaling)
4. Model Training & Evaluation
5. Model Comparison
6. Cluster Interpretation

## 1. Import Libraries

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# Import custom utilities
from utils.entrega1.data_loader import load_data
from utils.entrega1.eda import check_missing_values_viz, plot_distributions_numerical
from utils.entrega1.preprocessing import preprocess_pipeline
from utils.entrega1.modeling import (
    evaluate_clusters_kmeans,
    plot_knn_distance,
    optimize_dbscan_grid,
    evaluate_gmm_bic,
    compare_all_models_silhouette,
    visualize_clusters_pca,
    interpret_clusters
)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('viridis')

## 2. Load Data

In [None]:
# Load dataset
df = load_data('../data/datos_caso_1.csv')
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Dataset info
df.info()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Check missing values
check_missing_values_viz(df)

In [None]:
# Numerical distributions
eda_numerical_cols = ['Year_Birth', 'Income', 'Recency', 'Kidhome', 'Teenhome', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'Z_CostContact', 'Z_Revenue', 'Response']
plot_distributions_numerical(df, eda_numerical_cols)

## 4. Data Preprocessing

The preprocessing pipeline:
1. Handles missing values
2. Removes atypical values
3. Creates engineered features (Age, Tenure, Total Purchases, etc.)
4. **One-Hot Encodes** categorical features (Education, Marital_Status)
5. **Scales** numerical features using StandardScaler

**Important:** The pipeline returns transformers and feature lists for inverse transformation.

In [None]:
# Preprocess data - returns scaled data, transformers, and feature lists
df_scaled, scaler, encoder, all_feature_cols, num_feature_cols, ohe_feature_cols, categorical_cols = preprocess_pipeline(df)

print(f"Scaled data shape: {df_scaled.shape}")
print(f"Total features: {len(all_feature_cols)}")
print(f"Numerical features: {len(num_feature_cols)}")
print(f"OHE features: {len(ohe_feature_cols)}")
df_scaled.head()

## 5. K-Means Clustering

### 5.1 Determine Optimal K (Elbow + Silhouette)

In [None]:
# Evaluate K-Means for k=2 to k=19 with reference line at k=3
evaluate_clusters_kmeans(df_scaled, range(2, 20), include_silhouette=True, ref_cluster=3, n_init=20)

### 5.2 Fit Final K-Means Model

In [None]:
# Based on elbow/silhouette analysis, choose optimal k (e.g., 6)
optimal_k = 6

kmeans_final = KMeans(n_clusters=optimal_k, n_init=20, random_state=123)
kmeans_final.fit(df_scaled)

print(f"K-Means with k={optimal_k} fitted successfully")
print(f"Inertia: {kmeans_final.inertia_:.2f}")

In [None]:
# Display cluster centers (scaled values)
print("Cluster Centers (Scaled):")
centers_scaled = pd.DataFrame(kmeans_final.cluster_centers_, columns=all_feature_cols)
centers_scaled

### 5.3 Visualize K-Means Clusters (PCA)

In [None]:
labels_kmeans = kmeans_final.labels_
visualize_clusters_pca(df_scaled, labels_kmeans, title='K-Means Clusters (PCA)')

## 6. Hierarchical Clustering

In [None]:
# Fit hierarchical clustering with same number of clusters as K-Means
h_cluster_final = AgglomerativeClustering(n_clusters=optimal_k)
labels_h_clust = h_cluster_final.fit_predict(df_scaled)

print(f"Hierarchical Clustering with {optimal_k} clusters fitted")
print(f"Cluster distribution: {np.bincount(labels_h_clust)}")

In [None]:
# Visualize hierarchical clusters
visualize_clusters_pca(df_scaled, labels_h_clust, title='Hierarchical Clusters (PCA)')

## 7. DBSCAN Clustering

### 7.1 K-NN Distance Plot (Estimate Epsilon)

In [None]:
# Plot k-nearest neighbors distance
plot_knn_distance(df_scaled, k=5)

### 7.2 Grid Search for Optimal Hyperparameters

In [None]:
# Grid search for epsilon and min_samples
eps_values = np.arange(1.25, 1.50, 0.05)
min_samples_values = np.arange(2, 10)

print("Optimizing DBSCAN hyperparameters...")
dbscan_results = optimize_dbscan_grid(df_scaled, eps_values, min_samples_values)

In [None]:
# View best parameters
best_idx = dbscan_results['Score'].idxmax()
best_params = dbscan_results.loc[best_idx]
print(f"Best DBSCAN parameters:")
print(f"  Epsilon: {best_params['Epsilon']:.2f}")
print(f"  Vecindad (min_samples): {int(best_params['Vecindad'])}")
print(f"  Silhouette Score: {best_params['Score']:.4f}")

### 7.3 Fit Final DBSCAN Model

In [None]:
# Fit DBSCAN with optimal parameters
dbscan_final = DBSCAN(eps=best_params['Epsilon'], min_samples=int(best_params['Vecindad']))
labels_dbscan = dbscan_final.fit_predict(df_scaled)

print(f"DBSCAN fitted")
print(f"Unique clusters: {np.unique(labels_dbscan)}")
print(f"Cluster distribution: {np.bincount(labels_dbscan[labels_dbscan >= 0])}")
print(f"Noise points: {np.sum(labels_dbscan == -1)}")

In [None]:
# Visualize DBSCAN clusters
visualize_clusters_pca(df_scaled, labels_dbscan, title='DBSCAN Clusters (PCA)')

## 8. Gaussian Mixture Model (GMM)

### 8.1 BIC Analysis for Optimal Components

In [None]:
# Evaluate GMM using BIC across different covariance types
evaluate_gmm_bic(df_scaled, range(2, 30), covariance_types=['spherical', 'tied', 'diag', 'full'])

### 8.2 Fit Final GMM Model

In [None]:
# Based on BIC plot, choose optimal configuration
gmm_final = GaussianMixture(n_components=20, covariance_type='diag', random_state=123)
gmm_final.fit(df_scaled)
labels_gmm = gmm_final.predict(df_scaled)

print(f"GMM with 20 components (diag covariance) fitted")
print(f"BIC: {gmm_final.bic(df_scaled):.2f}")
print(f"AIC: {gmm_final.aic(df_scaled):.2f}")

In [None]:
# Visualize GMM clusters
visualize_clusters_pca(df_scaled, labels_gmm, title='GMM Clusters (PCA)')

## 9. Model Comparison

### 9.1 Silhouette Score Comparison

In [None]:
# Compare all models using Silhouette score
labels_dict = {
    'KMeans': labels_kmeans,
    'Hierarchical': labels_h_clust,
    'DBSCAN': labels_dbscan,
    'GMM': labels_gmm
}

scores = compare_all_models_silhouette(df_scaled, labels_dict)

### 9.2 Select Best Model

Based on the Silhouette scores above, select the best performing model for interpretation.

In [None]:
# Select best model (example: KMeans)
best_model_name = 'KMeans'
best_labels = labels_kmeans

print(f"Selected model: {best_model_name}")
print(f"Number of clusters: {len(np.unique(best_labels))}")

## 10. Cluster Interpretation

### 10.1 Inverse Transform Cluster Centers

Convert scaled cluster centers back to original units for interpretation.

In [None]:
# Get cluster centers (scaled) - use all_feature_cols
centers_scaled = pd.DataFrame(kmeans_final.cluster_centers_, columns=all_feature_cols)

print(f"Cluster centers shape: {centers_scaled.shape}")
print(f"Numerical columns: {len(num_feature_cols)}")
print(f"OHE columns: {len(ohe_feature_cols)}")

In [None]:
# Inverse transform numerical features (scaler was fitted only on these)
centers_num_inverse = scaler.inverse_transform(centers_scaled[num_feature_cols])

# Inverse transform categorical features
centers_cat_inverse = encoder.inverse_transform(centers_scaled[ohe_feature_cols])

# Combine into final dataframe
transformed_centers = pd.DataFrame(
    np.concatenate([centers_num_inverse, centers_cat_inverse], axis=1),
    columns=num_feature_cols + categorical_cols
)

print("\nCluster Centers (Original Scale):")
transformed_centers

### 10.2 Cluster Statistics

In [None]:
# Add cluster labels to scaled data
df_with_clusters = df_scaled.copy()
df_with_clusters['Cluster'] = best_labels

# Cluster size distribution
cluster_sizes = df_with_clusters['Cluster'].value_counts().sort_index()
print("Cluster Sizes:")
print(cluster_sizes)

# Visualize cluster sizes
plt.figure(figsize=(10, 6))
cluster_sizes.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Cluster Size Distribution')
plt.xlabel('Cluster')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.show()

### 10.3 Cluster Profiles

Interpret each cluster based on the transformed centers.

In [None]:
# Display key characteristics for each cluster
key_features = ['Age', 'Income', 'Total_Mnt', 'Total_Num_Purchases', 'Recency', 'Education', 'Marital_Status']
available_features = [f for f in key_features if f in transformed_centers.columns]

print("\nKey Cluster Characteristics:")
transformed_centers[available_features]

## 11. Conclusions

Summary of findings:
1. **Best Model**: [Based on Silhouette scores]
2. **Number of Clusters**: [Optimal k]
3. **Key Insights**: [Describe main customer segments]

### Next Steps:
- Deploy segmentation model
- Create targeted marketing strategies for each segment
- Monitor cluster stability over time