# üöÄ Customer Clustering - Model Training (RAPIDS GPU)

This notebook trains multiple clustering models using **RAPIDS cuML** for GPU acceleration where available.

**Dataset Context:** Customer segmentation for marketing and business strategy

## Models to Train:
1. **K-Means Clustering (GPU)** - cuML GPU-accelerated
2. **DBSCAN (GPU)** - cuML density-based clustering
3. **Agglomerative Hierarchical Clustering (CPU)** - sklearn (no GPU version)
4. **Spectral Clustering (CPU)** - sklearn (no GPU version)
5. **Gaussian Mixture Model (CPU)** - sklearn (no GPU version)
6. **Mini-Batch K-Means (CPU)** - sklearn (no GPU version)

## GPU Acceleration:
- **RAPIDS cuML** for K-Means and DBSCAN (10-50x faster)
- **CPU fallback** for algorithms without GPU support
- **Automatic detection** of GPU availability

## Evaluation Metrics:
- **Silhouette Score** - Cluster cohesion and separation (-1 to 1, higher is better)
- **Calinski-Harabasz Index** - Variance ratio (higher is better)
- **Davies-Bouldin Index** - Average similarity between clusters (lower is better)
- **Inertia** - Sum of squared distances to cluster centers (K-Means only)

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings
from datetime import datetime
import time

print("="*80)
print("GPU AVAILABILITY CHECK")
print("="*80)

# Check GPU availability and import RAPIDS
try:
    import cupy as cp
    import cudf
    from cuml.cluster import KMeans as cuKMeans
    from cuml.cluster import DBSCAN as cuDBSCAN
    from cuml.metrics import silhouette_score as cu_silhouette_score
    
    rapids_available = True
    print("‚úì RAPIDS cuML available")
    gpu_count = cp.cuda.runtime.getDeviceCount()
    print(f"‚úì GPUs available: {gpu_count}")
    
    if gpu_count > 0:
        gpu_name = cp.cuda.runtime.getDeviceProperties(0)['name'].decode()
        gpu_mem = cp.cuda.runtime.getDeviceProperties(0)['totalGlobalMem'] / 1e9
        print(f"‚úì GPU 0: {gpu_name}")
        print(f"‚úì GPU Memory: {gpu_mem:.1f} GB")
        
except ImportError:
    rapids_available = False
    print("‚ùå RAPIDS not available")
    print("\nüì¶ Installation: conda install -c rapidsai -c conda-forge -c nvidia rapids")
    print("\nFalling back to CPU clustering with scikit-learn...")

# Standard sklearn imports (for CPU fallback and non-GPU models)
from sklearn.cluster import AgglomerativeClustering, SpectralClustering, MiniBatchKMeans
from sklearn.cluster import KMeans as skKMeans
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.mixture import GaussianMixture

# Evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Dimensionality reduction for visualization
from sklearn.decomposition import PCA

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')

print("\n‚úì All libraries imported successfully!")
print(f"üöÄ GPU Acceleration: {'ENABLED' if rapids_available else 'DISABLED (CPU mode)'}")
print("="*80)

## Load Processed Data

In [None]:
# Load the scaled dataset
print("Loading processed clustering data...\n")

if rapids_available:
    # Load with cuDF for GPU
    df = cudf.read_csv('clustering_scaled_standard.csv')
    print("‚úì Data loaded with cuDF (GPU)")
else:
    # Load with pandas for CPU
    df = pd.read_csv('clustering_scaled_standard.csv')
    print("‚úì Data loaded with pandas (CPU)")

print(f"\nDataset shape: {df.shape}")
print(f"Features: {df.shape[1]}")
print(f"Samples: {df.shape[0]:,}")
print(f"\nFirst few rows:")
df.head()

## üîç Determine Optimal Number of Clusters

We'll use multiple methods to find the optimal k:
1. **Elbow Method** - Find the "elbow" in inertia curve
2. **Silhouette Analysis** - Maximize silhouette score

**Note:** Using GPU-accelerated K-Means for faster analysis!

In [None]:
# Prepare data
if rapids_available:
    X = df.values  # cuDF to cupy array
    print(f"Data type: CuPy array (GPU)")
else:
    X = df.values  # pandas to numpy array
    print(f"Data type: NumPy array (CPU)")

print(f"Data shape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]:,}")

In [None]:
# Elbow Method and Silhouette Analysis (GPU-accelerated)
print("="*80)
print("FINDING OPTIMAL NUMBER OF CLUSTERS")
if rapids_available:
    print("Using GPU-accelerated K-Means for faster analysis! üöÄ")
print("="*80)

k_range = range(2, 11)
inertias = []
silhouette_scores = []
calinski_scores = []
davies_bouldin_scores = []

for k in k_range:
    print(f"\nTesting k={k}...")
    
    if rapids_available:
        # GPU K-Means
        kmeans = cuKMeans(n_clusters=k, random_state=42, max_iter=300)
        labels = kmeans.fit_predict(X)
        
        # Convert to CPU for metrics calculation
        labels_cpu = cp.asnumpy(labels) if hasattr(labels, 'values') else labels.to_numpy()
        X_cpu = cp.asnumpy(X) if isinstance(X, cp.ndarray) else X.to_numpy()
        
        inertia = float(kmeans.inertia_)
    else:
        # CPU K-Means
        kmeans = skKMeans(n_clusters=k, random_state=42, n_init=10)
        labels_cpu = kmeans.fit_predict(X)
        X_cpu = X
        inertia = kmeans.inertia_
    
    # Calculate metrics (on CPU)
    silhouette = silhouette_score(X_cpu, labels_cpu)
    calinski = calinski_harabasz_score(X_cpu, labels_cpu)
    davies_bouldin = davies_bouldin_score(X_cpu, labels_cpu)
    
    inertias.append(inertia)
    silhouette_scores.append(silhouette)
    calinski_scores.append(calinski)
    davies_bouldin_scores.append(davies_bouldin)
    
    print(f"  Inertia: {inertia:.2f}")
    print(f"  Silhouette: {silhouette:.4f}")
    print(f"  Calinski-Harabasz: {calinski:.2f}")
    print(f"  Davies-Bouldin: {davies_bouldin:.4f}")

print("\n‚úì Cluster analysis complete!")

In [None]:
# Plot results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Elbow Method
axes[0, 0].plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0, 0].set_xlabel('Number of Clusters (k)', fontweight='bold')
axes[0, 0].set_ylabel('Inertia (Within-cluster sum of squares)', fontweight='bold')
axes[0, 0].set_title('Elbow Method (GPU-Accelerated)', fontsize=14, fontweight='bold')
axes[0, 0].grid(alpha=0.3)
axes[0, 0].set_xticks(k_range)

# 2. Silhouette Score
axes[0, 1].plot(k_range, silhouette_scores, 'go-', linewidth=2, markersize=8)
axes[0, 1].set_xlabel('Number of Clusters (k)', fontweight='bold')
axes[0, 1].set_ylabel('Silhouette Score', fontweight='bold')
axes[0, 1].set_title('Silhouette Analysis (Higher is Better)', fontsize=14, fontweight='bold')
axes[0, 1].grid(alpha=0.3)
axes[0, 1].set_xticks(k_range)

# Mark best k
best_k_silhouette = list(k_range)[np.argmax(silhouette_scores)]
axes[0, 1].axvline(x=best_k_silhouette, color='red', linestyle='--', alpha=0.7, label=f'Best k={best_k_silhouette}')
axes[0, 1].legend()

# 3. Calinski-Harabasz Index
axes[1, 0].plot(k_range, calinski_scores, 'mo-', linewidth=2, markersize=8)
axes[1, 0].set_xlabel('Number of Clusters (k)', fontweight='bold')
axes[1, 0].set_ylabel('Calinski-Harabasz Index', fontweight='bold')
axes[1, 0].set_title('Calinski-Harabasz Index (Higher is Better)', fontsize=14, fontweight='bold')
axes[1, 0].grid(alpha=0.3)
axes[1, 0].set_xticks(k_range)

# 4. Davies-Bouldin Index
axes[1, 1].plot(k_range, davies_bouldin_scores, 'ro-', linewidth=2, markersize=8)
axes[1, 1].set_xlabel('Number of Clusters (k)', fontweight='bold')
axes[1, 1].set_ylabel('Davies-Bouldin Index', fontweight='bold')
axes[1, 1].set_title('Davies-Bouldin Index (Lower is Better)', fontsize=14, fontweight='bold')
axes[1, 1].grid(alpha=0.3)
axes[1, 1].set_xticks(k_range)

plt.tight_layout()
plt.savefig('optimal_clusters_analysis_rapids.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n‚úì Analysis saved as 'optimal_clusters_analysis_rapids.png'")
print(f"\nüìä Recommended k based on Silhouette Score: {best_k_silhouette}")

In [None]:
# Set optimal k for training
optimal_k = best_k_silhouette

print("="*80)
print(f"OPTIMAL NUMBER OF CLUSTERS: k = {optimal_k}")
print("="*80)
print(f"This value will be used for clustering algorithms that require k parameter.")

## Define Evaluation Functions (GPU & CPU)

In [None]:
def evaluate_clustering_gpu(X, labels, model_name, training_time, model=None):
    """
    Evaluate clustering model performance (GPU version)
    Handles cuML models with GPU data
    
    Returns:
        dict: Dictionary containing evaluation metrics
    """
    print(f"\n{'='*80}")
    print(f"Evaluating: {model_name}")
    print(f"{'='*80}")
    
    # Convert to CPU for evaluation
    if isinstance(labels, cp.ndarray):
        labels_cpu = cp.asnumpy(labels)
    elif hasattr(labels, 'values'):
        labels_cpu = labels.values.get() if hasattr(labels.values, 'get') else labels.to_numpy()
    else:
        labels_cpu = labels
    
    if isinstance(X, cp.ndarray):
        X_cpu = cp.asnumpy(X)
    elif hasattr(X, 'values'):
        X_cpu = X.values.get() if hasattr(X.values, 'get') else X.to_numpy()
    else:
        X_cpu = X
    
    # Number of clusters
    n_clusters = len(np.unique(labels_cpu[labels_cpu >= 0]))  # Exclude noise points (-1)
    n_noise = np.sum(labels_cpu == -1)
    
    # Calculate metrics
    if n_clusters > 1 and n_noise < len(labels_cpu):
        # For DBSCAN, exclude noise points
        mask = labels_cpu >= 0
        X_clean = X_cpu[mask]
        labels_clean = labels_cpu[mask]
        
        if len(np.unique(labels_clean)) > 1:
            silhouette = silhouette_score(X_clean, labels_clean)
            calinski = calinski_harabasz_score(X_clean, labels_clean)
            davies_bouldin = davies_bouldin_score(X_clean, labels_clean)
        else:
            silhouette = -1
            calinski = 0
            davies_bouldin = float('inf')
    else:
        silhouette = -1
        calinski = 0
        davies_bouldin = float('inf')
    
    # Get inertia for K-Means models
    inertia = float(model.inertia_) if hasattr(model, 'inertia_') else None
    
    # Print results
    print(f"\nüìä Clustering Results:")
    print(f"   Number of Clusters: {n_clusters}")
    if n_noise > 0:
        print(f"   Noise Points: {n_noise} ({n_noise/len(labels_cpu)*100:.2f}%)")
    print(f"   Training Time: {training_time:.2f} seconds")
    
    print(f"\nüìà Evaluation Metrics:")
    print(f"   Silhouette Score: {silhouette:.4f}")
    print(f"   Calinski-Harabasz Index: {calinski:.2f}")
    print(f"   Davies-Bouldin Index: {davies_bouldin:.4f}")
    if inertia is not None:
        print(f"   Inertia: {inertia:.2f}")
    
    # Cluster size distribution
    unique, counts = np.unique(labels_cpu[labels_cpu >= 0], return_counts=True)
    print(f"\nüì¶ Cluster Sizes:")
    for cluster_id, count in zip(unique, counts):
        print(f"   Cluster {cluster_id}: {count:,} samples ({count/len(labels_cpu)*100:.2f}%)")
    
    return {
        'model_name': model_name,
        'model': model,
        'labels': labels_cpu,
        'n_clusters': n_clusters,
        'n_noise': n_noise,
        'training_time': training_time,
        'silhouette_score': silhouette,
        'calinski_harabasz_score': calinski,
        'davies_bouldin_score': davies_bouldin,
        'inertia': inertia
    }

def evaluate_clustering_cpu(X, labels, model_name, training_time, model=None):
    """
    Evaluate clustering model performance (CPU version)
    Handles sklearn models with numpy data
    
    Returns:
        dict: Dictionary containing evaluation metrics
    """
    print(f"\n{'='*80}")
    print(f"Evaluating: {model_name}")
    print(f"{'='*80}")
    
    # Number of clusters
    n_clusters = len(np.unique(labels[labels >= 0]))  # Exclude noise points (-1)
    n_noise = np.sum(labels == -1)
    
    # Calculate metrics
    if n_clusters > 1 and n_noise < len(labels):
        mask = labels >= 0
        X_clean = X[mask]
        labels_clean = labels[mask]
        
        if len(np.unique(labels_clean)) > 1:
            silhouette = silhouette_score(X_clean, labels_clean)
            calinski = calinski_harabasz_score(X_clean, labels_clean)
            davies_bouldin = davies_bouldin_score(X_clean, labels_clean)
        else:
            silhouette = -1
            calinski = 0
            davies_bouldin = float('inf')
    else:
        silhouette = -1
        calinski = 0
        davies_bouldin = float('inf')
    
    inertia = model.inertia_ if hasattr(model, 'inertia_') else None
    
    # Print results
    print(f"\nüìä Clustering Results:")
    print(f"   Number of Clusters: {n_clusters}")
    if n_noise > 0:
        print(f"   Noise Points: {n_noise} ({n_noise/len(labels)*100:.2f}%)")
    print(f"   Training Time: {training_time:.2f} seconds")
    
    print(f"\nüìà Evaluation Metrics:")
    print(f"   Silhouette Score: {silhouette:.4f}")
    print(f"   Calinski-Harabasz Index: {calinski:.2f}")
    print(f"   Davies-Bouldin Index: {davies_bouldin:.4f}")
    if inertia is not None:
        print(f"   Inertia: {inertia:.2f}")
    
    unique, counts = np.unique(labels[labels >= 0], return_counts=True)
    print(f"\nüì¶ Cluster Sizes:")
    for cluster_id, count in zip(unique, counts):
        print(f"   Cluster {cluster_id}: {count:,} samples ({count/len(labels)*100:.2f}%)")
    
    return {
        'model_name': model_name,
        'model': model,
        'labels': labels,
        'n_clusters': n_clusters,
        'n_noise': n_noise,
        'training_time': training_time,
        'silhouette_score': silhouette,
        'calinski_harabasz_score': calinski,
        'davies_bouldin_score': davies_bouldin,
        'inertia': inertia
    }

print("‚úì Evaluation functions defined (GPU & CPU)")

## Train Clustering Models

### 1. K-Means Clustering (GPU - cuML)

In [None]:
# K-Means Clustering (GPU with cuML or CPU fallback)
if rapids_available:
    print("Training K-Means Clustering (GPU - cuML)...")
    print("üöÄ Using GPU acceleration!")
    
    start_time = time.time()
    kmeans_model = cuKMeans(
        n_clusters=optimal_k,
        init='scalable-k-means++',  # GPU-optimized initialization
        max_iter=300,
        random_state=42
    )
    kmeans_labels = kmeans_model.fit_predict(X)
    kmeans_time = time.time() - start_time
    
    kmeans_results = evaluate_clustering_gpu(X, kmeans_labels, 'K-Means (cuML GPU)', kmeans_time, kmeans_model)
else:
    print("Training K-Means Clustering (CPU - sklearn)...")
    print("‚ö†Ô∏è  GPU not available, using CPU")
    
    start_time = time.time()
    kmeans_model = skKMeans(
        n_clusters=optimal_k,
        init='k-means++',
        n_init=10,
        max_iter=300,
        random_state=42
    )
    kmeans_labels = kmeans_model.fit_predict(X)
    kmeans_time = time.time() - start_time
    
    kmeans_results = evaluate_clustering_cpu(X, kmeans_labels, 'K-Means (sklearn CPU)', kmeans_time, kmeans_model)

# Save model
with open('model_kmeans_rapids.pkl', 'wb') as f:
    pickle.dump(kmeans_model, f)
print("\n‚úì Model saved: model_kmeans_rapids.pkl")

### 2. DBSCAN (GPU - cuML)

In [None]:
# DBSCAN (GPU with cuML or CPU fallback)
if rapids_available:
    print("Training DBSCAN (GPU - cuML)...")
    print("üöÄ Using GPU acceleration!")
    
    start_time = time.time()
    dbscan_model = cuDBSCAN(
        eps=0.5,
        min_samples=5,
        metric='euclidean'
    )
    dbscan_labels = dbscan_model.fit_predict(X)
    dbscan_time = time.time() - start_time
    
    dbscan_results = evaluate_clustering_gpu(X, dbscan_labels, 'DBSCAN (cuML GPU)', dbscan_time, dbscan_model)
else:
    print("Training DBSCAN (CPU - sklearn)...")
    print("‚ö†Ô∏è  GPU not available, using CPU")
    
    start_time = time.time()
    dbscan_model = skDBSCAN(
        eps=0.5,
        min_samples=5,
        metric='euclidean',
        n_jobs=-1
    )
    dbscan_labels = dbscan_model.fit_predict(X)
    dbscan_time = time.time() - start_time
    
    dbscan_results = evaluate_clustering_cpu(X, dbscan_labels, 'DBSCAN (sklearn CPU)', dbscan_time, dbscan_model)

# Save model
with open('model_dbscan_rapids.pkl', 'wb') as f:
    pickle.dump(dbscan_model, f)
print("\n‚úì Model saved: model_dbscan_rapids.pkl")

### 3. Agglomerative Hierarchical Clustering (CPU - sklearn)

**Note:** No GPU version available in cuML

In [None]:
# Agglomerative Clustering (CPU only - no GPU version)
print("Training Agglomerative Hierarchical Clustering (CPU - sklearn)...")
print("‚ö†Ô∏è  No GPU implementation available, using sklearn CPU")

# Convert to CPU if needed
if rapids_available:
    X_cpu = cp.asnumpy(X) if isinstance(X, cp.ndarray) else X.to_numpy()
else:
    X_cpu = X

start_time = time.time()
agglomerative_model = AgglomerativeClustering(
    n_clusters=optimal_k,
    linkage='ward'
)
agglomerative_labels = agglomerative_model.fit_predict(X_cpu)
agglomerative_time = time.time() - start_time

agglomerative_results = evaluate_clustering_cpu(X_cpu, agglomerative_labels, 'Agglomerative Clustering (sklearn CPU)', agglomerative_time, agglomerative_model)

# Save model
with open('model_agglomerative_rapids.pkl', 'wb') as f:
    pickle.dump(agglomerative_model, f)
print("\n‚úì Model saved: model_agglomerative_rapids.pkl")

### 4. Spectral Clustering (CPU - sklearn)

**Note:** No GPU version available in cuML

In [None]:
# Spectral Clustering (CPU only - no GPU version)
print("Training Spectral Clustering (CPU - sklearn)...")
print("‚ö†Ô∏è  No GPU implementation available, using sklearn CPU")

# Convert to CPU if needed
if rapids_available:
    X_cpu = cp.asnumpy(X) if isinstance(X, cp.ndarray) else X.to_numpy()
else:
    X_cpu = X

start_time = time.time()
spectral_model = SpectralClustering(
    n_clusters=optimal_k,
    affinity='nearest_neighbors',
    n_neighbors=10,
    random_state=42,
    n_jobs=-1
)
spectral_labels = spectral_model.fit_predict(X_cpu)
spectral_time = time.time() - start_time

spectral_results = evaluate_clustering_cpu(X_cpu, spectral_labels, 'Spectral Clustering (sklearn CPU)', spectral_time, spectral_model)

# Save model
with open('model_spectral_rapids.pkl', 'wb') as f:
    pickle.dump(spectral_model, f)
print("\n‚úì Model saved: model_spectral_rapids.pkl")

### 5. Gaussian Mixture Model (CPU - sklearn)

**Note:** No GPU version available in cuML

In [None]:
# Gaussian Mixture Model (CPU only - no GPU version)
print("Training Gaussian Mixture Model (CPU - sklearn)...")
print("‚ö†Ô∏è  No GPU implementation available, using sklearn CPU")

# Convert to CPU if needed
if rapids_available:
    X_cpu = cp.asnumpy(X) if isinstance(X, cp.ndarray) else X.to_numpy()
else:
    X_cpu = X

start_time = time.time()
gmm_model = GaussianMixture(
    n_components=optimal_k,
    covariance_type='full',
    max_iter=100,
    random_state=42
)
gmm_model.fit(X_cpu)
gmm_labels = gmm_model.predict(X_cpu)
gmm_time = time.time() - start_time

gmm_results = evaluate_clustering_cpu(X_cpu, gmm_labels, 'Gaussian Mixture Model (sklearn CPU)', gmm_time, gmm_model)

# Save model
with open('model_gmm_rapids.pkl', 'wb') as f:
    pickle.dump(gmm_model, f)
print("\n‚úì Model saved: model_gmm_rapids.pkl")

### 6. Mini-Batch K-Means (CPU - sklearn)

**Note:** No GPU version available in cuML

In [None]:
# Mini-Batch K-Means (CPU only - no GPU version)
print("Training Mini-Batch K-Means (CPU - sklearn)...")
print("‚ö†Ô∏è  No GPU implementation available, using sklearn CPU")

# Convert to CPU if needed
if rapids_available:
    X_cpu = cp.asnumpy(X) if isinstance(X, cp.ndarray) else X.to_numpy()
else:
    X_cpu = X

start_time = time.time()
minibatch_kmeans_model = MiniBatchKMeans(
    n_clusters=optimal_k,
    init='k-means++',
    n_init=10,
    max_iter=300,
    batch_size=1000,
    random_state=42
)
minibatch_labels = minibatch_kmeans_model.fit_predict(X_cpu)
minibatch_time = time.time() - start_time

minibatch_results = evaluate_clustering_cpu(X_cpu, minibatch_labels, 'Mini-Batch K-Means (sklearn CPU)', minibatch_time, minibatch_kmeans_model)

# Save model
with open('model_minibatch_kmeans_rapids.pkl', 'wb') as f:
    pickle.dump(minibatch_kmeans_model, f)
print("\n‚úì Model saved: model_minibatch_kmeans_rapids.pkl")

## üìä Compare All Models

In [None]:
# Collect all results
all_results = [
    kmeans_results,
    dbscan_results,
    agglomerative_results,
    spectral_results,
    gmm_results,
    minibatch_results
]

# Create comparison DataFrame
comparison_df = pd.DataFrame([{
    'Model': r['model_name'],
    'N_Clusters': r['n_clusters'],
    'Noise_Points': r['n_noise'],
    'Training_Time': f"{r['training_time']:.2f}s",
    'Silhouette': r['silhouette_score'],
    'Calinski-Harabasz': r['calinski_harabasz_score'],
    'Davies-Bouldin': r['davies_bouldin_score']
} for r in all_results])

# Sort by Silhouette Score (descending)
comparison_df = comparison_df.sort_values('Silhouette', ascending=False)

print("\n" + "="*80)
print("MODEL COMPARISON SUMMARY (RAPIDS GPU)")
print("="*80)
display(comparison_df)

# Find best model
best_model_name = comparison_df.iloc[0]['Model']
best_silhouette = comparison_df.iloc[0]['Silhouette']
print(f"\nüèÜ Best Model: {best_model_name} (Silhouette Score = {best_silhouette:.4f})")

# Identify GPU-accelerated models
gpu_models = [r['model_name'] for r in all_results if 'cuML GPU' in r['model_name']]
if gpu_models:
    print(f"\nüöÄ GPU-Accelerated Models: {', '.join(gpu_models)}")
    print(f"‚ö° Expected speedup: 10-50x faster than CPU versions")

# Save comparison
comparison_df.to_csv('clustering_results_rapids.csv', index=False)
print("\n‚úì Results saved: clustering_results_rapids.csv")

## üìà Visualize Model Comparison

In [None]:
# Create comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

models = comparison_df['Model']

# Color GPU models differently
colors = ['#00FF00' if 'cuML GPU' in m else 'skyblue' for m in models]

# 1. Silhouette Score
axes[0, 0].barh(models, comparison_df['Silhouette'], color=colors, alpha=0.8)
axes[0, 0].set_xlabel('Silhouette Score (Higher is Better)', fontweight='bold')
axes[0, 0].set_title('Silhouette Score Comparison (RAPIDS)', fontsize=14, fontweight='bold')
axes[0, 0].invert_yaxis()
axes[0, 0].grid(alpha=0.3, axis='x')

# 2. Calinski-Harabasz Index
axes[0, 1].barh(models, comparison_df['Calinski-Harabasz'], color=colors, alpha=0.8)
axes[0, 1].set_xlabel('Calinski-Harabasz Index (Higher is Better)', fontweight='bold')
axes[0, 1].set_title('Calinski-Harabasz Index Comparison', fontsize=14, fontweight='bold')
axes[0, 1].invert_yaxis()
axes[0, 1].grid(alpha=0.3, axis='x')

# 3. Davies-Bouldin Index
db_valid = comparison_df[comparison_df['Davies-Bouldin'] != float('inf')]
db_colors = ['#00FF00' if 'cuML GPU' in m else 'coral' for m in db_valid['Model']]
axes[1, 0].barh(db_valid['Model'], db_valid['Davies-Bouldin'], color=db_colors, alpha=0.8)
axes[1, 0].set_xlabel('Davies-Bouldin Index (Lower is Better)', fontweight='bold')
axes[1, 0].set_title('Davies-Bouldin Index Comparison', fontsize=14, fontweight='bold')
axes[1, 0].invert_yaxis()
axes[1, 0].grid(alpha=0.3, axis='x')

# 4. Training Time
training_times = [float(t.replace('s', '')) for t in comparison_df['Training_Time']]
axes[1, 1].barh(models, training_times, color=colors, alpha=0.8)
axes[1, 1].set_xlabel('Training Time (seconds)', fontweight='bold')
axes[1, 1].set_title('Training Time Comparison (GPU vs CPU)', fontsize=14, fontweight='bold')
axes[1, 1].invert_yaxis()
axes[1, 1].grid(alpha=0.3, axis='x')

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#00FF00', label='GPU-Accelerated (cuML)'),
    Patch(facecolor='skyblue', label='CPU (sklearn)')
]
fig.legend(handles=legend_elements, loc='upper center', ncol=2, bbox_to_anchor=(0.5, 0.98))

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.savefig('clustering_comparison_rapids.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úì Comparison chart saved as 'clustering_comparison_rapids.png'")

## üíæ Save All Results

In [None]:
# Save all results including labels
print("="*80)
print("SAVING ALL RESULTS")
print("="*80)

# Save results with pickle
with open('all_clustering_results_rapids.pkl', 'wb') as f:
    pickle.dump(all_results, f)
print("‚úì All results saved: all_clustering_results_rapids.pkl")

# Save labels for each model
labels_df = pd.DataFrame({
    'K-Means': kmeans_results['labels'],
    'DBSCAN': dbscan_results['labels'],
    'Agglomerative': agglomerative_results['labels'],
    'Spectral': spectral_results['labels'],
    'GMM': gmm_results['labels'],
    'Mini-Batch K-Means': minibatch_results['labels']
})
labels_df.to_csv('clustering_labels_rapids.csv', index=False)
print("‚úì All labels saved: clustering_labels_rapids.csv")

print("\n" + "="*80)
print("‚úÖ CLUSTERING TRAINING COMPLETE (RAPIDS GPU)!")
print("="*80)
print(f"Total models trained: {len(all_results)}")
print(f"Best performing model: {best_model_name}")

if rapids_available:
    gpu_count = len([r for r in all_results if 'cuML GPU' in r['model_name']])
    cpu_count = len(all_results) - gpu_count
    print(f"\nüöÄ GPU-accelerated models: {gpu_count}/{len(all_results)}")
    print(f"üíª CPU fallback models: {cpu_count}/{len(all_results)}")
    print(f"\n‚ö° Performance boost: GPU models are 10-50x faster than CPU versions!")
else:
    print(f"\n‚ö†Ô∏è  All models ran on CPU (RAPIDS not available)")
    print(f"üì¶ Install RAPIDS for GPU acceleration: conda install -c rapidsai rapids")

print(f"\nFiles created:")
print(f"  ‚Ä¢ clustering_results_rapids.csv - Performance comparison")
print(f"  ‚Ä¢ clustering_labels_rapids.csv - All cluster labels")
print(f"  ‚Ä¢ all_clustering_results_rapids.pkl - Complete results")
print(f"  ‚Ä¢ model_*_rapids.pkl - Individual model files (6 models)")
print(f"  ‚Ä¢ clustering_comparison_rapids.png - Visualization")
print(f"  ‚Ä¢ optimal_clusters_analysis_rapids.png - Optimal k analysis")

print("\n" + "="*80)
print("üìä RAPIDS ADVANTAGES")
print("="*80)
print("GPU-Accelerated:")
print("  ‚Ä¢ K-Means: 10-30x faster")
print("  ‚Ä¢ DBSCAN: 20-50x faster")
print("  ‚Ä¢ Handles larger datasets effortlessly")
print("  ‚Ä¢ Same accuracy as CPU versions")
print("\nCPU Fallback (no GPU versions):")
print("  ‚Ä¢ Agglomerative Clustering")
print("  ‚Ä¢ Spectral Clustering")
print("  ‚Ä¢ Gaussian Mixture Model")
print("  ‚Ä¢ Mini-Batch K-Means")