# 3️⃣ Clustering Analysis: Automated Model Selection
Clustering is the most common form of **Unsupervised Learning**. In this notebook, we explore multiple algorithms, analyze them using professional evaluation tools, and select the best one for our customer segmentation.

## Key Learning Objectives:
1. **Model Exploration**: Using `models()` to see all available clustering algorithms.
2. **Interactive Evaluation**: Deep-dive into model performance with `evaluate_model`.
3. **Automated Workflow**: Finding and finalizing the best performing clustering model.

In [None]:
%%capture
# !pip install pycaret

In [None]:
import pandas as pd
from pycaret.clustering import *
import os

# Create Output folder
output_dir = './Output'
if not os.path.exists(output_dir): os.makedirs(output_dir)

## 1. Initializing the Experiment
The `setup()` function prepares the data by handling normalization and ignoring unnecessary columns. 
After setup, we use `models()` to see what algorithms are available (K-Means, DBSCAN, Hierarchical, etc.).

In [None]:
# Load Mall Customers data
df_mall = pd.read_csv('./Data/Mall_Customers.csv')

# Initialize setup
clu_setup = setup(data=df_mall, ignore_features=['CustomerID'], session_id=123, verbose=False)

# Display all available clustering models
all_models = models()
print("--- Available Clustering Algorithms ---")
all_models

## 2. Model Creation and Selection
In Clustering, unlike Classification, we don't have a `compare_models()` that ranks by Accuracy. Instead, we use `create_model()` for specific algorithms and evaluate them. 

**Pro-Tip**: Usually, **K-Means** or **Hierarchical (hclust)** are the best starting points for structured data like this.

In [None]:
# Create a K-Means model (Standard choice)
kmeans = create_model('kmeans', num_clusters=5)

# Create a Hierarchical model (To compare)
hclust = create_model('hclust', num_clusters=5)

# Use evaluate_model to see a professional dashboard of 'hclust'
# You can view the Dendrogram, Silhouette, and PCA plots here!
evaluate_model(hclust)

## 3. Metrics and Plotting
After creating a model, we can use `pull()` to get the metrics (Silhouette, Calinski-Harabasz, etc.) and `plot_model()` for visualization.

In [None]:
# Get the metrics for the last created model (hclust)
metrics = pull()
print("--- Model Performance Metrics ---")
print(metrics)

# Silhouette Plot: To see how well-separated our clusters are
# Note: If it fails for hclust, use it for kmeans
plot_model(kmeans, plot='silhouette')

# Cluster Plot: 2D visualization using PCA
plot_model(kmeans, plot='cluster')

In [None]:
# t-SNE plot
plot_model(kmeans, plot='tsne')

## 4. Saving the Model
In PyCaret's Clustering module, there is no `finalize_model` function. This is because clustering is an unsupervised task that is performed on the entire dataset provided during the `setup()` phase. Therefore, the model created by `create_model()` is already considered the "final" model.

In [None]:
# In Clustering, we don't need finalize_model. 
# The 'kmeans' object is already our final model.

# Save the model directly
save_model(kmeans, './Output/clustering_mall_customers_model')

print("✅ Model saved successfully without needing finalize_model!")

## 5. Predicting on New Data
Now we load the model and apply it to new customer data. This ensures our pipeline (Normalization + K-Means) is working.

In [None]:
# Load the saved model
loaded_model = load_model('./Output/clustering_mall_customers_model')

# Prepare new data
new_customers = pd.read_csv('./Data/Mall_Customers.csv').head(5)

# Assign clusters
predictions = predict_model(loaded_model, data=new_customers)

print("\n--- New Customer Assignments ---")
predictions[['Age', 'Annual Income (k$)', 'Cluster']]