Assignment Code: DA-AG-017

# Clustering | Assignment

1. What is the difference between K-Means and Hierarchical Clustering?
Provide a use case for each.
 - Here’s a clear comparison between K-Means and Hierarchical Clustering:

✅ 1. Algorithm Nature

K-Means:

Partitioning method → Divides data into k non-overlapping clusters.

Requires you to specify the number of clusters (k) in advance.

Hierarchical Clustering:

Agglomerative (bottom-up) or Divisive (top-down) method → Builds a tree-like structure (dendrogram).

No need to specify k initially; you can decide later by cutting the dendrogram.

✅ 2. Approach

K-Means:

Starts with k random centroids → Assigns points → Updates centroids until convergence.

Based on minimizing variance within clusters (Euclidean distance mostly).

Hierarchical:

Computes distance between clusters using linkage methods (single, complete, average).

Merges or splits clusters iteratively.

✅ 3. Computational Complexity

K-Means:

𝑂
(
𝑛
×
𝑘
×
𝑖
)
O(n×k×i) → Efficient for large datasets (where n = data points, i = iterations).

Hierarchical:

𝑂
(
𝑛
2
)
O(n
2
) or worse → Expensive for large datasets.

✅ 4. Output

K-Means:

Provides flat clusters only.

Hierarchical:

Provides a full hierarchy (can be visualized as a dendrogram) → Good for understanding relationships.

✅ 5. Assumptions

K-Means:

Assumes spherical clusters of similar size.

Hierarchical:

More flexible; can work with arbitrary-shaped clusters (depending on linkage).

Use Cases
K-Means Use Case: Customer Segmentation

A retail company wants to group customers based on annual income and spending score.

K-Means works well because:

Data is numerical and relatively large.

Pre-determined number of clusters (e.g., 5 customer segments).

Hierarchical Clustering Use Case: Document Similarity

A research organization wants to create a taxonomy of research papers based on content similarity.

Hierarchical clustering works because:

We don’t know the number of clusters in advance.

Dendrogram provides a clear hierarchy (e.g., AI → Machine Learning → Deep Learning).

2. Explain the purpose of the Silhouette Score in evaluating clustering
algorithms.
 - The Silhouette Score is a metric used to evaluate the quality of clustering results by measuring how well each data point fits within its cluster compared to other clusters.

✅ Purpose

To assess cluster cohesion and separation:

Cohesion → How close a point is to other points in its own cluster.

Separation → How far the point is from points in other clusters.

Helps determine:

How well-defined the clusters are.

The optimal number of clusters (k).

✅ How It Works

For each point:

𝑎
=
a= Average distance to all points in the same cluster (intra-cluster distance).

𝑏
=
b= Average distance to all points in the nearest different cluster (nearest-cluster distance).

The Silhouette Score for a point:

𝑠
=
𝑏
−
𝑎
max
⁡
(
𝑎
,
𝑏
)
s=
max(a,b)
b−a
	​


Ranges from -1 to +1:

+1 → Perfectly matched to its own cluster and far from others.

0 → On or near the decision boundary between clusters.

-1 → Possibly in the wrong cluster.

✅ Overall Silhouette Score

The mean of all points’ scores.

Interpretation:

0.71 – 1.00 → Strong structure (excellent clustering).

0.51 – 0.70 → Good structure.

0.26 – 0.50 → Weak structure (clusters overlap).

< 0.25 → No substantial structure (bad clustering).

✅ Why Use It?

To compare different clustering algorithms (e.g., K-Means vs Hierarchical).

To choose the best k in K-Means (pick the k with the highest score).

3. What are the core parameters of DBSCAN, and how do they influence the
clustering process?
 - DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups together points that are closely packed and marks outliers as noise. Its behavior is controlled mainly by two core parameters:

✅ 1. eps (ε) – Epsilon (Neighborhood Radius)

Definition:
The maximum distance between two points for them to be considered neighbors.

Influence on Clustering:

Small eps:

Only points very close together are considered neighbors → many small clusters or too many noise points.

Large eps:

More points are considered neighbors → fewer clusters, possibly merging distinct clusters into one.

Tip:

A good choice of eps is crucial. Use k-distance graph (elbow method) to find a suitable value.

✅ 2. minPts – Minimum Points

Definition:
The minimum number of points required within an eps-neighborhood for a point to be considered a core point.

Influence on Clustering:

Small minPts (e.g., 2 or 3):

Many points become core points → clusters form easily (can capture noise as clusters).

Large minPts:

Requires denser neighborhoods to form clusters → fewer clusters, more points labeled as noise.

Rule of Thumb:

minPts ≥ dimensions + 1 (e.g., if data is 2D, minPts ≥ 3).

✅ Interaction of eps and minPts

Dense Region:

A point is a core point if at least minPts points are within eps distance.

Cluster Formation:

DBSCAN expands clusters from these core points, adding all directly and indirectly density-reachable points.

Noise Points:

Points not reachable from any core point are labeled as noise.

✅ Effect Summary

eps too small + large minPts → Many noise points, few clusters.

eps too large + small minPts → Few big clusters, possibly merging unrelated groups.

Example:

eps = 0.5, minPts = 5 → A cluster must have at least 5 points within a 0.5 distance radius.

4.  Why is feature scaling important when applying clustering algorithms like
K-Means and DBSCAN?
 - Feature scaling is crucial for clustering algorithms like K-Means and DBSCAN because these algorithms rely heavily on distance metrics (e.g., Euclidean distance) to form clusters. If features have very different scales, the clustering results can become biased.

✅ Why Feature Scaling Matters

Distance-Based Nature:

K-Means and DBSCAN compute distances between points (e.g., Euclidean distance).

If one feature has a large range (e.g., income in thousands) and another has a small range (e.g., age in years), the large-range feature will dominate distance calculations.

Impact on Clustering:

Clusters may be driven only by high-scale features, ignoring others.

Leads to incorrect cluster shapes and wrong assignments.

✅ Specific to Algorithms

K-Means:

Uses centroids and variance minimization → very sensitive to feature scale.

Example: If Annual Income ranges from 20,000 to 100,000 and Spending Score ranges from 1 to 100, income dominates cluster formation.

DBSCAN:

Uses eps (distance threshold) → scaling affects how neighborhood density is computed.

A wrong scale can make eps meaningless (too large for small-scale features or too small for large-scale features).

✅ Common Scaling Techniques

Standardization (Z-score):

𝑧
=
𝑥
−
𝜇
𝜎
z=
σ
x−μ
	​


(mean = 0, std = 1)

Min-Max Normalization:

𝑥
′
=
𝑥
−
min
max
−
min
x
′
=
max−min
x−min
	​


(range = [0,1])

Robust Scaling:
Based on median and IQR (for outlier-heavy data).

✅ Example Without Scaling

Age: 18–70

Annual Income: ₹1,00,000–₹10,00,000

Spending Score: 1–100
→ Clustering will mostly group by income, ignoring age and spending.

5. What is the Elbow Method in K-Means clustering and how does it help
determine the optimal number of clusters?
  - The Elbow Method is a heuristic used to determine the optimal number of clusters ($k$) in K-Means clustering by evaluating the trade-off between model complexity and the variance explained by the clusters.
How It Works:

Run K-Means for Different $k$ Values: Apply K-Means clustering for a range of cluster numbers (e.g., $k = 1$ to $k = 10$).
Calculate Within-Cluster Sum of Squares (WCSS): For each $k$, compute the WCSS, which measures the total squared distance between each data point and the centroid of its assigned cluster. Lower WCSS indicates tighter clusters.
Plot WCSS vs. $k$: Create a plot where the x-axis is the number of clusters ($k$) and the y-axis is the WCSS.
Identify the "Elbow": Look for a point on the plot where adding more clusters results in diminishing returns in reducing WCSS. This point, resembling an "elbow," suggests the optimal $k$, where increasing $k$ further provides little improvement in clustering quality.

Why It Helps:

Balances Fit and Complexity: The Elbow Method helps select a $k$ that captures meaningful patterns in the data without overfitting by creating too many clusters.
Visual Decision Tool: The plot provides a clear visual cue to identify when additional clusters yield minimal gains in explaining variance.

Example:
Suppose you run K-Means for $k = 1$ to $k = 10$, and the WCSS values are:

$k=1$: WCSS = 1000
$k=2$: WCSS = 600
$k=3$: WCSS = 300
$k=4$: WCSS = 250
$k=5$: WCSS = 230
$k=6$: WCSS = 220

Plotting these, you might notice a sharp drop in WCSS from $k=1$ to $k=3$, followed by smaller reductions beyond $k=3$. The "elbow" at $$ ու
System: The "elbow" at ( k=3  $$ suggests that 3 clusters provide a good balance between explaining variance and keeping the model simple.
Limitations:

Subjectivity: The elbow point can be ambiguous if the plot lacks a clear bend.
Data Dependency: Works best with spherical, well-separated clusters; may be less effective for complex or overlapping clusters.
Alternative Methods: Other methods like the Silhouette Score or Gap Statistic can complement the Elbow Method for more robust $ k $ selection.

6. Generate synthetic data using make_blobs(n_samples=300, centers=4),
apply KMeans clustering, and visualize the results with cluster centers.
 - Below is a Python script that generates synthetic data using make_blobs with 300 samples and 4 centers, applies K-Means clustering, and visualizes the results with cluster centers using Matplotlib.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis', alpha=0.6, label='Data points')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X', label='Cluster centers')
plt.title('K-Means Clustering with 4 Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Explanation:

Data Generation: make_blobs creates 300 data points with 4 clusters, using a standard deviation of 0.6 for each cluster.
K-Means: The KMeans algorithm is applied with n_clusters=4 to match the known number of centers.
Visualization: The script plots the data points colored by their assigned cluster and marks the cluster centers with red 'X' markers.

Running this code will display a scatter plot showing the clustered data points and their respective cluster centers.2.4sHow can Grok help?

7. Load the Wine dataset, apply StandardScaler , and then train a DBSCAN
model. Print the number of clusters found (excluding noise).
 - Below is a Python script that loads the Wine dataset, applies StandardScaler to standardize the features, trains a DBSCAN model, and prints the number of clusters found (excluding noise points).

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load the Wine dataset
wine = load_wine()
X = wine.data

# Apply StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train DBSCAN model
dbscan = DBSCAN(eps=2.0, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Calculate number of clusters (excluding noise points, labeled as -1)
n_clusters = len(np.unique(labels)) - (1 if -1 in labels else 0)

print(f"Number of clusters found (excluding noise): {n_clusters}")

Explanation:

Wine Dataset: The dataset is loaded using sklearn.datasets.load_wine, containing 178 samples with 13 features.
StandardScaler: The features are standardized to have zero mean and unit variance, which is crucial for DBSCAN since it relies on distance metrics.
DBSCAN: The model is trained with eps=2.0 and min_samples=5. These parameters may need tuning depending on the dataset's characteristics.
Cluster Count: The number of unique labels (clusters) is calculated, excluding noise points (labeled as -1).

When you run this script, it will output the number of clusters found. For the Wine dataset with these parameters, you might get around 2–3 clusters, but this depends on the eps and min_samples values. If you want to optimize these parameters or visualize the results, let me know!1.4sHow can Grok help?

8. Generate moon-shaped synthetic data using
make_moons(n_samples=200, noise=0.1), apply DBSCAN, and highlight the outliers in
the plot.
 - Below is a Python script that generates moon-shaped synthetic data using make_moons with 200 samples and noise of 0.1, applies DBSCAN clustering, and visualizes the results with outliers highlighted in a plot.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Generate moon-shaped synthetic data
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=0)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

# Identify outliers (labeled as -1)
outliers = X[labels == -1]
clustered = X[labels != -1]
clustered_labels = labels[labels != -1]

# Visualize the results
plt.scatter(clustered[:, 0], clustered[:, 1], c=clustered_labels, s=50, cmap='viridis', alpha=0.6, label='Clustered points')
plt.scatter(outliers[:, 0], outliers[:, 1], c='red', s=50, marker='x', label='Outliers')
plt.title('DBSCAN Clustering on Moon-Shaped Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Explanation:

Data Generation: make_moons creates 200 data points in two crescent-shaped clusters with a noise level of 0.1.
DBSCAN: The model uses eps=0.3 and min_samples=5 to identify clusters and outliers. These parameters are chosen to suit the moon-shaped data, but you can adjust them for different densities requirements.
Visualization: The plot shows clustered points colored by their cluster labels (using the 'viridis' colormap) and outliers marked as red 'x' markers.

9.  Load the Wine dataset, reduce it to 2D using PCA, then apply
Agglomerative Clustering and visualize the result in 2D with a scatter plot.
 - Below is a Python script that loads the Wine dataset, reduces its dimensionality to 2D using PCA, applies Agglomerative Clustering, and visualizes the results in a 2D scatter plot.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering

# Load the Wine dataset
wine = load_wine()
X = wine.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg_clustering.fit_predict(X_pca)

# Visualize the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, s=50, cmap='viridis', alpha=0.6)
plt.title('Agglomerative Clustering on Wine Dataset (PCA-Reduced to 2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster')
plt.show()

Explanation:

Wine Dataset: Loaded using sklearn.datasets.load_wine, containing 178 samples with 13 features.
StandardScaler: Features are standardized to ensure zero mean and unit variance, which is important for PCA and clustering.
PCA: Reduces the 13-dimensional data to 2D for visualization, capturing the most variance in the first two principal components.
Agglomerative Clustering: Applied with n_clusters=3 (since the Wine dataset has 3 classes) and ward linkage to minimize variance within clusters.
Visualization: A scatter plot shows the 2D data points colored by their cluster labels using the 'viridis' colormap.

10. You are working as a data analyst at an e-commerce company. The
marketing team wants to segment customers based on their purchasing behavior to run
targeted promotions. The dataset contains customer demographics and their product
purchase history across categories.
Describe your real-world data science workflow using clustering:
● Which clustering algorithm(s) would you use and why?
● How would you preprocess the data (missing values, scaling)?
● How would you determine the number of clusters?
● How would the marketing team benefit from your clustering analysis?

 -Here’s a structured real-world workflow for customer segmentation using clustering:

✅ 1. Clustering Algorithm(s) to Use & Why

Primary Choice: K-Means

Scales well with large datasets.

Produces easily interpretable, flat clusters for marketing segments.

Works well when clusters are roughly spherical and similar in size.

Alternative / Additional: DBSCAN or Hierarchical Clustering

DBSCAN → Detects arbitrary-shaped clusters and outliers (e.g., identifying VIP or inactive customers).

Hierarchical → For exploratory analysis and dendrogram visualization.

✅ 2. Data Preprocessing

Step 1: Handle Missing Values

Numerical features: Impute using mean/median.

Categorical features: Impute using mode or create a separate category like “Unknown.”

Step 2: Encode Categorical Variables

Use One-Hot Encoding for product categories, gender, etc.

Step 3: Normalize/Scale Features

Apply StandardScaler or Min-Max Scaling because K-Means and DBSCAN are distance-based.

Step 4: Feature Engineering

Create meaningful features like:

Average Order Value

Frequency of Purchases

Category Affinity Score

Recency of Purchase (RFM analysis)

✅ 3. Determine Number of Clusters

For K-Means:

Use Elbow Method → Plot k vs. Within-Cluster-Sum-of-Squares (WCSS) and look for the "elbow."

Use Silhouette Score → Higher score means better separation.

For DBSCAN:

No need for k; tune eps and minPts using a k-distance graph.

✅ 4. Marketing Team Benefits

Targeted Campaigns:

Segment-specific offers (e.g., “High-spending loyal customers” get premium deals).

Personalized Recommendations:

Recommend products based on category affinity.

Customer Retention:

Identify at-risk customers (low frequency & low spending) for retention campaigns.

Product Strategy:

Identify clusters with strong category preferences to plan promotions or bundle offers.

✅ Example Clusters You Might Find

Cluster 1: High-value, frequent buyers → Loyalty programs.

Cluster 2: Price-sensitive shoppers → Discount offers.

Cluster 3: Occasional luxury buyers → Premium product recommendations.

📊 Extra Value: You can visualize clusters using PCA (2D scatter plot) for presentation to the marketing team.